Models

DRP Model

Contains the DRPModel class.

The DRPModel class is an abstract wrapper class for drug response prediction models.

class drevalpy.models.drp_model.DRPModel

Bases: ABC

Abstract wrapper class for drug response prediction models.

The DRPModel class is an abstract wrapper class for drug response prediction models. It has a boolean attribute is_single_drug_model indicating whether it is a single drug model and a boolean attribute early_stopping indicating whether early stopping is used.

abstractmethod build_model(hyperparameters)

Builds the model, for models that use hyperparameters.

Subclasses should call self.log_hyperparameters(hyperparameters) at the beginning of this method to ensure hyperparameters are logged to wandb if enabled.

Parameters:: hyperparameters (dict[str, Any]) – hyperparameters for the model
Return type:: None

Example:

def build_model(self, hyperparameters: dict[str, Any]) -> None:
    self.log_hyperparameters(hyperparameters)  # Log to wandb
    self.model = ElasticNet(alpha=hyperparameters["alpha"], l1_ratio=hyperparameters["l1_ratio"])

abstract property cell_line_views: list[str]

Returns the sources the model needs as input for describing the cell line.

Returns:: cell line views, e.g., [“methylation”, “gene_expression”, “mirna_expression”, “mutation”]. If the model does not use cell line features, return an empty list.

compute_and_log_final_metrics(dataset, additional_metrics=None, prefix='val_')

Compute final performance metrics from a dataset and log them to wandb.

This method computes R^2 and PCC (always), plus any additional metrics specified. The metrics are both logged to wandb history and stored in the run summary.

Parameters:

dataset (DrugResponseDataset) – DrugResponseDataset with predictions and response
additional_metrics (list[str] | None) – optional list of additional metrics to compute (e.g., [“RMSE”, “MAE”])
prefix (str) – metric name prefix indicating which split the metrics belong to (for example, use "val" for validation and "test" for test metrics)

Return type:

dict[str, float]

Returns:

dictionary of computed metrics

compute_performance_metrics(predictions, targets, prefix='')

Compute R^2 and PCC metrics from predictions and targets.

This is a convenience method for computing performance metrics consistently across all models. It always computes R^2 and PCC in addition to any other metrics that may be needed.

Parameters:

predictions (ndarray) – model predictions array
targets (ndarray) – ground truth targets array
prefix (str) – optional prefix for metric keys (e.g., val_, train_)

Return type:

dict[str, float]

Returns:

dictionary of computed metrics with optional prefix

abstract property drug_views: list[str]

Returns the sources the model needs as input for describing the drug.

Returns:: drug views, e.g., [“descriptors”, “fingerprints”, “targets”]. If the model does not use drug features, return an empty list.

early_stopping = False

finish_wandb()

Finish the wandb run. Call this when training is complete.

Return type:: None

get_concatenated_features(cell_line_view, drug_view, cell_line_ids_output, drug_ids_output, cell_line_input, drug_input)

Concatenates the features to an input matrix X for the given cell line and drug views.

Parameters:

cell_line_view (str | None) – gene expression, methylation, etc.
drug_view (str | None) – ids, fingerprints, etc.
cell_line_ids_output (ndarray) – cell line ids
drug_ids_output (ndarray) – drug ids
cell_line_input (FeatureDataset | None) – input associated with the cell line
drug_input (FeatureDataset | None) – input associated with the drug

Return type:

ndarray

Returns:

X, the feature matrix needed for, e.g., sklearn models

Raises:

ValueError – if no features are provided

This can, e.g., be done in the training method to produce a large input feature matrix for the model where the rows are the samples and the columns are the cell line and drug features concatenated. This method is an alternative to using DataLoaders. It is used for models operating on the whole input matrix at once.

Example:

x = self.get_concatenated_features(
    cell_line_view="gene_expression",
    drug_view="fingerprints",
    cell_line_ids_output=output.cell_line_ids,
    drug_ids_output=output.drug_ids,
    cell_line_input=cell_line_input,
    drug_input=drug_input,
)
self.model.fit(x, output.response)

get_feature_matrices(cell_line_ids, drug_ids, cell_line_input, drug_input)

Returns the feature matrices for the given cell line and drug ids by retrieving the correct views.

Parameters:

cell_line_ids (ndarray) – cell line identifiers
drug_ids (ndarray) – drug identifiers
cell_line_input (FeatureDataset | None) – cell line omics features
drug_input (FeatureDataset | None) – drug omics features

Return type:

dict[str, ndarray]

Returns:

dictionary with the feature matrices

Raises:

ValueError – if the input does not contain the correct views

This can e.g., done to produce the input for the predict() method for deep learning models: Example:

input_data = self.get_feature_matrices(
    cell_line_ids=cell_line_ids,
    drug_ids=drug_ids,
    cell_line_input=cell_line_input,
    drug_input=drug_input,
)
(
    gene_expression,
    mutations,
    cnvs
) = (
    input_data["gene_expression"],
    input_data["mutations"],
    input_data["copy_number_variation_gistic"]
)
return self.model.predict(gene_expression, mutations, cnvs)

Or to produce separate inputs for the train()/predict() method for other models if the model does not operate on the concatenated input matrix:

inputs = self.get_feature_matrices(
    cell_line_ids=output.cell_line_ids,
    drug_ids=output.drug_ids,
    cell_line_input=cell_line_input,
     drug_input=drug_input,
)
(
    gene_expression,
    methylation,
    mutations,
    copy_number_variation_gistic,
    fingerprints,
) = (
    inputs["gene_expression"],
    inputs["methylation"],
    inputs["mutations"],
    inputs["copy_number_variation_gistic"],
    inputs["fingerprints"],
)
self.model.fit(
    gene_expression, methylation, mutations, copy_number_variation_gistic, fingerprints, output.response
)

classmethod get_hyperparameter_set()

Loads the hyperparameters from a yaml file which is located in the same directory as the model.

Return type:

list[dict[str, Any]]

Returns:

list of hyperparameter sets

Raises:

ValueError – if the hyperparameters are not in the correct format
KeyError – if the model is not found in the hyperparameters file

abstractmethod classmethod get_model_name()

Returns the name of the model.

Return type:: str
Returns:: model name

get_wandb_logger()

Get a WandbLogger for PyTorch Lightning integration.

This method creates a WandbLogger that uses the existing wandb run. Returns None if wandb is not enabled.

Return type:: Any | None
Returns:: WandbLogger instance or None

init_wandb(project, config=None, name=None, tags=None, finish_previous=True)

Initialize wandb logging for this model instance.

Parameters:

project (str) – wandb project name
config (dict[str, Any] | None) – dictionary of configuration to log (e.g., hyperparameters, dataset info)
name (str | None) – run name (defaults to model name)
tags (list[str] | None) – list of tags for the run
finish_previous (bool) – whether to finish any existing wandb run before starting a new one

Return type:

None

is_single_drug_model = False

is_wandb_enabled()

Check if wandb logging is enabled for this model instance.

Return type:: bool
Returns:: True if wandb is initialized and active, False otherwise

classmethod load(directory)

Load a model, including trainable parameters, hyperparameters, scalars, encoders.

This method should fully reconstruct an instance of the model using the files in the specified directory.

Only needs to be implemented for the DrEval evaluation framework, if a final production model should be saved.

Parameters:: directory (str) – Source directory containing the saved model files
Raises:: NotImplementedError – if the method is not implemented by the subclass
Return type:: DRPModel

abstractmethod load_cell_line_features(data_path, dataset_name)

Load the cell line features before the train/predict method is called.

Required to implement for all models. Could, e.g., call get_multiomics_feature_dataset() or load_and_select_gene_features() from models/utils.py.

Parameters:

data_path (str) – path to the data, e.g., data/
dataset_name (str) – name of the dataset, e.g., “GDSC2”

Return type:

FeatureDataset

Returns:

FeatureDataset with the cell line features

abstractmethod load_drug_features(data_path, dataset_name)

Load the drug features before the train/predict method is called.

Required to implement for all models that use drug features. Could, e.g., call load_drug_fingerprint_features() or load_drug_ids_from_csv() from models/utils.py.

For single drug models, this method can return None.

Parameters:

data_path (str) – path to the data, e.g., data/
dataset_name (str) – name of the dataset, e.g., “GDSC2”

Return type:

FeatureDataset | None

Returns:

FeatureDataset or None

log_final_metrics(metrics)

Store final metrics in the wandb run summary.

This method is used to record final metrics (e.g., after validation or after a hyperparameter trial). Metrics are stored with their original names (e.g., val_RMSE, test_RMSE) without additional prefixes.

Parameters:: metrics (dict[str, float]) – dictionary of metric names to values
Return type:: None

log_hyperparameters(hyperparameters)

Log hyperparameters to wandb.

This method is called automatically by build_model when wandb is enabled. Subclasses can override this to add additional hyperparameter logging.

During hyperparameter tuning, config updates are skipped to avoid overwriting. Only the final best hyperparameters are logged to wandb.config.

Parameters:: hyperparameters (dict[str, Any]) – dictionary of hyperparameters to log
Return type:: None

log_metrics(metrics, step=None)

Log metrics to wandb.

Subclasses can call this method to log custom metrics during training.

Parameters:

metrics (dict[str, float]) – dictionary of metric names to values
step (int | None) – optional step number for the metrics

Return type:

None

abstractmethod predict(cell_line_ids, drug_ids, cell_line_input, drug_input=None)

Predicts the response for the given input.

Parameters:

drug_ids (ndarray) – list of drug ids, also used for single drug models, there it is just an array containing the same drug id
cell_line_ids (ndarray) – list of cell line ids
cell_line_input (FeatureDataset) – input associated with the cell line, required for all models
drug_input (FeatureDataset | None) – input associated with the drug, optional because single drug models do not use drug features

Return type:

ndarray

Returns:

predicted response

save(directory)

Save the model, including trainable parameters, hyperparameters, scalars, encoders.

This method should serialize all necessary components to allow full reconstruction of the model later via the load method.

Only needs to be implemented for the DrEval evaluation framework, if a final production model should be saved.

Parameters:: directory (str) – Target directory where the model and metadata should be saved
Raises:: NotImplementedError – if the method is not implemented by the subclass
Return type:: None

abstractmethod train(output, cell_line_input, drug_input=None, output_earlystopping=None, model_checkpoint_dir='checkpoints')

Trains the model.

Parameters:

output (DrugResponseDataset) – training data associated with the response output
cell_line_input (FeatureDataset) – input associated with the cell line, required for all models
drug_input (FeatureDataset | None) – input associated with the drug, optional because single drug models do not use drug features
output_earlystopping (DrugResponseDataset | None) – optional early stopping dataset
model_checkpoint_dir (str) – directory to save the model checkpoints

Return type:

None

Utility functions

Utility functions for loading and processing data.

class drevalpy.models.utils.ProteomicsMedianCenterAndImputeTransformer(feature_threshold=0.7, n_features=1000, normalization_downshift=1.8, normalization_width=0.3, imputation_seed=100)

Bases: BaseEstimator, TransformerMixin

Performs median centering and imputation of proteomics data.

fit(X, y=None)

Learns the top n_feature complete proteins and calculates the mean median of the train cell lines.

Parameters:

X – input proteomics data
y – not used

Returns:

self

transform(X)

Median center the data and impute missing values with downshifted normal distribution.

Parameters:: X – input proteomics data
Returns:: transformed proteomics data

class drevalpy.models.utils.VarianceFeatureSelector(view, k=1000)

Bases: object

Selects the top-k features with highest variance for a specific omics view.

Stores a boolean mask after fitting on training data and applies it consistently to other datasets.

Parameters:

view (str)
k (int)

fit(cell_line_input, output)

Fit the selector to the training data by computing a variance-based mask.

Parameters:

cell_line_input (FeatureDataset) – FeatureDataset containing omics features
output (DrugResponseDataset) – DrugResponseDataset with the training cell line IDs

Return type:

None

transform(cell_line_input)

Apply the feature mask to reduce the dataset to selected features.

Parameters:: cell_line_input (FeatureDataset) – FeatureDataset to transform
Return type:: FeatureDataset
Returns:: reduced FeatureDataset
Raises:: RuntimeError – if selector was not fitted

drevalpy.models.utils.get_multiomics_feature_dataset(data_path, dataset_name, gene_lists=None, omics=None)

Get multiomics feature dataset for the given list of OMICs.

Parameters:

data_path (str) – path to the data, e.g., data/
dataset_name (str) – name of the dataset, e.g., GDSC2
gene_lists (dict | None) – dictionary of names of lists of genes to include, for each omics type, e.g., {“gene_expression”: “landmark_genes_reduced”}, if None, all features are not reduced
omics (list[str] | None) – list of omics to include, e.g., [“gene_expression”, “methylation”]

Return type:

FeatureDataset

Returns:

FeatureDataset with the multiomics features

Raises:

ValueError – if no omics features are found

drevalpy.models.utils.iterate_features(df, feature_type)

Iterate over features.

Parameters:

df (DataFrame) – DataFrame with the features
feature_type (str) – type of feature, e.g., gene_expression, methylation, etc.

Return type:

dict[str, dict[str, ndarray]]

Returns:

dictionary with the features

drevalpy.models.utils.load_and_select_gene_features(feature_type, gene_list, data_path, dataset_name)

Load and reduce features of a single feature type, ensuring selection and ordering based on the gene list.

Attention: if gene_list is None, all features are loaded, which can be problematic for cross study prediction.

Parameters:

feature_type (str) – type of feature, e.g., gene_expression, methylation, etc.
gene_list (str | None) – list of genes to include, e.g., landmark_genes
data_path (str) – path to the data, e.g., data/
dataset_name (str) – name of the dataset, e.g., GDSC2

Return type:

FeatureDataset

Returns:

FeatureDataset with the reduced features

Raises:

ValueError – if genes from gene_list are missing in the dataset

drevalpy.models.utils.load_cl_ids_and_tissues_from_csv(path, dataset_name)

Load cell line ids and optional tissue annotations from csv file.

Parameters:

path (str) – path to the data, e.g., data/
dataset_name (str) – name of the dataset, e.g., GDSC2

Return type:

FeatureDataset

Returns:

FeatureDataset with cell line ids and tissue annotations, if available

drevalpy.models.utils.load_cl_ids_from_csv(path, dataset_name)

Load cell line ids from csv file.

Parameters:

path (str) – path to the data, e.g., data/
dataset_name (str) – name of the dataset, e.g., GDSC2

Return type:

FeatureDataset

Returns:

FeatureDataset with the cell line ids

drevalpy.models.utils.load_drug_fingerprint_features(data_path, dataset_name, fill_na=True, n_bits=128)

Load drug features from fingerprints.

Parameters:

data_path (str) – path to the data, e.g., data/
dataset_name (str) – name of the dataset, e.g., GDSC2
fill_na – whether to use default pubchemid-hashed fingerprints if fingerprint is not available
n_bits – number of bits in the fingerprint

Return type:

FeatureDataset

Returns:

FeatureDataset with the drug fingerprints

drevalpy.models.utils.load_drug_ids_from_csv(data_path, dataset_name)

Load drug ids from csv file.

Parameters:

data_path (str) – path to the data, e.g., data/
dataset_name (str) – name of the dataset, e.g., GDSC2

Return type:

FeatureDataset

Returns:

FeatureDataset with the drug ids

drevalpy.models.utils.load_generic_csv(path, dataset_name, feature_name, index_col='cell_line_name')

Loads a generic CSV file with cell line IDs as index and features as columns.

Parameters:

path (str) – path to the data, e.g., data/
dataset_name (str) – name of the dataset, e.g., GDSC2
feature_name (str) – name of the feature, e.g., gene_expression
index_col – name of the index column, e.g., cell_line_id

Return type:

FeatureDataset

Returns:

FeatureDataset with the features

drevalpy.models.utils.load_multi_cell_line_view(cell_line_views, data_path, dataset_name, model_name)

Load cell line features for a multi-view model.

Known omics types use specific gene lists for subsetting. Unknown types are loaded in full.

Parameters:

cell_line_views (list[str]) – list of cell line views
data_path (str) – path to the data, e.g., data/
dataset_name (str) – name of the dataset, e.g., GDSC1
model_name (str) – name of the model, used for error messages

Return type:

FeatureDataset

Returns:

FeatureDataset containing the cell line features

Raises:

ValueError – if cell_line_views is empty

drevalpy.models.utils.load_single_cell_line_view(cell_line_views, data_path, dataset_name, model_name)

Load cell line features for a single-view model.

If the view is “gene_expression”, the landmark_genes_reduced list is used for subsetting. Otherwise, the whole CSV is loaded.

Parameters:

cell_line_views (list[str]) – list of cell line views (must have exactly one element)
data_path (str) – path to the data, e.g., data/
dataset_name (str) – name of the dataset, e.g., GDSC1
model_name (str) – name of the model, used for error messages

Return type:

FeatureDataset

Returns:

FeatureDataset containing the cell line features

Raises:

ValueError – if cell_line_views is empty or has more than one element

drevalpy.models.utils.load_single_drug_view(drug_views, data_path, dataset_name, model_name)

Load drug features for a single-view model.

If drug_views is empty, drug IDs are loaded. If “fingerprints”, fingerprints are loaded. Otherwise, the CSV is loaded generically.

Parameters:

drug_views (list[str]) – list of drug views (at most one element)
data_path (str) – path to the data, e.g., data/
dataset_name (str) – name of the dataset, e.g., GDSC1
model_name (str) – name of the model, used for error messages

Return type:

FeatureDataset | None

Returns:

FeatureDataset containing the drug features

Raises:

ValueError – if more than one drug view is specified

drevalpy.models.utils.load_tissues_from_csv(path, dataset_name)

Load tissues from csv file.

Parameters:

path (str) – path to the data, e.g., data/
dataset_name (str) – name of the dataset, e.g., GDSC2

Return type:

FeatureDataset

Returns:

FeatureDataset with the tissues

drevalpy.models.utils.log10_and_set_na(x)

Log10 transform and set NaN for infinite values.

Parameters:: x – input array
Returns:: log10 transformed array with NaN for infinite values

drevalpy.models.utils.prepare_expression_and_methylation(cell_line_input, cell_line_ids, training, gene_expression_scaler=None, methylation_scaler=None, methylation_pca=None)

Applies preprocessing to gene expression and optionally methylation views.

Applies arcsinh + scaling to gene expression if a scaler is provided.
Applies scaling + PCA to methylation if both a scaler and PCA are provided.
Applies to all cell lines in cell_line_input, using fitting only on the given IDs if training=True.

Parameters:

cell_line_input (FeatureDataset) – FeatureDataset with the cell line features
cell_line_ids (ndarray) – IDs of the cell lines used for training or transformation
training (bool) – Whether to fit the scalers/PCA (True) or just apply transformation (False)
gene_expression_scaler (TransformerMixin | None) – Optional fitted or to-be-fitted scaler for gene expression
methylation_scaler (TransformerMixin | None) – Optional fitted or to-be-fitted scaler for methylation
methylation_pca (PCA | None) – Optional PCA transformer for methylation

Return type:

FeatureDataset

Returns:

FeatureDataset with the transformed features

drevalpy.models.utils.prepare_proteomics(cell_line_input, cell_line_ids, training, transformer)

Applies log10 transform and proteomics normalization (centering + imputation) to proteomics view.

Parameters:

cell_line_input (FeatureDataset) – FeatureDataset with proteomics features
cell_line_ids (ndarray) – cell line IDs for training or transformation
training (bool) – whether to fit or only transform
transformer (ProteomicsMedianCenterAndImputeTransformer) – Proteomics transformer

Return type:

FeatureDataset

Returns:

transformed FeatureDataset

drevalpy.models.utils.scale_gene_expression(cell_line_input, cell_line_ids, training, gene_expression_scaler)

Scales gene expression inplace using arcsinh transformation and a provided scaler.

Parameters:

cell_line_input (FeatureDataset) – FeatureDataset with the cell line features
cell_line_ids (ndarray) – IDs of cell lines to use for fitting or transformation
training (bool) – whether to fit or transform
gene_expression_scaler (TransformerMixin) – sklearn transformer for gene expression

Return type:

FeatureDataset

Returns:

FeatureDataset with the transformed features

drevalpy.models.utils.unique(array)

Get unique values ordered by first occurrence.

Parameters:: array – array of values
Returns:: unique values ordered by first occurrence

Models

DRP Model

Utility functions

Implemented models