Models
DRP Model
Contains the DRPModel class.
The DRPModel class is an abstract wrapper class for drug response prediction models.
- class drevalpy.models.drp_model.DRPModel
Bases:
ABCAbstract wrapper class for drug response prediction models.
The DRPModel class is an abstract wrapper class for drug response prediction models. It has a boolean attribute is_single_drug_model indicating whether it is a single drug model and a boolean attribute early_stopping indicating whether early stopping is used.
- abstractmethod build_model(hyperparameters)
Builds the model, for models that use hyperparameters.
Subclasses should call self.log_hyperparameters(hyperparameters) at the beginning of this method to ensure hyperparameters are logged to wandb if enabled.
Example:
def build_model(self, hyperparameters: dict[str, Any]) -> None: self.log_hyperparameters(hyperparameters) # Log to wandb self.model = ElasticNet(alpha=hyperparameters["alpha"], l1_ratio=hyperparameters["l1_ratio"])
- abstract property cell_line_views: list[str]
Returns the sources the model needs as input for describing the cell line.
- Returns:
cell line views, e.g., [“methylation”, “gene_expression”, “mirna_expression”, “mutation”]. If the model does not use cell line features, return an empty list.
- compute_and_log_final_metrics(dataset, additional_metrics=None, prefix='val_')
Compute final performance metrics from a dataset and log them to wandb.
This method computes R^2 and PCC (always), plus any additional metrics specified. The metrics are both logged to wandb history and stored in the run summary.
- Parameters:
dataset (
DrugResponseDataset) – DrugResponseDataset with predictions and responseadditional_metrics (
list[str] |None) – optional list of additional metrics to compute (e.g., [“RMSE”, “MAE”])prefix (
str) – metric name prefix indicating which split the metrics belong to (for example, use"val"for validation and"test"for test metrics)
- Return type:
- Returns:
dictionary of computed metrics
- compute_performance_metrics(predictions, targets, prefix='')
Compute R^2 and PCC metrics from predictions and targets.
This is a convenience method for computing performance metrics consistently across all models. It always computes R^2 and PCC in addition to any other metrics that may be needed.
- abstract property drug_views: list[str]
Returns the sources the model needs as input for describing the drug.
- Returns:
drug views, e.g., [“descriptors”, “fingerprints”, “targets”]. If the model does not use drug features, return an empty list.
- early_stopping = False
- get_concatenated_features(cell_line_view, drug_view, cell_line_ids_output, drug_ids_output, cell_line_input, drug_input)
Concatenates the features to an input matrix X for the given cell line and drug views.
- Parameters:
cell_line_view (
str|None) – gene expression, methylation, etc.cell_line_ids_output (
ndarray) – cell line idsdrug_ids_output (
ndarray) – drug idscell_line_input (
FeatureDataset|None) – input associated with the cell linedrug_input (
FeatureDataset|None) – input associated with the drug
- Return type:
- Returns:
X, the feature matrix needed for, e.g., sklearn models
- Raises:
ValueError – if no features are provided
This can, e.g., be done in the training method to produce a large input feature matrix for the model where the rows are the samples and the columns are the cell line and drug features concatenated. This method is an alternative to using DataLoaders. It is used for models operating on the whole input matrix at once.
Example:
x = self.get_concatenated_features( cell_line_view="gene_expression", drug_view="fingerprints", cell_line_ids_output=output.cell_line_ids, drug_ids_output=output.drug_ids, cell_line_input=cell_line_input, drug_input=drug_input, ) self.model.fit(x, output.response)
- get_feature_matrices(cell_line_ids, drug_ids, cell_line_input, drug_input)
Returns the feature matrices for the given cell line and drug ids by retrieving the correct views.
- Parameters:
cell_line_ids (
ndarray) – cell line identifiersdrug_ids (
ndarray) – drug identifierscell_line_input (
FeatureDataset|None) – cell line omics featuresdrug_input (
FeatureDataset|None) – drug omics features
- Return type:
- Returns:
dictionary with the feature matrices
- Raises:
ValueError – if the input does not contain the correct views
This can e.g., done to produce the input for the predict() method for deep learning models: Example:
input_data = self.get_feature_matrices( cell_line_ids=cell_line_ids, drug_ids=drug_ids, cell_line_input=cell_line_input, drug_input=drug_input, ) ( gene_expression, mutations, cnvs ) = ( input_data["gene_expression"], input_data["mutations"], input_data["copy_number_variation_gistic"] ) return self.model.predict(gene_expression, mutations, cnvs)
Or to produce separate inputs for the train()/predict() method for other models if the model does not operate on the concatenated input matrix:
inputs = self.get_feature_matrices( cell_line_ids=output.cell_line_ids, drug_ids=output.drug_ids, cell_line_input=cell_line_input, drug_input=drug_input, ) ( gene_expression, methylation, mutations, copy_number_variation_gistic, fingerprints, ) = ( inputs["gene_expression"], inputs["methylation"], inputs["mutations"], inputs["copy_number_variation_gistic"], inputs["fingerprints"], ) self.model.fit( gene_expression, methylation, mutations, copy_number_variation_gistic, fingerprints, output.response )
- classmethod get_hyperparameter_set()
Loads the hyperparameters from a yaml file which is located in the same directory as the model.
- abstractmethod classmethod get_model_name()
Returns the name of the model.
- Return type:
- Returns:
model name
- get_wandb_logger()
Get a WandbLogger for PyTorch Lightning integration.
This method creates a WandbLogger that uses the existing wandb run. Returns None if wandb is not enabled.
- init_wandb(project, config=None, name=None, tags=None, finish_previous=True)
Initialize wandb logging for this model instance.
- is_single_drug_model = False
- is_wandb_enabled()
Check if wandb logging is enabled for this model instance.
- Return type:
- Returns:
True if wandb is initialized and active, False otherwise
- classmethod load(directory)
Load a model, including trainable parameters, hyperparameters, scalars, encoders.
This method should fully reconstruct an instance of the model using the files in the specified directory.
Only needs to be implemented for the DrEval evaluation framework, if a final production model should be saved.
- Parameters:
directory (
str) – Source directory containing the saved model files- Raises:
NotImplementedError – if the method is not implemented by the subclass
- Return type:
- abstractmethod load_cell_line_features(data_path, dataset_name)
Load the cell line features before the train/predict method is called.
Required to implement for all models. Could, e.g., call get_multiomics_feature_dataset() or load_and_select_gene_features() from models/utils.py.
- Parameters:
- Return type:
- Returns:
FeatureDataset with the cell line features
- abstractmethod load_drug_features(data_path, dataset_name)
Load the drug features before the train/predict method is called.
Required to implement for all models that use drug features. Could, e.g., call load_drug_fingerprint_features() or load_drug_ids_from_csv() from models/utils.py.
For single drug models, this method can return None.
- Parameters:
- Return type:
- Returns:
FeatureDataset or None
- log_final_metrics(metrics)
Store final metrics in the wandb run summary.
This method is used to record final metrics (e.g., after validation or after a hyperparameter trial). Metrics are stored with their original names (e.g., val_RMSE, test_RMSE) without additional prefixes.
- log_hyperparameters(hyperparameters)
Log hyperparameters to wandb.
This method is called automatically by build_model when wandb is enabled. Subclasses can override this to add additional hyperparameter logging.
During hyperparameter tuning, config updates are skipped to avoid overwriting. Only the final best hyperparameters are logged to wandb.config.
- log_metrics(metrics, step=None)
Log metrics to wandb.
Subclasses can call this method to log custom metrics during training.
- abstractmethod predict(cell_line_ids, drug_ids, cell_line_input, drug_input=None)
Predicts the response for the given input.
- Parameters:
drug_ids (
ndarray) – list of drug ids, also used for single drug models, there it is just an array containing the same drug idcell_line_ids (
ndarray) – list of cell line idscell_line_input (
FeatureDataset) – input associated with the cell line, required for all modelsdrug_input (
FeatureDataset|None) – input associated with the drug, optional because single drug models do not use drug features
- Return type:
- Returns:
predicted response
- save(directory)
Save the model, including trainable parameters, hyperparameters, scalars, encoders.
This method should serialize all necessary components to allow full reconstruction of the model later via the load method.
Only needs to be implemented for the DrEval evaluation framework, if a final production model should be saved.
- Parameters:
directory (
str) – Target directory where the model and metadata should be saved- Raises:
NotImplementedError – if the method is not implemented by the subclass
- Return type:
- abstractmethod train(output, cell_line_input, drug_input=None, output_earlystopping=None, model_checkpoint_dir='checkpoints')
Trains the model.
- Parameters:
output (
DrugResponseDataset) – training data associated with the response outputcell_line_input (
FeatureDataset) – input associated with the cell line, required for all modelsdrug_input (
FeatureDataset|None) – input associated with the drug, optional because single drug models do not use drug featuresoutput_earlystopping (
DrugResponseDataset|None) – optional early stopping datasetmodel_checkpoint_dir (
str) – directory to save the model checkpoints
- Return type:
Utility functions
Utility functions for loading and processing data.
- class drevalpy.models.utils.ProteomicsMedianCenterAndImputeTransformer(feature_threshold=0.7, n_features=1000, normalization_downshift=1.8, normalization_width=0.3, imputation_seed=100)
Bases:
BaseEstimator,TransformerMixinPerforms median centering and imputation of proteomics data.
- fit(X, y=None)
Learns the top n_feature complete proteins and calculates the mean median of the train cell lines.
- Parameters:
X – input proteomics data
y – not used
- Returns:
self
- transform(X)
Median center the data and impute missing values with downshifted normal distribution.
- Parameters:
X – input proteomics data
- Returns:
transformed proteomics data
- class drevalpy.models.utils.VarianceFeatureSelector(view, k=1000)
Bases:
objectSelects the top-k features with highest variance for a specific omics view.
Stores a boolean mask after fitting on training data and applies it consistently to other datasets.
- fit(cell_line_input, output)
Fit the selector to the training data by computing a variance-based mask.
- Parameters:
cell_line_input (
FeatureDataset) – FeatureDataset containing omics featuresoutput (
DrugResponseDataset) – DrugResponseDataset with the training cell line IDs
- Return type:
- transform(cell_line_input)
Apply the feature mask to reduce the dataset to selected features.
- Parameters:
cell_line_input (
FeatureDataset) – FeatureDataset to transform- Return type:
- Returns:
reduced FeatureDataset
- Raises:
RuntimeError – if selector was not fitted
- drevalpy.models.utils.get_multiomics_feature_dataset(data_path, dataset_name, gene_lists=None, omics=None)
Get multiomics feature dataset for the given list of OMICs.
- Parameters:
data_path (
str) – path to the data, e.g., data/dataset_name (
str) – name of the dataset, e.g., GDSC2gene_lists (
dict|None) – dictionary of names of lists of genes to include, for each omics type, e.g., {“gene_expression”: “landmark_genes_reduced”}, if None, all features are not reducedomics (
list[str] |None) – list of omics to include, e.g., [“gene_expression”, “methylation”]
- Return type:
- Returns:
FeatureDataset with the multiomics features
- Raises:
ValueError – if no omics features are found
- drevalpy.models.utils.iterate_features(df, feature_type)
Iterate over features.
- drevalpy.models.utils.load_and_select_gene_features(feature_type, gene_list, data_path, dataset_name)
Load and reduce features of a single feature type, ensuring selection and ordering based on the gene list.
Attention: if gene_list is None, all features are loaded, which can be problematic for cross study prediction.
- Parameters:
- Return type:
- Returns:
FeatureDataset with the reduced features
- Raises:
ValueError – if genes from gene_list are missing in the dataset
- drevalpy.models.utils.load_cl_ids_from_csv(path, dataset_name)
Load cell line ids from csv file.
- Parameters:
- Return type:
- Returns:
FeatureDataset with the cell line ids
- drevalpy.models.utils.load_drug_fingerprint_features(data_path, dataset_name, fill_na=True, n_bits=128)
Load drug features from fingerprints.
- Parameters:
- Return type:
- Returns:
FeatureDataset with the drug fingerprints
- drevalpy.models.utils.load_drug_ids_from_csv(data_path, dataset_name)
Load drug ids from csv file.
- Parameters:
- Return type:
- Returns:
FeatureDataset with the drug ids
- drevalpy.models.utils.load_generic_csv(path, dataset_name, feature_name, index_col='cell_line_name')
Loads a generic CSV file with cell line IDs as index and features as columns.
- Parameters:
- Return type:
- Returns:
FeatureDataset with the features
- drevalpy.models.utils.load_multi_cell_line_view(cell_line_views, data_path, dataset_name, model_name)
Load cell line features for a multi-view model.
Known omics types use specific gene lists for subsetting. Unknown types are loaded in full.
- Parameters:
- Return type:
- Returns:
FeatureDataset containing the cell line features
- Raises:
ValueError – if cell_line_views is empty
- drevalpy.models.utils.load_single_cell_line_view(cell_line_views, data_path, dataset_name, model_name)
Load cell line features for a single-view model.
If the view is “gene_expression”, the landmark_genes_reduced list is used for subsetting. Otherwise, the whole CSV is loaded.
- Parameters:
- Return type:
- Returns:
FeatureDataset containing the cell line features
- Raises:
ValueError – if cell_line_views is empty or has more than one element
- drevalpy.models.utils.load_single_drug_view(drug_views, data_path, dataset_name, model_name)
Load drug features for a single-view model.
If drug_views is empty, drug IDs are loaded. If “fingerprints”, fingerprints are loaded. Otherwise, the CSV is loaded generically.
- Parameters:
- Return type:
- Returns:
FeatureDataset containing the drug features
- Raises:
ValueError – if more than one drug view is specified
- drevalpy.models.utils.load_tissues_from_csv(path, dataset_name)
Load tissues from csv file.
- Parameters:
- Return type:
- Returns:
FeatureDataset with the tissues
- drevalpy.models.utils.log10_and_set_na(x)
Log10 transform and set NaN for infinite values.
- Parameters:
x – input array
- Returns:
log10 transformed array with NaN for infinite values
- drevalpy.models.utils.prepare_expression_and_methylation(cell_line_input, cell_line_ids, training, gene_expression_scaler=None, methylation_scaler=None, methylation_pca=None)
Applies preprocessing to gene expression and optionally methylation views.
Applies arcsinh + scaling to gene expression if a scaler is provided.
Applies scaling + PCA to methylation if both a scaler and PCA are provided.
Applies to all cell lines in cell_line_input, using fitting only on the given IDs if training=True.
- Parameters:
cell_line_input (
FeatureDataset) – FeatureDataset with the cell line featurescell_line_ids (
ndarray) – IDs of the cell lines used for training or transformationtraining (
bool) – Whether to fit the scalers/PCA (True) or just apply transformation (False)gene_expression_scaler (
TransformerMixin|None) – Optional fitted or to-be-fitted scaler for gene expressionmethylation_scaler (
TransformerMixin|None) – Optional fitted or to-be-fitted scaler for methylationmethylation_pca (
PCA|None) – Optional PCA transformer for methylation
- Return type:
- Returns:
FeatureDataset with the transformed features
- drevalpy.models.utils.prepare_proteomics(cell_line_input, cell_line_ids, training, transformer)
Applies log10 transform and proteomics normalization (centering + imputation) to proteomics view.
- Parameters:
cell_line_input (
FeatureDataset) – FeatureDataset with proteomics featurescell_line_ids (
ndarray) – cell line IDs for training or transformationtraining (
bool) – whether to fit or only transformtransformer (
ProteomicsMedianCenterAndImputeTransformer) – Proteomics transformer
- Return type:
- Returns:
transformed FeatureDataset
- drevalpy.models.utils.scale_gene_expression(cell_line_input, cell_line_ids, training, gene_expression_scaler)
Scales gene expression inplace using arcsinh transformation and a provided scaler.
- Parameters:
cell_line_input (
FeatureDataset) – FeatureDataset with the cell line featurescell_line_ids (
ndarray) – IDs of cell lines to use for fitting or transformationtraining (
bool) – whether to fit or transformgene_expression_scaler (
TransformerMixin) – sklearn transformer for gene expression
- Return type:
- Returns:
FeatureDataset with the transformed features
- drevalpy.models.utils.unique(array)
Get unique values ordered by first occurrence.
- Parameters:
array – array of values
- Returns:
unique values ordered by first occurrence