Models

DRP Model

Contains the DRPModel class.

The DRPModel class is an abstract wrapper class for drug response prediction models.

class drevalpy.models.drp_model.DRPModel

Bases: ABC

Abstract wrapper class for drug response prediction models.

The DRPModel class is an abstract wrapper class for drug response prediction models. It has a boolean attribute is_single_drug_model indicating whether it is a single drug model and a boolean attribute early_stopping indicating whether early stopping is used.

abstractmethod build_model(hyperparameters)

Builds the model, for models that use hyperparameters.

Parameters:

hyperparameters (dict[str, Any]) – hyperparameters for the model

Return type:

None

Example:

self.model = ElasticNet(alpha=hyperparameters["alpha"], l1_ratio=hyperparameters["l1_ratio"])
abstract property cell_line_views: list[str]

Returns the sources the model needs as input for describing the cell line.

Returns:

cell line views, e.g., [“methylation”, “gene_expression”, “mirna_expression”, “mutation”]. If the model does not use cell line features, return an empty list.

abstract property drug_views: list[str]

Returns the sources the model needs as input for describing the drug.

Returns:

drug views, e.g., [“descriptors”, “fingerprints”, “targets”]. If the model does not use drug features, return an empty list.

early_stopping = False
get_concatenated_features(cell_line_view, drug_view, cell_line_ids_output, drug_ids_output, cell_line_input, drug_input)

Concatenates the features to an input matrix X for the given cell line and drug views.

Parameters:
  • cell_line_view (str | None) – gene expression, methylation, etc.

  • drug_view (str | None) – ids, fingerprints, etc.

  • cell_line_ids_output (ndarray) – cell line ids

  • drug_ids_output (ndarray) – drug ids

  • cell_line_input (FeatureDataset | None) – input associated with the cell line

  • drug_input (FeatureDataset | None) – input associated with the drug

Return type:

ndarray

Returns:

X, the feature matrix needed for, e.g., sklearn models

Raises:

ValueError – if no features are provided

This can, e.g., be done in the training method to produce a large input feature matrix for the model where the rows are the samples and the columns are the cell line and drug features concatenated. This method is an alternative to using DataLoaders. It is used for models operating on the whole input matrix at once.

Example:

x = self.get_concatenated_features(
    cell_line_view="gene_expression",
    drug_view="fingerprints",
    cell_line_ids_output=output.cell_line_ids,
    drug_ids_output=output.drug_ids,
    cell_line_input=cell_line_input,
    drug_input=drug_input,
)
self.model.fit(x, output.response)
get_feature_matrices(cell_line_ids, drug_ids, cell_line_input, drug_input)

Returns the feature matrices for the given cell line and drug ids by retrieving the correct views.

Parameters:
Return type:

dict[str, ndarray]

Returns:

dictionary with the feature matrices

Raises:

ValueError – if the input does not contain the correct views

This can e.g., done to produce the input for the predict() method for deep learning models: Example:

input_data = self.get_feature_matrices(
    cell_line_ids=cell_line_ids,
    drug_ids=drug_ids,
    cell_line_input=cell_line_input,
    drug_input=drug_input,
)
(
    gene_expression,
    mutations,
    cnvs
) = (
    input_data["gene_expression"],
    input_data["mutations"],
    input_data["copy_number_variation_gistic"]
)
return self.model.predict(gene_expression, mutations, cnvs)

Or to produce separate inputs for the train()/predict() method for other models if the model does not operate on the concatenated input matrix:

inputs = self.get_feature_matrices(
    cell_line_ids=output.cell_line_ids,
    drug_ids=output.drug_ids,
    cell_line_input=cell_line_input,
     drug_input=drug_input,
)
(
    gene_expression,
    methylation,
    mutations,
    copy_number_variation_gistic,
    fingerprints,
) = (
    inputs["gene_expression"],
    inputs["methylation"],
    inputs["mutations"],
    inputs["copy_number_variation_gistic"],
    inputs["fingerprints"],
)
self.model.fit(
    gene_expression, methylation, mutations, copy_number_variation_gistic, fingerprints, output.response
)
classmethod get_hyperparameter_set()

Loads the hyperparameters from a yaml file which is located in the same directory as the model.

Return type:

list[dict[str, Any]]

Returns:

list of hyperparameter sets

Raises:
  • ValueError – if the hyperparameters are not in the correct format

  • KeyError – if the model is not found in the hyperparameters file

abstractmethod classmethod get_model_name()

Returns the name of the model.

Return type:

str

Returns:

model name

is_single_drug_model = False
classmethod load(directory)

Load a model, including trainable parameters, hyperparameters, scalars, encoders.

This method should fully reconstruct an instance of the model using the files in the specified directory.

Only needs to be implemented for the DrEval evaluation framework, if a final production model should be saved.

Parameters:

directory (str) – Source directory containing the saved model files

Raises:

NotImplementedError – if the method is not implemented by the subclass

Return type:

DRPModel

abstractmethod load_cell_line_features(data_path, dataset_name)

Load the cell line features before the train/predict method is called.

Required to implement for all models. Could, e.g., call get_multiomics_feature_dataset() or load_and_select_gene_features() from models/utils.py.

Parameters:
  • data_path (str) – path to the data, e.g., data/

  • dataset_name (str) – name of the dataset, e.g., “GDSC2”

Return type:

FeatureDataset

Returns:

FeatureDataset with the cell line features

abstractmethod load_drug_features(data_path, dataset_name)

Load the drug features before the train/predict method is called.

Required to implement for all models that use drug features. Could, e.g., call load_drug_fingerprint_features() or load_drug_ids_from_csv() from models/utils.py.

For single drug models, this method can return None.

Parameters:
  • data_path (str) – path to the data, e.g., data/

  • dataset_name (str) – name of the dataset, e.g., “GDSC2”

Return type:

FeatureDataset | None

Returns:

FeatureDataset or None

abstractmethod predict(cell_line_ids, drug_ids, cell_line_input, drug_input=None)

Predicts the response for the given input.

Parameters:
  • drug_ids (ndarray) – list of drug ids, also used for single drug models, there it is just an array containing the same drug id

  • cell_line_ids (ndarray) – list of cell line ids

  • cell_line_input (FeatureDataset) – input associated with the cell line, required for all models

  • drug_input (FeatureDataset | None) – input associated with the drug, optional because single drug models do not use drug features

Return type:

ndarray

Returns:

predicted response

save(directory)

Save the model, including trainable parameters, hyperparameters, scalars, encoders.

This method should serialize all necessary components to allow full reconstruction of the model later via the load method.

Only needs to be implemented for the DrEval evaluation framework, if a final production model should be saved.

Parameters:

directory (str) – Target directory where the model and metadata should be saved

Raises:

NotImplementedError – if the method is not implemented by the subclass

Return type:

None

abstractmethod train(output, cell_line_input, drug_input=None, output_earlystopping=None, model_checkpoint_dir='checkpoints')

Trains the model.

Parameters:
  • output (DrugResponseDataset) – training data associated with the response output

  • cell_line_input (FeatureDataset) – input associated with the cell line, required for all models

  • drug_input (FeatureDataset | None) – input associated with the drug, optional because single drug models do not use drug features

  • output_earlystopping (DrugResponseDataset | None) – optional early stopping dataset

  • model_checkpoint_dir (str) – directory to save the model checkpoints

Return type:

None

Utility functions

Utility functions for loading and processing data.

class drevalpy.models.utils.ProteomicsMedianCenterAndImputeTransformer(feature_threshold=0.7, n_features=1000, normalization_downshift=1.8, normalization_width=0.3)

Bases: BaseEstimator, TransformerMixin

Performs median centering and imputation of proteomics data.

fit(X, y=None)

Learns the top n_feature complete proteins and calculates the mean median of the train cell lines.

Parameters:
  • X – input proteomics data

  • y – not used

Returns:

self

transform(X)

Median center the data and impute missing values with downshifted normal distribution.

Parameters:

X – input proteomics data

Returns:

transformed proteomics data

class drevalpy.models.utils.VarianceFeatureSelector(view, k=1000)

Bases: object

Selects the top-k features with highest variance for a specific omics view.

Stores a boolean mask after fitting on training data and applies it consistently to other datasets.

Parameters:
fit(cell_line_input, output)

Fit the selector to the training data by computing a variance-based mask.

Parameters:
Return type:

None

transform(cell_line_input)

Apply the feature mask to reduce the dataset to selected features.

Parameters:

cell_line_input (FeatureDataset) – FeatureDataset to transform

Return type:

FeatureDataset

Returns:

reduced FeatureDataset

Raises:

RuntimeError – if selector was not fitted

drevalpy.models.utils.get_multiomics_feature_dataset(data_path, dataset_name, gene_lists=None, omics=None)

Get multiomics feature dataset for the given list of OMICs.

Parameters:
  • data_path (str) – path to the data, e.g., data/

  • dataset_name (str) – name of the dataset, e.g., GDSC2

  • gene_lists (dict | None) – dictionary of names of lists of genes to include, for each omics type, e.g., {“gene_expression”: “landmark_genes_reduced”}, if None, all features are not reduced

  • omics (list[str] | None) – list of omics to include, e.g., [“gene_expression”, “methylation”]

Return type:

FeatureDataset

Returns:

FeatureDataset with the multiomics features

Raises:

ValueError – if no omics features are found

drevalpy.models.utils.iterate_features(df, feature_type)

Iterate over features.

Parameters:
  • df (DataFrame) – DataFrame with the features

  • feature_type (str) – type of feature, e.g., gene_expression, methylation, etc.

Return type:

dict[str, dict[str, ndarray]]

Returns:

dictionary with the features

drevalpy.models.utils.load_and_select_gene_features(feature_type, gene_list, data_path, dataset_name)

Load and reduce features of a single feature type, ensuring selection and ordering based on the gene list.

Attention: if gene_list is None, all features are loaded, which can be problematic for cross study prediction.

Parameters:
  • feature_type (str) – type of feature, e.g., gene_expression, methylation, etc.

  • gene_list (str | None) – list of genes to include, e.g., landmark_genes

  • data_path (str) – path to the data, e.g., data/

  • dataset_name (str) – name of the dataset, e.g., GDSC2

Return type:

FeatureDataset

Returns:

FeatureDataset with the reduced features

Raises:

ValueError – if genes from gene_list are missing in the dataset

drevalpy.models.utils.load_cl_ids_from_csv(path, dataset_name)

Load cell line ids from csv file.

Parameters:
  • path (str) – path to the data, e.g., data/

  • dataset_name (str) – name of the dataset, e.g., GDSC2

Return type:

FeatureDataset

Returns:

FeatureDataset with the cell line ids

drevalpy.models.utils.load_drug_fingerprint_features(data_path, dataset_name, fill_na=True, n_bits=128)

Load drug features from fingerprints.

Parameters:
  • data_path (str) – path to the data, e.g., data/

  • dataset_name (str) – name of the dataset, e.g., GDSC2

  • fill_na – whether to use default pubchemid-hashed fingerprints if fingerprint is not available

  • n_bits – number of bits in the fingerprint

Return type:

FeatureDataset

Returns:

FeatureDataset with the drug fingerprints

drevalpy.models.utils.load_drug_ids_from_csv(data_path, dataset_name)

Load drug ids from csv file.

Parameters:
  • data_path (str) – path to the data, e.g., data/

  • dataset_name (str) – name of the dataset, e.g., GDSC2

Return type:

FeatureDataset

Returns:

FeatureDataset with the drug ids

drevalpy.models.utils.load_tissues_from_csv(path, dataset_name)

Load tissues from csv file.

Parameters:
  • path (str) – path to the data, e.g., data/

  • dataset_name (str) – name of the dataset, e.g., GDSC2

Return type:

FeatureDataset

Returns:

FeatureDataset with the tissues

drevalpy.models.utils.log10_and_set_na(x)

Log10 transform and set NaN for infinite values.

Parameters:

x – input array

Returns:

log10 transformed array with NaN for infinite values

drevalpy.models.utils.prepare_expression_and_methylation(cell_line_input, cell_line_ids, training, gene_expression_scaler=None, methylation_scaler=None, methylation_pca=None)

Applies preprocessing to gene expression and optionally methylation views.

  • Applies arcsinh + scaling to gene expression if a scaler is provided.

  • Applies scaling + PCA to methylation if both a scaler and PCA are provided.

  • Applies to all cell lines in cell_line_input, using fitting only on the given IDs if training=True.

Parameters:
  • cell_line_input (FeatureDataset) – FeatureDataset with the cell line features

  • cell_line_ids (ndarray) – IDs of the cell lines used for training or transformation

  • training (bool) – Whether to fit the scalers/PCA (True) or just apply transformation (False)

  • gene_expression_scaler (TransformerMixin | None) – Optional fitted or to-be-fitted scaler for gene expression

  • methylation_scaler (TransformerMixin | None) – Optional fitted or to-be-fitted scaler for methylation

  • methylation_pca (PCA | None) – Optional PCA transformer for methylation

Return type:

FeatureDataset

Returns:

FeatureDataset with the transformed features

drevalpy.models.utils.prepare_proteomics(cell_line_input, cell_line_ids, training, transformer)

Applies log10 transform and proteomics normalization (centering + imputation) to proteomics view.

Parameters:
Return type:

FeatureDataset

Returns:

transformed FeatureDataset

drevalpy.models.utils.scale_gene_expression(cell_line_input, cell_line_ids, training, gene_expression_scaler)

Scales gene expression inplace using arcsinh transformation and a provided scaler.

Parameters:
  • cell_line_input (FeatureDataset) – FeatureDataset with the cell line features

  • cell_line_ids (ndarray) – IDs of cell lines to use for fitting or transformation

  • training (bool) – whether to fit or transform

  • gene_expression_scaler (TransformerMixin) – sklearn transformer for gene expression

Return type:

FeatureDataset

Returns:

FeatureDataset with the transformed features

drevalpy.models.utils.unique(array)

Get unique values ordered by first occurrence.

Parameters:

array – array of values

Returns:

unique values ordered by first occurrence

Implemented models