Implemented baselines

Flexible Input System

The sklearn baseline models support flexible inputs. Rather than hardcoding which omic data type a model uses, you configure cell_line_views and drug_views directly in the hyperparameters.yaml file. A single model class (e.g., ElasticNet, RandomForest, KNNRegressor) can therefore be trained on gene expression, proteomics, or any other available omic without needing a separate Python class for each combination.

This replaces the previously separate model classes (ProteomicsRandomForest, ProteomicsElasticNet, SingleDrugProteomicsRandomForest, SingleDrugProteomicsElasticNet), which have been removed in favor of this unified approach.

Configuring the input views

The default RandomForest configuration uses gene expression and fingerprints:

RandomForest:
  cell_line_views:
    - gene_expression
  drug_views:
    - fingerprints
  n_estimators:
    - 100
  max_depth:
    - 5
    - 10
    - 30
  ...

To train the same Random Forest on proteomics data instead, change cell_line_views:

RandomForest:
  cell_line_views:
    - proteomics
  drug_views:
    - fingerprints
  n_estimators:
    - 100
  ...

For the MultiViewRandomForest, multiple cell line views can be specified as a nested list:

MultiViewRandomForest:
  cell_line_views:
    - - gene_expression
      - methylation
      - mutations
      - copy_number_variation_gistic
  drug_views:
    - fingerprints
  ...

How features are loaded

The feature loading depends on which view is specified in the configuration:

  • gene_expression: Loaded with the landmark_genes_reduced gene list for feature selection.

  • fingerprints: Loaded using the precomputed Morgan fingerprints provided with each dataset.

  • proteomics: Loaded as a generic CSV. The ProteomicsMedianCenterAndImputeTransformer is automatically initialized for preprocessing.

  • Any other feature name (e.g., methylation, mutations, copy_number_variation_gistic, or a custom name): The model calls load_generic_csv, which looks for a CSV file at <data_path>/<dataset_name>/<feature_name>.csv. The CSV must have cell_line_name as the index column. All columns (except cellosaurus_id, which is dropped if present) are used as features.

This means you can use any custom omic by placing a correctly formatted CSV in the dataset directory and setting cell_line_views to the file’s name (without the .csv extension).

For drug features the same logic applies: fingerprints loads the precomputed fingerprints, an empty drug_views list loads only the drug IDs, and any other name loads the CSV at <data_path>/<dataset_name>/<feature_name>.csv.

Proteomics-specific hyperparameters

When proteomics is specified as a cell line view, the following hyperparameters control the preprocessing transformer:

  • proteomics_feature_threshold (default: 0.7): minimum fraction of non-NA values required per protein

  • proteomics_n_features (default: 1000): number of top-variance features to select

  • proteomics_normalization_width (default: 0.3): width parameter for median-center normalization

  • proteomics_normalization_downshift (default: 1.8): downshift parameter for median-center normalization

Naive Predictors

Simple mean-based predictors that serve as lower-bound baselines. These models do not use any cell line or drug features. They predict the mean response value computed from the training set, aggregated at different levels (global, per drug, per cell line, per tissue, or per tissue-drug combination).

Implements the naive predictor models.

The naive predictor models are simple models that predict the mean of the response values. The NaivePredictor predicts the overall mean of the response, the NaiveCellLineMeanPredictor predicts the mean of the response per cell line, and the NaiveDrugMeanPredictor predicts the mean of the response per drug. The NaiveTissueMeanPredictor predicts the mean of the response per tissue. The NaiveTissueDrugMeanPredictor predicts the mean of the response per tissue-drug combination. The NaiveMeanEffectsPredictor predicts the response as the overall mean plus the cell line effect plus the drug effect and should be the strongest naive baseline.

class drevalpy.models.baselines.naive_pred.NaiveCellLineMeanPredictor

Bases: NaiveModel

Naive predictor model that predicts the mean of the response per cell line.

cell_line_views = ['cell_line_name']
drug_views = ['pubchem_id']
classmethod get_model_name()

Returns the model name.

Return type:

str

Returns:

NaiveCellLineMeanPredictor

load_cell_line_features(data_path, dataset_name)

Loads the cell line features, in this case the cell line ids.

Parameters:
  • data_path (str) – path to the data

  • dataset_name (str) – name of the dataset

Return type:

FeatureDataset

Returns:

FeatureDataset containing the cell line ids

load_drug_features(data_path, dataset_name)

Loads the drug features.

Parameters:
  • data_path (str) – Path to the data.

  • dataset_name (str) – Name of the dataset.

Return type:

FeatureDataset

Returns:

FeatureDataset containing the drug IDs.

predict(cell_line_ids, drug_ids, cell_line_input, drug_input=None)

Predicts the cell line mean for each drug-cell line combination.

If the cell line is not in the training set, the dataset mean is used.

Parameters:
Return type:

ndarray

Returns:

array of the same length as the input cell_line_id containing the cell line mean

predict_cl(cl_id)

Predicts the mean of the response for a given cell line.

If the cell line is not in the training set, the dataset mean is used.

Parameters:

cl_id (str) – Cell line ID

Return type:

float

Returns:

predicted response

train(output, cell_line_input, drug_input=None, output_earlystopping=None, model_checkpoint_dir='None')

Computes the mean per cell line.

If - later on - the cell line is not in the training set, the overall mean is used.

Parameters:
Return type:

None

class drevalpy.models.baselines.naive_pred.NaiveDrugMeanPredictor

Bases: NaiveModel

Naive predictor model that predicts the mean of the response per drug.

cell_line_views = ['cell_line_name']
drug_views = ['pubchem_id']
classmethod get_model_name()

Returns the model name.

Return type:

str

Returns:

NaiveDrugMeanPredictor

load_cell_line_features(data_path, dataset_name)

Loads the cell line features.

Parameters:
  • data_path (str) – Path to the data.

  • dataset_name (str) – Name of the dataset.

Return type:

FeatureDataset

Returns:

FeatureDataset containing the cell line IDs.

load_drug_features(data_path, dataset_name)

Loads the drug features, in this case the drug ids.

Parameters:
  • data_path (str) – path to the data

  • dataset_name (str) – name of the dataset

Return type:

FeatureDataset

Returns:

FeatureDataset containing the drug ids

predict(cell_line_ids, drug_ids, cell_line_input, drug_input=None)

Predicts the drug mean for each drug-cell line combination.

If the drug is not in the training set, the dataset mean is used.

Parameters:
Return type:

ndarray

Returns:

array of the same length as the input drug_id containing the drug mean

predict_drug(drug_id)

Predicts the mean of the response for a given drug.

If the drug is not in the training set, the dataset mean is used.

Parameters:

drug_id (str) – ID of the drug

Returns:

predicted response

train(output, cell_line_input, drug_input=None, output_earlystopping=None, model_checkpoint_dir='None')

Computes the mean per drug. If - later on - the drug is not in the training set, the overall mean is used.

Parameters:
Raises:

ValueError – If drug_input is None

Return type:

None

class drevalpy.models.baselines.naive_pred.NaiveMeanEffectsPredictor

Bases: NaiveModel

ANOVA-like predictor model.

Predicts the response as: response = overall_mean + cell_line_effect + drug_effect.

Here:
  • cell_line_effect = (cell line mean - overall_mean)

  • drug_effect = (drug mean - overall_mean)

This formulation ensures that the overall mean is not counted twice.

cell_line_views = ['cell_line_name']
drug_views = ['pubchem_id']
classmethod get_model_name()

Returns the name of the model.

Return type:

str

Returns:

The name of the model as a string.

load_cell_line_features(data_path, dataset_name)

Loads the cell line features.

Parameters:
  • data_path (str) – Path to the data.

  • dataset_name (str) – Name of the dataset.

Return type:

FeatureDataset

Returns:

FeatureDataset containing the cell line IDs.

load_drug_features(data_path, dataset_name)

Loads the drug features.

Parameters:
  • data_path (str) – Path to the data.

  • dataset_name (str) – Name of the dataset.

Return type:

FeatureDataset

Returns:

FeatureDataset containing the drug IDs.

predict(cell_line_ids, drug_ids, cell_line_input, drug_input=None)

Predicts responses for given cell line and drug pairs.

The prediction is computed as:

prediction = overall_mean + cell_line_effect + drug_effect

If a cell line or drug has not been seen during training, their effect is set to zero.

Parameters:
Return type:

ndarray

Returns:

NumPy array of predicted responses.

train(output, cell_line_input, drug_input=None, output_earlystopping=None, model_checkpoint_dir='checkpoints')

Trains with overall mean, cell line effects, and drug effects.

Parameters:
Raises:

ValueError – If drug_input is None.

Return type:

None

class drevalpy.models.baselines.naive_pred.NaiveModel

Bases: DRPModel

Base class for all naive predictor models which are based on simple dataset stats.

This class provides a shared interface and save/load mechanism for simple statistical models that predict drug response based on dataset means, stratified by drug, cell line, or tissue.

build_model(hyperparameters)

Builds the model.

Naive model do not require any hyperparameter tuning.

Parameters:

hyperparameters (dict) – Dictionary of hyperparameters (not used).

classmethod load(directory)

Loads the model parameters from the given directory.

Reads the ‘naive_model.json’ file and initializes a NaiveModel instance with the loaded parameters.

Parameters:

directory (str) – Path to the directory where the model is saved.

Return type:

NaiveModel

Returns:

An instance of NaiveModel with the loaded parameters.

save(directory)

Saves the model parameters to the given directory.

Serializes dataset_mean and any available subclass-specific attributes to a JSON file named ‘naive_model.json’. Creates the directory if it doesn’t exist.

Parameters:

directory (str) – Path to the directory where the model will be saved.

Return type:

None

class drevalpy.models.baselines.naive_pred.NaivePredictor

Bases: NaiveModel

Naive predictor model that predicts the overall mean of the response.

cell_line_views = ['cell_line_name']
drug_views = ['pubchem_id']
classmethod get_model_name()

Returns the model name.

Return type:

str

Returns:

NaivePredictor

load_cell_line_features(data_path, dataset_name)

Loads the cell line features, in this case the cell line ids.

Parameters:
  • data_path (str) – path to the data

  • dataset_name (str) – name of the dataset

Return type:

FeatureDataset

Returns:

FeatureDataset containing the cell line ids

load_drug_features(data_path, dataset_name)

Loads the drug features, in this case the drug ids.

Parameters:
  • data_path (str) – path to the data

  • dataset_name (str) – name of the dataset

Return type:

FeatureDataset

Returns:

FeatureDataset containing the drug ids

predict(cell_line_ids, drug_ids, cell_line_input, drug_input=None)

Predicts the dataset mean for each drug-cell line combination.

Parameters:
Return type:

ndarray

Returns:

array of the same length as the input cell line id containing the dataset mean

train(output, cell_line_input, drug_input=None, output_earlystopping=None, model_checkpoint_dir='checkpoints')

Computes the overall mean of the output response values and saves them.

Parameters:
Return type:

None

class drevalpy.models.baselines.naive_pred.NaiveTissueDrugMeanPredictor

Bases: NaiveModel

Naive predictor model that predicts the mean of the response per tissue-drug combination.

This model combines tissue and drug information to predict the mean response aggregated across all cell lines from the same tissue tested on the same drug. If a (tissue, drug) combination was not seen during training, it falls back to the overall dataset mean.

cell_line_views = ['tissue']
drug_views = ['pubchem_id']
classmethod get_model_name()

Returns the model name.

Return type:

str

Returns:

NaiveTissueDrugMeanPredictor

classmethod load(directory)

Loads the model parameters from the given directory.

Overrides the base class load method to convert string keys back to tuple keys.

Parameters:

directory (str) – Path to the directory where the model is saved.

Return type:

NaiveTissueDrugMeanPredictor

Returns:

An instance of NaiveTissueDrugMeanPredictor with the loaded parameters.

load_cell_line_features(data_path, dataset_name)

Loads the cell line features, in this case the tissue annotations.

Parameters:
  • data_path (str) – path to the data

  • dataset_name (str) – name of the dataset

Return type:

FeatureDataset

Returns:

FeatureDataset containing the tissue ids

load_drug_features(data_path, dataset_name)

Loads the drug features, in this case the drug ids.

Parameters:
  • data_path (str) – path to the data

  • dataset_name (str) – name of the dataset

Return type:

FeatureDataset

Returns:

FeatureDataset containing the drug ids

predict(cell_line_ids, drug_ids, cell_line_input, drug_input=None)

Predicts the tissue-drug mean for each drug-cell line combination.

If the (tissue, drug) combination is not in the training set, the dataset mean is used.

Parameters:
  • cell_line_ids (ndarray) – cell line ids

  • drug_ids (ndarray) – drug ids (used directly, following NaiveDrugMeanPredictor pattern)

  • cell_line_input (FeatureDataset) – tissue features

  • drug_input (FeatureDataset | None) – not needed

Return type:

ndarray

Returns:

array of the same length as the input containing the tissue-drug mean or dataset mean

save(directory)

Saves the model parameters to the given directory.

Overrides the base class save method to handle tuple keys in tissue_drug_means by converting them to JSON-serializable string keys.

Parameters:

directory (str) – Path to the directory where the model will be saved.

Return type:

None

train(output, cell_line_input, drug_input=None, output_earlystopping=None, model_checkpoint_dir='None')

Computes the mean per tissue-drug combination. Falls back to the overall mean for unknown combinations.

Parameters:
Raises:

ValueError – If drug_input is None.

Return type:

None

class drevalpy.models.baselines.naive_pred.NaiveTissueMeanPredictor

Bases: NaiveModel

Naive predictor model that predicts the mean of the response per tissue.

cell_line_views = ['tissue']
drug_views = []
classmethod get_model_name()

Returns the model name.

Return type:

str

Returns:

NaiveTissueMeanPredictor

load_cell_line_features(data_path, dataset_name)

Loads the cell line features, in this case the tissue annotations.

Parameters:
  • data_path (str) – path to the data

  • dataset_name (str) – name of the dataset

Return type:

FeatureDataset

Returns:

FeatureDataset containing the tissue ids

load_drug_features(data_path, dataset_name)

Loads the drug features.

Parameters:
  • data_path (str) – Path to the data.

  • dataset_name (str) – Name of the dataset.

Return type:

FeatureDataset

Returns:

FeatureDataset containing the drug IDs.

predict(cell_line_ids, drug_ids, cell_line_input, drug_input=None)

Predicts the tissue mean for each drug-cell line combination.

If the tissue is not in the training set, the dataset mean is used.

Parameters:
Return type:

ndarray

Returns:

array of the same length as the input cell_line_id containing the tissue mean

train(output, cell_line_input, drug_input=None, output_earlystopping=None, model_checkpoint_dir='None')

Computes the mean per tissue. Falls back to the overall mean for unknown tissues.

Parameters:
Return type:

None

Sklearn Models

Scikit-learn-based models for drug response prediction. All models in this module support flexible inputs (see Flexible Input System above). By default they concatenate cell line features and drug features into a single input matrix. Available models: ElasticNetModel, RandomForest, SVMRegressor, GradientBoosting, and AdaBoostDecisionTree.

Contains sklearn baseline models: ElasticNet, RandomForest, SVM, AdaBoost.

class drevalpy.models.baselines.sklearn_models.AdaBoostDecisionTree

Bases: SklearnModel

AdaBoost model using Decision Trees as week learners for drug response prediction.

build_model(hyperparameters)

Builds the model from hyperparameters.

Parameters:

hyperparameters (dict) – Hyperparameters for the model. Contains n_estimators, max_depth, min_samples_split and min_samples_leaf.

classmethod get_model_name()

Returns the model name.

Return type:

str

Returns:

AdaBoostDecisionTree

class drevalpy.models.baselines.sklearn_models.ElasticNetModel

Bases: SklearnModel

ElasticNet model for drug response prediction.

build_model(hyperparameters)

Builds the ElasticNet model from hyperparameters.

Parameters:

hyperparameters (dict) – Contains L1 ratio and alpha.

classmethod get_model_name()

Returns the model name.

Return type:

str

Returns:

ElasticNet

class drevalpy.models.baselines.sklearn_models.GradientBoosting

Bases: SklearnModel

Gradient Boosting model for drug response prediction.

build_model(hyperparameters)

Builds the model from hyperparameters.

Parameters:

hyperparameters (dict) – Hyperparameters for the model. Contains n_estimators, learning_rate, max_depth, and subsample

classmethod get_model_name()

Returns the model name.

Return type:

str

Returns:

GradientBoosting

class drevalpy.models.baselines.sklearn_models.KNNRegressor

Bases: SklearnModel

KNNRegressor model for using k-nearest neighbors for drug response prediction.

build_model(hyperparameters)

Builds the model from hyperparameters.

Parameters:

hyperparameters (dict) – Hyperparameters for the model. Contains neighbors, weights.

classmethod get_model_name()

Returns the model name.

Return type:

str

Returns:

KNNRegressor

class drevalpy.models.baselines.sklearn_models.LassoModel

Bases: SklearnModel

Lasso regression model for drug response prediction.

build_model(hyperparameters)

Builds the Lasso model from hyperparameters.

Parameters:

hyperparameters (dict) – Contains alpha.

classmethod get_model_name()

Returns the model name.

Return type:

str

Returns:

Lasso

class drevalpy.models.baselines.sklearn_models.RandomForest

Bases: SklearnModel

RandomForest model for drug response prediction.

build_model(hyperparameters)

Builds the model from hyperparameters.

Parameters:

hyperparameters (dict) – Hyperparameters for the model. Contains n_estimators, criterion, max_samples, max_depth and n_jobs.

classmethod get_model_name()

Returns the model name.

Return type:

str

Returns:

RandomForest

class drevalpy.models.baselines.sklearn_models.SVMRegressor

Bases: SklearnModel

SVM model for drug response prediction.

build_model(hyperparameters)

Builds the model from hyperparameters.

Parameters:

hyperparameters (dict) – Hyperparameters for the model. Contains kernel, C, epsilon, and max_iter.

classmethod get_model_name()

Returns the model name.

Return type:

str

Returns:

SVR (Support Vector Regressor)

class drevalpy.models.baselines.sklearn_models.SklearnModel

Bases: DRPModel

Parent class that contains the common methods for the sklearn models.

build_model(hyperparameters)

Builds the model from hyperparameters.

Flexible input support: Initializes the cell_line_views and drug_views to the values specified in the hyperparameters.yaml file. If nothing is specified, gene_expression and fingerprints are used.

If proteomics is specified in the hyperparameters, the ProteomicsMedianCenterAndImputeTransformer is initialized.

Parameters:

hyperparameters (dict) – Custom hyperparameters for the model, have to be defined in the child class.

cell_line_views = []
drug_views = []
classmethod get_model_name()

Returns the model name.

Raises:

NotImplementedError – If the method is not implemented in the child class.

Return type:

str

classmethod load(directory)

Load a trained sklearn-based model and its preprocessing components from disk.

Loads: - model.pkl: the trained sklearn model - hyperparameters.json: model hyperparameters (optional) - scaler.pkl: gene expression scaler (optional) - proteomics_transformer.pkl: proteomics transformer (optional)

Parameters:

directory (str) – path to the directory where model files are stored

Return type:

SklearnModel

Returns:

an instance of the model with restored state

Raises:

FileNotFoundError – if model.pkl is missing

load_cell_line_features(data_path, dataset_name)

Loads the cell line features for a single-view sklearn model.

Parameters:
  • data_path (str) – Path to the data

  • dataset_name (str) – Name of the dataset

Return type:

FeatureDataset

Returns:

FeatureDataset containing the cell line features

load_drug_features(data_path, dataset_name)

Load the drug features for a single-view sklearn model.

Parameters:
  • data_path (str) – Path to the data

  • dataset_name (str) – Name of the dataset

Return type:

FeatureDataset | None

Returns:

FeatureDataset containing the drug features

predict(cell_line_ids, drug_ids, cell_line_input, drug_input=None)

Predicts the response for the given input.

Parameters:
Return type:

ndarray

Returns:

predicted drug response

save(directory)

Save the trained model and any associated preprocessing components to the given directory.

Saves: - model.pkl: the trained sklearn model - hyperparameters.json: dictionary of model hyperparameters (if present) - scaler.pkl: fitted gene expression scaler (if present) - proteomics_transformer.pkl: fitted proteomics transformer (if present)

Parameters:

directory (str) – path to the directory where model files will be stored

Raises:

ValueError – if the model is not trained

Return type:

None

train(output, cell_line_input, drug_input=None, output_earlystopping=None, model_checkpoint_dir='checkpoints')

Trains the model.

The number of features is the number of genes + the number of fingerprints. :type output: DrugResponseDataset :param output: training dataset containing the response output :type cell_line_input: FeatureDataset :param cell_line_input: training dataset containing gene expression data :type drug_input: FeatureDataset | None :param drug_input: training dataset containing fingerprints data :type output_earlystopping: DrugResponseDataset | None :param output_earlystopping: not needed :type model_checkpoint_dir: str :param model_checkpoint_dir: not needed

Return type:

None

Parameters:

Single-Drug Baselines

Single-drug variants of the sklearn models. These models are trained separately for each drug, using only cell line features (no drug features). Available models: SingleDrugRandomForest and SingleDrugElasticNet. Both support flexible inputs for the cell line view.

SingleDrugElasticNet and SingleDrugRandomForest class. Fit a model for each drug separately.

class drevalpy.models.baselines.singledrug_baselines.SingleDrugElasticNet

Bases: ElasticNetModel

SingleDrugElasticNet class.

build_model(hyperparameters)

Overwrites drug views to be empty.

Parameters:

hyperparameters (dict) – hyperparameters

drug_views = []
early_stopping = False
classmethod get_model_name()

Returns the model name.

Return type:

str

Returns:

SingleDrugElasticNet

is_single_drug_model = True
load_drug_features(data_path, dataset_name)

Load drug features. Not needed for SingleDrugElasticNet.

Parameters:
  • data_path – path to the data

  • dataset_name – name of the dataset

Returns:

None

class drevalpy.models.baselines.singledrug_baselines.SingleDrugRandomForest

Bases: RandomForest

SingleDrugRandomForest class.

build_model(hyperparameters)

Overwrites drug views to be empty.

Parameters:

hyperparameters (dict) – hyperparameters

drug_views = []
early_stopping = False
classmethod get_model_name()

Returns the model name.

Return type:

str

Returns:

SingleDrugRandomForest

is_single_drug_model = True
load_drug_features(data_path, dataset_name)

Load drug features. Not needed for SingleDrugRandomForest.

Parameters:
  • data_path – path to the data

  • dataset_name – name of the dataset

Returns:

None

Multi-View Random Forest

A Random Forest that accepts multiple cell line views simultaneously (e.g., gene expression, methylation, mutations, and copy number variation). Each view is loaded and preprocessed independently, then all feature matrices are concatenated before training. Methylation data is reduced with PCA before concatenation.

Contains the Multi-OMICS Random Forest model.

class drevalpy.models.baselines.multi_view_random_forest.MultiViewRandomForest

Bases: RandomForest

Multi-View Random Forest model.

cell_line_views = ['gene_expression', 'methylation', 'mutations', 'copy_number_variation_gistic']
classmethod get_model_name()

Returns the model name.

Return type:

str

Returns:

MultiViewRandomForest

classmethod load(directory)

Loads the trained model, hyperparameters, scaler, and PCA transformer from the specified directory.

Parameters:

directory (str) – Path to the directory where model components are stored.

Return type:

MultiViewRandomForest

Returns:

An instance of MultiViewRandomForest with restored state.

load_cell_line_features(data_path, dataset_name)

Loads the cell line features for a multi-view random forest.

Parameters:
  • data_path (str) – data path e.g. data/

  • dataset_name (str) – dataset name e.g. GDSC1

Return type:

FeatureDataset

Returns:

FeatureDataset containing the cell line omics features

predict(cell_line_ids, drug_ids, cell_line_input, drug_input=None)

Predicts the response for the given input.

Parameters:
Return type:

ndarray

Returns:

predicted response

Raises:

RuntimeError – if PCA has not been fit

save(directory)

Saves the trained model, hyperparameters, scaler, and PCA transformer to the specified directory.

Parameters:

directory (str) – Path to the directory where model components will be saved.

Return type:

None

train(output, cell_line_input, drug_input=None, output_earlystopping=None, model_checkpoint_dir='checkpoints')

Trains the model: the number of features is the number of genes + the number of fingerprints.

Parameters:
Return type:

None