Implemented baselines
Flexible Input System
The sklearn baseline models support flexible inputs. Rather than hardcoding which omic data type a model uses,
you configure cell_line_views and drug_views directly in the hyperparameters.yaml file.
A single model class (e.g., ElasticNet, RandomForest, KNNRegressor) can therefore be trained on gene expression,
proteomics, or any other available omic without needing a separate Python class for each combination.
This replaces the previously separate model classes (ProteomicsRandomForest, ProteomicsElasticNet,
SingleDrugProteomicsRandomForest, SingleDrugProteomicsElasticNet), which have been removed in favor
of this unified approach.
Configuring the input views
The default RandomForest configuration uses gene expression and fingerprints:
RandomForest:
cell_line_views:
- gene_expression
drug_views:
- fingerprints
n_estimators:
- 100
max_depth:
- 5
- 10
- 30
...
To train the same Random Forest on proteomics data instead, change cell_line_views:
RandomForest:
cell_line_views:
- proteomics
drug_views:
- fingerprints
n_estimators:
- 100
...
For the MultiViewRandomForest, multiple cell line views can be specified as a nested list:
MultiViewRandomForest:
cell_line_views:
- - gene_expression
- methylation
- mutations
- copy_number_variation_gistic
drug_views:
- fingerprints
...
How features are loaded
The feature loading depends on which view is specified in the configuration:
gene_expression: Loaded with the
landmark_genes_reducedgene list for feature selection.fingerprints: Loaded using the precomputed Morgan fingerprints provided with each dataset.
proteomics: Loaded as a generic CSV. The
ProteomicsMedianCenterAndImputeTransformeris automatically initialized for preprocessing.Any other feature name (e.g.,
methylation,mutations,copy_number_variation_gistic, or a custom name): The model callsload_generic_csv, which looks for a CSV file at<data_path>/<dataset_name>/<feature_name>.csv. The CSV must havecell_line_nameas the index column. All columns (exceptcellosaurus_id, which is dropped if present) are used as features.
This means you can use any custom omic by placing a correctly formatted CSV in the dataset directory
and setting cell_line_views to the file’s name (without the .csv extension).
For drug features the same logic applies: fingerprints loads the precomputed fingerprints, an empty
drug_views list loads only the drug IDs, and any other name loads the CSV at
<data_path>/<dataset_name>/<feature_name>.csv.
Proteomics-specific hyperparameters
When proteomics is specified as a cell line view, the following hyperparameters control the
preprocessing transformer:
proteomics_feature_threshold(default: 0.7): minimum fraction of non-NA values required per proteinproteomics_n_features(default: 1000): number of top-variance features to selectproteomics_normalization_width(default: 0.3): width parameter for median-center normalizationproteomics_normalization_downshift(default: 1.8): downshift parameter for median-center normalization
Naive Predictors
Simple mean-based predictors that serve as lower-bound baselines. These models do not use any cell line or drug features. They predict the mean response value computed from the training set, aggregated at different levels (global, per drug, per cell line, per tissue, or per tissue-drug combination).
Implements the naive predictor models.
The naive predictor models are simple models that predict the mean of the response values. The NaivePredictor predicts the overall mean of the response, the NaiveCellLineMeanPredictor predicts the mean of the response per cell line, and the NaiveDrugMeanPredictor predicts the mean of the response per drug. The NaiveTissueMeanPredictor predicts the mean of the response per tissue. The NaiveTissueDrugMeanPredictor predicts the mean of the response per tissue-drug combination. The NaiveMeanEffectsPredictor predicts the response as the overall mean plus the cell line effect plus the drug effect and should be the strongest naive baseline.
- class drevalpy.models.baselines.naive_pred.NaiveCellLineMeanPredictor
Bases:
NaiveModelNaive predictor model that predicts the mean of the response per cell line.
- cell_line_views = ['cell_line_name']
- drug_views = ['pubchem_id']
- classmethod get_model_name()
Returns the model name.
- Return type:
- Returns:
NaiveCellLineMeanPredictor
- load_cell_line_features(data_path, dataset_name)
Loads the cell line features, in this case the cell line ids.
- Parameters:
- Return type:
- Returns:
FeatureDataset containing the cell line ids
- load_drug_features(data_path, dataset_name)
Loads the drug features.
- Parameters:
- Return type:
- Returns:
FeatureDataset containing the drug IDs.
- predict(cell_line_ids, drug_ids, cell_line_input, drug_input=None)
Predicts the cell line mean for each drug-cell line combination.
If the cell line is not in the training set, the dataset mean is used.
- Parameters:
cell_line_ids (
ndarray) – cell line idsdrug_ids (
ndarray) – not neededcell_line_input (
FeatureDataset) – not neededdrug_input (
FeatureDataset|None) – not needed
- Return type:
- Returns:
array of the same length as the input cell_line_id containing the cell line mean
- predict_cl(cl_id)
Predicts the mean of the response for a given cell line.
If the cell line is not in the training set, the dataset mean is used.
- train(output, cell_line_input, drug_input=None, output_earlystopping=None, model_checkpoint_dir='None')
Computes the mean per cell line.
If - later on - the cell line is not in the training set, the overall mean is used.
- Parameters:
output (
DrugResponseDataset) – training dataset containing the response outputcell_line_input (
FeatureDataset) – cell line inputsdrug_input (
FeatureDataset|None) – not neededoutput_earlystopping (
DrugResponseDataset|None) – not neededmodel_checkpoint_dir (
str) – not needed
- Return type:
- class drevalpy.models.baselines.naive_pred.NaiveDrugMeanPredictor
Bases:
NaiveModelNaive predictor model that predicts the mean of the response per drug.
- cell_line_views = ['cell_line_name']
- drug_views = ['pubchem_id']
- classmethod get_model_name()
Returns the model name.
- Return type:
- Returns:
NaiveDrugMeanPredictor
- load_cell_line_features(data_path, dataset_name)
Loads the cell line features.
- Parameters:
- Return type:
- Returns:
FeatureDataset containing the cell line IDs.
- load_drug_features(data_path, dataset_name)
Loads the drug features, in this case the drug ids.
- Parameters:
- Return type:
- Returns:
FeatureDataset containing the drug ids
- predict(cell_line_ids, drug_ids, cell_line_input, drug_input=None)
Predicts the drug mean for each drug-cell line combination.
If the drug is not in the training set, the dataset mean is used.
- Parameters:
cell_line_ids (
ndarray) – not neededdrug_ids (
ndarray) – drug idscell_line_input (
FeatureDataset) – not neededdrug_input (
FeatureDataset|None) – not needed
- Return type:
- Returns:
array of the same length as the input drug_id containing the drug mean
- predict_drug(drug_id)
Predicts the mean of the response for a given drug.
If the drug is not in the training set, the dataset mean is used.
- Parameters:
drug_id (
str) – ID of the drug- Returns:
predicted response
- train(output, cell_line_input, drug_input=None, output_earlystopping=None, model_checkpoint_dir='None')
Computes the mean per drug. If - later on - the drug is not in the training set, the overall mean is used.
- Parameters:
output (
DrugResponseDataset) – training dataset containing the response outputcell_line_input (
FeatureDataset) – not neededdrug_input (
FeatureDataset|None) – drug idoutput_earlystopping (
DrugResponseDataset|None) – not neededmodel_checkpoint_dir (
str) – not needed
- Raises:
ValueError – If drug_input is None
- Return type:
- class drevalpy.models.baselines.naive_pred.NaiveMeanEffectsPredictor
Bases:
NaiveModelANOVA-like predictor model.
Predicts the response as: response = overall_mean + cell_line_effect + drug_effect.
- Here:
cell_line_effect = (cell line mean - overall_mean)
drug_effect = (drug mean - overall_mean)
This formulation ensures that the overall mean is not counted twice.
- cell_line_views = ['cell_line_name']
- drug_views = ['pubchem_id']
- classmethod get_model_name()
Returns the name of the model.
- Return type:
- Returns:
The name of the model as a string.
- load_cell_line_features(data_path, dataset_name)
Loads the cell line features.
- Parameters:
- Return type:
- Returns:
FeatureDataset containing the cell line IDs.
- load_drug_features(data_path, dataset_name)
Loads the drug features.
- Parameters:
- Return type:
- Returns:
FeatureDataset containing the drug IDs.
- predict(cell_line_ids, drug_ids, cell_line_input, drug_input=None)
Predicts responses for given cell line and drug pairs.
- The prediction is computed as:
prediction = overall_mean + cell_line_effect + drug_effect
If a cell line or drug has not been seen during training, their effect is set to zero.
- Parameters:
cell_line_ids (
ndarray) – Array of cell line IDs.drug_ids (
ndarray) – Array of drug IDs.cell_line_input (
FeatureDataset) – Not used.drug_input (
FeatureDataset|None) – Not used.
- Return type:
- Returns:
NumPy array of predicted responses.
- train(output, cell_line_input, drug_input=None, output_earlystopping=None, model_checkpoint_dir='checkpoints')
Trains with overall mean, cell line effects, and drug effects.
- Parameters:
output (
DrugResponseDataset) – Training dataset containing the response output.cell_line_input (
FeatureDataset) – Feature dataset containing cell line IDs.drug_input (
FeatureDataset|None) – Feature dataset containing drug IDs. Must not be None.output_earlystopping (
DrugResponseDataset|None) – Not used.model_checkpoint_dir (
str) – Not used.
- Raises:
ValueError – If drug_input is None.
- Return type:
- class drevalpy.models.baselines.naive_pred.NaiveModel
Bases:
DRPModelBase class for all naive predictor models which are based on simple dataset stats.
This class provides a shared interface and save/load mechanism for simple statistical models that predict drug response based on dataset means, stratified by drug, cell line, or tissue.
- build_model(hyperparameters)
Builds the model.
Naive model do not require any hyperparameter tuning.
- Parameters:
hyperparameters (
dict) – Dictionary of hyperparameters (not used).
- classmethod load(directory)
Loads the model parameters from the given directory.
Reads the ‘naive_model.json’ file and initializes a NaiveModel instance with the loaded parameters.
- Parameters:
directory (
str) – Path to the directory where the model is saved.- Return type:
- Returns:
An instance of NaiveModel with the loaded parameters.
- save(directory)
Saves the model parameters to the given directory.
Serializes dataset_mean and any available subclass-specific attributes to a JSON file named ‘naive_model.json’. Creates the directory if it doesn’t exist.
- class drevalpy.models.baselines.naive_pred.NaivePredictor
Bases:
NaiveModelNaive predictor model that predicts the overall mean of the response.
- cell_line_views = ['cell_line_name']
- drug_views = ['pubchem_id']
- load_cell_line_features(data_path, dataset_name)
Loads the cell line features, in this case the cell line ids.
- Parameters:
- Return type:
- Returns:
FeatureDataset containing the cell line ids
- load_drug_features(data_path, dataset_name)
Loads the drug features, in this case the drug ids.
- Parameters:
- Return type:
- Returns:
FeatureDataset containing the drug ids
- predict(cell_line_ids, drug_ids, cell_line_input, drug_input=None)
Predicts the dataset mean for each drug-cell line combination.
- Parameters:
cell_line_ids (
ndarray) – cell line idsdrug_ids (
ndarray) – not neededcell_line_input (
FeatureDataset) – not neededdrug_input (
FeatureDataset|None) – not needed
- Return type:
- Returns:
array of the same length as the input cell line id containing the dataset mean
- train(output, cell_line_input, drug_input=None, output_earlystopping=None, model_checkpoint_dir='checkpoints')
Computes the overall mean of the output response values and saves them.
- Parameters:
output (
DrugResponseDataset) – training dataset containing the response outputcell_line_input (
FeatureDataset) – not neededdrug_input (
FeatureDataset|None) – not neededoutput_earlystopping (
DrugResponseDataset|None) – not neededmodel_checkpoint_dir (
str) – not needed
- Return type:
- class drevalpy.models.baselines.naive_pred.NaiveTissueDrugMeanPredictor
Bases:
NaiveModelNaive predictor model that predicts the mean of the response per tissue-drug combination.
This model combines tissue and drug information to predict the mean response aggregated across all cell lines from the same tissue tested on the same drug. If a (tissue, drug) combination was not seen during training, it falls back to the overall dataset mean.
- cell_line_views = ['tissue']
- drug_views = ['pubchem_id']
- classmethod get_model_name()
Returns the model name.
- Return type:
- Returns:
NaiveTissueDrugMeanPredictor
- classmethod load(directory)
Loads the model parameters from the given directory.
Overrides the base class load method to convert string keys back to tuple keys.
- Parameters:
directory (
str) – Path to the directory where the model is saved.- Return type:
- Returns:
An instance of NaiveTissueDrugMeanPredictor with the loaded parameters.
- load_cell_line_features(data_path, dataset_name)
Loads the cell line features, in this case the tissue annotations.
- Parameters:
- Return type:
- Returns:
FeatureDataset containing the tissue ids
- load_drug_features(data_path, dataset_name)
Loads the drug features, in this case the drug ids.
- Parameters:
- Return type:
- Returns:
FeatureDataset containing the drug ids
- predict(cell_line_ids, drug_ids, cell_line_input, drug_input=None)
Predicts the tissue-drug mean for each drug-cell line combination.
If the (tissue, drug) combination is not in the training set, the dataset mean is used.
- Parameters:
cell_line_ids (
ndarray) – cell line idsdrug_ids (
ndarray) – drug ids (used directly, following NaiveDrugMeanPredictor pattern)cell_line_input (
FeatureDataset) – tissue featuresdrug_input (
FeatureDataset|None) – not needed
- Return type:
- Returns:
array of the same length as the input containing the tissue-drug mean or dataset mean
- save(directory)
Saves the model parameters to the given directory.
Overrides the base class save method to handle tuple keys in tissue_drug_means by converting them to JSON-serializable string keys.
- train(output, cell_line_input, drug_input=None, output_earlystopping=None, model_checkpoint_dir='None')
Computes the mean per tissue-drug combination. Falls back to the overall mean for unknown combinations.
- Parameters:
output (
DrugResponseDataset) – training dataset with .response and .drug_idscell_line_input (
FeatureDataset) – tissue features for cell linesdrug_input (
FeatureDataset|None) – drug id featuresoutput_earlystopping (
DrugResponseDataset|None) – not neededmodel_checkpoint_dir (
str) – not needed
- Raises:
ValueError – If drug_input is None.
- Return type:
- class drevalpy.models.baselines.naive_pred.NaiveTissueMeanPredictor
Bases:
NaiveModelNaive predictor model that predicts the mean of the response per tissue.
- cell_line_views = ['tissue']
- drug_views = []
- classmethod get_model_name()
Returns the model name.
- Return type:
- Returns:
NaiveTissueMeanPredictor
- load_cell_line_features(data_path, dataset_name)
Loads the cell line features, in this case the tissue annotations.
- Parameters:
- Return type:
- Returns:
FeatureDataset containing the tissue ids
- load_drug_features(data_path, dataset_name)
Loads the drug features.
- Parameters:
- Return type:
- Returns:
FeatureDataset containing the drug IDs.
- predict(cell_line_ids, drug_ids, cell_line_input, drug_input=None)
Predicts the tissue mean for each drug-cell line combination.
If the tissue is not in the training set, the dataset mean is used.
- Parameters:
cell_line_ids (
ndarray) – cell line idsdrug_ids (
ndarray) – not neededcell_line_input (
FeatureDataset) – tissue featuresdrug_input (
FeatureDataset|None) – not needed
- Return type:
- Returns:
array of the same length as the input cell_line_id containing the tissue mean
- train(output, cell_line_input, drug_input=None, output_earlystopping=None, model_checkpoint_dir='None')
Computes the mean per tissue. Falls back to the overall mean for unknown tissues.
- Parameters:
output (
DrugResponseDataset) – training dataset with .responsecell_line_input (
FeatureDataset) – tissue features for cell linesdrug_input (
FeatureDataset|None) – not neededoutput_earlystopping (
DrugResponseDataset|None) – not neededmodel_checkpoint_dir (
str) – not needed
- Return type:
Sklearn Models
Scikit-learn-based models for drug response prediction. All models in this module support flexible inputs
(see Flexible Input System above). By default they concatenate cell line features and drug features into
a single input matrix. Available models: ElasticNetModel, RandomForest, SVMRegressor,
GradientBoosting, and AdaBoostDecisionTree.
Contains sklearn baseline models: ElasticNet, RandomForest, SVM, AdaBoost.
- class drevalpy.models.baselines.sklearn_models.AdaBoostDecisionTree
Bases:
SklearnModelAdaBoost model using Decision Trees as week learners for drug response prediction.
- class drevalpy.models.baselines.sklearn_models.ElasticNetModel
Bases:
SklearnModelElasticNet model for drug response prediction.
- class drevalpy.models.baselines.sklearn_models.GradientBoosting
Bases:
SklearnModelGradient Boosting model for drug response prediction.
- class drevalpy.models.baselines.sklearn_models.KNNRegressor
Bases:
SklearnModelKNNRegressor model for using k-nearest neighbors for drug response prediction.
- class drevalpy.models.baselines.sklearn_models.LassoModel
Bases:
SklearnModelLasso regression model for drug response prediction.
- class drevalpy.models.baselines.sklearn_models.RandomForest
Bases:
SklearnModelRandomForest model for drug response prediction.
- class drevalpy.models.baselines.sklearn_models.SVMRegressor
Bases:
SklearnModelSVM model for drug response prediction.
- class drevalpy.models.baselines.sklearn_models.SklearnModel
Bases:
DRPModelParent class that contains the common methods for the sklearn models.
- build_model(hyperparameters)
Builds the model from hyperparameters.
Flexible input support: Initializes the cell_line_views and drug_views to the values specified in the hyperparameters.yaml file. If nothing is specified, gene_expression and fingerprints are used.
If proteomics is specified in the hyperparameters, the ProteomicsMedianCenterAndImputeTransformer is initialized.
- Parameters:
hyperparameters (
dict) – Custom hyperparameters for the model, have to be defined in the child class.
- cell_line_views = []
- drug_views = []
- classmethod get_model_name()
Returns the model name.
- Raises:
NotImplementedError – If the method is not implemented in the child class.
- Return type:
- classmethod load(directory)
Load a trained sklearn-based model and its preprocessing components from disk.
Loads: - model.pkl: the trained sklearn model - hyperparameters.json: model hyperparameters (optional) - scaler.pkl: gene expression scaler (optional) - proteomics_transformer.pkl: proteomics transformer (optional)
- Parameters:
directory (
str) – path to the directory where model files are stored- Return type:
- Returns:
an instance of the model with restored state
- Raises:
FileNotFoundError – if model.pkl is missing
- load_cell_line_features(data_path, dataset_name)
Loads the cell line features for a single-view sklearn model.
- Parameters:
- Return type:
- Returns:
FeatureDataset containing the cell line features
- load_drug_features(data_path, dataset_name)
Load the drug features for a single-view sklearn model.
- Parameters:
- Return type:
- Returns:
FeatureDataset containing the drug features
- predict(cell_line_ids, drug_ids, cell_line_input, drug_input=None)
Predicts the response for the given input.
- Parameters:
drug_ids (
ndarray) – drug idscell_line_ids (
ndarray) – cell line idsdrug_input (
FeatureDataset|None) – drug inputcell_line_input (
FeatureDataset) – cell line input
- Return type:
- Returns:
predicted drug response
- save(directory)
Save the trained model and any associated preprocessing components to the given directory.
Saves: - model.pkl: the trained sklearn model - hyperparameters.json: dictionary of model hyperparameters (if present) - scaler.pkl: fitted gene expression scaler (if present) - proteomics_transformer.pkl: fitted proteomics transformer (if present)
- Parameters:
directory (
str) – path to the directory where model files will be stored- Raises:
ValueError – if the model is not trained
- Return type:
- train(output, cell_line_input, drug_input=None, output_earlystopping=None, model_checkpoint_dir='checkpoints')
Trains the model.
The number of features is the number of genes + the number of fingerprints. :type output:
DrugResponseDataset:param output: training dataset containing the response output :type cell_line_input:FeatureDataset:param cell_line_input: training dataset containing gene expression data :type drug_input:FeatureDataset|None:param drug_input: training dataset containing fingerprints data :type output_earlystopping:DrugResponseDataset|None:param output_earlystopping: not needed :type model_checkpoint_dir:str:param model_checkpoint_dir: not needed- Return type:
- Parameters:
output (DrugResponseDataset)
cell_line_input (FeatureDataset)
drug_input (FeatureDataset | None)
output_earlystopping (DrugResponseDataset | None)
model_checkpoint_dir (str)
Single-Drug Baselines
Single-drug variants of the sklearn models. These models are trained separately for each drug, using only
cell line features (no drug features). Available models: SingleDrugRandomForest and
SingleDrugElasticNet. Both support flexible inputs for the cell line view.
SingleDrugElasticNet and SingleDrugRandomForest class. Fit a model for each drug separately.
- class drevalpy.models.baselines.singledrug_baselines.SingleDrugElasticNet
Bases:
ElasticNetModelSingleDrugElasticNet class.
- build_model(hyperparameters)
Overwrites drug views to be empty.
- Parameters:
hyperparameters (
dict) – hyperparameters
- drug_views = []
- early_stopping = False
- classmethod get_model_name()
Returns the model name.
- Return type:
- Returns:
SingleDrugElasticNet
- is_single_drug_model = True
- load_drug_features(data_path, dataset_name)
Load drug features. Not needed for SingleDrugElasticNet.
- Parameters:
data_path – path to the data
dataset_name – name of the dataset
- Returns:
None
- class drevalpy.models.baselines.singledrug_baselines.SingleDrugRandomForest
Bases:
RandomForestSingleDrugRandomForest class.
- build_model(hyperparameters)
Overwrites drug views to be empty.
- Parameters:
hyperparameters (
dict) – hyperparameters
- drug_views = []
- early_stopping = False
- classmethod get_model_name()
Returns the model name.
- Return type:
- Returns:
SingleDrugRandomForest
- is_single_drug_model = True
- load_drug_features(data_path, dataset_name)
Load drug features. Not needed for SingleDrugRandomForest.
- Parameters:
data_path – path to the data
dataset_name – name of the dataset
- Returns:
None
Multi-View Random Forest
A Random Forest that accepts multiple cell line views simultaneously (e.g., gene expression, methylation, mutations, and copy number variation). Each view is loaded and preprocessed independently, then all feature matrices are concatenated before training. Methylation data is reduced with PCA before concatenation.
Contains the Multi-OMICS Random Forest model.
- class drevalpy.models.baselines.multi_view_random_forest.MultiViewRandomForest
Bases:
RandomForestMulti-View Random Forest model.
- cell_line_views = ['gene_expression', 'methylation', 'mutations', 'copy_number_variation_gistic']
- classmethod get_model_name()
Returns the model name.
- Return type:
- Returns:
MultiViewRandomForest
- classmethod load(directory)
Loads the trained model, hyperparameters, scaler, and PCA transformer from the specified directory.
- Parameters:
directory (
str) – Path to the directory where model components are stored.- Return type:
- Returns:
An instance of MultiViewRandomForest with restored state.
- load_cell_line_features(data_path, dataset_name)
Loads the cell line features for a multi-view random forest.
- Parameters:
- Return type:
- Returns:
FeatureDataset containing the cell line omics features
- predict(cell_line_ids, drug_ids, cell_line_input, drug_input=None)
Predicts the response for the given input.
- Parameters:
cell_line_ids (
ndarray) – cell line idsdrug_ids (
ndarray) – drug idscell_line_input (
FeatureDataset) – cell line inputdrug_input (
FeatureDataset|None) – drug input
- Return type:
- Returns:
predicted response
- Raises:
RuntimeError – if PCA has not been fit
- save(directory)
Saves the trained model, hyperparameters, scaler, and PCA transformer to the specified directory.
- train(output, cell_line_input, drug_input=None, output_earlystopping=None, model_checkpoint_dir='checkpoints')
Trains the model: the number of features is the number of genes + the number of fingerprints.
- Parameters:
output (
DrugResponseDataset) – training dataset containing the response outputcell_line_input (
FeatureDataset) – training dataset containing the OMICsdrug_input (
FeatureDataset|None) – training dataset containing fingerprints dataoutput_earlystopping (
DrugResponseDataset|None) – not neededmodel_checkpoint_dir (
str) – not needed
- Return type: