Implemented baselines 

Returns:

FeatureDataset containing the cell line ids

load_drug_features(data_path, dataset_name)

Loads the drug features.

Parameters:

data_path (str) – Path to the data.
dataset_name (str) – Name of the dataset.

Return type:

Returns:

FeatureDataset containing the drug IDs.

predict(cell_line_ids, drug_ids, cell_line_input, drug_input=None)

Predicts the cell line mean for each drug-cell line combination.

If the cell line is not in the training set, the dataset mean is used.

Parameters:

cell_line_ids (ndarray) – cell line ids
drug_ids (ndarray) – not needed
cell_line_input (FeatureDataset) – not needed
drug_input (FeatureDataset | None) – not needed

Return type:

Returns:

array of the same length as the input cell_line_id containing the cell line mean

predict_cl(cl_id)

Predicts the mean of the response for a given cell line.

If the cell line is not in the training set, the dataset mean is used.

Parameters:: cl_id (str) – Cell line ID
Return type:: float
Returns:: predicted response

train(output, cell_line_input, drug_input=None, output_earlystopping=None, model_checkpoint_dir='None')

Computes the mean per cell line.

If - later on - the cell line is not in the training set, the overall mean is used.

Parameters:

output (DrugResponseDataset) – training dataset containing the response output
cell_line_input (FeatureDataset) – cell line inputs
drug_input (FeatureDataset | None) – not needed
output_earlystopping (DrugResponseDataset | None) – not needed
model_checkpoint_dir (str) – not needed

Return type:

class drevalpy.models.baselines.naive_pred.NaiveDrugMeanPredictor

Bases: NaiveModel

Naive predictor model that predicts the mean of the response per drug.

cell_line_views = ['cell_line_name']

drug_views = ['pubchem_id']

classmethod get_model_name()

Returns the model name.

Return type:: str
Returns:: NaiveDrugMeanPredictor

load_cell_line_features(data_path, dataset_name)

Loads the cell line features.

Parameters:

data_path (str) – Path to the data.
dataset_name (str) – Name of the dataset.

Return type:

Returns:

FeatureDataset containing the cell line IDs.

load_drug_features(data_path, dataset_name)

Loads the drug features, in this case the drug ids.

Parameters:

data_path (str) – path to the data
dataset_name (str) – name of the dataset

Return type:

Returns:

FeatureDataset containing the drug ids

predict(cell_line_ids, drug_ids, cell_line_input, drug_input=None)

Predicts the drug mean for each drug-cell line combination.

If the drug is not in the training set, the dataset mean is used.

Parameters:

cell_line_ids (ndarray) – not needed
drug_ids (ndarray) – drug ids
cell_line_input (FeatureDataset) – not needed
drug_input (FeatureDataset | None) – not needed

Return type:

Returns:

array of the same length as the input drug_id containing the drug mean

predict_drug(drug_id)

Predicts the mean of the response for a given drug.

If the drug is not in the training set, the dataset mean is used.

Parameters:: drug_id (str) – ID of the drug
Returns:: predicted response

train(output, cell_line_input, drug_input=None, output_earlystopping=None, model_checkpoint_dir='None')

Computes the mean per drug. If - later on - the drug is not in the training set, the overall mean is used.

Parameters:

output (DrugResponseDataset) – training dataset containing the response output
cell_line_input (FeatureDataset) – not needed
drug_input (FeatureDataset | None) – drug id
output_earlystopping (DrugResponseDataset | None) – not needed
model_checkpoint_dir (str) – not needed

Raises:

ValueError – If drug_input is None

Return type:

class drevalpy.models.baselines.naive_pred.NaiveMeanEffectsPredictor

Bases: NaiveModel

ANOVA-like predictor model.

Predicts the response as: response = overall_mean + tissue_effect + cell_line_residual_effect + drug_effect.

Here:

tissue_effect = (tissue mean - overall_mean)
cell_line_residual_effect = (cell line mean - tissue mean for that cell line)
drug_effect = (drug mean - overall_mean)

This formulation avoids double-counting tissue signal already captured by cell line means. For unseen cell lines with a known tissue, the tissue effect provides a fallback. If tissue information is not available, this model falls back to the previous formulation: response = overall_mean + cell_line_effect + drug_effect.

cell_line_views = ['cell_line_name']

drug_views = ['pubchem_id']

classmethod get_model_name()

Returns the name of the model.

Return type:: str
Returns:: The name of the model as a string.

load_cell_line_features(data_path, dataset_name)

Loads the cell line features.

Parameters:

data_path (str) – Path to the data.
dataset_name (str) – Name of the dataset.

Return type:

Returns:

FeatureDataset containing the cell line IDs and tissue annotations, if available.

load_drug_features(data_path, dataset_name)

Loads the drug features.

Parameters:

data_path (str) – Path to the data.
dataset_name (str) – Name of the dataset.

Return type:

Returns:

FeatureDataset containing the drug IDs.

predict(cell_line_ids, drug_ids, cell_line_input, drug_input=None)

Predicts responses for given cell line and drug pairs.

The prediction is computed as:: prediction = overall_mean + tissue_effect + cell_line_residual_effect + drug_effect

If a cell line, tissue, or drug has not been seen during training, their effect is set to zero.

Parameters:

cell_line_ids (ndarray) – Array of cell line IDs.
drug_ids (ndarray) – Array of drug IDs.
cell_line_input (FeatureDataset) – Feature dataset containing tissue annotations.
drug_input (FeatureDataset | None) – Not used.

Return type:

Returns:

NumPy array of predicted responses.

train(output, cell_line_input, drug_input=None, output_earlystopping=None, model_checkpoint_dir='checkpoints')

Trains with overall mean, tissue effects, cell line residual effects, and drug effects.

Parameters:

output (DrugResponseDataset) – Training dataset containing the response output.
cell_line_input (FeatureDataset) – Feature dataset containing cell line IDs and tissue annotations.
drug_input (FeatureDataset | None) – Feature dataset containing drug IDs. Must not be None.
output_earlystopping (DrugResponseDataset | None) – Not used.
model_checkpoint_dir (str) – Not used.

Raises:

ValueError – If drug_input is None.

Return type:

class drevalpy.models.baselines.naive_pred.NaiveModel

Bases: DRPModel

Base class for all naive predictor models which are based on simple dataset stats.

This class provides a shared interface and save/load mechanism for simple statistical models that predict drug response based on dataset means, stratified by drug, cell line, or tissue.

build_model(hyperparameters)

Builds the model.

Naive model do not require any hyperparameter tuning.

Parameters:: hyperparameters (dict) – Dictionary of hyperparameters (not used).

classmethod load(directory)

Loads the model parameters from the given directory.

Reads the ‘naive_model.json’ file and initializes a NaiveModel instance with the loaded parameters.

Parameters:: directory (str) – Path to the directory where the model is saved.
Return type:: NaiveModel
Returns:: An instance of NaiveModel with the loaded parameters.

save(directory)

Saves the model parameters to the given directory.

Serializes dataset_mean and any available subclass-specific attributes to a JSON file named ‘naive_model.json’. Creates the directory if it doesn’t exist.

Parameters:: directory (str) – Path to the directory where the model will be saved.
Return type:: None

class drevalpy.models.baselines.naive_pred.NaivePredictor

Bases: NaiveModel

Naive predictor model that predicts the overall mean of the response.

cell_line_views = ['cell_line_name']

drug_views = ['pubchem_id']

classmethod get_model_name()

Returns the model name.

Return type:: str
Returns:: NaivePredictor

load_cell_line_features(data_path, dataset_name)

Loads the cell line features, in this case the cell line ids.

Parameters:

data_path (str) – path to the data
dataset_name (str) – name of the dataset

Return type:

Returns:

FeatureDataset containing the cell line ids

load_drug_features(data_path, dataset_name)

Loads the drug features, in this case the drug ids.

Parameters:

data_path (str) – path to the data
dataset_name (str) – name of the dataset

Return type:

Returns:

FeatureDataset containing the drug ids

predict(cell_line_ids, drug_ids, cell_line_input, drug_input=None)

Predicts the dataset mean for each drug-cell line combination.

Parameters:

cell_line_ids (ndarray) – cell line ids
drug_ids (ndarray) – not needed
cell_line_input (FeatureDataset) – not needed
drug_input (FeatureDataset | None) – not needed

Return type:

Returns:

array of the same length as the input cell line id containing the dataset mean

train(output, cell_line_input, drug_input=None, output_earlystopping=None, model_checkpoint_dir='checkpoints')

Computes the overall mean of the output response values and saves them.

Parameters:

output (DrugResponseDataset) – training dataset containing the response output
cell_line_input (FeatureDataset) – not needed
drug_input (FeatureDataset | None) – not needed
output_earlystopping (DrugResponseDataset | None) – not needed
model_checkpoint_dir (str) – not needed

Return type:

class drevalpy.models.baselines.naive_pred.NaiveTissueDrugMeanPredictor

Bases: NaiveModel

Naive predictor model that predicts the mean of the response per tissue-drug combination.

This model combines tissue and drug information to predict the mean response aggregated across all cell lines from the same tissue tested on the same drug. If a (tissue, drug) combination was not seen during training, it falls back to the overall dataset mean.

cell_line_views = ['tissue']

drug_views = ['pubchem_id']

classmethod get_model_name()

Returns the model name.

Return type:: str
Returns:: NaiveTissueDrugMeanPredictor

classmethod load(directory)

Loads the model parameters from the given directory.

Overrides the base class load method to convert string keys back to tuple keys.

Parameters:: directory (str) – Path to the directory where the model is saved.
Return type:: NaiveTissueDrugMeanPredictor
Returns:: An instance of NaiveTissueDrugMeanPredictor with the loaded parameters.

load_cell_line_features(data_path, dataset_name)

Loads the cell line features, in this case the tissue annotations.

Parameters:

data_path (str) – path to the data
dataset_name (str) – name of the dataset

Return type:

Returns:

FeatureDataset containing the tissue ids

load_drug_features(data_path, dataset_name)

Loads the drug features, in this case the drug ids.

Parameters:

data_path (str) – path to the data
dataset_name (str) – name of the dataset

Return type:

Returns:

FeatureDataset containing the drug ids

predict(cell_line_ids, drug_ids, cell_line_input, drug_input=None)

Predicts the tissue-drug mean for each drug-cell line combination.

If the (tissue, drug) combination is not in the training set, the dataset mean is used.

Parameters:

cell_line_ids (ndarray) – cell line ids
drug_ids (ndarray) – drug ids (used directly, following NaiveDrugMeanPredictor pattern)
cell_line_input (FeatureDataset) – tissue features
drug_input (FeatureDataset | None) – not needed

Return type:

Returns:

array of the same length as the input containing the tissue-drug mean or dataset mean

save(directory)

Saves the model parameters to the given directory.

Overrides the base class save method to handle tuple keys in tissue_drug_means by converting them to JSON-serializable string keys.

Parameters:: directory (str) – Path to the directory where the model will be saved.
Return type:: None

train(output, cell_line_input, drug_input=None, output_earlystopping=None, model_checkpoint_dir='None')

Computes the mean per tissue-drug combination. Falls back to the overall mean for unknown combinations.

Parameters:

output (DrugResponseDataset) – training dataset with .response and .drug_ids
cell_line_input (FeatureDataset) – tissue features for cell lines
drug_input (FeatureDataset | None) – drug id features
output_earlystopping (DrugResponseDataset | None) – not needed
model_checkpoint_dir (str) – not needed

Raises:

ValueError – If drug_input is None.

Return type:

class drevalpy.models.baselines.naive_pred.NaiveTissueMeanPredictor

Bases: NaiveModel

Naive predictor model that predicts the mean of the response per tissue.

cell_line_views = ['tissue']

drug_views = []

classmethod get_model_name()

Returns the model name.

Return type:: str
Returns:: NaiveTissueMeanPredictor

load_cell_line_features(data_path, dataset_name)

Loads the cell line features, in this case the tissue annotations.

Parameters:

data_path (str) – path to the data
dataset_name (str) – name of the dataset

Return type:

Returns:

FeatureDataset containing the tissue ids

load_drug_features(data_path, dataset_name)

Loads the drug features.

Parameters:

data_path (str) – Path to the data.
dataset_name (str) – Name of the dataset.

Return type:

Returns:

FeatureDataset containing the drug IDs.

predict(cell_line_ids, drug_ids, cell_line_input, drug_input=None)

Predicts the tissue mean for each drug-cell line combination.

If the tissue is not in the training set, the dataset mean is used.

Parameters:

cell_line_ids (ndarray) – cell line ids
drug_ids (ndarray) – not needed
cell_line_input (FeatureDataset) – tissue features
drug_input (FeatureDataset | None) – not needed

Return type:

Returns:

array of the same length as the input cell_line_id containing the tissue mean

train(output, cell_line_input, drug_input=None, output_earlystopping=None, model_checkpoint_dir='None')

Computes the mean per tissue. Falls back to the overall mean for unknown tissues.

Parameters:

output (DrugResponseDataset) – training dataset with .response
cell_line_input (FeatureDataset) – tissue features for cell lines
drug_input (FeatureDataset | None) – not needed
output_earlystopping (DrugResponseDataset | None) – not needed
model_checkpoint_dir (str) – not needed

Return type:

Sklearn Models

Scikit-learn-based models for drug response prediction. All models in this module support flexible inputs (see Flexible Input System above). By default they concatenate cell line features and drug features into a single input matrix. Available models: ElasticNetModel, RandomForest, SVMRegressor, GradientBoosting, and AdaBoostDecisionTree.

Contains sklearn baseline models: ElasticNet, RandomForest, SVM, AdaBoost.

class drevalpy.models.baselines.sklearn_models.AdaBoostDecisionTree

Bases: SklearnModel

AdaBoost model using Decision Trees as week learners for drug response prediction.

build_model(hyperparameters)

Builds the model from hyperparameters.

Parameters:: hyperparameters (dict) – Hyperparameters for the model. Contains n_estimators, max_depth, min_samples_split and min_samples_leaf.

classmethod get_model_name()

Returns the model name.

Return type:: str
Returns:: AdaBoostDecisionTree

class drevalpy.models.baselines.sklearn_models.ElasticNetModel

Bases: SklearnModel

ElasticNet model for drug response prediction.

build_model(hyperparameters)

Builds the ElasticNet model from hyperparameters.

Parameters:: hyperparameters (dict) – Contains L1 ratio and alpha.

classmethod get_model_name()

Returns the model name.

Return type:: str
Returns:: ElasticNet

class drevalpy.models.baselines.sklearn_models.GradientBoosting

Bases: SklearnModel

Gradient Boosting model for drug response prediction.

build_model(hyperparameters)

Builds the model from hyperparameters.

Parameters:: hyperparameters (dict) – Hyperparameters for the model. Contains n_estimators, learning_rate, max_depth, and subsample

classmethod get_model_name()

Returns the model name.

Return type:: str
Returns:: GradientBoosting

class drevalpy.models.baselines.sklearn_models.KNNRegressor

Bases: SklearnModel

KNNRegressor model for using k-nearest neighbors for drug response prediction.

build_model(hyperparameters)

Builds the model from hyperparameters.

Parameters:: hyperparameters (dict) – Hyperparameters for the model. Contains neighbors, weights.

classmethod get_model_name()

Returns the model name.

Return type:: str
Returns:: KNNRegressor

class drevalpy.models.baselines.sklearn_models.LassoModel

Bases: SklearnModel

Lasso regression model for drug response prediction.

build_model(hyperparameters)

Builds the Lasso model from hyperparameters.

Parameters:: hyperparameters (dict) – Contains alpha.

classmethod get_model_name()

Returns the model name.

Return type:: str
Returns:: Lasso

class drevalpy.models.baselines.sklearn_models.RandomForest

Bases: SklearnModel

RandomForest model for drug response prediction.

build_model(hyperparameters)

Builds the model from hyperparameters.

Parameters:: hyperparameters (dict) – Hyperparameters for the model. Contains n_estimators, criterion, max_samples, max_depth and n_jobs.

classmethod get_model_name()

Returns the model name.

Return type:: str
Returns:: RandomForest

class drevalpy.models.baselines.sklearn_models.SVMRegressor

Bases: SklearnModel

SVM model for drug response prediction.

build_model(hyperparameters)

Builds the model from hyperparameters.

Parameters:: hyperparameters (dict) – Hyperparameters for the model. Contains kernel, C, epsilon, and max_iter.

classmethod get_model_name()

Returns the model name.

Return type:: str
Returns:: SVR (Support Vector Regressor)

class drevalpy.models.baselines.sklearn_models.SklearnModel

Bases: DRPModel

Parent class that contains the common methods for the sklearn models.

build_model(hyperparameters)

Builds the model from hyperparameters.

Flexible input support: Initializes the cell_line_views and drug_views to the values specified in the hyperparameters.yaml file. If nothing is specified, gene_expression and fingerprints are used.

If proteomics is specified in the hyperparameters, the ProteomicsMedianCenterAndImputeTransformer is initialized.

Parameters:: hyperparameters (dict) – Custom hyperparameters for the model, have to be defined in the child class.

cell_line_views = []

drug_views = []

classmethod get_model_name()

Returns the model name.

Raises:: NotImplementedError – If the method is not implemented in the child class.
Return type:: str

classmethod load(directory)

Load a trained sklearn-based model and its preprocessing components from disk.

Loads: - model.pkl: the trained sklearn model - hyperparameters.json: model hyperparameters (optional) - scaler.pkl: gene expression scaler (optional) - proteomics_transformer.pkl: proteomics transformer (optional)

Parameters:: directory (str) – path to the directory where model files are stored
Return type:: SklearnModel
Returns:: an instance of the model with restored state
Raises:: FileNotFoundError – if model.pkl is missing

load_cell_line_features(data_path, dataset_name)

Loads the cell line features for a single-view sklearn model.

Parameters:

data_path (str) – Path to the data
dataset_name (str) – Name of the dataset

Return type:

Returns:

FeatureDataset containing the cell line features

load_drug_features(data_path, dataset_name)

Load the drug features for a single-view sklearn model.

Parameters:

data_path (str) – Path to the data
dataset_name (str) – Name of the dataset

Return type:

FeatureDataset | None

Returns:

FeatureDataset containing the drug features

predict(cell_line_ids, drug_ids, cell_line_input, drug_input=None)

Predicts the response for the given input.

Parameters:

drug_ids (ndarray) – drug ids
cell_line_ids (ndarray) – cell line ids
drug_input (FeatureDataset | None) – drug input
cell_line_input (FeatureDataset) – cell line input

Return type:

Returns:

predicted drug response

save(directory)

Save the trained model and any associated preprocessing components to the given directory.

Saves: - model.pkl: the trained sklearn model - hyperparameters.json: dictionary of model hyperparameters (if present) - scaler.pkl: fitted gene expression scaler (if present) - proteomics_transformer.pkl: fitted proteomics transformer (if present)

Parameters:: directory (str) – path to the directory where model files will be stored
Raises:: ValueError – if the model is not trained
Return type:: None

train(output, cell_line_input, drug_input=None, output_earlystopping=None, model_checkpoint_dir='checkpoints')

Trains the model.

The number of features is the number of genes + the number of fingerprints. :type output: DrugResponseDataset :param output: training dataset containing the response output :type cell_line_input: FeatureDataset :param cell_line_input: training dataset containing gene expression data :type drug_input: FeatureDataset | None :param drug_input: training dataset containing fingerprints data :type output_earlystopping: DrugResponseDataset | None :param output_earlystopping: not needed :type model_checkpoint_dir: str :param model_checkpoint_dir: not needed

Return type:

Parameters:

output (DrugResponseDataset)
cell_line_input (FeatureDataset)
drug_input (FeatureDataset | None)
output_earlystopping (DrugResponseDataset | None)
model_checkpoint_dir (str)

Single-Drug Baselines

Single-drug variants of the sklearn models. These models are trained separately for each drug, using only cell line features (no drug features). Available models: SingleDrugRandomForest and SingleDrugElasticNet. Both support flexible inputs for the cell line view.

SingleDrugElasticNet and SingleDrugRandomForest class. Fit a model for each drug separately.

class drevalpy.models.baselines.singledrug_baselines.SingleDrugElasticNet

Bases: ElasticNetModel

SingleDrugElasticNet class.

build_model(hyperparameters)

Overwrites drug views to be empty.

Parameters:: hyperparameters (dict) – hyperparameters

drug_views = []

early_stopping = False

classmethod get_model_name()

Returns the model name.

Return type:: str
Returns:: SingleDrugElasticNet

is_single_drug_model = True

load_drug_features(data_path, dataset_name)

Load drug features. Not needed for SingleDrugElasticNet.

Parameters:

data_path – path to the data
dataset_name – name of the dataset

Returns:

None

class drevalpy.models.baselines.singledrug_baselines.SingleDrugRandomForest

Bases: RandomForest

SingleDrugRandomForest class.

build_model(hyperparameters)

Overwrites drug views to be empty.

Parameters:: hyperparameters (dict) – hyperparameters

drug_views = []

early_stopping = False

classmethod get_model_name()

Returns the model name.

Return type:: str
Returns:: SingleDrugRandomForest

is_single_drug_model = True

load_drug_features(data_path, dataset_name)

Load drug features. Not needed for SingleDrugRandomForest.

Parameters:

data_path – path to the data
dataset_name – name of the dataset

Returns:

None

Multi-View Random Forest

A Random Forest that accepts multiple cell line views simultaneously (e.g., gene expression, methylation, mutations, and copy number variation). Each view is loaded and preprocessed independently, then all feature matrices are concatenated before training. Methylation data is reduced with PCA before concatenation.

Contains the Multi-OMICS Random Forest model.

class drevalpy.models.baselines.multi_view_random_forest.MultiViewRandomForest

Bases: RandomForest

Multi-View Random Forest model.

cell_line_views = ['gene_expression', 'methylation', 'mutations', 'copy_number_variation_gistic']

classmethod get_model_name()

Returns the model name.

Return type:: str
Returns:: MultiViewRandomForest

classmethod load(directory)

Loads the trained model, hyperparameters, scaler, and PCA transformer from the specified directory.

Parameters:: directory (str) – Path to the directory where model components are stored.
Return type:: MultiViewRandomForest
Returns:: An instance of MultiViewRandomForest with restored state.

load_cell_line_features(data_path, dataset_name)

Loads the cell line features for a multi-view random forest.

Parameters:

data_path (str) – data path e.g. data/
dataset_name (str) – dataset name e.g. GDSC1

Return type:

Returns:

FeatureDataset containing the cell line omics features

predict(cell_line_ids, drug_ids, cell_line_input, drug_input=None)

Predicts the response for the given input.

Parameters:

cell_line_ids (ndarray) – cell line ids
drug_ids (ndarray) – drug ids
cell_line_input (FeatureDataset) – cell line input
drug_input (FeatureDataset | None) – drug input

Return type:

Returns:

predicted response

Raises:

RuntimeError – if PCA has not been fit

save(directory)

Saves the trained model, hyperparameters, scaler, and PCA transformer to the specified directory.

Parameters:: directory (str) – Path to the directory where model components will be saved.
Return type:: None

train(output, cell_line_input, drug_input=None, output_earlystopping=None, model_checkpoint_dir='checkpoints')

Trains the model: the number of features is the number of genes + the number of fingerprints.

Parameters:

output (DrugResponseDataset) – training dataset containing the response output
cell_line_input (FeatureDataset) – training dataset containing the OMICs
drug_input (FeatureDataset | None) – training dataset containing fingerprints data
output_earlystopping (DrugResponseDataset | None) – not needed
model_checkpoint_dir (str) – not needed

Return type: