How to use DrEvalPy

Here, we document how to run DrEval with our implemented models and datasets. You can either do this with the standalone supplied here or with the associated Nextflow pipeline drugresponseeval. We recommend the use of our Nextflow pipeline for computational demanding runs and for improved reproducibility. No knowledge of Nextflow is required to run it. The Nextflow pipeline is available on the nf-core GitHub, the corresponding documentation can be found here. Documentation of the standalone is provided below.

Run a drug response experiment results with `drevalpy`

You can run it the drug response pipeline, which can test drug response models via:

drevalpy --help

Example:

drevalpy --run_id my_first_run --models NaiveDrugMeanPredictor ElasticNet --dataset_name TOYv1 --test_mode LCO

Note: You need at least 7 CV splits to get a meaningful critical difference diagram and the corresponding p-values.

--run_id TEXT: Identifier to save the results. [default: my_run]

--path_data TEXT: Path to the data directory. [default: data] All data files should be stored in this directory and will be downloaded into this directory. The location of the datasets are resolved by <path_data>/<dataset_name>/<dataset_name>.csv. If providing raw viability data, the file needs to be named <dataset_name>_raw.csv instead and --no_refitting needs to be unspecified for automated curve fitting (thats the default) (see --no_refitting for details and also check the Custom Datasets section).

--models TEXT: Model to evaluate or list of models to compare. For a list of available models, see the Available Models section.

--baselines TEXT: List of baselines to evaluate. If NaiveMeanEffectsPredictor is not part of them, we will add it. For a list of available baselines, see the Available Models section. The baselines are also hyperparameter-tuned and compared to the models, but no randomization or robustness tests are run. NaiveMeanEffectsPredictor is always run as it is required for evaluation.

--test_mode TEXT

Which tests to run:

LPO — Leave-random-Pairs-Out
LCO — Leave-Cell-line-Out
LTO — Leave-Tissue-Out
LDO — Leave-Drug-Out

Can be a list, e.g. 'LPO LCO LTO LDO' to run all tests. [default: LPO]. For more information, see the Available Settings section.

--randomization_mode TEXT

Which randomization tests to run in addition to the normal run. None disables randomization tests. For more information, see the Available Randomization Tests section. Available modes:

SVCC — Single View Constant (while others are perturbed) for Cell Lines
SVRC — Single View Random (while others are held constant) for Cell Lines
SVCD — Single View Constant (while others are perturbed) for Drugs
SVRD — Single View Random (while others are held constant) for Drugs

Can be a list, e.g. 'SVCC SVCD'.

--randomization_type TEXT

Type of randomization to use:

permutation — shuffles features over instances while preserving feature distributions
invariant — preserves a key characteristic such as matrix mean/standard deviation or network degree

[default: permutation]

--n_trials_robustness INTEGER: Number of trials for the robustness test. The robustness test trains the model with varying seeds multiple times to check stability. 0 disables the robustness test. [default: 0]. For more information, see the Robustness Test section.

--dataset_name TEXT: Name of the dataset to use. For a list of available datasets, see the Available Datasets section. For information on how to use custom datasets, see the Custom Datasets section. [default: GDSC1]

--cross_study_datasets TEXT: List of datasets to use for cross-study prediction evaluation. [default: []]

--path_out TEXT: Path to the output directory. [default: results/]

--no_refitting: If not set, the measure is appended with ‘_curvecurator’. If a custom dataset_name was provided, this will invoke the fitting procedure of raw viability data, which is expected to exist at <path_data>/<dataset_name>/<dataset_name>_raw.csv. The fitted dataset will be stored in the same folder, in a file called <dataset_name>.csv. Also check the Custom Datasets section. Default is False, i.e., curvecurated drug response measures are utilized.

--curve_curator_cores INTEGER: Maximum number of cores used to fit curves with CurveCurator, capped at the number of curves to fit. Only used when --no_refitting is not set. [default: 1]

--curve_curator_normalize: Normalize response values to [0, 1] for CurveCurator. [default: False]

--measure TEXT: Drug response measure used as prediction target. If using one of the available datasets (see --dataset_name), this is restricted to one of [‘LN_IC50’, ‘EC50’, ‘IC50’, ‘pEC50’, ‘AUC’, ‘response’]. This corresponds to the names of the columns that contain theses measures in the provided input dataset. If providing a custom dataset, this may differ. If the option --no_refitting is not set, the prefix ‘_curvecurator’ is automatically appended, e.g. ‘LN_IC50_curvecurator’, to allow using the refit measures instead of the ones originally published for the available datasets, allowing for better dataset comparability (refit measures are already provided in the available datasets or computed as part of the fitting procedure when providing custom raw viability datasets, see --no_refitting for details). [default: LN_IC50]

--overwrite: Overwrite existing results with the same --path_out and --run_id.

--optim_metric TEXT: Metric for hyperparameter tuning. For more information, see the Available Metrics section. One of MSE, RMSE, MAE, R^2, Pearson, Spearman, Kendall. [default: RMSE]

--wandb_project TEXT: Optional Weights & Biases project name. If provided, enables wandb logging for all DRPModel instances.

--n_cv_splits INTEGER: Number of cross-validation splits. [default: 7]

--response_transformation TEXT: Transformation applied to the response variable during training and prediction; retransformed after final predictions. For more information, see the Available Response Transformations section. One of standard, minmax, robust.

--multiprocessing: If set, we will use raytune for fitting. Default is False. [default: False]

--model_checkpoint_dir TEXT: Directory to save model checkpoints. [default: TEMPORARY]

--final_model_on_full_data: Save a final model trained and tuned on the union of all folds after cross-validation.

--no_hyperparameter_tuning: Disable hyperparameter tuning and use the first hyperparameter set.

Visualize and evaluate results with `drevalpy-report`

Executing the main script drevalpy will generate a folder with the results which includes the predictions of all models in all specified settings. The drevalpy report CLI will evaluate the results with all available metrics and create an HTML report with many visualizations. You can run it with the following command:

drevalpy report [-h] --run_id RUN_ID --dataset_name DATASET [--path_data PATH_DATA] [--result_path RESULT_PATH]

Options:

-h, --help: Show help message and exit.
--run_id RUN_ID: Identifier for the run which was used when executing the drevalpy command.
--dataset_name DATASET: Name of the dataset which was used when executing the drevalpy command.
--path_data PATH_DATA: Path to the data directory, default: data.
--result_path RESULT_PATH: Path to the results directory, default: results.

Example:

drevalpy report --run_id my_first_run --dataset_name TOYv1

The report will be stored in the results/RUN_ID folder. You can open the index.html file in your browser to view the report.

Available Settings

DrEval is designed to ensure that drug response prediction models are evaluated in a consistent and reproducible manner. We offer four settings via the --test_mode parameter:

Image visualizing the Leave-Pair-Out setting

Image visualizing the Leave-Cell-Line-Out setting

Image visualizing the Leave-Tissue-Out setting

Image visualizing the Leave-Drug-Out setting

Leave-Pair-Out (LPO): Random pairs of cell lines and drugs are left out for validation/testing but both the drug and the cell line might already be present in the training set. This is the easiest setting for your model but also the most uninformative one. The only application scenario for this setting is when you want to test whether your model can complete the missing values in the training set.
Leave-Cell-Line-Out (LCO): Random cell lines are left out for validation/testing but the drugs might already be present in the training set. This setting is more challenging than LPO but still relatively easy. The application scenario for this setting is when you want to test whether your model can predict the response of a new cell line. This is very relevant for personalized medicine.
Leave-Tissue-Out (LTO): Random tissues are left out for validation/testing but the drugs might already be present in the training set. This setting is more challenging than LCO. The application scenario for this setting is when you want to test whether your model can predict the response of a new tissue. This is very relevant for drug repurposing.
Leave-Drug-Out (LDO): Random drugs are left out for validation/testing but the cell lines might already be present in the training set. This setting is the most challenging one. The application scenario for this setting is when you want to test whether your model can predict the response of a new drug. This is very relevant for drug development.

An underlying issue is that drugs have a rather unique IC50 range. That means that by just predicting the mean IC50 that a drug has in the training set (aggregated over all cell lines), you can already achieve a seemingly good prediction (as evaluated by naive R^2 or correlation metrics). This is why we also offer the possibility to compare your model to a NaivePredictor that predicts the mean IC50 of all drugs in the training set. We also offer several less naive predictors: NaiveCellLineMeanPredictor, NaiveDrugMeanPredictor, NaiveTissueMeanPredictor, and NaiveTissueDrugMeanPredictor. The NaiveCellLineMeanPredictor predicts the mean IC50 of a cell line in the training set, the NaiveDrugMeanPredictor predicts the mean IC50 of a drug in the training set, the NaiveTissueMeanPredictor predicts the mean IC50 of a tissue in the training set, and the NaiveTissueDrugMeanPredictor predicts the mean IC50 per tissue-drug combination (aggregated across all cell lines with that tissue-drug pair). The NaiveMeanEffectPredictor combines the effects of cell lines and drugs. It is equivalent to the NaiveCellLineMeanPredictor and NaiveDrugMeanPredictor for the LDO and LCO settings, respectively, as test cell line effects and drug effects are unknown in these settings.

In LCO, NaiveTissueDrugMeanPredictor is the strongest baseline, while in all other settings, NaiveMeanEffectPredictor is the strongest.

Available Models

In addition to the Naive Predictors, we offer a variety of more advanced baseline models and some state-of-the-art models to compare your model against. You can either set them as baselines or as models via the --models and --baselines parameters. We first identify the best hyperparameters for all models and baselines in a cross-validation setting. Then, we train the models on the whole training set and evaluate them on the test set. For --models, you can also perform randomization and robustness tests. The --baselines are skipped for these tests.

The sklearn baseline models (AdaBoostDecisionTree, ElasticNet, GradientBoosting, KNNRegressor, Lasso, RandomForest, SVR, SingleDrugRandomForest, SingleDrugElasticNet), MultiViewXGBoost, and the machine learning baselines (SimpleNeuralNetwork, MultiViewNeuralNetwork) support flexible inputs: the input types can be configured via cell_line_views and drug_views in hyperparameters.yaml without needing separate model classes. By default they use gene expression and fingerprints. See the sklearn model Flexible Input System or the SimpleNeuralNetwork Flexible Input System for details.

Model Name	Baseline / Published Model	Multi-Drug Model / Single-Drug Model	Description
NaivePredictor	Baseline Method	Multi-Drug Model	Most simple method. Predicts the mean response of all drugs in the training set.
NaiveCellLineMeanPredictor	Baseline Method	Multi-Drug Model	Predicts the mean response of the cell line in the training set.
NaiveDrugMeanPredictor	Baseline Method	Multi-Drug Model	Predicts the mean response of the drug in the training set.
NaiveMeanEffectsPredictor	Baseline Method	Multi-Drug Model	Predicts using ANOVA-like mean effect model of cell lines and drugs
NaiveTissueMeanPredictor	Baseline Method	Multi-Drug Model	Predicts the mean response of the tissue in the training set.
NaiveTissueDrugMeanPredictor	Baseline Method	Multi-Drug Model	Predicts the mean response per tissue-drug combination in the training set (aggregated across all cell lines with that tissue-drug pair). Falls back to the overall dataset mean for unseen combinations.
AdaBoostDecisionTree	Baseline Method	Multi-Drug Model	Fits an Sklearn AdaBoost Regressor with Decision Tree base estimators. Supports flexible inputs (default: gene expression or proteomics + fingerprints).
ElasticNet	Baseline Method	Multi-Drug Model	Fits an Sklearn Elastic Net, Lasso, or Ridge model. Supports flexible inputs (default: gene expression or proteomics + fingerprints).
Lasso	Baseline Method	Multi-Drug Model	Explicitly fits an Sklearn Lasso model. Supports flexible inputs (default: gene expression or proteomics + fingerprints).
SingleDrugElasticNet	Baseline Method	Single-Drug Model	Fits an Elastic Net model for each drug separately. Supports flexible inputs (default: gene expression).
GradientBoosting	Baseline Method	Multi-Drug Model	Fits an Sklearn Histogram-based Gradient Boosting Regression Tree. Supports flexible inputs (default: gene expression or proteomics + fingerprints).
MultiViewXGBoost	Baseline Method	Multi-Drug Model	Fits an XGBoost XGBRegressor on a single or multiple cell line views. Supports flexible inputs (default: gene expression or proteomics or [gene expression + methylation + mutations + copy number variation] + fingerprints).
KNNRegressor	Baseline Method	Multi-Drug Model	Fits an Sklearn KNNRegressor. Supports flexible inputs (default: gene expression or proteomics + fingerprints).
RandomForest	Baseline Method	Multi-Drug Model	Fits an Sklearn Random Forest Regressor. Supports flexible inputs (default: gene expression or proteomics + fingerprints).
MultiViewRandomForest	Baseline Method	Multi-Drug Model	Fits an Sklearn Random Forest Regressor on multiple cell line views (default: gene expression, methylation, mutations, copy number variation) and drug fingerprints. Methylation dimensionality is reduced with PCA.
SingleDrugRandomForest	Baseline Method	Single-Drug Model	Fits an Sklearn Random Forest Regressor for each drug separately. Supports flexible inputs (default: gene expression).
SVR	Baseline Method	Multi-Drug Model	Fits an Sklearn Support Vector Regressor. Supports flexible inputs (default: gene expression or proteomics + fingerprints).
SimpleNeuralNetwork	Baseline Method	Multi-Drug Model	Fits a simple feedforward neural network (implemented with Pytorch Lightning) on flexible cell line and drug input (concatenated input) with 3 layers of varying dimensions and Dropout layers. Default: gene expression + fingerprints or drug_chemberta_embeddings.
MultiViewNeuralNetwork	Baseline Method	Multi-Drug Model	Fits a simple feedforward neural network (implemented with Pytorch Lightning) on flexible omic inputs (default: gene expression, methylation, mutation, copy number variation data), and drug fingerprints (concatenated input) with 3 layers of varying dimensions and Dropout layers. The dimensionality of the methylation data, if supplied, is reduced with a PCA to the first 100 components before it is fed to the model.
DrugGNN	Baseline Method	Multi-Drug Model	Represents drugs as graph, encodes their structure with a 3-layer GNN. Uses a 2-layer MLP for encoding gene expression. Concatenates the representations and feeds them through 2 more MLP layers.
PharmaFormer	Published Model	Multi-Drug Model	Transformer-based model using byte-pair encoded drug SMILES and gene expression features for drug response prediction.
SRMF	Published Model	Multi-Drug Model	Similarity Regularization Matrix Factorization model by Wang et al. on gene expression data and drug fingerprints. Re-implemented Matlab code into Python. The basic idea is to represent each drug and each cell line by their respective similarities to all other drugs/cell lines. Those similarities are mapped into a shared latent low-dimensional space from which responses are predicted.
MOLIR	Published Model	Single-Drug Model	Regression extension of MOLI: multi-omics late integration deep neural network. by Sharifi-Noghabi et al. Takes somatic mutation, copy number variation and gene expression data as input. MOLI reduces the dimensionality of each omics type with a hidden layer, concatenates them into one representation and optimizes this representation via a combined cost function consisting of a triplet loss and a binary cross-entropy loss. We implemented a regression adaption with MSE loss and an adapted triplet loss for regression.
SuperFELTR	Published Model	Single-Drug Model	Regression extension of SuperFELT: supervised feature extraction learning using triplet loss for drug response by Park et al. Very similar to MOLI(R). In MOLI(R), encoders and the classifier were trained jointly. Super.FELT(R) trains them independently. MOLI(R) was trained without feature selection (except for the Variance Threshold on the gene expression). Super.FELT(R) uses feature selection for all omics data.
DIPK	Published Model	Multi-Drug Model	Deep neural network Integrating Prior Knowledge from Li et al. Uses gene interaction relationships (encoded by a graph auto-encoder), gene expression profiles (encoded by a denoising auto-encoder), and molecular topologies (encoded by MolGNet). Those features are integrated using multi-head attention layers.
Precily	Published Model	Multi-Drug Model	Precily from Chawla et al. Uses GSVA pathway-activity scores with SMILESVec drug embeddings. Features are concatenated and passed through multiple linear layers with ReLU and Dropout.

Available Datasets

We provide commonly used datasets to evaluate your model on (GDSC1, GDSC2, CCLE, CTRPv2) via the --dataset_name parameter. Further, we provide 2 datasets with more clinical relevance: BeatAML2 and PDX_Bruna.

Dataset Name	Number of DRP Curves	Number of Drugs	Number of Cell Lines	Description
GDSC1	316,506	378	970	The Genomics of Drug Sensitivity in Cancer (GDSC) dataset version 1.
GDSC2	234,437	287	969	The Genomics of Drug Sensitivity in Cancer (GDSC) dataset version 2.
CCLE	11,670	24	503	The Cancer Cell Line Encyclopedia (CCLE) dataset.
CTRPv1	60,758	354	243	The Cancer Therapeutics Response Portal (CTRP) dataset version 1.
CTRPv2	395,025	546	886	The Cancer Therapeutics Response Portal (CTRP) dataset version 2.
TOYv1	2,711	36	90	A toy dataset for testing purposes subsetted from CTRPv2.
TOYv2	2,784	36	90	A second toy dataset for cross study testing purposes. 80 cell lines and 32 drugs overlap TOYv2.
BeatAML2	62,487	166	569 (patients)	Ex vivo drug sensitivity screening for a cohort of acute myeloid leukemia (AML) patients.
PDX_Bruna	2,559	104	37 (mouse passages)	Ex vivo drug sensitivity screening for short-term cultures of PDTX-derived tumor cells from breast cancer patients

If not specifying --no_refitting option with these datasets (default: false), the desired measure provided with the --measure option is appended with “_curvecurator”, e.g. “IC50_curvecurator”. In the provided datasets, these are the measures calculated with the same fitting procedure using CurveCurator. To use the measures reported from the original publications of the dataset, use the --no_refitting option, which will use the original measures as provided in the datasets.

This however makes it hard to do cross-study comparisons, since the measures may not be directly comparable due to differences in the fitting procedures used by the original authors. It is therefore recommended to always use DrEvalPy without the --no_refitting option, which will lead to the use of the refitted measures that are calculated with the same procedure for all datasets.

Corresponding feature data

The datasets have corresponding cell-line and drug feature data. The sources are as follows:

GDSC1 & 2:
- Gene expression: RMA-normalized microarray expression data from the GDSC Data Portal (raw data).
- Methylation: Preprocessed Beta Values for all CpG islands, IlluminaHumanMethylation450 BeadChip GDSC Data Portal.
CCLE, CTRPv1, CTRPv2:
- Gene expression: reprocessed RNA-seq data PRJNA523380
- Methylation: DepMap Beta Values for RRBS clusters CCLE_RRBS_TSS_CpG_clusters_20180614.txt
Used by GDSC1, 2, CCLE, CTRPv1 and v2:
- Mutation & CNV data: Sanger Cell Model Passports.
- Proteomics: Raw data at PRIDE: PXD030304
BeatAML2:
- Gene expression: RNA-seq but not re-processed because of missing FASTQ files. Taken from the corresponding website
- Mutation data would have been available but is measured too shallow, so we chose not to include it
PDX_Bruna:
- Retrieved from the corresponding figshare
- Gene expression: Microarray expression data
- Copy number variation: Reprocessed with GISTIC2.0
- Mutation data would have been available but is measured too shallow, so we chose not to include it
- Methylation data would have been available but only Promoter methylation data which is incompatible with the CpG methylation data we have for the other screens.
Drug features
- Morgan Fingerprints were generated with RDKit from SMILES either downloaded from PubChem or provided by GDSC.
- DIPK associated drive
  
  MolGNet features were generated from SMILES
  
  BIONIC features were generated from top expressed genes
Gene lists
- The 978 landmark genes are from the L1000 assay
- The drug target genes are the genes targeted by the drugs used in GDSC, extractable from the GDSC Data Portal (compounds annotation).
- The intersection lists are features occurring in all datasets for the respective OMICs to ensure that cross-study predictions can easily be done because the features are shared.
- Reduced versions of the lists only containing genes occurring in all datasets

For more information on the preprocessing, please refer to the corresponding GitHub Repo.

Custom Datasets

You can also provide your own custom dataset via the --dataset_name parameter by specifying a name that is not in the list of the available datasets. This can be prefit data (not recommended for comparability reasons) or raw viability data that is automatically fit with the exact same procedure that was used to refit the available datasets in the previous section.

Raw viability data

DrEvalPy expects a csv-formatted file in the location <path_data>/<dataset>/<dataset_name>_raw.csv (corresponding to the --path_data and --dataset_name options), which contains the raw viability data in long format with the columns [“dose”, “response”, “sample”, “drug”] and an optional “replicate” column. If replicates are provided, the procedure will fit one curve per sample / drug pair using all replicates.
All dosages have to be provided in µM! Drevalpy will compute the following response measures:
- pEC50_curvecurator: computed internally by CurveCurator. Is computed as -log10(EC50_curvecurator[M]).
- EC50_curvecurator: given in µM
- IC50_curvecurator: given in µM
- LN_IC50_curvecurator: computed from IC50_curvecurator
- AUC_curvecurator
The option --curve_curator_cores must be set. --no_refitting must not be set.
DrEvalPy provides all results of the fitting in the same folder including the fitted curves in a file folder <path_data>/<dataset>/<dataset_name>.csv

Prefit viability data

DrEvalPy expects a csv-formatted file in the location <path_data>/<dataset>/<dataset_name>.csv (corresponding to the --path_data and --dataset_name options), with at least the columns [“cell_line_id”, “drug_id”, <measure>”] where <measure> is replaced with the name of the measure you provide.
For LTO, you must also provide a “tissue” column with tissue information
Available measures depend on the column names and can be provided using the –measure option.
It is required that you use measure names that are also working with the available datasets if you use the --cross_study_datasets option
Your dataset will be read in with the DrugResponseDataset.from_csv method (drevalpy.datasets.dataset); Example response file would support the measure AUC.

Available Randomization Tests

We offer the possibility to test how much the performance of your model deteriorates when you randomize the input training data. We have several randomization modes and types available.

The modes are supplied via --randomization_mode and the types via --randomization_type.:

SVCC: Single View Constant for Cell Lines: A single cell line view (e.g., gene expression) is held unperturbed while the others are randomized.
SVCD: Single View Constant for Drugs: A single drug view (e.g., drug fingerprints) is held unperturbed while the others are randomized.
SVRC: Single View Random for Cell Lines: A single cell line view (e.g., gene expression) is randomized while the others are held unperturbed.
SVRD: Single View Random for Drugs: A single drug view (e.g., drug fingerprints) is randomized while the others are held unperturbed.

Currently, we support two ways of randomizing the data. The default is permututation.

Permutation: Permutes the features over the instances, keeping the distribution of the features the same but dissolving the relationship to the target.
Invariant: The randomization is done in a way that a key characteristic of the feature is preserved. In case of matrices, this is the mean and standard deviation of the feature view for this instance, for networks it is the degree distribution.

Robustness Test

The robustness test is a test where the model is trained with varying seeds. This is done multiple times to see how stable the model is. Via --n_trials_robustness, you can specify the number of trials for the robustness tests.

Available Metrics

We offer a variety of metrics to evaluate your model on. The default is the R^2 score. You can change the metric via the --optim_metric parameter. The following metrics are available:

R^2: The coefficient of determination. The higher the better.
MSE: The mean squared error. The lower the better.
RMSE: The root mean squared error. The lower the better.
MAE: The mean absolute error. The lower the better.
Pearson: The Pearson correlation coefficient. The higher the better.
Spearman: The Spearman correlation coefficient. The higher the better.
Kendall: The Kendall correlation coefficient. The higher the better.
Normalized [R^2, Pearson, Spearman, Kendall]: A version of the metric where the true and predicted response values are normalized by the predictions of the NaiveMeanEffectsPredictor.

Available Response Transformations

We offer the possibility to transform the response data before training the model. This can be done via the --response_transformation parameter. The following transformations are available:

None: No transformation is applied.
standard: The sklearn StandardScaler is applied.
minmax: The sklearn MinMaxScaler is applied.
robust: The sklearn RobustScaler is applied.