Datasets

Dataset module

Defines the different dataset classes.

DrugResponseDataset for response values and FeatureDataset for feature values. They both inherit from the abstract class Dataset. The DrugResponseDataset class is used to store drug response values per cell line and drug. The FeatureDataset class is used to store feature values per cell line or drug. The FeatureDataset class can also store meta information for the feature views. The DrugResponseDataset class can be split into training, validation and test sets for cross-validation. The FeatureDataset class can be used to randomize feature vectors.

class drevalpy.datasets.dataset.DrugResponseDataset(response, cell_line_ids, drug_ids, tissues=None, predictions=None, dataset_name='unnamed')

Bases: object

Drug response dataset.

Parameters:
add_rows(other)

Adds rows from another dataset.

Parameters:

other (DrugResponseDataset) – other dataset

Return type:

None

property cell_line_ids: ndarray

Returns the cell_line_ids.

Returns:

numpy array containing cell_line_ids values.

copy()

Returns a copy of the drug response dataset.

Returns:

copy of the dataset

property cv_splits: list[dict[str, DrugResponseDataset]]

Returns the cv_splits.

Returns:

DrugResponseDatasets containing the CV_splits.

property dataset_name: str

Returns the name of this DrugResponseDataset.

Used in the pipeline.

Returns:

dataset name.

property drug_ids: ndarray

Returns the drug_ids.

Returns:

numpy array containing drug_ids values.

fit_transform(response_transformation)

Fit and transform the response data and prediction data of the dataset.

Parameters:

response_transformation (TransformerMixin) – e.g., StandardScaler, MinMaxScaler, RobustScaler

Return type:

None

classmethod from_csv(input_file, dataset_name='unknown', measure='response', tissue_column='tissue')

Load a dataset from a csv file.

This function creates a DrugResponseDataset from a provided input file in csv format. The following columns are required:

  • response: the drug response values as floating point values

  • cell_line_name: a string identifier for cell lines

  • pubchem_id: a string identifier for drugs

  • predictions: an optional column containing drug response predictions

  • LN_IC50_curvecurator: the name of the column containing the measure to predict

Parameters:
  • input_file (str | Path) – Path to the csv file containing the data to be loaded

  • dataset_name (str) – Optional name to associate the dataset with, default = “unknown”

  • measure (str) – The name of the column containing the measure to predict, default = “response”

  • tissue_column (str | None) – Optional column name of column containing tissue types

Raises:

ValueError – If the required columns are not found in the input file

Return type:

DrugResponseDataset

Returns:

DrugResponseDataset object containing data from provided csv file.

inverse_transform(response_transformation)

Inverse transform the response data and prediction data of the dataset.

Parameters:

response_transformation (TransformerMixin) – e.g., StandardScaler, MinMaxScaler, RobustScaler

Return type:

None

load_splits(path)

Load cross validation splits from path/cv_split_0_train.csv and path/cv_split_0_test.csv.

Parameters:

path (str) – path to the directory containing the cv split files

Raises:

AssertionError – if no cv split files are found in path

Return type:

None

mask(mask)

Removes rows from the dataset based on a boolean mask.

Parameters:

mask (ndarray) – boolean mask

Raises:

ValueError – if mask is not boolean or integer

Return type:

None

property predictions: ndarray | None

Returns the predictions if they exist.

Returns:

numpy array containing prediction values or None.

reduce_to(cell_line_ids=None, drug_ids=None)

Removes all rows which contain a cell_line not in cell_line_ids or a drug not in drug_ids.

Parameters:
  • cell_line_ids (ndarray | None) – cell line IDs or None to keep all cell lines

  • drug_ids (ndarray | None) – drug IDs or None to keep all cell lines

Return type:

None

remove_nan_responses()

Removes rows with NaN values in the response.

Return type:

None

remove_rows(indices)

Removes rows from the dataset.

Parameters:

indices (ndarray) – indices of rows to remove

Raises:

ValueError – if indices are out of bounds or not 1-dimensional

Return type:

None

property response: ndarray

Returns the response values.

Returns:

numpy array containing response values.

save_splits(path)

Save cross validation splits to path/cv_split_0_train.csv and path/cv_split_0_test.csv.

Parameters:

path (str) – path to the directory where the cv split files are saved

Raises:

AssertionError – if DrugResponseDataset was not split

shuffle(random_state=42)

Shuffles the dataset.

Parameters:

random_state (int) – random state

Return type:

None

split_dataset(n_cv_splits, mode, split_validation=True, split_early_stopping=True, validation_ratio=0.1, random_state=42)

Splits the dataset into training, validation and test sets for cross-validation.

Parameters:
  • n_cv_splits (int) – number of cross-validation splits, e.g., 5

  • mode (str) – split mode (‘LPO’, ‘LCO’, ‘LDO’)

  • split_validation (bool) – if True, a validation set is generated

  • split_early_stopping (bool) – if True, an early stopping set is generated

  • validation_ratio (float) – ratio of validation set size to training set size

  • random_state (int) – random state

Return type:

list[dict]

Returns:

list of dictionaries containing the cross-validation datasets. Each fold is a dictionary with keys ‘train’, ‘validation’, ‘test’, ‘validation_es’, ‘early_stopping’.

Raises:
  • ValueError – if mode is not ‘LPO’, ‘LCO’, or ‘LDO’

  • ValueError – if LTO cross-validation but tissue information not provided

property tissue: ndarray | None

Returns the tissue types if they exist.

Returns:

numpy array containing tissue types or None.

to_csv(path)

Stores the drug response dataset on disk.

Parameters:

path (str | Path) – path to desired storage location

to_dataframe()

Convert the dataset into a pandas DataFrame.

Return type:

DataFrame

Returns:

pandas DataFrame of the dataset)

transform(response_transformation)

Apply transformation to the response data and prediction data of the dataset.

Parameters:

response_transformation (TransformerMixin) – e.g., StandardScaler, MinMaxScaler, RobustScaler

Return type:

None

class drevalpy.datasets.dataset.FeatureDataset(features, meta_info=None)

Bases: object

Class for feature datasets.

This class represents datasets with one or more views of features associated with a set of entities, such as drugs or cell lines. The feature data is stored in a nested dictionary structure:

{
identifier_1: {

view_name_1: feature_vector, view_name_2: feature_vector, …

}, identifier_2: {

view_name_1: feature_vector, view_name_2: feature_vector, …

}

  • Each outer key is a string identifier (e.g. a cell line ID or drug ID)

  • Each inner key is the name of a view (e.g. ‘gene_expression’, ‘fingerprints’)

  • Each inner value is a feature vector or object representing that view for the identifier

Parameters:
add_features(other)

Adds features views from another dataset. Inner join (only common identifiers are kept).

Parameters:

other (FeatureDataset) – other dataset

Raises:

AssertionError – if feature views overlap

Return type:

None

add_meta_info(other)

Adds meta information to the feature dataset.

Parameters:

other (FeatureDataset) – other dataset

Return type:

None

apply(function, view)

Applies a function to the features of a view.

Parameters:
  • function (Callable) – function to apply

  • view (str) – view to apply the function to

copy()

Returns a copy of the feature dataset.

Returns:

copy of the dataset

property features: dict[str, dict[str, Any]]

Returns the features.

Returns:

features of this FeatureDataset

fit_transform_features(train_ids, transformer, view)

Fits and applies a transformation. Fitting is done only on the train_ids.

Parameters:
  • train_ids (ndarray) – The IDs corresponding to the training dataset.

  • transformer (TransformerMixin) – sklearn transformer

  • view (str) – the view to transform

Returns:

The modified FeatureDataset with transformed gene expression features.

Raises:
classmethod from_csv(path_to_csv, id_column, view_name, drop_columns=None, transpose=False, extract_meta_info=True)

Load a one-view feature dataset from a csv file.

Load a feature dataset from a csv file. The rows of the csv file represent the instances (cell lines or drugs), the columns represent the features. A column named id_column contains the identifiers of the instances. All unrelated columns (e.g. other id columns) should be provided as drop_columns, that will be removed from the dataset.

Parameters:
  • path_to_csv (str | Path) – path to the csv file containing the data to be loaded

  • view_name (str) – name of the view (e.g. gene_expression)

  • id_column (str) – name of the column containing the identifiers

  • drop_columns (list[str] | None) – list of columns to drop (e.g. other identifier columns)

  • transpose (bool) – if True, the csv is transposed, i.e. the rows become columns and vice versa

  • extract_meta_info (bool) – if True, extracts meta information from the dataset, e.g. gene names for gene expression

Returns:

FeatureDataset object containing data from provided csv file.

get_feature_matrix(view, identifiers)

Returns the feature matrix for the given view.

The feature view must be a vector or matrix.

Parameters:
  • view (str) – view name

  • identifiers (ndarray) – list of identifiers (cell lines oder drugs)

Return type:

ndarray

Returns:

feature matrix

Raises:
property identifiers: ndarray

Returns the identifiers of the features.

Used in the pipeline.

Returns:

feature identifiers of this FeatureDataset

property meta_info: dict[str, Any]

Returns the meta information.

Returns:

Meta information of this FeatureDataset

randomize_features(views_to_randomize, randomization_type)

Randomizes the feature vectors.

Permutation permutes the feature vectors. Invariant means that the randomization is done in a way that a key characteristic of the feature is preserved. In case of matrices, this is the mean and standard deviation of the feature view for this instance, for networks it is the degree distribution.

Parameters:
  • views_to_randomize (str | list[str]) – name of feature view or list of names of multiple feature views to randomize. The other views are not randomized.

  • randomization_type (str) – randomization type (‘permutation’, ‘invariant’).

Raises:
  • AssertionError – if randomization_type is not ‘permutation’ or ‘invariant’

  • ValueError – if no invariant randomization is available for the feature view type

Return type:

None

to_csv(path, id_column, view_name)

Save the feature dataset to a CSV file. If meta_info is available for the view and valid, it will be written as column names.

Parameters:
  • path (str | Path) – Path to the CSV file.

  • id_column (str) – Name of the column containing the identifiers.

  • view_name (str) – Name of the view.

transform_features(ids, transformer, view)

Applies a transformation like standard scaling to features.

Parameters:
  • ids (ndarray) – The IDs to transform

  • transformer (TransformerMixin) – fitted sklearn transformer

  • view (str) – the view to transform

Raises:
property view_names: list[str]

Returns the view_names.

Returns:

view_names of this FeatureDataset

drevalpy.datasets.dataset.split_early_stopping_data(validation_dataset, test_mode)

Splits the validation dataset into a validation and an early stopping dataset.

Parameters:
  • validation_dataset (DrugResponseDataset) – input validation dataset

  • test_mode (str) – LPO, LCO, LTO, LDO

Raises:

ValueError – if test_mode is not one of the expected values

Return type:

tuple[DrugResponseDataset, DrugResponseDataset]

Returns:

the resulting validation and early stopping datasets

Loaders

Contains functions to load the GDSC1, GDSC2, CCLE, and Toy datasets.

drevalpy.datasets.loader.check_measure(measure_queried, measures_data, dataset_name)

Check if the queried measure is in the dataset.

Parameters:
  • measure_queried (str) – The measure to check.

  • measures_data (list[str]) – The measures in the dataset.

  • dataset_name (str) – The name of the dataset.

Raises:

ValueError – If the measure is not found in the dataset.

Return type:

None

drevalpy.datasets.loader.load_beataml2(path_data='data', measure='LN_IC50_curvecurator')

Loads the BeatAML2 dataset.

Parameters:
  • path_data (str) – Path to the dataset.

  • measure (str) – The name of the column containing the measure to predict, default: LN_IC50_curvecurator

Return type:

DrugResponseDataset

Returns:

DrugResponseDataset containing response, cell line IDs, and drug IDs.

drevalpy.datasets.loader.load_ccle(path_data='data', measure='LN_IC50_curvecurator')

Loads the CCLE dataset.

Parameters:
  • path_data (str) – Path to the dataset.

  • measure (str) – The name of the column containing the measure to predict, default = “LN_IC50_curvecurator”

Return type:

DrugResponseDataset

Returns:

DrugResponseDataset containing response, cell line IDs, and drug IDs.

drevalpy.datasets.loader.load_ctrpv1(path_data='data', measure='LN_IC50_curvecurator')

Load CTRPv1 dataset.

Parameters:
  • path_data (str) – Path to the location of CTRPv1 dataset

  • measure (str) – The name of the column containing the measure to predict, default = “LN_IC50_curvecurator”

Return type:

DrugResponseDataset

Returns:

DrugResponseDataset containing response, cell line IDs, and drug IDs

drevalpy.datasets.loader.load_ctrpv2(path_data='data', measure='LN_IC50_curvecurator')

Load CTRPv2 dataset.

Parameters:
  • path_data (str) – Path to the location of CTRPv2 dataset

  • measure (str) – The name of the column containing the measure to predict, default: LN_IC50_curvecurator

Return type:

DrugResponseDataset

Returns:

DrugResponseDataset containing response, cell line IDs, and drug IDs

drevalpy.datasets.loader.load_custom(path_data, dataset_name='custom', measure='response', tissue_column=None)

Load custom dataset.

Parameters:
  • path_data (str | Path) – Path to location of custom dataset

  • dataset_name (str) – Name of the dataset.

  • measure (str) – The name of the column containing the measure to predict, default = “response”

  • tissue_column (str | None) – The name of the column containing the tissue type. If None, no tissue information is loaded.

Return type:

DrugResponseDataset

Returns:

DrugResponseDataset containing response, cell line IDs, and drug IDs

drevalpy.datasets.loader.load_dataset(dataset_name, path_data='data', measure='response', curve_curator=False, cores=1, tissue_column=None, normalize=False)

Load a dataset based on the dataset name.

Parameters:
  • dataset_name (str) – The name of the dataset to load. Can be one of (‘GDSC1’, ‘GDSC2’, ‘CCLE’, ‘TOYv1’, or ‘TOYv2’) to download provided datasets, or any other name to allow for custom datasets.

  • path_data (str) – The parent path in which custom or downloaded datasets should be located, or in which raw viability data is to be found for fitting with CurveCurator (see param curve_curator for details). The location of the datasets are resolved by <path_data>/<dataset_name>/<dataset_name>.csv.

  • measure (str) – The name of the column containing the measure to predict, default = “response”. If curve_curator is True, this measure is appended with “_curvecurator”, e.g. “response_curvecurator” to distinguish between measures provided by the original source of a dataset, or the measures fit by CurveCurator.

  • curve_curator (bool) – If True, the measure is appended with “_curvecurator”. If a custom dataset_name was provided, this will invoke the fitting procedure of raw viability data, which is expected to exist at <path_data>/<dataset_name>/<dataset_name>_raw.csv. The fitted dataset will be stored in the same folder, in a file called <dataset_name>.csv

  • cores (int) – Number of cores to use for CurveCurator fitting. Only used when curve_curator is True, default = 1

  • tissue_column (str | None) – The name of the column containing the tissue type. If None, no tissue information is loaded. This is only used when loading a custom dataset. Default = None.

  • normalize (bool) – Whether to normalize the response values to [0, 1] for curvecurator. Default = False. Only used for custom datasets when curve_curator is True.

Return type:

DrugResponseDataset

Returns:

A DrugResponseDataset containing response, cell line IDs, drug IDs, and dataset name.

Raises:

FileNotFoundError – If the custom dataset or raw viability data could not be found at the given path.

drevalpy.datasets.loader.load_gdsc1(path_data='data', measure='LN_IC50_curvecurator')

Loads the GDSC1 dataset.

Parameters:
  • path_data (str) – Path to the dataset.

  • measure (str) – The name of the column containing the measure to predict, default = “LN_IC50_curvecurator”

Return type:

DrugResponseDataset

Returns:

DrugResponseDataset containing response, cell line IDs, and drug IDs.

drevalpy.datasets.loader.load_gdsc2(path_data='data', measure='LN_IC50_curvecurator')

Loads the GDSC2 dataset.

Parameters:
  • path_data (str) – Path to the dataset.

  • measure (str) – The name of the column containing the measure to predict, default = “LN_IC50_curvecurator”

Returns:

DrugResponseDataset containing response, cell line IDs, and drug IDs.

drevalpy.datasets.loader.load_pdx_bruna(path_data='data', measure='LN_IC50_curvecurator')

Loads the PDX_Bruna dataset.

Parameters:
  • path_data (str) – Path to the dataset.

  • measure (str) – The name of the column containing the measure to predict, default: LN_IC50_curvecurator

Return type:

DrugResponseDataset

Returns:

DrugResponseDataset containing response, cell line IDs, and drug IDs.

drevalpy.datasets.loader.load_toyv1(path_data='data', measure='LN_IC50_curvecurator')

Loads small Toy dataset, subsampled from CTRPv2.

Parameters:
  • path_data (str) – Path to the dataset.

  • measure (str) – The name of the column containing the measure to predict, default = “LN_IC50_curvecurator”

Return type:

DrugResponseDataset

Returns:

DrugResponseDataset containing response, cell line IDs, and drug IDs.

drevalpy.datasets.loader.load_toyv2(path_data='data', measure='LN_IC50_curvecurator')

Loads small Toy dataset, subsampled from GDSC2. Can be used to test cross study prediction.

Parameters:
  • path_data (str) – Path to the dataset.

  • measure (str) – The name of the column containing the measure to predict, default = “LN_IC50_curvecurator”

Return type:

DrugResponseDataset

Returns:

DrugResponseDataset containing response, cell line IDs, and drug IDs.

CurveCurator

Contains all function required for CurveCurator fitting.

CurveCurator publication: Bayer, F.P., Gander, M., Kuster, B. et al. CurveCurator: a recalibrated F-statistic to assess, classify, and explore significance of dose–response curves. Nat Commun 14, 7902 (2023). https://doi-org.eaccess.tum.edu/10.1038/s41467-023-43696-z

CurveCurator applies a recalibrated F-statistic for p-value estimation of 4-point log-logistic regression fits. In drevalpy, this can be used to generate training data with higher quality, since quality measures, such as p-value, R2, or relevance score can be used to filter out viability measurements of low quality.

drevalpy.datasets.curvecurator.fit_curves(input_file, output_dir, dataset_name, cores, normalize=False)

Fit curves for provided raw viability data.

This functions reads viability data in a predefined input format, preprocesses the data to be readable by CurveCurator, fits curves to the data using CurveCurator, and postprocesses the fitted data to a format required by drevalpy.

Parameters:
  • input_file (str) – Path to the file containing the raw viability data

  • output_dir (str) – Path to store all the files to, including the preprocessed data, the config.toml for CurveCurator, CurveCurator’s output files, and the postprocessed data

  • dataset_name (str) – The name of the dataset, will be used to prepend the postprocessed <dataset_name>.csv file

  • cores (int) – The number of cores to be used for fitting the curves using CurveCurator. This parameter is written into the config.toml, but it is min of the number of curves to fit and the number given (min(n_curves, cores))

  • normalize (bool) – Whether to normalize the response values to [0, 1] for curvecurator. Default = False.

drevalpy.datasets.curvecurator.postprocess(output_folder, dataset_name)

Postprocess CurveCurator output files.

This function reads all curves.tsv files created by CurveCurator, which contain the fitted curve parameters, postprocesses them to be used by drevalpy and combines everything in one <dataset_name>.csv file for usage by drevalpy.

Parameters:
  • output_folder (str) – Path to the output folder of CurveCurator containing the curves.txt file.

  • dataset_name (str) – The name of the dataset, will be used to prepend the postprocessed <dataset_name>.csv file

drevalpy.datasets.curvecurator.preprocess(input_file, output_dir, dataset_name, cores, normalize=False)

Preprocess raw viability data and create required input files for CurveCurator.

This function takes an input file containing raw viability in long format. The required columns are “dose”, “response”, “sample”, and “drug”, with an optional “replicate” column. If there are multiple dose ranges or numbers of replicates, groups in the form (maxdose, mindose, n_replicates) are created to keep the number of parameters for fitting low and the input dataframes for curvecurator as dense as possible. All dosages must be provided in µM! All responses must be normalized against the control already without the response for the control.

Parameters:
  • input_file (str) – Path to csv file containing the raw viability data

  • output_dir (str) – Path to store all the files to, including the preprocessed data, the config.toml for CurveCurator, CurveCurator’s output files, and the postprocessed data

  • dataset_name (str) – Name of the dataset

  • cores (int) – The number of cores to be used for fitting the curves using CurveCurator. This parameter is written into the config.toml, but it is min of the number of curves to fit and the number given (min(n_curves, cores))

  • normalize (bool) – Whether to normalize the response values to [0, 1] for curvecurator. Default = False.

Raises:

ValueError – If required columns are not found in the provided input file.

Utility functions

Utility functions for datasets.

drevalpy.datasets.utils.download_dataset(dataset_name, data_path='data', redownload=False)

Download the latets dataset from Zenodo.

Parameters:
  • dataset_name (str) – dataset name, from “GDSC1”, “GDSC2”, “CCLE”, “CTRPv1”, “CTRPv2”, “TOYv1”, “TOYv2”, “meta”

  • data_path (str) – where to save the data

  • redownload (bool) – whether to redownload the data

Raises:

HTTPError – if the download fails

drevalpy.datasets.utils.download_from_url(dataset_name, file_url)

Download a file from a given URL.

Parameters:
  • dataset_name (str) – how the dataset is called

  • file_url (str) – exact URL to the zip file

Return type:

Response

Returns:

HTML response containing response.content

Raises:

HTTPError – if the download fails

drevalpy.datasets.utils.permute_features(features, identifiers, views_to_permute, all_views)

Permute the specified views for each entity (= cell line or drug).

E.g. each cell line gets the feature vector/graph/image… of another cell line. Drawn without replacement.

Parameters:
  • features (dict[str, dict[str, Any]]) – dictionary of features

  • identifiers (ndarray) – array of identifiers

  • views_to_permute (list[str]) – list of views to permute

  • all_views (list[str]) – list of all views

Return type:

dict

Returns:

permuted features

drevalpy.datasets.utils.randomize_graph(original_graph)

Randomizes the graph by shuffling the edges while preserving the degree sequence.

Parameters:

original_graph (Graph) – The original graph

Return type:

Graph

Returns:

Randomized graph with the same degree sequence and node attributes

drevalpy.datasets.utils.unzip_data(path_to_zip, response, data_path)

Unzips the downloaded data.

Parameters:
  • path_to_zip (Path) – Path to the zip file to be unzipped.

  • response (Response) – HTML response containing response.content

  • data_path (str) – Where the unzipped directory should be stored