Datasets

Dataset module

Defines the different dataset classes.

DrugResponseDataset for response values and FeatureDataset for feature values. They both inherit from the abstract class Dataset. The DrugResponseDataset class is used to store drug response values per cell line and drug. The FeatureDataset class is used to store feature values per cell line or drug. The FeatureDataset class can also store meta information for the feature views. The DrugResponseDataset class can be split into training, validation and test sets for cross-validation. The FeatureDataset class can be used to randomize feature vectors.

class drevalpy.datasets.dataset.DrugResponseDataset(response, cell_line_ids, drug_ids, tissues=None, predictions=None, dataset_name='unnamed')

Bases: object

Drug response dataset.

Parameters:

response (ndarray)
cell_line_ids (ndarray)
drug_ids (ndarray)
tissues (ndarray | None)
predictions (ndarray | None)
dataset_name (str)

add_rows(other)

Adds rows from another dataset.

Parameters:: other (DrugResponseDataset) – other dataset
Return type:: None

property cell_line_ids: ndarray

Returns the cell_line_ids.

Returns:: numpy array containing cell_line_ids values.

copy()

Returns a copy of the drug response dataset.

Returns:: copy of the dataset

property cv_splits: list[dict[str, DrugResponseDataset]]

Returns the cv_splits.

Returns:: DrugResponseDatasets containing the CV_splits.

property dataset_name: str

Returns the name of this DrugResponseDataset.

Used in the pipeline.

Returns:: dataset name.

property drug_ids: ndarray

Returns the drug_ids.

Returns:: numpy array containing drug_ids values.

fit_transform(response_transformation)

Fit and transform the response data and prediction data of the dataset.

Parameters:: response_transformation (TransformerMixin) – e.g., StandardScaler, MinMaxScaler, RobustScaler
Return type:: None

classmethod from_csv(input_file, dataset_name='unknown', measure='response', tissue_column='tissue')

Load a dataset from a csv file.

This function creates a DrugResponseDataset from a provided input file in csv format. The following columns are required:

response: the drug response values as floating point values
cell_line_name: a string identifier for cell lines
pubchem_id: a string identifier for drugs
predictions: an optional column containing drug response predictions
LN_IC50_curvecurator: the name of the column containing the measure to predict

Parameters:

input_file (str | Path) – Path to the csv file containing the data to be loaded
dataset_name (str) – Optional name to associate the dataset with, default = “unknown”
measure (str) – The name of the column containing the measure to predict, default = “response”
tissue_column (str | None) – Optional column name of column containing tissue types

Raises:

ValueError – If the required columns are not found in the input file

Return type:

DrugResponseDataset

Returns:

DrugResponseDataset object containing data from provided csv file.

inverse_transform(response_transformation)

Inverse transform the response data and prediction data of the dataset.

Parameters:: response_transformation (TransformerMixin) – e.g., StandardScaler, MinMaxScaler, RobustScaler
Return type:: None

load_splits(path)

Load cross validation splits from path/cv_split_0_train.csv and path/cv_split_0_test.csv.

Parameters:: path (str) – path to the directory containing the cv split files
Raises:: AssertionError – if no cv split files are found in path
Return type:: None

mask(mask)

Removes rows from the dataset based on a boolean mask.

Parameters:: mask (ndarray) – boolean mask
Raises:: ValueError – if mask is not boolean or integer
Return type:: None

property predictions: ndarray | None

Returns the predictions if they exist.

Returns:: numpy array containing prediction values or None.

reduce_to(cell_line_ids=None, drug_ids=None)

Removes all rows which contain a cell_line not in cell_line_ids or a drug not in drug_ids.

Parameters:

cell_line_ids (ndarray | None) – cell line IDs or None to keep all cell lines
drug_ids (ndarray | None) – drug IDs or None to keep all cell lines

Return type:

None

remove_nan_responses()

Removes rows with NaN values in the response.

Return type:: None

remove_rows(indices)

Removes rows from the dataset.

Parameters:: indices (ndarray) – indices of rows to remove
Raises:: ValueError – if indices are out of bounds or not 1-dimensional
Return type:: None

property response: ndarray

Returns the response values.

Returns:: numpy array containing response values.

save_splits(path)

Save cross validation splits to path/cv_split_0_train.csv and path/cv_split_0_test.csv.

Parameters:: path (str) – path to the directory where the cv split files are saved
Raises:: AssertionError – if DrugResponseDataset was not split

shuffle(random_state=42)

Shuffles the dataset.

Parameters:: random_state (int) – random state
Return type:: None

split_dataset(n_cv_splits, mode, split_validation=True, split_early_stopping=True, validation_ratio=0.1, random_state=42)

Splits the dataset into training, validation and test sets for cross-validation.

Parameters:

n_cv_splits (int) – number of cross-validation splits, e.g., 5
mode (str) – split mode (‘LPO’, ‘LCO’, ‘LDO’)
split_validation (bool) – if True, a validation set is generated
split_early_stopping (bool) – if True, an early stopping set is generated
validation_ratio (float) – ratio of validation set size to training set size
random_state (int) – random state

Return type:

list[dict]

Returns:

list of dictionaries containing the cross-validation datasets. Each fold is a dictionary with keys ‘train’, ‘validation’, ‘test’, ‘validation_es’, ‘early_stopping’.

Raises:

ValueError – if mode is not ‘LPO’, ‘LCO’, or ‘LDO’
ValueError – if LTO cross-validation but tissue information not provided

property tissue: ndarray | None

Returns the tissue types if they exist.

Returns:: numpy array containing tissue types or None.

to_csv(path)

Stores the drug response dataset on disk.

Parameters:: path (str | Path) – path to desired storage location

to_dataframe()

Convert the dataset into a pandas DataFrame.

Return type:: DataFrame
Returns:: pandas DataFrame of the dataset)

transform(response_transformation)

Apply transformation to the response data and prediction data of the dataset.

Parameters:: response_transformation (TransformerMixin) – e.g., StandardScaler, MinMaxScaler, RobustScaler
Return type:: None

class drevalpy.datasets.dataset.FeatureDataset(features, meta_info=None)

Bases: object

Class for feature datasets.

This class represents datasets with one or more views of features associated with a set of entities, such as drugs or cell lines. The feature data is stored in a nested dictionary structure:

{

identifier_1: {: view_name_1: feature_vector, view_name_2: feature_vector, …

}, identifier_2: {

view_name_1: feature_vector, view_name_2: feature_vector, …

}

Each outer key is a string identifier (e.g. a cell line ID or drug ID)
Each inner key is the name of a view (e.g. ‘gene_expression’, ‘fingerprints’)
Each inner value is a feature vector or object representing that view for the identifier

Parameters:

features (dict[str, dict[str, Any]])
meta_info (dict[str, Any] | None)

add_features(other)

Adds features views from another dataset. Inner join (only common identifiers are kept).

Parameters:: other (FeatureDataset) – other dataset
Raises:: AssertionError – if feature views overlap
Return type:: None

add_meta_info(other)

Adds meta information to the feature dataset.

Parameters:: other (FeatureDataset) – other dataset
Return type:: None

apply(function, view)

Applies a function to the features of a view.

Parameters:

function (Callable) – function to apply
view (str) – view to apply the function to

copy()

Returns a copy of the feature dataset.

Returns:: copy of the dataset

property features: dict[str, dict[str, Any]]

Returns the features.

Returns:: features of this FeatureDataset

fit_transform_features(train_ids, transformer, view)

Fits and applies a transformation. Fitting is done only on the train_ids.

Parameters:

train_ids (ndarray) – The IDs corresponding to the training dataset.
transformer (TransformerMixin) – sklearn transformer
view (str) – the view to transform

Returns:

The modified FeatureDataset with transformed gene expression features.

Raises:

AssertionError – if view is not in the FeatureDataset
AssertionError – if train IDs are not unique

classmethod from_csv(path_to_csv, id_column, view_name, drop_columns=None, transpose=False, extract_meta_info=True)

Load a one-view feature dataset from a csv file.

Load a feature dataset from a csv file. The rows of the csv file represent the instances (cell lines or drugs), the columns represent the features. A column named id_column contains the identifiers of the instances. All unrelated columns (e.g. other id columns) should be provided as drop_columns, that will be removed from the dataset.

Parameters:

path_to_csv (str | Path) – path to the csv file containing the data to be loaded
view_name (str) – name of the view (e.g. gene_expression)
id_column (str) – name of the column containing the identifiers
drop_columns (list[str] | None) – list of columns to drop (e.g. other identifier columns)
transpose (bool) – if True, the csv is transposed, i.e. the rows become columns and vice versa
extract_meta_info (bool) – if True, extracts meta information from the dataset, e.g. gene names for gene expression

Returns:

FeatureDataset object containing data from provided csv file.

get_feature_matrix(view, identifiers)

Returns the feature matrix for the given view.

The feature view must be a vector or matrix.

Parameters:

view (str) – view name
identifiers (ndarray) – list of identifiers (cell lines oder drugs)

Return type:

ndarray

Returns:

feature matrix

Raises:

AssertionError – if no identifiers are given
AssertionError – if view is not in the FeatureDataset
AssertionError – if identifiers are not in the FeatureDataset
AssertionError – if feature vectors of view have different lengths
AssertionError – if view is not a numpy array, i.e. not a vector or matrix

property identifiers: ndarray

Returns the identifiers of the features.

Used in the pipeline.

Returns:: feature identifiers of this FeatureDataset

property meta_info: dict[str, Any]

Returns the meta information.

Returns:: Meta information of this FeatureDataset

randomize_features(views_to_randomize, randomization_type)

Randomizes the feature vectors.

Permutation permutes the feature vectors. Invariant means that the randomization is done in a way that a key characteristic of the feature is preserved. In case of matrices, this is the mean and standard deviation of the feature view for this instance, for networks it is the degree distribution.

Parameters:

views_to_randomize (str | list[str]) – name of feature view or list of names of multiple feature views to randomize. The other views are not randomized.
randomization_type (str) – randomization type (‘permutation’, ‘invariant’).

Raises:

AssertionError – if randomization_type is not ‘permutation’ or ‘invariant’
ValueError – if no invariant randomization is available for the feature view type

Return type:

None

to_csv(path, id_column, view_name)

Save the feature dataset to a CSV file. If meta_info is available for the view and valid, it will be written as column names.

Parameters:

path (str | Path) – Path to the CSV file.
id_column (str) – Name of the column containing the identifiers.
view_name (str) – Name of the view.

transform_features(ids, transformer, view)

Applies a transformation like standard scaling to features.

Parameters:

ids (ndarray) – The IDs to transform
transformer (TransformerMixin) – fitted sklearn transformer
view (str) – the view to transform

Raises:

AssertionError – if view is not in the FeatureDataset
AssertionError – if a cell line is missing
AssertionError – if IDs are not unique

property view_names: list[str]

Returns the view_names.

Returns:: view_names of this FeatureDataset

drevalpy.datasets.dataset.split_early_stopping_data(validation_dataset, test_mode)

Splits the validation dataset into a validation and an early stopping dataset.

Parameters:

validation_dataset (DrugResponseDataset) – input validation dataset
test_mode (str) – LPO, LCO, LTO, LDO

Raises:

ValueError – if test_mode is not one of the expected values

Return type:

tuple[DrugResponseDataset, DrugResponseDataset]

Returns:

the resulting validation and early stopping datasets

Loaders

Contains functions to load the GDSC1, GDSC2, CCLE, and Toy datasets.

drevalpy.datasets.loader.check_measure(measure_queried, measures_data, dataset_name)

Check if the queried measure is in the dataset.

Parameters:

measure_queried (str) – The measure to check.
measures_data (list[str]) – The measures in the dataset.
dataset_name (str) – The name of the dataset.

Raises:

ValueError – If the measure is not found in the dataset.

Return type:

None

drevalpy.datasets.loader.load_beataml2(path_data='data', measure='LN_IC50_curvecurator')

Loads the BeatAML2 dataset.

Parameters:

path_data (str) – Path to the dataset.
measure (str) – The name of the column containing the measure to predict, default: LN_IC50_curvecurator

Return type:

DrugResponseDataset

Returns:

DrugResponseDataset containing response, cell line IDs, and drug IDs.

drevalpy.datasets.loader.load_ccle(path_data='data', measure='LN_IC50_curvecurator')

Loads the CCLE dataset.

Parameters:

path_data (str) – Path to the dataset.
measure (str) – The name of the column containing the measure to predict, default = “LN_IC50_curvecurator”

Return type:

DrugResponseDataset

Returns:

DrugResponseDataset containing response, cell line IDs, and drug IDs.

drevalpy.datasets.loader.load_ctrpv1(path_data='data', measure='LN_IC50_curvecurator')

Load CTRPv1 dataset.

Parameters:

path_data (str) – Path to the location of CTRPv1 dataset
measure (str) – The name of the column containing the measure to predict, default = “LN_IC50_curvecurator”

Return type:

DrugResponseDataset

Returns:

DrugResponseDataset containing response, cell line IDs, and drug IDs

drevalpy.datasets.loader.load_ctrpv2(path_data='data', measure='LN_IC50_curvecurator')

Load CTRPv2 dataset.

Parameters:

path_data (str) – Path to the location of CTRPv2 dataset
measure (str) – The name of the column containing the measure to predict, default: LN_IC50_curvecurator

Return type:

DrugResponseDataset

Returns:

DrugResponseDataset containing response, cell line IDs, and drug IDs

drevalpy.datasets.loader.load_custom(path_data, dataset_name='custom', measure='response', tissue_column=None)

Load custom dataset.

Parameters:

path_data (str | Path) – Path to location of custom dataset
dataset_name (str) – Name of the dataset.
measure (str) – The name of the column containing the measure to predict, default = “response”
tissue_column (str | None) – The name of the column containing the tissue type. If None, no tissue information is loaded.

Return type:

DrugResponseDataset

Returns:

DrugResponseDataset containing response, cell line IDs, and drug IDs

drevalpy.datasets.loader.load_dataset(dataset_name, path_data='data', measure='response', curve_curator=False, cores=1, tissue_column=None, normalize=False)

Load a dataset based on the dataset name.

Parameters:

dataset_name (str) – The name of the dataset to load. Can be one of (‘GDSC1’, ‘GDSC2’, ‘CCLE’, ‘TOYv1’, or ‘TOYv2’) to download provided datasets, or any other name to allow for custom datasets.
path_data (str) – The parent path in which custom or downloaded datasets should be located, or in which raw viability data is to be found for fitting with CurveCurator (see param curve_curator for details). The location of the datasets are resolved by <path_data>/<dataset_name>/<dataset_name>.csv.
measure (str) – The name of the column containing the measure to predict, default = “response”. If curve_curator is True, this measure is appended with “_curvecurator”, e.g. “response_curvecurator” to distinguish between measures provided by the original source of a dataset, or the measures fit by CurveCurator.
curve_curator (bool) – If True, the measure is appended with “_curvecurator”. If a custom dataset_name was provided, this will invoke the fitting procedure of raw viability data, which is expected to exist at <path_data>/<dataset_name>/<dataset_name>_raw.csv. The fitted dataset will be stored in the same folder, in a file called <dataset_name>.csv
cores (int) – Number of cores to use for CurveCurator fitting. Only used when curve_curator is True, default = 1
tissue_column (str | None) – The name of the column containing the tissue type. If None, no tissue information is loaded. This is only used when loading a custom dataset. Default = None.
normalize (bool) – Whether to normalize the response values to [0, 1] for curvecurator. Default = False. Only used for custom datasets when curve_curator is True.

Return type:

DrugResponseDataset

Returns:

A DrugResponseDataset containing response, cell line IDs, drug IDs, and dataset name.

Raises:

FileNotFoundError – If the custom dataset or raw viability data could not be found at the given path.

drevalpy.datasets.loader.load_gdsc1(path_data='data', measure='LN_IC50_curvecurator')

Loads the GDSC1 dataset.

Parameters:

path_data (str) – Path to the dataset.
measure (str) – The name of the column containing the measure to predict, default = “LN_IC50_curvecurator”

Return type:

DrugResponseDataset

Returns:

DrugResponseDataset containing response, cell line IDs, and drug IDs.

drevalpy.datasets.loader.load_gdsc2(path_data='data', measure='LN_IC50_curvecurator')

Loads the GDSC2 dataset.

Parameters:

path_data (str) – Path to the dataset.
measure (str) – The name of the column containing the measure to predict, default = “LN_IC50_curvecurator”

Returns:

DrugResponseDataset containing response, cell line IDs, and drug IDs.

drevalpy.datasets.loader.load_pdx_bruna(path_data='data', measure='LN_IC50_curvecurator')

Loads the PDX_Bruna dataset.

Parameters:

path_data (str) – Path to the dataset.
measure (str) – The name of the column containing the measure to predict, default: LN_IC50_curvecurator

Return type:

DrugResponseDataset

Returns:

DrugResponseDataset containing response, cell line IDs, and drug IDs.

drevalpy.datasets.loader.load_toyv1(path_data='data', measure='LN_IC50_curvecurator')

Loads small Toy dataset, subsampled from CTRPv2.

Parameters:

path_data (str) – Path to the dataset.
measure (str) – The name of the column containing the measure to predict, default = “LN_IC50_curvecurator”

Return type:

DrugResponseDataset

Returns:

DrugResponseDataset containing response, cell line IDs, and drug IDs.

drevalpy.datasets.loader.load_toyv2(path_data='data', measure='LN_IC50_curvecurator')

Loads small Toy dataset, subsampled from GDSC2. Can be used to test cross study prediction.

Parameters:

path_data (str) – Path to the dataset.
measure (str) – The name of the column containing the measure to predict, default = “LN_IC50_curvecurator”

Return type:

DrugResponseDataset

Returns:

DrugResponseDataset containing response, cell line IDs, and drug IDs.

CurveCurator

Contains all function required for CurveCurator fitting.

CurveCurator publication: Bayer, F.P., Gander, M., Kuster, B. et al. CurveCurator: a recalibrated F-statistic to assess, classify, and explore significance of dose–response curves. Nat Commun 14, 7902 (2023). https://doi-org.eaccess.tum.edu/10.1038/s41467-023-43696-z

CurveCurator applies a recalibrated F-statistic for p-value estimation of 4-point log-logistic regression fits. In drevalpy, this can be used to generate training data with higher quality, since quality measures, such as p-value, R2, or relevance score can be used to filter out viability measurements of low quality.

drevalpy.datasets.curvecurator.fit_curves(input_file, output_dir, dataset_name, cores, normalize=False)

Fit curves for provided raw viability data.

This functions reads viability data in a predefined input format, preprocesses the data to be readable by CurveCurator, fits curves to the data using CurveCurator, and postprocesses the fitted data to a format required by drevalpy.

Parameters:

input_file (str) – Path to the file containing the raw viability data
output_dir (str) – Path to store all the files to, including the preprocessed data, the config.toml for CurveCurator, CurveCurator’s output files, and the postprocessed data
dataset_name (str) – The name of the dataset, will be used to prepend the postprocessed <dataset_name>.csv file
cores (int) – The number of cores to be used for fitting the curves using CurveCurator. This parameter is written into the config.toml, but it is min of the number of curves to fit and the number given (min(n_curves, cores))
normalize (bool) – Whether to normalize the response values to [0, 1] for curvecurator. Default = False.

drevalpy.datasets.curvecurator.postprocess(output_folder, dataset_name)

Postprocess CurveCurator output files.

This function reads all curves.tsv files created by CurveCurator, which contain the fitted curve parameters, postprocesses them to be used by drevalpy and combines everything in one <dataset_name>.csv file for usage by drevalpy.

Parameters:

output_folder (str) – Path to the output folder of CurveCurator containing the curves.txt file.
dataset_name (str) – The name of the dataset, will be used to prepend the postprocessed <dataset_name>.csv file

drevalpy.datasets.curvecurator.preprocess(input_file, output_dir, dataset_name, cores, normalize=False)

Preprocess raw viability data and create required input files for CurveCurator.

This function takes an input file containing raw viability in long format. The required columns are “dose”, “response”, “sample”, and “drug”, with an optional “replicate” column. If there are multiple dose ranges or numbers of replicates, groups in the form (maxdose, mindose, n_replicates) are created to keep the number of parameters for fitting low and the input dataframes for curvecurator as dense as possible. All dosages must be provided in µM! All responses must be normalized against the control already without the response for the control.

Parameters:

input_file (str) – Path to csv file containing the raw viability data
output_dir (str) – Path to store all the files to, including the preprocessed data, the config.toml for CurveCurator, CurveCurator’s output files, and the postprocessed data
dataset_name (str) – Name of the dataset
cores (int) – The number of cores to be used for fitting the curves using CurveCurator. This parameter is written into the config.toml, but it is min of the number of curves to fit and the number given (min(n_curves, cores))
normalize (bool) – Whether to normalize the response values to [0, 1] for curvecurator. Default = False.

Raises:

ValueError – If required columns are not found in the provided input file.

Utility functions

Utility functions for datasets.

drevalpy.datasets.utils.download_dataset(dataset_name, data_path='data', redownload=False)

Download the latets dataset from Zenodo.

Parameters:

dataset_name (str) – dataset name, from “GDSC1”, “GDSC2”, “CCLE”, “CTRPv1”, “CTRPv2”, “TOYv1”, “TOYv2”, “meta”
data_path (str) – where to save the data
redownload (bool) – whether to redownload the data

Raises:

HTTPError – if the download fails

drevalpy.datasets.utils.download_from_url(dataset_name, file_url)

Download a file from a given URL.

Parameters:

dataset_name (str) – how the dataset is called
file_url (str) – exact URL to the zip file

Return type:

Response

Returns:

HTML response containing response.content

Raises:

HTTPError – if the download fails

drevalpy.datasets.utils.permute_features(features, identifiers, views_to_permute, all_views)

Permute the specified views for each entity (= cell line or drug).

E.g. each cell line gets the feature vector/graph/image… of another cell line. Drawn without replacement.

Parameters:

features (dict[str, dict[str, Any]]) – dictionary of features
identifiers (ndarray) – array of identifiers
views_to_permute (list[str]) – list of views to permute
all_views (list[str]) – list of all views

Return type:

dict

Returns:

permuted features

drevalpy.datasets.utils.randomize_graph(original_graph)

Randomizes the graph by shuffling the edges while preserving the degree sequence.

Parameters:: original_graph (Graph) – The original graph
Return type:: Graph
Returns:: Randomized graph with the same degree sequence and node attributes

drevalpy.datasets.utils.unzip_data(path_to_zip, response, data_path)

Unzips the downloaded data.

Parameters:

path_to_zip (Path) – Path to the zip file to be unzipped.
response (Response) – HTML response containing response.content
data_path (str) – Where the unzipped directory should be stored