Datasets
Dataset module
Defines the different dataset classes.
DrugResponseDataset for response values and FeatureDataset for feature values. They both inherit from the abstract class Dataset. The DrugResponseDataset class is used to store drug response values per cell line and drug. The FeatureDataset class is used to store feature values per cell line or drug. The FeatureDataset class can also store meta information for the feature views. The DrugResponseDataset class can be split into training, validation and test sets for cross-validation. The FeatureDataset class can be used to randomize feature vectors.
- class drevalpy.datasets.dataset.DrugResponseDataset(response, cell_line_ids, drug_ids, tissues=None, predictions=None, dataset_name='unnamed')
Bases:
objectDrug response dataset.
- Parameters:
- add_rows(other)
Adds rows from another dataset.
- Parameters:
other (
DrugResponseDataset) – other dataset- Return type:
- property cell_line_ids: ndarray
Returns the cell_line_ids.
- Returns:
numpy array containing cell_line_ids values.
- copy()
Returns a copy of the drug response dataset.
- Returns:
copy of the dataset
- property cv_splits: list[dict[str, DrugResponseDataset]]
Returns the cv_splits.
- Returns:
DrugResponseDatasets containing the CV_splits.
- property dataset_name: str
Returns the name of this DrugResponseDataset.
Used in the pipeline.
- Returns:
dataset name.
- fit_transform(response_transformation)
Fit and transform the response data and prediction data of the dataset.
- Parameters:
response_transformation (
TransformerMixin) – e.g., StandardScaler, MinMaxScaler, RobustScaler- Return type:
- classmethod from_csv(input_file, dataset_name='unknown', measure='response', tissue_column='tissue')
Load a dataset from a csv file.
This function creates a DrugResponseDataset from a provided input file in csv format. The following columns are required:
response: the drug response values as floating point values
cell_line_name: a string identifier for cell lines
pubchem_id: a string identifier for drugs
predictions: an optional column containing drug response predictions
LN_IC50_curvecurator: the name of the column containing the measure to predict
- Parameters:
input_file (
str|Path) – Path to the csv file containing the data to be loadeddataset_name (
str) – Optional name to associate the dataset with, default = “unknown”measure (
str) – The name of the column containing the measure to predict, default = “response”tissue_column (
str|None) – Optional column name of column containing tissue types
- Raises:
ValueError – If the required columns are not found in the input file
- Return type:
- Returns:
DrugResponseDataset object containing data from provided csv file.
- inverse_transform(response_transformation)
Inverse transform the response data and prediction data of the dataset.
- Parameters:
response_transformation (
TransformerMixin) – e.g., StandardScaler, MinMaxScaler, RobustScaler- Return type:
- load_splits(path)
Load cross validation splits from path/cv_split_0_train.csv and path/cv_split_0_test.csv.
- Parameters:
path (
str) – path to the directory containing the cv split files- Raises:
AssertionError – if no cv split files are found in path
- Return type:
- mask(mask)
Removes rows from the dataset based on a boolean mask.
- Parameters:
mask (
ndarray) – boolean mask- Raises:
ValueError – if mask is not boolean or integer
- Return type:
- property predictions: ndarray | None
Returns the predictions if they exist.
- Returns:
numpy array containing prediction values or None.
- reduce_to(cell_line_ids=None, drug_ids=None)
Removes all rows which contain a cell_line not in cell_line_ids or a drug not in drug_ids.
- remove_rows(indices)
Removes rows from the dataset.
- Parameters:
indices (
ndarray) – indices of rows to remove- Raises:
ValueError – if indices are out of bounds or not 1-dimensional
- Return type:
- property response: ndarray
Returns the response values.
- Returns:
numpy array containing response values.
- save_splits(path)
Save cross validation splits to path/cv_split_0_train.csv and path/cv_split_0_test.csv.
- Parameters:
path (
str) – path to the directory where the cv split files are saved- Raises:
AssertionError – if DrugResponseDataset was not split
- shuffle(random_state=42)
Shuffles the dataset.
- split_dataset(n_cv_splits, mode, split_validation=True, split_early_stopping=True, validation_ratio=0.1, random_state=42)
Splits the dataset into training, validation and test sets for cross-validation.
- Parameters:
n_cv_splits (
int) – number of cross-validation splits, e.g., 5mode (
str) – split mode (‘LPO’, ‘LCO’, ‘LDO’)split_validation (
bool) – if True, a validation set is generatedsplit_early_stopping (
bool) – if True, an early stopping set is generatedvalidation_ratio (
float) – ratio of validation set size to training set sizerandom_state (
int) – random state
- Return type:
- Returns:
list of dictionaries containing the cross-validation datasets. Each fold is a dictionary with keys ‘train’, ‘validation’, ‘test’, ‘validation_es’, ‘early_stopping’.
- Raises:
ValueError – if mode is not ‘LPO’, ‘LCO’, or ‘LDO’
ValueError – if LTO cross-validation but tissue information not provided
- property tissue: ndarray | None
Returns the tissue types if they exist.
- Returns:
numpy array containing tissue types or None.
- to_csv(path)
Stores the drug response dataset on disk.
- to_dataframe()
Convert the dataset into a pandas DataFrame.
- Return type:
DataFrame- Returns:
pandas DataFrame of the dataset)
- transform(response_transformation)
Apply transformation to the response data and prediction data of the dataset.
- Parameters:
response_transformation (
TransformerMixin) – e.g., StandardScaler, MinMaxScaler, RobustScaler- Return type:
- class drevalpy.datasets.dataset.FeatureDataset(features, meta_info=None)
Bases:
objectClass for feature datasets.
This class represents datasets with one or more views of features associated with a set of entities, such as drugs or cell lines. The feature data is stored in a nested dictionary structure:
- {
- identifier_1: {
view_name_1: feature_vector, view_name_2: feature_vector, …
}, identifier_2: {
view_name_1: feature_vector, view_name_2: feature_vector, …
}
Each outer key is a string identifier (e.g. a cell line ID or drug ID)
Each inner key is the name of a view (e.g. ‘gene_expression’, ‘fingerprints’)
Each inner value is a feature vector or object representing that view for the identifier
- add_features(other)
Adds features views from another dataset. Inner join (only common identifiers are kept).
- Parameters:
other (
FeatureDataset) – other dataset- Raises:
AssertionError – if feature views overlap
- Return type:
- add_meta_info(other)
Adds meta information to the feature dataset.
- Parameters:
other (
FeatureDataset) – other dataset- Return type:
- apply(function, view)
Applies a function to the features of a view.
- copy()
Returns a copy of the feature dataset.
- Returns:
copy of the dataset
- property features: dict[str, dict[str, Any]]
Returns the features.
- Returns:
features of this FeatureDataset
- fit_transform_features(train_ids, transformer, view)
Fits and applies a transformation. Fitting is done only on the train_ids.
- Parameters:
train_ids (
ndarray) – The IDs corresponding to the training dataset.transformer (
TransformerMixin) – sklearn transformerview (
str) – the view to transform
- Returns:
The modified FeatureDataset with transformed gene expression features.
- Raises:
AssertionError – if view is not in the FeatureDataset
AssertionError – if train IDs are not unique
- classmethod from_csv(path_to_csv, id_column, view_name, drop_columns=None, transpose=False, extract_meta_info=True)
Load a one-view feature dataset from a csv file.
Load a feature dataset from a csv file. The rows of the csv file represent the instances (cell lines or drugs), the columns represent the features. A column named id_column contains the identifiers of the instances. All unrelated columns (e.g. other id columns) should be provided as drop_columns, that will be removed from the dataset.
- Parameters:
path_to_csv (
str|Path) – path to the csv file containing the data to be loadedview_name (
str) – name of the view (e.g. gene_expression)id_column (
str) – name of the column containing the identifiersdrop_columns (
list[str] |None) – list of columns to drop (e.g. other identifier columns)transpose (
bool) – if True, the csv is transposed, i.e. the rows become columns and vice versaextract_meta_info (
bool) – if True, extracts meta information from the dataset, e.g. gene names for gene expression
- Returns:
FeatureDataset object containing data from provided csv file.
- get_feature_matrix(view, identifiers)
Returns the feature matrix for the given view.
The feature view must be a vector or matrix.
- Parameters:
- Return type:
- Returns:
feature matrix
- Raises:
AssertionError – if no identifiers are given
AssertionError – if view is not in the FeatureDataset
AssertionError – if identifiers are not in the FeatureDataset
AssertionError – if feature vectors of view have different lengths
AssertionError – if view is not a numpy array, i.e. not a vector or matrix
- property identifiers: ndarray
Returns the identifiers of the features.
Used in the pipeline.
- Returns:
feature identifiers of this FeatureDataset
- property meta_info: dict[str, Any]
Returns the meta information.
- Returns:
Meta information of this FeatureDataset
- randomize_features(views_to_randomize, randomization_type)
Randomizes the feature vectors.
Permutation permutes the feature vectors. Invariant means that the randomization is done in a way that a key characteristic of the feature is preserved. In case of matrices, this is the mean and standard deviation of the feature view for this instance, for networks it is the degree distribution.
- Parameters:
- Raises:
AssertionError – if randomization_type is not ‘permutation’ or ‘invariant’
ValueError – if no invariant randomization is available for the feature view type
- Return type:
- to_csv(path, id_column, view_name)
Save the feature dataset to a CSV file. If meta_info is available for the view and valid, it will be written as column names.
- transform_features(ids, transformer, view)
Applies a transformation like standard scaling to features.
- Parameters:
ids (
ndarray) – The IDs to transformtransformer (
TransformerMixin) – fitted sklearn transformerview (
str) – the view to transform
- Raises:
AssertionError – if view is not in the FeatureDataset
AssertionError – if a cell line is missing
AssertionError – if IDs are not unique
- drevalpy.datasets.dataset.split_early_stopping_data(validation_dataset, test_mode)
Splits the validation dataset into a validation and an early stopping dataset.
- Parameters:
validation_dataset (
DrugResponseDataset) – input validation datasettest_mode (
str) – LPO, LCO, LTO, LDO
- Raises:
ValueError – if test_mode is not one of the expected values
- Return type:
- Returns:
the resulting validation and early stopping datasets
Loaders
Contains functions to load the GDSC1, GDSC2, CCLE, and Toy datasets.
- drevalpy.datasets.loader.check_measure(measure_queried, measures_data, dataset_name)
Check if the queried measure is in the dataset.
- drevalpy.datasets.loader.load_beataml2(path_data='data', measure='LN_IC50_curvecurator')
Loads the BeatAML2 dataset.
- Parameters:
- Return type:
- Returns:
DrugResponseDataset containing response, cell line IDs, and drug IDs.
- drevalpy.datasets.loader.load_ccle(path_data='data', measure='LN_IC50_curvecurator')
Loads the CCLE dataset.
- Parameters:
- Return type:
- Returns:
DrugResponseDataset containing response, cell line IDs, and drug IDs.
- drevalpy.datasets.loader.load_ctrpv1(path_data='data', measure='LN_IC50_curvecurator')
Load CTRPv1 dataset.
- Parameters:
- Return type:
- Returns:
DrugResponseDataset containing response, cell line IDs, and drug IDs
- drevalpy.datasets.loader.load_ctrpv2(path_data='data', measure='LN_IC50_curvecurator')
Load CTRPv2 dataset.
- Parameters:
- Return type:
- Returns:
DrugResponseDataset containing response, cell line IDs, and drug IDs
- drevalpy.datasets.loader.load_custom(path_data, dataset_name='custom', measure='response', tissue_column=None)
Load custom dataset.
- Parameters:
- Return type:
- Returns:
DrugResponseDataset containing response, cell line IDs, and drug IDs
- drevalpy.datasets.loader.load_dataset(dataset_name, path_data='data', measure='response', curve_curator=False, cores=1, tissue_column=None, normalize=False)
Load a dataset based on the dataset name.
- Parameters:
dataset_name (
str) – The name of the dataset to load. Can be one of (‘GDSC1’, ‘GDSC2’, ‘CCLE’, ‘TOYv1’, or ‘TOYv2’) to download provided datasets, or any other name to allow for custom datasets.path_data (
str) – The parent path in which custom or downloaded datasets should be located, or in which raw viability data is to be found for fitting with CurveCurator (see param curve_curator for details). The location of the datasets are resolved by <path_data>/<dataset_name>/<dataset_name>.csv.measure (
str) – The name of the column containing the measure to predict, default = “response”. If curve_curator is True, this measure is appended with “_curvecurator”, e.g. “response_curvecurator” to distinguish between measures provided by the original source of a dataset, or the measures fit by CurveCurator.curve_curator (
bool) – If True, the measure is appended with “_curvecurator”. If a custom dataset_name was provided, this will invoke the fitting procedure of raw viability data, which is expected to exist at <path_data>/<dataset_name>/<dataset_name>_raw.csv. The fitted dataset will be stored in the same folder, in a file called <dataset_name>.csvcores (
int) – Number of cores to use for CurveCurator fitting. Only used when curve_curator is True, default = 1tissue_column (
str|None) – The name of the column containing the tissue type. If None, no tissue information is loaded. This is only used when loading a custom dataset. Default = None.normalize (
bool) – Whether to normalize the response values to [0, 1] for curvecurator. Default = False. Only used for custom datasets when curve_curator is True.
- Return type:
- Returns:
A DrugResponseDataset containing response, cell line IDs, drug IDs, and dataset name.
- Raises:
FileNotFoundError – If the custom dataset or raw viability data could not be found at the given path.
- drevalpy.datasets.loader.load_gdsc1(path_data='data', measure='LN_IC50_curvecurator')
Loads the GDSC1 dataset.
- Parameters:
- Return type:
- Returns:
DrugResponseDataset containing response, cell line IDs, and drug IDs.
- drevalpy.datasets.loader.load_gdsc2(path_data='data', measure='LN_IC50_curvecurator')
Loads the GDSC2 dataset.
- drevalpy.datasets.loader.load_pdx_bruna(path_data='data', measure='LN_IC50_curvecurator')
Loads the PDX_Bruna dataset.
- Parameters:
- Return type:
- Returns:
DrugResponseDataset containing response, cell line IDs, and drug IDs.
- drevalpy.datasets.loader.load_toyv1(path_data='data', measure='LN_IC50_curvecurator')
Loads small Toy dataset, subsampled from CTRPv2.
- Parameters:
- Return type:
- Returns:
DrugResponseDataset containing response, cell line IDs, and drug IDs.
- drevalpy.datasets.loader.load_toyv2(path_data='data', measure='LN_IC50_curvecurator')
Loads small Toy dataset, subsampled from GDSC2. Can be used to test cross study prediction.
- Parameters:
- Return type:
- Returns:
DrugResponseDataset containing response, cell line IDs, and drug IDs.
CurveCurator
Contains all function required for CurveCurator fitting.
CurveCurator publication: Bayer, F.P., Gander, M., Kuster, B. et al. CurveCurator: a recalibrated F-statistic to assess, classify, and explore significance of dose–response curves. Nat Commun 14, 7902 (2023). https://doi-org.eaccess.tum.edu/10.1038/s41467-023-43696-z
CurveCurator applies a recalibrated F-statistic for p-value estimation of 4-point log-logistic regression fits. In drevalpy, this can be used to generate training data with higher quality, since quality measures, such as p-value, R2, or relevance score can be used to filter out viability measurements of low quality.
- drevalpy.datasets.curvecurator.fit_curves(input_file, output_dir, dataset_name, cores, normalize=False)
Fit curves for provided raw viability data.
This functions reads viability data in a predefined input format, preprocesses the data to be readable by CurveCurator, fits curves to the data using CurveCurator, and postprocesses the fitted data to a format required by drevalpy.
- Parameters:
input_file (
str) – Path to the file containing the raw viability dataoutput_dir (
str) – Path to store all the files to, including the preprocessed data, the config.toml for CurveCurator, CurveCurator’s output files, and the postprocessed datadataset_name (
str) – The name of the dataset, will be used to prepend the postprocessed <dataset_name>.csv filecores (
int) – The number of cores to be used for fitting the curves using CurveCurator. This parameter is written into the config.toml, but it is min of the number of curves to fit and the number given (min(n_curves, cores))normalize (
bool) – Whether to normalize the response values to [0, 1] for curvecurator. Default = False.
- drevalpy.datasets.curvecurator.postprocess(output_folder, dataset_name)
Postprocess CurveCurator output files.
This function reads all curves.tsv files created by CurveCurator, which contain the fitted curve parameters, postprocesses them to be used by drevalpy and combines everything in one <dataset_name>.csv file for usage by drevalpy.
- drevalpy.datasets.curvecurator.preprocess(input_file, output_dir, dataset_name, cores, normalize=False)
Preprocess raw viability data and create required input files for CurveCurator.
This function takes an input file containing raw viability in long format. The required columns are “dose”, “response”, “sample”, and “drug”, with an optional “replicate” column. If there are multiple dose ranges or numbers of replicates, groups in the form (maxdose, mindose, n_replicates) are created to keep the number of parameters for fitting low and the input dataframes for curvecurator as dense as possible. All dosages must be provided in µM! All responses must be normalized against the control already without the response for the control.
- Parameters:
input_file (
str) – Path to csv file containing the raw viability dataoutput_dir (
str) – Path to store all the files to, including the preprocessed data, the config.toml for CurveCurator, CurveCurator’s output files, and the postprocessed datadataset_name (
str) – Name of the datasetcores (
int) – The number of cores to be used for fitting the curves using CurveCurator. This parameter is written into the config.toml, but it is min of the number of curves to fit and the number given (min(n_curves, cores))normalize (
bool) – Whether to normalize the response values to [0, 1] for curvecurator. Default = False.
- Raises:
ValueError – If required columns are not found in the provided input file.
Utility functions
Utility functions for datasets.
- drevalpy.datasets.utils.download_dataset(dataset_name, data_path='data', redownload=False)
Download the latets dataset from Zenodo.
- drevalpy.datasets.utils.download_from_url(dataset_name, file_url)
Download a file from a given URL.
- drevalpy.datasets.utils.permute_features(features, identifiers, views_to_permute, all_views)
Permute the specified views for each entity (= cell line or drug).
E.g. each cell line gets the feature vector/graph/image… of another cell line. Drawn without replacement.
- drevalpy.datasets.utils.randomize_graph(original_graph)
Randomizes the graph by shuffling the edges while preserving the degree sequence.
- Parameters:
original_graph (
Graph) – The original graph- Return type:
Graph- Returns:
Randomized graph with the same degree sequence and node attributes