MOLIR
MOLIR Model
Contains the MOLIR model, a regression adaptation of the MOLI model.
Original authors: Sharifi-Noghabi et al. (2019, 10.1093/bioinformatics/btz318) Code adapted from their Github: https://github.com/hosseinshn/MOLI and Hauptmann et al. (2023, 10.1186/s12859-023-05166-7) https://github.com/kramerlab/Multi-Omics_analysis
- class drevalpy.models.MOLIR.molir.MOLIR
Bases:
DRPModelRegression extension of MOLI: multi-omics late integration deep neural network.
Takes somatic mutation, copy number variation and gene expression data as input. MOLI uses type-specific encoding subnetworks to learn features for each omics type, concatenates them into one representation and optimizes this representation via a combined cost function consisting of a triplet loss and a binary cross-entropy loss. We use a regression adaption with MSE loss and a mechanism to find positive and negative samples.
- build_model(hyperparameters)
Builds the model from hyperparameters.
- cell_line_views = ['gene_expression', 'mutations', 'copy_number_variation_gistic']
- drug_views = []
- early_stopping = True
- is_single_drug_model = True
- load_cell_line_features(data_path, dataset_name)
Loads the cell line features: gene expression, mutations and copy number variation.
- Parameters:
- Return type:
- Returns:
FeatureDataset with gene expression, mutations and copy number variation
- load_drug_features(data_path, dataset_name)
Returns None, as drug features are not needed for MOLIR.
- Parameters:
- Return type:
- Returns:
None
- predict(cell_line_ids, drug_ids, cell_line_input, drug_input=None)
Predicts the drug response.
If there was no training data, only nans will be returned.
- Parameters:
cell_line_ids (
ndarray) – Cell lines to predictdrug_ids (
ndarray) – Drugs to predictcell_line_input (
FeatureDataset) – cell line omics featuresdrug_input (
FeatureDataset|None) – drug features, not needed
- Return type:
- Returns:
Predicted drug response
- Raises:
ValueError – If the model was not trained
- train(output, cell_line_input, drug_input=None, output_earlystopping=None, model_checkpoint_dir='checkpoints')
Initializes and trains the model.
First, the gene expression data was reduced using a variance threshold (0.05) and standardized. We chose to use the most variable 1000 genes instead to avoid issues with the variance threshold. Then, the model is initialized with the hyperparameters and the dimensions of the gene expression, mutation and copy number variation data. If there is no training data, the model is set to None (and predictions will be skipped as well). If there is not enough training data, the predictions will be made on the randomly initialized model.
- Parameters:
output (
DrugResponseDataset) – drug response datacell_line_input (
FeatureDataset) – cell line omics features, i.e., gene expression, mutations and copy number variationdrug_input (
FeatureDataset|None) – drug features, not neededoutput_earlystopping (
DrugResponseDataset|None) – early stopping data, not used when there is not enough datamodel_checkpoint_dir (
str) – directory to save the model checkpoints
- Raises:
ValueError – If drug_input is None.
- Return type:
Model utils
Utility functions for the MOLIR model.
Original authors of MOLI: Sharifi-Noghabi et al. (2019, 10.1093/bioinformatics/btz318) Code adapted from: Hauptmann et al. (2023, 10.1186/s12859-023-05166-7), https://github.com/kramerlab/Multi-Omics_analysis
- class drevalpy.models.MOLIR.utils.MOLIEncoder(input_size, output_size, dropout_rate)
Bases:
ModuleEncoders of the MOLIR model, which is identical to the encoders of the original MOLI model.
The MOLIR model has three encoders for the gene expression, mutations, and copy number variation data which are trained together.
- forward(x)
Forward pass of the encoder.
- Parameters:
x (
Tensor) – omic input features- Return type:
Tensor- Returns:
encoded omic features
- class drevalpy.models.MOLIR.utils.MOLIModel(hpams, input_dim_expr, input_dim_mut, input_dim_cnv)
Bases:
RegressionMetricsMixin,LightningModulePyTorch Lightning module for the MOLIR model.
The architecture of the MOLIR model is identical to the MOLI model, except for the omission of the final sigmoid layer and the usage of a regression MSE loss instead of a binary cross-entropy loss. Additionally, early stopping is added instead of tuning the number of epochs as hyperparameter.
- Parameters:
- configure_optimizers()
Overwrites the configure_optimizers method from PyTorch Lightning.
- Return type:
Optimizer- Returns:
optimizers for the MOLIR expression, mutation, copy number variation encoders, and regressor
- fit(output_train, cell_line_input, output_earlystopping=None, patience=5, model_checkpoint_dir='checkpoints', wandb_project=None)
Trains the MOLIR model.
First, the ranges for the triplet loss are determined using the standard deviation of the training responses. Then, the training and validation data loaders are created. The model is trained using the Lightning Trainer with an early stopping callback and patience of 5.
- Parameters:
output_train (
DrugResponseDataset) – training dataset containing the response outputcell_line_input (
FeatureDataset) – feature dataset containing the omics data of the cell linesoutput_earlystopping (
DrugResponseDataset|None) – early stopping datasetpatience (
int) – for early stoppingmodel_checkpoint_dir (
str) – directory to save the model checkpointswandb_project (
str|None) – optional wandb project name for logging. If provided, uses WandbLogger for PyTorch Lightning training.
- Return type:
- forward(x_gene, x_mutation, x_cna)
Forward pass of the MOLIR model.
- Parameters:
x_gene (
Tensor) – gene expression inputx_mutation (
Tensor) – mutation inputx_cna (
Tensor) – copy number variation input
- Return type:
Tensor- Returns:
predicted drug response
- predict(gene_expression, mutations, copy_number)
Perform prediction on given input data.
If there was enough training data to train the model, the model from the best epoch was saved in the checkpoint callback and is loaded now. If there was not enough training data, the model is only randomly initialized.
- training_step(batch, batch_idx)
Training step of the MOLIR model.
- class drevalpy.models.MOLIR.utils.MOLIRegressor(input_size, dropout_rate)
Bases:
ModuleRegressor of the MOLIR model.
It is identical to the regressor of the original MOLI model, except for the omission of the final sigmoid activation function. After the three encoders, the encoded features are concatenated and fed into the regressor.
- forward(x)
Forward pass of the regressor.
- Parameters:
x (
Tensor) – concatenated encoded features- Return type:
Tensor- Returns:
predicted drug response
- class drevalpy.models.MOLIR.utils.RegressionDataset(output, cell_line_input)
Bases:
DatasetDataset for regression tasks for the data loader.
- Parameters:
output (DrugResponseDataset)
cell_line_input (FeatureDataset)
- drevalpy.models.MOLIR.utils.create_dataset_and_loaders(batch_size, output_train, cell_line_input, output_earlystopping=None)
Creates the RegressionDataset (torch Dataset) and the DataLoader for the training and validation data.
- Parameters:
batch_size (
int) – specified batch sizeoutput_train (
DrugResponseDataset) – response values for the training datacell_line_input (
FeatureDataset) – omic input features of the cell linesoutput_earlystopping (
DrugResponseDataset|None) – early stopping dataset
- Return type:
- Returns:
training and validation data loaders
- drevalpy.models.MOLIR.utils.filter_and_sort_omics(model, gene_expression, mutations, cnvs, cell_line_input)
Filters out features that were not present during training and imputes missing features with zeros.
This is necessary because the feature order might have changed or more features are available (cross-study setting).
- Parameters:
model (
DRPModel) – either MOLIR or SuperFELTR selfgene_expression (
ndarray) – new gene expression data from which to predictmutations (
ndarray) – new mutation data from which to predictcnvs (
ndarray) – new copy number variation data from which to predictcell_line_input (
FeatureDataset) – needed for meta information (feature names)
- Return type:
- Returns:
filtered and sorted gene expression, mutations, and copy number variation data
- drevalpy.models.MOLIR.utils.generate_triplets_indices(y, positive_range, negative_range, random_seed=None)
Generates triplets for the MOLIR model.
The positive and negative range are determined by the standard deviation of the response values. A sample is considered positive if its response value is within the positive range of the label. The positive range is ±10% of the standard deviation of all response values. A sample is considered negative if its response value is at least one standard deviation away from the response value of the sample.
- Parameters:
- Return type:
- Returns:
positive and negative sample indices for each sample
- drevalpy.models.MOLIR.utils.get_dimensions_of_omics_data(cell_line_input)
Determines the dimensions of the omics data for the creation of the input layers.
- Parameters:
cell_line_input (
FeatureDataset) – omic input features of the cell lines- Return type:
- Returns:
dimensions of the gene expression, mutations, and copy number variation data
- drevalpy.models.MOLIR.utils.make_ranges(output)
Compute the positive and negative range for the triplet loss.
- Parameters:
output (
DrugResponseDataset) – drug response dataset- Return type:
- Returns:
positive and negative range for the triplet loss