MOLIR

MOLIR Model

Contains the MOLIR model, a regression adaptation of the MOLI model.

Original authors: Sharifi-Noghabi et al. (2019, 10.1093/bioinformatics/btz318) Code adapted from their Github: https://github.com/hosseinshn/MOLI and Hauptmann et al. (2023, 10.1186/s12859-023-05166-7) https://github.com/kramerlab/Multi-Omics_analysis

class drevalpy.models.MOLIR.molir.MOLIR

Bases: DRPModel

Regression extension of MOLI: multi-omics late integration deep neural network.

Takes somatic mutation, copy number variation and gene expression data as input. MOLI uses type-specific encoding subnetworks to learn features for each omics type, concatenates them into one representation and optimizes this representation via a combined cost function consisting of a triplet loss and a binary cross-entropy loss. We use a regression adaption with MSE loss and a mechanism to find positive and negative samples.

build_model(hyperparameters)

Builds the model from hyperparameters.

Parameters:: hyperparameters (dict[str, Any]) – Custom hyperparameters for the model, includes mini_batch, layer dimensions (h_dim1, h_dim2, h_dim3), learning_rate, dropout_rate, weight_decay, gamma, epochs, and margin.
Return type:: None

cell_line_views = ['gene_expression', 'mutations', 'copy_number_variation_gistic']

drug_views = []

early_stopping = True

classmethod get_model_name()

Returns the model name.

Return type:: str
Returns:: MOLIR

is_single_drug_model = True

load_cell_line_features(data_path, dataset_name)

Loads the cell line features: gene expression, mutations and copy number variation.

Parameters:

data_path (str) – path to the data
dataset_name (str) – name of the dataset

Return type:

FeatureDataset

Returns:

FeatureDataset with gene expression, mutations and copy number variation

load_drug_features(data_path, dataset_name)

Returns None, as drug features are not needed for MOLIR.

Parameters:

data_path (str) – path to the data
dataset_name (str) – name of the dataset

Return type:

FeatureDataset | None

Returns:

None

predict(cell_line_ids, drug_ids, cell_line_input, drug_input=None)

Predicts the drug response.

If there was no training data, only nans will be returned.

Parameters:

cell_line_ids (ndarray) – Cell lines to predict
drug_ids (ndarray) – Drugs to predict
cell_line_input (FeatureDataset) – cell line omics features
drug_input (FeatureDataset | None) – drug features, not needed

Return type:

ndarray

Returns:

Predicted drug response

Raises:

ValueError – If the model was not trained

train(output, cell_line_input, drug_input=None, output_earlystopping=None, model_checkpoint_dir='checkpoints')

Initializes and trains the model.

First, the gene expression data was reduced using a variance threshold (0.05) and standardized. We chose to use the most variable 1000 genes instead to avoid issues with the variance threshold. Then, the model is initialized with the hyperparameters and the dimensions of the gene expression, mutation and copy number variation data. If there is no training data, the model is set to None (and predictions will be skipped as well). If there is not enough training data, the predictions will be made on the randomly initialized model.

Parameters:

output (DrugResponseDataset) – drug response data
cell_line_input (FeatureDataset) – cell line omics features, i.e., gene expression, mutations and copy number variation
drug_input (FeatureDataset | None) – drug features, not needed
output_earlystopping (DrugResponseDataset | None) – early stopping data, not used when there is not enough data
model_checkpoint_dir (str) – directory to save the model checkpoints

Raises:

ValueError – If drug_input is None.

Return type:

None

Model utils

Utility functions for the MOLIR model.

Original authors of MOLI: Sharifi-Noghabi et al. (2019, 10.1093/bioinformatics/btz318) Code adapted from: Hauptmann et al. (2023, 10.1186/s12859-023-05166-7), https://github.com/kramerlab/Multi-Omics_analysis

class drevalpy.models.MOLIR.utils.MOLIEncoder(input_size, output_size, dropout_rate)

Bases: Module

Encoders of the MOLIR model, which is identical to the encoders of the original MOLI model.

The MOLIR model has three encoders for the gene expression, mutations, and copy number variation data which are trained together.

Parameters:

input_size (int)
output_size (int)
dropout_rate (float)

forward(x)

Forward pass of the encoder.

Parameters:: x (Tensor) – omic input features
Return type:: Tensor
Returns:: encoded omic features

class drevalpy.models.MOLIR.utils.MOLIModel(hpams, input_dim_expr, input_dim_mut, input_dim_cnv)

Bases: RegressionMetricsMixin, LightningModule

PyTorch Lightning module for the MOLIR model.

The architecture of the MOLIR model is identical to the MOLI model, except for the omission of the final sigmoid layer and the usage of a regression MSE loss instead of a binary cross-entropy loss. Additionally, early stopping is added instead of tuning the number of epochs as hyperparameter.

Parameters:

hpams (dict[str, int | float])
input_dim_expr (int)
input_dim_mut (int)
input_dim_cnv (int)

configure_optimizers()

Overwrites the configure_optimizers method from PyTorch Lightning.

Return type:: Optimizer
Returns:: optimizers for the MOLIR expression, mutation, copy number variation encoders, and regressor

fit(output_train, cell_line_input, output_earlystopping=None, patience=5, model_checkpoint_dir='checkpoints', wandb_project=None)

Trains the MOLIR model.

First, the ranges for the triplet loss are determined using the standard deviation of the training responses. Then, the training and validation data loaders are created. The model is trained using the Lightning Trainer with an early stopping callback and patience of 5.

Parameters:

output_train (DrugResponseDataset) – training dataset containing the response output
cell_line_input (FeatureDataset) – feature dataset containing the omics data of the cell lines
output_earlystopping (DrugResponseDataset | None) – early stopping dataset
patience (int) – for early stopping
model_checkpoint_dir (str) – directory to save the model checkpoints
wandb_project (str | None) – optional wandb project name for logging. If provided, uses WandbLogger for PyTorch Lightning training.

Return type:

None

forward(x_gene, x_mutation, x_cna)

Forward pass of the MOLIR model.

Parameters:

x_gene (Tensor) – gene expression input
x_mutation (Tensor) – mutation input
x_cna (Tensor) – copy number variation input

Return type:

Tensor

Returns:

predicted drug response

predict(gene_expression, mutations, copy_number)

Perform prediction on given input data.

If there was enough training data to train the model, the model from the best epoch was saved in the checkpoint callback and is loaded now. If there was not enough training data, the model is only randomly initialized.

Parameters:

gene_expression (ndarray) – gene expression data
mutations (ndarray) – mutation data
copy_number (ndarray) – copy number variation data

Return type:

ndarray

Returns:

predicted drug response

training_step(batch, batch_idx)

Training step of the MOLIR model.

Parameters:

batch (list[Tensor]) – batch of gene expression, mutations, copy number variation, and response
batch_idx (int) – index of the batch

Return type:

Tensor

Returns:

combined loss

validation_step(batch, batch_idx)

Validation step of the MOLIR model.

Parameters:

batch (list[Tensor]) – batch of gene expression, mutations, copy number variation, and response
batch_idx (int) – index of the batch

Return type:

Tensor

Returns:

combined loss

class drevalpy.models.MOLIR.utils.MOLIRegressor(input_size, dropout_rate)

Bases: Module

Regressor of the MOLIR model.

It is identical to the regressor of the original MOLI model, except for the omission of the final sigmoid activation function. After the three encoders, the encoded features are concatenated and fed into the regressor.

Parameters:

input_size (int)
dropout_rate (float)

forward(x)

Forward pass of the regressor.

Parameters:: x (Tensor) – concatenated encoded features
Return type:: Tensor
Returns:: predicted drug response

class drevalpy.models.MOLIR.utils.RegressionDataset(output, cell_line_input)

Bases: Dataset

Dataset for regression tasks for the data loader.

Parameters:

output (DrugResponseDataset)
cell_line_input (FeatureDataset)

drevalpy.models.MOLIR.utils.create_dataset_and_loaders(batch_size, output_train, cell_line_input, output_earlystopping=None)

Creates the RegressionDataset (torch Dataset) and the DataLoader for the training and validation data.

Parameters:

batch_size (int) – specified batch size
output_train (DrugResponseDataset) – response values for the training data
cell_line_input (FeatureDataset) – omic input features of the cell lines
output_earlystopping (DrugResponseDataset | None) – early stopping dataset

Return type:

tuple[DataLoader, DataLoader | None]

Returns:

training and validation data loaders

drevalpy.models.MOLIR.utils.filter_and_sort_omics(model, gene_expression, mutations, cnvs, cell_line_input)

Filters out features that were not present during training and imputes missing features with zeros.

This is necessary because the feature order might have changed or more features are available (cross-study setting).

Parameters:

model (DRPModel) – either MOLIR or SuperFELTR self
gene_expression (ndarray) – new gene expression data from which to predict
mutations (ndarray) – new mutation data from which to predict
cnvs (ndarray) – new copy number variation data from which to predict
cell_line_input (FeatureDataset) – needed for meta information (feature names)

Return type:

tuple[ndarray, ndarray, ndarray]

Returns:

filtered and sorted gene expression, mutations, and copy number variation data

drevalpy.models.MOLIR.utils.generate_triplets_indices(y, positive_range, negative_range, random_seed=None)

Generates triplets for the MOLIR model.

The positive and negative range are determined by the standard deviation of the response values. A sample is considered positive if its response value is within the positive range of the label. The positive range is ±10% of the standard deviation of all response values. A sample is considered negative if its response value is at least one standard deviation away from the response value of the sample.

Parameters:

y (ndarray) – response values
positive_range (float) – positive range for the triplet loss
negative_range (float) – negative range for the triplet loss
random_seed (int | None) – random seed for reproducibility

Return type:

tuple[ndarray, ndarray]

Returns:

positive and negative sample indices for each sample

drevalpy.models.MOLIR.utils.get_dimensions_of_omics_data(cell_line_input)

Determines the dimensions of the omics data for the creation of the input layers.

Parameters:: cell_line_input (FeatureDataset) – omic input features of the cell lines
Return type:: tuple[int, int, int]
Returns:: dimensions of the gene expression, mutations, and copy number variation data

drevalpy.models.MOLIR.utils.make_ranges(output)

Compute the positive and negative range for the triplet loss.

Parameters:: output (DrugResponseDataset) – drug response dataset
Return type:: tuple[float, float]
Returns:: positive and negative range for the triplet loss