MOLIR

MOLIR Model

Contains the MOLIR model, a regression adaptation of the MOLI model.

Original authors: Sharifi-Noghabi et al. (2019, 10.1093/bioinformatics/btz318) Code adapted from their Github: https://github.com/hosseinshn/MOLI and Hauptmann et al. (2023, 10.1186/s12859-023-05166-7) https://github.com/kramerlab/Multi-Omics_analysis

class drevalpy.models.MOLIR.molir.MOLIR

Bases: DRPModel

Regression extension of MOLI: multi-omics late integration deep neural network.

Takes somatic mutation, copy number variation and gene expression data as input. MOLI uses type-specific encoding subnetworks to learn features for each omics type, concatenates them into one representation and optimizes this representation via a combined cost function consisting of a triplet loss and a binary cross-entropy loss. We use a regression adaption with MSE loss and a mechanism to find positive and negative samples.

build_model(hyperparameters)

Builds the model from hyperparameters.

Parameters:

hyperparameters (dict[str, Any]) – Custom hyperparameters for the model, includes mini_batch, layer dimensions (h_dim1, h_dim2, h_dim3), learning_rate, dropout_rate, weight_decay, gamma, epochs, and margin.

Return type:

None

cell_line_views = ['gene_expression', 'mutations', 'copy_number_variation_gistic']
drug_views = []
early_stopping = True
classmethod get_model_name()

Returns the model name.

Return type:

str

Returns:

MOLIR

is_single_drug_model = True
load_cell_line_features(data_path, dataset_name)

Loads the cell line features: gene expression, mutations and copy number variation.

Parameters:
  • data_path (str) – path to the data

  • dataset_name (str) – name of the dataset

Return type:

FeatureDataset

Returns:

FeatureDataset with gene expression, mutations and copy number variation

load_drug_features(data_path, dataset_name)

Returns None, as drug features are not needed for MOLIR.

Parameters:
  • data_path (str) – path to the data

  • dataset_name (str) – name of the dataset

Return type:

FeatureDataset | None

Returns:

None

predict(cell_line_ids, drug_ids, cell_line_input, drug_input=None)

Predicts the drug response.

If there was no training data, only nans will be returned.

Parameters:
Return type:

ndarray

Returns:

Predicted drug response

Raises:

ValueError – If the model was not trained

train(output, cell_line_input, drug_input=None, output_earlystopping=None, model_checkpoint_dir='checkpoints')

Initializes and trains the model.

First, the gene expression data was reduced using a variance threshold (0.05) and standardized. We chose to use the most variable 1000 genes instead to avoid issues with the variance threshold. Then, the model is initialized with the hyperparameters and the dimensions of the gene expression, mutation and copy number variation data. If there is no training data, the model is set to None (and predictions will be skipped as well). If there is not enough training data, the predictions will be made on the randomly initialized model.

Parameters:
  • output (DrugResponseDataset) – drug response data

  • cell_line_input (FeatureDataset) – cell line omics features, i.e., gene expression, mutations and copy number variation

  • drug_input (FeatureDataset | None) – drug features, not needed

  • output_earlystopping (DrugResponseDataset | None) – early stopping data, not used when there is not enough data

  • model_checkpoint_dir (str) – directory to save the model checkpoints

Raises:

ValueError – If drug_input is None.

Return type:

None

Model utils

Utility functions for the MOLIR model.

Original authors of MOLI: Sharifi-Noghabi et al. (2019, 10.1093/bioinformatics/btz318) Code adapted from: Hauptmann et al. (2023, 10.1186/s12859-023-05166-7), https://github.com/kramerlab/Multi-Omics_analysis

class drevalpy.models.MOLIR.utils.MOLIEncoder(input_size, output_size, dropout_rate)

Bases: Module

Encoders of the MOLIR model, which is identical to the encoders of the original MOLI model.

The MOLIR model has three encoders for the gene expression, mutations, and copy number variation data which are trained together.

Parameters:
  • input_size (int)

  • output_size (int)

  • dropout_rate (float)

forward(x)

Forward pass of the encoder.

Parameters:

x (Tensor) – omic input features

Return type:

Tensor

Returns:

encoded omic features

class drevalpy.models.MOLIR.utils.MOLIModel(hpams, input_dim_expr, input_dim_mut, input_dim_cnv)

Bases: RegressionMetricsMixin, LightningModule

PyTorch Lightning module for the MOLIR model.

The architecture of the MOLIR model is identical to the MOLI model, except for the omission of the final sigmoid layer and the usage of a regression MSE loss instead of a binary cross-entropy loss. Additionally, early stopping is added instead of tuning the number of epochs as hyperparameter.

Parameters:
configure_optimizers()

Overwrites the configure_optimizers method from PyTorch Lightning.

Return type:

Optimizer

Returns:

optimizers for the MOLIR expression, mutation, copy number variation encoders, and regressor

fit(output_train, cell_line_input, output_earlystopping=None, patience=5, model_checkpoint_dir='checkpoints', wandb_project=None)

Trains the MOLIR model.

First, the ranges for the triplet loss are determined using the standard deviation of the training responses. Then, the training and validation data loaders are created. The model is trained using the Lightning Trainer with an early stopping callback and patience of 5.

Parameters:
  • output_train (DrugResponseDataset) – training dataset containing the response output

  • cell_line_input (FeatureDataset) – feature dataset containing the omics data of the cell lines

  • output_earlystopping (DrugResponseDataset | None) – early stopping dataset

  • patience (int) – for early stopping

  • model_checkpoint_dir (str) – directory to save the model checkpoints

  • wandb_project (str | None) – optional wandb project name for logging. If provided, uses WandbLogger for PyTorch Lightning training.

Return type:

None

forward(x_gene, x_mutation, x_cna)

Forward pass of the MOLIR model.

Parameters:
  • x_gene (Tensor) – gene expression input

  • x_mutation (Tensor) – mutation input

  • x_cna (Tensor) – copy number variation input

Return type:

Tensor

Returns:

predicted drug response

predict(gene_expression, mutations, copy_number)

Perform prediction on given input data.

If there was enough training data to train the model, the model from the best epoch was saved in the checkpoint callback and is loaded now. If there was not enough training data, the model is only randomly initialized.

Parameters:
  • gene_expression (ndarray) – gene expression data

  • mutations (ndarray) – mutation data

  • copy_number (ndarray) – copy number variation data

Return type:

ndarray

Returns:

predicted drug response

training_step(batch, batch_idx)

Training step of the MOLIR model.

Parameters:
  • batch (list[Tensor]) – batch of gene expression, mutations, copy number variation, and response

  • batch_idx (int) – index of the batch

Return type:

Tensor

Returns:

combined loss

validation_step(batch, batch_idx)

Validation step of the MOLIR model.

Parameters:
  • batch (list[Tensor]) – batch of gene expression, mutations, copy number variation, and response

  • batch_idx (int) – index of the batch

Return type:

Tensor

Returns:

combined loss

class drevalpy.models.MOLIR.utils.MOLIRegressor(input_size, dropout_rate)

Bases: Module

Regressor of the MOLIR model.

It is identical to the regressor of the original MOLI model, except for the omission of the final sigmoid activation function. After the three encoders, the encoded features are concatenated and fed into the regressor.

Parameters:
forward(x)

Forward pass of the regressor.

Parameters:

x (Tensor) – concatenated encoded features

Return type:

Tensor

Returns:

predicted drug response

class drevalpy.models.MOLIR.utils.RegressionDataset(output, cell_line_input)

Bases: Dataset

Dataset for regression tasks for the data loader.

Parameters:
drevalpy.models.MOLIR.utils.create_dataset_and_loaders(batch_size, output_train, cell_line_input, output_earlystopping=None)

Creates the RegressionDataset (torch Dataset) and the DataLoader for the training and validation data.

Parameters:
Return type:

tuple[DataLoader, DataLoader | None]

Returns:

training and validation data loaders

drevalpy.models.MOLIR.utils.filter_and_sort_omics(model, gene_expression, mutations, cnvs, cell_line_input)

Filters out features that were not present during training and imputes missing features with zeros.

This is necessary because the feature order might have changed or more features are available (cross-study setting).

Parameters:
  • model (DRPModel) – either MOLIR or SuperFELTR self

  • gene_expression (ndarray) – new gene expression data from which to predict

  • mutations (ndarray) – new mutation data from which to predict

  • cnvs (ndarray) – new copy number variation data from which to predict

  • cell_line_input (FeatureDataset) – needed for meta information (feature names)

Return type:

tuple[ndarray, ndarray, ndarray]

Returns:

filtered and sorted gene expression, mutations, and copy number variation data

drevalpy.models.MOLIR.utils.generate_triplets_indices(y, positive_range, negative_range, random_seed=None)

Generates triplets for the MOLIR model.

The positive and negative range are determined by the standard deviation of the response values. A sample is considered positive if its response value is within the positive range of the label. The positive range is ±10% of the standard deviation of all response values. A sample is considered negative if its response value is at least one standard deviation away from the response value of the sample.

Parameters:
  • y (ndarray) – response values

  • positive_range (float) – positive range for the triplet loss

  • negative_range (float) – negative range for the triplet loss

  • random_seed (int | None) – random seed for reproducibility

Return type:

tuple[ndarray, ndarray]

Returns:

positive and negative sample indices for each sample

drevalpy.models.MOLIR.utils.get_dimensions_of_omics_data(cell_line_input)

Determines the dimensions of the omics data for the creation of the input layers.

Parameters:

cell_line_input (FeatureDataset) – omic input features of the cell lines

Return type:

tuple[int, int, int]

Returns:

dimensions of the gene expression, mutations, and copy number variation data

drevalpy.models.MOLIR.utils.make_ranges(output)

Compute the positive and negative range for the triplet loss.

Parameters:

output (DrugResponseDataset) – drug response dataset

Return type:

tuple[float, float]

Returns:

positive and negative range for the triplet loss