Introduction

This vignette will walk a reader through the nebula() function.

Setup

Before going through the tutorial, install {Nebula}.

remotes::install_github("nebula-group/nebula", build_vignettes = TRUE)

library(nebula)

Basic Usage

We’ll be using the colon data set throughout this example. This set contains data from nrow(colon$modal$gene_expr) patients with colon adenocarcinoma (COAD) from the TCGA with available mRNA gene expression and DNA alteration data.

In this dataset there are 1239 gene expression features and 187 DNA alteration features (encompassing copy number alteration, DNA mutation, and DNA methylation data).


head(colon$modal$gene_expr[1:5,1:5])
#>                     12          13         26         33         55
#> TCGA-3L-AA1B -2.423038 -0.06077479  4.6400521 -2.8386386  0.4246023
#> TCGA-4N-A93T -3.210290 -0.82757599  5.0946978 -1.9419793 -0.6309671
#> TCGA-4T-AA8H -2.895921 -1.90435159  4.6929238 -2.8386386  1.4881064
#> TCGA-5M-AAT4 -3.970366 -1.07540683  4.0248334 -2.4807544  0.4058703
#> TCGA-5M-AAT6 -1.651016  1.94444404 -0.5901572  0.3651033 -1.6379581

head(colon$modal$dna_alt[1:5, 1:5])
#>              25 91 92 207 208
#> TCGA-3L-AA1B  0  0  0   0   0
#> TCGA-4N-A93T  0  0  0   0   0
#> TCGA-4T-AA8H  0  0  0   0   0
#> TCGA-5M-AAT4  0  0  0   0   0
#> TCGA-5M-AAT6  0  0  0   0   0

We also include network information in the colon dataset, which indicates network edges both within and between modalities.

For example, below we can see the 10th gene in the first modality has a network connection to the 8th gene in the first modality.


head(colon$network)
#>      mode1 gene1 mode2 gene2
#> [1,]     1    10     1     8
#> [2,]     1    23     1     8
#> [3,]     1    30     1     8
#> [4,]     1    46     1     8
#> [5,]     1    47     1     8
#> [6,]     1    48     1     8

Now perform network-driven subtype discovery on the example colon data.

We recommend setting a seed for reproducible results, since there is a stochastic nature to the Bayesian subtype assignments.

# set seed for reproducibility
set.seed(123)

t1 <- Sys.time()
res <- nebula(
  data = colon$modal, 
  modtype = c(0, 1), 
  E = colon$network,
  H = 3, 
  modeta = c(1, 0.2),
  nu = 1,
  alpha = 1,
  lam = 1, 
  alpha_sigma = 10,
  beta_sigma = 10,
  alpha_p = 1, 
  beta_p = 1
)
t2 <- Sys.time()

t2-t1
#> Time difference of 12.99246 secs

Now we can examine the output.

There are subtype assignments for all samples:

head(res$clustering)
#> [1] 1 3 3 3 2 3

summary(as.factor(res$clustering))
#>   1   2   3 
#>  94  29 216

You may also access probabilities that each sample is assigned to each subtype. Above, we saw the first sample was assigned to subtype “1”. Below, we see that the probability of the first sample sample being assigned to subtype “1” was close to one (after rounding).


round(head(res$clus_pr), 3)
#>      [,1] [,2] [,3]
#> [1,]    1    0    0
#> [2,]    0    0    1
#> [3,]    0    0    1
#> [4,]    0    0    1
#> [5,]    0    1    0
#> [6,]    0    0    1

Finally, you may also access feature selection information in the output. For each feature in each mode, there is a TRUE/FALSE value for whether the feature is a defining variable in each cluster (in the res$defvar element), as well as the probability of that feature being a defining variable for the cluster (res$defvar_pr)