vignettes/nebula_tutorial.Rmd
nebula_tutorial.Rmd
This vignette will walk a reader through the nebula()
function.
Before going through the tutorial, install {Nebula}.
remotes::install_github("nebula-group/nebula", build_vignettes = TRUE) library(nebula)
We’ll be using the colon
data set throughout this example. This set contains data from nrow(colon$modal$gene_expr)
patients with colon adenocarcinoma (COAD) from the TCGA with available mRNA gene expression and DNA alteration data.
In this dataset there are 1239 gene expression features and 187 DNA alteration features (encompassing copy number alteration, DNA mutation, and DNA methylation data).
head(colon$modal$gene_expr[1:5,1:5]) #> 12 13 26 33 55 #> TCGA-3L-AA1B -2.423038 -0.06077479 4.6400521 -2.8386386 0.4246023 #> TCGA-4N-A93T -3.210290 -0.82757599 5.0946978 -1.9419793 -0.6309671 #> TCGA-4T-AA8H -2.895921 -1.90435159 4.6929238 -2.8386386 1.4881064 #> TCGA-5M-AAT4 -3.970366 -1.07540683 4.0248334 -2.4807544 0.4058703 #> TCGA-5M-AAT6 -1.651016 1.94444404 -0.5901572 0.3651033 -1.6379581 head(colon$modal$dna_alt[1:5, 1:5]) #> 25 91 92 207 208 #> TCGA-3L-AA1B 0 0 0 0 0 #> TCGA-4N-A93T 0 0 0 0 0 #> TCGA-4T-AA8H 0 0 0 0 0 #> TCGA-5M-AAT4 0 0 0 0 0 #> TCGA-5M-AAT6 0 0 0 0 0
We also include network information in the colon
dataset, which indicates network edges both within and between modalities.
For example, below we can see the 10th gene in the first modality has a network connection to the 8th gene in the first modality.
head(colon$network) #> mode1 gene1 mode2 gene2 #> [1,] 1 10 1 8 #> [2,] 1 23 1 8 #> [3,] 1 30 1 8 #> [4,] 1 46 1 8 #> [5,] 1 47 1 8 #> [6,] 1 48 1 8
Now perform network-driven subtype discovery on the example colon data.
We recommend setting a seed for reproducible results, since there is a stochastic nature to the Bayesian subtype assignments.
# set seed for reproducibility set.seed(123) t1 <- Sys.time() res <- nebula( data = colon$modal, modtype = c(0, 1), E = colon$network, H = 3, modeta = c(1, 0.2), nu = 1, alpha = 1, lam = 1, alpha_sigma = 10, beta_sigma = 10, alpha_p = 1, beta_p = 1 ) t2 <- Sys.time() t2-t1 #> Time difference of 12.99246 secs
Now we can examine the output.
There are subtype assignments for all samples:
You may also access probabilities that each sample is assigned to each subtype. Above, we saw the first sample was assigned to subtype “1”. Below, we see that the probability of the first sample sample being assigned to subtype “1” was close to one (after rounding).
round(head(res$clus_pr), 3) #> [,1] [,2] [,3] #> [1,] 1 0 0 #> [2,] 0 0 1 #> [3,] 0 0 1 #> [4,] 0 0 1 #> [5,] 0 1 0 #> [6,] 0 0 1
Finally, you may also access feature selection information in the output. For each feature in each mode, there is a TRUE/FALSE value for whether the feature is a defining variable in each cluster (in the res$defvar
element), as well as the probability of that feature being a defining variable for the cluster (res$defvar_pr
)