Build using the scvi-tools
Performs nonlinear factor analysis and incorporates gene set information to pre-annotate factors with gene sets.
PALAVA is a nonlinear factor analysis method that incorporates prior knowledge through gene sets. The method provides interpretable dimension reduction to analyse biological signals in the data. The methods models use annotated latent variable uses a gene set as prior knowledge. Thus, we associates the the biological meaning of the gene set to the corresponding latent variable. This provides us with a more meaningful latent space, as the latent variables are pre-annotated with biological meaning. We also assume we are unaware of the which gene sets (or biological processes) are relevant to the data. Thus reasonably excessive gene sets can be provided. The method provides factor importance scores that ranks the factors based on importance. The modelling is flexible enough to infer nonlinear relationships between genes to capture more complicated biological processes in the data. The design of the annotated decoder also accounts for errors in the gene set. Consequently, through interpretability techniques the gene sets can be refined based on information from the data. Thus the method can introduce relevant genes into the gene set or not use gene set genes if the data does not have such a signal in the counts data.
- Create conda environment with python 3.10
conda create --name palava-env python=3.10 - Activate conda environment,
conda activate palava-env - Then run
pip install git+ssh://git@github.com/shimlab/PALAVA.git
If you want an editable installation or from cloned repo, then
- Clone this directory
- After cloning, navigate to the repo (in the same path as the setup.py), run
pip install -e .or remove the-efor normal installation.
This notebook test runs the method and visualises the output of the method. It requires the palava_on_sim_data_a_test.h5ad data file in the directory. Highly recommended to use a gpu. Training will take less than 10 mins with gpu (on cpu approx 1 hour).
- The raw counts of single RNA seq data to analyse (should not be log transformed)
- Set of gene sets you think could be relevant to the data at had (example: 50 Hallmark gene sets)
- example on simulated data can be found in
example_notebooks.
- Factor importance scores : Ranks the factors based on most importance
- Factor activations: Provides the representation of the data in terms of factors (factor to cell relationship, analogous to PCA representation of the data)
- Factor scores: Provides the factor loadings of the data (factor to gene relationship , analogous to factor loadings). This can be used to refine the gene sets.
- resolve warnings
- If using mac use
accelerator='cpu', the package is not compatible withmps.
