Quickstart#
Consider 3 ways in which you might come up with hypotheses about gene sets during analyses:
You obtain a DataFrame with differentially expressed genes and you store it at the end of your analysis as a parquet file.
You read a paper and read about up & down genes associated with this mechanism or this treatement. You write down a few notes along the lines of “I read paper X and these were the findings” and store them in a text file (or similar).
You run or train an ML model and obtain genes either as a prediction or by linearly decoding a latent space of your deep learning model. You store either the model or prediction as an artifact.
The problem with all these ways is that the “actual analysis result” is a somewhat unstructured artifact that can’t be easily queried and tabularized to guide thinking around the next experiment.
Hence, we introduce a metadata registry AnalysisResult
that links the actual artifact and analyses and can serve as a decision making vehicle on the team.
In this quickstart, we illustrate how to use AnalysisResult
with a simplified example of scenario 1 mentioned above.
!lamin init --storage ./test-lrex --schema bionty,lrex
Show code cell output
💡 connected lamindb: testuser1/test-lrex
import lamindb as ln
import bionty as bt
import lrex as lx
import scanpy as sc
ln.settings.transform.stem_uid = "NwPopSnhDS1t"
ln.settings.transform.version = "1"
Show code cell output
💡 connected lamindb: testuser1/test-lrex
Run an analysis#
Run a mock analysis:
# track run with parameters
pvals_adj = 0.05
ln.track(params={"pvals_adj": pvals_adj})
# get mock data
adata = ln.core.datasets.anndata_human_immune_cells(populate_registries=True)
adata
# run analysis
sc.tl.rank_genes_groups(
adata,
use_raw=False,
groupby="donor",
method="wilcoxon",
groups=["582C"],
reference="rest",
)
rank_genes_groups_df = sc.get.rank_genes_groups_df(adata, "582C")
rank_genes_groups_df.head()
degs_up = rank_genes_groups_df[
(rank_genes_groups_df["logfoldchanges"] > 0)
& (rank_genes_groups_df["pvals_adj"] < pvals_adj)
]
degs_down = rank_genes_groups_df[
(rank_genes_groups_df["logfoldchanges"] < 0)
& (rank_genes_groups_df["pvals_adj"] < pvals_adj)
]
Show code cell output
💡 notebook imports: bionty==0.42.9 lamindb==0.71.2 lrex==0.0.1 scanpy==1.10.1
💡 saved: Transform(version='1', uid='NwPopSnhDS1t5zKv', name='Quickstart', key='quickstart', type='notebook', updated_at=2024-05-07 20:50:28 UTC, created_by_id=1)
💡 saved: Run(uid='YfhvZfAm2142lyCMPuUz', json={'pvals_adj': 0.05}, transform_id=1, created_by_id=1)
Store analysis results#
Detailed results:
result_up = ln.Artifact.from_df(degs_up, description="DEGs up").save()
result_down = ln.Artifact.from_df(degs_down, description="DEGs down").save()
Abstracted results:
genes_up = bt.Gene.from_values(degs_up["names"].values, bt.Gene.ensembl_gene_id, organism="human")
genes_down = bt.Gene.from_values(degs_down["names"].values, bt.Gene.ensembl_gene_id, organism="human")
Show code cell output
❗ did not create Gene record for 1 non-validated ensembl_gene_id: 'ENSG00000269028'
Create the AnalysisResult
record:
analysis = lx.AnalysisResult(name="My analysis").save()
analysis.up_genes.set(genes_up)
analysis.down_genes.set(genes_down)
analysis.artifacts.set([result_up, result_down])
Queries on AnalysisResult
#
analysis.up_genes.df().head()
Show code cell output
uid | symbol | stable_id | ensembl_gene_id | ncbi_gene_ids | biotype | description | synonyms | organism_id | public_source_id | created_at | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||
874 | 5rVZ6jQgHzOA | SMAP2 | None | ENSG00000084070 | 64744 | protein_coding | small ArfGAP2 | SMAP1L | 1 | 9 | 2024-05-07 20:50:42.912682+00:00 | 2024-05-07 20:50:42.912696+00:00 | 1 |
973 | 3QdwQJwY0B0b | RPS8 | None | ENSG00000142937 | 6202 | protein_coding | ribosomal protein S8 | S8 | 1 | 9 | 2024-05-07 20:50:42.932163+00:00 | 2024-05-07 20:50:42.932177+00:00 | 1 |
2087 | PL64XVlf6Aco | RPS27 | None | ENSG00000177954 | 6232 | protein_coding | ribosomal protein S27 | S27|MPS-1|MPS1 | 1 | 9 | 2024-05-07 20:50:43.164002+00:00 | 2024-05-07 20:50:43.164017+00:00 | 1 |
2438 | 7MB2cNP2oD4y | SELL | None | ENSG00000188404 | 6402 | protein_coding | selectin L | LAM1|HLHRC|PLNHR|LEU-8|LSEL|LAM-1|LYAM-1|LYAM1... | 1 | 9 | 2024-05-07 20:50:43.386625+00:00 | 2024-05-07 20:50:43.386640+00:00 | 1 |
2848 | 7lqYcBIs0hYC | FCMR | None | ENSG00000162894 | 9214 | protein_coding | Fc mu receptor | FAIM3|TOSO|FCMUR | 1 | 9 | 2024-05-07 20:50:43.471022+00:00 | 2024-05-07 20:50:43.471036+00:00 | 1 |
analysis.down_genes.df().head()
Show code cell output
uid | symbol | stable_id | ensembl_gene_id | ncbi_gene_ids | biotype | description | synonyms | organism_id | public_source_id | created_at | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||
1180 | 5Kzn6IKvI1ES | JUN | None | ENSG00000177606 | 3725 | protein_coding | Jun proto-oncogene, AP-1 transcription factor ... | C-JUN|AP-1 | 1 | 9 | 2024-05-07 20:50:42.976136+00:00 | 2024-05-07 20:50:42.976150+00:00 | 1 |
2057 | 5wZu55GSw4d4 | S100A6 | None | ENSG00000197956 | 6277 | protein_coding | S100 calcium binding protein A6 | CABP|2A9|CACY|PRA | 1 | 9 | 2024-05-07 20:50:43.158945+00:00 | 2024-05-07 20:50:43.158959+00:00 | 1 |
2059 | 3ck8bNMmVpIh | S100A4 | None | ENSG00000196154 | 6275 | protein_coding | S100 calcium binding protein A4 | 18A2|P9KA|PEL98|MTS1|FSP1|CAPL|42A | 1 | 9 | 2024-05-07 20:50:43.159282+00:00 | 2024-05-07 20:50:43.159297+00:00 | 1 |
2771 | cM5Scoeej5xN | BTG2 | None | ENSG00000159388 | 7832 | protein_coding | BTG anti-proliferation factor 2 | MGC126064|MGC126063|PC3|APRO1|TIS21 | 1 | 9 | 2024-05-07 20:50:43.454949+00:00 | 2024-05-07 20:50:43.454963+00:00 | 1 |
3769 | 2NWfGttVcAlz | YPEL5 | None | ENSG00000119801 | 51646 | protein_coding | yippee like 5 | CGI-127 | 1 | 9 | 2024-05-07 20:50:43.665980+00:00 | 2024-05-07 20:50:43.665995+00:00 | 1 |
analysis.artifacts.df()
Show code cell output
version | uid | storage_id | key | suffix | accessor | description | size | hash | hash_type | n_objects | n_observations | transform_id | run_id | visibility | key_is_virtual | created_at | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||
1 | None | 5VQWjoJAUeO7obqZP87u | 1 | None | .parquet | DataFrame | DEGs up | 6310 | Erme0TktObLQCmKSNSVMAw | md5 | None | None | 1 | 1 | 1 | True | 2024-05-07 20:51:04.731997+00:00 | 2024-05-07 20:51:04.732026+00:00 | 1 |
2 | None | E1UuvYA9h4ZQt1qZs8IP | 1 | None | .parquet | DataFrame | DEGs down | 7273 | 9_HwJdp8bwCqNzsx9YZ05w | md5 | None | None | 1 | 1 | 1 | True | 2024-05-07 20:51:04.739717+00:00 | 2024-05-07 20:51:04.739741+00:00 | 1 |
The actual analysis (a notebook, a pipeline, a script, a UI interaction):
analysis.transform
Show code cell output
Transform(version='1', uid='NwPopSnhDS1t5zKv', name='Quickstart', key='quickstart', type='notebook', updated_at=2024-05-07 20:50:28 UTC, created_by_id=1)
The run of the analysis including parameters:
analysis.run
Show code cell output
Run(uid='YfhvZfAm2142lyCMPuUz', started_at=2024-05-07 20:50:28 UTC, json={'pvals_adj': 0.05}, is_consecutive=True, transform_id=1, created_by_id=1)
Data lineage:
analysis.transform.view_parents()
Show code cell output
Make a new version of the analysis#
Say we re-run an analysis and want to make a new version. Here’s how we can do this:
analysis_v2 = lx.AnalysisResult(is_new_version_of=analysis).save()
analysis_v2.versions.df()
version | uid | name | description | transform_id | run_id | created_at | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|
id | |||||||||
1 | 1 | xvyQYNh3 | My analysis | None | 1 | 1 | 2024-05-07 20:51:11.525839+00:00 | 2024-05-07 20:51:11.747865+00:00 | 1 |
2 | 2 | xvyQYNh3jbaS | My analysis | None | 1 | 1 | 2024-05-07 20:51:11.759980+00:00 | 2024-05-07 20:51:11.760008+00:00 | 1 |