Quickstart#

Consider 3 ways in which you might come up with hypotheses about gene sets during analyses:

  1. You obtain a DataFrame with differentially expressed genes and you store it at the end of your analysis as a parquet file.

  2. You read a paper and read about up & down genes associated with this mechanism or this treatement. You write down a few notes along the lines of “I read paper X and these were the findings” and store them in a text file (or similar).

  3. You run or train an ML model and obtain genes either as a prediction or by linearly decoding a latent space of your deep learning model. You store either the model or prediction as an artifact.

The problem with all these ways is that the “actual analysis result” is a somewhat unstructured artifact that can’t be easily queried and tabularized to guide thinking around the next experiment.

Hence, we introduce a metadata registry AnalysisResult that links the actual artifact and analyses and can serve as a decision making vehicle on the team.

In this quickstart, we illustrate how to use AnalysisResult with a simplified example of scenario 1 mentioned above.

!lamin init --storage ./test-lrex --schema bionty,lrex
Hide code cell output
💡 connected lamindb: testuser1/test-lrex
import lamindb as ln
import bionty as bt
import lrex as lx
import scanpy as sc

ln.settings.transform.stem_uid = "NwPopSnhDS1t"
ln.settings.transform.version = "1"
Hide code cell output
💡 connected lamindb: testuser1/test-lrex

Run an analysis#

Run a mock analysis:

# track run with parameters
pvals_adj = 0.05
ln.track(params={"pvals_adj": pvals_adj})

# get mock data
adata = ln.core.datasets.anndata_human_immune_cells(populate_registries=True)
adata

# run analysis
sc.tl.rank_genes_groups(
    adata,
    use_raw=False,
    groupby="donor",
    method="wilcoxon",
    groups=["582C"],
    reference="rest",
)
rank_genes_groups_df = sc.get.rank_genes_groups_df(adata, "582C")
rank_genes_groups_df.head()
degs_up = rank_genes_groups_df[
    (rank_genes_groups_df["logfoldchanges"] > 0)
    & (rank_genes_groups_df["pvals_adj"] < pvals_adj)
]
degs_down = rank_genes_groups_df[
    (rank_genes_groups_df["logfoldchanges"] < 0)
    & (rank_genes_groups_df["pvals_adj"] < pvals_adj)
]
Hide code cell output
💡 notebook imports: bionty==0.42.9 lamindb==0.71.2 lrex==0.0.1 scanpy==1.10.1
💡 saved: Transform(version='1', uid='NwPopSnhDS1t5zKv', name='Quickstart', key='quickstart', type='notebook', updated_at=2024-05-07 20:50:28 UTC, created_by_id=1)
💡 saved: Run(uid='YfhvZfAm2142lyCMPuUz', json={'pvals_adj': 0.05}, transform_id=1, created_by_id=1)

Store analysis results#

Detailed results:

result_up = ln.Artifact.from_df(degs_up, description="DEGs up").save()
result_down = ln.Artifact.from_df(degs_down, description="DEGs down").save()

Abstracted results:

genes_up = bt.Gene.from_values(degs_up["names"].values, bt.Gene.ensembl_gene_id, organism="human")
genes_down = bt.Gene.from_values(degs_down["names"].values, bt.Gene.ensembl_gene_id, organism="human")
Hide code cell output
did not create Gene record for 1 non-validated ensembl_gene_id: 'ENSG00000269028'

Create the AnalysisResult record:

analysis = lx.AnalysisResult(name="My analysis").save()
analysis.up_genes.set(genes_up)
analysis.down_genes.set(genes_down)
analysis.artifacts.set([result_up, result_down])

Queries on AnalysisResult#

analysis.up_genes.df().head()
Hide code cell output
uid symbol stable_id ensembl_gene_id ncbi_gene_ids biotype description synonyms organism_id public_source_id created_at updated_at created_by_id
id
874 5rVZ6jQgHzOA SMAP2 None ENSG00000084070 64744 protein_coding small ArfGAP2 SMAP1L 1 9 2024-05-07 20:50:42.912682+00:00 2024-05-07 20:50:42.912696+00:00 1
973 3QdwQJwY0B0b RPS8 None ENSG00000142937 6202 protein_coding ribosomal protein S8 S8 1 9 2024-05-07 20:50:42.932163+00:00 2024-05-07 20:50:42.932177+00:00 1
2087 PL64XVlf6Aco RPS27 None ENSG00000177954 6232 protein_coding ribosomal protein S27 S27|MPS-1|MPS1 1 9 2024-05-07 20:50:43.164002+00:00 2024-05-07 20:50:43.164017+00:00 1
2438 7MB2cNP2oD4y SELL None ENSG00000188404 6402 protein_coding selectin L LAM1|HLHRC|PLNHR|LEU-8|LSEL|LAM-1|LYAM-1|LYAM1... 1 9 2024-05-07 20:50:43.386625+00:00 2024-05-07 20:50:43.386640+00:00 1
2848 7lqYcBIs0hYC FCMR None ENSG00000162894 9214 protein_coding Fc mu receptor FAIM3|TOSO|FCMUR 1 9 2024-05-07 20:50:43.471022+00:00 2024-05-07 20:50:43.471036+00:00 1
analysis.down_genes.df().head()
Hide code cell output
uid symbol stable_id ensembl_gene_id ncbi_gene_ids biotype description synonyms organism_id public_source_id created_at updated_at created_by_id
id
1180 5Kzn6IKvI1ES JUN None ENSG00000177606 3725 protein_coding Jun proto-oncogene, AP-1 transcription factor ... C-JUN|AP-1 1 9 2024-05-07 20:50:42.976136+00:00 2024-05-07 20:50:42.976150+00:00 1
2057 5wZu55GSw4d4 S100A6 None ENSG00000197956 6277 protein_coding S100 calcium binding protein A6 CABP|2A9|CACY|PRA 1 9 2024-05-07 20:50:43.158945+00:00 2024-05-07 20:50:43.158959+00:00 1
2059 3ck8bNMmVpIh S100A4 None ENSG00000196154 6275 protein_coding S100 calcium binding protein A4 18A2|P9KA|PEL98|MTS1|FSP1|CAPL|42A 1 9 2024-05-07 20:50:43.159282+00:00 2024-05-07 20:50:43.159297+00:00 1
2771 cM5Scoeej5xN BTG2 None ENSG00000159388 7832 protein_coding BTG anti-proliferation factor 2 MGC126064|MGC126063|PC3|APRO1|TIS21 1 9 2024-05-07 20:50:43.454949+00:00 2024-05-07 20:50:43.454963+00:00 1
3769 2NWfGttVcAlz YPEL5 None ENSG00000119801 51646 protein_coding yippee like 5 CGI-127 1 9 2024-05-07 20:50:43.665980+00:00 2024-05-07 20:50:43.665995+00:00 1
analysis.artifacts.df()
Hide code cell output
version uid storage_id key suffix accessor description size hash hash_type n_objects n_observations transform_id run_id visibility key_is_virtual created_at updated_at created_by_id
id
1 None 5VQWjoJAUeO7obqZP87u 1 None .parquet DataFrame DEGs up 6310 Erme0TktObLQCmKSNSVMAw md5 None None 1 1 1 True 2024-05-07 20:51:04.731997+00:00 2024-05-07 20:51:04.732026+00:00 1
2 None E1UuvYA9h4ZQt1qZs8IP 1 None .parquet DataFrame DEGs down 7273 9_HwJdp8bwCqNzsx9YZ05w md5 None None 1 1 1 True 2024-05-07 20:51:04.739717+00:00 2024-05-07 20:51:04.739741+00:00 1

The actual analysis (a notebook, a pipeline, a script, a UI interaction):

analysis.transform
Hide code cell output
Transform(version='1', uid='NwPopSnhDS1t5zKv', name='Quickstart', key='quickstart', type='notebook', updated_at=2024-05-07 20:50:28 UTC, created_by_id=1)

The run of the analysis including parameters:

analysis.run
Hide code cell output
Run(uid='YfhvZfAm2142lyCMPuUz', started_at=2024-05-07 20:50:28 UTC, json={'pvals_adj': 0.05}, is_consecutive=True, transform_id=1, created_by_id=1)

Data lineage:

analysis.transform.view_parents()
Hide code cell output
_images/553ba33fc31b5488cd88890683de9964ad4ba83766307d3363d7ff73357422a9.svg

Make a new version of the analysis#

Say we re-run an analysis and want to make a new version. Here’s how we can do this:

analysis_v2 = lx.AnalysisResult(is_new_version_of=analysis).save()
analysis_v2.versions.df()
version uid name description transform_id run_id created_at updated_at created_by_id
id
1 1 xvyQYNh3 My analysis None 1 1 2024-05-07 20:51:11.525839+00:00 2024-05-07 20:51:11.747865+00:00 1
2 2 xvyQYNh3jbaS My analysis None 1 1 2024-05-07 20:51:11.759980+00:00 2024-05-07 20:51:11.760008+00:00 1