Analyze a collection in memory¶

Here, we’ll analyze the growing collection by loading it into memory. This is only possible if it’s not too large. If your data is large, you’ll likely want to iterate over the collection to train a model, the topic of the next page ().

import lamindb as ln
import bionty as bt

→ connected lamindb: testuser1/test-scrna

ln.context.uid = "mfWKm8OtAzp80000"
ln.context.track()

→ notebook imports: bionty==0.49.0 lamindb==0.76.2 scanpy==1.10.2

→ created Transform('mfWKm8OtAzp80000') & created Run('2024-08-28 16:09:33.564978+00:00')

ln.Collection.df()

Show code cell output Hide code cell output

	uid	version	is_latest	name	description	hash	reference	reference_type	visibility	transform_id	meta_artifact_id	run_id	created_by_id	updated_at
id
2	GADJXgAxQnlJ5fuc0001	2	True	My versioned scRNA-seq collection	None	dBJLoG6NFZ8WwlWqnfyFdQ	None	None	1	2	None	2	1	2024-08-28 16:09:22.896162+00:00
1	GADJXgAxQnlJ5fuc0000	None	False	My versioned scRNA-seq collection	None	exJtsBYH53iiebYH-Qx0sw	None	None	1	1	None	1	1	2024-08-28 16:09:22.883567+00:00

collection = ln.Collection.get(
    name="My versioned scRNA-seq collection", version="2"
)

collection.artifacts.df()

	uid	version	is_latest	description	key	suffix	type	size	hash	n_objects	n_observations	_hash_type	_accessor	visibility	_key_is_virtual	storage_id	transform_id	run_id	created_by_id	updated_at
id
1	CPEgHimtxrJYYqN70000	None	True	Human immune cells from Conde22	None	.h5ad	dataset	57612943	9sXda5E7BYiVoDOQkTC0KB	None	1648	sha1-fl	AnnData	1	True	1	1	1	1	2024-08-28 16:08:53.383893+00:00
2	hFbQL54nW2ecIrnb0000	None	True	10x reference adata	None	.h5ad	dataset	853388	jxR7kj0-xk-84u5sv3J9CQ	None	70	md5	AnnData	1	True	1	2	2	1	2024-08-28 16:09:21.727718+00:00

If the collection isn’t too large, we can now load it into memory.

Under-the-hood, the AnnData objects are concatenated during loading.

The amount of time this takes depends on a variety of factors.

If it occurs often, one might consider storing a concatenated version of the collection, rather than the individual pieces.

adata = collection.load()

The default is an outer join during concatenation as in pandas:

adata

AnnData object with n_obs × n_vars = 1718 × 36503
    obs: 'cell_type', 'n_genes', 'percent_mito', 'louvain', 'donor', 'tissue', 'assay', 'artifact_uid'
    obsm: 'X_pca', 'X_umap'

The AnnData has the reference to the individual artifacts in the .obs annotations:

adata.obs.artifact_uid.cat.categories

Index(['hFbQL54nW2ecIrnb0000', 'CPEgHimtxrJYYqN70000'], dtype='object')

We can easily obtain ensemble IDs for gene symbols using the look up object:

genes = bt.Gene.lookup(field="symbol")

genes.itm2b.ensembl_gene_id

'ENSG00000136156'

Let us create a plot:

import scanpy as sc

sc.pp.pca(adata, n_comps=2)

sc.pl.pca(
    adata,
    color=genes.itm2b.ensembl_gene_id,
    title=(
        f"{genes.itm2b.symbol} / {genes.itm2b.ensembl_gene_id} /"
        f" {genes.itm2b.description}"
    ),
    save="_itm2b",
)

WARNING: saving figure to file figures/pca_itm2b.pdf

_images/1f69f6d7c131b526d7715d3b396c48a7906ff56890b4614bcdcd99ef70e580fb.png

We could save a plot as a pdf and then see it in the flow diagram:

artifact = ln.Artifact("./figures/pca_itm2b.pdf", description="My result on ITM2B")
artifact.save()
artifact.view_lineage()

But given the image is part of the notebook, we can also rely on the report that we create when saving the notebook:

ln.context.finish()