Gene

lamindb provides access to the following public gene ontologies through bionty:

  1. Ensembl

  2. NCBI Gene

Here we show how to access and search gene ontologies to standardize new data.

# pip install 'lamindb[bionty]'
!lamin init --storage ./test-public-ontologies --schema bionty
import bionty as bt
import pandas as pd

PublicOntology objects

Let us create a public ontology accessor with public(), which chooses a default public ontology source from Source. It’s a PublicOntology object, which you can think about as a public registry:

public = bt.Gene.public(organism="human")
public
→ connected lamindb: testuser1/test-public-ontologies
PublicOntology
Entity: Gene
Organism: human
Source: ensembl, release-112
#terms: 75829

As for registries, you can export the ontology as a DataFrame:

df = public.df()
df.head()
ensembl_gene_id symbol ncbi_gene_id biotype description synonyms
0 ENSG00000000003 TSPAN6 7105 protein_coding tetraspanin 6 TSPAN-6|T245|TM4SF6
1 ENSG00000000005 TNMD 64102 protein_coding tenomodulin TEM|MYODULIN|CHM1L|TENDIN|BRICD4
2 ENSG00000000419 DPM1 8813 protein_coding dolichyl-phosphate mannosyltransferase subunit... CDGIE|MPDS
3 ENSG00000000457 SCYL3 57147 protein_coding SCY1 like pseudokinase 3 PACE1|PACE-1
4 ENSG00000000460 FIRRM 55732 protein_coding FIGNL1 interacting regulator of recombination ... C1ORF112|FLJ10706|APOLO1|FLIP|MEICA1

Unlike registries, you can also export it as a Pronto object via public.ontology.

Look up terms

As for registries, terms can be looked up with auto-complete:

lookup = public.lookup()

The . accessor provides normalized terms (lower case, only contains alphanumeric characters and underscores):

lookup.tcf7
Gene(ensembl_gene_id='ENSG00000081059', symbol='TCF7', ncbi_gene_id='6932', biotype='protein_coding', description='transcription factor 7 ', synonyms='TCF-1')

To look up the exact original strings, convert the lookup object to dict and use the [] accessor:

lookup_dict = lookup.dict()
lookup_dict["TCF7"]
Gene(ensembl_gene_id='ENSG00000081059', symbol='TCF7', ncbi_gene_id='6932', biotype='protein_coding', description='transcription factor 7 ', synonyms='TCF-1')

By default, the name field is used to generate lookup keys. You can specify another field to look up:

lookup = public.lookup(public.ncbi_gene_id)

If multiple entries are matched, they are returned as a list:

lookup.bt_100126572
Gene(ensembl_gene_id='ENSG00000203733', symbol='GJE1', ncbi_gene_id='100126572', biotype='protein_coding', description='gap junction protein epsilon 1 ', synonyms='CX23')

Search terms

Search behaves in the same way as it does for registries:

public.search("TP53").head(3)
ensembl_gene_id ncbi_gene_id biotype description synonyms __ratio__
symbol
TP53 ENSG00000141510 7157 protein_coding tumor protein p53 LFS1|P53 100.0
TP53TG3D ENSG00000205456 102723655 protein_coding TP53 target 3D 90.0
TP53TG3C ENSG00000205457 24150 protein_coding TP53 target 3C 90.0

By default, search also covers synonyms:

public.search("PDL1").head(3)
ensembl_gene_id ncbi_gene_id biotype description synonyms __ratio__
symbol
CD274 ENSG00000120217 29126 protein_coding CD274 molecule PD-L1|PDCD1LG1|B7H1|PDL1|B7-H1|B7-H 100.0
GAPDHP69 ENSG00000223460 None processed_pseudogene glyceraldehyde 3 phosphate dehydrogenase pseud... GAPDL14|GAPDHL14 90.0
GAPDHP68 ENSG00000233876 None processed_pseudogene glyceraldehyde 3 phosphate dehydrogenase pseud... GAPDHL13|GAPDL13 90.0

You can turn this off synonym by passing synonyms_field=None:

public.search("PDL1", synonyms_field=None).head(3)
ensembl_gene_id ncbi_gene_id biotype description synonyms __ratio__
symbol
SPDL1 ENSG00000040275 54908 protein_coding spindle apparatus coiled-coil protein 1 CCDC99|HSPINDLY|FLJ20364 88.888889
PODNL1 ENSG00000132000 79883 protein_coding podocan like 1 SLRR5B|FLJ23447 80.000000
PKD2L1 ENSG00000107593 9033 protein_coding polycystin 2 like 1, transient receptor potent... PKD2L|PCL|PKDL|TRPP3 80.000000

Search another field (default is .name):

public.search("tumor protein p53", field=public.description).head()
ensembl_gene_id symbol ncbi_gene_id biotype synonyms __ratio__
description
tumor protein p53 ENSG00000141510 TP53 7157 protein_coding LFS1|P53 100.000000
tumor protein p73 ENSG00000078900 TP73 7161 protein_coding P73 94.117647
tumor protein p63 ENSG00000073282 TP63 8626 protein_coding TP53CP|TP73L|P73L|OFC8|EEC3|P51|P53CP|SHFM4|TP... 94.117647
tumor protein D52 ENSG00000076554 TPD52 124188259 protein_coding D52|HD52|N8L 88.235294
tumor protein D52 ENSG00000076554 TPD52 7163 protein_coding D52|HD52|N8L 88.235294

Standardize gene identifiers

Let us generate a DataFrame that stores a number of gene identifiers, some of which corrupted:

data = {
    "gene symbol": ["A1CF", "A1BG", "FANCD1", "corrupted"],
    "ncbi id": ["29974", "1", "5133", "corrupted"],
    "ensembl_gene_id": [
        "ENSG00000148584",
        "ENSG00000121410",
        "ENSG00000188389",
        "ENSGcorrupted",
    ],
}
df_orig = pd.DataFrame(data).set_index("ensembl_gene_id")
df_orig
gene symbol ncbi id
ensembl_gene_id
ENSG00000148584 A1CF 29974
ENSG00000121410 A1BG 1
ENSG00000188389 FANCD1 5133
ENSGcorrupted corrupted corrupted

First we can check whether any of our values are validated against the ontology reference:

validated = public.validate(df_orig.index, public.ensembl_gene_id)
df_orig.index[~validated]
! 1 term (25.00%) is not validated: ENSGcorrupted
Index(['ENSGcorrupted'], dtype='object', name='ensembl_gene_id')

Next, we validate which symbols are mappable against the ontology:

# based on NCBI gene ID
public.validate(df_orig["ncbi id"], public.ncbi_gene_id)
! 1 term (25.00%) is not validated: corrupted
array([ True,  True,  True, False])
# based on Gene symbols
validated_symbols = public.validate(df_orig["gene symbol"], public.symbol)
df_orig["gene symbol"][~validated_symbols]
! 2 terms (50.00%) are not validated: FANCD1, corrupted
ensembl_gene_id
ENSG00000188389       FANCD1
ENSGcorrupted      corrupted
Name: gene symbol, dtype: object

Here, 2 of the gene symbols are not validated. Inspect why:

public.inspect(df_orig["gene symbol"], public.symbol);
! 2 terms (50.00%) are not validated for symbol: FANCD1, corrupted
   detected 1 terms with synonym: FANCD1
→  standardize terms via .standardize()

Logging suggests to use .standardize():

mapped_symbol_synonyms = public.standardize(df_orig["gene symbol"])
mapped_symbol_synonyms
['A1CF', 'A1BG', 'BRCA2', 'corrupted']

Optionally, you can return a mapper in the form of {synonym1: standardized_name1, ...}:

public.standardize(df_orig["gene symbol"], return_mapper=True)
{'FANCD1': 'BRCA2'}

We can use the standardized symbols as the new standardized index:

df_curated = df_orig.reset_index()
df_curated.index = mapped_symbol_synonyms
df_curated
ensembl_gene_id gene symbol ncbi id
A1CF ENSG00000148584 A1CF 29974
A1BG ENSG00000121410 A1BG 1
BRCA2 ENSG00000188389 FANCD1 5133
corrupted ENSGcorrupted corrupted corrupted

You can convert identifiers by passing return_field to standardize():

public.standardize(
    df_curated.index,
    field=public.symbol,
    return_field=public.ensembl_gene_id,
)
['ENSG00000148584', 'ENSG00000121410', 'ENSG00000139618', 'corrupted']

And return mappable identifiers as a dict:

public.standardize(
    df_curated.index,
    field=public.symbol,
    return_field=public.ensembl_gene_id,
    return_mapper=True,
)
{'A1BG': 'ENSG00000121410',
 'BRCA2': 'ENSG00000139618',
 'A1CF': 'ENSG00000148584'}

Ontology source versions

For any given entity, we can choose from a number of versions:

bt.Gene.list_source().df()
Hide code cell output
uid entity organism name version in_db currently_used description url md5 source_website dataframe_artifact_id run_id created_by_id updated_at
id
11 4UGN bionty.Gene human ensembl release-112 False True Ensembl s3://bionty-assets/df_human__ensembl__release-... 4ccda4d88720a326737376c534e8446b https://www.ensembl.org None None 1 2024-08-28 16:06:12.911961+00:00
12 1HoN bionty.Gene human ensembl release-111 False False Ensembl s3://bionty-assets/df_human__ensembl__release-... f9183bc44abb34459984e137b5de8af1 https://www.ensembl.org None None 1 2024-08-28 16:06:12.911998+00:00
13 5dmX bionty.Gene human ensembl release-110 False False Ensembl s3://bionty-assets/df_human__ensembl__release-... 832f3947e83664588d419608a469b528 https://www.ensembl.org None None 1 2024-08-28 16:06:12.912035+00:00
14 404r bionty.Gene human ensembl release-109 False False Ensembl s3://bionty-assets/human_ensembl_release-109_G... 72da9968c74e96d136a489a6102a4546 https://www.ensembl.org None None 1 2024-08-28 16:06:12.912072+00:00
15 4r4f bionty.Gene mouse ensembl release-112 False True Ensembl s3://bionty-assets/df_mouse__ensembl__release-... 519cf7b8acc3c948274f66f3155a3210 https://www.ensembl.org None None 1 2024-08-28 16:06:12.912109+00:00
16 5yZh bionty.Gene mouse ensembl release-111 False False Ensembl s3://bionty-assets/df_mouse__ensembl__release-... 5c071655347458307ac92b208f3c903a https://www.ensembl.org None None 1 2024-08-28 16:06:12.912146+00:00
17 34Tj bionty.Gene mouse ensembl release-110 False False Ensembl s3://bionty-assets/df_mouse__ensembl__release-... fa4ce130f2929aefd7ac3bc8eaf0c4de https://www.ensembl.org None None 1 2024-08-28 16:06:12.912182+00:00
18 PGj9 bionty.Gene mouse ensembl release-109 False False Ensembl s3://bionty-assets/mouse_ensembl_release-109_G... 08a1165061151b270b985317322bd2ed https://www.ensembl.org None None 1 2024-08-28 16:06:12.912219+00:00
19 4RPA bionty.Gene saccharomyces cerevisiae ensembl release-112 False True Ensembl s3://bionty-assets/df_saccharomyces cerevisiae... 11775126b101233525a0a9e2dd64edae https://www.ensembl.org None None 1 2024-08-28 16:06:12.912255+00:00
20 4Yyq bionty.Gene saccharomyces cerevisiae ensembl release-111 False False Ensembl s3://bionty-assets/df_saccharomyces cerevisiae... a15fab1d9d15a56d32fd2fd8a8fa250a https://www.ensembl.org None None 1 2024-08-28 16:06:12.912291+00:00
21 772a bionty.Gene saccharomyces cerevisiae ensembl release-110 False False Ensembl s3://bionty-assets/df_saccharomyces cerevisiae... 2e59495a3e87ea6575e408697dd73459 https://www.ensembl.org None None 1 2024-08-28 16:06:12.912327+00:00
# only lists the sources that are currently used
bt.Gene.list_source(currently_used=True).df()
uid entity organism name version in_db currently_used description url md5 source_website dataframe_artifact_id run_id created_by_id updated_at
id
11 4UGN bionty.Gene human ensembl release-112 False True Ensembl s3://bionty-assets/df_human__ensembl__release-... 4ccda4d88720a326737376c534e8446b https://www.ensembl.org None None 1 2024-08-28 16:06:12.911961+00:00
15 4r4f bionty.Gene mouse ensembl release-112 False True Ensembl s3://bionty-assets/df_mouse__ensembl__release-... 519cf7b8acc3c948274f66f3155a3210 https://www.ensembl.org None None 1 2024-08-28 16:06:12.912109+00:00
19 4RPA bionty.Gene saccharomyces cerevisiae ensembl release-112 False True Ensembl s3://bionty-assets/df_saccharomyces cerevisiae... 11775126b101233525a0a9e2dd64edae https://www.ensembl.org None None 1 2024-08-28 16:06:12.912255+00:00

When instantiating a Bionty object, we can choose a source or version:

source = bt.Source.get(
    name="ensembl", version="release-112", organism="human"
)
public = bt.Gene.public(source=source)
public
PublicOntology
Entity: Gene
Organism: human
Source: ensembl, release-112
#terms: 75829

The currently used ontologies can be displayed using:

bt.Source.filter(currently_used=True).df()
Hide code cell output
uid entity organism name version in_db currently_used description url md5 source_website dataframe_artifact_id run_id created_by_id updated_at
id
1 33TU bionty.Organism vertebrates ensembl release-112 False True Ensembl https://ftp.ensembl.org/pub/release-112/specie... 0ec37e77f4bc2d0b0b47c6c62b9f122d https://www.ensembl.org None None 1 2024-08-28 16:06:12.911561+00:00
6 6bbV bionty.Organism bacteria ensembl release-57 False True Ensembl https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacte... ee28510ed5586ea7ab4495717c96efc8 https://www.ensembl.org None None 1 2024-08-28 16:06:12.911772+00:00
7 6s9n bionty.Organism fungi ensembl release-57 False True Ensembl http://ftp.ensemblgenomes.org/pub/fungi/releas... dbcde58f4396ab8b2480f7fe9f83df8a https://www.ensembl.org None None 1 2024-08-28 16:06:12.911810+00:00
8 2PmT bionty.Organism metazoa ensembl release-57 False True Ensembl http://ftp.ensemblgenomes.org/pub/metazoa/rele... 424636a574fec078a61cbdddb05f9132 https://www.ensembl.org None None 1 2024-08-28 16:06:12.911848+00:00
9 7GPH bionty.Organism plants ensembl release-57 False True Ensembl https://ftp.ensemblgenomes.ebi.ac.uk/pub/plant... eadaa1f3e527e4c3940c90c7fa5c8bf4 https://www.ensembl.org None None 1 2024-08-28 16:06:12.911886+00:00
10 4tsk bionty.Organism all ncbitaxon 2023-06-20 False True NCBItaxon Ontology s3://bionty-assets/df_all__ncbitaxon__2023-06-... 00d97ba65627f1cd65636d2df22ea76c https://github.com/obophenotype/ncbitaxon None None 1 2024-08-28 16:06:12.911923+00:00
11 4UGN bionty.Gene human ensembl release-112 False True Ensembl s3://bionty-assets/df_human__ensembl__release-... 4ccda4d88720a326737376c534e8446b https://www.ensembl.org None None 1 2024-08-28 16:06:12.911961+00:00
15 4r4f bionty.Gene mouse ensembl release-112 False True Ensembl s3://bionty-assets/df_mouse__ensembl__release-... 519cf7b8acc3c948274f66f3155a3210 https://www.ensembl.org None None 1 2024-08-28 16:06:12.912109+00:00
19 4RPA bionty.Gene saccharomyces cerevisiae ensembl release-112 False True Ensembl s3://bionty-assets/df_saccharomyces cerevisiae... 11775126b101233525a0a9e2dd64edae https://www.ensembl.org None None 1 2024-08-28 16:06:12.912255+00:00
22 3EYy bionty.Protein human uniprot 2024-03 False True Uniprot s3://bionty-assets/df_human__uniprot__2024-03_... b5b9e7645065b4b3187114f07e3f402f https://www.uniprot.org None None 1 2024-08-28 16:06:12.912364+00:00
25 01RW bionty.Protein mouse uniprot 2024-03 False True Uniprot s3://bionty-assets/df_mouse__uniprot__2024-03_... b1b6a196eb853088d36198d8e3749ec4 https://www.uniprot.org None None 1 2024-08-28 16:06:12.912473+00:00
28 3kDh bionty.CellMarker human cellmarker 2.0 False True CellMarker s3://bionty-assets/human_cellmarker_2.0_CellMa... d565d4a542a5c7e7a06255975358e4f4 http://bio-bigdata.hrbmu.edu.cn/CellMarker None None 1 2024-08-28 16:06:12.912582+00:00
29 7bV5 bionty.CellMarker mouse cellmarker 2.0 False True CellMarker s3://bionty-assets/mouse_cellmarker_2.0_CellMa... 189586732c63be949e40dfa6a3636105 http://bio-bigdata.hrbmu.edu.cn/CellMarker None None 1 2024-08-28 16:06:12.912619+00:00
30 6LyR bionty.CellLine all clo 2022-03-21 False True Cell Line Ontology https://data.bioontology.org/ontologies/CLO/su... ea58a1010b7e745702a8397a526b3a33 https://bioportal.bioontology.org/ontologies/CLO None None 1 2024-08-28 16:06:12.912655+00:00
32 1Lhf bionty.CellType all cl 2024-05-15 False True Cell Ontology http://purl.obolibrary.org/obo/cl/releases/202... 8a8638a9e79567935793e5007704c650 https://obophenotype.github.io/cell-ontology None None 1 2024-08-28 16:06:12.912727+00:00
39 MUtA bionty.Tissue all uberon 2024-08-07 False True Uberon multi-species anatomy ontology http://purl.obolibrary.org/obo/uberon/releases... http://obophenotype.github.io/uberon None None 1 2024-08-28 16:06:12.912977+00:00
47 2L2r bionty.Disease all mondo 2024-06-04 False True Mondo Disease Ontology http://purl.obolibrary.org/obo/mondo/releases/... c47e8edb894c01f2511dfe0751fbc428 https://mondo.monarchinitiative.org None None 1 2024-08-28 16:06:12.913268+00:00
54 4ksw bionty.Disease human doid 2024-05-29 False True Human Disease Ontology http://purl.obolibrary.org/obo/doid/releases/2... bbefd72247d638edfcd31ec699947407 https://disease-ontology.org None None 1 2024-08-28 16:06:12.913523+00:00
62 69Xc bionty.ExperimentalFactor all efo 3.66.0 False True The Experimental Factor Ontology http://www.ebi.ac.uk/efo/releases/v3.66.0/efo.owl 6bd24217c740af7e1e771c1dabc9680b https://bioportal.bioontology.org/ontologies/EFO None None 1 2024-08-28 16:06:12.913823+00:00
67 48fB bionty.Phenotype human hp 2024-04-26 False True Human Phenotype Ontology https://github.com/obophenotype/human-phenotyp... e0f2e534eb2ad44a4d45573ef27b508f https://hpo.jax.org None None 1 2024-08-28 16:06:12.916315+00:00
72 4t7Q bionty.Phenotype mammalian mp 2024-06-18 False True Mammalian Phenotype Ontology https://github.com/mgijax/mammalian-phenotype-... 795d8378fe48ec13b41d01a86dd1c86c https://github.com/mgijax/mammalian-phenotype-... None None 1 2024-08-28 16:06:12.916497+00:00
75 sqPX bionty.Phenotype zebrafish zp 2024-04-18 False True Zebrafish Phenotype Ontology https://github.com/obophenotype/zebrafish-phen... 2231ebaa95becf8ff34a33c95a8d4350 https://github.com/obophenotype/zebrafish-phen... None None 1 2024-08-28 16:06:12.916602+00:00
79 6S4q bionty.Phenotype all pato 2024-03-28 False True Phenotype And Trait Ontology http://purl.obolibrary.org/obo/pato/releases/2... 6b1eaacd3d453b34375ce2e31c16328a https://github.com/pato-ontology/pato None None 1 2024-08-28 16:06:12.916748+00:00
81 7Ent bionty.Pathway all go 2024-06-17 False True Gene Ontology https://data.bioontology.org/ontologies/GO/sub... 7fa7ade5e3e26eab3959a7e4bc89ad4f http://geneontology.org None None 1 2024-08-28 16:06:12.916819+00:00
86 3rm9 BFXPipeline all lamin 1.0.0 False True Bioinformatics Pipeline s3://bionty-assets/df_all__lamin__1.0.0__BFXpi... https://lamin.ai None None 1 2024-08-28 16:06:12.916994+00:00
87 ugaI Drug all dron 2024-08-05 False True Drug Ontology https://data.bioontology.org/ontologies/DRON/s... https://bioportal.bioontology.org/ontologies/DRON None None 1 2024-08-28 16:06:12.917028+00:00
91 1GbF bionty.DevelopmentalStage human hsapdv 2024-05-28 False True Human Developmental Stages https://github.com/obophenotype/developmental-... https://github.com/obophenotype/developmental-... None None 1 2024-08-28 16:06:12.917166+00:00
93 10va bionty.DevelopmentalStage mouse mmusdv 2024-05-28 False True Mouse Developmental Stages https://github.com/obophenotype/developmental-... https://github.com/obophenotype/developmental-... None None 1 2024-08-28 16:06:12.917234+00:00
95 MJRq bionty.Ethnicity human hancestro 3.0 False True Human Ancestry Ontology https://github.com/EBISPOT/hancestro/raw/3.0/h... 76dd9efda9c2abd4bc32fc57c0b755dd https://github.com/EBISPOT/hancestro None None 1 2024-08-28 16:06:12.917304+00:00
96 5JnV BioSample all ncbi 2023-09 False True NCBI BioSample attributes s3://bionty-assets/df_all__ncbi__2023-09__BioS... 918db9bd1734b97c596c67d9654a4126 https://www.ncbi.nlm.nih.gov/biosample/docs/at... None None 1 2024-08-28 16:06:12.917339+00:00