How do I validate & annotate arbitrary data structures?¶
This guide walks through the low-level API that lets you validate iterables.
You can then use the records create inferred during validation to annotate a dataset.
How do I validate based on a public ontology?
LaminDB makes it easy to validate categorical variables based on registries that inherit from CanCurate.
CanCurate methods validate against the registries in your LaminDB instance.
In Manage biological registries, you’ll see how to extend standard validation to validation against public references using a PubliOntology object, e.g., via public_genes = bt.Gene.public().
By default, from_values() considers a match in a public reference a validated value for any bionty entity.
# pip install 'lamindb[bionty,zarr]'
!lamin init --storage ./test-curate-any --modules bionty
Define a test dataset.
import lamindb as ln
import bionty as bt
import zarr
import numpy as np
data = zarr.open_group(store="data.zarr", mode="a")
data.create_dataset(name="temperature", shape=(3,), dtype="float32")
data.create_dataset(name="knockout_gene", shape=(3,), dtype=str)
data.create_dataset(name="disease", shape=(3,), dtype=str)
data["knockout_gene"][:] = np.array(
["ENSG00000139618", "ENSG00000141510", "ENSG00000133703"]
)
data["disease"][:] = np.random.default_rng().choice(
["MONDO:0004975", "MONDO:0004980"], 3
)
→ connected lamindb: testuser1/test-curate-any
Validate and standardize vectors¶
Read the disease array from the zarr group into memory.
disease = data["disease"][:]
validate() validates vectore-like values against reference values in a registry.
It returns a boolean vector indicating where a value has an exact match in the reference values.
bt.Disease.validate(disease, field=bt.Disease.ontology_id)
When validation fails, you can call inspect() to figure out what to do.
inspect() applies the same definition of validation as validate(), but returns a rich return value InspectResult. Most importantly, it logs recommended curation steps that would render the data validated.
Note: you can use standardize() to standardize synonyms.
bt.Disease.inspect(disease, field=bt.Disease.ontology_id)
Bulk creating records using from_values() only returns validated records.
diseases = bt.Disease.from_values(disease, field=bt.Disease.ontology_id).save()
Repeat the process for more labels:
projects = ln.ULabel.from_values(
["Project A", "Project B"],
field=ln.ULabel.name,
create=True, # create non-validated labels
).save()
genes = bt.Gene.from_values(
data["knockout_gene"][:], field=bt.Gene.ensembl_gene_id
).save()
Annotate the dataset¶
Register the dataset as an artifact:
artifact = ln.Artifact("data.zarr", key="my_dataset.zarr").save()
Annotate with features:
ln.Feature(name="project", dtype=ln.ULabel).save()
ln.Feature(name="disease", dtype=bt.Disease.ontology_id).save()
ln.Feature(name="knockout_gene", dtype=bt.Gene.ensembl_gene_id).save()
artifact.features.add_values(
{"project": projects, "knockout_gene": genes, "disease": diseases}
)
artifact.describe()