Arc Virtual Cell Atlas: scRNA-seq¶

The Arc Virtual Cell Atlas hosts one of the biggest collections of scRNA-seq datasets.

Lamin mirrors the dataset for simplified access here: laminlabs/arc-virtual-cell-atlas.

If you use the data academically, please cite the original publications, Youngblut et al. (2025) and Zhang et al. (2025).

Connect to the source instance.

# pip install 'lamindb[jupyter,bionty,wetlab,gcp]'
!lamin connect laminlabs/arc-virtual-cell-atlas

Note

If you want to transfer artifacts or metadata into your own instance, use .using("laminlabs/arc-virtual-cell-atlas") when accessing registries and then .save() (Transfer data).

import lamindb as ln
import bionty as bt
import wetlab as wl
import pyarrow.compute as pc
import anndata as ad

Tahoe-100M¶

project_tahoe = ln.Project.get(name="Tahoe-100M")
project_tahoe

Project(uid='H5MwZwyA62rG', name='Tahoe-100M', is_type=False, url='https://arcinstitute.org/tools/virtualcellatlas', branch_id=1, space_id=1, created_by_id=1, created_at=2025-02-26 16:03:40 UTC)

# one collection in this project
project_tahoe.collections.df()

	uid	key	description	hash	reference	reference_type	space_id	meta_artifact_id	version	is_latest	run_id	created_at	created_by_id	_aux	branch_id
id
1	BpavRL4ntRTzWEE50000	tahoe100	None	GCLk4ZgQxgWspjmEUk3gIg	None	None	1	None	2025-02-25	True	3	2025-02-26 13:51:22.787537+00:00	1	None	1

Every individual dataset in the atlas is an .h5ad file that is registered as an artifact in LaminDB.

Artifact level metadata are registered and can be explored as follows:

# get the collection: https://lamin.ai/laminlabs/arc-virtual-cell-atlas/collection/BpavRL4ntRTzWEE5
collection_tahoe = ln.Collection.get(key="tahoe100")
# 14 artifacts in this collection, each correspond to a plate
artifacts_tahoe = collection_tahoe.artifacts.distinct()
artifacts_tahoe.df()

Show code cell output

Hide code cell output

	uid	key	description	suffix	kind	otype	size	hash	n_files	n_observations	_hash_type	_key_is_virtual	_overwrite_versions	space_id	storage_id	schema_id	version	is_latest	run_id	created_at	created_by_id	_aux	branch_id
id
1362	56uA9lPPmJ4zLUcr0000	2025-02-25/h5ad/plate10_filt_Vevo_Tahoe100M_WS...	None	.h5ad	dataset	AnnData	26536400717	j1FXsX7hs7u+eBqnWnmNHw	None	8044908	md5	False	False	1	2	3	None	True	1	2025-02-25 23:22:17.849980+00:00	1	None	1
1365	9L9HZ55HqUL0aqaR0000	2025-02-25/h5ad/plate13_filt_Vevo_Tahoe100M_WS...	None	.h5ad	dataset	AnnData	28071589885	RKOiaay+CHvv+Ukk/N+28A	None	8501658	md5	False	False	1	2	3	None	True	1	2025-02-25 23:22:18.977981+00:00	1	None	1
1372	aAHQ3zbD7n1asyYr0000	2025-02-25/h5ad/plate6_filt_Vevo_Tahoe100M_WSe...	None	.h5ad	dataset	AnnData	28934897078	NYvQEqVClziHm0ozWhOw1w	None	7545393	md5	False	False	1	2	3	None	True	1	2025-02-25 23:22:21.629962+00:00	1	None	1
1367	aJIqo7bNyJAs9z0r0000	2025-02-25/h5ad/plate1_filt_Vevo_Tahoe100M_WSe...	None	.h5ad	dataset	AnnData	19070623904	9iCNcouMqfNS3HA/2GUWOA	None	5481420	md5	False	False	1	2	3	None	True	1	2025-02-25 23:22:19.737995+00:00	1	None	1
1375	BDttiuV3Te8VB0dU0000	2025-02-25/h5ad/plate9_filt_Vevo_Tahoe100M_WSe...	None	.h5ad	dataset	AnnData	18791302576	4kHbVbmreg6akW6ZgsjxaA	None	5866669	md5	False	False	1	2	3	None	True	1	2025-02-25 23:22:22.759201+00:00	1	None	1
1374	czC19UpUEszVH2bU0000	2025-02-25/h5ad/plate8_filt_Vevo_Tahoe100M_WSe...	None	.h5ad	dataset	AnnData	30390935958	ilAzEPIh4FlDeTFaJ1dILw	None	8880979	md5	False	False	1	2	3	None	True	1	2025-02-25 23:22:22.387666+00:00	1	None	1
1373	DC5cacdJr1VoEXnl0000	2025-02-25/h5ad/plate7_filt_Vevo_Tahoe100M_WSe...	None	.h5ad	dataset	AnnData	16514746341	NOS4MY6eYYPOnAB8ViyWYg	None	5692117	md5	False	False	1	2	3	None	True	1	2025-02-25 23:22:22.009157+00:00	1	None	1
1371	EZATJLC4jE7pmwo40000	2025-02-25/h5ad/plate5_filt_Vevo_Tahoe100M_WSe...	None	.h5ad	dataset	AnnData	19763140865	VMBKFzOI5cj7UC1UDENP4A	None	6419498	md5	False	False	1	2	3	None	True	1	2025-02-25 23:22:21.255154+00:00	1	None	1
1363	omn7JStfJMzy8m6O0000	2025-02-25/h5ad/plate11_filt_Vevo_Tahoe100M_WS...	None	.h5ad	dataset	AnnData	23230802756	N2mzoYlMLEl6PdecaYyDvw	None	7435869	md5	False	False	1	2	3	None	True	1	2025-02-25 23:22:18.229629+00:00	1	None	1
1364	S2h2rPLCaUhZAM9u0000	2025-02-25/h5ad/plate12_filt_Vevo_Tahoe100M_WS...	None	.h5ad	dataset	AnnData	37495736876	VjAkWVFGVpzAMi9Innusuw	None	10487057	md5	False	False	1	2	3	None	True	1	2025-02-25 23:22:18.600910+00:00	1	None	1
1370	tKTeff0ugWqAm4P70000	2025-02-25/h5ad/plate4_filt_Vevo_Tahoe100M_WSe...	None	.h5ad	dataset	AnnData	23292672278	BkBXznbSovNWXtzPFITPcQ	None	7004356	md5	False	False	1	2	3	None	True	1	2025-02-25 23:22:20.879928+00:00	1	None	1
1366	vn5cUJCHbjpPPsZx0000	2025-02-25/h5ad/plate14_filt_Vevo_Tahoe100M_WS...	None	.h5ad	dataset	AnnData	22427932564	FrnStRehP16siRGG35ou+g	None	6518806	md5	False	False	1	2	3	None	True	1	2025-02-25 23:22:19.357999+00:00	1	None	1
1369	XVSrkq9pyF1OBLgG0000	2025-02-25/h5ad/plate3_filt_Vevo_Tahoe100M_WSe...	None	.h5ad	dataset	AnnData	13173722269	Jnrt7DaSUCGn8D8LS2itaw	None	4705402	md5	False	False	1	2	3	None	True	1	2025-02-25 23:22:20.497965+00:00	1	None	1
1368	ZFeVfd0ugAHeWCxm0000	2025-02-25/h5ad/plate2_filt_Vevo_Tahoe100M_WSe...	None	.h5ad	dataset	AnnData	29037152127	usxviuqGbuw0RYnECCVCWw	None	8064658	md5	False	False	1	2	3	None	True	1	2025-02-25 23:22:20.113956+00:00	1	None	1

50 cell lines.

artifacts_tahoe.list("cell_lines__name")[:5]

['A-172', 'A-427', 'A498', 'A549', 'AN3 CA']

380 compounds.

artifacts_tahoe.list("compounds__name")[:5]

['18β-Glycyrrhetinic acid',
 '4EGI-1',
 '5-Azacytidine',
 '5-Fluorouracil',
 '8-Hydroxyquinoline']

1,138 perturbations.

artifacts_tahoe.list("compound_perturbations__name")[:5]

["[('18β-Glycyrrhetinic acid', 0.05, 'uM')]",
 "[('18β-Glycyrrhetinic acid', 0.5, 'uM')]",
 "[('18β-Glycyrrhetinic acid', 5.0, 'uM')]",
 "[('4EGI-1', 0.05, 'uM')]",
 "[('4EGI-1', 0.5, 'uM')]"]

# check the curated metadata of the first artifact
artifact1 = artifacts_tahoe[0]
artifact1.describe()

Show code cell output

Hide code cell output

Artifact .h5ad · AnnData · dataset
├── General
│   ├── uid: 56uA9lPPmJ4zLUcr0000          hash: j1FXsX7hs7u+eBqnWnmNHw
│   ├── size: 24.7 GB                      n_observations: 8044908
│   ├── space: all                         branch: main
│   ├── created_at: 2025-02-25 23:22:17    created_by: sunnyosun (Sunny Sun)
│   ├── key: 2025-02-25/h5ad/plate10_filt_Vevo_Tahoe100M_WServicesFrom_ParseGigalab.h5ad
│   ├── storage location / path: 
│   │   gs://arc-ctc-tahoe100/2025-02-25/h5ad/plate10_filt_Vevo_Tahoe100M_WServicesFrom_ParseGigalab.h5ad
│   └── transform: register-tahoe100.ipynb
├── Dataset features
│   ├── var • 62710                     [bionty.Gene.stable_id]                                                    
│   │   TSPAN6                          float                                                                      
│   │   TNMD                            float                                                                      
│   │   DPM1                            float                                                                      
│   │   SCYL3                           float                                                                      
│   │   C1orf112                        float                                                                      
│   │   FGR                             float                                                                      
│   │   CFH                             float                                                                      
│   │   FUCA2                           float                                                                      
│   │   GCLC                            float                                                                      
│   │   NFYA                            float                                                                      
│   │   STPG1                           float                                                                      
│   │   NIPAL3                          float                                                                      
│   │   LAS1L                           float                                                                      
│   │   ENPP4                           float                                                                      
│   │   SEMA3F                          float                                                                      
│   │   CFTR                            float                                                                      
│   │   ANKIB1                          float                                                                      
│   │   CYP51A1                         float                                                                      
│   │   KRIT1                           float                                                                      
│   │   RAD52                           float                                                                      
│   └── obs • 16                        [Feature]                                                                  
│       cell_line                       cat[bionty.CellLine.description]   A-172, A-427, A498, A549, AN3 CA, AsPC-…
│       cell_name                       cat[bionty.CellLine]               A-172, A-427, A498, A549, AN3 CA, AsPC-…
│       drug                            cat[wetlab.Compound]               5-Azacytidine, 5-Fluorouracil, Abirater…
│       drugname_drugconc               cat[wetlab.CompoundPerturbation]   [('5-Azacytidine', 0.05, 'uM')], [('5-F…
│       pass_filter                     cat[ULabel[PassFilter]]            full, minimal                           
│       phase                           cat[ULabel[Phase]]                 G1, G2M, S                              
│       plate                           cat[ULabel[Plate]]                 plate10                                 
│       sample                          cat[wetlab.Biosample]              smp_2359, smp_2360, smp_2361, smp_2362,…
│       gene_count                      int                                                                        
│       tscp_count                      int                                                                        
│       mread_count                     int                                                                        
│       pcnt_mito                       float                                                                      
│       S_score                         float                                                                      
│       G2M_score                       float                                                                      
│       sublibrary                      str                                                                        
│       BARCODE                         str                                                                        
└── Labels
    └── .references                     Reference                          Tahoe-100M: A Giga-Scale Single-Cell Pe…
        .projects                       Project                            Tahoe-100M                              
        .compounds                      wetlab.Compound                    Bestatin (hydrochloride), Ataluren, Can…
        .compound_perturbations         wetlab.CompoundPerturbation        [('Bestatin (hydrochloride)', 0.05, 'uM…
        .biosamples                     wetlab.Biosample                   smp_2359, smp_2360, smp_2361, smp_2362,…
        .organisms                      bionty.Organism                    human                                   
        .cell_lines                     bionty.CellLine                    NCI-H1573, NCI-H460, hTERT-HPNE, SW48, …
        .ulabels                        ULabel                             plate10, G1, G2M, S, full, minimal

16 obs metadata features.

artifact1.features["obs"].df()

Show code cell output

Hide code cell output

/tmp/ipykernel_3754/2428349911.py:1: FutureWarning: Use slots[slot].members instead of __getitem__, __getitem__ will be removed in the future.
  artifact1.features["obs"].df()

	uid	name	dtype	is_type	unit	description	array_rank	array_size	array_shape	proxy_dtype	synonyms	_expect_many	_curation	space_id	type_id	run_id	created_at	created_by_id	_aux	branch_id
id
9	bujDkB4Nd1S5	S_score	float	None	None	Inferred S phase score	0	0	None	None	None	True	None	1	None	3	2025-02-25 22:31:22.144135+00:00	1	{'af': {'0': None, '1': True}}	1
3	PVpyJhciLdCQ	pass_filter	cat[ULabel[PassFilter]]	None	None	"Full" filters are more stringent on gene_coun...	0	0	None	None	None	True	None	1	None	3	2025-02-25 22:25:30.918235+00:00	1	{'af': {'0': None, '1': True}}	1
7	PZDiL36nJSFv	mread_count	int	None	None	Number of reads per cell	0	0	None	None	None	True	None	1	None	3	2025-02-25 22:30:31.810331+00:00	1	{'af': {'0': None, '1': True}}	1
4	vshELphl73qp	cell_line	cat[bionty.CellLine.description]	None	None	Cell line information (if applicable)	0	0	None	None	None	True	None	1	None	3	2025-02-25 22:27:22.393997+00:00	1	{'af': {'0': None, '1': True}}	1
1	YRSYWdIiesqL	plate	cat[ULabel[Plate]]	None	None	Plate identifier	0	0	None	None	None	True	None	1	None	3	2025-02-25 22:03:51.786985+00:00	1	{'af': {'0': None, '1': True}}	1
19	gQE1h3fIBiSf	sample	cat[wetlab.Biosample]	None	None	Unique treatment identifier, distinguishes rep...	0	0	None	None	None	True	None	1	None	3	2025-02-26 10:59:36.743558+00:00	1	{'af': {'0': None, '1': True}}	1
5	IjSP1lCY3Hyw	gene_count	int	None	None	Number of genes with at least one count	0	0	None	None	None	True	None	1	None	3	2025-02-25 22:30:30.668750+00:00	1	{'af': {'0': None, '1': True}}	1
6	LHUmmYKjIGPl	tscp_count	int	None	None	Number of transcripts, aka UMI count	0	0	None	None	None	True	None	1	None	3	2025-02-25 22:30:31.236532+00:00	1	{'af': {'0': None, '1': True}}	1
18	fLwdFKBUhBY9	drugname_drugconc	cat[wetlab.CompoundPerturbation]	None	None	Drug name, concentration, and concentration unit	0	0	None	None	None	True	None	1	None	3	2025-02-25 23:04:17.541812+00:00	1	{'af': {'0': None, '1': True}}	1
17	Q0cj2JR5Juwn	drug	cat[wetlab.Compound]	None	None	Drug name, parsed out from the drugname_drugco...	0	0	None	None	None	True	None	1	None	3	2025-02-25 23:02:05.717794+00:00	1	{'af': {'0': None, '1': True}}	1
15	3X4d0QEUuprp	sublibrary	str	None	None	Sublibrary ID (related to library prep and seq...	0	0	None	None	None	True	None	1	None	3	2025-02-25 22:35:14.673178+00:00	1	{'af': {'0': None, '1': True}}	1
16	dQELv2sIVnJX	BARCODE	str	None	None	Barcode ID	0	0	None	None	None	True	None	1	None	3	2025-02-25 22:35:15.627971+00:00	1	{'af': {'0': None, '1': True}}	1
8	X640W5tBUPOQ	pcnt_mito	float	None	None	Percentage of mitochondrial reads	0	0	None	None	None	True	None	1	None	3	2025-02-25 22:31:21.581885+00:00	1	{'af': {'0': None, '1': True}}	1
10	CF0O0e0WZxFz	G2M_score	float	None	None	Inferred G2M score	0	0	None	None	None	True	None	1	None	3	2025-02-25 22:31:22.708895+00:00	1	{'af': {'0': None, '1': True}}	1
2	QboQ1Q1Yxsjn	phase	cat[ULabel[Phase]]	None	None	Inferred cell cycle phase	0	0	None	None	None	True	None	1	None	3	2025-02-25 22:21:56.935262+00:00	1	{'af': {'0': None, '1': True}}	1
11	KPT70T8xJLIt	cell_name	cat[bionty.CellLine]	None	None	Commonly-used cell name (related to the cell_l...	0	0	None	None	None	True	None	1	None	3	2025-02-25 22:32:56.082195+00:00	1	{'af': {'0': None, '1': True}}	1

Query artifacts of interest based on metadata¶

Since all metadata are registered in the sql database, we can explore the datasets without accessing them.

Let’s find which datasets contain A549 cells perturbed with Piroxicam.

# lookup objects give you pythonic access to the values
cell_lines = bt.CellLine.lookup("ontology_id")
drugs = wl.Compound.lookup()

artifacts_a549_piroxicam = artifacts_tahoe.filter(
    cell_lines=cell_lines.cvcl_0023, compounds=drugs.piroxicam
)
artifacts_a549_piroxicam.df()

	uid	key	description	suffix	kind	otype	size	hash	n_files	n_observations	_hash_type	_key_is_virtual	_overwrite_versions	space_id	storage_id	schema_id	version	is_latest	run_id	created_at	created_by_id	_aux	branch_id
id
1362	56uA9lPPmJ4zLUcr0000	2025-02-25/h5ad/plate10_filt_Vevo_Tahoe100M_WS...	None	.h5ad	dataset	AnnData	26536400717	j1FXsX7hs7u+eBqnWnmNHw	None	8044908	md5	False	False	1	2	3	None	True	1	2025-02-25 23:22:17.849980+00:00	1	None	1
1363	omn7JStfJMzy8m6O0000	2025-02-25/h5ad/plate11_filt_Vevo_Tahoe100M_WS...	None	.h5ad	dataset	AnnData	23230802756	N2mzoYlMLEl6PdecaYyDvw	None	7435869	md5	False	False	1	2	3	None	True	1	2025-02-25 23:22:18.229629+00:00	1	None	1
1364	S2h2rPLCaUhZAM9u0000	2025-02-25/h5ad/plate12_filt_Vevo_Tahoe100M_WS...	None	.h5ad	dataset	AnnData	37495736876	VjAkWVFGVpzAMi9Innusuw	None	10487057	md5	False	False	1	2	3	None	True	1	2025-02-25 23:22:18.600910+00:00	1	None	1

You can download an .h5ad into your local cache:

artifact1.cache()

Or stream it:

artifact1.open()

Open the obs metadata parquet file as a PyArrow Dataset¶

Open the obs metadata file (2.29G) with PyArrow.Dataset.

obs_metadata = ln.Artifact.filter(
    key__endswith="obs_metadata.parquet", projects=project_tahoe
).one()
obs_metadata

Artifact(uid='y1TTR9wbrmZEwpOa0000', is_latest=True, key='2025-02-25/metadata/obs_metadata.parquet', suffix='.parquet', kind='dataset', otype='DataFrame', size=2293981573, hash='qEWOpGw9CmQVzaElyMWT1Q', n_observations=100648790, branch_id=1, space_id=1, storage_id=2, run_id=1, created_by_id=1, created_at=2025-02-25 19:33:42 UTC)

obs_metadata_ds = obs_metadata.open()
obs_metadata_ds.schema

Which A549 cells are perturbed with Piroxicam.

filter_expr = (pc.field("cell_name") == cell_lines.cvcl_0023.name) & (
    pc.field("drug") == drugs.piroxicam.name
)
obs_metadata_df = obs_metadata_ds.scanner(filter=filter_expr).to_table().to_pandas()
obs_metadata_df.value_counts("plate")

plate
plate12    2818
plate10    2812
plate11    2279
Name: count, dtype: int64

obs_metadata_df.head()

	plate	BARCODE_SUB_LIB_ID	sample	gene_count	tscp_count	mread_count	drugname_drugconc	drug	cell_line	sublibrary	BARCODE	pcnt_mito	S_score	G2M_score	phase	pass_filter	cell_name
29314	plate10	50_030_183-lib_1681	smp_2408	644	863	1024	[('Piroxicam', 0.05, 'uM')]	Piroxicam	CVCL_0023	lib_1681	50_030_183	0.101970	-0.282297	-0.165568	G1	full	A549
29337	plate10	50_035_135-lib_1681	smp_2408	1130	1570	1827	[('Piroxicam', 0.05, 'uM')]	Piroxicam	CVCL_0023	lib_1681	50_035_135	0.077070	-0.335042	-0.280220	G1	full	A549
29338	plate10	50_035_171-lib_1681	smp_2408	1058	1534	1809	[('Piroxicam', 0.05, 'uM')]	Piroxicam	CVCL_0023	lib_1681	50_035_171	0.124511	-0.402028	-0.404579	G1	full	A549
29352	plate10	50_038_157-lib_1681	smp_2408	1265	1883	2240	[('Piroxicam', 0.05, 'uM')]	Piroxicam	CVCL_0023	lib_1681	50_038_157	0.147106	-0.455343	-0.311355	G1	full	A549
29355	plate10	50_039_078-lib_1681	smp_2408	1355	1914	2258	[('Piroxicam', 0.05, 'uM')]	Piroxicam	CVCL_0023	lib_1681	50_039_078	0.070010	-0.349396	0.186264	G2M	full	A549

Retrieve the corresponding cells from h5ad files.

plate_cells = df.groupby("plate")["BARCODE_SUB_LIB_ID"].apply(list)

adatas = []
for artifact in artifacts_a549_piroxicam:
    plate = artifact.features.get_values()["plate"]
    idxs = plate_cells.get(plate)
    print(f"Loading {len(idxs)} cells from plate {plate}")
    with artifact.open() as store:
        adata = store[idxs].to_memory() # can also subst genes here
        adatas.append(adata)

scBaseCount¶

project_scbasecount = ln.Project.get(name="scBaseCount")
project_scbasecount

Project(uid='vdK00t9DGwHP', name='scBaseCount', is_type=False, url='https://arcinstitute.org/tools/virtualcellatlas', branch_id=1, space_id=1, created_by_id=1, created_at=2025-02-26 16:04:08 UTC)

This project has 105 collections (21 organisms x 5 count features):

project_scbasecount.collections.df()

Show code cell output

Hide code cell output

	uid	key	description	hash	reference	reference_type	space_id	meta_artifact_id	version	is_latest	run_id	created_at	created_by_id	_aux	branch_id
id
81	5iQtPFoyW3VA8gUO0000	scBaseCount/GeneFull_ExonOverIntron/Pan_troglo...	None	NJ27SxZhEUjXc4NGBxDFig	None	None	1	None	2025-02-25	True	10	2025-03-03 11:07:06.118837+00:00	1	None	1
31	0qwMfYdB4HMAfm5J0000	scBaseCount/GeneFull_ExonOverIntron/Drosophila...	None	9_xsnr1W0pjqB6vMGY27Kg	None	None	1	None	2025-02-25	True	10	2025-03-03 11:02:06.418327+00:00	1	None	1
87	QyeOMM8Qu2Yc637f0000	scBaseCount/Velocyto/Schistosoma_mansoni	None	7XZzjMBlIJQMqrcOhYFQYQ	None	None	1	None	2025-02-25	True	10	2025-03-03 11:07:36.194395+00:00	1	None	1
71	rForlsvLjM8zEgbO0000	scBaseCount/GeneFull_ExonOverIntron/Oryza_sativa	None	SqNuN0qVtQskeDnAZPRLrQ	None	None	1	None	2025-02-25	True	10	2025-03-03 11:06:15.137130+00:00	1	None	1
68	wXctL2347aWNGnf90000	scBaseCount/Gene/Oryza_sativa	None	LTqCz0GuUi1CnbHM_zi9qw	None	None	1	None	2025-02-25	True	10	2025-03-03 11:06:00.109765+00:00	1	None	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
55	BLamUQZhqBTnHG4K0000	scBaseCount/GeneFull_Ex50pAS/Homo_sapiens	None	SLBug97gNkMCZ3Gd2Bp1Aw	None	None	1	None	2025-02-25	True	10	2025-03-03 11:04:28.695376+00:00	1	None	1
27	2wPZaiNxigodW7X60000	scBaseCount/Velocyto/Danio_rerio	None	ceCKmkcgKyk_bRHhjGodTQ	None	None	1	None	2025-02-25	True	10	2025-03-03 11:01:45.771604+00:00	1	None	1
23	kXjTL9XbRysx3A8P0000	scBaseCount/Gene/Danio_rerio	None	TOhVCAQMVTRO8VD27SF6WQ	None	None	1	None	2025-02-25	True	10	2025-03-03 11:01:25.162863+00:00	1	None	1
58	TMcFueJifRSFVrSq0000	scBaseCount/Gene/Macaca_mulatta	None	OuNCmFSkmfKiLjvGEbBVKw	None	None	1	None	2025-02-25	True	10	2025-03-03 11:05:04.524140+00:00	1	None	1
8	ttGkPgXxLDO4sSXF0000	scBaseCount/Gene/Bos_taurus	None	jn1Nhcdt0lpB1I3hQ4SgFw	None	None	1	None	2025-02-25	True	10	2025-03-03 11:00:09.130314+00:00	1	None	1

105 rows × 15 columns

Query artifacts of interest based on metadata¶

Often you might not want to access all the h5ads in a collection, but rather filter them by metadata:

organisms = bt.Organism.lookup()
tissues = bt.Tissue.lookup()
efos = bt.ExperimentalFactor.lookup()
feature_counts = ln.ULabel.filter(type__name="STARsolo count features").lookup()

h5ads_brain = ln.Artifact.filter(
    suffix=".h5ad",
    projects=project_scbasecount,
    organisms=organisms.human,
    ulabels=feature_counts.genefull_ex50pas,
    tissues=tissues.brain,
    experimental_factors=efos.single_cell,
    experiments__name__contains="CRISPRi",  # `perturbation` column is registered in `wetlab.Experiment`
).distinct()

h5ads_brain.df()

Show code cell output

Hide code cell output

	uid	key	description	suffix	kind	otype	size	hash	n_files	n_observations	_hash_type	_key_is_virtual	_overwrite_versions	space_id	storage_id	schema_id	version	is_latest	run_id	created_at	created_by_id	_aux	branch_id
id
104180	1AlmBH0wFzUqosGV0000	2025-02-25/h5ad/GeneFull_Ex50pAS/Homo_sapiens/...	None	.h5ad	dataset	AnnData	3448668	A0k605SWKyxecLUFjNqS8A	None	6164	None	False	True	1	3	55	None	True	10	2025-02-28 16:46:25.771217+00:00	1	None	1
104186	24rg7gDQqP0EQRq30000	2025-02-25/h5ad/GeneFull_Ex50pAS/Homo_sapiens/...	None	.h5ad	dataset	AnnData	35229865	EA3jW7rwaZhIwtZpLLNCQQ	None	7463	None	False	True	1	3	55	None	True	10	2025-02-28 16:46:25.771217+00:00	1	None	1
104204	2vZHojPycv8uPoXp0000	2025-02-25/h5ad/GeneFull_Ex50pAS/Homo_sapiens/...	None	.h5ad	dataset	AnnData	35133716	Ud5Je3ue2dQcG53leo1nhA	None	4709	None	False	True	1	3	55	None	True	10	2025-02-28 16:46:25.771217+00:00	1	None	1
104174	3EbJEIJnCGqnEMUI0000	2025-02-25/h5ad/GeneFull_Ex50pAS/Homo_sapiens/...	None	.h5ad	dataset	AnnData	5727864	nddvJ0NRE3/rTAfQgyubow	None	7376	None	False	True	1	3	55	None	True	10	2025-02-28 16:46:25.771217+00:00	1	None	1
104205	3JlzQ4PcN58pOxM50000	2025-02-25/h5ad/GeneFull_Ex50pAS/Homo_sapiens/...	None	.h5ad	dataset	AnnData	35877513	elUEIdXpHR1xfltqUYPBgw	None	4718	None	False	True	1	3	55	None	True	10	2025-02-28 16:46:25.771217+00:00	1	None	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
104197	Wg6YBPWCwfU4Vr960000	2025-02-25/h5ad/GeneFull_Ex50pAS/Homo_sapiens/...	None	.h5ad	dataset	AnnData	38354054	JJCCXbqWTaIeV5vJvOllzw	None	7627	None	False	True	1	3	55	None	True	10	2025-02-28 16:46:25.771217+00:00	1	None	1
104170	YqiNrGCXc1cM9Dg90000	2025-02-25/h5ad/GeneFull_Ex50pAS/Homo_sapiens/...	None	.h5ad	dataset	AnnData	5494309	kMbDZo5QMSt3WzLKZjsdCg	None	7383	None	False	True	1	3	55	None	True	10	2025-02-28 16:46:25.771217+00:00	1	None	1
104219	zAxkTKnxCUEBAibd0000	2025-02-25/h5ad/GeneFull_Ex50pAS/Homo_sapiens/...	None	.h5ad	dataset	AnnData	37935375	D/xXUsmFZ14802xqd5cWaw	None	7616	None	False	True	1	3	55	None	True	10	2025-02-28 16:46:25.771217+00:00	1	None	1
104206	ZgGYpGntv2sF92Wg0000	2025-02-25/h5ad/GeneFull_Ex50pAS/Homo_sapiens/...	None	.h5ad	dataset	AnnData	36858036	fUND8GyVTUu3KrDEhmYYLg	None	9128	None	False	True	1	3	55	None	True	10	2025-02-28 16:46:25.771217+00:00	1	None	1
104166	ZmSJbhRC4WeK1nyA0000	2025-02-25/h5ad/GeneFull_Ex50pAS/Homo_sapiens/...	None	.h5ad	dataset	AnnData	40518635	gdcEf34j7wAVvxcUby9UDw	None	7114	None	False	True	1	3	55	None	True	10	2025-02-28 16:46:25.771217+00:00	1	None	1

64 rows × 23 columns

Load the h5ad files with obs metadata¶

Load the h5ads as a single AnnData:

adatas = []
for artifact in h5ads_brain[:5]:  # only load the first 5 artifacts to save CI time
    adatas.append(artifact.load())

# the obs metadatas are present in the parquet files
adata_concat = ad.concat(adatas)
adata_concat

Open the sample metadata:

sample_meta = ln.Artifact.filter(
    key__endswith="sample_metadata.parquet",
    projects=project_scbasecount,
    organisms=organisms.human,
    ulabels=feature_counts.genefull_ex50pas,
).one()

sample_meta

Artifact(uid='WCHkcyWN8L6pDI4E0000', is_latest=True, key='2025-02-25/metadata/GeneFull_Ex50pAS/Homo_sapiens/sample_metadata.parquet', suffix='.parquet', kind='dataset', otype='DataFrame', size=531878, hash='4QrqW8DQVRl6bKNYiJhq3g', n_observations=16077, branch_id=1, space_id=1, storage_id=3, run_id=2, created_by_id=1, created_at=2025-02-25 20:41:32 UTC)

sample_meta_dataset = sample_meta.open()
sample_meta_dataset.schema

Fetch corresponding sample metadata:

filter_expr = pc.field("srx_accession").isin(
    adata_concat.obs["SRX_accession"].astype(str)
)
df = sample_meta_dataset.scanner(filter=filter_expr).to_table().to_pandas()

Add the sample metadata to the AnnData:

adata_concat.obs = adata_concat.obs.merge(
    df, left_on="SRX_accession", right_on="srx_accession"
)
adata_concat

AnnData object with n_obs × n_vars = 38206 × 36601
    obs: 'gene_count', 'umi_count', 'SRX_accession', 'entrez_id', 'srx_accession', 'file_path', 'obs_count', 'lib_prep', 'tech_10x', 'cell_prep', 'organism', 'tissue', 'disease', 'perturbation', 'cell_line', 'czi_collection_id', 'czi_collection_name'

adata_concat.obs.head()

Show code cell output

Hide code cell output

	gene_count	umi_count	SRX_accession	entrez_id	srx_accession	file_path	obs_count	lib_prep	tech_10x	cell_prep	organism	tissue	disease	perturbation	cell_line	czi_collection_id	czi_collection_name
0	2748	5134.0	SRX10606628	14083632	SRX10606628	gs://arc-scbasecount/2025-02-25/h5ad/GeneFull_...	7641	10x_Genomics	3_prime_gex	single_cell	Homo sapiens	brain	Down syndrome	CRISPR/Cas9, CRISPRi, or small-molecule inhibi...	DS1	None	None
1	2351	4639.0	SRX10606628	14083632	SRX10606628	gs://arc-scbasecount/2025-02-25/h5ad/GeneFull_...	7641	10x_Genomics	3_prime_gex	single_cell	Homo sapiens	brain	Down syndrome	CRISPR/Cas9, CRISPRi, or small-molecule inhibi...	DS1	None	None
2	2184	4293.0	SRX10606628	14083632	SRX10606628	gs://arc-scbasecount/2025-02-25/h5ad/GeneFull_...	7641	10x_Genomics	3_prime_gex	single_cell	Homo sapiens	brain	Down syndrome	CRISPR/Cas9, CRISPRi, or small-molecule inhibi...	DS1	None	None
3	2469	5307.0	SRX10606628	14083632	SRX10606628	gs://arc-scbasecount/2025-02-25/h5ad/GeneFull_...	7641	10x_Genomics	3_prime_gex	single_cell	Homo sapiens	brain	Down syndrome	CRISPR/Cas9, CRISPRi, or small-molecule inhibi...	DS1	None	None
4	4144	9340.0	SRX10606628	14083632	SRX10606628	gs://arc-scbasecount/2025-02-25/h5ad/GeneFull_...	7641	10x_Genomics	3_prime_gex	single_cell	Homo sapiens	brain	Down syndrome	CRISPR/Cas9, CRISPRi, or small-molecule inhibi...	DS1	None	None