Bird’s eye view#

You typically want to know where files & datasets came from.

Here, you’ll backtrace file transformations through notebooks, pipelines & app uploads in a complex research project (based on Schmidt22).

import lamindb as ln

✅ loaded instance: testuser1/mydata (lamindb 0.50.3)

Track a bioinformatics pipeline#

When working with a pipeline, we’ll register it before running it.

This only happens once and could be done by anyone on your team.

ln.Transform(name="Cell Ranger", version="7.2.0", type="pipeline").save()

Before running the pipeline, query or search for the corresponding transform record:

transform = ln.Transform.filter(name="Cell Ranger", version="7.2.0").one()

Pass the record to track() to set a global run_context:

ln.track(transform)

✅ loaded: Transform(id='ezGsYAqUOyESsM', name='Cell Ranger', stem_id='ezGsYAqUOyES', version='7.2.0', type='pipeline', updated_at=2023-08-12 05:45:54, created_by_id='bKeW4T6E')

🌱 saved: Run(id='CoRnH4SDheN14LhLMkXg', run_at=2023-08-12 05:45:54, transform_id='ezGsYAqUOyESsM', created_by_id='bKeW4T6E')

Now, let’s stage (download) a few files from an instrument upload:

files = ln.File.filter(key__startswith="fastq/perturbseq").all()
filepaths = [file.stage() for file in files]

💡 adding file IAIffwcqv4Av6edxFaG7 as input for run CoRnH4SDheN14LhLMkXg, adding parent transform 7QdAuXkrS6mIz8

💡 adding file t3Nnv4E8cTQNbxrdnzo1 as input for run CoRnH4SDheN14LhLMkXg, adding parent transform 7QdAuXkrS6mIz8

Assume we processed them and obtained 3 output files in a folder 'filtered_feature_bc_matrix':

ln.File.tree("./mydata/perturbseq/filtered_feature_bc_matrix/")

filtered_feature_bc_matrix (0 sub-directories & 3 files): 
├── features.tsv.gz
├── matrix.mtx.gz
└── barcodes.tsv.gz

output_files = ln.File.from_dir("./mydata/perturbseq/filtered_feature_bc_matrix/")
ln.save(output_files)

✅ created 3 files from directory using storage /home/runner/work/lamin-usecases/lamin-usecases/docs/mydata and key = perturbseq/filtered_feature_bc_matrix/

🌱 storing file '87fqD9f1GZUT2ZJ9IZIb' with key 'perturbseq/filtered_feature_bc_matrix/features.tsv.gz'

🌱 storing file 'cG78eIKkYrfcj5KDety9' with key 'perturbseq/filtered_feature_bc_matrix/matrix.mtx.gz'

🌱 storing file 'sVSh07hhgZTTIn9SX3aA' with key 'perturbseq/filtered_feature_bc_matrix/barcodes.tsv.gz'

Each of these files now has transform and run records. For instance:

output_files[0].transform

Transform(id='ezGsYAqUOyESsM', name='Cell Ranger', stem_id='ezGsYAqUOyES', version='7.2.0', type='pipeline', updated_at=2023-08-12 05:45:54, created_by_id='bKeW4T6E')

output_files[0].run

Run(id='CoRnH4SDheN14LhLMkXg', run_at=2023-08-12 05:45:54, transform_id='ezGsYAqUOyESsM', created_by_id='bKeW4T6E')

Let’s look at the data lineage at this stage:

output_files[0].view_lineage()

https://d33wubrfki0l68.cloudfront.net/3125eb18b328957948ecba8c957c77e9ebc1e6f5/ce2dc/_images/f16147c34dfedf0c195cb72f76c7a4bb7800cf8ce42ae0acbdb72f9ddfcf4b8f.svg

And let’s keep running the Cell Ranger pipeline in the background:

Track app upload & analytics#

The hidden cell below simulates additional analytic steps including:

uploading phenotypic screen data
scRNA-seq analysis
analyses of the integrated datasets

Show code cell content Hide code cell content

# app upload
ln.setup.login("testuser1")
transform = ln.Transform(name="Upload GWS CRISPRa result", type="app")
ln.track(transform)

# upload and analyze the GWS data
filepath = ln.dev.datasets.schmidt22_crispra_gws_IFNG(ln.settings.storage)
file = ln.File(filepath, description="Raw data of schmidt22 crispra GWS")
file.save()
ln.setup.login("testuser2")
transform = ln.Transform(name="GWS CRIPSRa analysis", type="notebook")
ln.track(transform)

file_wgs = ln.File.filter(key="schmidt22-crispra-gws-IFNG.csv").one()
df = file_wgs.load().set_index("id")
hits_df = df[df["pos|fdr"] < 0.01].copy()
file_hits = ln.File(hits_df, description="hits from schmidt22 crispra GWS")
file_hits.save()

✅ logged in with email testuser1@lamin.ai and id DzTjkKse

🌱 saved: Transform(id='nXCqeblqzJCIz8', name='Upload GWS CRISPRa result', stem_id='nXCqeblqzJCI', version='0', type='app', updated_at=2023-08-12 05:45:55, created_by_id='DzTjkKse')

🌱 saved: Run(id='4PRgkXShgitGEEUUs7L9', run_at=2023-08-12 05:45:55, transform_id='nXCqeblqzJCIz8', created_by_id='DzTjkKse')

💡 file in storage '/home/runner/work/lamin-usecases/lamin-usecases/docs/mydata' with key 'schmidt22-crispra-gws-IFNG.csv'

✅ logged in with email testuser2@lamin.ai and id bKeW4T6E

🌱 saved: Transform(id='vWgq1bwq94jFz8', name='GWS CRIPSRa analysis', stem_id='vWgq1bwq94jF', version='0', type='notebook', updated_at=2023-08-12 05:45:56, created_by_id='bKeW4T6E')

🌱 saved: Run(id='f9J4ZFdFmnIAL3kk6fG9', run_at=2023-08-12 05:45:56, transform_id='vWgq1bwq94jFz8', created_by_id='bKeW4T6E')

💡 adding file lP2qDxYjzi1ggGJvSRwe as input for run f9J4ZFdFmnIAL3kk6fG9, adding parent transform nXCqeblqzJCIz8

💡 file will be copied to default storage upon `save()` with key 'uNJ663ZfSXZNDB9uWrbx.parquet'

💡 file is a dataframe, consider using File.from_df() to link column names as features

🌱 storing file 'uNJ663ZfSXZNDB9uWrbx' with key '.lamindb/uNJ663ZfSXZNDB9uWrbx.parquet'

Let’s see how the data lineage of this looks:

file = ln.File.filter(description="hits from schmidt22 crispra GWS").one()
file.view_lineage()

https://d33wubrfki0l68.cloudfront.net/0a8bc8eb7e0d86489d75a0420d4989444a5a3ed0/73065/_images/f3e1c2aaac8ebc21320ede594365c86b02ad203eecbfeabafc0238f530d37b74.svg

Track notebooks#

In the backgound, somebody integrated and analyzed the outputs of the app upload and the Cell Ranger pipeline:

Show code cell content Hide code cell content

# Let us add analytics on top of the cell ranger pipeline and the phenotypic screening
transform = ln.Transform(
    name="Perform single cell analysis, integrating with CRISPRa screen",
    type="notebook",
)
ln.track(transform)

file_ps = ln.File.filter(description__icontains="perturbseq").one()
adata = file_ps.load()
screen_hits = file_hits.load()
import scanpy as sc

sc.tl.score_genes(adata, adata.var_names.intersection(screen_hits.index).tolist())
filesuffix = "_fig1_score-wgs-hits.png"
sc.pl.umap(adata, color="score", show=False, save=filesuffix)
filepath = f"figures/umap{filesuffix}"
file = ln.File(filepath, key=filepath)
file.save()
filesuffix = "fig2_score-wgs-hits-per-cluster.png"
sc.pl.matrixplot(
    adata, groupby="cluster_name", var_names=["score"], show=False, save=filesuffix
)
filepath = f"figures/matrixplot_{filesuffix}"
file = ln.File(filepath, key=filepath)
file.save()

🌱 saved: Transform(id='J6x5ZIyYoEkuz8', name='Perform single cell analysis, integrating with CRISPRa screen', stem_id='J6x5ZIyYoEku', version='0', type='notebook', updated_at=2023-08-12 05:45:56, created_by_id='bKeW4T6E')

🌱 saved: Run(id='GzzvnF7vAdNoTdlAkrLc', run_at=2023-08-12 05:45:56, transform_id='J6x5ZIyYoEkuz8', created_by_id='bKeW4T6E')

💡 adding file HbZdt18uc1X5Q1WhbXUk as input for run GzzvnF7vAdNoTdlAkrLc, adding parent transform P8YMJSNGbe9z0b

💡 adding file uNJ663ZfSXZNDB9uWrbx as input for run GzzvnF7vAdNoTdlAkrLc, adding parent transform vWgq1bwq94jFz8

WARNING: saving figure to file figures/umap_fig1_score-wgs-hits.png

💡 file will be copied to default storage upon `save()` with key 'figures/umap_fig1_score-wgs-hits.png'

🌱 storing file 'c2406U3gQFawDxgv5nK9' with key 'figures/umap_fig1_score-wgs-hits.png'

WARNING: saving figure to file figures/matrixplot_fig2_score-wgs-hits-per-cluster.png

💡 file will be copied to default storage upon `save()` with key 'figures/matrixplot_fig2_score-wgs-hits-per-cluster.png'

🌱 storing file '1LUsIItZ6djAJZ1t6IXu' with key 'figures/matrixplot_fig2_score-wgs-hits-per-cluster.png'

The outcome of it are a few figures stored as image files. Let’s query one of them and look at the data lineage:

file = ln.File.filter(key__contains="figures/matrixplot").one()
file.view_lineage()

https://d33wubrfki0l68.cloudfront.net/4612552a97bf14862c5c562f2add84531107bb78/3fa18/_images/e3ba1b40189f8f171e89f0f39fe14151599c06f84f496cd900fb89961562a465.svg

We’d now like to track the current Jupyter notebook to continue the work:

ln.track()

🌱 saved: Transform(id='1LCd8kco9lZUz8', name='Bird's eye view', short_name='birds-eye', stem_id='1LCd8kco9lZU', version='0', type=notebook, updated_at=2023-08-12 05:45:58, created_by_id='bKeW4T6E')

🌱 saved: Run(id='4jMS4YGDvJCbNm8ZSi5F', run_at=2023-08-12 05:45:58, transform_id='1LCd8kco9lZUz8', created_by_id='bKeW4T6E')

Let’s load the image file:

file.stage()

💡 adding file 1LUsIItZ6djAJZ1t6IXu as input for run 4jMS4YGDvJCbNm8ZSi5F, adding parent transform J6x5ZIyYoEkuz8

PosixPath('/home/runner/work/lamin-usecases/lamin-usecases/docs/mydata/figures/matrixplot_fig2_score-wgs-hits-per-cluster.png')

We see that the image file is tracked as an input of the current notebook. The input is highlighted, the notebook follows at the bottom:

file.view_lineage()

https://d33wubrfki0l68.cloudfront.net/27e053431554c3de055758ba7d2cf1924bb89d95/a4412/_images/9aed3ffaa6f550eac4d9dbd38e42b2552c3460e4c12557d92b7e3c6c156c667d.svg

We can also purely look at the sequence of transforms:

transform = ln.Transform.search("Track data lineage", return_queryset=True).first()

transform.parents.df()

	name	short_name	stem_id	version	type	reference	updated_at	created_by_id
id
7QdAuXkrS6mIz8	Chromium 10x upload	None	7QdAuXkrS6mI	0	pipeline	None	2023-08-12 05:45:53	DzTjkKse

transform.view_parents()

https://d33wubrfki0l68.cloudfront.net/f9939e46475402b4715fd184f954cff048f706f4/683ee/_images/dfccc46d29c713a056cf3cf141405a4dabc87ece8547a7bc6789979d00836e2e.svg

And if you or another user re-runs a notebook, they’ll be informed about parents in the logging:

ln.track()

✅ loaded: Transform(id='1LCd8kco9lZUz8', name='Bird's eye view', short_name='birds-eye', stem_id='1LCd8kco9lZU', version='0', type='notebook', updated_at=2023-08-12 05:45:58, created_by_id='bKeW4T6E')

✅ loaded: Run(id='4jMS4YGDvJCbNm8ZSi5F', run_at=2023-08-12 05:45:58, transform_id='1LCd8kco9lZUz8', created_by_id='bKeW4T6E')

💡   parent transform: Transform(id='J6x5ZIyYoEkuz8', name='Perform single cell analysis, integrating with CRISPRa screen', stem_id='J6x5ZIyYoEku', version='0', type='notebook', updated_at=2023-08-12 05:45:58, created_by_id='bKeW4T6E')

Data lineage graph#

To summarize, let’s re-render the data lineage graph:

file.view_lineage()

Understand runs#

Under-the-hood we already tracked pipeline and notebook runs through run_context.

You can see this most easily by looking at the File.run attribute (in addition to File.transform).

File objects are the inputs and outputs of such runs.

Sometimes, we don’t want to create a global run context but manually pass a run when creating a file:

run = ln.Run(transform=transform)
ln.File(filepath, run=run)

When accessing a file via stage(), load() or backed(), two things happen:

The current run gets added to file.input_of
The transform of that file gets added as a parent of the current transform

While run outputs are automatically tracked as data sources once you call ln.track(), you can then still switch off auto-tracking of run inputs if you set ln.settings.track_run_inputs = False: Can I automatically track run inputs?

You can also track run inputs on a case by case basis via is_run_input=True, e.g., here:

file.load(is_run_input=True)

Query by provenance#

We can query or search for the notebook that created the file:

transform = ln.Transform.search("GWS CRIPSRa analysis", return_queryset=True).first()

And then find all the files created by that notebook:

ln.File.filter(transform=transform).df()

	storage_id	key	suffix	accessor	description	version	initial_version_id	size	hash	hash_type	transform_id	run_id	updated_at	created_by_id
id
uNJ663ZfSXZNDB9uWrbx	kADkunWO	None	.parquet	DataFrame	hits from schmidt22 crispra GWS	None	None	18368	yw5f-kMLJhaNhdEF-lhxOQ	md5	vWgq1bwq94jFz8	f9J4ZFdFmnIAL3kk6fG9	2023-08-12 05:45:56	bKeW4T6E

Which transform ingested a given file?

file = ln.File.filter().first()
file.transform

Transform(id='7QdAuXkrS6mIz8', name='Chromium 10x upload', stem_id='7QdAuXkrS6mI', version='0', type='pipeline', updated_at=2023-08-12 05:45:53, created_by_id='DzTjkKse')

And which user?

file.created_by

User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-08-12 05:45:48)

Which transforms were created by a given user?

users = ln.User.lookup(field="handle")

ln.Transform.filter(created_by=users.testuser2).df()

	name	short_name	stem_id	version	type	reference	updated_at	created_by_id
id
ezGsYAqUOyESsM	Cell Ranger	None	ezGsYAqUOyES	7.2.0	pipeline	None	2023-08-12 05:45:54	bKeW4T6E
P8YMJSNGbe9z0b	Preprocess Cell Ranger outputs	None	P8YMJSNGbe9z	2.0	pipeline	None	2023-08-12 05:45:54	bKeW4T6E
vWgq1bwq94jFz8	GWS CRIPSRa analysis	None	vWgq1bwq94jF	0	notebook	None	2023-08-12 05:45:56	bKeW4T6E
J6x5ZIyYoEkuz8	Perform single cell analysis, integrating with...	None	J6x5ZIyYoEku	0	notebook	None	2023-08-12 05:45:58	bKeW4T6E
1LCd8kco9lZUz8	Bird's eye view	birds-eye	1LCd8kco9lZU	0	notebook	None	2023-08-12 05:45:58	bKeW4T6E

Which notebooks were created by a given user?

ln.Transform.filter(created_by=users.testuser2, type="notebook").df()

	name	short_name	stem_id	version	type	reference	updated_at	created_by_id
id
vWgq1bwq94jFz8	GWS CRIPSRa analysis	None	vWgq1bwq94jF	0	notebook	None	2023-08-12 05:45:56	bKeW4T6E
J6x5ZIyYoEkuz8	Perform single cell analysis, integrating with...	None	J6x5ZIyYoEku	0	notebook	None	2023-08-12 05:45:58	bKeW4T6E
1LCd8kco9lZUz8	Bird's eye view	birds-eye	1LCd8kco9lZU	0	notebook	None	2023-08-12 05:45:58	bKeW4T6E

And of course, we can also view all recent additions to the entire database:

ln.view()

Show code cell output Hide code cell output

File

	storage_id	key	suffix	accessor	description	version	initial_version_id	size	hash	hash_type	transform_id	run_id	updated_at	created_by_id
id
1LUsIItZ6djAJZ1t6IXu	kADkunWO	figures/matrixplot_fig2_score-wgs-hits-per-clu...	.png	None	None	None	None	28814	JYIPcat0YWYVCX3RVd3mww	md5	J6x5ZIyYoEkuz8	GzzvnF7vAdNoTdlAkrLc	2023-08-12 05:45:58	bKeW4T6E
c2406U3gQFawDxgv5nK9	kADkunWO	figures/umap_fig1_score-wgs-hits.png	.png	None	None	None	None	118999	laQjVk4gh70YFzaUyzbUNg	md5	J6x5ZIyYoEkuz8	GzzvnF7vAdNoTdlAkrLc	2023-08-12 05:45:57	bKeW4T6E
uNJ663ZfSXZNDB9uWrbx	kADkunWO	None	.parquet	DataFrame	hits from schmidt22 crispra GWS	None	None	18368	yw5f-kMLJhaNhdEF-lhxOQ	md5	vWgq1bwq94jFz8	f9J4ZFdFmnIAL3kk6fG9	2023-08-12 05:45:56	bKeW4T6E
lP2qDxYjzi1ggGJvSRwe	kADkunWO	schmidt22-crispra-gws-IFNG.csv	.csv	None	Raw data of schmidt22 crispra GWS	None	None	1729685	cUSH0oQ2w-WccO8_ViKRAQ	md5	nXCqeblqzJCIz8	4PRgkXShgitGEEUUs7L9	2023-08-12 05:45:55	DzTjkKse
HbZdt18uc1X5Q1WhbXUk	kADkunWO	schmidt22_perturbseq.h5ad	.h5ad	AnnData	perturbseq counts	None	None	20659936	la7EvqEUMDlug9-rpw-udA	md5	P8YMJSNGbe9z0b	ntpBREA7zLyoKlYnOkWe	2023-08-12 05:45:54	bKeW4T6E
sVSh07hhgZTTIn9SX3aA	kADkunWO	perturbseq/filtered_feature_bc_matrix/barcodes...	.tsv.gz	None	None	None	None	6	4rKXb9tuQUWnNTH2ZUh45g	md5	ezGsYAqUOyESsM	CoRnH4SDheN14LhLMkXg	2023-08-12 05:45:54	bKeW4T6E
cG78eIKkYrfcj5KDety9	kADkunWO	perturbseq/filtered_feature_bc_matrix/matrix.m...	.mtx.gz	None	None	None	None	6	HcVzgqt2nTf495vRMf4_Cw	md5	ezGsYAqUOyESsM	CoRnH4SDheN14LhLMkXg	2023-08-12 05:45:54	bKeW4T6E
87fqD9f1GZUT2ZJ9IZIb	kADkunWO	perturbseq/filtered_feature_bc_matrix/features...	.tsv.gz	None	None	None	None	6	HXR7vE6fTla-rTkwXG8I-A	md5	ezGsYAqUOyESsM	CoRnH4SDheN14LhLMkXg	2023-08-12 05:45:54	bKeW4T6E
t3Nnv4E8cTQNbxrdnzo1	kADkunWO	fastq/perturbseq_R2_001.fastq.gz	.fastq.gz	None	None	None	None	6	dTFVvtlfDLFXd-8qsandFw	md5	7QdAuXkrS6mIz8	eQjDbd5C6V61trGBYTae	2023-08-12 05:45:53	DzTjkKse
IAIffwcqv4Av6edxFaG7	kADkunWO	fastq/perturbseq_R1_001.fastq.gz	.fastq.gz	None	None	None	None	6	ZyFkKerhv1D1kT9z_aMl1g	md5	7QdAuXkrS6mIz8	eQjDbd5C6V61trGBYTae	2023-08-12 05:45:53	DzTjkKse

Run

	transform_id	run_at	created_by_id	reference	reference_type
id
eQjDbd5C6V61trGBYTae	7QdAuXkrS6mIz8	2023-08-12 05:45:53	DzTjkKse	None	None
CoRnH4SDheN14LhLMkXg	ezGsYAqUOyESsM	2023-08-12 05:45:54	bKeW4T6E	None	None
ntpBREA7zLyoKlYnOkWe	P8YMJSNGbe9z0b	2023-08-12 05:45:54	bKeW4T6E	None	None
4PRgkXShgitGEEUUs7L9	nXCqeblqzJCIz8	2023-08-12 05:45:55	DzTjkKse	None	None
f9J4ZFdFmnIAL3kk6fG9	vWgq1bwq94jFz8	2023-08-12 05:45:56	bKeW4T6E	None	None
GzzvnF7vAdNoTdlAkrLc	J6x5ZIyYoEkuz8	2023-08-12 05:45:56	bKeW4T6E	None	None
4jMS4YGDvJCbNm8ZSi5F	1LCd8kco9lZUz8	2023-08-12 05:45:58	bKeW4T6E	None	None

Storage

	root	type	region	updated_at	created_by_id
id
kADkunWO	/home/runner/work/lamin-usecases/lamin-usecase...	local	None	2023-08-12 05:45:51	bKeW4T6E

Transform

	name	short_name	stem_id	version	type	reference	updated_at	created_by_id
id
1LCd8kco9lZUz8	Bird's eye view	birds-eye	1LCd8kco9lZU	0	notebook	None	2023-08-12 05:45:58	bKeW4T6E
J6x5ZIyYoEkuz8	Perform single cell analysis, integrating with...	None	J6x5ZIyYoEku	0	notebook	None	2023-08-12 05:45:58	bKeW4T6E
vWgq1bwq94jFz8	GWS CRIPSRa analysis	None	vWgq1bwq94jF	0	notebook	None	2023-08-12 05:45:56	bKeW4T6E
nXCqeblqzJCIz8	Upload GWS CRISPRa result	None	nXCqeblqzJCI	0	app	None	2023-08-12 05:45:55	DzTjkKse
P8YMJSNGbe9z0b	Preprocess Cell Ranger outputs	None	P8YMJSNGbe9z	2.0	pipeline	None	2023-08-12 05:45:54	bKeW4T6E
ezGsYAqUOyESsM	Cell Ranger	None	ezGsYAqUOyES	7.2.0	pipeline	None	2023-08-12 05:45:54	bKeW4T6E
7QdAuXkrS6mIz8	Chromium 10x upload	None	7QdAuXkrS6mI	0	pipeline	None	2023-08-12 05:45:53	DzTjkKse

User

	handle	email	name	updated_at
id
bKeW4T6E	testuser2	testuser2@lamin.ai	Test User2	2023-08-12 05:45:51
DzTjkKse	testuser1	testuser1@lamin.ai	Test User1	2023-08-12 05:45:48