Quickstart

Quickstart#

This is a quickstart guide to using Oxbow. Oxbow lets you access potentially larger-than-memory genomic files as tabular data structures, such as data frames.

Create a DataSource#

Use the convenience function associated with your file type. The returned DataSource object can be used to access the data in the file.

import oxbow as ox

ds = ox.from_bam("data/sample.bam")

Into data frames#

If the dataset fits comfortably in memory, you can materialize it fully as a Pandas or Polars data frame.

ds.pd()  # or ds.to_pandas()

	qname	flag	rname	pos	mapq	cigar	rnext	pnext	tlen	seq	qual	end
0	HWI-BRUNOP16X_0001:3:48:4861:11838#0	163	chr1	10542	0	50M	chr1	10571.0	79	CGAAATCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCT...	gggggggggggggggggggggggggeggggR\_[\ggggghggggg...	10591
1	HWI-BRUNOP16X_0001:3:28:6650:168848#0	16	chr1	10546	16	75M	NaN	NaN	0	ATCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGG...	fggggggggdgdggcdfggggfgggggggggggggggggggggggg...	10620
2	HWI-BRUNOP16X_0001:3:8:20066:88158#0	16	chr1	946457	0	75M	NaN	NaN	0	TAGTCCGAGGTCTCCTGAACCTTCCCAAGCAGCTGCTGCACCTGCC...	BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBd`aed``__U^__]_g...	946531
3	HWI-BRUNOP16X_0001:3:27:10302:58768#0	16	chr1	1014060	37	75M	NaN	NaN	0	AGCTGAATGGGCAGGTCCCCCAGAAGATCGGCGTGCACGCCTTCCA...	BBBBBBBBBBBBBBBBcYRcffggfgf_gfg\deegfgfgfcggcg...	1014134
4	HWI-BRUNOP16X_0001:3:65:3144:143676#0	83	chr3	196957	60	50M	chr3	196008.0	-999	GTAACGCTCCCGGACCCTGCGCGCCCCCGTCCCGGCTCCCGGCCGG...	BBBBBBBBBBBBBB_TTSSS[[Obbd`]e^STTTSZW`beTTTTTS...	197006
5	HWI-BRUNOP16X_0001:3:68:13088:156644#0	16	chr3	196958	37	75M	NaN	NaN	0	GACCCCCCCGGCCCCCGGCGCCCCCCCGCCCCGCCCCCGGGCGGGC...	BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB...	197032
6	HWI-BRUNOP16X_0001:3:48:3417:101389#0	163	chr3	196961	60	50M	chr3	319702.0	122791	GCTTACCGGACCCTGCGCGCCCCCGTCCCGGCTCCCGGCCGGCTCG...	gggggggggggggggggggggggfdaggggggdgggfgdhbe\T`B...	197010
7	HWI-BRUNOP16X_0001:3:46:17583:95767#0	161	chrX	503847	0	50M	chr4	185365552.0	0	TTTTATTTTTTTTTTTGAGATGGAGTCTCGCTCTTGTCACCGAGGC...	ddfdfd____dffff]__aeZ]\XZSPSNSSSSSSbbaabZ_``BB...	503896
8	HWI-BRUNOP16X_0001:3:4:7989:14941#0	16	chrY	586185	0	75M	NaN	NaN	0	GTGCGATCTCGGTTCGCTGCAACCTCTGCTTCCCAGGTTCAAGTGA...	BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB...	586259
9	HWI-BRUNOP16X_0001:3:44:11450:50194#0	0	chrY	587561	0	75M	NaN	NaN	0	NNTGCAGTGAGCTGAGATTGTGCCACTGCACTCCAGCCTGGGTGAC...	BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB...	587635

ds.pl()  # or ds.to_polars()

shape: (10, 12)

qname	flag	rname	pos	mapq	cigar	rnext	pnext	tlen	seq	qual	end
str	u16	cat	i32	u8	str	cat	i32	i32	str	str	i32
"HWI-BRUNOP16X_0001:3:48:4861:1…	163	"chr1"	10542	0	"50M"	"chr1"	10571	79	"CGAAATCTGTGCAGAGGAGAACGCAGCTCC…	"gggggggggggggggggggggggggegggg…	10591
"HWI-BRUNOP16X_0001:3:28:6650:1…	16	"chr1"	10546	16	"75M"	null	null	0	"ATCTGTGCAGAGGAGAACGCAGCTCCGCCC…	"fggggggggdgdggcdfggggfgggggggg…	10620
"HWI-BRUNOP16X_0001:3:8:20066:8…	16	"chr1"	946457	0	"75M"	null	null	0	"TAGTCCGAGGTCTCCTGAACCTTCCCAAGC…	"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB…	946531
"HWI-BRUNOP16X_0001:3:27:10302:…	16	"chr1"	1014060	37	"75M"	null	null	0	"AGCTGAATGGGCAGGTCCCCCAGAAGATCG…	"BBBBBBBBBBBBBBBBcYRcffggfgf_gf…	1014134
"HWI-BRUNOP16X_0001:3:65:3144:1…	83	"chr3"	196957	60	"50M"	"chr3"	196008	-999	"GTAACGCTCCCGGACCCTGCGCGCCCCCGT…	"BBBBBBBBBBBBBB_TTSSS[[Obbd`]e^…	197006
"HWI-BRUNOP16X_0001:3:68:13088:…	16	"chr3"	196958	37	"75M"	null	null	0	"GACCCCCCCGGCCCCCGGCGCCCCCCCGCC…	"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB…	197032
"HWI-BRUNOP16X_0001:3:48:3417:1…	163	"chr3"	196961	60	"50M"	"chr3"	319702	122791	"GCTTACCGGACCCTGCGCGCCCCCGTCCCG…	"gggggggggggggggggggggggfdagggg…	197010
"HWI-BRUNOP16X_0001:3:46:17583:…	161	"chrX"	503847	0	"50M"	"chr4"	185365552	0	"TTTTATTTTTTTTTTTGAGATGGAGTCTCG…	"ddfdfd____dffff]__aeZ]\XZSPSNS…	503896
"HWI-BRUNOP16X_0001:3:4:7989:14…	16	"chrY"	586185	0	"75M"	null	null	0	"GTGCGATCTCGGTTCGCTGCAACCTCTGCT…	"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB…	586259
"HWI-BRUNOP16X_0001:3:44:11450:…	0	"chrY"	587561	0	"75M"	null	null	0	"NNTGCAGTGAGCTGAGATTGTGCCACTGCA…	"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB…	587635

Into lazy data structures#

If the data source is very large, you can also load it into a lazy or “out-of-core” data structure, such as a Polars lazy frame or Dask data frame.

df = ds.pl(lazy=True)
df.show_graph()

../_images/9ba3f724fff994029c61e46c786bfc14feed0aa2a91d91713f08d42d28c67cb2.svg

df.head().collect()

shape: (5, 12)

qname	flag	rname	pos	mapq	cigar	rnext	pnext	tlen	seq	qual	end
str	u16	cat	i32	u8	str	cat	i32	i32	str	str	i32
"HWI-BRUNOP16X_0001:3:48:4861:1…	163	"chr1"	10542	0	"50M"	"chr1"	10571	79	"CGAAATCTGTGCAGAGGAGAACGCAGCTCC…	"gggggggggggggggggggggggggegggg…	10591
"HWI-BRUNOP16X_0001:3:28:6650:1…	16	"chr1"	10546	16	"75M"	null	null	0	"ATCTGTGCAGAGGAGAACGCAGCTCCGCCC…	"fggggggggdgdggcdfggggfgggggggg…	10620
"HWI-BRUNOP16X_0001:3:8:20066:8…	16	"chr1"	946457	0	"75M"	null	null	0	"TAGTCCGAGGTCTCCTGAACCTTCCCAAGC…	"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB…	946531
"HWI-BRUNOP16X_0001:3:27:10302:…	16	"chr1"	1014060	37	"75M"	null	null	0	"AGCTGAATGGGCAGGTCCCCCAGAAGATCG…	"BBBBBBBBBBBBBBBBcYRcffggfgf_gf…	1014134
"HWI-BRUNOP16X_0001:3:65:3144:1…	83	"chr3"	196957	60	"50M"	"chr3"	196008	-999	"GTAACGCTCCCGGACCCTGCGCGCCCCCGT…	"BBBBBBBBBBBBBB_TTSSS[[Obbd`]e^…	197006

Oxbow data sources can also be loaded into a DuckDB relation.

import duckdb

conn = duckdb.connect(":memory:")
ds = ox.from_gtf("data/gencode.v47.annotation.gtf")
rel = ds.to_duckdb(conn)
conn.sql(
    "SELECT seqid as chrom, type, start, rel.end, strand, attributes.gene_name " \
    "FROM rel " \
    "WHERE attributes.gene_name = 'PCSK9'" \
    "LIMIT 10"
).pl()

Note

See the Streams and Fragments section for details on building Dask data frames.

Range queries#

Data sources with indexes support querying genomic ranges. This is the case for htslib formats that are compressed with the BGZF gzip variant and indexed with an appropriate companion index file (e.g., .bai, .tbi, .csi). The BBI formats, BigWig and BigBed, possess an internal index and support range queries without an index file.

You can specify one or more ranges to the constructor or pass them to the regions() method. All records overlapping the query ranges will be returned.

ds = ox.from_bam("data/sample.bam", index="data/sample.bam.bai")
ds = ds.regions("chr1:900000-1100000")

ds.pl()

shape: (2, 12)

qname	flag	rname	pos	mapq	cigar	rnext	pnext	tlen	seq	qual	end
str	u16	cat	i32	u8	str	cat	i32	i32	str	str	i32
"HWI-BRUNOP16X_0001:3:8:20066:8…	16	"chr1"	946457	0	"75M"	null	null	0	"TAGTCCGAGGTCTCCTGAACCTTCCCAAGC…	"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB…	946531
"HWI-BRUNOP16X_0001:3:27:10302:…	16	"chr1"	1014060	37	"75M"	null	null	0	"AGCTGAATGGGCAGGTCCCCCAGAAGATCG…	"BBBBBBBBBBBBBBBBcYRcffggfgf_gf…	1014134

If the index file exists in the same location as the source file, it is automatically detected.

ox.from_bam("data/sample.bam").regions(["chr1", "chr3"]).pl()

shape: (7, 12)

qname	flag	rname	pos	mapq	cigar	rnext	pnext	tlen	seq	qual	end
str	u16	cat	i32	u8	str	cat	i32	i32	str	str	i32
"HWI-BRUNOP16X_0001:3:48:4861:1…	163	"chr1"	10542	0	"50M"	"chr1"	10571	79	"CGAAATCTGTGCAGAGGAGAACGCAGCTCC…	"gggggggggggggggggggggggggegggg…	10591
"HWI-BRUNOP16X_0001:3:28:6650:1…	16	"chr1"	10546	16	"75M"	null	null	0	"ATCTGTGCAGAGGAGAACGCAGCTCCGCCC…	"fggggggggdgdggcdfggggfgggggggg…	10620
"HWI-BRUNOP16X_0001:3:8:20066:8…	16	"chr1"	946457	0	"75M"	null	null	0	"TAGTCCGAGGTCTCCTGAACCTTCCCAAGC…	"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB…	946531
"HWI-BRUNOP16X_0001:3:27:10302:…	16	"chr1"	1014060	37	"75M"	null	null	0	"AGCTGAATGGGCAGGTCCCCCAGAAGATCG…	"BBBBBBBBBBBBBBBBcYRcffggfgf_gf…	1014134
"HWI-BRUNOP16X_0001:3:65:3144:1…	83	"chr3"	196957	60	"50M"	"chr3"	196008	-999	"GTAACGCTCCCGGACCCTGCGCGCCCCCGT…	"BBBBBBBBBBBBBB_TTSSS[[Obbd`]e^…	197006
"HWI-BRUNOP16X_0001:3:68:13088:…	16	"chr3"	196958	37	"75M"	null	null	0	"GACCCCCCCGGCCCCCGGCGCCCCCCCGCC…	"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB…	197032
"HWI-BRUNOP16X_0001:3:48:3417:1…	163	"chr3"	196961	60	"50M"	"chr3"	319702	122791	"GCTTACCGGACCCTGCGCGCCCCCGTCCCG…	"gggggggggggggggggggggggfdagggg…	197010

ox.from_bigwig("data/sample.bw").regions("chr21:10900000-15000000").pl()

shape: (3, 4)

chrom	start	end	value
str	u32	u32	f32
"chr21"	10971770	10971775	40.0
"chr21"	14787100	14787105	60.0
"chr21"	14959050	14959055	20.0

Note

Oxbow handles multiple ranges as separate fragments. For more details, see the Streams and Fragments section below.

Column projection#

Oxbow lets you select only the columns you need and will not parse the others.

import polars as pl

ox.from_bam(
    "data/sample.bam", 
    fields=["rname", "pos", "end", "mapq"],
).regions(
    "chr1"
).pl()

shape: (4, 4)

rname	pos	end	mapq
cat	i32	i32	u8
"chr1"	10542	10591	0
"chr1"	10546	10620	16
"chr1"	946457	946531	0
"chr1"	1014060	1014134	37

In data systems lingo, selecting columns is also known as “projection”. The lazy data structures returned by an oxbow data source are able to “push down” the projection operation to oxbow to prevent full record parsing. In the following example, only the four fields passed to the polars LazyFrame.select method will be parsed when the output gets computed.

df = (
    ox.from_bam("data/sample.bam")
    .regions("chr1")
    .pl(lazy=True)
    .select(
        pl.col("rname").alias("chrom"),
        pl.col("pos").alias("start"),
        "end",
        "mapq"
    )
    .collect()
)
df

shape: (4, 4)

chrom	start	end	mapq
cat	i32	i32	u8
"chr1"	10542	10591	0
"chr1"	10546	10620	16
"chr1"	946457	946531	0
"chr1"	1014060	1014134	37

Nested and composite fields#

Oxbow can handle the complex field structures of genomics file formats because they can all be mapped to Arrow constructs like lists, arrays, and structs.

For example, fields like SAM tags, VCF info and samples, and GTF attributes are exposed as struct columns in Arrow-native libraries like Polars, which are easy and efficient to manipulate.

SAM/BAM tags#

The htslib alignment formats, SAM and BAM, have optional fields called tags that are defined inline, rather than in a header or manifest. These definitions, a tuple of a tag name and type code, can be provided explicitly to the data source constructor for projection.

df = (
    ox.from_bam(
        "data/sample.bam", 
        fields=None,
        tag_defs=[('MD', 'Z'), ('NM', 'C')]
    )
    .regions("chr1")
    .pl()
    .select(
        pl.col("tags").struct.unnest()
    )
)
df

shape: (4, 2)

MD	NM
str	i64
"18C31"	1
"14C52A7"	2
"2T0G5T65"	3
"7G1C4A2A57"	4

By calling the with_tags() method, oxbow will scan an initial number of rows to discover tag definitions to add to the schema (determined by scan_rows).

df = (
    ox.from_bam("data/sample.bam")
    .with_tags()
    .regions("chr1")
    .pl()
)
df

shape: (4, 13)

qname	flag	rname	pos	mapq	cigar	rnext	pnext	tlen	seq	qual	end	tags
str	u16	cat	i32	u8	str	cat	i32	i32	str	str	i32	struct[12]
"HWI-BRUNOP16X_0001:3:48:4861:1…	163	"chr1"	10542	0	"50M"	"chr1"	10571	79	"CGAAATCTGTGCAGAGGAGAACGCAGCTCC…	"gggggggggggggggggggggggggegggg…	10591	{0,"18C31",1,"brain_50_fcb",0,3,8,null,0,1,0,"82"}
"HWI-BRUNOP16X_0001:3:28:6650:1…	16	"chr1"	10546	16	"75M"	null	null	0	"ATCTGTGCAGAGGAGAACGCAGCTCCGCCC…	"fggggggggdgdggcdfggggfgggggggg…	10620	{null,"14C52A7",2,"brain_75_fca",null,1,5,null,0,2,0,"85"}
"HWI-BRUNOP16X_0001:3:8:20066:8…	16	"chr1"	946457	0	"75M"	null	null	0	"TAGTCCGAGGTCTCCTGAACCTTCCCAAGC…	"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB…	946531	{null,"2T0G5T65",3,"brain_75_fca",null,2,0,"2,-131443143,75M,3;",0,3,0,"82"}
"HWI-BRUNOP16X_0001:3:27:10302:…	16	"chr1"	1014060	37	"75M"	null	null	0	"AGCTGAATGGGCAGGTCCCCCAGAAGATCG…	"BBBBBBBBBBBBBBBBcYRcffggfgf_gf…	1014134	{null,"7G1C4A2A57",4,"brain_75_fca",null,1,0,null,0,4,0,"85"}

df['tags'].struct.unnest().head()

shape: (4, 12)

AM	MD	NM	RG	SM	X0	X1	XA	XG	XM	XO	XT
i64	str	i64	str	i64	i64	i64	str	i64	i64	i64	str
0	"18C31"	1	"brain_50_fcb"	0	3	8	null	0	1	0	"82"
null	"14C52A7"	2	"brain_75_fca"	null	1	5	null	0	2	0	"85"
null	"2T0G5T65"	3	"brain_75_fca"	null	2	0	"2,-131443143,75M,3;"	0	3	0	"82"
null	"7G1C4A2A57"	4	"brain_75_fca"	null	1	0	null	0	4	0	"85"

GTF/GFF attributes#

GTF/GFF attributes are analogous to SAM tags. For GTF, the type is always "String". For GFF, attributes can be "String" or "Array", the latter materializing as a list column.

df = (
    ox.from_gff("data/sample.gff")
    .with_attributes()
    .pl()
)
df.head()

shape: (5, 9)

seqid	source	type	start	end	score	strand	frame	attributes
str	str	str	i32	i32	f32	str	u8	struct[18]
"chr13"	"HAVANA"	"exon"	81326030	81326191	null	"+"	null	{"exon:ENST00000782961.1:2","ENST00000782961.1",null,"ENSE00004156517.1","2","ENSG00000229309.3","ENSG00000229309","lncRNA","OTTHUMG00000017146.2",null,null,"2",null,["basic", "Ensembl_canonical", "TAGENE"],"ENST00000782961.1","ENST00000782961",null,"lncRNA"}
"chr6"	"HAVANA"	"CDS"	32002399	32002540	null	"+"	1	{"CDS:ENST00000498271.1","ENST00000498271.1","CCDS59005.1","ENSE00001878698.1","40","ENSG00000244731.10","C4A","protein_coding","OTTHUMG00000031186.6","OTTHUMT00000356896.1","HGNC:1323","2","ENSP00000420212.1",["RNA_Seq_supported_only", "basic", … "CCDS"],"ENST00000498271.1","C4A-246","1","protein_coding"}
"chr10"	"HAVANA"	"exon"	72930538	72930637	null	"+"	null	{"exon:ENST00000334011.10:8","ENST00000334011.10","CCDS7318.1","ENSE00001170094.1","8","ENSG00000138315.13","OIT3","protein_coding","OTTHUMG00000018444.2","OTTHUMT00000048596.2","HGNC:29953","2","ENSP00000333900.5",["basic", "Ensembl_canonical", … "CCDS"],"ENST00000334011.10","OIT3-201","1","protein_coding"}
"chr1"	"HAVANA"	"exon"	497210	497299	null	"-"	null	{"exon:ENST00000641916.1:4","ENST00000641916.1",null,"ENSE00003812605.1","4","ENSG00000290385.2","ENSG00000290385","lncRNA",null,"OTTHUMT00000493599.1",null,"2",null,null,"ENST00000641916.1","ENST00000641916",null,"lncRNA"}
"chr13"	"HAVANA"	"CDS"	35655579	35655749	null	"+"	2	{"CDS:ENST00000629018.4","ENST00000629018.4",null,"ENSE00000938859.1","28","ENSG00000172915.20","NBEA","protein_coding","OTTHUMG00000016724.2",null,"HGNC:7648","2","ENSP00000486239.3",["RNA_Seq_supported_only", "mRNA_start_NF", "cds_start_NF"],"ENST00000629018.4","NBEA-207","5","protein_coding"}

df['attributes'].struct.unnest().head()

shape: (5, 18)

ID	Parent	ccdsid	exon_id	exon_number	gene_id	gene_name	gene_type	havana_gene	havana_transcript	hgnc_id	level	protein_id	tag	transcript_id	transcript_name	transcript_support_level	transcript_type
str	str	str	str	str	str	str	str	str	str	str	str	str	list[str]	str	str	str	str
"exon:ENST00000782961.1:2"	"ENST00000782961.1"	null	"ENSE00004156517.1"	"2"	"ENSG00000229309.3"	"ENSG00000229309"	"lncRNA"	"OTTHUMG00000017146.2"	null	null	"2"	null	["basic", "Ensembl_canonical", "TAGENE"]	"ENST00000782961.1"	"ENST00000782961"	null	"lncRNA"
"CDS:ENST00000498271.1"	"ENST00000498271.1"	"CCDS59005.1"	"ENSE00001878698.1"	"40"	"ENSG00000244731.10"	"C4A"	"protein_coding"	"OTTHUMG00000031186.6"	"OTTHUMT00000356896.1"	"HGNC:1323"	"2"	"ENSP00000420212.1"	["RNA_Seq_supported_only", "basic", … "CCDS"]	"ENST00000498271.1"	"C4A-246"	"1"	"protein_coding"
"exon:ENST00000334011.10:8"	"ENST00000334011.10"	"CCDS7318.1"	"ENSE00001170094.1"	"8"	"ENSG00000138315.13"	"OIT3"	"protein_coding"	"OTTHUMG00000018444.2"	"OTTHUMT00000048596.2"	"HGNC:29953"	"2"	"ENSP00000333900.5"	["basic", "Ensembl_canonical", … "CCDS"]	"ENST00000334011.10"	"OIT3-201"	"1"	"protein_coding"
"exon:ENST00000641916.1:4"	"ENST00000641916.1"	null	"ENSE00003812605.1"	"4"	"ENSG00000290385.2"	"ENSG00000290385"	"lncRNA"	null	"OTTHUMT00000493599.1"	null	"2"	null	null	"ENST00000641916.1"	"ENST00000641916"	null	"lncRNA"
"CDS:ENST00000629018.4"	"ENST00000629018.4"	null	"ENSE00000938859.1"	"28"	"ENSG00000172915.20"	"NBEA"	"protein_coding"	"OTTHUMG00000016724.2"	null	"HGNC:7648"	"2"	"ENSP00000486239.3"	["RNA_Seq_supported_only", "mRNA_start_NF", "cds_start_NF"]	"ENST00000629018.4"	"NBEA-207"	"5"	"protein_coding"

Important

As of oxbow v0.7, alignment file tag definitions and annotation file attribute definitions are no longer auto-discovered by default—this behavior is opt-in. Use the with_tags() or with_attributes() methods, respectively, to discover or specify tag/attribute definitions.

VCF/BCF info fields#

For the htslib variant call formats, VCF and BCF, the subfields of the INFO field are defined in the VCF header, so they do not need to be discovered by sniffing rows and you do not need to specify types.

By default, all info fields are parsed (info_fields="*"). You can project any subset or ignore them entirely by setting the info_fields argument to None.

(
    ox.from_vcf(
        "data/sample.vcf.gz",
        info_fields=None,
    )
    .pl()
).head()

shape: (5, 7)

chrom	pos	id	ref	alt	qual	filter
cat	i32	list[str]	str	list[str]	f32	list[str]
"1"	65872	[]	"T"	["G"]	44.18	[]
"1"	69511	[]	"A"	["G"]	2552.929932	[]
"1"	762273	[]	"G"	["A"]	19085.929688	[]
"1"	866511	[]	"C"	["CCCCT"]	3136.889893	[]
"1"	876499	[]	"A"	["G"]	3338.929932	[]

df = (
    ox.from_vcf(
        "data/sample.vcf.gz",
        info_fields=["TYPE", "snpeff.Effect", "snpeff.Gene_Name", "snpeff.Transcript_BioType"],
    )
    .pl()
)
df.head()

shape: (5, 8)

chrom	pos	id	ref	alt	qual	filter	info
cat	i32	list[str]	str	list[str]	f32	list[str]	struct[4]
"1"	65872	[]	"T"	["G"]	44.18	[]	{["SNP"],["intergenic_region"],null,null}
"1"	69511	[]	"A"	["G"]	2552.929932	[]	{["SNP"],["sequence_feature[transmembrane_region:Transmembrane_region]", "sequence_feature[disulfide_bond]", "missense_variant"],["OR4F5", "OR4F5", "OR4F5"],["protein_coding", "protein_coding", "protein_coding"]}
"1"	762273	[]	"G"	["A"]	19085.929688	[]	{["SNP"],["non_coding_exon_variant"],["LINC00115"],["lincRNA"]}
"1"	866511	[]	"C"	["CCCCT"]	3136.889893	[]	{["Insertion"],["intron_variant"],["SAMD11"],["protein_coding"]}
"1"	876499	[]	"A"	["G"]	3338.929932	[]	{["SNP"],["intron_variant"],["SAMD11"],["protein_coding"]}

df.unnest("info").head()

shape: (5, 11)

chrom	pos	id	ref	alt	qual	filter	TYPE	snpeff.Effect	snpeff.Gene_Name	snpeff.Transcript_BioType
cat	i32	list[str]	str	list[str]	f32	list[str]	list[str]	list[str]	list[str]	list[str]
"1"	65872	[]	"T"	["G"]	44.18	[]	["SNP"]	["intergenic_region"]	null	null
"1"	69511	[]	"A"	["G"]	2552.929932	[]	["SNP"]	["sequence_feature[transmembrane_region:Transmembrane_region]", "sequence_feature[disulfide_bond]", "missense_variant"]	["OR4F5", "OR4F5", "OR4F5"]	["protein_coding", "protein_coding", "protein_coding"]
"1"	762273	[]	"G"	["A"]	19085.929688	[]	["SNP"]	["non_coding_exon_variant"]	["LINC00115"]	["lincRNA"]
"1"	866511	[]	"C"	["CCCCT"]	3136.889893	[]	["Insertion"]	["intron_variant"]	["SAMD11"]	["protein_coding"]
"1"	876499	[]	"A"	["G"]	3338.929932	[]	["SNP"]	["intron_variant"]	["SAMD11"]	["protein_coding"]

VCF/BCF sample genotype data#

For the htslib variant call formats, each variant call record is associated with an arbitrary number of so-called FORMAT fields that provide genotype-related information for each sample. Like INFO, these fields are defined in the header.

Using the samples and genotype_fields arguments, you can project any subset of samples as separate struct columns and project any subset of their associated genotype fields. Use samples="*" to select all samples or a list to select a subset.

df = ox.from_vcf(
    "data/sample.vcf.gz",
    info_fields=None,
    samples=['NA12891', 'NA12892'],
).pl()
df.head()

shape: (5, 9)

chrom	pos	id	ref	alt	qual	filter	NA12891	NA12892
cat	i32	list[str]	str	list[str]	f32	list[str]	struct[6]	struct[6]
"1"	65872	[]	"T"	["G"]	44.18	[]	{[14, 2],16,21,{[0, 1],[true, true]},[21, 0, 439],18}	{[14, 2],16,21,{[0, 1],[true, true]},[21, 0, 437],18}
"1"	69511	[]	"A"	["G"]	2552.929932	[]	{null,null,null,{[null, null],[false, false]},null,null}	{[0, 39],39,99,{[1, 1],[true, true]},[1289, 117, 0],null}
"1"	762273	[]	"G"	["A"]	19085.929688	[]	{[0, 82],82,99,{[1, 1],[true, true]},[2952, 247, 0],127}	{[0, 68],68,99,{[1, 1],[true, true]},[2485, 204, 0],127}
"1"	866511	[]	"C"	["CCCCT"]	3136.889893	[]	{[0, 13],13,37,{[1, 1],[true, true]},[512, 37, 0],26}	{[0, 9],9,27,{[1, 1],[true, true]},[402, 27, 0],26}
"1"	876499	[]	"A"	["G"]	3338.929932	[]	{[0, 17],17,51,{[1, 1],[true, true]},[645, 51, 0],26}	{[0, 9],9,27,{[1, 1],[true, true]},[355, 27, 0],26}

Each sample column is essentially a sub-dataframe of that sample’s genotype fields.

df['NA12892'].struct.unnest().head()

shape: (5, 6)

AD	DP	GQ	GT	PL	TP
list[i32]	i32	i32	struct[2]	list[i32]	i32
[14, 2]	16	21	{[0, 1],[true, true]}	[21, 0, 437]	18
[0, 39]	39	99	{[1, 1],[true, true]}	[1289, 117, 0]	null
[0, 68]	68	99	{[1, 1],[true, true]}	[2485, 204, 0]	127
[0, 9]	9	27	{[1, 1],[true, true]}	[402, 27, 0]	26
[0, 9]	9	27	{[1, 1],[true, true]}	[355, 27, 0]	26

Important

As of oxbow v0.7, variant file sample columns are no longer projected by default—they are opt-in. We recommend using the with_samples() API, below, to do this.

The recommended approach to project sample genotype data is to use the with_samples() method. Declaring samples this way further nests all sample-related data in a single “samples” struct column for convenience.

df = (
    ox.from_vcf(
        "data/sample.vcf.gz",
        info_fields=None,
    )
    .with_samples()
    .pl()
)
df.head()

shape: (5, 8)

chrom	pos	id	ref	alt	qual	filter	samples
cat	i32	list[str]	str	list[str]	f32	list[str]	struct[3]
"1"	65872	[]	"T"	["G"]	44.18	[]	{{[15, 0],15,45,{[0, 0],[true, true]},[0, 45, 520],18},{[14, 2],16,21,{[0, 1],[true, true]},[21, 0, 439],18},{[14, 2],16,21,{[0, 1],[true, true]},[21, 0, 437],18}}
"1"	69511	[]	"A"	["G"]	2552.929932	[]	{{null,null,null,{[null, null],[false, false]},null,null},{null,null,null,{[null, null],[false, false]},null,null},{[0, 39],39,99,{[1, 1],[true, true]},[1289, 117, 0],null}}
"1"	762273	[]	"G"	["A"]	19085.929688	[]	{{[0, 67],67,99,{[1, 1],[true, true]},[2510, 202, 0],127},{[0, 82],82,99,{[1, 1],[true, true]},[2952, 247, 0],127},{[0, 68],68,99,{[1, 1],[true, true]},[2485, 204, 0],127}}
"1"	866511	[]	"C"	["CCCCT"]	3136.889893	[]	{{[0, 13],13,38,{[1, 1],[true, true]},[583, 38, 0],26},{[0, 13],13,37,{[1, 1],[true, true]},[512, 37, 0],26},{[0, 9],9,27,{[1, 1],[true, true]},[402, 27, 0],26}}
"1"	876499	[]	"A"	["G"]	3338.929932	[]	{{[0, 12],12,36,{[1, 1],[true, true]},[465, 36, 0],26},{[0, 17],17,51,{[1, 1],[true, true]},[645, 51, 0],26},{[0, 9],9,27,{[1, 1],[true, true]},[355, 27, 0],26}}

df.unnest("samples").head()

shape: (5, 10)

chrom	pos	id	ref	alt	qual	filter	NA12878i	NA12891	NA12892
cat	i32	list[str]	str	list[str]	f32	list[str]	struct[6]	struct[6]	struct[6]
"1"	65872	[]	"T"	["G"]	44.18	[]	{[15, 0],15,45,{[0, 0],[true, true]},[0, 45, 520],18}	{[14, 2],16,21,{[0, 1],[true, true]},[21, 0, 439],18}	{[14, 2],16,21,{[0, 1],[true, true]},[21, 0, 437],18}
"1"	69511	[]	"A"	["G"]	2552.929932	[]	{null,null,null,{[null, null],[false, false]},null,null}	{null,null,null,{[null, null],[false, false]},null,null}	{[0, 39],39,99,{[1, 1],[true, true]},[1289, 117, 0],null}
"1"	762273	[]	"G"	["A"]	19085.929688	[]	{[0, 67],67,99,{[1, 1],[true, true]},[2510, 202, 0],127}	{[0, 82],82,99,{[1, 1],[true, true]},[2952, 247, 0],127}	{[0, 68],68,99,{[1, 1],[true, true]},[2485, 204, 0],127}
"1"	866511	[]	"C"	["CCCCT"]	3136.889893	[]	{[0, 13],13,38,{[1, 1],[true, true]},[583, 38, 0],26}	{[0, 13],13,37,{[1, 1],[true, true]},[512, 37, 0],26}	{[0, 9],9,27,{[1, 1],[true, true]},[402, 27, 0],26}
"1"	876499	[]	"A"	["G"]	3338.929932	[]	{[0, 12],12,36,{[1, 1],[true, true]},[465, 36, 0],26}	{[0, 17],17,51,{[1, 1],[true, true]},[645, 51, 0],26}	{[0, 9],9,27,{[1, 1],[true, true]},[355, 27, 0],26}

You can also customize how sample genotype data are nested by using the group_by argument to with_samples(). By default (group_by="sample"), the columns are grouped first by sample name, then by genotype field name. By setting group_by="field", you can swap the nesting order to group columns first by genotype field name, then by sample name.

df = (
    ox.from_vcf(
        "data/sample.vcf.gz",
        info_fields=None,
    )
    .with_samples(
        ['NA12891', 'NA12892'],
        genotype_fields=['AD', 'DP', 'GQ', 'PL', 'TP'],
        group_by="field",
    )    
).pl()
df.head()

shape: (5, 8)

chrom	pos	id	ref	alt	qual	filter	samples
cat	i32	list[str]	str	list[str]	f32	list[str]	struct[5]
"1"	65872	[]	"T"	["G"]	44.18	[]	{{[14, 2],[14, 2]},{16,16},{21,21},{[21, 0, 439],[21, 0, 437]},{18,18}}
"1"	69511	[]	"A"	["G"]	2552.929932	[]	{{null,[0, 39]},{null,39},{null,99},{null,[1289, 117, 0]},{null,null}}
"1"	762273	[]	"G"	["A"]	19085.929688	[]	{{[0, 82],[0, 68]},{82,68},{99,99},{[2952, 247, 0],[2485, 204, 0]},{127,127}}
"1"	866511	[]	"C"	["CCCCT"]	3136.889893	[]	{{[0, 13],[0, 9]},{13,9},{37,27},{[512, 37, 0],[402, 27, 0]},{26,26}}
"1"	876499	[]	"A"	["G"]	3338.929932	[]	{{[0, 17],[0, 9]},{17,9},{51,27},{[645, 51, 0],[355, 27, 0]},{26,26}}

df.unnest("samples").head()

shape: (5, 12)

chrom	pos	id	ref	alt	qual	filter	AD	DP	GQ	PL	TP
cat	i32	list[str]	str	list[str]	f32	list[str]	struct[2]	struct[2]	struct[2]	struct[2]	struct[2]
"1"	65872	[]	"T"	["G"]	44.18	[]	{[14, 2],[14, 2]}	{16,16}	{21,21}	{[21, 0, 439],[21, 0, 437]}	{18,18}
"1"	69511	[]	"A"	["G"]	2552.929932	[]	{null,[0, 39]}	{null,39}	{null,99}	{null,[1289, 117, 0]}	{null,null}
"1"	762273	[]	"G"	["A"]	19085.929688	[]	{[0, 82],[0, 68]}	{82,68}	{99,99}	{[2952, 247, 0],[2485, 204, 0]}	{127,127}
"1"	866511	[]	"C"	["CCCCT"]	3136.889893	[]	{[0, 13],[0, 9]}	{13,9}	{37,27}	{[512, 37, 0],[402, 27, 0]}	{26,26}
"1"	876499	[]	"A"	["G"]	3338.929932	[]	{[0, 17],[0, 9]}	{17,9}	{51,27}	{[645, 51, 0],[355, 27, 0]}	{26,26}

In this case, each genotype field column is a data series containing the values of that field associated with each of the samples.

df.unnest("samples")['DP'].struct.unnest().head()

shape: (5, 2)

NA12891	NA12892
i32	i32
16	16
null	39
82	68
13	9
17	9

BED schemas#

Oxbow understands BEDn+m schema specifiers to interpret the contents of BED files.

ox.from_bed("data/sample.bed", bed_schema="bed3+").pl().head()

shape: (5, 4)

chrom	start	end	rest
str	i64	i64	str
"chr1"	1100000	1200000	"A1 . . 1100000 1200000 226,56,…
"chr1"	1550000	1600000	"A1 . . 1550000 1600000 226,56,…
"chr1"	1900000	2450000	"A1 . . 1900000 2450000 226,56,…
"chr10"	50000	250000	"AB . . 50000 250000 94,189,62"
"chr10"	250000	650000	"A2 . . 250000 650000 247,130,0"

ox.from_bed("data/sample.bed", bed_schema="bed3+6").pl().head()

shape: (5, 9)

chrom	start	end	BED3+1	BED3+2	BED3+3	BED3+4	BED3+5	BED3+6
str	i64	i64	str	str	str	str	str	str
"chr1"	1100000	1200000	"A1"	"."	"."	"1100000"	"1200000"	"226,56,56"
"chr1"	1550000	1600000	"A1"	"."	"."	"1550000"	"1600000"	"226,56,56"
"chr1"	1900000	2450000	"A1"	"."	"."	"1900000"	"2450000"	"226,56,56"
"chr10"	50000	250000	"AB"	"."	"."	"50000"	"250000"	"94,189,62"
"chr10"	250000	650000	"A2"	"."	"."	"250000"	"650000"	"247,130,0"

ox.from_bed("data/sample.bed", bed_schema="bed9").pl().head()

shape: (5, 9)

chrom	start	end	name	score	strand	thickStart	thickEnd	itemRgb
str	i64	i64	str	u16	cat	i64	i64	array[u8, 3]
"chr1"	1100000	1200000	"A1"	null	null	1100000	1200000	[226, 56, 56]
"chr1"	1550000	1600000	"A1"	null	null	1550000	1600000	[226, 56, 56]
"chr1"	1900000	2450000	"A1"	null	null	1900000	2450000	[226, 56, 56]
"chr10"	50000	250000	"AB"	null	null	50000	250000	[94, 189, 62]
"chr10"	250000	650000	"A2"	null	null	250000	650000	[247, 130, 0]

BigBed AutoSql#

BigBed records natively store genomic coordinate fields and a flat string containing the “rest” of the data (equivalent to a bed3+ schema).

ox.from_bigbed("data/autosql-sample.bb").pl().head()

shape: (5, 4)

chrom	start	end	rest
str	u32	u32	str
"chr1"	11868	14409	"ENST00000456328.2 1000 + 11868…
"chr1"	14403	29570	"ENST00000488147.1 1000 - 14403…
"chr1"	17368	17436	"ENST00000619216.1 1000 - 17368…
"chr1"	29553	31097	"ENST00000473358.1 1000 + 29553…
"chr1"	30365	30503	"ENST00000607096.1 1000 + 30365…

If a BigBed file contains AutoSql definitions of its record fields and types, Oxbow can parse them.

ox.from_bigbed("data/autosql-sample.bb", schema="autosql").pl().head()

shape: (5, 20)

chrom	start	end	name	score	strand	thickStart	thickEnd	reserved	blockCount	blockSizes	chromStarts	name2	cdsStartStat	cdsEndStat	exonFrames	type	geneName	geneName2	geneType
str	u32	u32	str	u32	str	u32	u32	u32	i32	list[i32]	list[i32]	str	str	str	list[i32]	str	str	str	str
"chr1"	11868	14409	"ENST00000456328.2"	1000	"+"	11868	11868	null	3	[359, 109, 1189]	[0, 744, 1352]	"DDX11L1"	"none"	"none"	[-1, -1, -1]	"none"	"ENST00000456328.2"	"DDX11L1"	"none"
"chr1"	14403	29570	"ENST00000488147.1"	1000	"-"	14403	14403	null	11	[98, 34, … 37]	[0, 601, … 15130]	"WASH7P"	"none"	"none"	[-1, -1, … -1]	"none"	"ENST00000488147.1"	"WASH7P"	"none"
"chr1"	17368	17436	"ENST00000619216.1"	1000	"-"	17368	17368	null	1	[68]	[0]	"MIR6859-2"	"none"	"none"	[-1]	"none"	"ENST00000619216.1"	"MIR6859-2"	"none"
"chr1"	29553	31097	"ENST00000473358.1"	1000	"+"	29553	29553	null	3	[486, 104, 122]	[0, 1010, 1422]	"MIR1302-11"	"none"	"none"	[-1, -1, -1]	"none"	"ENST00000473358.1"	"MIR1302-11"	"none"
"chr1"	30365	30503	"ENST00000607096.1"	1000	"+"	30365	30365	null	1	[138]	[0]	"MIR1302-9"	"none"	"none"	[-1]	"none"	"ENST00000607096.1"	"MIR1302-9"	"none"

Custom BED schemas#

You can impose a custom parsing interpretation—field names and types (beyond the first three fields)—on a BED or BigBed file as long as the text values in those fields are compatible with the types you impose.

Pass in a BED schema as a tuple of (str, dict[str, str]), representing 3-12 standard BED fields ("bed{n}") + custom extended fields encoded as a dictionary of field name to type name. Types can be declared using C-style AutoSql names (string, short, float, double, etc.) or Rust integer shorthands (i8, u8, i32, f32, f64, etc.). Fixed and variable-length array types can be declared using int[], int[10] (AutoSql style) or [i32], [i32; 10] (Rust shorthand style).

(
    ox.from_bigbed(
        "data/autosql-sample.bb", 
        schema=("bed4", {"score": "double", "strand": "string"})
    )
    .pl()
    .head()
)

shape: (5, 6)

chrom	start	end	name	score	strand
str	u32	u32	str	f64	str
"chr1"	11868	14409	"ENST00000456328.2"	1000.0	"+"
"chr1"	14403	29570	"ENST00000488147.1"	1000.0	"-"
"chr1"	17368	17436	"ENST00000619216.1"	1000.0	"-"
"chr1"	29553	31097	"ENST00000473358.1"	1000.0	"+"
"chr1"	30365	30503	"ENST00000607096.1"	1000.0	"+"

narrowpeak = (
    "bed6",
    {"fold_change": "f64", "-log10p": "f64", "-log10q": "f64", "relSummit": "i64"}
)
(
    ox.from_bed(
        "data/ENCFF758CQW.100.bed.gz", 
        bed_schema=narrowpeak,
        compression="gzip"
    )
    .pl()
    .head()
)

shape: (5, 10)

chrom	start	end	name	score	strand	fold_change	-log10p	-log10q	relSummit
str	i64	i64	str	u16	cat	f64	f64	f64	i64
"chr1"	86499906	86500478	null	1000	null	269.56463	-1.0	4.53508	306
"chr7"	25565806	25566365	null	1000	null	267.92568	-1.0	4.53508	275
"chr14"	49862021	49862498	null	1000	null	266.99777	-1.0	4.53508	212
"chr20"	58209727	58210234	null	1000	null	262.28789	-1.0	4.53508	273
"chr7"	151172497	151172982	null	1000	null	261.30677	-1.0	4.53508	242

Zoom levels#

The UCSC BBI formats store multiple “zoom” or “reduction” levels. These are tables of fixed-resolution genomic bins containing summary statistics of the signal of a BigWig track track or the interval coverage depth of a BigBed track.

ds = ox.from_bigwig("data/sample.bw")
ds.zoom_levels

[2621440, 10485760, 41943040]

ds.zoom(ds.zoom_levels[1]).regions("chr21").pl()

shape: (5, 8)

chrom	start	end	bases_covered	min	max	sum	sum_squares
cat	u32	u32	u64	f64	f64	f64	f64
"chr21"	9486505	17408540	90	20.0	80.0	4000.0	224000.0
"chr21"	17829945	26140885	155	0.0	80.0	7900.0	470000.0
"chr21"	27133600	36015675	205	0.0	80.0	9000.0	472000.0
"chr21"	36097355	44412085	190	0.0	80.0	7200.0	376000.0
"chr21"	45704025	48129895	65	20.0	80.0	2800.0	148000.0

Remote files and file-like objects#

You can pull data directly from HTTP and cloud storage URLs. If needed, paths or URLs to index files must be given explicitly.

ds = ox.from_bam(
    "https://oxbow-ngs.s3.us-east-2.amazonaws.com/example.bam",
    index="https://oxbow-ngs.s3.us-east-2.amazonaws.com/example.bam.bai"
)

Instead of using file paths or URLs, the source and index inputs to create a data source can alternatively be callables that open a binary I/O stream, i.e. any Python file-like object.

ds = ox.from_bam(
    lambda : open("sample.bam", "rb"),
    index=lambda : open("sample.bam.bai", "rb"),
)

This gives you the power to customize your own transports – to read remote sources, diverse file system implementations, or different file encodings – independently of oxbow itself!

Libraries like fsspec or smart_open can be used for this purpose.

from fsspec.implementations.cached import CachingFileSystem
from s3fs import S3FileSystem

url = "https://oxbow-ngs.s3.us-east-2.amazonaws.com/example.bam"
httpfs = CachingFileSystem(target_protocol="https")
ds = ox.from_bam(
    lambda : httpfs.open(url, "rb"),
    index=lambda : httpfs.open(url + ".bai", "rb"),
)

s3fs = S3FileSystem(anon=True)
s3_uri = "s3://oxbow-ngs/example.bam"
ds = ox.from_bam(
    lambda : s3fs.open(s3_uri, "rb"),
    index=lambda : s3fs.open(s3_uri + ".bai", "rb"),
    tag_defs=[],
)
ds.regions("chr1:82744-85000").pl()

Streams and Fragments#

An oxbow data source object streams data via a sequence of Arrow RecordBatches. This stream is exposed as an iterator and you can use it to materialize each batch manually.

ds = ox.from_bam("data/sample.bam", batch_size=100)
batch = next(ds.batches())
batch

arro3.core.RecordBatch
+----------------------------------------+--------+-------------------------+---------+-------+-------+-------------------------+-----------+--------+-----------------------------------------------------------------------------+-----------------------------------------------------------------------------+---------+
| qname                                  | flag   | rname                   | pos     | mapq  | cigar | rnext                   | pnext     | tlen   | seq                                                                         | qual                                                                        | end     |
| Utf8                                   | UInt16 | Dictionary(Int32, Utf8) | Int32   | UInt8 | Utf8  | Dictionary(Int32, Utf8) | Int32     | Int32  | Utf8                                                                        | Utf8                                                                        | Int32   |
+----------------------------------------+--------+-------------------------+---------+-------+-------+-------------------------+-----------+--------+-----------------------------------------------------------------------------+-----------------------------------------------------------------------------+---------+
| HWI-BRUNOP16X_0001:3:48:4861:11838#0   | 163    | chr1                    | 10542   | 0     | 50M   | chr1                    | 10571     | 79     | CGAAATCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGG                          | gggggggggggggggggggggggggeggggR\_[\ggggghggggggggg                          | 10591   |
| HWI-BRUNOP16X_0001:3:28:6650:168848#0  | 16     | chr1                    | 10546   | 16    | 75M   | null                    | null      | 0      | ATCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGAACGCAGCTCCGCC | fggggggggdgdggcdfggggfgggggggggggggggggggggggggfggggggggggggggggggggggggggg | 10620   |
| HWI-BRUNOP16X_0001:3:8:20066:88158#0   | 16     | chr1                    | 946457  | 0     | 75M   | null                    | null      | 0      | TAGTCCGAGGTCTCCTGAACCTTCCCAAGCAGCTGCTGCACCTGCCGGCAGTAGTTGGCCACCTTGCACTCCCGG | BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBd`aed``__U^__]_ggggcggggd]\\\[\]^]]gggggdfcbb | 946531  |
| HWI-BRUNOP16X_0001:3:27:10302:58768#0  | 16     | chr1                    | 1014060 | 37    | 75M   | null                    | null      | 0      | AGCTGAATGGGCAGGTCCCCCAGAAGATCGGCGTGCACGCCTTCCAGCAGCGTCTGGCTGTCCACCCGAGCGGTG | BBBBBBBBBBBBBBBBcYRcffggfgf_gfg\deegfgfgfcggcggfggggcgggggcgcggfgggggggggeg | 1014134 |
| HWI-BRUNOP16X_0001:3:65:3144:143676#0  | 83     | chr3                    | 196957  | 60    | 50M   | chr3                    | 196008    | -999   | GTAACGCTCCCGGACCCTGCGCGCCCCCGTCCCGGCTCCCGGCCGGCTCG                          | BBBBBBBBBBBBBB_TTSSS[[Obbd`]e^STTTSZW`beTTTTTSSTTT                          | 197006  |
| HWI-BRUNOP16X_0001:3:68:13088:156644#0 | 16     | chr3                    | 196958  | 37    | 75M   | null                    | null      | 0      | GACCCCCCCGGCCCCCGGCGCCCCCCCGCCCCGCCCCCGGGCGGGCGGGGGGGAGAAGGCGCCCGAGGGGAGGCG | BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB^bg_`[^\]X`ZZcggggdfgggggggg | 197032  |
| HWI-BRUNOP16X_0001:3:48:3417:101389#0  | 163    | chr3                    | 196961  | 60    | 50M   | chr3                    | 319702    | 122791 | GCTTACCGGACCCTGCGCGCCCCCGTCCCGGCTCCCGGCCGGCTCGGGGG                          | gggggggggggggggggggggggfdaggggggdgggfgdhbe\T`BBBBB                          | 197010  |
| HWI-BRUNOP16X_0001:3:46:17583:95767#0  | 161    | chrX                    | 503847  | 0     | 50M   | chr4                    | 185365552 | 0      | TTTTATTTTTTTTTTTGAGATGGAGTCTCGCTCTTGTCACCGAGGCTGGA                          | ddfdfd____dffff]__aeZ]\XZSPSNSSSSSSbbaabZ_``BBBBBB                          | 503896  |
| HWI-BRUNOP16X_0001:3:4:7989:14941#0    | 16     | chrY                    | 586185  | 0     | 75M   | null                    | null      | 0      | GTGCGATCTCGGTTCGCTGCAACCTCTGCTTCCCAGGTTCAAGTGATTCTCCGGCCTCAGCCTCCCAAGTAGCNN | BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB | 586259  |
| HWI-BRUNOP16X_0001:3:44:11450:50194#0  | 0      | chrY                    | 587561  | 0     | 75M   | null                    | null      | 0      | NNTGCAGTGAGCTGAGATTGTGCCACTGCACTCCAGCCTGGGTGACAGAGGTAGACTGTGTCTCAAAAAAAAAAA | BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB | 587635  |
+----------------------------------------+--------+-------------------------+---------+-------+-------+-------------------------+-----------+--------+-----------------------------------------------------------------------------+-----------------------------------------------------------------------------+---------+

pl.from_arrow(batch)

shape: (10, 12)

qname	flag	rname	pos	mapq	cigar	rnext	pnext	tlen	seq	qual	end
str	u16	cat	i32	u8	str	cat	i32	i32	str	str	i32
"HWI-BRUNOP16X_0001:3:48:4861:1…	163	"chr1"	10542	0	"50M"	"chr1"	10571	79	"CGAAATCTGTGCAGAGGAGAACGCAGCTCC…	"gggggggggggggggggggggggggegggg…	10591
"HWI-BRUNOP16X_0001:3:28:6650:1…	16	"chr1"	10546	16	"75M"	null	null	0	"ATCTGTGCAGAGGAGAACGCAGCTCCGCCC…	"fggggggggdgdggcdfggggfgggggggg…	10620
"HWI-BRUNOP16X_0001:3:8:20066:8…	16	"chr1"	946457	0	"75M"	null	null	0	"TAGTCCGAGGTCTCCTGAACCTTCCCAAGC…	"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB…	946531
"HWI-BRUNOP16X_0001:3:27:10302:…	16	"chr1"	1014060	37	"75M"	null	null	0	"AGCTGAATGGGCAGGTCCCCCAGAAGATCG…	"BBBBBBBBBBBBBBBBcYRcffggfgf_gf…	1014134
"HWI-BRUNOP16X_0001:3:65:3144:1…	83	"chr3"	196957	60	"50M"	"chr3"	196008	-999	"GTAACGCTCCCGGACCCTGCGCGCCCCCGT…	"BBBBBBBBBBBBBB_TTSSS[[Obbd`]e^…	197006
"HWI-BRUNOP16X_0001:3:68:13088:…	16	"chr3"	196958	37	"75M"	null	null	0	"GACCCCCCCGGCCCCCGGCGCCCCCCCGCC…	"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB…	197032
"HWI-BRUNOP16X_0001:3:48:3417:1…	163	"chr3"	196961	60	"50M"	"chr3"	319702	122791	"GCTTACCGGACCCTGCGCGCCCCCGTCCCG…	"gggggggggggggggggggggggfdagggg…	197010
"HWI-BRUNOP16X_0001:3:46:17583:…	161	"chrX"	503847	0	"50M"	"chr4"	185365552	0	"TTTTATTTTTTTTTTTGAGATGGAGTCTCG…	"ddfdfd____dffff]__aeZ]\XZSPSNS…	503896
"HWI-BRUNOP16X_0001:3:4:7989:14…	16	"chrY"	586185	0	"75M"	null	null	0	"GTGCGATCTCGGTTCGCTGCAACCTCTGCT…	"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB…	586259
"HWI-BRUNOP16X_0001:3:44:11450:…	0	"chrY"	587561	0	"75M"	null	null	0	"NNTGCAGTGAGCTGAGATTGTGCCACTGCA…	"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB…	587635

Data sources can be logically grouped into fragments. Without random access, a data source contains only a single fragment.

ds = ox.from_bam("data/sample.bam")
ds.fragments()

[<oxbow._pyarrow.BatchReaderFragment at 0x7c928e83da90>]

When you register range queries, each query gets mapped to a unique fragment. Each fragment generates an independent stream of record batches.

ds = ox.from_bam("data/sample.bam").regions(["chr1", "chr3", "chrX"])
ds.fragments()

[<oxbow._pyarrow.BatchReaderFragment at 0x7c9287752350>,
 <oxbow._pyarrow.BatchReaderFragment at 0x7c92877c2060>,
 <oxbow._pyarrow.BatchReaderFragment at 0x7c92877c1a70>]

Dask data frames#

Dask uses a different approach than the streaming paradigm of Polars and DuckDB: it subdivides a data set into a known number of independently accessible logical partitions, each of which is expected to fit in memory. When you convert an Oxbow data source into a Dask data frame, oxbow maps fragments to partitions:

df = (
    ox.from_bam("data/sample.bam")
    .regions(["chr1", "chrX", "chrY"])
    .dd()  # or to_dask()
)
df

Dask DataFrame Structure:

	qname	flag	rname	pos	mapq	cigar	rnext	pnext	tlen	seq	qual	end
npartitions=3
	string	uint16	category[known]	int32	uint8	string	category[known]	int32	int32	string	string	int32
	...	...	...	...	...	...	...	...	...	...	...	...
	...	...	...	...	...	...	...	...	...	...	...	...
	...	...	...	...	...	...	...	...	...	...	...	...

Dask Name: to_string_dtype, 2 expressions

df.partitions[1].compute()

	qname	flag	rname	pos	mapq	cigar	rnext	pnext	tlen	seq	qual	end
0	HWI-BRUNOP16X_0001:3:46:17583:95767#0	161	chrX	503847	0	50M	chr4	185365552	0	TTTTATTTTTTTTTTTGAGATGGAGTCTCGCTCTTGTCACCGAGGC...	ddfdfd____dffff]__aeZ]\XZSPSNSSSSSSbbaabZ_``BB...	503896

Quickstart

Contents

Quickstart#

Create a DataSource#

Into data frames#

Into lazy data structures#

Range queries#

Column projection#

Nested and composite fields#

SAM/BAM tags#

GTF/GFF attributes#

VCF/BCF info fields#

VCF/BCF sample genotype data#

BED schemas#

BigBed AutoSql#

Custom BED schemas#

Zoom levels#

Remote files and file-like objects#

Streams and Fragments#

Dask data frames#