Quickstart

Quickstart#

This is a quickstart guide to using Oxbow. Oxbow lets you access potentially larger-than-memory genomic files as tabular data structures, such as data frames.

Create a DataSource#

Use the convenience function associated with your file type. The returned DataSource object can be used to access the data in the file.

import oxbow as ox

ds = ox.from_bam("data/sample.bam")

Into data frames#

If the dataset fits comfortably in memory, you can materialize it fully as a Pandas or Polars data frame.

ds.pd()  # or ds.to_pandas()

	qname	flag	rname	pos	mapq	cigar	rnext	pnext	tlen	seq	qual	end	tags
0	HWI-BRUNOP16X_0001:3:48:4861:11838#0	163	chr1	10542	0	50M	chr1	10571.0	79	CGAAATCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCT...	gggggggggggggggggggggggggeggggR\_[\ggggghggggg...	10591	{'AM': 0.0, 'MD': '18C31', 'NM': 1, 'RG': 'bra...
1	HWI-BRUNOP16X_0001:3:28:6650:168848#0	16	chr1	10546	16	75M	NaN	NaN	0	ATCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGG...	fggggggggdgdggcdfggggfgggggggggggggggggggggggg...	10620	{'AM': None, 'MD': '14C52A7', 'NM': 2, 'RG': '...
2	HWI-BRUNOP16X_0001:3:8:20066:88158#0	16	chr1	946457	0	75M	NaN	NaN	0	TAGTCCGAGGTCTCCTGAACCTTCCCAAGCAGCTGCTGCACCTGCC...	BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBd`aed``__U^__]_g...	946531	{'AM': None, 'MD': '2T0G5T65', 'NM': 3, 'RG': ...
3	HWI-BRUNOP16X_0001:3:27:10302:58768#0	16	chr1	1014060	37	75M	NaN	NaN	0	AGCTGAATGGGCAGGTCCCCCAGAAGATCGGCGTGCACGCCTTCCA...	BBBBBBBBBBBBBBBBcYRcffggfgf_gfg\deegfgfgfcggcg...	1014134	{'AM': None, 'MD': '7G1C4A2A57', 'NM': 4, 'RG'...
4	HWI-BRUNOP16X_0001:3:65:3144:143676#0	83	chr3	196957	60	50M	chr3	196008.0	-999	GTAACGCTCCCGGACCCTGCGCGCCCCCGTCCCGGCTCCCGGCCGG...	BBBBBBBBBBBBBB_TTSSS[[Obbd`]e^STTTSZW`beTTTTTS...	197006	{'AM': 37.0, 'MD': '0C0A0G1G0C0T1A41', 'NM': 7...
5	HWI-BRUNOP16X_0001:3:68:13088:156644#0	16	chr3	196958	37	75M	NaN	NaN	0	GACCCCCCCGGCCCCCGGCGCCCCCCCGCCCCGCCCCCGGGCGGGC...	BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB...	197032	{'AM': None, 'MD': '0A0G0A0G1T0T0A1C1G0A3T4G6T...
6	HWI-BRUNOP16X_0001:3:48:3417:101389#0	163	chr3	196961	60	50M	chr3	319702.0	122791	GCTTACCGGACCCTGCGCGCCCCCGTCCCGGCTCCCGGCCGGCTCG...	gggggggggggggggggggggggfdaggggggdgggfgdhbe\T`B...	197010	{'AM': 37.0, 'MD': '50', 'NM': 0, 'RG': 'brain...
7	HWI-BRUNOP16X_0001:3:46:17583:95767#0	161	chrX	503847	0	50M	chr4	185365552.0	0	TTTTATTTTTTTTTTTGAGATGGAGTCTCGCTCTTGTCACCGAGGC...	ddfdfd____dffff]__aeZ]\XZSPSNSSSSSSbbaabZ_``BB...	503896	{'AM': 0.0, 'MD': '4T36C8', 'NM': 2, 'RG': 'br...
8	HWI-BRUNOP16X_0001:3:4:7989:14941#0	16	chrY	586185	0	75M	NaN	NaN	0	GTGCGATCTCGGTTCGCTGCAACCTCTGCTTCCCAGGTTCAAGTGA...	BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB...	586259	{'AM': None, 'MD': '4C10A10C16C29T0G0', 'NM': ...
9	HWI-BRUNOP16X_0001:3:44:11450:50194#0	0	chrY	587561	0	75M	NaN	NaN	0	NNTGCAGTGAGCTGAGATTGTGCCACTGCACTCCAGCCTGGGTGAC...	BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB...	587635	{'AM': None, 'MD': '0G0G48T0G23', 'NM': 4, 'RG...

ds.pl()  # or ds.to_polars()

shape: (10, 13)

qname	flag	rname	pos	mapq	cigar	rnext	pnext	tlen	seq	qual	end	tags
str	u16	cat	i32	u8	str	cat	i32	i32	str	str	i32	struct[12]
"HWI-BRUNOP16X_0001:3:48:4861:1…	163	"chr1"	10542	0	"50M"	"chr1"	10571	79	"CGAAATCTGTGCAGAGGAGAACGCAGCTCC…	"gggggggggggggggggggggggggegggg…	10591	{0,"18C31",1,"brain_50_fcb",0,3,8,null,0,1,0,"82"}
"HWI-BRUNOP16X_0001:3:28:6650:1…	16	"chr1"	10546	16	"75M"	null	null	0	"ATCTGTGCAGAGGAGAACGCAGCTCCGCCC…	"fggggggggdgdggcdfggggfgggggggg…	10620	{null,"14C52A7",2,"brain_75_fca",null,1,5,null,0,2,0,"85"}
"HWI-BRUNOP16X_0001:3:8:20066:8…	16	"chr1"	946457	0	"75M"	null	null	0	"TAGTCCGAGGTCTCCTGAACCTTCCCAAGC…	"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB…	946531	{null,"2T0G5T65",3,"brain_75_fca",null,2,0,"2,-131443143,75M,3;",0,3,0,"82"}
"HWI-BRUNOP16X_0001:3:27:10302:…	16	"chr1"	1014060	37	"75M"	null	null	0	"AGCTGAATGGGCAGGTCCCCCAGAAGATCG…	"BBBBBBBBBBBBBBBBcYRcffggfgf_gf…	1014134	{null,"7G1C4A2A57",4,"brain_75_fca",null,1,0,null,0,4,0,"85"}
"HWI-BRUNOP16X_0001:3:65:3144:1…	83	"chr3"	196957	60	"50M"	"chr3"	196008	-999	"GTAACGCTCCCGGACCCTGCGCGCCCCCGT…	"BBBBBBBBBBBBBB_TTSSS[[Obbd`]e^…	197006	{37,"0C0A0G1G0C0T1A41",7,"brain_50_fcb",37,1,0,null,0,7,0,"85"}
"HWI-BRUNOP16X_0001:3:68:13088:…	16	"chr3"	196958	37	"75M"	null	null	0	"GACCCCCCCGGCCCCCGGCGCCCCCCCGCC…	"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB…	197032	{null,"0A0G0A0G1T0T0A1C1G0A3T4G6T4G1T2C3C3T0C27",19,"brain_75_fca",null,1,0,null,0,19,0,"85"}
"HWI-BRUNOP16X_0001:3:48:3417:1…	163	"chr3"	196961	60	"50M"	"chr3"	319702	122791	"GCTTACCGGACCCTGCGCGCCCCCGTCCCG…	"gggggggggggggggggggggggfdagggg…	197010	{37,"50",0,"brain_50_fcb",37,1,0,null,0,0,0,"85"}
"HWI-BRUNOP16X_0001:3:46:17583:…	161	"chrX"	503847	0	"50M"	"chr4"	185365552	0	"TTTTATTTTTTTTTTTGAGATGGAGTCTCG…	"ddfdfd____dffff]__aeZ]\XZSPSNS…	503896	{0,"4T36C8",2,"brain_50_fcb",0,18,174,null,0,2,0,"82"}
"HWI-BRUNOP16X_0001:3:4:7989:14…	16	"chrY"	586185	0	"75M"	null	null	0	"GTGCGATCTCGGTTCGCTGCAACCTCTGCT…	"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB…	586259	{null,"4C10A10C16C29T0G0",6,"brain_75_fca",null,2,2,"X,-586185,75M,6;3,+196723225,75M,7;19,+13666092,75M,7;",0,6,0,"82"}
"HWI-BRUNOP16X_0001:3:44:11450:…	0	"chrY"	587561	0	"75M"	null	null	0	"NNTGCAGTGAGCTGAGATTGTGCCACTGCA…	"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB…	587635	{null,"0G0G48T0G23",4,"brain_75_fca",null,6,54,null,0,4,0,"82"}

Into lazy data structures#

If the data source is very large, you can also load it into a lazy or “out-of-core” data structure, such as a Polars lazy frame or Dask data frame.

df = ds.pl(lazy=True)
df.show_graph()

../_images/705ef5a0788b7340fdbd4c206bf057a36c07dba9e34ccaef2aaa3b5251a61613.svg

df.head().collect()

shape: (5, 13)

qname	flag	rname	pos	mapq	cigar	rnext	pnext	tlen	seq	qual	end	tags
str	u16	cat	i32	u8	str	cat	i32	i32	str	str	i32	struct[12]
"HWI-BRUNOP16X_0001:3:48:4861:1…	163	"chr1"	10542	0	"50M"	"chr1"	10571	79	"CGAAATCTGTGCAGAGGAGAACGCAGCTCC…	"gggggggggggggggggggggggggegggg…	10591	{0,"18C31",1,"brain_50_fcb",0,3,8,null,0,1,0,"82"}
"HWI-BRUNOP16X_0001:3:28:6650:1…	16	"chr1"	10546	16	"75M"	null	null	0	"ATCTGTGCAGAGGAGAACGCAGCTCCGCCC…	"fggggggggdgdggcdfggggfgggggggg…	10620	{null,"14C52A7",2,"brain_75_fca",null,1,5,null,0,2,0,"85"}
"HWI-BRUNOP16X_0001:3:8:20066:8…	16	"chr1"	946457	0	"75M"	null	null	0	"TAGTCCGAGGTCTCCTGAACCTTCCCAAGC…	"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB…	946531	{null,"2T0G5T65",3,"brain_75_fca",null,2,0,"2,-131443143,75M,3;",0,3,0,"82"}
"HWI-BRUNOP16X_0001:3:27:10302:…	16	"chr1"	1014060	37	"75M"	null	null	0	"AGCTGAATGGGCAGGTCCCCCAGAAGATCG…	"BBBBBBBBBBBBBBBBcYRcffggfgf_gf…	1014134	{null,"7G1C4A2A57",4,"brain_75_fca",null,1,0,null,0,4,0,"85"}
"HWI-BRUNOP16X_0001:3:65:3144:1…	83	"chr3"	196957	60	"50M"	"chr3"	196008	-999	"GTAACGCTCCCGGACCCTGCGCGCCCCCGT…	"BBBBBBBBBBBBBB_TTSSS[[Obbd`]e^…	197006	{37,"0C0A0G1G0C0T1A41",7,"brain_50_fcb",37,1,0,null,0,7,0,"85"}

Oxbow data sources can also be loaded into a DuckDB relation.

import duckdb

conn = duckdb.connect(":memory:")
ds = ox.from_gtf("data/gencode.v47.annotation.gtf")
rel = ds.to_duckdb(conn)
conn.sql(
    "SELECT seqid as chrom, type, start, rel.end, strand, attributes.gene_name " \
    "FROM rel " \
    "WHERE attributes.gene_name = 'PCSK9'" \
    "LIMIT 10"
).pl()

Note

See the Streams and Fragments section for details on building Dask data frames.

Range queries#

Data sources with indexes support querying genomic ranges. This is the case for htslib formats that are compressed with the BGZF gzip variant and indexed with an appropriate companion index file (e.g., .bai, .tbi, .csi). The BBI formats, BigWig and BigBed, possess an internal index and support range queries without an index file.

You can specify one or more ranges to the constructor or pass them to the regions() method. All records overlapping the query ranges will be returned.

ds = ox.from_bam("data/sample.bam", index="data/sample.bam.bai")
ds = ds.regions("chr1:900000-1100000")

ds.pl()

shape: (2, 13)

qname	flag	rname	pos	mapq	cigar	rnext	pnext	tlen	seq	qual	end	tags
str	u16	cat	i32	u8	str	cat	i32	i32	str	str	i32	struct[12]
"HWI-BRUNOP16X_0001:3:8:20066:8…	16	"chr1"	946457	0	"75M"	null	null	0	"TAGTCCGAGGTCTCCTGAACCTTCCCAAGC…	"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB…	946531	{null,"2T0G5T65",3,"brain_75_fca",null,2,0,"2,-131443143,75M,3;",0,3,0,"82"}
"HWI-BRUNOP16X_0001:3:27:10302:…	16	"chr1"	1014060	37	"75M"	null	null	0	"AGCTGAATGGGCAGGTCCCCCAGAAGATCG…	"BBBBBBBBBBBBBBBBcYRcffggfgf_gf…	1014134	{null,"7G1C4A2A57",4,"brain_75_fca",null,1,0,null,0,4,0,"85"}

If the index file exists in the same location as the source file, it is automatically detected.

ox.from_bam("data/sample.bam").regions(["chr1", "chr3"]).pl()

shape: (7, 13)

qname	flag	rname	pos	mapq	cigar	rnext	pnext	tlen	seq	qual	end	tags
str	u16	cat	i32	u8	str	cat	i32	i32	str	str	i32	struct[12]
"HWI-BRUNOP16X_0001:3:48:4861:1…	163	"chr1"	10542	0	"50M"	"chr1"	10571	79	"CGAAATCTGTGCAGAGGAGAACGCAGCTCC…	"gggggggggggggggggggggggggegggg…	10591	{0,"18C31",1,"brain_50_fcb",0,3,8,null,0,1,0,"82"}
"HWI-BRUNOP16X_0001:3:28:6650:1…	16	"chr1"	10546	16	"75M"	null	null	0	"ATCTGTGCAGAGGAGAACGCAGCTCCGCCC…	"fggggggggdgdggcdfggggfgggggggg…	10620	{null,"14C52A7",2,"brain_75_fca",null,1,5,null,0,2,0,"85"}
"HWI-BRUNOP16X_0001:3:8:20066:8…	16	"chr1"	946457	0	"75M"	null	null	0	"TAGTCCGAGGTCTCCTGAACCTTCCCAAGC…	"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB…	946531	{null,"2T0G5T65",3,"brain_75_fca",null,2,0,"2,-131443143,75M,3;",0,3,0,"82"}
"HWI-BRUNOP16X_0001:3:27:10302:…	16	"chr1"	1014060	37	"75M"	null	null	0	"AGCTGAATGGGCAGGTCCCCCAGAAGATCG…	"BBBBBBBBBBBBBBBBcYRcffggfgf_gf…	1014134	{null,"7G1C4A2A57",4,"brain_75_fca",null,1,0,null,0,4,0,"85"}
"HWI-BRUNOP16X_0001:3:65:3144:1…	83	"chr3"	196957	60	"50M"	"chr3"	196008	-999	"GTAACGCTCCCGGACCCTGCGCGCCCCCGT…	"BBBBBBBBBBBBBB_TTSSS[[Obbd`]e^…	197006	{37,"0C0A0G1G0C0T1A41",7,"brain_50_fcb",37,1,0,null,0,7,0,"85"}
"HWI-BRUNOP16X_0001:3:68:13088:…	16	"chr3"	196958	37	"75M"	null	null	0	"GACCCCCCCGGCCCCCGGCGCCCCCCCGCC…	"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB…	197032	{null,"0A0G0A0G1T0T0A1C1G0A3T4G6T4G1T2C3C3T0C27",19,"brain_75_fca",null,1,0,null,0,19,0,"85"}
"HWI-BRUNOP16X_0001:3:48:3417:1…	163	"chr3"	196961	60	"50M"	"chr3"	319702	122791	"GCTTACCGGACCCTGCGCGCCCCCGTCCCG…	"gggggggggggggggggggggggfdagggg…	197010	{37,"50",0,"brain_50_fcb",37,1,0,null,0,0,0,"85"}

ox.from_bigwig("data/sample.bw").regions("chr21:10900000-15000000").pl()

shape: (3, 4)

chrom	start	end	value
str	u32	u32	f32
"chr21"	10971770	10971775	40.0
"chr21"	14787100	14787105	60.0
"chr21"	14959050	14959055	20.0

Note

Oxbow handles multiple ranges as separate fragments. For more details, see the Streams and Fragments section below.

Column projection#

Oxbow lets you select only the columns you need and will not parse the others.

import polars as pl

ox.from_bam(
    "data/sample.bam", 
    fields=["rname", "pos", "end", "mapq"],
    tag_defs=[],
).regions(
    "chr1"
).pl()

shape: (4, 4)

rname	pos	end	mapq
cat	i32	i32	u8
"chr1"	10542	10591	0
"chr1"	10546	10620	16
"chr1"	946457	946531	0
"chr1"	1014060	1014134	37

In data systems lingo, selecting columns is also known as “projection”. The lazy data structures returned by an oxbow data source are able to “push down” the projection operation to oxbow to prevent full record parsing. In the following example, only the four fields passed to the polars LazyFrame.select method will be parsed when the output gets computed.

df = (
    ox.from_bam("data/sample.bam")
    .regions("chr1")
    .pl(lazy=True)
    .select(
        pl.col("rname").alias("chrom"),
        pl.col("pos").alias("start"),
        "end",
        "mapq"
    )
    .collect()
)
df

shape: (4, 4)

chrom	start	end	mapq
cat	i32	i32	u8
"chr1"	10542	10591	0
"chr1"	10546	10620	16
"chr1"	946457	946531	0
"chr1"	1014060	1014134	37

Nested and complex fields#

Oxbow can handle the complex field structures of genomics file formats because they can all be mapped to Arrow constructs like lists, arrays, and structs.

For example, fields like SAM tags, VCF info and samples, and GTF attributes are exposed as struct columns in Arrow-native libraries like Polars, which are easy and efficient to manipulate.

SAM/BAM tags#

The htslib alignment formats, SAM and BAM, have optional fields called tags that are defined inline, rather than in a header or manifest. These definitions, a tuple of a tag name and type code, can be provided explicitly to the data source constructor for projection.

df = (
    ox.from_bam(
        "data/sample.bam", 
        fields=[],
        tag_defs=[('MD', 'Z'), ('NM', 'C')]
    )
    .regions("chr1")
    .pl()
    .select(
        pl.col("tags").struct.unnest()
    )
)
df

shape: (4, 2)

MD	NM
str	i64
"18C31"	1
"14C52A7"	2
"2T0G5T65"	3
"7G1C4A2A57"	4

By default, oxbow will scan an initial number of rows to discover tag definitions (determined by tag_scan_rows). Set tag_defs=[] to ignore tags entirely.

df = (
    ox.from_bam(
        "data/sample.bam", 
        tag_defs=[],
    )
    .regions("chr1")
    .pl()
)
df

shape: (4, 12)

qname	flag	rname	pos	mapq	cigar	rnext	pnext	tlen	seq	qual	end
str	u16	cat	i32	u8	str	cat	i32	i32	str	str	i32
"HWI-BRUNOP16X_0001:3:48:4861:1…	163	"chr1"	10542	0	"50M"	"chr1"	10571	79	"CGAAATCTGTGCAGAGGAGAACGCAGCTCC…	"gggggggggggggggggggggggggegggg…	10591
"HWI-BRUNOP16X_0001:3:28:6650:1…	16	"chr1"	10546	16	"75M"	null	null	0	"ATCTGTGCAGAGGAGAACGCAGCTCCGCCC…	"fggggggggdgdggcdfggggfgggggggg…	10620
"HWI-BRUNOP16X_0001:3:8:20066:8…	16	"chr1"	946457	0	"75M"	null	null	0	"TAGTCCGAGGTCTCCTGAACCTTCCCAAGC…	"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB…	946531
"HWI-BRUNOP16X_0001:3:27:10302:…	16	"chr1"	1014060	37	"75M"	null	null	0	"AGCTGAATGGGCAGGTCCCCCAGAAGATCG…	"BBBBBBBBBBBBBBBBcYRcffggfgf_gf…	1014134

GTF/GFF attributes#

GTF/GFF attributes are analogous to SAM tags. For GTF, the type is always "String". For GFF, attributes can be "String" or "Array", the latter materializing as a list column.

df = (
    ox.from_gff("data/sample.gff")
    .pl()
)
df.head()

shape: (5, 9)

seqid	source	type	start	end	score	strand	frame	attributes
str	str	str	i32	i32	f32	str	u8	struct[18]
"chr13"	"HAVANA"	"exon"	81326030	81326191	null	"+"	null	{"exon:ENST00000782961.1:2","ENST00000782961.1",null,"ENSE00004156517.1","2","ENSG00000229309.3","ENSG00000229309","lncRNA","OTTHUMG00000017146.2",null,null,"2",null,["basic", "Ensembl_canonical", "TAGENE"],"ENST00000782961.1","ENST00000782961",null,"lncRNA"}
"chr6"	"HAVANA"	"CDS"	32002399	32002540	null	"+"	1	{"CDS:ENST00000498271.1","ENST00000498271.1","CCDS59005.1","ENSE00001878698.1","40","ENSG00000244731.10","C4A","protein_coding","OTTHUMG00000031186.6","OTTHUMT00000356896.1","HGNC:1323","2","ENSP00000420212.1",["RNA_Seq_supported_only", "basic", … "CCDS"],"ENST00000498271.1","C4A-246","1","protein_coding"}
"chr10"	"HAVANA"	"exon"	72930538	72930637	null	"+"	null	{"exon:ENST00000334011.10:8","ENST00000334011.10","CCDS7318.1","ENSE00001170094.1","8","ENSG00000138315.13","OIT3","protein_coding","OTTHUMG00000018444.2","OTTHUMT00000048596.2","HGNC:29953","2","ENSP00000333900.5",["basic", "Ensembl_canonical", … "CCDS"],"ENST00000334011.10","OIT3-201","1","protein_coding"}
"chr1"	"HAVANA"	"exon"	497210	497299	null	"-"	null	{"exon:ENST00000641916.1:4","ENST00000641916.1",null,"ENSE00003812605.1","4","ENSG00000290385.2","ENSG00000290385","lncRNA",null,"OTTHUMT00000493599.1",null,"2",null,null,"ENST00000641916.1","ENST00000641916",null,"lncRNA"}
"chr13"	"HAVANA"	"CDS"	35655579	35655749	null	"+"	2	{"CDS:ENST00000629018.4","ENST00000629018.4",null,"ENSE00000938859.1","28","ENSG00000172915.20","NBEA","protein_coding","OTTHUMG00000016724.2",null,"HGNC:7648","2","ENSP00000486239.3",["RNA_Seq_supported_only", "mRNA_start_NF", "cds_start_NF"],"ENST00000629018.4","NBEA-207","5","protein_coding"}

df['attributes'].struct.unnest().head()

shape: (5, 18)

ID	Parent	ccdsid	exon_id	exon_number	gene_id	gene_name	gene_type	havana_gene	havana_transcript	hgnc_id	level	protein_id	tag	transcript_id	transcript_name	transcript_support_level	transcript_type
str	str	str	str	str	str	str	str	str	str	str	str	str	list[str]	str	str	str	str
"exon:ENST00000782961.1:2"	"ENST00000782961.1"	null	"ENSE00004156517.1"	"2"	"ENSG00000229309.3"	"ENSG00000229309"	"lncRNA"	"OTTHUMG00000017146.2"	null	null	"2"	null	["basic", "Ensembl_canonical", "TAGENE"]	"ENST00000782961.1"	"ENST00000782961"	null	"lncRNA"
"CDS:ENST00000498271.1"	"ENST00000498271.1"	"CCDS59005.1"	"ENSE00001878698.1"	"40"	"ENSG00000244731.10"	"C4A"	"protein_coding"	"OTTHUMG00000031186.6"	"OTTHUMT00000356896.1"	"HGNC:1323"	"2"	"ENSP00000420212.1"	["RNA_Seq_supported_only", "basic", … "CCDS"]	"ENST00000498271.1"	"C4A-246"	"1"	"protein_coding"
"exon:ENST00000334011.10:8"	"ENST00000334011.10"	"CCDS7318.1"	"ENSE00001170094.1"	"8"	"ENSG00000138315.13"	"OIT3"	"protein_coding"	"OTTHUMG00000018444.2"	"OTTHUMT00000048596.2"	"HGNC:29953"	"2"	"ENSP00000333900.5"	["basic", "Ensembl_canonical", … "CCDS"]	"ENST00000334011.10"	"OIT3-201"	"1"	"protein_coding"
"exon:ENST00000641916.1:4"	"ENST00000641916.1"	null	"ENSE00003812605.1"	"4"	"ENSG00000290385.2"	"ENSG00000290385"	"lncRNA"	null	"OTTHUMT00000493599.1"	null	"2"	null	null	"ENST00000641916.1"	"ENST00000641916"	null	"lncRNA"
"CDS:ENST00000629018.4"	"ENST00000629018.4"	null	"ENSE00000938859.1"	"28"	"ENSG00000172915.20"	"NBEA"	"protein_coding"	"OTTHUMG00000016724.2"	null	"HGNC:7648"	"2"	"ENSP00000486239.3"	["RNA_Seq_supported_only", "mRNA_start_NF", "cds_start_NF"]	"ENST00000629018.4"	"NBEA-207"	"5"	"protein_coding"

VCF/BCF info fields#

For the htslib variant call formats, VCF and BCF, the subfields of the INFO field are defined in the VCF header, so they do not need to be discovered by sniffing rows and you do not need to specify types.

By default, all info fields are parsed. You can project any subset or ignore them entirely using the info_fields argument.

(
    ox.from_vcf(
        "data/sample.vcf.gz",
        info_fields=[],
        samples=[],
    )
    .pl()
).head()

shape: (5, 7)

chrom	pos	id	ref	alt	qual	filter
cat	i32	list[str]	str	list[str]	f32	list[str]
"1"	65872	[]	"T"	["G"]	44.18	[]
"1"	69511	[]	"A"	["G"]	2552.929932	[]
"1"	762273	[]	"G"	["A"]	19085.929688	[]
"1"	866511	[]	"C"	["CCCCT"]	3136.889893	[]
"1"	876499	[]	"A"	["G"]	3338.929932	[]

df = (
    ox.from_vcf(
        "data/sample.vcf.gz",
        info_fields=["TYPE", "snpeff.Effect", "snpeff.Gene_Name", "snpeff.Transcript_BioType"],
        samples=[],
    )
    .pl()
)
df.head()

shape: (5, 8)

chrom	pos	id	ref	alt	qual	filter	info
cat	i32	list[str]	str	list[str]	f32	list[str]	struct[4]
"1"	65872	[]	"T"	["G"]	44.18	[]	{["SNP"],"intergenic_region",null,null}
"1"	69511	[]	"A"	["G"]	2552.929932	[]	{["SNP"],"sequence_feature[transmembrane_region:Transmembrane_region]","OR4F5","protein_coding"}
"1"	762273	[]	"G"	["A"]	19085.929688	[]	{["SNP"],"non_coding_exon_variant","LINC00115","lincRNA"}
"1"	866511	[]	"C"	["CCCCT"]	3136.889893	[]	{["Insertion"],"intron_variant","SAMD11","protein_coding"}
"1"	876499	[]	"A"	["G"]	3338.929932	[]	{["SNP"],"intron_variant","SAMD11","protein_coding"}

df.unnest("info").head()

shape: (5, 11)

chrom	pos	id	ref	alt	qual	filter	TYPE	snpeff.Effect	snpeff.Gene_Name	snpeff.Transcript_BioType
cat	i32	list[str]	str	list[str]	f32	list[str]	list[str]	str	str	str
"1"	65872	[]	"T"	["G"]	44.18	[]	["SNP"]	"intergenic_region"	null	null
"1"	69511	[]	"A"	["G"]	2552.929932	[]	["SNP"]	"sequence_feature[transmembrane…	"OR4F5"	"protein_coding"
"1"	762273	[]	"G"	["A"]	19085.929688	[]	["SNP"]	"non_coding_exon_variant"	"LINC00115"	"lincRNA"
"1"	866511	[]	"C"	["CCCCT"]	3136.889893	[]	["Insertion"]	"intron_variant"	"SAMD11"	"protein_coding"
"1"	876499	[]	"A"	["G"]	3338.929932	[]	["SNP"]	"intron_variant"	"SAMD11"	"protein_coding"

VCF/BCF sample genotype data#

For the htslib variant call formats, each variant call record is associated with an arbitrary number of so-called FORMAT fields that provide genotype-related information for each sample. Like INFO, these fields are defined in the header.

Using the samples and genotype_fields arguments, you can project any subset of samples as separate struct columns and project any subset of their associated genotype fields.

df = ox.from_vcf(
    "data/sample.vcf.gz",
    info_fields=[],
    samples=['NA12891', 'NA12892'],
).pl()
df.head()

shape: (5, 9)

chrom	pos	id	ref	alt	qual	filter	NA12891	NA12892
cat	i32	list[str]	str	list[str]	f32	list[str]	struct[6]	struct[6]
"1"	65872	[]	"T"	["G"]	44.18	[]	{[14, 2],16,21,{[0, 1],[true, true]},[21, 0, 439],18}	{[14, 2],16,21,{[0, 1],[true, true]},[21, 0, 437],18}
"1"	69511	[]	"A"	["G"]	2552.929932	[]	{null,null,null,{[null, null],[false, false]},null,null}	{[0, 39],39,99,{[1, 1],[true, true]},[1289, 117, 0],null}
"1"	762273	[]	"G"	["A"]	19085.929688	[]	{[0, 82],82,99,{[1, 1],[true, true]},[2952, 247, 0],127}	{[0, 68],68,99,{[1, 1],[true, true]},[2485, 204, 0],127}
"1"	866511	[]	"C"	["CCCCT"]	3136.889893	[]	{[0, 13],13,37,{[1, 1],[true, true]},[512, 37, 0],26}	{[0, 9],9,27,{[1, 1],[true, true]},[402, 27, 0],26}
"1"	876499	[]	"A"	["G"]	3338.929932	[]	{[0, 17],17,51,{[1, 1],[true, true]},[645, 51, 0],26}	{[0, 9],9,27,{[1, 1],[true, true]},[355, 27, 0],26}

Each sample column is essentially a sub-dataframe of genotype fields.

df['NA12892'].struct.unnest().head()

shape: (5, 6)

AD	DP	GQ	GT	PL	TP
list[i32]	i32	i32	struct[2]	list[i32]	i32
[14, 2]	16	21	{[0, 1],[true, true]}	[21, 0, 437]	18
[0, 39]	39	99	{[1, 1],[true, true]}	[1289, 117, 0]	null
[0, 68]	68	99	{[1, 1],[true, true]}	[2485, 204, 0]	127
[0, 9]	9	27	{[1, 1],[true, true]}	[402, 27, 0]	26
[0, 9]	9	27	{[1, 1],[true, true]}	[355, 27, 0]	26

You can also customize how sample genotype data are nested by using the genotype_by argument. By default (genotype_by="sample"), the columns are grouped first by sample name, then by genotype field name. By setting genotype_by="field", you can swap the nesting order to group columns first by genotype field name, then by sample name.

df = ox.from_vcf(
    "data/sample.vcf.gz",
    info_fields=[],
    samples=['NA12891', 'NA12892'],
    genotype_fields=['AD', 'DP', 'GQ', 'PL', 'TP'],
    genotype_by="field",
).pl()
df.head()

shape: (5, 12)

chrom	pos	id	ref	alt	qual	filter	AD	DP	GQ	PL	TP
cat	i32	list[str]	str	list[str]	f32	list[str]	struct[2]	struct[2]	struct[2]	struct[2]	struct[2]
"1"	65872	[]	"T"	["G"]	44.18	[]	{[14, 2],[14, 2]}	{16,16}	{21,21}	{[21, 0, 439],[21, 0, 437]}	{18,18}
"1"	69511	[]	"A"	["G"]	2552.929932	[]	{null,[0, 39]}	{null,39}	{null,99}	{null,[1289, 117, 0]}	{null,null}
"1"	762273	[]	"G"	["A"]	19085.929688	[]	{[0, 82],[0, 68]}	{82,68}	{99,99}	{[2952, 247, 0],[2485, 204, 0]}	{127,127}
"1"	866511	[]	"C"	["CCCCT"]	3136.889893	[]	{[0, 13],[0, 9]}	{13,9}	{37,27}	{[512, 37, 0],[402, 27, 0]}	{26,26}
"1"	876499	[]	"A"	["G"]	3338.929932	[]	{[0, 17],[0, 9]}	{17,9}	{51,27}	{[645, 51, 0],[355, 27, 0]}	{26,26}

In this case, each genotype field column is a data series containing the values of that field associated with each of the samples.

df['DP'].struct.unnest().head()

shape: (5, 2)

NA12891	NA12892
i32	i32
16	16
null	39
82	68
13	9
17	9

BED schemas#

Oxbow understands BEDn+m schema specifiers to interpret the contents of BED files.

ox.from_bed("data/sample.bed", bed_schema="bed3+").pl().head()

shape: (5, 4)

chrom	start	end	rest
str	i64	i64	str
"chr1"	1100001	1200000	"A1 . . 1100000 1200000 226,56,…
"chr1"	1550001	1600000	"A1 . . 1550000 1600000 226,56,…
"chr1"	1900001	2450000	"A1 . . 1900000 2450000 226,56,…
"chr10"	50001	250000	"AB . . 50000 250000 94,189,62"
"chr10"	250001	650000	"A2 . . 250000 650000 247,130,0"

ox.from_bed("data/sample.bed", bed_schema="bed3+6").pl().head()

shape: (5, 9)

chrom	start	end	BED3+1	BED3+2	BED3+3	BED3+4	BED3+5	BED3+6
str	i64	i64	str	str	str	str	str	str
"chr1"	1100001	1200000	"A1"	"."	"."	"1100000"	"1200000"	"226,56,56"
"chr1"	1550001	1600000	"A1"	"."	"."	"1550000"	"1600000"	"226,56,56"
"chr1"	1900001	2450000	"A1"	"."	"."	"1900000"	"2450000"	"226,56,56"
"chr10"	50001	250000	"AB"	"."	"."	"50000"	"250000"	"94,189,62"
"chr10"	250001	650000	"A2"	"."	"."	"250000"	"650000"	"247,130,0"

ox.from_bed("data/sample.bed", bed_schema="bed9").pl().head()

shape: (5, 9)

chrom	start	end	name	score	strand	thickStart	thickEnd	itemRgb
str	i64	i64	str	u16	cat	i64	i64	array[u8, 3]
"chr1"	1100001	1200000	"A1"	null	null	1100000	1200000	[226, 56, 56]
"chr1"	1550001	1600000	"A1"	null	null	1550000	1600000	[226, 56, 56]
"chr1"	1900001	2450000	"A1"	null	null	1900000	2450000	[226, 56, 56]
"chr10"	50001	250000	"AB"	null	null	50000	250000	[94, 189, 62]
"chr10"	250001	650000	"A2"	null	null	250000	650000	[247, 130, 0]

BigBed AutoSql#

Oxbow can also parse BigBed records that contain AutoSql definitions of the records.

ox.from_bigbed("data/autosql-sample.bb").pl().head()

shape: (5, 4)

chrom	start	end	rest
str	u32	u32	str
"chr1"	11868	14409	"ENST00000456328.2 1000 + 11868…
"chr1"	14403	29570	"ENST00000488147.1 1000 - 14403…
"chr1"	17368	17436	"ENST00000619216.1 1000 - 17368…
"chr1"	29553	31097	"ENST00000473358.1 1000 + 29553…
"chr1"	30365	30503	"ENST00000607096.1 1000 + 30365…

ox.from_bigbed("data/autosql-sample.bb", schema="autosql").pl().head()

shape: (5, 20)

chrom	start	end	name	score	strand	thickStart	thickEnd	reserved	blockCount	blockSizes	chromStarts	name2	cdsStartStat	cdsEndStat	exonFrames	type	geneName	geneName2	geneType
str	u32	u32	str	u32	str	u32	u32	u32	i32	list[i32]	list[i32]	str	str	str	list[i32]	str	str	str	str
"chr1"	11868	14409	"ENST00000456328.2"	1000	"+"	11868	11868	null	3	[359, 109, 1189]	[0, 744, 1352]	"DDX11L1"	"none"	"none"	[-1, -1, -1]	"none"	"ENST00000456328.2"	"DDX11L1"	"none"
"chr1"	14403	29570	"ENST00000488147.1"	1000	"-"	14403	14403	null	11	[98, 34, … 37]	[0, 601, … 15130]	"WASH7P"	"none"	"none"	[-1, -1, … -1]	"none"	"ENST00000488147.1"	"WASH7P"	"none"
"chr1"	17368	17436	"ENST00000619216.1"	1000	"-"	17368	17368	null	1	[68]	[0]	"MIR6859-2"	"none"	"none"	[-1]	"none"	"ENST00000619216.1"	"MIR6859-2"	"none"
"chr1"	29553	31097	"ENST00000473358.1"	1000	"+"	29553	29553	null	3	[486, 104, 122]	[0, 1010, 1422]	"MIR1302-11"	"none"	"none"	[-1, -1, -1]	"none"	"ENST00000473358.1"	"MIR1302-11"	"none"
"chr1"	30365	30503	"ENST00000607096.1"	1000	"+"	30365	30365	null	1	[138]	[0]	"MIR1302-9"	"none"	"none"	[-1]	"none"	"ENST00000607096.1"	"MIR1302-9"	"none"

Zoom levels#

The UCSC BBI formats store multiple “zoom” or “reduction” levels. These are tables of fixed-resolution genomic bins containing summary statistics of the signal of a BigWig track track or the interval coverage depth of a BigBed track.

ds = ox.from_bigwig("data/sample.bw")
ds.zoom_levels

[2621440, 10485760, 41943040]

ds.zoom(ds.zoom_levels[1]).regions("chr21").pl()

shape: (5, 8)

chrom	start	end	bases_covered	min	max	sum	sum_squares
cat	u32	u32	u64	f64	f64	f64	f64
"chr21"	9486505	17408540	90	20.0	80.0	4000.0	224000.0
"chr21"	17829945	26140885	155	0.0	80.0	7900.0	470000.0
"chr21"	27133600	36015675	205	0.0	80.0	9000.0	472000.0
"chr21"	36097355	44412085	190	0.0	80.0	7200.0	376000.0
"chr21"	45704025	48129895	65	20.0	80.0	2800.0	148000.0

Remote files and file-like objects#

Instead of using file paths, source and index inputs to create a data source can alternatively be callables that open a binary I/O stream, i.e. any Python file-like object.

ds = ox.from_bam(
    lambda : open("sample.bam", "rb"),
    index=lambda : open("sample.bam.bai", "rb"),
)

This gives you the power to customize your own transports – to read remote sources, diverse file system implementations, or different file encodings – independently of oxbow itself.

Libraries like fsspec or smart_open can be used for this purpose.

from fsspec.implementations.cached import CachingFileSystem
from s3fs import S3FileSystem

url = "https://oxbow-ngs.s3.us-east-2.amazonaws.com/example.bam"
httpfs = CachingFileSystem(target_protocol="https")
ds = ox.from_bam(
    lambda : httpfs.open(url, "rb"),
    index=lambda : httpfs.open(url + ".bai", "rb"),
)

s3fs = S3FileSystem(anon=True)
s3_uri = "s3://oxbow-ngs/example.bam"
ds = ox.from_bam(
    lambda : s3fs.open(s3_uri, "rb"),
    index=lambda : s3fs.open(s3_uri + ".bai", "rb"),
    tag_defs=[],
)

Streams and Fragments#

An oxbow data source object streams data via a sequence of Arrow RecordBatches. This stream is exposed as an iterator and you can use it to materialize each batch manually.

ds = ox.from_bam("data/sample.bam", batch_size=100)
batch = next(ds.batches())
batch

pyarrow.RecordBatch
qname: string
flag: uint16
rname: dictionary<values=string, indices=int32, ordered=0>
pos: int32
mapq: uint8
cigar: string
rnext: dictionary<values=string, indices=int32, ordered=0>
pnext: int32
tlen: int32
seq: string
qual: string
end: int32
tags: struct<AM: int64, MD: string, NM: int64, RG: string, SM: int64, X0: int64, X1: int64, XA: string, XG: int64, XM: int64, XO: int64, XT: string>
  child 0, AM: int64
  child 1, MD: string
  child 2, NM: int64
  child 3, RG: string
  child 4, SM: int64
  child 5, X0: int64
  child 6, X1: int64
  child 7, XA: string
  child 8, XG: int64
  child 9, XM: int64
  child 10, XO: int64
  child 11, XT: string
----
qname: ["HWI-BRUNOP16X_0001:3:48:4861:11838#0","HWI-BRUNOP16X_0001:3:28:6650:168848#0","HWI-BRUNOP16X_0001:3:8:20066:88158#0","HWI-BRUNOP16X_0001:3:27:10302:58768#0","HWI-BRUNOP16X_0001:3:65:3144:143676#0","HWI-BRUNOP16X_0001:3:68:13088:156644#0","HWI-BRUNOP16X_0001:3:48:3417:101389#0","HWI-BRUNOP16X_0001:3:46:17583:95767#0","HWI-BRUNOP16X_0001:3:4:7989:14941#0","HWI-BRUNOP16X_0001:3:44:11450:50194#0"]
flag: [163,16,16,16,83,16,163,161,16,0]
rname: -- dictionary:
["chrY","chr20","chrX","chr13","chr22","chr10","chr6","chr19","chr14","chr18",...,"chr11","chr17","chr8","chr7","chr15","chr12","chr1","chr16","chr5","chr3"]-- indices:
[20,20,20,20,23,23,23,2,0,0]
pos: [10542,10546,946457,1014060,196957,196958,196961,503847,586185,587561]
mapq: [0,16,0,37,60,37,60,0,0,0]
cigar: ["50M","75M","75M","75M","50M","75M","50M","50M","75M","75M"]
rnext: -- dictionary:
["chrY","chr20","chrX","chr13","chr22","chr10","chr6","chr19","chr14","chr18",...,"chr11","chr17","chr8","chr7","chr15","chr12","chr1","chr16","chr5","chr3"]-- indices:
[20,null,null,null,23,null,23,11,null,null]
pnext: [10571,null,null,null,196008,null,319702,185365552,null,null]
tlen: [79,0,0,0,-999,0,122791,0,0,0]
seq: ["CGAAATCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGG","ATCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGAACGCAGCTCCGCC","TAGTCCGAGGTCTCCTGAACCTTCCCAAGCAGCTGCTGCACCTGCCGGCAGTAGTTGGCCACCTTGCACTCCCGG","AGCTGAATGGGCAGGTCCCCCAGAAGATCGGCGTGCACGCCTTCCAGCAGCGTCTGGCTGTCCACCCGAGCGGTG","GTAACGCTCCCGGACCCTGCGCGCCCCCGTCCCGGCTCCCGGCCGGCTCG","GACCCCCCCGGCCCCCGGCGCCCCCCCGCCCCGCCCCCGGGCGGGCGGGGGGGAGAAGGCGCCCGAGGGGAGGCG","GCTTACCGGACCCTGCGCGCCCCCGTCCCGGCTCCCGGCCGGCTCGGGGG","TTTTATTTTTTTTTTTGAGATGGAGTCTCGCTCTTGTCACCGAGGCTGGA","GTGCGATCTCGGTTCGCTGCAACCTCTGCTTCCCAGGTTCAAGTGATTCTCCGGCCTCAGCCTCCCAAGTAGCNN","NNTGCAGTGAGCTGAGATTGTGCCACTGCACTCCAGCCTGGGTGACAGAGGTAGACTGTGTCTCAAAAAAAAAAA"]
...

pl.from_arrow(batch)

shape: (10, 13)

qname	flag	rname	pos	mapq	cigar	rnext	pnext	tlen	seq	qual	end	tags
str	u16	cat	i32	u8	str	cat	i32	i32	str	str	i32	struct[12]
"HWI-BRUNOP16X_0001:3:48:4861:1…	163	"chr1"	10542	0	"50M"	"chr1"	10571	79	"CGAAATCTGTGCAGAGGAGAACGCAGCTCC…	"gggggggggggggggggggggggggegggg…	10591	{0,"18C31",1,"brain_50_fcb",0,3,8,null,0,1,0,"82"}
"HWI-BRUNOP16X_0001:3:28:6650:1…	16	"chr1"	10546	16	"75M"	null	null	0	"ATCTGTGCAGAGGAGAACGCAGCTCCGCCC…	"fggggggggdgdggcdfggggfgggggggg…	10620	{null,"14C52A7",2,"brain_75_fca",null,1,5,null,0,2,0,"85"}
"HWI-BRUNOP16X_0001:3:8:20066:8…	16	"chr1"	946457	0	"75M"	null	null	0	"TAGTCCGAGGTCTCCTGAACCTTCCCAAGC…	"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB…	946531	{null,"2T0G5T65",3,"brain_75_fca",null,2,0,"2,-131443143,75M,3;",0,3,0,"82"}
"HWI-BRUNOP16X_0001:3:27:10302:…	16	"chr1"	1014060	37	"75M"	null	null	0	"AGCTGAATGGGCAGGTCCCCCAGAAGATCG…	"BBBBBBBBBBBBBBBBcYRcffggfgf_gf…	1014134	{null,"7G1C4A2A57",4,"brain_75_fca",null,1,0,null,0,4,0,"85"}
"HWI-BRUNOP16X_0001:3:65:3144:1…	83	"chr3"	196957	60	"50M"	"chr3"	196008	-999	"GTAACGCTCCCGGACCCTGCGCGCCCCCGT…	"BBBBBBBBBBBBBB_TTSSS[[Obbd`]e^…	197006	{37,"0C0A0G1G0C0T1A41",7,"brain_50_fcb",37,1,0,null,0,7,0,"85"}
"HWI-BRUNOP16X_0001:3:68:13088:…	16	"chr3"	196958	37	"75M"	null	null	0	"GACCCCCCCGGCCCCCGGCGCCCCCCCGCC…	"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB…	197032	{null,"0A0G0A0G1T0T0A1C1G0A3T4G6T4G1T2C3C3T0C27",19,"brain_75_fca",null,1,0,null,0,19,0,"85"}
"HWI-BRUNOP16X_0001:3:48:3417:1…	163	"chr3"	196961	60	"50M"	"chr3"	319702	122791	"GCTTACCGGACCCTGCGCGCCCCCGTCCCG…	"gggggggggggggggggggggggfdagggg…	197010	{37,"50",0,"brain_50_fcb",37,1,0,null,0,0,0,"85"}
"HWI-BRUNOP16X_0001:3:46:17583:…	161	"chrX"	503847	0	"50M"	"chr4"	185365552	0	"TTTTATTTTTTTTTTTGAGATGGAGTCTCG…	"ddfdfd____dffff]__aeZ]\XZSPSNS…	503896	{0,"4T36C8",2,"brain_50_fcb",0,18,174,null,0,2,0,"82"}
"HWI-BRUNOP16X_0001:3:4:7989:14…	16	"chrY"	586185	0	"75M"	null	null	0	"GTGCGATCTCGGTTCGCTGCAACCTCTGCT…	"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB…	586259	{null,"4C10A10C16C29T0G0",6,"brain_75_fca",null,2,2,"X,-586185,75M,6;3,+196723225,75M,7;19,+13666092,75M,7;",0,6,0,"82"}
"HWI-BRUNOP16X_0001:3:44:11450:…	0	"chrY"	587561	0	"75M"	null	null	0	"NNTGCAGTGAGCTGAGATTGTGCCACTGCA…	"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB…	587635	{null,"0G0G48T0G23",4,"brain_75_fca",null,6,54,null,0,4,0,"82"}

Data sources can be logically grouped into fragments. Without random access, a data source contains only a single fragment.

ds = ox.from_bam("data/sample.bam")
ds.fragments()

[<oxbow._pyarrow.BatchReaderFragment at 0x706d5477fbb0>]

When you register range queries, each query gets mapped to a unique fragment. Each fragment generates an independent stream of record batches.

ds = ox.from_bam("data/sample.bam").regions(["chr1", "chr3", "chrX"])
ds.fragments()

[<oxbow._pyarrow.BatchReaderFragment at 0x706d5477f490>,
 <oxbow._pyarrow.BatchReaderFragment at 0x706d5476deb0>,
 <oxbow._pyarrow.BatchReaderFragment at 0x706d54883ac0>]

Dask data frames#

Dask uses a different approach than the streaming paradigm of Polars and DuckDB: it subdivides a data set into a known number of independently accessible logical partitions, each of which is expected to fit in memory. When you convert an Oxbow data source into a Dask data frame, oxbow maps fragments to partitions:

df = (
    ox.from_bam("data/sample.bam")
    .regions(["chr1", "chrX", "chrY"])
    .dd()  # or to_dask()
)
df

Dask DataFrame Structure:

	qname	flag	rname	pos	mapq	cigar	rnext	pnext	tlen	seq	qual	end	tags
npartitions=3
	string	uint16	category[known]	int32	uint8	string	category[known]	int32	int32	string	string	int32	string
	...	...	...	...	...	...	...	...	...	...	...	...	...
	...	...	...	...	...	...	...	...	...	...	...	...	...
	...	...	...	...	...	...	...	...	...	...	...	...	...

Dask Name: to_string_dtype, 2 expressions

df.partitions[1].compute()

	qname	flag	rname	pos	mapq	cigar	rnext	pnext	tlen	seq	qual	end	tags
0	HWI-BRUNOP16X_0001:3:46:17583:95767#0	161	chrX	503847	0	50M	chr4	185365552	0	TTTTATTTTTTTTTTTGAGATGGAGTCTCGCTCTTGTCACCGAGGC...	ddfdfd____dffff]__aeZ]\XZSPSNSSSSSSbbaabZ_``BB...	503896	{'AM': 0, 'MD': '4T36C8', 'NM': 2, 'RG': 'brai...

Quickstart

Contents

Quickstart#

Create a DataSource#

Into data frames#

Into lazy data structures#

Range queries#

Column projection#

Nested and complex fields#

SAM/BAM tags#

GTF/GFF attributes#

VCF/BCF info fields#

VCF/BCF sample genotype data#

BED schemas#

BigBed AutoSql#

Zoom levels#

Remote files and file-like objects#

Streams and Fragments#

Dask data frames#