Quickstart#

This is a quickstart guide to using Oxbow. Oxbow lets you access potentially larger-than-memory genomic files as tabular data structures, such as data frames.

Create a DataSource#

Use the convenience function associated with your file type. The returned DataSource object can be used to access the data in the file.

import oxbow as ox

ds = ox.from_bam("data/sample.bam")

Into data frames#

If the dataset fits comfortably in memory, you can materialize it fully as a Pandas or Polars data frame.

ds.pd()  # or ds.to_pandas()
qname flag rname pos mapq cigar rnext pnext tlen seq qual end tags
0 HWI-BRUNOP16X_0001:3:48:4861:11838#0 163 chr1 10542 0 50M chr1 10571.0 79 CGAAATCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCT... gggggggggggggggggggggggggeggggR\_[\ggggghggggg... 10591 {'AM': 0.0, 'MD': '18C31', 'NM': 1, 'RG': 'bra...
1 HWI-BRUNOP16X_0001:3:28:6650:168848#0 16 chr1 10546 16 75M NaN NaN 0 ATCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGG... fggggggggdgdggcdfggggfgggggggggggggggggggggggg... 10620 {'AM': None, 'MD': '14C52A7', 'NM': 2, 'RG': '...
2 HWI-BRUNOP16X_0001:3:8:20066:88158#0 16 chr1 946457 0 75M NaN NaN 0 TAGTCCGAGGTCTCCTGAACCTTCCCAAGCAGCTGCTGCACCTGCC... BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBd`aed``__U^__]_g... 946531 {'AM': None, 'MD': '2T0G5T65', 'NM': 3, 'RG': ...
3 HWI-BRUNOP16X_0001:3:27:10302:58768#0 16 chr1 1014060 37 75M NaN NaN 0 AGCTGAATGGGCAGGTCCCCCAGAAGATCGGCGTGCACGCCTTCCA... BBBBBBBBBBBBBBBBcYRcffggfgf_gfg\deegfgfgfcggcg... 1014134 {'AM': None, 'MD': '7G1C4A2A57', 'NM': 4, 'RG'...
4 HWI-BRUNOP16X_0001:3:65:3144:143676#0 83 chr3 196957 60 50M chr3 196008.0 -999 GTAACGCTCCCGGACCCTGCGCGCCCCCGTCCCGGCTCCCGGCCGG... BBBBBBBBBBBBBB_TTSSS[[Obbd`]e^STTTSZW`beTTTTTS... 197006 {'AM': 37.0, 'MD': '0C0A0G1G0C0T1A41', 'NM': 7...
5 HWI-BRUNOP16X_0001:3:68:13088:156644#0 16 chr3 196958 37 75M NaN NaN 0 GACCCCCCCGGCCCCCGGCGCCCCCCCGCCCCGCCCCCGGGCGGGC... BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB... 197032 {'AM': None, 'MD': '0A0G0A0G1T0T0A1C1G0A3T4G6T...
6 HWI-BRUNOP16X_0001:3:48:3417:101389#0 163 chr3 196961 60 50M chr3 319702.0 122791 GCTTACCGGACCCTGCGCGCCCCCGTCCCGGCTCCCGGCCGGCTCG... gggggggggggggggggggggggfdaggggggdgggfgdhbe\T`B... 197010 {'AM': 37.0, 'MD': '50', 'NM': 0, 'RG': 'brain...
7 HWI-BRUNOP16X_0001:3:46:17583:95767#0 161 chrX 503847 0 50M chr4 185365552.0 0 TTTTATTTTTTTTTTTGAGATGGAGTCTCGCTCTTGTCACCGAGGC... ddfdfd____dffff]__aeZ]\XZSPSNSSSSSSbbaabZ_``BB... 503896 {'AM': 0.0, 'MD': '4T36C8', 'NM': 2, 'RG': 'br...
8 HWI-BRUNOP16X_0001:3:4:7989:14941#0 16 chrY 586185 0 75M NaN NaN 0 GTGCGATCTCGGTTCGCTGCAACCTCTGCTTCCCAGGTTCAAGTGA... BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB... 586259 {'AM': None, 'MD': '4C10A10C16C29T0G0', 'NM': ...
9 HWI-BRUNOP16X_0001:3:44:11450:50194#0 0 chrY 587561 0 75M NaN NaN 0 NNTGCAGTGAGCTGAGATTGTGCCACTGCACTCCAGCCTGGGTGAC... BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB... 587635 {'AM': None, 'MD': '0G0G48T0G23', 'NM': 4, 'RG...
ds.pl()  # or ds.to_polars()
shape: (10, 13)
qnameflagrnameposmapqcigarrnextpnexttlenseqqualendtags
stru16cati32u8strcati32i32strstri32struct[12]
"HWI-BRUNOP16X_0001:3:48:4861:1…163"chr1"105420"50M""chr1"1057179"CGAAATCTGTGCAGAGGAGAACGCAGCTCC…"gggggggggggggggggggggggggegggg…10591{0,"18C31",1,"brain_50_fcb",0,3,8,null,0,1,0,"82"}
"HWI-BRUNOP16X_0001:3:28:6650:1…16"chr1"1054616"75M"nullnull0"ATCTGTGCAGAGGAGAACGCAGCTCCGCCC…"fggggggggdgdggcdfggggfgggggggg…10620{null,"14C52A7",2,"brain_75_fca",null,1,5,null,0,2,0,"85"}
"HWI-BRUNOP16X_0001:3:8:20066:8…16"chr1"9464570"75M"nullnull0"TAGTCCGAGGTCTCCTGAACCTTCCCAAGC…"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB…946531{null,"2T0G5T65",3,"brain_75_fca",null,2,0,"2,-131443143,75M,3;",0,3,0,"82"}
"HWI-BRUNOP16X_0001:3:27:10302:…16"chr1"101406037"75M"nullnull0"AGCTGAATGGGCAGGTCCCCCAGAAGATCG…"BBBBBBBBBBBBBBBBcYRcffggfgf_gf…1014134{null,"7G1C4A2A57",4,"brain_75_fca",null,1,0,null,0,4,0,"85"}
"HWI-BRUNOP16X_0001:3:65:3144:1…83"chr3"19695760"50M""chr3"196008-999"GTAACGCTCCCGGACCCTGCGCGCCCCCGT…"BBBBBBBBBBBBBB_TTSSS[[Obbd`]e^…197006{37,"0C0A0G1G0C0T1A41",7,"brain_50_fcb",37,1,0,null,0,7,0,"85"}
"HWI-BRUNOP16X_0001:3:68:13088:…16"chr3"19695837"75M"nullnull0"GACCCCCCCGGCCCCCGGCGCCCCCCCGCC…"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB…197032{null,"0A0G0A0G1T0T0A1C1G0A3T4G6T4G1T2C3C3T0C27",19,"brain_75_fca",null,1,0,null,0,19,0,"85"}
"HWI-BRUNOP16X_0001:3:48:3417:1…163"chr3"19696160"50M""chr3"319702122791"GCTTACCGGACCCTGCGCGCCCCCGTCCCG…"gggggggggggggggggggggggfdagggg…197010{37,"50",0,"brain_50_fcb",37,1,0,null,0,0,0,"85"}
"HWI-BRUNOP16X_0001:3:46:17583:…161"chrX"5038470"50M""chr4"1853655520"TTTTATTTTTTTTTTTGAGATGGAGTCTCG…"ddfdfd____dffff]__aeZ]\XZSPSNS…503896{0,"4T36C8",2,"brain_50_fcb",0,18,174,null,0,2,0,"82"}
"HWI-BRUNOP16X_0001:3:4:7989:14…16"chrY"5861850"75M"nullnull0"GTGCGATCTCGGTTCGCTGCAACCTCTGCT…"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB…586259{null,"4C10A10C16C29T0G0",6,"brain_75_fca",null,2,2,"X,-586185,75M,6;3,+196723225,75M,7;19,+13666092,75M,7;",0,6,0,"82"}
"HWI-BRUNOP16X_0001:3:44:11450:…0"chrY"5875610"75M"nullnull0"NNTGCAGTGAGCTGAGATTGTGCCACTGCA…"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB…587635{null,"0G0G48T0G23",4,"brain_75_fca",null,6,54,null,0,4,0,"82"}

Into lazy data structures#

If the data source is very large, you can also load it into a lazy or “out-of-core” data structure, such as a Polars lazy frame or Dask data frame.

df = ds.pl(lazy=True)
df.show_graph()
../_images/705ef5a0788b7340fdbd4c206bf057a36c07dba9e34ccaef2aaa3b5251a61613.svg
df.head().collect()
shape: (5, 13)
qnameflagrnameposmapqcigarrnextpnexttlenseqqualendtags
stru16cati32u8strcati32i32strstri32struct[12]
"HWI-BRUNOP16X_0001:3:48:4861:1…163"chr1"105420"50M""chr1"1057179"CGAAATCTGTGCAGAGGAGAACGCAGCTCC…"gggggggggggggggggggggggggegggg…10591{0,"18C31",1,"brain_50_fcb",0,3,8,null,0,1,0,"82"}
"HWI-BRUNOP16X_0001:3:28:6650:1…16"chr1"1054616"75M"nullnull0"ATCTGTGCAGAGGAGAACGCAGCTCCGCCC…"fggggggggdgdggcdfggggfgggggggg…10620{null,"14C52A7",2,"brain_75_fca",null,1,5,null,0,2,0,"85"}
"HWI-BRUNOP16X_0001:3:8:20066:8…16"chr1"9464570"75M"nullnull0"TAGTCCGAGGTCTCCTGAACCTTCCCAAGC…"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB…946531{null,"2T0G5T65",3,"brain_75_fca",null,2,0,"2,-131443143,75M,3;",0,3,0,"82"}
"HWI-BRUNOP16X_0001:3:27:10302:…16"chr1"101406037"75M"nullnull0"AGCTGAATGGGCAGGTCCCCCAGAAGATCG…"BBBBBBBBBBBBBBBBcYRcffggfgf_gf…1014134{null,"7G1C4A2A57",4,"brain_75_fca",null,1,0,null,0,4,0,"85"}
"HWI-BRUNOP16X_0001:3:65:3144:1…83"chr3"19695760"50M""chr3"196008-999"GTAACGCTCCCGGACCCTGCGCGCCCCCGT…"BBBBBBBBBBBBBB_TTSSS[[Obbd`]e^…197006{37,"0C0A0G1G0C0T1A41",7,"brain_50_fcb",37,1,0,null,0,7,0,"85"}

Oxbow data sources can also be loaded into a DuckDB relation.

import duckdb

conn = duckdb.connect(":memory:")
ds = ox.from_gtf("data/gencode.v47.annotation.gtf")
rel = ds.to_duckdb(conn)
conn.sql(
    "SELECT seqid as chrom, type, start, rel.end, strand, attributes.gene_name " \
    "FROM rel " \
    "WHERE attributes.gene_name = 'PCSK9'" \
    "LIMIT 10"
).pl()

Note

See the Streams and Fragments section for details on building Dask data frames.

Range queries#

Data sources with indexes support querying genomic ranges. This is the case for htslib formats that are compressed with the BGZF gzip variant and indexed with an appropriate companion index file (e.g., .bai, .tbi, .csi). The BBI formats, BigWig and BigBed, possess an internal index and support range queries without an index file.

You can specify one or more ranges to the constructor or pass them to the regions() method. All records overlapping the query ranges will be returned.

ds = ox.from_bam("data/sample.bam", index="data/sample.bam.bai")
ds = ds.regions("chr1:900000-1100000")

ds.pl()
shape: (2, 13)
qnameflagrnameposmapqcigarrnextpnexttlenseqqualendtags
stru16cati32u8strcati32i32strstri32struct[12]
"HWI-BRUNOP16X_0001:3:8:20066:8…16"chr1"9464570"75M"nullnull0"TAGTCCGAGGTCTCCTGAACCTTCCCAAGC…"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB…946531{null,"2T0G5T65",3,"brain_75_fca",null,2,0,"2,-131443143,75M,3;",0,3,0,"82"}
"HWI-BRUNOP16X_0001:3:27:10302:…16"chr1"101406037"75M"nullnull0"AGCTGAATGGGCAGGTCCCCCAGAAGATCG…"BBBBBBBBBBBBBBBBcYRcffggfgf_gf…1014134{null,"7G1C4A2A57",4,"brain_75_fca",null,1,0,null,0,4,0,"85"}

If the index file exists in the same location as the source file, it is automatically detected.

ox.from_bam("data/sample.bam").regions(["chr1", "chr3"]).pl()
shape: (7, 13)
qnameflagrnameposmapqcigarrnextpnexttlenseqqualendtags
stru16cati32u8strcati32i32strstri32struct[12]
"HWI-BRUNOP16X_0001:3:48:4861:1…163"chr1"105420"50M""chr1"1057179"CGAAATCTGTGCAGAGGAGAACGCAGCTCC…"gggggggggggggggggggggggggegggg…10591{0,"18C31",1,"brain_50_fcb",0,3,8,null,0,1,0,"82"}
"HWI-BRUNOP16X_0001:3:28:6650:1…16"chr1"1054616"75M"nullnull0"ATCTGTGCAGAGGAGAACGCAGCTCCGCCC…"fggggggggdgdggcdfggggfgggggggg…10620{null,"14C52A7",2,"brain_75_fca",null,1,5,null,0,2,0,"85"}
"HWI-BRUNOP16X_0001:3:8:20066:8…16"chr1"9464570"75M"nullnull0"TAGTCCGAGGTCTCCTGAACCTTCCCAAGC…"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB…946531{null,"2T0G5T65",3,"brain_75_fca",null,2,0,"2,-131443143,75M,3;",0,3,0,"82"}
"HWI-BRUNOP16X_0001:3:27:10302:…16"chr1"101406037"75M"nullnull0"AGCTGAATGGGCAGGTCCCCCAGAAGATCG…"BBBBBBBBBBBBBBBBcYRcffggfgf_gf…1014134{null,"7G1C4A2A57",4,"brain_75_fca",null,1,0,null,0,4,0,"85"}
"HWI-BRUNOP16X_0001:3:65:3144:1…83"chr3"19695760"50M""chr3"196008-999"GTAACGCTCCCGGACCCTGCGCGCCCCCGT…"BBBBBBBBBBBBBB_TTSSS[[Obbd`]e^…197006{37,"0C0A0G1G0C0T1A41",7,"brain_50_fcb",37,1,0,null,0,7,0,"85"}
"HWI-BRUNOP16X_0001:3:68:13088:…16"chr3"19695837"75M"nullnull0"GACCCCCCCGGCCCCCGGCGCCCCCCCGCC…"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB…197032{null,"0A0G0A0G1T0T0A1C1G0A3T4G6T4G1T2C3C3T0C27",19,"brain_75_fca",null,1,0,null,0,19,0,"85"}
"HWI-BRUNOP16X_0001:3:48:3417:1…163"chr3"19696160"50M""chr3"319702122791"GCTTACCGGACCCTGCGCGCCCCCGTCCCG…"gggggggggggggggggggggggfdagggg…197010{37,"50",0,"brain_50_fcb",37,1,0,null,0,0,0,"85"}
ox.from_bigwig("data/sample.bw").regions("chr21:10900000-15000000").pl()
shape: (3, 4)
chromstartendvalue
stru32u32f32
"chr21"109717701097177540.0
"chr21"147871001478710560.0
"chr21"149590501495905520.0

Note

Oxbow handles multiple ranges as separate fragments. For more details, see the Streams and Fragments section below.

Column projection#

Oxbow lets you select only the columns you need and will not parse the others.

import polars as pl

ox.from_bam(
    "data/sample.bam", 
    fields=["rname", "pos", "end", "mapq"],
    tag_defs=[],
).regions(
    "chr1"
).pl()
shape: (4, 4)
rnameposendmapq
cati32i32u8
"chr1"10542105910
"chr1"105461062016
"chr1"9464579465310
"chr1"1014060101413437

In data systems lingo, selecting columns is also known as “projection”. The lazy data structures returned by an oxbow data source are able to “push down” the projection operation to oxbow to prevent full record parsing. In the following example, only the four fields passed to the polars LazyFrame.select method will be parsed when the output gets computed.

df = (
    ox.from_bam("data/sample.bam")
    .regions("chr1")
    .pl(lazy=True)
    .select(
        pl.col("rname").alias("chrom"),
        pl.col("pos").alias("start"),
        "end",
        "mapq"
    )
    .collect()
)
df
shape: (4, 4)
chromstartendmapq
cati32i32u8
"chr1"10542105910
"chr1"105461062016
"chr1"9464579465310
"chr1"1014060101413437

Nested and complex fields#

Oxbow can handle the complex field structures of genomics file formats because they can all be mapped to Arrow constructs like lists, arrays, and structs.

For example, fields like SAM tags, VCF info and samples, and GTF attributes are exposed as struct columns in Arrow-native libraries like Polars, which are easy and efficient to manipulate.

SAM/BAM tags#

The htslib alignment formats, SAM and BAM, have optional fields called tags that are defined inline, rather than in a header or manifest. These definitions, a tuple of a tag name and type code, can be provided explicitly to the data source constructor for projection.

df = (
    ox.from_bam(
        "data/sample.bam", 
        fields=[],
        tag_defs=[('MD', 'Z'), ('NM', 'C')]
    )
    .regions("chr1")
    .pl()
    .select(
        pl.col("tags").struct.unnest()
    )
)
df
shape: (4, 2)
MDNM
stri64
"18C31"1
"14C52A7"2
"2T0G5T65"3
"7G1C4A2A57"4

By default, oxbow will scan an initial number of rows to discover tag definitions (determined by tag_scan_rows). Set tag_defs=[] to ignore tags entirely.

df = (
    ox.from_bam(
        "data/sample.bam", 
        tag_defs=[],
    )
    .regions("chr1")
    .pl()
)
df
shape: (4, 12)
qnameflagrnameposmapqcigarrnextpnexttlenseqqualend
stru16cati32u8strcati32i32strstri32
"HWI-BRUNOP16X_0001:3:48:4861:1…163"chr1"105420"50M""chr1"1057179"CGAAATCTGTGCAGAGGAGAACGCAGCTCC…"gggggggggggggggggggggggggegggg…10591
"HWI-BRUNOP16X_0001:3:28:6650:1…16"chr1"1054616"75M"nullnull0"ATCTGTGCAGAGGAGAACGCAGCTCCGCCC…"fggggggggdgdggcdfggggfgggggggg…10620
"HWI-BRUNOP16X_0001:3:8:20066:8…16"chr1"9464570"75M"nullnull0"TAGTCCGAGGTCTCCTGAACCTTCCCAAGC…"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB…946531
"HWI-BRUNOP16X_0001:3:27:10302:…16"chr1"101406037"75M"nullnull0"AGCTGAATGGGCAGGTCCCCCAGAAGATCG…"BBBBBBBBBBBBBBBBcYRcffggfgf_gf…1014134

GTF/GFF attributes#

GTF/GFF attributes are analogous to SAM tags. For GTF, the type is always "String". For GFF, attributes can be "String" or "Array", the latter materializing as a list column.

df = (
    ox.from_gff("data/sample.gff")
    .pl()
)
df.head()
shape: (5, 9)
seqidsourcetypestartendscorestrandframeattributes
strstrstri32i32f32stru8struct[18]
"chr13""HAVANA""exon"8132603081326191null"+"null{"exon:ENST00000782961.1:2","ENST00000782961.1",null,"ENSE00004156517.1","2","ENSG00000229309.3","ENSG00000229309","lncRNA","OTTHUMG00000017146.2",null,null,"2",null,["basic", "Ensembl_canonical", "TAGENE"],"ENST00000782961.1","ENST00000782961",null,"lncRNA"}
"chr6""HAVANA""CDS"3200239932002540null"+"1{"CDS:ENST00000498271.1","ENST00000498271.1","CCDS59005.1","ENSE00001878698.1","40","ENSG00000244731.10","C4A","protein_coding","OTTHUMG00000031186.6","OTTHUMT00000356896.1","HGNC:1323","2","ENSP00000420212.1",["RNA_Seq_supported_only", "basic", … "CCDS"],"ENST00000498271.1","C4A-246","1","protein_coding"}
"chr10""HAVANA""exon"7293053872930637null"+"null{"exon:ENST00000334011.10:8","ENST00000334011.10","CCDS7318.1","ENSE00001170094.1","8","ENSG00000138315.13","OIT3","protein_coding","OTTHUMG00000018444.2","OTTHUMT00000048596.2","HGNC:29953","2","ENSP00000333900.5",["basic", "Ensembl_canonical", … "CCDS"],"ENST00000334011.10","OIT3-201","1","protein_coding"}
"chr1""HAVANA""exon"497210497299null"-"null{"exon:ENST00000641916.1:4","ENST00000641916.1",null,"ENSE00003812605.1","4","ENSG00000290385.2","ENSG00000290385","lncRNA",null,"OTTHUMT00000493599.1",null,"2",null,null,"ENST00000641916.1","ENST00000641916",null,"lncRNA"}
"chr13""HAVANA""CDS"3565557935655749null"+"2{"CDS:ENST00000629018.4","ENST00000629018.4",null,"ENSE00000938859.1","28","ENSG00000172915.20","NBEA","protein_coding","OTTHUMG00000016724.2",null,"HGNC:7648","2","ENSP00000486239.3",["RNA_Seq_supported_only", "mRNA_start_NF", "cds_start_NF"],"ENST00000629018.4","NBEA-207","5","protein_coding"}
df['attributes'].struct.unnest().head()
shape: (5, 18)
IDParentccdsidexon_idexon_numbergene_idgene_namegene_typehavana_genehavana_transcripthgnc_idlevelprotein_idtagtranscript_idtranscript_nametranscript_support_leveltranscript_type
strstrstrstrstrstrstrstrstrstrstrstrstrlist[str]strstrstrstr
"exon:ENST00000782961.1:2""ENST00000782961.1"null"ENSE00004156517.1""2""ENSG00000229309.3""ENSG00000229309""lncRNA""OTTHUMG00000017146.2"nullnull"2"null["basic", "Ensembl_canonical", "TAGENE"]"ENST00000782961.1""ENST00000782961"null"lncRNA"
"CDS:ENST00000498271.1""ENST00000498271.1""CCDS59005.1""ENSE00001878698.1""40""ENSG00000244731.10""C4A""protein_coding""OTTHUMG00000031186.6""OTTHUMT00000356896.1""HGNC:1323""2""ENSP00000420212.1"["RNA_Seq_supported_only", "basic", … "CCDS"]"ENST00000498271.1""C4A-246""1""protein_coding"
"exon:ENST00000334011.10:8""ENST00000334011.10""CCDS7318.1""ENSE00001170094.1""8""ENSG00000138315.13""OIT3""protein_coding""OTTHUMG00000018444.2""OTTHUMT00000048596.2""HGNC:29953""2""ENSP00000333900.5"["basic", "Ensembl_canonical", … "CCDS"]"ENST00000334011.10""OIT3-201""1""protein_coding"
"exon:ENST00000641916.1:4""ENST00000641916.1"null"ENSE00003812605.1""4""ENSG00000290385.2""ENSG00000290385""lncRNA"null"OTTHUMT00000493599.1"null"2"nullnull"ENST00000641916.1""ENST00000641916"null"lncRNA"
"CDS:ENST00000629018.4""ENST00000629018.4"null"ENSE00000938859.1""28""ENSG00000172915.20""NBEA""protein_coding""OTTHUMG00000016724.2"null"HGNC:7648""2""ENSP00000486239.3"["RNA_Seq_supported_only", "mRNA_start_NF", "cds_start_NF"]"ENST00000629018.4""NBEA-207""5""protein_coding"

VCF/BCF info fields#

For the htslib variant call formats, VCF and BCF, the subfields of the INFO field are defined in the VCF header, so they do not need to be discovered by sniffing rows and you do not need to specify types.

By default, all info fields are parsed. You can project any subset or ignore them entirely using the info_fields argument.

(
    ox.from_vcf(
        "data/sample.vcf.gz",
        info_fields=[],
        samples=[],
    )
    .pl()
).head()
shape: (5, 7)
chromposidrefaltqualfilter
cati32list[str]strlist[str]f32list[str]
"1"65872[]"T"["G"]44.18[]
"1"69511[]"A"["G"]2552.929932[]
"1"762273[]"G"["A"]19085.929688[]
"1"866511[]"C"["CCCCT"]3136.889893[]
"1"876499[]"A"["G"]3338.929932[]
df = (
    ox.from_vcf(
        "data/sample.vcf.gz",
        info_fields=["TYPE", "snpeff.Effect", "snpeff.Gene_Name", "snpeff.Transcript_BioType"],
        samples=[],
    )
    .pl()
)
df.head()
shape: (5, 8)
chromposidrefaltqualfilterinfo
cati32list[str]strlist[str]f32list[str]struct[4]
"1"65872[]"T"["G"]44.18[]{["SNP"],"intergenic_region",null,null}
"1"69511[]"A"["G"]2552.929932[]{["SNP"],"sequence_feature[transmembrane_region:Transmembrane_region]","OR4F5","protein_coding"}
"1"762273[]"G"["A"]19085.929688[]{["SNP"],"non_coding_exon_variant","LINC00115","lincRNA"}
"1"866511[]"C"["CCCCT"]3136.889893[]{["Insertion"],"intron_variant","SAMD11","protein_coding"}
"1"876499[]"A"["G"]3338.929932[]{["SNP"],"intron_variant","SAMD11","protein_coding"}
df.unnest("info").head()
shape: (5, 11)
chromposidrefaltqualfilterTYPEsnpeff.Effectsnpeff.Gene_Namesnpeff.Transcript_BioType
cati32list[str]strlist[str]f32list[str]list[str]strstrstr
"1"65872[]"T"["G"]44.18[]["SNP"]"intergenic_region"nullnull
"1"69511[]"A"["G"]2552.929932[]["SNP"]"sequence_feature[transmembrane…"OR4F5""protein_coding"
"1"762273[]"G"["A"]19085.929688[]["SNP"]"non_coding_exon_variant""LINC00115""lincRNA"
"1"866511[]"C"["CCCCT"]3136.889893[]["Insertion"]"intron_variant""SAMD11""protein_coding"
"1"876499[]"A"["G"]3338.929932[]["SNP"]"intron_variant""SAMD11""protein_coding"

VCF/BCF sample genotype data#

For the htslib variant call formats, each variant call record is associated with an arbitrary number of so-called FORMAT fields that provide genotype-related information for each sample. Like INFO, these fields are defined in the header.

Using the samples and genotype_fields arguments, you can project any subset of samples as separate struct columns and project any subset of their associated genotype fields.

df = ox.from_vcf(
    "data/sample.vcf.gz",
    info_fields=[],
    samples=['NA12891', 'NA12892'],
).pl()
df.head()
shape: (5, 9)
chromposidrefaltqualfilterNA12891NA12892
cati32list[str]strlist[str]f32list[str]struct[6]struct[6]
"1"65872[]"T"["G"]44.18[]{[14, 2],16,21,{[0, 1],[true, true]},[21, 0, 439],18}{[14, 2],16,21,{[0, 1],[true, true]},[21, 0, 437],18}
"1"69511[]"A"["G"]2552.929932[]{null,null,null,{[null, null],[false, false]},null,null}{[0, 39],39,99,{[1, 1],[true, true]},[1289, 117, 0],null}
"1"762273[]"G"["A"]19085.929688[]{[0, 82],82,99,{[1, 1],[true, true]},[2952, 247, 0],127}{[0, 68],68,99,{[1, 1],[true, true]},[2485, 204, 0],127}
"1"866511[]"C"["CCCCT"]3136.889893[]{[0, 13],13,37,{[1, 1],[true, true]},[512, 37, 0],26}{[0, 9],9,27,{[1, 1],[true, true]},[402, 27, 0],26}
"1"876499[]"A"["G"]3338.929932[]{[0, 17],17,51,{[1, 1],[true, true]},[645, 51, 0],26}{[0, 9],9,27,{[1, 1],[true, true]},[355, 27, 0],26}

Each sample column is essentially a sub-dataframe of genotype fields.

df['NA12892'].struct.unnest().head()
shape: (5, 6)
ADDPGQGTPLTP
list[i32]i32i32struct[2]list[i32]i32
[14, 2]1621{[0, 1],[true, true]}[21, 0, 437]18
[0, 39]3999{[1, 1],[true, true]}[1289, 117, 0]null
[0, 68]6899{[1, 1],[true, true]}[2485, 204, 0]127
[0, 9]927{[1, 1],[true, true]}[402, 27, 0]26
[0, 9]927{[1, 1],[true, true]}[355, 27, 0]26

You can also customize how sample genotype data are nested by using the genotype_by argument. By default (genotype_by="sample"), the columns are grouped first by sample name, then by genotype field name. By setting genotype_by="field", you can swap the nesting order to group columns first by genotype field name, then by sample name.

df = ox.from_vcf(
    "data/sample.vcf.gz",
    info_fields=[],
    samples=['NA12891', 'NA12892'],
    genotype_fields=['AD', 'DP', 'GQ', 'PL', 'TP'],
    genotype_by="field",
).pl()
df.head()
shape: (5, 12)
chromposidrefaltqualfilterADDPGQPLTP
cati32list[str]strlist[str]f32list[str]struct[2]struct[2]struct[2]struct[2]struct[2]
"1"65872[]"T"["G"]44.18[]{[14, 2],[14, 2]}{16,16}{21,21}{[21, 0, 439],[21, 0, 437]}{18,18}
"1"69511[]"A"["G"]2552.929932[]{null,[0, 39]}{null,39}{null,99}{null,[1289, 117, 0]}{null,null}
"1"762273[]"G"["A"]19085.929688[]{[0, 82],[0, 68]}{82,68}{99,99}{[2952, 247, 0],[2485, 204, 0]}{127,127}
"1"866511[]"C"["CCCCT"]3136.889893[]{[0, 13],[0, 9]}{13,9}{37,27}{[512, 37, 0],[402, 27, 0]}{26,26}
"1"876499[]"A"["G"]3338.929932[]{[0, 17],[0, 9]}{17,9}{51,27}{[645, 51, 0],[355, 27, 0]}{26,26}

In this case, each genotype field column is a data series containing the values of that field associated with each of the samples.

df['DP'].struct.unnest().head()
shape: (5, 2)
NA12891NA12892
i32i32
1616
null39
8268
139
179

BED schemas#

Oxbow understands BEDn+m schema specifiers to interpret the contents of BED files.

ox.from_bed("data/sample.bed", bed_schema="bed3+").pl().head()
shape: (5, 4)
chromstartendrest
stri64i64str
"chr1"11000011200000"A1 . . 1100000 1200000 226,56,…
"chr1"15500011600000"A1 . . 1550000 1600000 226,56,…
"chr1"19000012450000"A1 . . 1900000 2450000 226,56,…
"chr10"50001250000"AB . . 50000 250000 94,189,62"
"chr10"250001650000"A2 . . 250000 650000 247,130,0"
ox.from_bed("data/sample.bed", bed_schema="bed3+6").pl().head()
shape: (5, 9)
chromstartendBED3+1BED3+2BED3+3BED3+4BED3+5BED3+6
stri64i64strstrstrstrstrstr
"chr1"11000011200000"A1""."".""1100000""1200000""226,56,56"
"chr1"15500011600000"A1""."".""1550000""1600000""226,56,56"
"chr1"19000012450000"A1""."".""1900000""2450000""226,56,56"
"chr10"50001250000"AB""."".""50000""250000""94,189,62"
"chr10"250001650000"A2""."".""250000""650000""247,130,0"
ox.from_bed("data/sample.bed", bed_schema="bed9").pl().head()
shape: (5, 9)
chromstartendnamescorestrandthickStartthickEnditemRgb
stri64i64stru16cati64i64array[u8, 3]
"chr1"11000011200000"A1"nullnull11000001200000[226, 56, 56]
"chr1"15500011600000"A1"nullnull15500001600000[226, 56, 56]
"chr1"19000012450000"A1"nullnull19000002450000[226, 56, 56]
"chr10"50001250000"AB"nullnull50000250000[94, 189, 62]
"chr10"250001650000"A2"nullnull250000650000[247, 130, 0]

BigBed AutoSql#

Oxbow can also parse BigBed records that contain AutoSql definitions of the records.

ox.from_bigbed("data/autosql-sample.bb").pl().head()
shape: (5, 4)
chromstartendrest
stru32u32str
"chr1"1186814409"ENST00000456328.2 1000 + 11868…
"chr1"1440329570"ENST00000488147.1 1000 - 14403…
"chr1"1736817436"ENST00000619216.1 1000 - 17368…
"chr1"2955331097"ENST00000473358.1 1000 + 29553…
"chr1"3036530503"ENST00000607096.1 1000 + 30365…
ox.from_bigbed("data/autosql-sample.bb", schema="autosql").pl().head()
shape: (5, 20)
chromstartendnamescorestrandthickStartthickEndreservedblockCountblockSizeschromStartsname2cdsStartStatcdsEndStatexonFramestypegeneNamegeneName2geneType
stru32u32stru32stru32u32u32i32list[i32]list[i32]strstrstrlist[i32]strstrstrstr
"chr1"1186814409"ENST00000456328.2"1000"+"1186811868null3[359, 109, 1189][0, 744, 1352]"DDX11L1""none""none"[-1, -1, -1]"none""ENST00000456328.2""DDX11L1""none"
"chr1"1440329570"ENST00000488147.1"1000"-"1440314403null11[98, 34, … 37][0, 601, … 15130]"WASH7P""none""none"[-1, -1, … -1]"none""ENST00000488147.1""WASH7P""none"
"chr1"1736817436"ENST00000619216.1"1000"-"1736817368null1[68][0]"MIR6859-2""none""none"[-1]"none""ENST00000619216.1""MIR6859-2""none"
"chr1"2955331097"ENST00000473358.1"1000"+"2955329553null3[486, 104, 122][0, 1010, 1422]"MIR1302-11""none""none"[-1, -1, -1]"none""ENST00000473358.1""MIR1302-11""none"
"chr1"3036530503"ENST00000607096.1"1000"+"3036530365null1[138][0]"MIR1302-9""none""none"[-1]"none""ENST00000607096.1""MIR1302-9""none"

Zoom levels#

The UCSC BBI formats store multiple “zoom” or “reduction” levels. These are tables of fixed-resolution genomic bins containing summary statistics of the signal of a BigWig track track or the interval coverage depth of a BigBed track.

ds = ox.from_bigwig("data/sample.bw")
ds.zoom_levels
[2621440, 10485760, 41943040]
ds.zoom(ds.zoom_levels[1]).regions("chr21").pl()
shape: (5, 8)
chromstartendbases_coveredminmaxsumsum_squares
catu32u32u64f64f64f64f64
"chr21"9486505174085409020.080.04000.0224000.0
"chr21"17829945261408851550.080.07900.0470000.0
"chr21"27133600360156752050.080.09000.0472000.0
"chr21"36097355444120851900.080.07200.0376000.0
"chr21"45704025481298956520.080.02800.0148000.0

Remote files and file-like objects#

Instead of using file paths, source and index inputs to create a data source can alternatively be callables that open a binary I/O stream, i.e. any Python file-like object.

ds = ox.from_bam(
    lambda : open("sample.bam", "rb"),
    index=lambda : open("sample.bam.bai", "rb"),
)

This gives you the power to customize your own transports – to read remote sources, diverse file system implementations, or different file encodings – independently of oxbow itself.

Libraries like fsspec or smart_open can be used for this purpose.

from fsspec.implementations.cached import CachingFileSystem
from s3fs import S3FileSystem

url = "https://oxbow-ngs.s3.us-east-2.amazonaws.com/example.bam"
httpfs = CachingFileSystem(target_protocol="https")
ds = ox.from_bam(
    lambda : httpfs.open(url, "rb"),
    index=lambda : httpfs.open(url + ".bai", "rb"),
)

s3fs = S3FileSystem(anon=True)
s3_uri = "s3://oxbow-ngs/example.bam"
ds = ox.from_bam(
    lambda : s3fs.open(s3_uri, "rb"),
    index=lambda : s3fs.open(s3_uri + ".bai", "rb"),
    tag_defs=[],
)

Streams and Fragments#

An oxbow data source object streams data via a sequence of Arrow RecordBatches. This stream is exposed as an iterator and you can use it to materialize each batch manually.

ds = ox.from_bam("data/sample.bam", batch_size=100)
batch = next(ds.batches())
batch
pyarrow.RecordBatch
qname: string
flag: uint16
rname: dictionary<values=string, indices=int32, ordered=0>
pos: int32
mapq: uint8
cigar: string
rnext: dictionary<values=string, indices=int32, ordered=0>
pnext: int32
tlen: int32
seq: string
qual: string
end: int32
tags: struct<AM: int64, MD: string, NM: int64, RG: string, SM: int64, X0: int64, X1: int64, XA: string, XG: int64, XM: int64, XO: int64, XT: string>
  child 0, AM: int64
  child 1, MD: string
  child 2, NM: int64
  child 3, RG: string
  child 4, SM: int64
  child 5, X0: int64
  child 6, X1: int64
  child 7, XA: string
  child 8, XG: int64
  child 9, XM: int64
  child 10, XO: int64
  child 11, XT: string
----
qname: ["HWI-BRUNOP16X_0001:3:48:4861:11838#0","HWI-BRUNOP16X_0001:3:28:6650:168848#0","HWI-BRUNOP16X_0001:3:8:20066:88158#0","HWI-BRUNOP16X_0001:3:27:10302:58768#0","HWI-BRUNOP16X_0001:3:65:3144:143676#0","HWI-BRUNOP16X_0001:3:68:13088:156644#0","HWI-BRUNOP16X_0001:3:48:3417:101389#0","HWI-BRUNOP16X_0001:3:46:17583:95767#0","HWI-BRUNOP16X_0001:3:4:7989:14941#0","HWI-BRUNOP16X_0001:3:44:11450:50194#0"]
flag: [163,16,16,16,83,16,163,161,16,0]
rname: -- dictionary:
["chrY","chr20","chrX","chr13","chr22","chr10","chr6","chr19","chr14","chr18",...,"chr11","chr17","chr8","chr7","chr15","chr12","chr1","chr16","chr5","chr3"]-- indices:
[20,20,20,20,23,23,23,2,0,0]
pos: [10542,10546,946457,1014060,196957,196958,196961,503847,586185,587561]
mapq: [0,16,0,37,60,37,60,0,0,0]
cigar: ["50M","75M","75M","75M","50M","75M","50M","50M","75M","75M"]
rnext: -- dictionary:
["chrY","chr20","chrX","chr13","chr22","chr10","chr6","chr19","chr14","chr18",...,"chr11","chr17","chr8","chr7","chr15","chr12","chr1","chr16","chr5","chr3"]-- indices:
[20,null,null,null,23,null,23,11,null,null]
pnext: [10571,null,null,null,196008,null,319702,185365552,null,null]
tlen: [79,0,0,0,-999,0,122791,0,0,0]
seq: ["CGAAATCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGG","ATCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGAACGCAGCTCCGCC","TAGTCCGAGGTCTCCTGAACCTTCCCAAGCAGCTGCTGCACCTGCCGGCAGTAGTTGGCCACCTTGCACTCCCGG","AGCTGAATGGGCAGGTCCCCCAGAAGATCGGCGTGCACGCCTTCCAGCAGCGTCTGGCTGTCCACCCGAGCGGTG","GTAACGCTCCCGGACCCTGCGCGCCCCCGTCCCGGCTCCCGGCCGGCTCG","GACCCCCCCGGCCCCCGGCGCCCCCCCGCCCCGCCCCCGGGCGGGCGGGGGGGAGAAGGCGCCCGAGGGGAGGCG","GCTTACCGGACCCTGCGCGCCCCCGTCCCGGCTCCCGGCCGGCTCGGGGG","TTTTATTTTTTTTTTTGAGATGGAGTCTCGCTCTTGTCACCGAGGCTGGA","GTGCGATCTCGGTTCGCTGCAACCTCTGCTTCCCAGGTTCAAGTGATTCTCCGGCCTCAGCCTCCCAAGTAGCNN","NNTGCAGTGAGCTGAGATTGTGCCACTGCACTCCAGCCTGGGTGACAGAGGTAGACTGTGTCTCAAAAAAAAAAA"]
...
pl.from_arrow(batch)
shape: (10, 13)
qnameflagrnameposmapqcigarrnextpnexttlenseqqualendtags
stru16cati32u8strcati32i32strstri32struct[12]
"HWI-BRUNOP16X_0001:3:48:4861:1…163"chr1"105420"50M""chr1"1057179"CGAAATCTGTGCAGAGGAGAACGCAGCTCC…"gggggggggggggggggggggggggegggg…10591{0,"18C31",1,"brain_50_fcb",0,3,8,null,0,1,0,"82"}
"HWI-BRUNOP16X_0001:3:28:6650:1…16"chr1"1054616"75M"nullnull0"ATCTGTGCAGAGGAGAACGCAGCTCCGCCC…"fggggggggdgdggcdfggggfgggggggg…10620{null,"14C52A7",2,"brain_75_fca",null,1,5,null,0,2,0,"85"}
"HWI-BRUNOP16X_0001:3:8:20066:8…16"chr1"9464570"75M"nullnull0"TAGTCCGAGGTCTCCTGAACCTTCCCAAGC…"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB…946531{null,"2T0G5T65",3,"brain_75_fca",null,2,0,"2,-131443143,75M,3;",0,3,0,"82"}
"HWI-BRUNOP16X_0001:3:27:10302:…16"chr1"101406037"75M"nullnull0"AGCTGAATGGGCAGGTCCCCCAGAAGATCG…"BBBBBBBBBBBBBBBBcYRcffggfgf_gf…1014134{null,"7G1C4A2A57",4,"brain_75_fca",null,1,0,null,0,4,0,"85"}
"HWI-BRUNOP16X_0001:3:65:3144:1…83"chr3"19695760"50M""chr3"196008-999"GTAACGCTCCCGGACCCTGCGCGCCCCCGT…"BBBBBBBBBBBBBB_TTSSS[[Obbd`]e^…197006{37,"0C0A0G1G0C0T1A41",7,"brain_50_fcb",37,1,0,null,0,7,0,"85"}
"HWI-BRUNOP16X_0001:3:68:13088:…16"chr3"19695837"75M"nullnull0"GACCCCCCCGGCCCCCGGCGCCCCCCCGCC…"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB…197032{null,"0A0G0A0G1T0T0A1C1G0A3T4G6T4G1T2C3C3T0C27",19,"brain_75_fca",null,1,0,null,0,19,0,"85"}
"HWI-BRUNOP16X_0001:3:48:3417:1…163"chr3"19696160"50M""chr3"319702122791"GCTTACCGGACCCTGCGCGCCCCCGTCCCG…"gggggggggggggggggggggggfdagggg…197010{37,"50",0,"brain_50_fcb",37,1,0,null,0,0,0,"85"}
"HWI-BRUNOP16X_0001:3:46:17583:…161"chrX"5038470"50M""chr4"1853655520"TTTTATTTTTTTTTTTGAGATGGAGTCTCG…"ddfdfd____dffff]__aeZ]\XZSPSNS…503896{0,"4T36C8",2,"brain_50_fcb",0,18,174,null,0,2,0,"82"}
"HWI-BRUNOP16X_0001:3:4:7989:14…16"chrY"5861850"75M"nullnull0"GTGCGATCTCGGTTCGCTGCAACCTCTGCT…"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB…586259{null,"4C10A10C16C29T0G0",6,"brain_75_fca",null,2,2,"X,-586185,75M,6;3,+196723225,75M,7;19,+13666092,75M,7;",0,6,0,"82"}
"HWI-BRUNOP16X_0001:3:44:11450:…0"chrY"5875610"75M"nullnull0"NNTGCAGTGAGCTGAGATTGTGCCACTGCA…"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB…587635{null,"0G0G48T0G23",4,"brain_75_fca",null,6,54,null,0,4,0,"82"}

Data sources can be logically grouped into fragments. Without random access, a data source contains only a single fragment.

ds = ox.from_bam("data/sample.bam")
ds.fragments()
[<oxbow._pyarrow.BatchReaderFragment at 0x706d5477fbb0>]

When you register range queries, each query gets mapped to a unique fragment. Each fragment generates an independent stream of record batches.

ds = ox.from_bam("data/sample.bam").regions(["chr1", "chr3", "chrX"])
ds.fragments()
[<oxbow._pyarrow.BatchReaderFragment at 0x706d5477f490>,
 <oxbow._pyarrow.BatchReaderFragment at 0x706d5476deb0>,
 <oxbow._pyarrow.BatchReaderFragment at 0x706d54883ac0>]

Dask data frames#

Dask uses a different approach than the streaming paradigm of Polars and DuckDB: it subdivides a data set into a known number of independently accessible logical partitions, each of which is expected to fit in memory. When you convert an Oxbow data source into a Dask data frame, oxbow maps fragments to partitions:

df = (
    ox.from_bam("data/sample.bam")
    .regions(["chr1", "chrX", "chrY"])
    .dd()  # or to_dask()
)
df
Dask DataFrame Structure:
qname flag rname pos mapq cigar rnext pnext tlen seq qual end tags
npartitions=3
string uint16 category[known] int32 uint8 string category[known] int32 int32 string string int32 string
... ... ... ... ... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ... ... ... ... ...
Dask Name: to_string_dtype, 2 expressions
df.partitions[1].compute()
qname flag rname pos mapq cigar rnext pnext tlen seq qual end tags
0 HWI-BRUNOP16X_0001:3:46:17583:95767#0 161 chrX 503847 0 50M chr4 185365552 0 TTTTATTTTTTTTTTTGAGATGGAGTCTCGCTCTTGTCACCGAGGC... ddfdfd____dffff]__aeZ]\XZSPSNSSSSSSbbaabZ_``BB... 503896 {'AM': 0, 'MD': '4T36C8', 'NM': 2, 'RG': 'brai...