Quickstart#

This is a quickstart guide to using Oxbow. Oxbow lets you access potentially larger-than-memory genomic files as tabular data structures, such as data frames.

Create a DataSource#

Use the convenience function associated with your file type. The returned DataSource object can be used to access the data in the file.

import oxbow as ox

ds = ox.from_bam("data/sample.bam")

Into data frames#

If the dataset fits comfortably in memory, you can materialize it fully as a Pandas or Polars data frame.

ds.pd()  # or ds.to_pandas()
qname flag rname pos mapq cigar rnext pnext tlen seq qual end
0 HWI-BRUNOP16X_0001:3:48:4861:11838#0 163 chr1 10542 0 50M chr1 10571.0 79 CGAAATCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCT... gggggggggggggggggggggggggeggggR\_[\ggggghggggg... 10591
1 HWI-BRUNOP16X_0001:3:28:6650:168848#0 16 chr1 10546 16 75M NaN NaN 0 ATCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGG... fggggggggdgdggcdfggggfgggggggggggggggggggggggg... 10620
2 HWI-BRUNOP16X_0001:3:8:20066:88158#0 16 chr1 946457 0 75M NaN NaN 0 TAGTCCGAGGTCTCCTGAACCTTCCCAAGCAGCTGCTGCACCTGCC... BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBd`aed``__U^__]_g... 946531
3 HWI-BRUNOP16X_0001:3:27:10302:58768#0 16 chr1 1014060 37 75M NaN NaN 0 AGCTGAATGGGCAGGTCCCCCAGAAGATCGGCGTGCACGCCTTCCA... BBBBBBBBBBBBBBBBcYRcffggfgf_gfg\deegfgfgfcggcg... 1014134
4 HWI-BRUNOP16X_0001:3:65:3144:143676#0 83 chr3 196957 60 50M chr3 196008.0 -999 GTAACGCTCCCGGACCCTGCGCGCCCCCGTCCCGGCTCCCGGCCGG... BBBBBBBBBBBBBB_TTSSS[[Obbd`]e^STTTSZW`beTTTTTS... 197006
5 HWI-BRUNOP16X_0001:3:68:13088:156644#0 16 chr3 196958 37 75M NaN NaN 0 GACCCCCCCGGCCCCCGGCGCCCCCCCGCCCCGCCCCCGGGCGGGC... BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB... 197032
6 HWI-BRUNOP16X_0001:3:48:3417:101389#0 163 chr3 196961 60 50M chr3 319702.0 122791 GCTTACCGGACCCTGCGCGCCCCCGTCCCGGCTCCCGGCCGGCTCG... gggggggggggggggggggggggfdaggggggdgggfgdhbe\T`B... 197010
7 HWI-BRUNOP16X_0001:3:46:17583:95767#0 161 chrX 503847 0 50M chr4 185365552.0 0 TTTTATTTTTTTTTTTGAGATGGAGTCTCGCTCTTGTCACCGAGGC... ddfdfd____dffff]__aeZ]\XZSPSNSSSSSSbbaabZ_``BB... 503896
8 HWI-BRUNOP16X_0001:3:4:7989:14941#0 16 chrY 586185 0 75M NaN NaN 0 GTGCGATCTCGGTTCGCTGCAACCTCTGCTTCCCAGGTTCAAGTGA... BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB... 586259
9 HWI-BRUNOP16X_0001:3:44:11450:50194#0 0 chrY 587561 0 75M NaN NaN 0 NNTGCAGTGAGCTGAGATTGTGCCACTGCACTCCAGCCTGGGTGAC... BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB... 587635
ds.pl()  # or ds.to_polars()
shape: (10, 12)
qnameflagrnameposmapqcigarrnextpnexttlenseqqualend
stru16cati32u8strcati32i32strstri32
"HWI-BRUNOP16X_0001:3:48:4861:1…163"chr1"105420"50M""chr1"1057179"CGAAATCTGTGCAGAGGAGAACGCAGCTCC…"gggggggggggggggggggggggggegggg…10591
"HWI-BRUNOP16X_0001:3:28:6650:1…16"chr1"1054616"75M"nullnull0"ATCTGTGCAGAGGAGAACGCAGCTCCGCCC…"fggggggggdgdggcdfggggfgggggggg…10620
"HWI-BRUNOP16X_0001:3:8:20066:8…16"chr1"9464570"75M"nullnull0"TAGTCCGAGGTCTCCTGAACCTTCCCAAGC…"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB…946531
"HWI-BRUNOP16X_0001:3:27:10302:…16"chr1"101406037"75M"nullnull0"AGCTGAATGGGCAGGTCCCCCAGAAGATCG…"BBBBBBBBBBBBBBBBcYRcffggfgf_gf…1014134
"HWI-BRUNOP16X_0001:3:65:3144:1…83"chr3"19695760"50M""chr3"196008-999"GTAACGCTCCCGGACCCTGCGCGCCCCCGT…"BBBBBBBBBBBBBB_TTSSS[[Obbd`]e^…197006
"HWI-BRUNOP16X_0001:3:68:13088:…16"chr3"19695837"75M"nullnull0"GACCCCCCCGGCCCCCGGCGCCCCCCCGCC…"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB…197032
"HWI-BRUNOP16X_0001:3:48:3417:1…163"chr3"19696160"50M""chr3"319702122791"GCTTACCGGACCCTGCGCGCCCCCGTCCCG…"gggggggggggggggggggggggfdagggg…197010
"HWI-BRUNOP16X_0001:3:46:17583:…161"chrX"5038470"50M""chr4"1853655520"TTTTATTTTTTTTTTTGAGATGGAGTCTCG…"ddfdfd____dffff]__aeZ]\XZSPSNS…503896
"HWI-BRUNOP16X_0001:3:4:7989:14…16"chrY"5861850"75M"nullnull0"GTGCGATCTCGGTTCGCTGCAACCTCTGCT…"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB…586259
"HWI-BRUNOP16X_0001:3:44:11450:…0"chrY"5875610"75M"nullnull0"NNTGCAGTGAGCTGAGATTGTGCCACTGCA…"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB…587635

Into lazy data structures#

If the data source is very large, you can also load it into a lazy or “out-of-core” data structure, such as a Polars lazy frame or Dask data frame.

df = ds.pl(lazy=True)
df.show_graph()
../_images/9ba3f724fff994029c61e46c786bfc14feed0aa2a91d91713f08d42d28c67cb2.svg
df.head().collect()
shape: (5, 12)
qnameflagrnameposmapqcigarrnextpnexttlenseqqualend
stru16cati32u8strcati32i32strstri32
"HWI-BRUNOP16X_0001:3:48:4861:1…163"chr1"105420"50M""chr1"1057179"CGAAATCTGTGCAGAGGAGAACGCAGCTCC…"gggggggggggggggggggggggggegggg…10591
"HWI-BRUNOP16X_0001:3:28:6650:1…16"chr1"1054616"75M"nullnull0"ATCTGTGCAGAGGAGAACGCAGCTCCGCCC…"fggggggggdgdggcdfggggfgggggggg…10620
"HWI-BRUNOP16X_0001:3:8:20066:8…16"chr1"9464570"75M"nullnull0"TAGTCCGAGGTCTCCTGAACCTTCCCAAGC…"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB…946531
"HWI-BRUNOP16X_0001:3:27:10302:…16"chr1"101406037"75M"nullnull0"AGCTGAATGGGCAGGTCCCCCAGAAGATCG…"BBBBBBBBBBBBBBBBcYRcffggfgf_gf…1014134
"HWI-BRUNOP16X_0001:3:65:3144:1…83"chr3"19695760"50M""chr3"196008-999"GTAACGCTCCCGGACCCTGCGCGCCCCCGT…"BBBBBBBBBBBBBB_TTSSS[[Obbd`]e^…197006

Oxbow data sources can also be loaded into a DuckDB relation.

import duckdb

conn = duckdb.connect(":memory:")
ds = ox.from_gtf("data/gencode.v47.annotation.gtf")
rel = ds.to_duckdb(conn)
conn.sql(
    "SELECT seqid as chrom, type, start, rel.end, strand, attributes.gene_name " \
    "FROM rel " \
    "WHERE attributes.gene_name = 'PCSK9'" \
    "LIMIT 10"
).pl()

Note

See the Streams and Fragments section for details on building Dask data frames.

Range queries#

Data sources with indexes support querying genomic ranges. This is the case for htslib formats that are compressed with the BGZF gzip variant and indexed with an appropriate companion index file (e.g., .bai, .tbi, .csi). The BBI formats, BigWig and BigBed, possess an internal index and support range queries without an index file.

You can specify one or more ranges to the constructor or pass them to the regions() method. All records overlapping the query ranges will be returned.

ds = ox.from_bam("data/sample.bam", index="data/sample.bam.bai")
ds = ds.regions("chr1:900000-1100000")

ds.pl()
shape: (2, 12)
qnameflagrnameposmapqcigarrnextpnexttlenseqqualend
stru16cati32u8strcati32i32strstri32
"HWI-BRUNOP16X_0001:3:8:20066:8…16"chr1"9464570"75M"nullnull0"TAGTCCGAGGTCTCCTGAACCTTCCCAAGC…"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB…946531
"HWI-BRUNOP16X_0001:3:27:10302:…16"chr1"101406037"75M"nullnull0"AGCTGAATGGGCAGGTCCCCCAGAAGATCG…"BBBBBBBBBBBBBBBBcYRcffggfgf_gf…1014134

If the index file exists in the same location as the source file, it is automatically detected.

ox.from_bam("data/sample.bam").regions(["chr1", "chr3"]).pl()
shape: (7, 12)
qnameflagrnameposmapqcigarrnextpnexttlenseqqualend
stru16cati32u8strcati32i32strstri32
"HWI-BRUNOP16X_0001:3:48:4861:1…163"chr1"105420"50M""chr1"1057179"CGAAATCTGTGCAGAGGAGAACGCAGCTCC…"gggggggggggggggggggggggggegggg…10591
"HWI-BRUNOP16X_0001:3:28:6650:1…16"chr1"1054616"75M"nullnull0"ATCTGTGCAGAGGAGAACGCAGCTCCGCCC…"fggggggggdgdggcdfggggfgggggggg…10620
"HWI-BRUNOP16X_0001:3:8:20066:8…16"chr1"9464570"75M"nullnull0"TAGTCCGAGGTCTCCTGAACCTTCCCAAGC…"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB…946531
"HWI-BRUNOP16X_0001:3:27:10302:…16"chr1"101406037"75M"nullnull0"AGCTGAATGGGCAGGTCCCCCAGAAGATCG…"BBBBBBBBBBBBBBBBcYRcffggfgf_gf…1014134
"HWI-BRUNOP16X_0001:3:65:3144:1…83"chr3"19695760"50M""chr3"196008-999"GTAACGCTCCCGGACCCTGCGCGCCCCCGT…"BBBBBBBBBBBBBB_TTSSS[[Obbd`]e^…197006
"HWI-BRUNOP16X_0001:3:68:13088:…16"chr3"19695837"75M"nullnull0"GACCCCCCCGGCCCCCGGCGCCCCCCCGCC…"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB…197032
"HWI-BRUNOP16X_0001:3:48:3417:1…163"chr3"19696160"50M""chr3"319702122791"GCTTACCGGACCCTGCGCGCCCCCGTCCCG…"gggggggggggggggggggggggfdagggg…197010
ox.from_bigwig("data/sample.bw").regions("chr21:10900000-15000000").pl()
shape: (3, 4)
chromstartendvalue
stru32u32f32
"chr21"109717701097177540.0
"chr21"147871001478710560.0
"chr21"149590501495905520.0

Note

Oxbow handles multiple ranges as separate fragments. For more details, see the Streams and Fragments section below.

Column projection#

Oxbow lets you select only the columns you need and will not parse the others.

import polars as pl

ox.from_bam(
    "data/sample.bam", 
    fields=["rname", "pos", "end", "mapq"],
).regions(
    "chr1"
).pl()
shape: (4, 4)
rnameposendmapq
cati32i32u8
"chr1"10542105910
"chr1"105461062016
"chr1"9464579465310
"chr1"1014060101413437

In data systems lingo, selecting columns is also known as “projection”. The lazy data structures returned by an oxbow data source are able to “push down” the projection operation to oxbow to prevent full record parsing. In the following example, only the four fields passed to the polars LazyFrame.select method will be parsed when the output gets computed.

df = (
    ox.from_bam("data/sample.bam")
    .regions("chr1")
    .pl(lazy=True)
    .select(
        pl.col("rname").alias("chrom"),
        pl.col("pos").alias("start"),
        "end",
        "mapq"
    )
    .collect()
)
df
shape: (4, 4)
chromstartendmapq
cati32i32u8
"chr1"10542105910
"chr1"105461062016
"chr1"9464579465310
"chr1"1014060101413437

Nested and composite fields#

Oxbow can handle the complex field structures of genomics file formats because they can all be mapped to Arrow constructs like lists, arrays, and structs.

For example, fields like SAM tags, VCF info and samples, and GTF attributes are exposed as struct columns in Arrow-native libraries like Polars, which are easy and efficient to manipulate.

SAM/BAM tags#

The htslib alignment formats, SAM and BAM, have optional fields called tags that are defined inline, rather than in a header or manifest. These definitions, a tuple of a tag name and type code, can be provided explicitly to the data source constructor for projection.

df = (
    ox.from_bam(
        "data/sample.bam", 
        fields=None,
        tag_defs=[('MD', 'Z'), ('NM', 'C')]
    )
    .regions("chr1")
    .pl()
    .select(
        pl.col("tags").struct.unnest()
    )
)
df
shape: (4, 2)
MDNM
stri64
"18C31"1
"14C52A7"2
"2T0G5T65"3
"7G1C4A2A57"4

By calling the with_tags() method, oxbow will scan an initial number of rows to discover tag definitions to add to the schema (determined by scan_rows).

df = (
    ox.from_bam("data/sample.bam")
    .with_tags()
    .regions("chr1")
    .pl()
)
df
shape: (4, 13)
qnameflagrnameposmapqcigarrnextpnexttlenseqqualendtags
stru16cati32u8strcati32i32strstri32struct[12]
"HWI-BRUNOP16X_0001:3:48:4861:1…163"chr1"105420"50M""chr1"1057179"CGAAATCTGTGCAGAGGAGAACGCAGCTCC…"gggggggggggggggggggggggggegggg…10591{0,"18C31",1,"brain_50_fcb",0,3,8,null,0,1,0,"82"}
"HWI-BRUNOP16X_0001:3:28:6650:1…16"chr1"1054616"75M"nullnull0"ATCTGTGCAGAGGAGAACGCAGCTCCGCCC…"fggggggggdgdggcdfggggfgggggggg…10620{null,"14C52A7",2,"brain_75_fca",null,1,5,null,0,2,0,"85"}
"HWI-BRUNOP16X_0001:3:8:20066:8…16"chr1"9464570"75M"nullnull0"TAGTCCGAGGTCTCCTGAACCTTCCCAAGC…"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB…946531{null,"2T0G5T65",3,"brain_75_fca",null,2,0,"2,-131443143,75M,3;",0,3,0,"82"}
"HWI-BRUNOP16X_0001:3:27:10302:…16"chr1"101406037"75M"nullnull0"AGCTGAATGGGCAGGTCCCCCAGAAGATCG…"BBBBBBBBBBBBBBBBcYRcffggfgf_gf…1014134{null,"7G1C4A2A57",4,"brain_75_fca",null,1,0,null,0,4,0,"85"}
df['tags'].struct.unnest().head()
shape: (4, 12)
AMMDNMRGSMX0X1XAXGXMXOXT
i64stri64stri64i64i64stri64i64i64str
0"18C31"1"brain_50_fcb"038null010"82"
null"14C52A7"2"brain_75_fca"null15null020"85"
null"2T0G5T65"3"brain_75_fca"null20"2,-131443143,75M,3;"030"82"
null"7G1C4A2A57"4"brain_75_fca"null10null040"85"

GTF/GFF attributes#

GTF/GFF attributes are analogous to SAM tags. For GTF, the type is always "String". For GFF, attributes can be "String" or "Array", the latter materializing as a list column.

df = (
    ox.from_gff("data/sample.gff")
    .with_attributes()
    .pl()
)
df.head()
shape: (5, 9)
seqidsourcetypestartendscorestrandframeattributes
strstrstri32i32f32stru8struct[18]
"chr13""HAVANA""exon"8132603081326191null"+"null{"exon:ENST00000782961.1:2","ENST00000782961.1",null,"ENSE00004156517.1","2","ENSG00000229309.3","ENSG00000229309","lncRNA","OTTHUMG00000017146.2",null,null,"2",null,["basic", "Ensembl_canonical", "TAGENE"],"ENST00000782961.1","ENST00000782961",null,"lncRNA"}
"chr6""HAVANA""CDS"3200239932002540null"+"1{"CDS:ENST00000498271.1","ENST00000498271.1","CCDS59005.1","ENSE00001878698.1","40","ENSG00000244731.10","C4A","protein_coding","OTTHUMG00000031186.6","OTTHUMT00000356896.1","HGNC:1323","2","ENSP00000420212.1",["RNA_Seq_supported_only", "basic", … "CCDS"],"ENST00000498271.1","C4A-246","1","protein_coding"}
"chr10""HAVANA""exon"7293053872930637null"+"null{"exon:ENST00000334011.10:8","ENST00000334011.10","CCDS7318.1","ENSE00001170094.1","8","ENSG00000138315.13","OIT3","protein_coding","OTTHUMG00000018444.2","OTTHUMT00000048596.2","HGNC:29953","2","ENSP00000333900.5",["basic", "Ensembl_canonical", … "CCDS"],"ENST00000334011.10","OIT3-201","1","protein_coding"}
"chr1""HAVANA""exon"497210497299null"-"null{"exon:ENST00000641916.1:4","ENST00000641916.1",null,"ENSE00003812605.1","4","ENSG00000290385.2","ENSG00000290385","lncRNA",null,"OTTHUMT00000493599.1",null,"2",null,null,"ENST00000641916.1","ENST00000641916",null,"lncRNA"}
"chr13""HAVANA""CDS"3565557935655749null"+"2{"CDS:ENST00000629018.4","ENST00000629018.4",null,"ENSE00000938859.1","28","ENSG00000172915.20","NBEA","protein_coding","OTTHUMG00000016724.2",null,"HGNC:7648","2","ENSP00000486239.3",["RNA_Seq_supported_only", "mRNA_start_NF", "cds_start_NF"],"ENST00000629018.4","NBEA-207","5","protein_coding"}
df['attributes'].struct.unnest().head()
shape: (5, 18)
IDParentccdsidexon_idexon_numbergene_idgene_namegene_typehavana_genehavana_transcripthgnc_idlevelprotein_idtagtranscript_idtranscript_nametranscript_support_leveltranscript_type
strstrstrstrstrstrstrstrstrstrstrstrstrlist[str]strstrstrstr
"exon:ENST00000782961.1:2""ENST00000782961.1"null"ENSE00004156517.1""2""ENSG00000229309.3""ENSG00000229309""lncRNA""OTTHUMG00000017146.2"nullnull"2"null["basic", "Ensembl_canonical", "TAGENE"]"ENST00000782961.1""ENST00000782961"null"lncRNA"
"CDS:ENST00000498271.1""ENST00000498271.1""CCDS59005.1""ENSE00001878698.1""40""ENSG00000244731.10""C4A""protein_coding""OTTHUMG00000031186.6""OTTHUMT00000356896.1""HGNC:1323""2""ENSP00000420212.1"["RNA_Seq_supported_only", "basic", … "CCDS"]"ENST00000498271.1""C4A-246""1""protein_coding"
"exon:ENST00000334011.10:8""ENST00000334011.10""CCDS7318.1""ENSE00001170094.1""8""ENSG00000138315.13""OIT3""protein_coding""OTTHUMG00000018444.2""OTTHUMT00000048596.2""HGNC:29953""2""ENSP00000333900.5"["basic", "Ensembl_canonical", … "CCDS"]"ENST00000334011.10""OIT3-201""1""protein_coding"
"exon:ENST00000641916.1:4""ENST00000641916.1"null"ENSE00003812605.1""4""ENSG00000290385.2""ENSG00000290385""lncRNA"null"OTTHUMT00000493599.1"null"2"nullnull"ENST00000641916.1""ENST00000641916"null"lncRNA"
"CDS:ENST00000629018.4""ENST00000629018.4"null"ENSE00000938859.1""28""ENSG00000172915.20""NBEA""protein_coding""OTTHUMG00000016724.2"null"HGNC:7648""2""ENSP00000486239.3"["RNA_Seq_supported_only", "mRNA_start_NF", "cds_start_NF"]"ENST00000629018.4""NBEA-207""5""protein_coding"

Important

As of oxbow v0.7, alignment file tag definitions and annotation file attribute definitions are no longer auto-discovered by default—this behavior is opt-in. Use the with_tags() or with_attributes() methods, respectively, to discover or specify tag/attribute definitions.

VCF/BCF info fields#

For the htslib variant call formats, VCF and BCF, the subfields of the INFO field are defined in the VCF header, so they do not need to be discovered by sniffing rows and you do not need to specify types.

By default, all info fields are parsed (info_fields="*"). You can project any subset or ignore them entirely by setting the info_fields argument to None.

(
    ox.from_vcf(
        "data/sample.vcf.gz",
        info_fields=None,
    )
    .pl()
).head()
shape: (5, 7)
chromposidrefaltqualfilter
cati32list[str]strlist[str]f32list[str]
"1"65872[]"T"["G"]44.18[]
"1"69511[]"A"["G"]2552.929932[]
"1"762273[]"G"["A"]19085.929688[]
"1"866511[]"C"["CCCCT"]3136.889893[]
"1"876499[]"A"["G"]3338.929932[]
df = (
    ox.from_vcf(
        "data/sample.vcf.gz",
        info_fields=["TYPE", "snpeff.Effect", "snpeff.Gene_Name", "snpeff.Transcript_BioType"],
    )
    .pl()
)
df.head()
shape: (5, 8)
chromposidrefaltqualfilterinfo
cati32list[str]strlist[str]f32list[str]struct[4]
"1"65872[]"T"["G"]44.18[]{["SNP"],["intergenic_region"],null,null}
"1"69511[]"A"["G"]2552.929932[]{["SNP"],["sequence_feature[transmembrane_region:Transmembrane_region]", "sequence_feature[disulfide_bond]", "missense_variant"],["OR4F5", "OR4F5", "OR4F5"],["protein_coding", "protein_coding", "protein_coding"]}
"1"762273[]"G"["A"]19085.929688[]{["SNP"],["non_coding_exon_variant"],["LINC00115"],["lincRNA"]}
"1"866511[]"C"["CCCCT"]3136.889893[]{["Insertion"],["intron_variant"],["SAMD11"],["protein_coding"]}
"1"876499[]"A"["G"]3338.929932[]{["SNP"],["intron_variant"],["SAMD11"],["protein_coding"]}
df.unnest("info").head()
shape: (5, 11)
chromposidrefaltqualfilterTYPEsnpeff.Effectsnpeff.Gene_Namesnpeff.Transcript_BioType
cati32list[str]strlist[str]f32list[str]list[str]list[str]list[str]list[str]
"1"65872[]"T"["G"]44.18[]["SNP"]["intergenic_region"]nullnull
"1"69511[]"A"["G"]2552.929932[]["SNP"]["sequence_feature[transmembrane_region:Transmembrane_region]", "sequence_feature[disulfide_bond]", "missense_variant"]["OR4F5", "OR4F5", "OR4F5"]["protein_coding", "protein_coding", "protein_coding"]
"1"762273[]"G"["A"]19085.929688[]["SNP"]["non_coding_exon_variant"]["LINC00115"]["lincRNA"]
"1"866511[]"C"["CCCCT"]3136.889893[]["Insertion"]["intron_variant"]["SAMD11"]["protein_coding"]
"1"876499[]"A"["G"]3338.929932[]["SNP"]["intron_variant"]["SAMD11"]["protein_coding"]

VCF/BCF sample genotype data#

For the htslib variant call formats, each variant call record is associated with an arbitrary number of so-called FORMAT fields that provide genotype-related information for each sample. Like INFO, these fields are defined in the header.

Using the samples and genotype_fields arguments, you can project any subset of samples as separate struct columns and project any subset of their associated genotype fields. Use samples="*" to select all samples or a list to select a subset.

df = ox.from_vcf(
    "data/sample.vcf.gz",
    info_fields=None,
    samples=['NA12891', 'NA12892'],
).pl()
df.head()
shape: (5, 9)
chromposidrefaltqualfilterNA12891NA12892
cati32list[str]strlist[str]f32list[str]struct[6]struct[6]
"1"65872[]"T"["G"]44.18[]{[14, 2],16,21,{[0, 1],[true, true]},[21, 0, 439],18}{[14, 2],16,21,{[0, 1],[true, true]},[21, 0, 437],18}
"1"69511[]"A"["G"]2552.929932[]{null,null,null,{[null, null],[false, false]},null,null}{[0, 39],39,99,{[1, 1],[true, true]},[1289, 117, 0],null}
"1"762273[]"G"["A"]19085.929688[]{[0, 82],82,99,{[1, 1],[true, true]},[2952, 247, 0],127}{[0, 68],68,99,{[1, 1],[true, true]},[2485, 204, 0],127}
"1"866511[]"C"["CCCCT"]3136.889893[]{[0, 13],13,37,{[1, 1],[true, true]},[512, 37, 0],26}{[0, 9],9,27,{[1, 1],[true, true]},[402, 27, 0],26}
"1"876499[]"A"["G"]3338.929932[]{[0, 17],17,51,{[1, 1],[true, true]},[645, 51, 0],26}{[0, 9],9,27,{[1, 1],[true, true]},[355, 27, 0],26}

Each sample column is essentially a sub-dataframe of that sample’s genotype fields.

df['NA12892'].struct.unnest().head()
shape: (5, 6)
ADDPGQGTPLTP
list[i32]i32i32struct[2]list[i32]i32
[14, 2]1621{[0, 1],[true, true]}[21, 0, 437]18
[0, 39]3999{[1, 1],[true, true]}[1289, 117, 0]null
[0, 68]6899{[1, 1],[true, true]}[2485, 204, 0]127
[0, 9]927{[1, 1],[true, true]}[402, 27, 0]26
[0, 9]927{[1, 1],[true, true]}[355, 27, 0]26

Important

As of oxbow v0.7, variant file sample columns are no longer projected by default—they are opt-in. We recommend using the with_samples() API, below, to do this.

The recommended approach to project sample genotype data is to use the with_samples() method. Declaring samples this way further nests all sample-related data in a single “samples” struct column for convenience.

df = (
    ox.from_vcf(
        "data/sample.vcf.gz",
        info_fields=None,
    )
    .with_samples()
    .pl()
)
df.head()
shape: (5, 8)
chromposidrefaltqualfiltersamples
cati32list[str]strlist[str]f32list[str]struct[3]
"1"65872[]"T"["G"]44.18[]{{[15, 0],15,45,{[0, 0],[true, true]},[0, 45, 520],18},{[14, 2],16,21,{[0, 1],[true, true]},[21, 0, 439],18},{[14, 2],16,21,{[0, 1],[true, true]},[21, 0, 437],18}}
"1"69511[]"A"["G"]2552.929932[]{{null,null,null,{[null, null],[false, false]},null,null},{null,null,null,{[null, null],[false, false]},null,null},{[0, 39],39,99,{[1, 1],[true, true]},[1289, 117, 0],null}}
"1"762273[]"G"["A"]19085.929688[]{{[0, 67],67,99,{[1, 1],[true, true]},[2510, 202, 0],127},{[0, 82],82,99,{[1, 1],[true, true]},[2952, 247, 0],127},{[0, 68],68,99,{[1, 1],[true, true]},[2485, 204, 0],127}}
"1"866511[]"C"["CCCCT"]3136.889893[]{{[0, 13],13,38,{[1, 1],[true, true]},[583, 38, 0],26},{[0, 13],13,37,{[1, 1],[true, true]},[512, 37, 0],26},{[0, 9],9,27,{[1, 1],[true, true]},[402, 27, 0],26}}
"1"876499[]"A"["G"]3338.929932[]{{[0, 12],12,36,{[1, 1],[true, true]},[465, 36, 0],26},{[0, 17],17,51,{[1, 1],[true, true]},[645, 51, 0],26},{[0, 9],9,27,{[1, 1],[true, true]},[355, 27, 0],26}}
df.unnest("samples").head()
shape: (5, 10)
chromposidrefaltqualfilterNA12878iNA12891NA12892
cati32list[str]strlist[str]f32list[str]struct[6]struct[6]struct[6]
"1"65872[]"T"["G"]44.18[]{[15, 0],15,45,{[0, 0],[true, true]},[0, 45, 520],18}{[14, 2],16,21,{[0, 1],[true, true]},[21, 0, 439],18}{[14, 2],16,21,{[0, 1],[true, true]},[21, 0, 437],18}
"1"69511[]"A"["G"]2552.929932[]{null,null,null,{[null, null],[false, false]},null,null}{null,null,null,{[null, null],[false, false]},null,null}{[0, 39],39,99,{[1, 1],[true, true]},[1289, 117, 0],null}
"1"762273[]"G"["A"]19085.929688[]{[0, 67],67,99,{[1, 1],[true, true]},[2510, 202, 0],127}{[0, 82],82,99,{[1, 1],[true, true]},[2952, 247, 0],127}{[0, 68],68,99,{[1, 1],[true, true]},[2485, 204, 0],127}
"1"866511[]"C"["CCCCT"]3136.889893[]{[0, 13],13,38,{[1, 1],[true, true]},[583, 38, 0],26}{[0, 13],13,37,{[1, 1],[true, true]},[512, 37, 0],26}{[0, 9],9,27,{[1, 1],[true, true]},[402, 27, 0],26}
"1"876499[]"A"["G"]3338.929932[]{[0, 12],12,36,{[1, 1],[true, true]},[465, 36, 0],26}{[0, 17],17,51,{[1, 1],[true, true]},[645, 51, 0],26}{[0, 9],9,27,{[1, 1],[true, true]},[355, 27, 0],26}

You can also customize how sample genotype data are nested by using the group_by argument to with_samples(). By default (group_by="sample"), the columns are grouped first by sample name, then by genotype field name. By setting group_by="field", you can swap the nesting order to group columns first by genotype field name, then by sample name.

df = (
    ox.from_vcf(
        "data/sample.vcf.gz",
        info_fields=None,
    )
    .with_samples(
        ['NA12891', 'NA12892'],
        genotype_fields=['AD', 'DP', 'GQ', 'PL', 'TP'],
        group_by="field",
    )    
).pl()
df.head()
shape: (5, 8)
chromposidrefaltqualfiltersamples
cati32list[str]strlist[str]f32list[str]struct[5]
"1"65872[]"T"["G"]44.18[]{{[14, 2],[14, 2]},{16,16},{21,21},{[21, 0, 439],[21, 0, 437]},{18,18}}
"1"69511[]"A"["G"]2552.929932[]{{null,[0, 39]},{null,39},{null,99},{null,[1289, 117, 0]},{null,null}}
"1"762273[]"G"["A"]19085.929688[]{{[0, 82],[0, 68]},{82,68},{99,99},{[2952, 247, 0],[2485, 204, 0]},{127,127}}
"1"866511[]"C"["CCCCT"]3136.889893[]{{[0, 13],[0, 9]},{13,9},{37,27},{[512, 37, 0],[402, 27, 0]},{26,26}}
"1"876499[]"A"["G"]3338.929932[]{{[0, 17],[0, 9]},{17,9},{51,27},{[645, 51, 0],[355, 27, 0]},{26,26}}
df.unnest("samples").head()
shape: (5, 12)
chromposidrefaltqualfilterADDPGQPLTP
cati32list[str]strlist[str]f32list[str]struct[2]struct[2]struct[2]struct[2]struct[2]
"1"65872[]"T"["G"]44.18[]{[14, 2],[14, 2]}{16,16}{21,21}{[21, 0, 439],[21, 0, 437]}{18,18}
"1"69511[]"A"["G"]2552.929932[]{null,[0, 39]}{null,39}{null,99}{null,[1289, 117, 0]}{null,null}
"1"762273[]"G"["A"]19085.929688[]{[0, 82],[0, 68]}{82,68}{99,99}{[2952, 247, 0],[2485, 204, 0]}{127,127}
"1"866511[]"C"["CCCCT"]3136.889893[]{[0, 13],[0, 9]}{13,9}{37,27}{[512, 37, 0],[402, 27, 0]}{26,26}
"1"876499[]"A"["G"]3338.929932[]{[0, 17],[0, 9]}{17,9}{51,27}{[645, 51, 0],[355, 27, 0]}{26,26}

In this case, each genotype field column is a data series containing the values of that field associated with each of the samples.

df.unnest("samples")['DP'].struct.unnest().head()
shape: (5, 2)
NA12891NA12892
i32i32
1616
null39
8268
139
179

BED schemas#

Oxbow understands BEDn+m schema specifiers to interpret the contents of BED files.

ox.from_bed("data/sample.bed", bed_schema="bed3+").pl().head()
shape: (5, 4)
chromstartendrest
stri64i64str
"chr1"11000001200000"A1 . . 1100000 1200000 226,56,…
"chr1"15500001600000"A1 . . 1550000 1600000 226,56,…
"chr1"19000002450000"A1 . . 1900000 2450000 226,56,…
"chr10"50000250000"AB . . 50000 250000 94,189,62"
"chr10"250000650000"A2 . . 250000 650000 247,130,0"
ox.from_bed("data/sample.bed", bed_schema="bed3+6").pl().head()
shape: (5, 9)
chromstartendBED3+1BED3+2BED3+3BED3+4BED3+5BED3+6
stri64i64strstrstrstrstrstr
"chr1"11000001200000"A1""."".""1100000""1200000""226,56,56"
"chr1"15500001600000"A1""."".""1550000""1600000""226,56,56"
"chr1"19000002450000"A1""."".""1900000""2450000""226,56,56"
"chr10"50000250000"AB""."".""50000""250000""94,189,62"
"chr10"250000650000"A2""."".""250000""650000""247,130,0"
ox.from_bed("data/sample.bed", bed_schema="bed9").pl().head()
shape: (5, 9)
chromstartendnamescorestrandthickStartthickEnditemRgb
stri64i64stru16cati64i64array[u8, 3]
"chr1"11000001200000"A1"nullnull11000001200000[226, 56, 56]
"chr1"15500001600000"A1"nullnull15500001600000[226, 56, 56]
"chr1"19000002450000"A1"nullnull19000002450000[226, 56, 56]
"chr10"50000250000"AB"nullnull50000250000[94, 189, 62]
"chr10"250000650000"A2"nullnull250000650000[247, 130, 0]

BigBed AutoSql#

BigBed records natively store genomic coordinate fields and a flat string containing the “rest” of the data (equivalent to a bed3+ schema).

ox.from_bigbed("data/autosql-sample.bb").pl().head()
shape: (5, 4)
chromstartendrest
stru32u32str
"chr1"1186814409"ENST00000456328.2 1000 + 11868…
"chr1"1440329570"ENST00000488147.1 1000 - 14403…
"chr1"1736817436"ENST00000619216.1 1000 - 17368…
"chr1"2955331097"ENST00000473358.1 1000 + 29553…
"chr1"3036530503"ENST00000607096.1 1000 + 30365…

If a BigBed file contains AutoSql definitions of its record fields and types, Oxbow can parse them.

ox.from_bigbed("data/autosql-sample.bb", schema="autosql").pl().head()
shape: (5, 20)
chromstartendnamescorestrandthickStartthickEndreservedblockCountblockSizeschromStartsname2cdsStartStatcdsEndStatexonFramestypegeneNamegeneName2geneType
stru32u32stru32stru32u32u32i32list[i32]list[i32]strstrstrlist[i32]strstrstrstr
"chr1"1186814409"ENST00000456328.2"1000"+"1186811868null3[359, 109, 1189][0, 744, 1352]"DDX11L1""none""none"[-1, -1, -1]"none""ENST00000456328.2""DDX11L1""none"
"chr1"1440329570"ENST00000488147.1"1000"-"1440314403null11[98, 34, … 37][0, 601, … 15130]"WASH7P""none""none"[-1, -1, … -1]"none""ENST00000488147.1""WASH7P""none"
"chr1"1736817436"ENST00000619216.1"1000"-"1736817368null1[68][0]"MIR6859-2""none""none"[-1]"none""ENST00000619216.1""MIR6859-2""none"
"chr1"2955331097"ENST00000473358.1"1000"+"2955329553null3[486, 104, 122][0, 1010, 1422]"MIR1302-11""none""none"[-1, -1, -1]"none""ENST00000473358.1""MIR1302-11""none"
"chr1"3036530503"ENST00000607096.1"1000"+"3036530365null1[138][0]"MIR1302-9""none""none"[-1]"none""ENST00000607096.1""MIR1302-9""none"

Custom BED schemas#

You can impose a custom parsing interpretation—field names and types (beyond the first three fields)—on a BED or BigBed file as long as the text values in those fields are compatible with the types you impose.

Pass in a BED schema as a tuple of (str, dict[str, str]), representing 3-12 standard BED fields ("bed{n}") + custom extended fields encoded as a dictionary of field name to type name. Types can be declared using C-style AutoSql names (string, short, float, double, etc.) or Rust integer shorthands (i8, u8, i32, f32, f64, etc.). Fixed and variable-length array types can be declared using int[], int[10] (AutoSql style) or [i32], [i32; 10] (Rust shorthand style).

(
    ox.from_bigbed(
        "data/autosql-sample.bb", 
        schema=("bed4", {"score": "double", "strand": "string"})
    )
    .pl()
    .head()
)
shape: (5, 6)
chromstartendnamescorestrand
stru32u32strf64str
"chr1"1186814409"ENST00000456328.2"1000.0"+"
"chr1"1440329570"ENST00000488147.1"1000.0"-"
"chr1"1736817436"ENST00000619216.1"1000.0"-"
"chr1"2955331097"ENST00000473358.1"1000.0"+"
"chr1"3036530503"ENST00000607096.1"1000.0"+"
narrowpeak = (
    "bed6",
    {"fold_change": "f64", "-log10p": "f64", "-log10q": "f64", "relSummit": "i64"}
)
(
    ox.from_bed(
        "data/ENCFF758CQW.100.bed.gz", 
        bed_schema=narrowpeak,
        compression="gzip"
    )
    .pl()
    .head()
)
shape: (5, 10)
chromstartendnamescorestrandfold_change-log10p-log10qrelSummit
stri64i64stru16catf64f64f64i64
"chr1"8649990686500478null1000null269.56463-1.04.53508306
"chr7"2556580625566365null1000null267.92568-1.04.53508275
"chr14"4986202149862498null1000null266.99777-1.04.53508212
"chr20"5820972758210234null1000null262.28789-1.04.53508273
"chr7"151172497151172982null1000null261.30677-1.04.53508242

Zoom levels#

The UCSC BBI formats store multiple “zoom” or “reduction” levels. These are tables of fixed-resolution genomic bins containing summary statistics of the signal of a BigWig track track or the interval coverage depth of a BigBed track.

ds = ox.from_bigwig("data/sample.bw")
ds.zoom_levels
[2621440, 10485760, 41943040]
ds.zoom(ds.zoom_levels[1]).regions("chr21").pl()
shape: (5, 8)
chromstartendbases_coveredminmaxsumsum_squares
catu32u32u64f64f64f64f64
"chr21"9486505174085409020.080.04000.0224000.0
"chr21"17829945261408851550.080.07900.0470000.0
"chr21"27133600360156752050.080.09000.0472000.0
"chr21"36097355444120851900.080.07200.0376000.0
"chr21"45704025481298956520.080.02800.0148000.0

Remote files and file-like objects#

You can pull data directly from HTTP and cloud storage URLs. If needed, paths or URLs to index files must be given explicitly.

ds = ox.from_bam(
    "https://oxbow-ngs.s3.us-east-2.amazonaws.com/example.bam",
    index="https://oxbow-ngs.s3.us-east-2.amazonaws.com/example.bam.bai"
)

Instead of using file paths or URLs, the source and index inputs to create a data source can alternatively be callables that open a binary I/O stream, i.e. any Python file-like object.

ds = ox.from_bam(
    lambda : open("sample.bam", "rb"),
    index=lambda : open("sample.bam.bai", "rb"),
)

This gives you the power to customize your own transports – to read remote sources, diverse file system implementations, or different file encodings – independently of oxbow itself!

Libraries like fsspec or smart_open can be used for this purpose.

from fsspec.implementations.cached import CachingFileSystem
from s3fs import S3FileSystem

url = "https://oxbow-ngs.s3.us-east-2.amazonaws.com/example.bam"
httpfs = CachingFileSystem(target_protocol="https")
ds = ox.from_bam(
    lambda : httpfs.open(url, "rb"),
    index=lambda : httpfs.open(url + ".bai", "rb"),
)

s3fs = S3FileSystem(anon=True)
s3_uri = "s3://oxbow-ngs/example.bam"
ds = ox.from_bam(
    lambda : s3fs.open(s3_uri, "rb"),
    index=lambda : s3fs.open(s3_uri + ".bai", "rb"),
    tag_defs=[],
)
ds.regions("chr1:82744-85000").pl()

Streams and Fragments#

An oxbow data source object streams data via a sequence of Arrow RecordBatches. This stream is exposed as an iterator and you can use it to materialize each batch manually.

ds = ox.from_bam("data/sample.bam", batch_size=100)
batch = next(ds.batches())
batch
arro3.core.RecordBatch
+----------------------------------------+--------+-------------------------+---------+-------+-------+-------------------------+-----------+--------+-----------------------------------------------------------------------------+-----------------------------------------------------------------------------+---------+
| qname                                  | flag   | rname                   | pos     | mapq  | cigar | rnext                   | pnext     | tlen   | seq                                                                         | qual                                                                        | end     |
| Utf8                                   | UInt16 | Dictionary(Int32, Utf8) | Int32   | UInt8 | Utf8  | Dictionary(Int32, Utf8) | Int32     | Int32  | Utf8                                                                        | Utf8                                                                        | Int32   |
+----------------------------------------+--------+-------------------------+---------+-------+-------+-------------------------+-----------+--------+-----------------------------------------------------------------------------+-----------------------------------------------------------------------------+---------+
| HWI-BRUNOP16X_0001:3:48:4861:11838#0   | 163    | chr1                    | 10542   | 0     | 50M   | chr1                    | 10571     | 79     | CGAAATCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGG                          | gggggggggggggggggggggggggeggggR\_[\ggggghggggggggg                          | 10591   |
| HWI-BRUNOP16X_0001:3:28:6650:168848#0  | 16     | chr1                    | 10546   | 16    | 75M   | null                    | null      | 0      | ATCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGAACGCAGCTCCGCC | fggggggggdgdggcdfggggfgggggggggggggggggggggggggfggggggggggggggggggggggggggg | 10620   |
| HWI-BRUNOP16X_0001:3:8:20066:88158#0   | 16     | chr1                    | 946457  | 0     | 75M   | null                    | null      | 0      | TAGTCCGAGGTCTCCTGAACCTTCCCAAGCAGCTGCTGCACCTGCCGGCAGTAGTTGGCCACCTTGCACTCCCGG | BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBd`aed``__U^__]_ggggcggggd]\\\[\]^]]gggggdfcbb | 946531  |
| HWI-BRUNOP16X_0001:3:27:10302:58768#0  | 16     | chr1                    | 1014060 | 37    | 75M   | null                    | null      | 0      | AGCTGAATGGGCAGGTCCCCCAGAAGATCGGCGTGCACGCCTTCCAGCAGCGTCTGGCTGTCCACCCGAGCGGTG | BBBBBBBBBBBBBBBBcYRcffggfgf_gfg\deegfgfgfcggcggfggggcgggggcgcggfgggggggggeg | 1014134 |
| HWI-BRUNOP16X_0001:3:65:3144:143676#0  | 83     | chr3                    | 196957  | 60    | 50M   | chr3                    | 196008    | -999   | GTAACGCTCCCGGACCCTGCGCGCCCCCGTCCCGGCTCCCGGCCGGCTCG                          | BBBBBBBBBBBBBB_TTSSS[[Obbd`]e^STTTSZW`beTTTTTSSTTT                          | 197006  |
| HWI-BRUNOP16X_0001:3:68:13088:156644#0 | 16     | chr3                    | 196958  | 37    | 75M   | null                    | null      | 0      | GACCCCCCCGGCCCCCGGCGCCCCCCCGCCCCGCCCCCGGGCGGGCGGGGGGGAGAAGGCGCCCGAGGGGAGGCG | BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB^bg_`[^\]X`ZZcggggdfgggggggg | 197032  |
| HWI-BRUNOP16X_0001:3:48:3417:101389#0  | 163    | chr3                    | 196961  | 60    | 50M   | chr3                    | 319702    | 122791 | GCTTACCGGACCCTGCGCGCCCCCGTCCCGGCTCCCGGCCGGCTCGGGGG                          | gggggggggggggggggggggggfdaggggggdgggfgdhbe\T`BBBBB                          | 197010  |
| HWI-BRUNOP16X_0001:3:46:17583:95767#0  | 161    | chrX                    | 503847  | 0     | 50M   | chr4                    | 185365552 | 0      | TTTTATTTTTTTTTTTGAGATGGAGTCTCGCTCTTGTCACCGAGGCTGGA                          | ddfdfd____dffff]__aeZ]\XZSPSNSSSSSSbbaabZ_``BBBBBB                          | 503896  |
| HWI-BRUNOP16X_0001:3:4:7989:14941#0    | 16     | chrY                    | 586185  | 0     | 75M   | null                    | null      | 0      | GTGCGATCTCGGTTCGCTGCAACCTCTGCTTCCCAGGTTCAAGTGATTCTCCGGCCTCAGCCTCCCAAGTAGCNN | BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB | 586259  |
| HWI-BRUNOP16X_0001:3:44:11450:50194#0  | 0      | chrY                    | 587561  | 0     | 75M   | null                    | null      | 0      | NNTGCAGTGAGCTGAGATTGTGCCACTGCACTCCAGCCTGGGTGACAGAGGTAGACTGTGTCTCAAAAAAAAAAA | BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB | 587635  |
+----------------------------------------+--------+-------------------------+---------+-------+-------+-------------------------+-----------+--------+-----------------------------------------------------------------------------+-----------------------------------------------------------------------------+---------+
pl.from_arrow(batch)
shape: (10, 12)
qnameflagrnameposmapqcigarrnextpnexttlenseqqualend
stru16cati32u8strcati32i32strstri32
"HWI-BRUNOP16X_0001:3:48:4861:1…163"chr1"105420"50M""chr1"1057179"CGAAATCTGTGCAGAGGAGAACGCAGCTCC…"gggggggggggggggggggggggggegggg…10591
"HWI-BRUNOP16X_0001:3:28:6650:1…16"chr1"1054616"75M"nullnull0"ATCTGTGCAGAGGAGAACGCAGCTCCGCCC…"fggggggggdgdggcdfggggfgggggggg…10620
"HWI-BRUNOP16X_0001:3:8:20066:8…16"chr1"9464570"75M"nullnull0"TAGTCCGAGGTCTCCTGAACCTTCCCAAGC…"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB…946531
"HWI-BRUNOP16X_0001:3:27:10302:…16"chr1"101406037"75M"nullnull0"AGCTGAATGGGCAGGTCCCCCAGAAGATCG…"BBBBBBBBBBBBBBBBcYRcffggfgf_gf…1014134
"HWI-BRUNOP16X_0001:3:65:3144:1…83"chr3"19695760"50M""chr3"196008-999"GTAACGCTCCCGGACCCTGCGCGCCCCCGT…"BBBBBBBBBBBBBB_TTSSS[[Obbd`]e^…197006
"HWI-BRUNOP16X_0001:3:68:13088:…16"chr3"19695837"75M"nullnull0"GACCCCCCCGGCCCCCGGCGCCCCCCCGCC…"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB…197032
"HWI-BRUNOP16X_0001:3:48:3417:1…163"chr3"19696160"50M""chr3"319702122791"GCTTACCGGACCCTGCGCGCCCCCGTCCCG…"gggggggggggggggggggggggfdagggg…197010
"HWI-BRUNOP16X_0001:3:46:17583:…161"chrX"5038470"50M""chr4"1853655520"TTTTATTTTTTTTTTTGAGATGGAGTCTCG…"ddfdfd____dffff]__aeZ]\XZSPSNS…503896
"HWI-BRUNOP16X_0001:3:4:7989:14…16"chrY"5861850"75M"nullnull0"GTGCGATCTCGGTTCGCTGCAACCTCTGCT…"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB…586259
"HWI-BRUNOP16X_0001:3:44:11450:…0"chrY"5875610"75M"nullnull0"NNTGCAGTGAGCTGAGATTGTGCCACTGCA…"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB…587635

Data sources can be logically grouped into fragments. Without random access, a data source contains only a single fragment.

ds = ox.from_bam("data/sample.bam")
ds.fragments()
[<oxbow._pyarrow.BatchReaderFragment at 0x7c928e83da90>]

When you register range queries, each query gets mapped to a unique fragment. Each fragment generates an independent stream of record batches.

ds = ox.from_bam("data/sample.bam").regions(["chr1", "chr3", "chrX"])
ds.fragments()
[<oxbow._pyarrow.BatchReaderFragment at 0x7c9287752350>,
 <oxbow._pyarrow.BatchReaderFragment at 0x7c92877c2060>,
 <oxbow._pyarrow.BatchReaderFragment at 0x7c92877c1a70>]

Dask data frames#

Dask uses a different approach than the streaming paradigm of Polars and DuckDB: it subdivides a data set into a known number of independently accessible logical partitions, each of which is expected to fit in memory. When you convert an Oxbow data source into a Dask data frame, oxbow maps fragments to partitions:

df = (
    ox.from_bam("data/sample.bam")
    .regions(["chr1", "chrX", "chrY"])
    .dd()  # or to_dask()
)
df
Dask DataFrame Structure:
qname flag rname pos mapq cigar rnext pnext tlen seq qual end
npartitions=3
string uint16 category[known] int32 uint8 string category[known] int32 int32 string string int32
... ... ... ... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ... ... ... ...
Dask Name: to_string_dtype, 2 expressions
df.partitions[1].compute()
qname flag rname pos mapq cigar rnext pnext tlen seq qual end
0 HWI-BRUNOP16X_0001:3:46:17583:95767#0 161 chrX 503847 0 50M chr4 185365552 0 TTTTATTTTTTTTTTTGAGATGGAGTCTCGCTCTTGTCACCGAGGC... ddfdfd____dffff]__aeZ]\XZSPSNSSSSSSbbaabZ_``BB... 503896