Quickstart#
This is a quickstart guide to using Oxbow. Oxbow lets you access potentially larger-than-memory genomic files as tabular data structures, such as data frames.
Create a DataSource#
Use the convenience function associated with your file type. The returned DataSource object can be used to access the data in the file.
import oxbow as ox
ds = ox.from_bam("data/sample.bam")
Into data frames#
If the dataset fits comfortably in memory, you can materialize it fully as a Pandas or Polars data frame.
ds.pd() # or ds.to_pandas()
| qname | flag | rname | pos | mapq | cigar | rnext | pnext | tlen | seq | qual | end | tags | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | HWI-BRUNOP16X_0001:3:48:4861:11838#0 | 163 | chr1 | 10542 | 0 | 50M | chr1 | 10571.0 | 79 | CGAAATCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCT... | gggggggggggggggggggggggggeggggR\_[\ggggghggggg... | 10591 | {'AM': 0.0, 'MD': '18C31', 'NM': 1, 'RG': 'bra... |
| 1 | HWI-BRUNOP16X_0001:3:28:6650:168848#0 | 16 | chr1 | 10546 | 16 | 75M | NaN | NaN | 0 | ATCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGG... | fggggggggdgdggcdfggggfgggggggggggggggggggggggg... | 10620 | {'AM': None, 'MD': '14C52A7', 'NM': 2, 'RG': '... |
| 2 | HWI-BRUNOP16X_0001:3:8:20066:88158#0 | 16 | chr1 | 946457 | 0 | 75M | NaN | NaN | 0 | TAGTCCGAGGTCTCCTGAACCTTCCCAAGCAGCTGCTGCACCTGCC... | BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBd`aed``__U^__]_g... | 946531 | {'AM': None, 'MD': '2T0G5T65', 'NM': 3, 'RG': ... |
| 3 | HWI-BRUNOP16X_0001:3:27:10302:58768#0 | 16 | chr1 | 1014060 | 37 | 75M | NaN | NaN | 0 | AGCTGAATGGGCAGGTCCCCCAGAAGATCGGCGTGCACGCCTTCCA... | BBBBBBBBBBBBBBBBcYRcffggfgf_gfg\deegfgfgfcggcg... | 1014134 | {'AM': None, 'MD': '7G1C4A2A57', 'NM': 4, 'RG'... |
| 4 | HWI-BRUNOP16X_0001:3:65:3144:143676#0 | 83 | chr3 | 196957 | 60 | 50M | chr3 | 196008.0 | -999 | GTAACGCTCCCGGACCCTGCGCGCCCCCGTCCCGGCTCCCGGCCGG... | BBBBBBBBBBBBBB_TTSSS[[Obbd`]e^STTTSZW`beTTTTTS... | 197006 | {'AM': 37.0, 'MD': '0C0A0G1G0C0T1A41', 'NM': 7... |
| 5 | HWI-BRUNOP16X_0001:3:68:13088:156644#0 | 16 | chr3 | 196958 | 37 | 75M | NaN | NaN | 0 | GACCCCCCCGGCCCCCGGCGCCCCCCCGCCCCGCCCCCGGGCGGGC... | BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB... | 197032 | {'AM': None, 'MD': '0A0G0A0G1T0T0A1C1G0A3T4G6T... |
| 6 | HWI-BRUNOP16X_0001:3:48:3417:101389#0 | 163 | chr3 | 196961 | 60 | 50M | chr3 | 319702.0 | 122791 | GCTTACCGGACCCTGCGCGCCCCCGTCCCGGCTCCCGGCCGGCTCG... | gggggggggggggggggggggggfdaggggggdgggfgdhbe\T`B... | 197010 | {'AM': 37.0, 'MD': '50', 'NM': 0, 'RG': 'brain... |
| 7 | HWI-BRUNOP16X_0001:3:46:17583:95767#0 | 161 | chrX | 503847 | 0 | 50M | chr4 | 185365552.0 | 0 | TTTTATTTTTTTTTTTGAGATGGAGTCTCGCTCTTGTCACCGAGGC... | ddfdfd____dffff]__aeZ]\XZSPSNSSSSSSbbaabZ_``BB... | 503896 | {'AM': 0.0, 'MD': '4T36C8', 'NM': 2, 'RG': 'br... |
| 8 | HWI-BRUNOP16X_0001:3:4:7989:14941#0 | 16 | chrY | 586185 | 0 | 75M | NaN | NaN | 0 | GTGCGATCTCGGTTCGCTGCAACCTCTGCTTCCCAGGTTCAAGTGA... | BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB... | 586259 | {'AM': None, 'MD': '4C10A10C16C29T0G0', 'NM': ... |
| 9 | HWI-BRUNOP16X_0001:3:44:11450:50194#0 | 0 | chrY | 587561 | 0 | 75M | NaN | NaN | 0 | NNTGCAGTGAGCTGAGATTGTGCCACTGCACTCCAGCCTGGGTGAC... | BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB... | 587635 | {'AM': None, 'MD': '0G0G48T0G23', 'NM': 4, 'RG... |
ds.pl() # or ds.to_polars()
| qname | flag | rname | pos | mapq | cigar | rnext | pnext | tlen | seq | qual | end | tags |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| str | u16 | cat | i32 | u8 | str | cat | i32 | i32 | str | str | i32 | struct[12] |
| "HWI-BRUNOP16X_0001:3:48:4861:1… | 163 | "chr1" | 10542 | 0 | "50M" | "chr1" | 10571 | 79 | "CGAAATCTGTGCAGAGGAGAACGCAGCTCC… | "gggggggggggggggggggggggggegggg… | 10591 | {0,"18C31",1,"brain_50_fcb",0,3,8,null,0,1,0,"82"} |
| "HWI-BRUNOP16X_0001:3:28:6650:1… | 16 | "chr1" | 10546 | 16 | "75M" | null | null | 0 | "ATCTGTGCAGAGGAGAACGCAGCTCCGCCC… | "fggggggggdgdggcdfggggfgggggggg… | 10620 | {null,"14C52A7",2,"brain_75_fca",null,1,5,null,0,2,0,"85"} |
| "HWI-BRUNOP16X_0001:3:8:20066:8… | 16 | "chr1" | 946457 | 0 | "75M" | null | null | 0 | "TAGTCCGAGGTCTCCTGAACCTTCCCAAGC… | "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB… | 946531 | {null,"2T0G5T65",3,"brain_75_fca",null,2,0,"2,-131443143,75M,3;",0,3,0,"82"} |
| "HWI-BRUNOP16X_0001:3:27:10302:… | 16 | "chr1" | 1014060 | 37 | "75M" | null | null | 0 | "AGCTGAATGGGCAGGTCCCCCAGAAGATCG… | "BBBBBBBBBBBBBBBBcYRcffggfgf_gf… | 1014134 | {null,"7G1C4A2A57",4,"brain_75_fca",null,1,0,null,0,4,0,"85"} |
| "HWI-BRUNOP16X_0001:3:65:3144:1… | 83 | "chr3" | 196957 | 60 | "50M" | "chr3" | 196008 | -999 | "GTAACGCTCCCGGACCCTGCGCGCCCCCGT… | "BBBBBBBBBBBBBB_TTSSS[[Obbd`]e^… | 197006 | {37,"0C0A0G1G0C0T1A41",7,"brain_50_fcb",37,1,0,null,0,7,0,"85"} |
| "HWI-BRUNOP16X_0001:3:68:13088:… | 16 | "chr3" | 196958 | 37 | "75M" | null | null | 0 | "GACCCCCCCGGCCCCCGGCGCCCCCCCGCC… | "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB… | 197032 | {null,"0A0G0A0G1T0T0A1C1G0A3T4G6T4G1T2C3C3T0C27",19,"brain_75_fca",null,1,0,null,0,19,0,"85"} |
| "HWI-BRUNOP16X_0001:3:48:3417:1… | 163 | "chr3" | 196961 | 60 | "50M" | "chr3" | 319702 | 122791 | "GCTTACCGGACCCTGCGCGCCCCCGTCCCG… | "gggggggggggggggggggggggfdagggg… | 197010 | {37,"50",0,"brain_50_fcb",37,1,0,null,0,0,0,"85"} |
| "HWI-BRUNOP16X_0001:3:46:17583:… | 161 | "chrX" | 503847 | 0 | "50M" | "chr4" | 185365552 | 0 | "TTTTATTTTTTTTTTTGAGATGGAGTCTCG… | "ddfdfd____dffff]__aeZ]\XZSPSNS… | 503896 | {0,"4T36C8",2,"brain_50_fcb",0,18,174,null,0,2,0,"82"} |
| "HWI-BRUNOP16X_0001:3:4:7989:14… | 16 | "chrY" | 586185 | 0 | "75M" | null | null | 0 | "GTGCGATCTCGGTTCGCTGCAACCTCTGCT… | "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB… | 586259 | {null,"4C10A10C16C29T0G0",6,"brain_75_fca",null,2,2,"X,-586185,75M,6;3,+196723225,75M,7;19,+13666092,75M,7;",0,6,0,"82"} |
| "HWI-BRUNOP16X_0001:3:44:11450:… | 0 | "chrY" | 587561 | 0 | "75M" | null | null | 0 | "NNTGCAGTGAGCTGAGATTGTGCCACTGCA… | "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB… | 587635 | {null,"0G0G48T0G23",4,"brain_75_fca",null,6,54,null,0,4,0,"82"} |
Into lazy data structures#
If the data source is very large, you can also load it into a lazy or “out-of-core” data structure, such as a Polars lazy frame or Dask data frame.
df = ds.pl(lazy=True)
df.show_graph()
df.head().collect()
| qname | flag | rname | pos | mapq | cigar | rnext | pnext | tlen | seq | qual | end | tags |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| str | u16 | cat | i32 | u8 | str | cat | i32 | i32 | str | str | i32 | struct[12] |
| "HWI-BRUNOP16X_0001:3:48:4861:1… | 163 | "chr1" | 10542 | 0 | "50M" | "chr1" | 10571 | 79 | "CGAAATCTGTGCAGAGGAGAACGCAGCTCC… | "gggggggggggggggggggggggggegggg… | 10591 | {0,"18C31",1,"brain_50_fcb",0,3,8,null,0,1,0,"82"} |
| "HWI-BRUNOP16X_0001:3:28:6650:1… | 16 | "chr1" | 10546 | 16 | "75M" | null | null | 0 | "ATCTGTGCAGAGGAGAACGCAGCTCCGCCC… | "fggggggggdgdggcdfggggfgggggggg… | 10620 | {null,"14C52A7",2,"brain_75_fca",null,1,5,null,0,2,0,"85"} |
| "HWI-BRUNOP16X_0001:3:8:20066:8… | 16 | "chr1" | 946457 | 0 | "75M" | null | null | 0 | "TAGTCCGAGGTCTCCTGAACCTTCCCAAGC… | "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB… | 946531 | {null,"2T0G5T65",3,"brain_75_fca",null,2,0,"2,-131443143,75M,3;",0,3,0,"82"} |
| "HWI-BRUNOP16X_0001:3:27:10302:… | 16 | "chr1" | 1014060 | 37 | "75M" | null | null | 0 | "AGCTGAATGGGCAGGTCCCCCAGAAGATCG… | "BBBBBBBBBBBBBBBBcYRcffggfgf_gf… | 1014134 | {null,"7G1C4A2A57",4,"brain_75_fca",null,1,0,null,0,4,0,"85"} |
| "HWI-BRUNOP16X_0001:3:65:3144:1… | 83 | "chr3" | 196957 | 60 | "50M" | "chr3" | 196008 | -999 | "GTAACGCTCCCGGACCCTGCGCGCCCCCGT… | "BBBBBBBBBBBBBB_TTSSS[[Obbd`]e^… | 197006 | {37,"0C0A0G1G0C0T1A41",7,"brain_50_fcb",37,1,0,null,0,7,0,"85"} |
Oxbow data sources can also be loaded into a DuckDB relation.
import duckdb
conn = duckdb.connect(":memory:")
ds = ox.from_gtf("data/gencode.v47.annotation.gtf")
rel = ds.to_duckdb(conn)
conn.sql(
"SELECT seqid as chrom, type, start, rel.end, strand, attributes.gene_name " \
"FROM rel " \
"WHERE attributes.gene_name = 'PCSK9'" \
"LIMIT 10"
).pl()
Note
See the Streams and Fragments section for details on building Dask data frames.
Range queries#
Data sources with indexes support querying genomic ranges. This is the case for htslib formats that are compressed with the BGZF gzip variant and indexed with an appropriate companion index file (e.g., .bai, .tbi, .csi). The BBI formats, BigWig and BigBed, possess an internal index and support range queries without an index file.
You can specify one or more ranges to the constructor or pass them to the regions() method. All records overlapping the query ranges will be returned.
ds = ox.from_bam("data/sample.bam", index="data/sample.bam.bai")
ds = ds.regions("chr1:900000-1100000")
ds.pl()
| qname | flag | rname | pos | mapq | cigar | rnext | pnext | tlen | seq | qual | end | tags |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| str | u16 | cat | i32 | u8 | str | cat | i32 | i32 | str | str | i32 | struct[12] |
| "HWI-BRUNOP16X_0001:3:8:20066:8… | 16 | "chr1" | 946457 | 0 | "75M" | null | null | 0 | "TAGTCCGAGGTCTCCTGAACCTTCCCAAGC… | "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB… | 946531 | {null,"2T0G5T65",3,"brain_75_fca",null,2,0,"2,-131443143,75M,3;",0,3,0,"82"} |
| "HWI-BRUNOP16X_0001:3:27:10302:… | 16 | "chr1" | 1014060 | 37 | "75M" | null | null | 0 | "AGCTGAATGGGCAGGTCCCCCAGAAGATCG… | "BBBBBBBBBBBBBBBBcYRcffggfgf_gf… | 1014134 | {null,"7G1C4A2A57",4,"brain_75_fca",null,1,0,null,0,4,0,"85"} |
If the index file exists in the same location as the source file, it is automatically detected.
ox.from_bam("data/sample.bam").regions(["chr1", "chr3"]).pl()
| qname | flag | rname | pos | mapq | cigar | rnext | pnext | tlen | seq | qual | end | tags |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| str | u16 | cat | i32 | u8 | str | cat | i32 | i32 | str | str | i32 | struct[12] |
| "HWI-BRUNOP16X_0001:3:48:4861:1… | 163 | "chr1" | 10542 | 0 | "50M" | "chr1" | 10571 | 79 | "CGAAATCTGTGCAGAGGAGAACGCAGCTCC… | "gggggggggggggggggggggggggegggg… | 10591 | {0,"18C31",1,"brain_50_fcb",0,3,8,null,0,1,0,"82"} |
| "HWI-BRUNOP16X_0001:3:28:6650:1… | 16 | "chr1" | 10546 | 16 | "75M" | null | null | 0 | "ATCTGTGCAGAGGAGAACGCAGCTCCGCCC… | "fggggggggdgdggcdfggggfgggggggg… | 10620 | {null,"14C52A7",2,"brain_75_fca",null,1,5,null,0,2,0,"85"} |
| "HWI-BRUNOP16X_0001:3:8:20066:8… | 16 | "chr1" | 946457 | 0 | "75M" | null | null | 0 | "TAGTCCGAGGTCTCCTGAACCTTCCCAAGC… | "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB… | 946531 | {null,"2T0G5T65",3,"brain_75_fca",null,2,0,"2,-131443143,75M,3;",0,3,0,"82"} |
| "HWI-BRUNOP16X_0001:3:27:10302:… | 16 | "chr1" | 1014060 | 37 | "75M" | null | null | 0 | "AGCTGAATGGGCAGGTCCCCCAGAAGATCG… | "BBBBBBBBBBBBBBBBcYRcffggfgf_gf… | 1014134 | {null,"7G1C4A2A57",4,"brain_75_fca",null,1,0,null,0,4,0,"85"} |
| "HWI-BRUNOP16X_0001:3:65:3144:1… | 83 | "chr3" | 196957 | 60 | "50M" | "chr3" | 196008 | -999 | "GTAACGCTCCCGGACCCTGCGCGCCCCCGT… | "BBBBBBBBBBBBBB_TTSSS[[Obbd`]e^… | 197006 | {37,"0C0A0G1G0C0T1A41",7,"brain_50_fcb",37,1,0,null,0,7,0,"85"} |
| "HWI-BRUNOP16X_0001:3:68:13088:… | 16 | "chr3" | 196958 | 37 | "75M" | null | null | 0 | "GACCCCCCCGGCCCCCGGCGCCCCCCCGCC… | "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB… | 197032 | {null,"0A0G0A0G1T0T0A1C1G0A3T4G6T4G1T2C3C3T0C27",19,"brain_75_fca",null,1,0,null,0,19,0,"85"} |
| "HWI-BRUNOP16X_0001:3:48:3417:1… | 163 | "chr3" | 196961 | 60 | "50M" | "chr3" | 319702 | 122791 | "GCTTACCGGACCCTGCGCGCCCCCGTCCCG… | "gggggggggggggggggggggggfdagggg… | 197010 | {37,"50",0,"brain_50_fcb",37,1,0,null,0,0,0,"85"} |
ox.from_bigwig("data/sample.bw").regions("chr21:10900000-15000000").pl()
| chrom | start | end | value |
|---|---|---|---|
| str | u32 | u32 | f32 |
| "chr21" | 10971770 | 10971775 | 40.0 |
| "chr21" | 14787100 | 14787105 | 60.0 |
| "chr21" | 14959050 | 14959055 | 20.0 |
Note
Oxbow handles multiple ranges as separate fragments. For more details, see the Streams and Fragments section below.
Column projection#
Oxbow lets you select only the columns you need and will not parse the others. This is referred to as column “projection”.
import polars as pl
ox.from_bam(
"data/sample.bam",
fields=["rname", "pos", "end", "mapq"],
tag_defs=[],
).regions(
"chr1"
).pl()
| rname | pos | end | mapq |
|---|---|---|---|
| cat | i32 | i32 | u8 |
| "chr1" | 10542 | 10591 | 0 |
| "chr1" | 10546 | 10620 | 16 |
| "chr1" | 946457 | 946531 | 0 |
| "chr1" | 1014060 | 1014134 | 37 |
The lazy data structures returned by a data source are able to push the column projection operation down to oxbow. In the following example, only the four fields passed to the polars LazyFrame.select method will be parsed when the output gets computed.
df = (
ox.from_bam("data/sample.bam")
.regions("chr1")
.pl(lazy=True)
.select(
pl.col("rname").alias("chrom"),
pl.col("pos").alias("start"),
"end",
"mapq"
)
.collect()
)
df
| chrom | start | end | mapq |
|---|---|---|---|
| cat | i32 | i32 | u8 |
| "chr1" | 10542 | 10591 | 0 |
| "chr1" | 10546 | 10620 | 16 |
| "chr1" | 946457 | 946531 | 0 |
| "chr1" | 1014060 | 1014134 | 37 |
Nested and complex fields#
Oxbow can handle the complex field structures of genomics file formats because they can all be mapped to Arrow constructs like lists, arrays, and structs.
For example, fields like SAM tags, VCF info and samples, and GTF attributes are exposed as struct columns in Arrow-native libraries like Polars, which are easy and efficient to manipulate.
GTF/GFF attributes#
GTF/GFF attributes are analogous to SAM tags. For GTF, the type is always "String". For GFF, attributes can be "String" or "Array", the latter materializing as a list column.
df = (
ox.from_gff("data/sample.gff")
.pl()
)
df.head()
| seqid | source | type | start | end | score | strand | frame | attributes |
|---|---|---|---|---|---|---|---|---|
| str | str | str | i32 | i32 | f32 | str | u8 | struct[18] |
| "chr13" | "HAVANA" | "exon" | 81326030 | 81326030 | null | "+" | null | {"exon:ENST00000782961.1:2","ENST00000782961.1",null,"ENSE00004156517.1","2","ENSG00000229309.3","ENSG00000229309","lncRNA","OTTHUMG00000017146.2",null,null,"2",null,["basic", "Ensembl_canonical", "TAGENE"],"ENST00000782961.1","ENST00000782961",null,"lncRNA"} |
| "chr6" | "HAVANA" | "CDS" | 32002399 | 32002399 | null | "+" | 1 | {"CDS:ENST00000498271.1","ENST00000498271.1","CCDS59005.1","ENSE00001878698.1","40","ENSG00000244731.10","C4A","protein_coding","OTTHUMG00000031186.6","OTTHUMT00000356896.1","HGNC:1323","2","ENSP00000420212.1",["RNA_Seq_supported_only", "basic", … "CCDS"],"ENST00000498271.1","C4A-246","1","protein_coding"} |
| "chr10" | "HAVANA" | "exon" | 72930538 | 72930538 | null | "+" | null | {"exon:ENST00000334011.10:8","ENST00000334011.10","CCDS7318.1","ENSE00001170094.1","8","ENSG00000138315.13","OIT3","protein_coding","OTTHUMG00000018444.2","OTTHUMT00000048596.2","HGNC:29953","2","ENSP00000333900.5",["basic", "Ensembl_canonical", … "CCDS"],"ENST00000334011.10","OIT3-201","1","protein_coding"} |
| "chr1" | "HAVANA" | "exon" | 497210 | 497210 | null | "-" | null | {"exon:ENST00000641916.1:4","ENST00000641916.1",null,"ENSE00003812605.1","4","ENSG00000290385.2","ENSG00000290385","lncRNA",null,"OTTHUMT00000493599.1",null,"2",null,null,"ENST00000641916.1","ENST00000641916",null,"lncRNA"} |
| "chr13" | "HAVANA" | "CDS" | 35655579 | 35655579 | null | "+" | 2 | {"CDS:ENST00000629018.4","ENST00000629018.4",null,"ENSE00000938859.1","28","ENSG00000172915.20","NBEA","protein_coding","OTTHUMG00000016724.2",null,"HGNC:7648","2","ENSP00000486239.3",["RNA_Seq_supported_only", "mRNA_start_NF", "cds_start_NF"],"ENST00000629018.4","NBEA-207","5","protein_coding"} |
df['attributes'].struct.unnest().head()
| ID | Parent | ccdsid | exon_id | exon_number | gene_id | gene_name | gene_type | havana_gene | havana_transcript | hgnc_id | level | protein_id | tag | transcript_id | transcript_name | transcript_support_level | transcript_type |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| str | str | str | str | str | str | str | str | str | str | str | str | str | list[str] | str | str | str | str |
| "exon:ENST00000782961.1:2" | "ENST00000782961.1" | null | "ENSE00004156517.1" | "2" | "ENSG00000229309.3" | "ENSG00000229309" | "lncRNA" | "OTTHUMG00000017146.2" | null | null | "2" | null | ["basic", "Ensembl_canonical", "TAGENE"] | "ENST00000782961.1" | "ENST00000782961" | null | "lncRNA" |
| "CDS:ENST00000498271.1" | "ENST00000498271.1" | "CCDS59005.1" | "ENSE00001878698.1" | "40" | "ENSG00000244731.10" | "C4A" | "protein_coding" | "OTTHUMG00000031186.6" | "OTTHUMT00000356896.1" | "HGNC:1323" | "2" | "ENSP00000420212.1" | ["RNA_Seq_supported_only", "basic", … "CCDS"] | "ENST00000498271.1" | "C4A-246" | "1" | "protein_coding" |
| "exon:ENST00000334011.10:8" | "ENST00000334011.10" | "CCDS7318.1" | "ENSE00001170094.1" | "8" | "ENSG00000138315.13" | "OIT3" | "protein_coding" | "OTTHUMG00000018444.2" | "OTTHUMT00000048596.2" | "HGNC:29953" | "2" | "ENSP00000333900.5" | ["basic", "Ensembl_canonical", … "CCDS"] | "ENST00000334011.10" | "OIT3-201" | "1" | "protein_coding" |
| "exon:ENST00000641916.1:4" | "ENST00000641916.1" | null | "ENSE00003812605.1" | "4" | "ENSG00000290385.2" | "ENSG00000290385" | "lncRNA" | null | "OTTHUMT00000493599.1" | null | "2" | null | null | "ENST00000641916.1" | "ENST00000641916" | null | "lncRNA" |
| "CDS:ENST00000629018.4" | "ENST00000629018.4" | null | "ENSE00000938859.1" | "28" | "ENSG00000172915.20" | "NBEA" | "protein_coding" | "OTTHUMG00000016724.2" | null | "HGNC:7648" | "2" | "ENSP00000486239.3" | ["RNA_Seq_supported_only", "mRNA_start_NF", "cds_start_NF"] | "ENST00000629018.4" | "NBEA-207" | "5" | "protein_coding" |
VCF/BCF info fields#
For the htslib variant call formats, VCF and BCF, the subfields of the INFO field are defined in the VCF header, so they do not need to be discovered by sniffing rows and you do not need to specify types.
By default, all info fields are parsed. You can project any subset or ignore them entirely using the info_fields argument.
(
ox.from_vcf(
"data/sample.vcf.gz",
info_fields=[],
samples=[],
)
.pl()
).head()
| chrom | pos | id | ref | alt | qual | filter |
|---|---|---|---|---|---|---|
| cat | i32 | list[str] | str | list[str] | f32 | list[str] |
| "1" | 65872 | [] | "T" | ["G"] | 44.18 | [] |
| "1" | 69511 | [] | "A" | ["G"] | 2552.929932 | [] |
| "1" | 762273 | [] | "G" | ["A"] | 19085.929688 | [] |
| "1" | 866511 | [] | "C" | ["CCCCT"] | 3136.889893 | [] |
| "1" | 876499 | [] | "A" | ["G"] | 3338.929932 | [] |
df = (
ox.from_vcf(
"data/sample.vcf.gz",
info_fields=["TYPE", "snpeff.Effect", "snpeff.Gene_Name", "snpeff.Transcript_BioType"],
samples=[],
)
.pl()
)
df.head()
| chrom | pos | id | ref | alt | qual | filter | info |
|---|---|---|---|---|---|---|---|
| cat | i32 | list[str] | str | list[str] | f32 | list[str] | struct[4] |
| "1" | 65872 | [] | "T" | ["G"] | 44.18 | [] | {["SNP"],"intergenic_region",null,null} |
| "1" | 69511 | [] | "A" | ["G"] | 2552.929932 | [] | {["SNP"],"sequence_feature[transmembrane_region:Transmembrane_region]","OR4F5","protein_coding"} |
| "1" | 762273 | [] | "G" | ["A"] | 19085.929688 | [] | {["SNP"],"non_coding_exon_variant","LINC00115","lincRNA"} |
| "1" | 866511 | [] | "C" | ["CCCCT"] | 3136.889893 | [] | {["Insertion"],"intron_variant","SAMD11","protein_coding"} |
| "1" | 876499 | [] | "A" | ["G"] | 3338.929932 | [] | {["SNP"],"intron_variant","SAMD11","protein_coding"} |
df.unnest("info").head()
| chrom | pos | id | ref | alt | qual | filter | TYPE | snpeff.Effect | snpeff.Gene_Name | snpeff.Transcript_BioType |
|---|---|---|---|---|---|---|---|---|---|---|
| cat | i32 | list[str] | str | list[str] | f32 | list[str] | list[str] | str | str | str |
| "1" | 65872 | [] | "T" | ["G"] | 44.18 | [] | ["SNP"] | "intergenic_region" | null | null |
| "1" | 69511 | [] | "A" | ["G"] | 2552.929932 | [] | ["SNP"] | "sequence_feature[transmembrane… | "OR4F5" | "protein_coding" |
| "1" | 762273 | [] | "G" | ["A"] | 19085.929688 | [] | ["SNP"] | "non_coding_exon_variant" | "LINC00115" | "lincRNA" |
| "1" | 866511 | [] | "C" | ["CCCCT"] | 3136.889893 | [] | ["Insertion"] | "intron_variant" | "SAMD11" | "protein_coding" |
| "1" | 876499 | [] | "A" | ["G"] | 3338.929932 | [] | ["SNP"] | "intron_variant" | "SAMD11" | "protein_coding" |
VCF/BCF sample genotype data#
For the htslib variant call formats, each variant call record is associated with an arbitrary number of so-called FORMAT fields that provide genotype-related information for each sample. Like INFO, these fields are defined in the header.
Using the samples and genotype_fields arguments, you can project any subset of samples as separate struct columns and project any subset of their associated genotype fields.
df = ox.from_vcf(
"data/sample.vcf.gz",
info_fields=[],
samples=['NA12891', 'NA12892'],
).pl()
df.head()
| chrom | pos | id | ref | alt | qual | filter | NA12891 | NA12892 |
|---|---|---|---|---|---|---|---|---|
| cat | i32 | list[str] | str | list[str] | f32 | list[str] | struct[6] | struct[6] |
| "1" | 65872 | [] | "T" | ["G"] | 44.18 | [] | {[14, 2],16,21,{[0, 1],[true, true]},[21, 0, 439],18} | {[14, 2],16,21,{[0, 1],[true, true]},[21, 0, 437],18} |
| "1" | 69511 | [] | "A" | ["G"] | 2552.929932 | [] | {null,null,null,{[null, null],[false, false]},null,null} | {[0, 39],39,99,{[1, 1],[true, true]},[1289, 117, 0],null} |
| "1" | 762273 | [] | "G" | ["A"] | 19085.929688 | [] | {[0, 82],82,99,{[1, 1],[true, true]},[2952, 247, 0],127} | {[0, 68],68,99,{[1, 1],[true, true]},[2485, 204, 0],127} |
| "1" | 866511 | [] | "C" | ["CCCCT"] | 3136.889893 | [] | {[0, 13],13,37,{[1, 1],[true, true]},[512, 37, 0],26} | {[0, 9],9,27,{[1, 1],[true, true]},[402, 27, 0],26} |
| "1" | 876499 | [] | "A" | ["G"] | 3338.929932 | [] | {[0, 17],17,51,{[1, 1],[true, true]},[645, 51, 0],26} | {[0, 9],9,27,{[1, 1],[true, true]},[355, 27, 0],26} |
Each sample column is essentially a sub-dataframe of genotype fields.
df['NA12892'].struct.unnest().head()
| AD | DP | GQ | GT | PL | TP |
|---|---|---|---|---|---|
| list[i32] | i32 | i32 | struct[2] | list[i32] | i32 |
| [14, 2] | 16 | 21 | {[0, 1],[true, true]} | [21, 0, 437] | 18 |
| [0, 39] | 39 | 99 | {[1, 1],[true, true]} | [1289, 117, 0] | null |
| [0, 68] | 68 | 99 | {[1, 1],[true, true]} | [2485, 204, 0] | 127 |
| [0, 9] | 9 | 27 | {[1, 1],[true, true]} | [402, 27, 0] | 26 |
| [0, 9] | 9 | 27 | {[1, 1],[true, true]} | [355, 27, 0] | 26 |
You can also customize how sample genotype data are nested by using the genotype_by argument. By default (genotype_by="sample"), the columns are grouped first by sample name, then by genotype field name. By setting genotype_by="field", you can swap the nesting order to group columns first by genotype field name, then by sample name.
df = ox.from_vcf(
"data/sample.vcf.gz",
info_fields=[],
samples=['NA12891', 'NA12892'],
genotype_fields=['AD', 'DP', 'GQ', 'PL', 'TP'],
genotype_by="field",
).pl()
df.head()
| chrom | pos | id | ref | alt | qual | filter | AD | DP | GQ | PL | TP |
|---|---|---|---|---|---|---|---|---|---|---|---|
| cat | i32 | list[str] | str | list[str] | f32 | list[str] | struct[2] | struct[2] | struct[2] | struct[2] | struct[2] |
| "1" | 65872 | [] | "T" | ["G"] | 44.18 | [] | {[14, 2],[14, 2]} | {16,16} | {21,21} | {[21, 0, 439],[21, 0, 437]} | {18,18} |
| "1" | 69511 | [] | "A" | ["G"] | 2552.929932 | [] | {null,[0, 39]} | {null,39} | {null,99} | {null,[1289, 117, 0]} | {null,null} |
| "1" | 762273 | [] | "G" | ["A"] | 19085.929688 | [] | {[0, 82],[0, 68]} | {82,68} | {99,99} | {[2952, 247, 0],[2485, 204, 0]} | {127,127} |
| "1" | 866511 | [] | "C" | ["CCCCT"] | 3136.889893 | [] | {[0, 13],[0, 9]} | {13,9} | {37,27} | {[512, 37, 0],[402, 27, 0]} | {26,26} |
| "1" | 876499 | [] | "A" | ["G"] | 3338.929932 | [] | {[0, 17],[0, 9]} | {17,9} | {51,27} | {[645, 51, 0],[355, 27, 0]} | {26,26} |
In this case, each genotype field column is a data series containing the values of that field associated with each of the samples.
df['DP'].struct.unnest().head()
| NA12891 | NA12892 |
|---|---|
| i32 | i32 |
| 16 | 16 |
| null | 39 |
| 82 | 68 |
| 13 | 9 |
| 17 | 9 |
BED schemas#
Oxbow understands BEDn+m schema specifiers to interpret the contents of BED files.
ox.from_bed("data/sample.bed", bed_schema="bed3+").pl().head()
| chrom | start | end | rest |
|---|---|---|---|
| str | i64 | i64 | str |
| "chr1" | 1100001 | 1200000 | "A1 . . 1100000 1200000 226,56,… |
| "chr1" | 1550001 | 1600000 | "A1 . . 1550000 1600000 226,56,… |
| "chr1" | 1900001 | 2450000 | "A1 . . 1900000 2450000 226,56,… |
| "chr10" | 50001 | 250000 | "AB . . 50000 250000 94,189,62" |
| "chr10" | 250001 | 650000 | "A2 . . 250000 650000 247,130,0" |
ox.from_bed("data/sample.bed", bed_schema="bed3+6").pl().head()
| chrom | start | end | BED3+1 | BED3+2 | BED3+3 | BED3+4 | BED3+5 | BED3+6 |
|---|---|---|---|---|---|---|---|---|
| str | i64 | i64 | str | str | str | str | str | str |
| "chr1" | 1100001 | 1200000 | "A1" | "." | "." | "1100000" | "1200000" | "226,56,56" |
| "chr1" | 1550001 | 1600000 | "A1" | "." | "." | "1550000" | "1600000" | "226,56,56" |
| "chr1" | 1900001 | 2450000 | "A1" | "." | "." | "1900000" | "2450000" | "226,56,56" |
| "chr10" | 50001 | 250000 | "AB" | "." | "." | "50000" | "250000" | "94,189,62" |
| "chr10" | 250001 | 650000 | "A2" | "." | "." | "250000" | "650000" | "247,130,0" |
ox.from_bed("data/sample.bed", bed_schema="bed9").pl().head()
| chrom | start | end | name | score | strand | thickStart | thickEnd | itemRgb |
|---|---|---|---|---|---|---|---|---|
| str | i64 | i64 | str | u16 | cat | i64 | i64 | array[u8, 3] |
| "chr1" | 1100001 | 1200000 | "A1" | null | null | 1100000 | 1200000 | [226, 56, 56] |
| "chr1" | 1550001 | 1600000 | "A1" | null | null | 1550000 | 1600000 | [226, 56, 56] |
| "chr1" | 1900001 | 2450000 | "A1" | null | null | 1900000 | 2450000 | [226, 56, 56] |
| "chr10" | 50001 | 250000 | "AB" | null | null | 50000 | 250000 | [94, 189, 62] |
| "chr10" | 250001 | 650000 | "A2" | null | null | 250000 | 650000 | [247, 130, 0] |
BigBed AutoSql#
Oxbow can also parse BigBed records that contain AutoSql definitions of the records.
ox.from_bigbed("data/autosql-sample.bb").pl().head()
| chrom | start | end | rest |
|---|---|---|---|
| str | u32 | u32 | str |
| "chr1" | 11868 | 14409 | "ENST00000456328.2 1000 + 11868… |
| "chr1" | 14403 | 29570 | "ENST00000488147.1 1000 - 14403… |
| "chr1" | 17368 | 17436 | "ENST00000619216.1 1000 - 17368… |
| "chr1" | 29553 | 31097 | "ENST00000473358.1 1000 + 29553… |
| "chr1" | 30365 | 30503 | "ENST00000607096.1 1000 + 30365… |
ox.from_bigbed("data/autosql-sample.bb", schema="autosql").pl().head()
| chrom | start | end | name | score | strand | thickStart | thickEnd | reserved | blockCount | blockSizes | chromStarts | name2 | cdsStartStat | cdsEndStat | exonFrames | type | geneName | geneName2 | geneType |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| str | u32 | u32 | str | u32 | str | u32 | u32 | u32 | i32 | list[i32] | list[i32] | str | str | str | list[i32] | str | str | str | str |
| "chr1" | 11868 | 14409 | "ENST00000456328.2" | 1000 | "+" | 11868 | 11868 | null | 3 | [359, 109, 1189] | [0, 744, 1352] | "DDX11L1" | "none" | "none" | [-1, -1, -1] | "none" | "ENST00000456328.2" | "DDX11L1" | "none" |
| "chr1" | 14403 | 29570 | "ENST00000488147.1" | 1000 | "-" | 14403 | 14403 | null | 11 | [98, 34, … 37] | [0, 601, … 15130] | "WASH7P" | "none" | "none" | [-1, -1, … -1] | "none" | "ENST00000488147.1" | "WASH7P" | "none" |
| "chr1" | 17368 | 17436 | "ENST00000619216.1" | 1000 | "-" | 17368 | 17368 | null | 1 | [68] | [0] | "MIR6859-2" | "none" | "none" | [-1] | "none" | "ENST00000619216.1" | "MIR6859-2" | "none" |
| "chr1" | 29553 | 31097 | "ENST00000473358.1" | 1000 | "+" | 29553 | 29553 | null | 3 | [486, 104, 122] | [0, 1010, 1422] | "MIR1302-11" | "none" | "none" | [-1, -1, -1] | "none" | "ENST00000473358.1" | "MIR1302-11" | "none" |
| "chr1" | 30365 | 30503 | "ENST00000607096.1" | 1000 | "+" | 30365 | 30365 | null | 1 | [138] | [0] | "MIR1302-9" | "none" | "none" | [-1] | "none" | "ENST00000607096.1" | "MIR1302-9" | "none" |
Zoom levels#
The UCSC BBI formats store multiple “zoom” or “reduction” levels. These are tables of fixed-resolution genomic bins containing summary statistics of the signal of a BigWig track track or the interval coverage depth of a BigBed track.
ds = ox.from_bigwig("data/sample.bw")
ds.zoom_levels
[2621440, 10485760, 41943040]
ds.zoom(ds.zoom_levels[1]).regions("chr21").pl()
| chrom | start | end | bases_covered | min | max | sum | sum_squares |
|---|---|---|---|---|---|---|---|
| cat | u32 | u32 | u64 | f64 | f64 | f64 | f64 |
| "chr21" | 9486505 | 17408540 | 90 | 20.0 | 80.0 | 4000.0 | 224000.0 |
| "chr21" | 17829945 | 26140885 | 155 | 0.0 | 80.0 | 7900.0 | 470000.0 |
| "chr21" | 27133600 | 36015675 | 205 | 0.0 | 80.0 | 9000.0 | 472000.0 |
| "chr21" | 36097355 | 44412085 | 190 | 0.0 | 80.0 | 7200.0 | 376000.0 |
| "chr21" | 45704025 | 48129895 | 65 | 20.0 | 80.0 | 2800.0 | 148000.0 |
Remote files and file-like objects#
Instead of using file paths, source and index inputs to create a data source can alternatively be callables that open a binary I/O stream, i.e. any Python file-like object.
ds = ox.from_bam(
lambda : open("sample.bam", "rb"),
index=lambda : open("sample.bam.bai", "rb"),
)
This gives you the power to customize your own transports – to read remote sources, diverse file system implementations, or different file encodings – independently of oxbow itself.
Libraries like fsspec or smart_open can be used for this purpose.
from fsspec.implementations.cached import CachingFileSystem
from s3fs import S3FileSystem
url = "https://oxbow-ngs.s3.us-east-2.amazonaws.com/example.bam"
httpfs = CachingFileSystem(target_protocol="https")
ds = ox.from_bam(
lambda : httpfs.open(url, "rb"),
index=lambda : httpfs.open(url + ".bai", "rb"),
)
s3fs = S3FileSystem(anon=True)
s3_uri = "s3://oxbow-ngs/example.bam"
ds = ox.from_bam(
lambda : s3fs.open(s3_uri, "rb"),
index=lambda : s3fs.open(s3_uri + ".bai", "rb"),
tag_defs=[],
)
Streams and Fragments#
An oxbow data source object streams data via a sequence of Arrow RecordBatches. This stream is exposed as an iterator and you can use it to materialize each batch manually.
ds = ox.from_bam("data/sample.bam", batch_size=100)
batch = next(ds.batches())
batch
pyarrow.RecordBatch
qname: string
flag: uint16
rname: dictionary<values=string, indices=int32, ordered=0>
pos: int32
mapq: uint8
cigar: string
rnext: dictionary<values=string, indices=int32, ordered=0>
pnext: int32
tlen: int32
seq: string
qual: string
end: int32
tags: struct<AM: int64, MD: string, NM: int64, RG: string, SM: int64, X0: int64, X1: int64, XA: string, XG: int64, XM: int64, XO: int64, XT: string>
child 0, AM: int64
child 1, MD: string
child 2, NM: int64
child 3, RG: string
child 4, SM: int64
child 5, X0: int64
child 6, X1: int64
child 7, XA: string
child 8, XG: int64
child 9, XM: int64
child 10, XO: int64
child 11, XT: string
----
qname: ["HWI-BRUNOP16X_0001:3:48:4861:11838#0","HWI-BRUNOP16X_0001:3:28:6650:168848#0","HWI-BRUNOP16X_0001:3:8:20066:88158#0","HWI-BRUNOP16X_0001:3:27:10302:58768#0","HWI-BRUNOP16X_0001:3:65:3144:143676#0","HWI-BRUNOP16X_0001:3:68:13088:156644#0","HWI-BRUNOP16X_0001:3:48:3417:101389#0","HWI-BRUNOP16X_0001:3:46:17583:95767#0","HWI-BRUNOP16X_0001:3:4:7989:14941#0","HWI-BRUNOP16X_0001:3:44:11450:50194#0"]
flag: [163,16,16,16,83,16,163,161,16,0]
rname: -- dictionary:
["chrY","chr20","chrX","chr13","chr22","chr10","chr6","chr19","chr14","chr18",...,"chr11","chr17","chr8","chr7","chr15","chr12","chr1","chr16","chr5","chr3"]-- indices:
[20,20,20,20,23,23,23,2,0,0]
pos: [10542,10546,946457,1014060,196957,196958,196961,503847,586185,587561]
mapq: [0,16,0,37,60,37,60,0,0,0]
cigar: ["50M","75M","75M","75M","50M","75M","50M","50M","75M","75M"]
rnext: -- dictionary:
["chrY","chr20","chrX","chr13","chr22","chr10","chr6","chr19","chr14","chr18",...,"chr11","chr17","chr8","chr7","chr15","chr12","chr1","chr16","chr5","chr3"]-- indices:
[20,null,null,null,23,null,23,11,null,null]
pnext: [10571,null,null,null,196008,null,319702,185365552,null,null]
tlen: [79,0,0,0,-999,0,122791,0,0,0]
seq: ["CGAAATCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGG","ATCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGAACGCAGCTCCGCC","TAGTCCGAGGTCTCCTGAACCTTCCCAAGCAGCTGCTGCACCTGCCGGCAGTAGTTGGCCACCTTGCACTCCCGG","AGCTGAATGGGCAGGTCCCCCAGAAGATCGGCGTGCACGCCTTCCAGCAGCGTCTGGCTGTCCACCCGAGCGGTG","GTAACGCTCCCGGACCCTGCGCGCCCCCGTCCCGGCTCCCGGCCGGCTCG","GACCCCCCCGGCCCCCGGCGCCCCCCCGCCCCGCCCCCGGGCGGGCGGGGGGGAGAAGGCGCCCGAGGGGAGGCG","GCTTACCGGACCCTGCGCGCCCCCGTCCCGGCTCCCGGCCGGCTCGGGGG","TTTTATTTTTTTTTTTGAGATGGAGTCTCGCTCTTGTCACCGAGGCTGGA","GTGCGATCTCGGTTCGCTGCAACCTCTGCTTCCCAGGTTCAAGTGATTCTCCGGCCTCAGCCTCCCAAGTAGCNN","NNTGCAGTGAGCTGAGATTGTGCCACTGCACTCCAGCCTGGGTGACAGAGGTAGACTGTGTCTCAAAAAAAAAAA"]
...
pl.from_arrow(batch)
| qname | flag | rname | pos | mapq | cigar | rnext | pnext | tlen | seq | qual | end | tags |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| str | u16 | cat | i32 | u8 | str | cat | i32 | i32 | str | str | i32 | struct[12] |
| "HWI-BRUNOP16X_0001:3:48:4861:1… | 163 | "chr1" | 10542 | 0 | "50M" | "chr1" | 10571 | 79 | "CGAAATCTGTGCAGAGGAGAACGCAGCTCC… | "gggggggggggggggggggggggggegggg… | 10591 | {0,"18C31",1,"brain_50_fcb",0,3,8,null,0,1,0,"82"} |
| "HWI-BRUNOP16X_0001:3:28:6650:1… | 16 | "chr1" | 10546 | 16 | "75M" | null | null | 0 | "ATCTGTGCAGAGGAGAACGCAGCTCCGCCC… | "fggggggggdgdggcdfggggfgggggggg… | 10620 | {null,"14C52A7",2,"brain_75_fca",null,1,5,null,0,2,0,"85"} |
| "HWI-BRUNOP16X_0001:3:8:20066:8… | 16 | "chr1" | 946457 | 0 | "75M" | null | null | 0 | "TAGTCCGAGGTCTCCTGAACCTTCCCAAGC… | "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB… | 946531 | {null,"2T0G5T65",3,"brain_75_fca",null,2,0,"2,-131443143,75M,3;",0,3,0,"82"} |
| "HWI-BRUNOP16X_0001:3:27:10302:… | 16 | "chr1" | 1014060 | 37 | "75M" | null | null | 0 | "AGCTGAATGGGCAGGTCCCCCAGAAGATCG… | "BBBBBBBBBBBBBBBBcYRcffggfgf_gf… | 1014134 | {null,"7G1C4A2A57",4,"brain_75_fca",null,1,0,null,0,4,0,"85"} |
| "HWI-BRUNOP16X_0001:3:65:3144:1… | 83 | "chr3" | 196957 | 60 | "50M" | "chr3" | 196008 | -999 | "GTAACGCTCCCGGACCCTGCGCGCCCCCGT… | "BBBBBBBBBBBBBB_TTSSS[[Obbd`]e^… | 197006 | {37,"0C0A0G1G0C0T1A41",7,"brain_50_fcb",37,1,0,null,0,7,0,"85"} |
| "HWI-BRUNOP16X_0001:3:68:13088:… | 16 | "chr3" | 196958 | 37 | "75M" | null | null | 0 | "GACCCCCCCGGCCCCCGGCGCCCCCCCGCC… | "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB… | 197032 | {null,"0A0G0A0G1T0T0A1C1G0A3T4G6T4G1T2C3C3T0C27",19,"brain_75_fca",null,1,0,null,0,19,0,"85"} |
| "HWI-BRUNOP16X_0001:3:48:3417:1… | 163 | "chr3" | 196961 | 60 | "50M" | "chr3" | 319702 | 122791 | "GCTTACCGGACCCTGCGCGCCCCCGTCCCG… | "gggggggggggggggggggggggfdagggg… | 197010 | {37,"50",0,"brain_50_fcb",37,1,0,null,0,0,0,"85"} |
| "HWI-BRUNOP16X_0001:3:46:17583:… | 161 | "chrX" | 503847 | 0 | "50M" | "chr4" | 185365552 | 0 | "TTTTATTTTTTTTTTTGAGATGGAGTCTCG… | "ddfdfd____dffff]__aeZ]\XZSPSNS… | 503896 | {0,"4T36C8",2,"brain_50_fcb",0,18,174,null,0,2,0,"82"} |
| "HWI-BRUNOP16X_0001:3:4:7989:14… | 16 | "chrY" | 586185 | 0 | "75M" | null | null | 0 | "GTGCGATCTCGGTTCGCTGCAACCTCTGCT… | "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB… | 586259 | {null,"4C10A10C16C29T0G0",6,"brain_75_fca",null,2,2,"X,-586185,75M,6;3,+196723225,75M,7;19,+13666092,75M,7;",0,6,0,"82"} |
| "HWI-BRUNOP16X_0001:3:44:11450:… | 0 | "chrY" | 587561 | 0 | "75M" | null | null | 0 | "NNTGCAGTGAGCTGAGATTGTGCCACTGCA… | "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB… | 587635 | {null,"0G0G48T0G23",4,"brain_75_fca",null,6,54,null,0,4,0,"82"} |
Data sources can be logically grouped into fragments. Without random access, a data source contains only a single fragment.
ds = ox.from_bam("data/sample.bam")
ds.fragments()
[<oxbow._pyarrow.BatchReaderFragment at 0x717deb7836f0>]
When you register range queries, each query gets mapped to a unique fragment. Each fragment generates an independent stream of record batches.
ds = ox.from_bam("data/sample.bam").regions(["chr1", "chr3", "chrX"])
ds.fragments()
[<oxbow._pyarrow.BatchReaderFragment at 0x717deb783a80>,
<oxbow._pyarrow.BatchReaderFragment at 0x717deb779eb0>,
<oxbow._pyarrow.BatchReaderFragment at 0x717df01fb460>]
Dask data frames#
Dask uses a different approach than the streaming paradigm of Polars and DuckDB: it subdivides a data set into a known number of independently accessible logical partitions, each of which is expected to fit in memory. When you convert an Oxbow data source into a Dask data frame, oxbow maps fragments to partitions:
df = (
ox.from_bam("data/sample.bam")
.regions(["chr1", "chrX", "chrY"])
.dd() # or to_dask()
)
df
| qname | flag | rname | pos | mapq | cigar | rnext | pnext | tlen | seq | qual | end | tags | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| npartitions=3 | |||||||||||||
| string | uint16 | category[known] | int32 | uint8 | string | category[known] | int32 | int32 | string | string | int32 | string | |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
df.partitions[1].compute()
| qname | flag | rname | pos | mapq | cigar | rnext | pnext | tlen | seq | qual | end | tags | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | HWI-BRUNOP16X_0001:3:46:17583:95767#0 | 161 | chrX | 503847 | 0 | 50M | chr4 | 185365552 | 0 | TTTTATTTTTTTTTTTGAGATGGAGTCTCGCTCTTGTCACCGAGGC... | ddfdfd____dffff]__aeZ]\XZSPSNSSSSSSbbaabZ_``BB... | 503896 | {'AM': 0, 'MD': '4T36C8', 'NM': 2, 'RG': 'brai... |