Coordinate conventions#

In practice, there are two commonly used conventions for the numerical representation of genomic intervals by their bounding base coordinates and the semantics of their inclusion or exclusion. These conventions are often informally referred to as “coordinate systems”. Mixing the two is one of the most common sources of off-by-one bugs in bioinformatics pipelines because genomic file formats and tools use or expect different conventions.

Oxbow lets you pick the coordinate convention for the output in Arrow batches and lets you specify the interpretation of input query ranges, independently of the format on disk.

Notation#

SAM, VCF, and GFF use 1-based, fully-closed intervals; BED, BigBed, and BigWig use 0-based, half-open intervals. To make matters worse, some binary formats like BAM and BCF use 0-based representations internally while most tools surface those values as 1-based (but some don’t).

A helpful observation is that, numerically, the two conventions differ only in the encoding of the start coordinate. Although it can be conceptually misleading, a very useful mnemonic terminology for distinguishing them is 0-based start, 1-based end and 1-based start, 1-based end.

Oxbow uses a compact two-character notation for coordinate conventions based on the mnemonic terminology, where the first character is the base of the start coordinate and the second is the “base” of the end coordinate:

Code

Informal name

Format name

Bracket notation

Native to

"01"

0-based start, 1-based end

0-based, half-open

[start, end)

BED, BigBed, BigWig

"11"

1-based start, 1-based end

1-based, closed

[start, end]

SAM, BAM, CRAM, VCF, BCF, GFF, GTF †

† as returned by htslib-based tools and by noodles

Each format defaults to its own native convention, so by default Oxbow does not change anything. You only need to think about coordinate systems if you want output that differs from the format’s native convention — for example, normalizing every source to "01" before joining alignment records against BED features.

1. Output via the coords argument#

Every Python data source factory accepts a coords keyword argument:

import oxbow as ox

# Native: BAM positions are emitted 1-based.
ds = ox.from_bam("data/sample.bam")
ds.pl().select("rname", "pos", "end").head()

# Coerce BAM positions to 0-based half-open to match BED tracks.
ds = ox.from_bam("data/sample.bam", coords="01")
ds.pl().select("rname", "pos", "end").head()
# Native: BED positions are emitted 0-based.
ds = ox.from_bed("data/sample.bed")

# Coerce BED positions to 1-based closed to match SAM/VCF.
ds = ox.from_bed("data/sample.bed", coords="11")

Only the start column changes; end coordinates are the same in either system.

2. Input via region queries#

Representations of query regions also need to be interpreted according to a convention. Oxbow accepts two notations: implicit UCSC-style notation and explicit bracket notation.

UCSC-style notation: use DataSource’s convention#

Familiar chr1:10000-20000 style, with optional , or _ thousands separators:

ds.regions("chr1:10,000-20,000")
ds.regions("chr1:10_000-20_000")
ds.regions("chr1")               # whole chromosome

This notation is ambiguouschr1:10000-20000 could mean either convention depending on context — so Oxbow needs to know which one you meant. The rule is:

An oxbow DataSource interprets the region according to its own coords setting. So from_bam("...", coords="01").regions("chr1:10000-20000") treats the region as 0-based half-open, matching the output it will produce.

Bracket notation: explicit, self-describing#

If you want the region to mean the same thing regardless of context, use bracket notation. The brackets carry the coordinate convention in the string itself:

ds.regions("chr1:[10000,20000)")   # 0-based half-open
ds.regions("chr1:[10001,20000]")   # 1-based closed (same interval as above)

Bracket notation overrides any coords setting. Only _ is accepted as a thousands separator in this form (since , separates start from end):

ds.regions("chr1:[10_000,20_000)")

This is the recommended notation when a region is constructed somewhere far from the scanner that consumes it — for example, written into a config file, a CLI flag, or a pipeline manifest — because it is unambiguous on its own.

Tip

A practical convention: stick to UCSC notation when you’re typing regions interactively next to the scanner that uses them, and switch to bracket notation in any code that needs to be portable across formats.