oxbow.core.PyBedScanner#

class oxbow.core.PyBedScanner(src, bed_schema, compressed=False, fields=None, coords=None)#

A BED file scanner.

Parameters:
  • src (str or file-like) – The path to the BED file or a file-like object.

  • bed_schema (str, list[tuple[str, str]], or dict[str, str]) – The BED schema. Can be a specifier string (e.g., “bed6+3”), a list of (name, type) tuples, or a dict mapping names to types.

  • compressed (bool, optional [default: False]) – Whether the source is BGZF-compressed.

  • fields (list[str], optional) – Names of the BED fields to include in the schema.

  • coords (Literal["01", "11"], optional [default: "01"]) – Coordinate system for returning positions and interpreting query ranges. “01” for 0-based half-open, “11” for 1-based closed.

__init__()#

Methods

__init__()

field_names()

Return the names of the BED fields.

scan([columns, batch_size, limit])

Scan batches of records from the file.

scan_byte_ranges(byte_ranges[, columns, ...])

Scan batches of records from specified byte ranges in the file.

scan_query(region[, index, columns, ...])

Scan batches of records from a genomic range query on a BGZF-encoded file.

scan_virtual_ranges(vpos_ranges[, columns, ...])

Scan batches of records from virtual position ranges in a BGZF file.

schema()

Return the Arrow schema.

field_names()#

Return the names of the BED fields.

scan(columns=None, batch_size=1024, limit=None)#

Scan batches of records from the file.

Parameters:
  • columns (list[str], optional) – Names of the columns to project.

  • batch_size (int, optional [default: 1024]) – The number of records to include in each batch.

  • limit (int, optional) – The maximum number of records to scan. If None, records are scanned until EOF.

Returns:

An iterator yielding Arrow record batches.

Return type:

arro3 RecordBatchReader (pycapsule)

scan_byte_ranges(byte_ranges, columns=None, batch_size=1024, limit=None)#

Scan batches of records from specified byte ranges in the file.

The byte positions must align with record boundaries.

Parameters:
  • byte_ranges (list[tuple[int, int]]) – List of (start, end) byte position tuples to read from.

  • columns (list[str], optional) – Names of the columns to project.

  • batch_size (int, optional [default: 1024]) – The number of records to include in each batch.

  • limit (int, optional) – The maximum number of records to scan. If None, all records in the specified ranges are scanned.

Returns:

An iterator yielding Arrow record batches.

Return type:

arro3 RecordBatchReader (pycapsule)

scan_query(region, index=None, columns=None, batch_size=1024, limit=None)#

Scan batches of records from a genomic range query on a BGZF-encoded file.

This operation requires an index file.

Parameters:
  • region (str) – Genomic range string in the format “chr:start-end”, “chr:[start,end]” or “chr:[start,end)”.

  • index (path or file-like, optional) – The index file to use for querying the region. If None and the source was provided as a path, we will attempt to load the index from the same path with an additional extension.

  • columns (list[str], optional) – Names of the columns to project.

  • batch_size (int, optional [default: 1024]) – The number of records to include in each batch.

  • limit (int, optional) – The maximum number of records to scan. If None, all records intersecting the query range are scanned.

Returns:

An iterator yielding Arrow record batches.

Return type:

arro3 RecordBatchReader (pycapsule)

scan_virtual_ranges(vpos_ranges, columns=None, batch_size=1024, limit=None)#

Scan batches of records from virtual position ranges in a BGZF file.

The virtual positions must align with record boundaries. That means that the compressed offset must point to the beginning of a BGZF block and the uncompressed offset must point to the beginning or end of a record decoded within the block.

Parameters:
  • vpos_ranges (list[tuple[vpos, vpos]]) – List of virtual position ranges as pairs. Each virtual position can be given as either a packed virtual position (int), or an unpacked tuple of ints (c, u) specifying the compressed and uncompressed offsets, respectively.

  • columns (list[str], optional) – Names of the columns to project.

  • batch_size (int, optional [default: 1024]) – The number of records to include in each batch.

  • limit (int, optional) – The maximum number of records to scan. If None, all records in the specified ranges are scanned.

Returns:

An iterator yielding Arrow record batches.

Return type:

arro3 RecordBatchReader (pycapsule)

schema()#

Return the Arrow schema.

Return type:

arro3 Schema (pycapsule)