oxbow.core.PyGffScanner

oxbow.core.PyGffScanner#

class oxbow.core.PyGffScanner(src, compressed=False, fields=None, attribute_defs=None)#

A GFF file scanner.

Parameters:

src (str or file-like) – The path to the GFF file or a file-like object.
compressed (bool, optional [default: False]) – Whether the source is BGZF-compressed.
fields (list[str], optional) – Names of the fixed fields to project.
attribute_defs (list[tuple[str, str]], optional [default: None]) – Definitions for the "attributes" struct column. None omits the attributes column. Use the attribute_defs() method to discover definitions.

Methods

`__init__`()
`attribute_defs`([scan_rows])	Discover attribute definitions by sniffing scan_rows records.
`field_names`()	Return the names of the fixed fields.
`scan`([columns, batch_size, limit])	Scan batches of records from the file.
`scan_byte_ranges`(byte_ranges[, columns, ...])	Scan batches of records from specified byte ranges in the file.
`scan_query`(region[, index, columns, ...])	Scan batches of records from a genomic range query on a BGZF-encoded file.
`scan_virtual_ranges`(vpos_ranges[, columns, ...])	Scan batches of records from virtual position ranges in a BGZF file.
`schema`()	Return the Arrow schema.

attribute_defs(scan_rows=1024)#

Discover attribute definitions by sniffing scan_rows records.

The reader stream is reset to its original position after scanning.

Parameters:: scan_rows (int, optional [default: 1024]) – The number of records to scan. If None, all records are scanned.
Returns:: A list of attribute definitions, where each definition is a tuple of the attribute name and its type (String or Array).
Return type:: list[tuple[str, str]]

scan(columns=None, batch_size=1024, limit=None)#

Scan batches of records from the file.

Parameters:

columns (list[str], optional) – Names of the columns to project.
batch_size (int, optional [default: 1024]) – The number of records to include in each batch.
limit (int, optional) – The maximum number of records to scan. If None, records are scanned until EOF.

Return type:

arro3 RecordBatchReader (pycapsule)

scan_byte_ranges(byte_ranges, columns=None, batch_size=1024, limit=None)#

Scan batches of records from specified byte ranges in the file.

Parameters:

byte_ranges (list[tuple[int, int]]) – List of (start, end) byte position tuples to read from.
columns (list[str], optional) – Names of the columns to project.
batch_size (int, optional [default: 1024]) – The number of records to include in each batch.
limit (int, optional) – The maximum number of records to scan.

Return type:

arro3 RecordBatchReader (pycapsule)

scan_query(region, index=None, columns=None, batch_size=1024, limit=None)#

Scan batches of records from a genomic range query on a BGZF-encoded file.

Parameters:

region (str) – Genomic region in the format “chr:start-end”.
index (path or file-like, optional) – The index file to use for querying the region.
columns (list[str], optional) – Names of the columns to project.
batch_size (int, optional [default: 1024]) – The number of records to include in each batch.
limit (int, optional) – The maximum number of records to scan.

Return type:

arro3 RecordBatchReader (pycapsule)

scan_virtual_ranges(vpos_ranges, columns=None, batch_size=1024, limit=None)#

Scan batches of records from virtual position ranges in a BGZF file.

Parameters:

vpos_ranges (list[tuple[vpos, vpos]]) – List of virtual position ranges as pairs.
columns (list[str], optional) – Names of the columns to project.
batch_size (int, optional [default: 1024]) – The number of records to include in each batch.
limit (int, optional) – The maximum number of records to scan.

Return type:

arro3 RecordBatchReader (pycapsule)

schema()#

Return the Arrow schema.