oxbow.core.PyGtfScanner#

class oxbow.core.PyGtfScanner(src, compressed=False)#

A GTF file scanner.

Parameters:
  • obj (str or file-like) – The path to the GTF file or a file-like object.

  • compressed (bool, optional [default: False]) – Whether the source is BGZF-compressed. If None, it is assumed to be uncompressed.

__init__()#

Methods

__init__()

attribute_defs([scan_rows])

Discover attribute definitions by sniffing scan_rows records.

field_names()

Return the names of the fixed fields.

scan([fields, attribute_defs, batch_size, limit])

Scan batches of records from the file.

scan_query(region[, index, fields, ...])

Scan batches of records from a genomic range query on a BGZF-encoded file.

schema([fields, attribute_defs])

Return the Arrow schema.

attribute_defs(scan_rows=1024)#

Discover attribute definitions by sniffing scan_rows records.

The reader stream is reset to its original position after scanning.

Parameters:

scan_rows (int, optional [default: 1024]) – The number of records to scan. If None, all records are scanned.

Returns:

A list of attribute definitions, where each definition is a tuple of the attribute name and its type (always String for GTF).

Return type:

list[tuple[str, str]]

field_names()#

Return the names of the fixed fields.

scan(fields=None, attribute_defs=None, batch_size=1024, limit=None)#

Scan batches of records from the file.

Parameters:
  • fields (list[str], optional) – Names of the fixed fields to project.

  • attribute_defs (list[tuple[str, str]], optional) – Definitions of attribute fields to project.

  • batch_size (int, optional [default: 1024]) – The number of records to include in each batch.

  • limit (int, optional) – The maximum number of records to scan. If None, records are scanned until EOF.

Returns:

A PyCapsule stream iterator for the record batches.

Return type:

pyo3_arrow.PyRecordBatchReader

scan_query(region, index=None, fields=None, attribute_defs=None, batch_size=1024, limit=None)#

Scan batches of records from a genomic range query on a BGZF-encoded file.

This operation requires an index file.

Parameters:
  • region (str) – Genomic region in the format “chr:start-end”.

  • index (path or file-like, optional) – The index file to use for querying the region. If None and the source was provided as a path, we will attempt to load the index from the same path with an additional extension.

  • fields (list[str], optional) – Names of the fixed fields to project.

  • attribute_defs (list[tuple[str, str]], optional) – Definitions of attribute fields to project.

  • batch_size (int, optional [default: 1024]) – The number of records to include in each batch.

Returns:

A PyCapsule stream iterator for the record batches.

Return type:

pyo3_arrow.PyRecordBatchReader

schema(fields=None, attribute_defs=None)#

Return the Arrow schema.

Parameters:
  • fields (list[str], optional) – Names of the fixed fields to project.

  • attribute_defs (list[tuple[str, str]], optional) – Definitions of attribute fields to project.

Return type:

pyo3_arrow.PySchema