oxbow.core.PyFastaScanner

oxbow.core.PyFastaScanner#

class oxbow.core.PyFastaScanner(src, compressed=False)#

A FASTA file scanner.

Parameters:

src (str or file-like) – The path to the FASTA file or a file-like object.
compressed (bool, optional [default: False]) – Whether the source is BGZF-compressed.

__init__()#

Methods

`__init__`()
`field_names`()	Return the names of the fixed fields.
`scan`([fields, batch_size, limit])	Scan the source as record batches.
`scan_query`(regions[, index, gzi, fields, ...])	Scan sequence slices as record batches from a list of genomic ranges.
`schema`([fields])	Return the Arrow schema.

field_names()#: Return the names of the fixed fields.

scan(fields=None, batch_size=1, limit=None)#

Scan the source as record batches.

Parameters:

fields (list[str], optional) – Names of the fixed fields to project.
batch_size (int, optional [default: 1]) – The number of records to include in each batch.
limit (int, optional) – The maximum number of records to scan. If None, records are scanned until EOF.

Returns:

An iterator yielding Arrow record batches.

Return type:

arro3 RecordBatchReader (pycapsule)

Notes

Since reference sequences are often large, the default batch size is set to 1.

scan_query(regions, index=None, gzi=None, fields=None, batch_size=1024)#

Scan sequence slices as record batches from a list of genomic ranges.

This operation requires an index file.

Parameters:

regions (list[str]) – Genomic ranges in the format “chr:start-end”.
index (path or file-like, optional) – The FAI index file to use for slicing the reference sequences. If None and the source was provided as a path, we will attempt to load the index from the same path with an additional extension.
gzi (path or file-like, optional) – A GZI index file to use if the source is BGZF-encoded.
fields (list[str], optional) – Names of the fixed fields to project.
tag_defs (list[tuple[str, str]], optional) – Definitions of tag fields to project.
batch_size (int, optional [default: 1024]) – The number of records to include in each batch.

Returns:

An iterator yielding Arrow record batches.

Return type:

arro3 RecordBatchReader (pycapsule)

Notes

An FAI index is required to slice the reference sequences. If the source is BGZF-compressed, an additional GZI index is also required. The GZI index is used to translate uncompressed positions (from the FAI index) into compressed positions in the BGZF file.

schema(fields=None)#

Return the Arrow schema.

Parameters:: fields (list[str], optional) – Names of the fixed fields to project.
Return type:: arro3 Schema (pycapsule)

oxbow.core.PyFastaScanner

Contents

oxbow.core.PyFastaScanner#