oxbow.core.PyBamScanner#
- class oxbow.core.PyBamScanner(src, compressed=True, fields=None, tag_defs=None)#
A BAM file scanner.
- Parameters:
src (str or file-like) – The path to the BAM file or a file-like object.
compressed (bool, optional [default: True]) – Whether the source is BGZF-compressed.
fields (str or list[str] or None, optional [default: "*"]) – Standard SAM fields to include.
"*"for all,Noneto omit, or a list of field names.tag_defs (list[tuple[str, str]], optional [default: None]) – Tag definitions for the
"tags"struct column.Noneomits the tags column. Use thetag_defs()method to discover definitions.
- __init__()#
Methods
__init__()Return the names of the reference sequences.
Return the names of the reference sequences and their lengths in bp.
Return the names of the standard SAM fields.
model()Return the string representation of the alignment model.
scan([columns, batch_size, limit])Scan batches of records from the file.
scan_byte_ranges(byte_ranges[, columns, ...])Scan batches of records from specified byte ranges in the file.
scan_query(region[, index, columns, ...])Scan batches of records from a genomic range query on a BGZF-encoded file.
scan_unmapped([index, columns, batch_size, ...])Scan batches of records from the set of unaligned reads.
scan_virtual_ranges(vpos_ranges[, columns, ...])Scan batches of records from virtual position ranges in a BGZF file.
schema()Return the Arrow schema.
tag_defs([scan_rows])Discover tag definitions by sniffing scan_rows records.
- chrom_names()#
Return the names of the reference sequences.
- chrom_sizes()#
Return the names of the reference sequences and their lengths in bp.
- field_names()#
Return the names of the standard SAM fields.
- model()#
Return the string representation of the alignment model.
- scan(columns=None, batch_size=1024, limit=None)#
Scan batches of records from the file.
- Parameters:
columns (list[str], optional) – Names of the top-level columns to project.
batch_size (int, optional [default: 1024]) – The number of records to include in each batch.
limit (int, optional) – The maximum number of records to scan. If None, records are scanned until EOF.
- Returns:
An iterator yielding Arrow record batches.
- Return type:
arro3 RecordBatchReader (pycapsule)
- scan_byte_ranges(byte_ranges, columns=None, batch_size=1024, limit=None)#
Scan batches of records from specified byte ranges in the file.
The byte positions must align with record boundaries.
- Parameters:
byte_ranges (list[tuple[int, int]]) – List of (start, end) byte position tuples to read from.
columns (list[str], optional) – Names of the top-level columns to project.
batch_size (int, optional [default: 1024]) – The number of records to include in each batch.
limit (int, optional) – The maximum number of records to scan. If None, all records in the specified ranges are scanned.
- Returns:
An iterator yielding Arrow record batches.
- Return type:
arro3 RecordBatchReader (pycapsule)
- scan_query(region, index=None, columns=None, batch_size=1024, limit=None)#
Scan batches of records from a genomic range query on a BGZF-encoded file.
This operation requires an index file.
- Parameters:
region (str) – Genomic region in the format “chr:start-end”.
index (path or file-like, optional) – The index file to use for querying the region. If None and the source was provided as a path, we will attempt to load the index from the same path with an additional extension.
columns (list[str], optional) – Names of the top-level columns to project.
batch_size (int, optional [default: 1024]) – The number of records to include in each batch.
limit (int, optional) – The maximum number of records to scan. If None, all records intersecting the query range are scanned.
- Returns:
An iterator yielding Arrow record batches.
- Return type:
arro3 RecordBatchReader (pycapsule)
- scan_unmapped(index=None, columns=None, batch_size=1024, limit=None)#
Scan batches of records from the set of unaligned reads.
This operation requires an index file.
- Parameters:
index (path or file-like, optional) – The index file to use for querying the region. If None and the source was provided as a path, we will attempt to load the index from the same path with an additional extension.
columns (list[str], optional) – Names of the top-level columns to project.
batch_size (int, optional [default: 1024]) – The number of records to include in each batch.
limit (int, optional) – The maximum number of records to scan. If None, records are scanned until EOF.
- Returns:
An iterator yielding Arrow record batches.
- Return type:
arro3 RecordBatchReader (pycapsule)
- scan_virtual_ranges(vpos_ranges, columns=None, batch_size=1024, limit=None)#
Scan batches of records from virtual position ranges in a BGZF file.
The virtual positions must align with record boundaries. That means that the compressed offset must point to the beginning of a BGZF block and the uncompressed offset must point to the beginning or end of a record decoded within the block.
- Parameters:
vpos_ranges (list[tuple[vpos, vpos]]) – List of virtual position ranges as pairs. Each virtual position can be given as either a packed virtual position (int), or an unpacked tuple of ints
(c, u)specifying the compressed and uncompressed offsets, respectively.columns (list[str], optional) – Names of the top-level columns to project.
batch_size (int, optional [default: 1024]) – The number of records to include in each batch.
limit (int, optional) – The maximum number of records to scan. If None, all records in the specified ranges are scanned.
- Returns:
An iterator yielding Arrow record batches.
- Return type:
arro3 RecordBatchReader (pycapsule)
- schema()#
Return the Arrow schema.
- Return type:
arro3 Schema (pycapsule)
- tag_defs(scan_rows=1024)#
Discover tag definitions by sniffing scan_rows records.
The reader stream is reset to its original position after scanning.
- Parameters:
scan_rows (int, optional [default: 1024]) – The number of records to scan. If None, all records are scanned.
- Returns:
A list of tag definitions, where each definition is a tuple of the tag name and the SAM tag type code.
- Return type:
list[tuple[str, str]]