oxbow.core.PyBamScanner#
- class oxbow.core.PyBamScanner(src, compressed=True)#
A BAM file scanner.
- Parameters:
src (str or file-like) – The path to the BAM file or a file-like object.
compressed (bool, optional [default: True]) – Whether the source is BGZF-compressed.
- __init__()#
Methods
__init__()Return the names of the reference sequences.
Return the names of the reference sequences and their lengths in bp.
Return the names of the fixed fields.
scan([fields, tag_defs, batch_size, limit])Scan batches of records from the file.
scan_byte_ranges(byte_ranges[, fields, ...])Scan batches of records from specified byte ranges in the file.
scan_query(region[, index, fields, ...])Scan batches of records from a genomic range query on a BGZF-encoded file.
scan_unmapped([index, fields, tag_defs, ...])Scan batches of records from the set of unaligned reads.
scan_virtual_ranges(vpos_ranges[, fields, ...])Scan batches of records from virtual position ranges in a BGZF file.
schema([fields, tag_defs])Return the Arrow schema.
tag_defs([scan_rows])Discover tag definitions by sniffing scan_rows records.
- chrom_names()#
Return the names of the reference sequences.
- chrom_sizes()#
Return the names of the reference sequences and their lengths in bp.
- field_names()#
Return the names of the fixed fields.
- scan(fields=None, tag_defs=None, batch_size=1024, limit=None)#
Scan batches of records from the file.
- Parameters:
fields (list[str], optional) – Names of the fixed fields to project.
tag_defs (list[tuple[str, str]], optional) – Definitions of tag fields to project.
batch_size (int, optional [default: 1024]) – The number of records to include in each batch.
limit (int, optional) – The maximum number of records to scan. If None, records are scanned until EOF.
- Returns:
An iterator yielding Arrow record batches.
- Return type:
arro3 RecordBatchReader (pycapsule)
- scan_byte_ranges(byte_ranges, fields=None, tag_defs=None, batch_size=1024, limit=None)#
Scan batches of records from specified byte ranges in the file.
The byte positions must align with record boundaries.
- Parameters:
byte_ranges (list[tuple[int, int]]) – List of (start, end) byte position tuples to read from.
fields (list[str], optional) – Names of the fixed fields to project.
tag_defs (list[tuple[str, str]], optional) – Definitions of tag fields to project.
batch_size (int, optional [default: 1024]) – The number of records to include in each batch.
limit (int, optional) – The maximum number of records to scan. If None, all records in the specified ranges are scanned.
- Returns:
An iterator yielding Arrow record batches.
- Return type:
arro3 RecordBatchReader (pycapsule)
- scan_query(region, index=None, fields=None, tag_defs=None, batch_size=1024, limit=None)#
Scan batches of records from a genomic range query on a BGZF-encoded file.
This operation requires an index file.
- Parameters:
region (str) – Genomic region in the format “chr:start-end”.
index (path or file-like, optional) – The index file to use for querying the region. If None and the source was provided as a path, we will attempt to load the index from the same path with an additional extension.
fields (list[str], optional) – Names of the fixed fields to project.
tag_defs (list[tuple[str, str]], optional) – Definitions of tag fields to project.
batch_size (int, optional [default: 1024]) – The number of records to include in each batch.
limit (int, optional) – The maximum number of records to scan. If None, all records intersecting the query range are scanned.
- Returns:
An iterator yielding Arrow record batches.
- Return type:
arro3 RecordBatchReader (pycapsule)
- scan_unmapped(index=None, fields=None, tag_defs=None, batch_size=1024, limit=None)#
Scan batches of records from the set of unaligned reads.
This operation requires an index file.
- Parameters:
index (path or file-like, optional) – The index file to use for querying the region. If None and the source was provided as a path, we will attempt to load the index from the same path with an additional extension.
fields (list[str], optional) – Names of the fixed fields to project.
tag_defs (list[tuple[str, str]], optional) – Definitions of tag fields to project.
batch_size (int, optional [default: 1024]) – The number of records to include in each batch.
limit (int, optional) – The maximum number of records to scan. If None, records are scanned until EOF.
- Returns:
An iterator yielding Arrow record batches.
- Return type:
arro3 RecordBatchReader (pycapsule)
- scan_virtual_ranges(vpos_ranges, fields=None, tag_defs=None, batch_size=1024, limit=None)#
Scan batches of records from virtual position ranges in a BGZF file.
The virtual positions must align with record boundaries. That means that the compressed offset must point to the beginning of a BGZF block and the uncompressed offset must point to the beginning or end of a record decoded within the block.
- Parameters:
vpos_ranges (list[tuple[vpos, vpos]]) – List of virtual position ranges as pairs. Each virtual position can be given as either a packed virtual position (int), or an unpacked tuple of ints
(c, u)specifying the compressed and uncompressed offsets, respectively.fields (list[str], optional) – Names of the fixed fields to project.
tag_defs (list[tuple[str, str]], optional) – Definitions of tag fields to project.
batch_size (int, optional [default: 1024]) – The number of records to include in each batch.
limit (int, optional) – The maximum number of records to scan. If None, all records in the specified ranges are scanned.
- Returns:
An iterator yielding Arrow record batches.
- Return type:
arro3 RecordBatchReader (pycapsule)
- schema(fields=None, tag_defs=None)#
Return the Arrow schema.
- Parameters:
fields (list[str], optional) – Names of the fixed fields to project.
tag_defs (list[tuple[str, str]], optional) – Definitions of tag fields to project.
- Return type:
arro3 Schema (pycapsule)
- tag_defs(scan_rows=1024)#
Discover tag definitions by sniffing scan_rows records.
The reader stream is reset to its original position after scanning.
- Parameters:
scan_rows (int, optional [default: 1024]) – The number of records to scan. If None, all records are scanned.
- Returns:
A list of tag definitions, where each definition is a tuple of the tag name and the SAM tag type code.
- Return type:
list[tuple[str, str]]