oxbow.core.PyCramScanner

oxbow.core.PyCramScanner#

class oxbow.core.PyCramScanner(src, compressed=None)#

A CRAM file scanner.

Parameters:: src (str or file-like) – The path to the CRAM file or a file-like object.

Methods

`__init__`()
`chrom_names`()	Return the names of the reference sequences.
`chrom_sizes`()	Return the names of the reference sequences and their lengths in bp.
`field_names`()	Return the names of the fixed fields.
`scan`([reference, reference_index, fields, ...])	Scan batches of records from the file.
`scan_query`(region[, index, reference, ...])	Scan batches of records from a genomic range.
`schema`([fields, tag_defs])	Return the Arrow schema.
`tag_defs`([scan_rows])	Discover tag definitions by sniffing scan_rows records.

chrom_sizes()#: Return the names of the reference sequences and their lengths in bp.

scan(reference=None, reference_index=None, fields=None, tag_defs=None, batch_size=1024, limit=None)#

Scan batches of records from the file.

Parameters:

reference (path or file-like, optional) – The external reference FASTA file to use for decoding bases in the CRAM records. If not provided, sequence bases or references must be embedded in the CRAM file.
reference_index (path or file-like, optional) – The index file for the reference FASTA file.
fields (list[str], optional) – Names of the fixed fields to project.
tag_defs (list[tuple[str, str]], optional) – Definitions of tag fields to project.
batch_size (int, optional [default: 1024]) – The number of records to include in each batch.
limit (int, optional) – The maximum number of records to scan. If None, records are scanned until EOF.

Returns:

An iterator yielding Arrow record batches.

Return type:

arro3 RecordBatchReader (pycapsule)

scan_query(region, index=None, reference=None, reference_index=None, fields=None, tag_defs=None, batch_size=1024, limit=None)#

Scan batches of records from a genomic range.

This operation requires an index file.

Parameters:

region (str) – Genomic region in the format “chr:start-end”.
index (path or file-like, optional) – The index file to use for querying the region. If None and the source was provided as a path, we will attempt to load the index from the same path with an additional extension.
reference (path or file-like, optional) – The external reference FASTA file to use for decoding bases in the CRAM records. If not provided, sequence bases or references must be embedded in the CRAM file.
reference_index (path or file-like, optional) – The index file for the reference FASTA file.
fields (list[str], optional) – Names of the fixed fields to project.
tag_defs (list[tuple[str, str]], optional) – Definitions of tag fields to project.
batch_size (int, optional [default: 1024]) – The number of records to include in each batch.
limit (int, optional) – The maximum number of records to scan. If None, all records intersecting the query range are scanned.

Returns:

An iterator yielding Arrow record batches.

Return type:

arro3 RecordBatchReader (pycapsule)

schema(fields=None, tag_defs=None)#

Return the Arrow schema.

Parameters:

fields (list[str], optional) – Names of the fixed fields to project.
tag_defs (list[tuple[str, str]], optional) – Definitions of tag fields to project.

Return type:

arro3 Schema (pycapsule)

tag_defs(scan_rows=1024)#

Discover tag definitions by sniffing scan_rows records.

The reader stream is reset to its original position after scanning.

Parameters:: scan_rows (int, optional [default: 1024]) – The number of records to scan.
Returns:: A list of tag definitions, where each definition is a tuple of the tag name and the SAM tag type code.
Return type:: list[tuple[str, str]]