oxbow.core.PyCramScanner#

class oxbow.core.PyCramScanner(src, compressed=None, fields=None, tag_defs=None, reference=None, reference_index=None)#

A CRAM file scanner.

Parameters:
  • src (str or file-like) – The path to the CRAM file or a file-like object.

  • fields (str or list[str] or None, optional [default: "*"]) – Standard SAM fields to include. "*" for all, None to omit, or a list of field names.

  • tag_defs (list[tuple[str, str]], optional [default: None]) – Tag definitions for the "tags" struct column. None omits the tags column. Use the tag_defs() method to discover definitions.

__init__()#

Methods

__init__()

chrom_names()

Return the names of the reference sequences.

chrom_sizes()

Return the names of the reference sequences and their lengths in bp.

field_names()

Return the names of the standard SAM fields.

model()

Return the string representation of the alignment model.

scan([columns, batch_size, limit])

Scan batches of records from the file.

scan_query(region[, index, columns, ...])

Scan batches of records from a genomic range.

schema()

Return the Arrow schema.

tag_defs([scan_rows])

Discover tag definitions by sniffing scan_rows records.

chrom_names()#

Return the names of the reference sequences.

chrom_sizes()#

Return the names of the reference sequences and their lengths in bp.

field_names()#

Return the names of the standard SAM fields.

model()#

Return the string representation of the alignment model.

scan(columns=None, batch_size=1024, limit=None)#

Scan batches of records from the file.

Parameters:
  • columns (list[str], optional) – Names of the top-level columns to project.

  • batch_size (int, optional [default: 1024]) – The number of records to include in each batch.

  • limit (int, optional) – The maximum number of records to scan. If None, records are scanned until EOF.

Returns:

An iterator yielding Arrow record batches.

Return type:

arro3 RecordBatchReader (pycapsule)

scan_query(region, index=None, columns=None, batch_size=1024, limit=None)#

Scan batches of records from a genomic range.

This operation requires an index file.

Parameters:
  • region (str) – Genomic region in the format “chr:start-end”.

  • index (path or file-like, optional) – The index file to use for querying the region. If None and the source was provided as a path, we will attempt to load the index from the same path with an additional extension.

  • columns (list[str], optional) – Names of the top-level columns to project.

  • batch_size (int, optional [default: 1024]) – The number of records to include in each batch.

  • limit (int, optional) – The maximum number of records to scan. If None, all records intersecting the query range are scanned.

Returns:

An iterator yielding Arrow record batches.

Return type:

arro3 RecordBatchReader (pycapsule)

schema()#

Return the Arrow schema.

Return type:

arro3 Schema (pycapsule)

tag_defs(scan_rows=1024)#

Discover tag definitions by sniffing scan_rows records.

The reader stream is reset to its original position after scanning.

Parameters:

scan_rows (int, optional [default: 1024]) – The number of records to scan.

Returns:

A list of tag definitions, where each definition is a tuple of the tag name and the SAM tag type code.

Return type:

list[tuple[str, str]]