oxbow.from_sam

Contents

oxbow.from_sam#

oxbow.from_sam(source: str | Path | Callable[[], IO[bytes] | str], compression: Literal['infer', 'bgzf', 'gzip', None] = 'infer', *, fields: Literal['*'] | list[str] | None = '*', tag_defs: list[tuple[str, str]] | None = None, regions: str | list[str] | None = None, index: str | Path | Callable[[], IO[bytes] | str] | None = None, batch_size: int = 131072) SamFile#

Create a SAM file data source.

Changed in version 0.7.0: The tag_scan_rows parameter was removed and tag definitions are no longer discovered by default. The tag_defs parameter now defaults to omitting tag definitions (None). To perform tag discovery, use the with_tags() method on the returned data source, which accepts a scan_rows parameter to control how many records are scanned.

Parameters:
  • source (str, pathlib.Path, or Callable) – The URI or path to the SAM file, or a callable that opens the file as a file-like object.

  • compression (Literal["infer", "bgzf", "gzip", None], default: "infer") – Compression of the source bytestream. If “infer” and source is a URI or path, the file’s compression is guessed based on the extension, where “.gz” or “.bgz” is interpreted as BGZF. Pass “gzip” to decode regular GZIP. If None, the source bytestream is assumed to be uncompressed. For more customized decoding, provide a callable source instead.

  • fields (list[str] or "*", optional [default: "*"]) – Standard SAM fields to include. By default, all standard fields are included.

  • tag_defs (list[tuple[str, str]], optional [default: None]) – Definitions for tags to project. These will be nested in a “tags” column. If None, tag definitions are omitted. To discover tag definitions, use the with_tags() method on the returned data source.

  • regions (str | list[str], optional) – One or more genomic regions to query. Only applicable if an associated index file is available.

  • index (str, pathlib.Path, or Callable, optional) – An optional index file associated with the SAM file. If source is a URI or path, is BGZF-compressed, and the index file shares the same name with a “.tbi” or “.csi” extension, the index file is automatically detected.

  • batch_size (int, optional [default: 131072]) – The number of records to read in each batch.

Returns:

A data source object representing the SAM file.

Return type:

SamFile

Notes

Sequence Alignment Map (SAM) is a widely used text-based format for storing biological sequences aligned to a reference sequence.

See also

from_bam

Create a BAM file data source.

from_cram

Create a CRAM file data source.