oxbow.core.CramFile

oxbow.core.CramFile#

__init__(source: str | Callable[[], IO[bytes] | str], compressed: bool = False, *, fields: Literal['*'] | list[str] | None = '*', tag_defs: list[tuple[str, str]] | None = None, coords: Literal['01', '11'] = '11', regions: str | list[str] | None = None, index: str | Callable[[], IO[bytes] | str] | None = None, reference: str | Callable[[], IO[bytes] | str] | None = None, reference_index: str | Callable[[], IO[bytes] | str] | None = None, batch_size: int = 131072)[source]#

Methods

`__init__`(source[, compressed, fields, ...])
`batches`()	Generate record batches from the data source.
`dataset`()	Convert the data source into a dataset.
`dd`([find_divisions])	Convert the data source to a Dask DataFrame.
`fragments`()	Get fragments of the data source.
`pd`()	Convert the dataset to a Pandas DataFrame.
`pl`([lazy])	Convert the data source to a Polars DataFrame or LazyFrame.
`regions`(regions)	Query one or more genomic ranges within the data source.
`scanner`()	Create a low-level scanner for the data source.
`to_dask`([find_divisions])	Convert the data source to a Dask DataFrame.
`to_duckdb`(conn)	Convert the data source into a DuckDB Relation.
`to_ipc`()	Serialize the data source as an Arrow IPC stream.
`to_pandas`()	Convert the dataset to a Pandas DataFrame.
`to_polars`([lazy])	Convert the data source to a Polars DataFrame or LazyFrame.
`with_tags`([tag_defs, scan_rows])	Return a new data source with the specified tag definitions.

Attributes

`chrom_names`	List of reference sequence names declared in the header.
`chrom_sizes`	List of reference sequence names and their lengths in bp.
`columns`	The top-level column names of the projection.
`schema`	The arrow schema of the projection.
`tag_defs`	List of definitions for interpreting tags.

batches() → Generator#

Generate record batches from the data source.

Yields:: RecordBatch – A record batch from the data source.

property chrom_names: list[str]#: List of reference sequence names declared in the header.

property chrom_sizes: list[tuple[str, int]]#: List of reference sequence names and their lengths in bp.

property columns: list[str]#: The top-level column names of the projection.

dataset() → BatchReaderDataset#

Convert the data source into a dataset.

A dataset is a collection of fragments that can be processed as a single logical entity.

Returns:: A dataset representation of the data source.
Return type:: BatchReaderDataset

dd(find_divisions=False)#

Convert the data source to a Dask DataFrame.

Parameters:: find_divisions (bool, optional) – If True, find divisions for the Dask DataFrame, by default False.
Returns:: A Dask DataFrame representation of the data source.
Return type:: dask.dataframe.DataFrame

fragments() → list[BatchReaderFragment]#

Get fragments of the data source.

Fragments represent parts of the data source that can be processed independently.

Returns:: A list of fragments representing parts of the data source.
Return type:: list of BatchReaderFragment

pd()#

Convert the dataset to a Pandas DataFrame.

Returns:: A Pandas DataFrame representation of the dataset.
Return type:: pandas.DataFrame

pl(lazy=False)#

Convert the data source to a Polars DataFrame or LazyFrame.

Parameters:: lazy (bool, optional [default: False]) – If True, returns a LazyFrame.
Returns:: A polars representation of the data source.
Return type:: polars.DataFrame | polars.LazyFrame

regions(regions: str | list[str]) → Self#

Query one or more genomic ranges within the data source.

This method creates a new instance of the data source with the same parameters, overriding the regions to select from the data source.

Parameters:: regions (str | list[str]) – The regions to select from the data source. This can be a single region or a list of regions.
Return type:: DataSource

Notes

Genomic range strings can be in the following formats:

UCSC-style "chr:start-end": intepreted using the coordinate system of the data source.
Bracket-style "chr:[start,end]": explicitly 1-based, end-inclusive.
Bracket-style "chr:[start,end)": explicitly 0-based, end-exclusive.

scanner() → Any#: Create a low-level scanner for the data source.

property schema: Schema#: The arrow schema of the projection.

property tag_defs: list[tuple[str, str]]#: List of definitions for interpreting tags.

to_dask(find_divisions=False)#

Convert the data source to a Dask DataFrame.

Parameters:: find_divisions (bool, optional) – If True, find divisions for the Dask DataFrame, by default False.
Returns:: A Dask DataFrame representation of the data source.
Return type:: dask.dataframe.DataFrame

to_duckdb(conn)#

Convert the data source into a DuckDB Relation.

Parameters:: conn (duckdb.DuckDBPyConnection) – The DuckDB connection.
Returns:: A DuckDB Relation representation of the data source.
Return type:: duckdb.DuckDBPyRelation

to_ipc() → bytes#

Serialize the data source as an Arrow IPC stream.

Returns:: The serialized data source in Arrow IPC format.
Return type:: bytes

to_pandas()#

Convert the dataset to a Pandas DataFrame.

Returns:: A Pandas DataFrame representation of the dataset.
Return type:: pandas.DataFrame

to_polars(lazy=False)#

Convert the data source to a Polars DataFrame or LazyFrame.

Parameters:: lazy (bool, optional [default: False]) – If True, returns a LazyFrame.
Returns:: A polars representation of the data source.
Return type:: polars.DataFrame | polars.LazyFrame

with_tags(tag_defs: list[tuple[str, str]] | None = None, *, scan_rows: int = 1024) → Self#

Return a new data source with the specified tag definitions.

Parameters:

tag_defs (list[tuple[str, str]] or None, optional [default: None]) – Definitions for tags to project. These will be nested in a “tags” column. If None (default), tag definitions are discovered by scanning records in the file, which is controlled by the scan_rows parameter.
scan_rows (int, optional [default: 1024]) – Number of rows to scan for tag discovery if tag_defs is None. Set to -1 to scan the entire file, which may be slow for large files.

Returns:

A new data source with the specified tag definitions.

Return type:

Self

Notes

Tag definitions take the form of a list of (tag_name, tag_type) tuples, where tag_name is a 2-character string and tag_type is a single-character type code as defined in the SAM specification.

Type codes:

A: Printable character

i: Signed integer

f: Floating point number

Z: String

H: Hex string

B: Array (comma-separated values with type code prefix, e.g., “i,1,2,3”)

oxbow.core.CramFile

Contents

oxbow.core.CramFile#