oxbow.core.BamFile#
- class oxbow.core.BamFile(source: str | Callable[[], IO[bytes] | str], compressed: bool = False, *, fields: list[str] | None = None, tag_defs: list[tuple[str, str]] | None = None, tag_scan_rows: int = 1024, regions: str | list[str] | None = None, index: str | Callable[[], IO[bytes] | str] | None = None, batch_size: int = 131072)#
- __init__(source: str | Callable[[], IO[bytes] | str], compressed: bool = False, *, fields: list[str] | None = None, tag_defs: list[tuple[str, str]] | None = None, tag_scan_rows: int = 1024, regions: str | list[str] | None = None, index: str | Callable[[], IO[bytes] | str] | None = None, batch_size: int = 131072)#
Methods
__init__(source[, compressed, fields, ...])batches()Generate record batches from the data source.
dataset()Convert the data source into a dataset.
dd([find_divisions])Convert the data source to a Dask DataFrame.
Get fragments of the data source.
pd()Convert the dataset to a Pandas DataFrame.
pl([lazy])Convert the data source to a Polars DataFrame or LazyFrame.
regions(regions)Query one or more genomic ranges within the data source.
scanner()Create a low-level scanner for the data source.
to_dask([find_divisions])Convert the data source to a Dask DataFrame.
to_duckdb(conn)Convert the data source into a DuckDB Relation.
to_ipc()Serialize the data source as an Arrow IPC stream.
Convert the dataset to a Pandas DataFrame.
to_polars([lazy])Convert the data source to a Polars DataFrame or LazyFrame.
Attributes
List of reference sequence names declared in the header.
List of reference sequence names and their lengths in bp.
The top-level column names of the projection.
The arrow schema of the projection.
List of definitions for interpreting tag records.
- batches() Generator#
Generate record batches from the data source.
- Yields:
RecordBatch – A record batch from the data source.
- property chrom_names: list[str]#
List of reference sequence names declared in the header.
- property chrom_sizes: list[tuple[str, int]]#
List of reference sequence names and their lengths in bp.
- property columns: list[str]#
The top-level column names of the projection.
- dataset() BatchReaderDataset#
Convert the data source into a dataset.
A dataset is a collection of fragments that can be processed as a single logical entity.
- Returns:
A dataset representation of the data source.
- Return type:
- dd(find_divisions=False)#
Convert the data source to a Dask DataFrame.
- Parameters:
find_divisions (bool, optional) – If True, find divisions for the Dask DataFrame, by default False.
- Returns:
A Dask DataFrame representation of the data source.
- Return type:
dask.dataframe.DataFrame
- fragments() list[BatchReaderFragment]#
Get fragments of the data source.
Fragments represent parts of the data source that can be processed independently.
- Returns:
A list of fragments representing parts of the data source.
- Return type:
list of BatchReaderFragment
- pd()#
Convert the dataset to a Pandas DataFrame.
- Returns:
A Pandas DataFrame representation of the dataset.
- Return type:
pandas.DataFrame
- pl(lazy=False)#
Convert the data source to a Polars DataFrame or LazyFrame.
- Parameters:
lazy (bool, optional [default: False]) – If True, returns a LazyFrame.
- Returns:
A polars representation of the data source.
- Return type:
polars.DataFrame | polars.LazyFrame
- regions(regions: str | list[str]) Self#
Query one or more genomic ranges within the data source.
This method creates a new instance of the data source with the same parameters, overriding the regions to select from the data source.
- Parameters:
regions (str | list[str]) – The regions to select from the data source. This can be a single region or a list of regions.
- Return type:
DataSource
- scanner() Any#
Create a low-level scanner for the data source.
- property schema: Schema#
The arrow schema of the projection.
- property tag_defs: list[tuple[str, str]]#
List of definitions for interpreting tag records.
- to_dask(find_divisions=False)#
Convert the data source to a Dask DataFrame.
- Parameters:
find_divisions (bool, optional) – If True, find divisions for the Dask DataFrame, by default False.
- Returns:
A Dask DataFrame representation of the data source.
- Return type:
dask.dataframe.DataFrame
- to_duckdb(conn)#
Convert the data source into a DuckDB Relation.
- Parameters:
conn (duckdb.DuckDBPyConnection) – The DuckDB connection.
- Returns:
A DuckDB Relation representation of the data source.
- Return type:
duckdb.DuckDBPyRelation
- to_ipc() bytes#
Serialize the data source as an Arrow IPC stream.
- Returns:
The serialized data source in Arrow IPC format.
- Return type:
bytes
- to_pandas()#
Convert the dataset to a Pandas DataFrame.
- Returns:
A Pandas DataFrame representation of the dataset.
- Return type:
pandas.DataFrame
- to_polars(lazy=False)#
Convert the data source to a Polars DataFrame or LazyFrame.
- Parameters:
lazy (bool, optional [default: False]) – If True, returns a LazyFrame.
- Returns:
A polars representation of the data source.
- Return type:
polars.DataFrame | polars.LazyFrame