oxbow.core.VcfFile#
- class oxbow.core.VcfFile(source: str | Callable[[], IO[bytes] | str], compressed: bool = False, *, fields=None, info_fields: list[str] | None = None, samples: list[str] | None = None, genotype_fields: list[str] | None = None, genotype_by: Literal['sample', 'field'] = 'sample', regions: str | list[str] | None = None, index: str | Callable[[], IO[bytes] | str] | None = None, batch_size: int = 131072)#
- __init__(source: str | Callable[[], IO[bytes] | str], compressed: bool = False, *, fields=None, info_fields: list[str] | None = None, samples: list[str] | None = None, genotype_fields: list[str] | None = None, genotype_by: Literal['sample', 'field'] = 'sample', regions: str | list[str] | None = None, index: str | Callable[[], IO[bytes] | str] | None = None, batch_size: int = 131072)#
Methods
__init__(source[, compressed, fields, ...])batches()Generate record batches from the data source.
dataset()Convert the data source into a dataset.
dd([find_divisions])Convert the data source to a Dask DataFrame.
Get fragments of the data source.
pd()Convert the dataset to a Pandas DataFrame.
pl([lazy])Convert the data source to a Polars DataFrame or LazyFrame.
regions(regions)Query one or more genomic ranges within the data source.
scanner()Create a low-level scanner for the data source.
to_dask([find_divisions])Convert the data source to a Dask DataFrame.
to_duckdb(conn)Convert the data source into a DuckDB Relation.
to_ipc()Serialize the data source as an Arrow IPC stream.
Convert the dataset to a Pandas DataFrame.
to_polars([lazy])Convert the data source to a Polars DataFrame or LazyFrame.
Attributes
List of reference sequence names declared in the header.
List of reference sequence names and their lengths in bp.
The top-level column names of the projection.
List of FORMAT field definitions declared in the header.
List of INFO field definitions declared in the header.
List of sample IDs declared in the header.
The arrow schema of the projection.
- batches() Generator[RecordBatch, None, None]#
Generate record batches from the data source.
- Yields:
pa.RecordBatch – A record batch from the data source.
- property chrom_names: list[str]#
List of reference sequence names declared in the header.
- property chrom_sizes: list[tuple[str, int]]#
List of reference sequence names and their lengths in bp.
- property columns: list[str]#
The top-level column names of the projection.
- dataset() BatchReaderDataset#
Convert the data source into a dataset.
A dataset is a collection of fragments that can be processed as a single logical entity.
- Returns:
A dataset representation of the data source.
- Return type:
- dd(find_divisions=False)#
Convert the data source to a Dask DataFrame.
- Parameters:
find_divisions (bool, optional) – If True, find divisions for the Dask DataFrame, by default False.
- Returns:
A Dask DataFrame representation of the data source.
- Return type:
dask.dataframe.DataFrame
- fragments() list[BatchReaderFragment]#
Get fragments of the data source.
Fragments represent parts of the data source that can be processed independently.
- Returns:
A list of fragments representing parts of the data source.
- Return type:
list of BatchReaderFragment
- property genotype_field_defs: list[tuple[str, str, str]]#
List of FORMAT field definitions declared in the header.
- property info_field_defs: list[tuple[str, str, str]]#
List of INFO field definitions declared in the header.
- pd()#
Convert the dataset to a Pandas DataFrame.
- Returns:
A Pandas DataFrame representation of the dataset.
- Return type:
pandas.DataFrame
- pl(lazy=False)#
Convert the data source to a Polars DataFrame or LazyFrame.
- Parameters:
lazy (bool, optional [default: False]) – If True, returns a LazyFrame.
- Returns:
A polars representation of the data source.
- Return type:
polars.DataFrame | polars.LazyFrame
- regions(regions: str | list[str]) Self#
Query one or more genomic ranges within the data source.
This method creates a new instance of the data source with the same parameters, overriding the regions to select from the data source.
- Parameters:
regions (str | list[str]) – The regions to select from the data source. This can be a single region or a list of regions.
- Return type:
DataSource
- property samples: list[str]#
List of sample IDs declared in the header.
- scanner() Any#
Create a low-level scanner for the data source.
- property schema: Schema#
The arrow schema of the projection.
- to_dask(find_divisions=False)#
Convert the data source to a Dask DataFrame.
- Parameters:
find_divisions (bool, optional) – If True, find divisions for the Dask DataFrame, by default False.
- Returns:
A Dask DataFrame representation of the data source.
- Return type:
dask.dataframe.DataFrame
- to_duckdb(conn)#
Convert the data source into a DuckDB Relation.
- Parameters:
conn (duckdb.DuckDBPyConnection) – The DuckDB connection.
- Returns:
A DuckDB Relation representation of the data source.
- Return type:
duckdb.DuckDBPyRelation
- to_ipc() bytes#
Serialize the data source as an Arrow IPC stream.
- Returns:
The serialized data source in Arrow IPC format.
- Return type:
bytes
- to_pandas()#
Convert the dataset to a Pandas DataFrame.
- Returns:
A Pandas DataFrame representation of the dataset.
- Return type:
pandas.DataFrame
- to_polars(lazy=False)#
Convert the data source to a Polars DataFrame or LazyFrame.
- Parameters:
lazy (bool, optional [default: False]) – If True, returns a LazyFrame.
- Returns:
A polars representation of the data source.
- Return type:
polars.DataFrame | polars.LazyFrame