oxbow.core.GtfFile#
- class oxbow.core.GtfFile(source: str | Callable[[], IO[bytes] | str], compressed: bool = False, *, fields: Literal['*'] | list[str] | None = '*', attribute_defs: list[tuple[str, str]] | None = None, regions: str | list[str] | None = None, index: str | Callable[[], IO[bytes] | str] | None = None, batch_size: int = 131072)#
- __init__(source: str | Callable[[], IO[bytes] | str], compressed: bool = False, *, fields: Literal['*'] | list[str] | None = '*', attribute_defs: list[tuple[str, str]] | None = None, regions: str | list[str] | None = None, index: str | Callable[[], IO[bytes] | str] | None = None, batch_size: int = 131072)#
Methods
__init__(source[, compressed, fields, ...])batches()Generate record batches from the data source.
dataset()Convert the data source into a dataset.
dd([find_divisions])Convert the data source to a Dask DataFrame.
Get fragments of the data source.
pd()Convert the dataset to a Pandas DataFrame.
pl([lazy])Convert the data source to a Polars DataFrame or LazyFrame.
regions(regions)Query one or more genomic ranges within the data source.
scanner()Create a low-level scanner for the data source.
to_dask([find_divisions])Convert the data source to a Dask DataFrame.
to_duckdb(conn)Convert the data source into a DuckDB Relation.
to_ipc()Serialize the data source as an Arrow IPC stream.
Convert the dataset to a Pandas DataFrame.
to_polars([lazy])Convert the data source to a Polars DataFrame or LazyFrame.
with_attributes([attribute_defs, scan_rows])Return a new data source with the specified attribute definitions.
Attributes
List of definitions for interpreting attribute records.
The top-level column names of the projection.
The arrow schema of the projection.
- property attribute_defs: list[tuple[str, str]]#
List of definitions for interpreting attribute records.
- batches() Generator#
Generate record batches from the data source.
- Yields:
RecordBatch – A record batch from the data source.
- property columns: list[str]#
The top-level column names of the projection.
- dataset() BatchReaderDataset#
Convert the data source into a dataset.
A dataset is a collection of fragments that can be processed as a single logical entity.
- Returns:
A dataset representation of the data source.
- Return type:
- dd(find_divisions=False)#
Convert the data source to a Dask DataFrame.
- Parameters:
find_divisions (bool, optional) – If True, find divisions for the Dask DataFrame, by default False.
- Returns:
A Dask DataFrame representation of the data source.
- Return type:
dask.dataframe.DataFrame
- fragments() list[BatchReaderFragment]#
Get fragments of the data source.
Fragments represent parts of the data source that can be processed independently.
- Returns:
A list of fragments representing parts of the data source.
- Return type:
list of BatchReaderFragment
- pd()#
Convert the dataset to a Pandas DataFrame.
- Returns:
A Pandas DataFrame representation of the dataset.
- Return type:
pandas.DataFrame
- pl(lazy=False)#
Convert the data source to a Polars DataFrame or LazyFrame.
- Parameters:
lazy (bool, optional [default: False]) – If True, returns a LazyFrame.
- Returns:
A polars representation of the data source.
- Return type:
polars.DataFrame | polars.LazyFrame
- regions(regions: str | list[str]) Self#
Query one or more genomic ranges within the data source.
This method creates a new instance of the data source with the same parameters, overriding the regions to select from the data source.
- Parameters:
regions (str | list[str]) – The regions to select from the data source. This can be a single region or a list of regions.
- Return type:
DataSource
- scanner() Any#
Create a low-level scanner for the data source.
- property schema: Schema#
The arrow schema of the projection.
- to_dask(find_divisions=False)#
Convert the data source to a Dask DataFrame.
- Parameters:
find_divisions (bool, optional) – If True, find divisions for the Dask DataFrame, by default False.
- Returns:
A Dask DataFrame representation of the data source.
- Return type:
dask.dataframe.DataFrame
- to_duckdb(conn)#
Convert the data source into a DuckDB Relation.
- Parameters:
conn (duckdb.DuckDBPyConnection) – The DuckDB connection.
- Returns:
A DuckDB Relation representation of the data source.
- Return type:
duckdb.DuckDBPyRelation
- to_ipc() bytes#
Serialize the data source as an Arrow IPC stream.
- Returns:
The serialized data source in Arrow IPC format.
- Return type:
bytes
- to_pandas()#
Convert the dataset to a Pandas DataFrame.
- Returns:
A Pandas DataFrame representation of the dataset.
- Return type:
pandas.DataFrame
- to_polars(lazy=False)#
Convert the data source to a Polars DataFrame or LazyFrame.
- Parameters:
lazy (bool, optional [default: False]) – If True, returns a LazyFrame.
- Returns:
A polars representation of the data source.
- Return type:
polars.DataFrame | polars.LazyFrame
- with_attributes(attribute_defs: list[tuple[str, str]] | None = None, *, scan_rows: int = 1024) Self#
Return a new data source with the specified attribute definitions.
- Parameters:
attribute_defs (list[tuple[str, str]] or None, optional [default: None]) – Definitions for attributes to project. These will be nested in an “attributes” column. If None (default), attribute definitions are discovered by scanning records in the file, which is controlled by the
scan_rowsparameter.scan_rows (int, optional [default: 1024]) – Number of rows to scan for attribute discovery if attribute_defs is None. Set to -1 to scan the entire file, which may be slow for large files.
- Returns:
A new data source with the specified attribute definitions.
- Return type:
Self
Notes
Attribute definitions are tuples of (name, type), where type is a string indicating how to interpret the attribute values.
Attribute types:
“String”: a string value
“Array”: a comma-separated list of values