oxbow.core.GtfFile

oxbow.core.GtfFile#

__init__(source: str | Callable[[], IO[bytes] | str], compressed: bool = False, *, fields: Literal['*'] | list[str] | None = '*', attribute_defs: list[tuple[str, str]] | None = None, regions: str | list[str] | None = None, index: str | Callable[[], IO[bytes] | str] | None = None, batch_size: int = 131072)#

Methods

`__init__`(source[, compressed, fields, ...])
`batches`()	Generate record batches from the data source.
`dataset`()	Convert the data source into a dataset.
`dd`([find_divisions])	Convert the data source to a Dask DataFrame.
`fragments`()	Get fragments of the data source.
`pd`()	Convert the dataset to a Pandas DataFrame.
`pl`([lazy])	Convert the data source to a Polars DataFrame or LazyFrame.
`regions`(regions)	Query one or more genomic ranges within the data source.
`scanner`()	Create a low-level scanner for the data source.
`to_dask`([find_divisions])	Convert the data source to a Dask DataFrame.
`to_duckdb`(conn)	Convert the data source into a DuckDB Relation.
`to_ipc`()	Serialize the data source as an Arrow IPC stream.
`to_pandas`()	Convert the dataset to a Pandas DataFrame.
`to_polars`([lazy])	Convert the data source to a Polars DataFrame or LazyFrame.
`with_attributes`([attribute_defs, scan_rows])	Return a new data source with the specified attribute definitions.

Attributes

`attribute_defs`	List of definitions for interpreting attribute records.
`columns`	The top-level column names of the projection.
`schema`	The arrow schema of the projection.

property attribute_defs: list[tuple[str, str]]#: List of definitions for interpreting attribute records.

batches() → Generator#

Generate record batches from the data source.

Yields:: RecordBatch – A record batch from the data source.

property columns: list[str]#: The top-level column names of the projection.

dataset() → BatchReaderDataset#

Convert the data source into a dataset.

A dataset is a collection of fragments that can be processed as a single logical entity.

Returns:: A dataset representation of the data source.
Return type:: BatchReaderDataset

dd(find_divisions=False)#

Convert the data source to a Dask DataFrame.

Parameters:: find_divisions (bool, optional) – If True, find divisions for the Dask DataFrame, by default False.
Returns:: A Dask DataFrame representation of the data source.
Return type:: dask.dataframe.DataFrame

fragments() → list[BatchReaderFragment]#

Get fragments of the data source.

Fragments represent parts of the data source that can be processed independently.

Returns:: A list of fragments representing parts of the data source.
Return type:: list of BatchReaderFragment

pd()#

Convert the dataset to a Pandas DataFrame.

Returns:: A Pandas DataFrame representation of the dataset.
Return type:: pandas.DataFrame

pl(lazy=False)#

Convert the data source to a Polars DataFrame or LazyFrame.

Parameters:: lazy (bool, optional [default: False]) – If True, returns a LazyFrame.
Returns:: A polars representation of the data source.
Return type:: polars.DataFrame | polars.LazyFrame

regions(regions: str | list[str]) → Self#

Query one or more genomic ranges within the data source.

This method creates a new instance of the data source with the same parameters, overriding the regions to select from the data source.

Parameters:: regions (str | list[str]) – The regions to select from the data source. This can be a single region or a list of regions.
Return type:: DataSource

scanner() → Any#: Create a low-level scanner for the data source.

property schema: Schema#: The arrow schema of the projection.

to_dask(find_divisions=False)#

Convert the data source to a Dask DataFrame.

Parameters:: find_divisions (bool, optional) – If True, find divisions for the Dask DataFrame, by default False.
Returns:: A Dask DataFrame representation of the data source.
Return type:: dask.dataframe.DataFrame

to_duckdb(conn)#

Convert the data source into a DuckDB Relation.

Parameters:: conn (duckdb.DuckDBPyConnection) – The DuckDB connection.
Returns:: A DuckDB Relation representation of the data source.
Return type:: duckdb.DuckDBPyRelation

to_ipc() → bytes#

Serialize the data source as an Arrow IPC stream.

Returns:: The serialized data source in Arrow IPC format.
Return type:: bytes

to_pandas()#

Convert the dataset to a Pandas DataFrame.

Returns:: A Pandas DataFrame representation of the dataset.
Return type:: pandas.DataFrame

to_polars(lazy=False)#

Convert the data source to a Polars DataFrame or LazyFrame.

Parameters:: lazy (bool, optional [default: False]) – If True, returns a LazyFrame.
Returns:: A polars representation of the data source.
Return type:: polars.DataFrame | polars.LazyFrame

with_attributes(attribute_defs: list[tuple[str, str]] | None = None, *, scan_rows: int = 1024) → Self#

Return a new data source with the specified attribute definitions.

Parameters:

attribute_defs (list[tuple[str, str]] or None, optional [default: None]) – Definitions for attributes to project. These will be nested in an “attributes” column. If None (default), attribute definitions are discovered by scanning records in the file, which is controlled by the scan_rows parameter.
scan_rows (int, optional [default: 1024]) – Number of rows to scan for attribute discovery if attribute_defs is None. Set to -1 to scan the entire file, which may be slow for large files.

Returns:

A new data source with the specified attribute definitions.

Return type:

Self

Notes

Attribute definitions are tuples of (name, type), where type is a string indicating how to interpret the attribute values.

Attribute types:

“String”: a string value

“Array”: a comma-separated list of values

oxbow.core.GtfFile

Contents

oxbow.core.GtfFile#