oxbow.arrow.BatchReaderFragment#
- class oxbow.arrow.BatchReaderFragment(make_batchreader: Callable[[list[str] | None, int], RecordBatchReader | Iterator[RecordBatch]], schema: Schema, batch_size: int = 131072, partition_expression: Expression = None)#
A Fragment that emits RecordBatches from a reproducible source.
To provide stateless replay, a new record batch iterator over the same records is constructed whenever a scanner is requested.
- Parameters:
make_batchreader (Callable[[list[str] | None, int], RecordBatchIter]) – A function that recreates a specific stream of record batches.
schema (pyarrow.Schema) – The schema of the RecordBatches.
batch_size (int, optional) – The maximum row count for scanned record batches.
partition_expression (pyarrow.dataset.Expression, optional) – A partition expression for the fragment, by default None.
- schema#
The schema of the RecordBatches in the fragment.
- Type:
pyarrow.Schema
- partition_expression#
An expression that evaluates to true for all data viewed by this fragment.
- Type:
pyarrow.dataset.Expression
- __init__(make_batchreader: Callable[[list[str] | None, int], RecordBatchReader | Iterator[RecordBatch]], schema: Schema, batch_size: int = 131072, partition_expression: Expression = None)[source]#
Create a BatchReaderFragment from a BatchReader factory function.
- Parameters:
make_batchreader (Callable[[list[str] | None, int], RecordBatchIter]) – A function that recreates a specific stream of record batches.
schema (pyarrow.Schema) – The schema of the RecordBatches.
batch_size (int, optional) – The maximum row count for scanned record batches.
partition_expression (pyarrow.dataset.Expression, optional) – A partition expression for the fragment, by default None.
Notes
The make_batchreader function should accept the following arguments and return a pyarrow.RecordBatchReader:
columns: The columns to project.
batch_size: The maximum number of rows per batch.
Methods
__init__(make_batchreader, schema[, ...])Create a BatchReaderFragment from a BatchReader factory function.
count_rows([filter, batch_size, ...])head(num_rows[, columns, filter, ...])iter_batches([columns, batch_size])Iterate over batches in the fragment.
scanner([schema, columns, filter, ...])Build a scan operation against the fragment.
take(indices[, columns, filter, batch_size, ...])to_batches([schema, columns, filter, ...])Scan and read the fragment as materialized record batches.
to_table([schema, columns, filter, ...])Attributes
An expression that evaluates to true for all data viewed by this fragment.
physical_schemaThe schema of the RecordBatches in the fragment.
- iter_batches(columns: list[str] | None = None, batch_size: int | None = None) Iterator[RecordBatch][source]#
Iterate over batches in the fragment.
- Parameters:
columns (list[str], optional) – Names of columns to project. By default, all available columns are projected.
batch_size (int, optional) – The maximum row count for scanned record batches.
- Returns:
An iterator of record batches.
- Return type:
Iterator[pyarrow.RecordBatch]
- property partition_expression: Expression#
An expression that evaluates to true for all data viewed by this fragment.
- Return type:
pyarrow.dataset.Expression
- scanner(schema: pa.Schema | None = None, columns: list[str] | None = None, filter: ds.Expression | None = None, batch_size: int | None = None, batch_readahead: int = 16, fragment_readahead: int = 4, fragment_scan_options: ds.FragmentScanOptions | None = None, use_threads: bool = True, memory_pool: ds.MemoryPool | None = None, **kwargs) ds.Scanner[source]#
Build a scan operation against the fragment.
- Parameters:
schema (pyarrow.Schema, optional) – The schema to use for scanning. If not specified, uses the Fragment’s physical schema.
columns (list[str], optional) – Names of columns to project. By default, all available columns are projected.
filter (pyarrow.dataset.Expression, optional) – A filter expression. Scan will return only the rows matching the filter.
batch_size (int, optional) – The maximum row count for scanned record batches.
batch_readahead (int, optional) – The number of batches to read ahead in a file.
fragment_readahead (int, optional) – The number of fragments/files to read ahead.
fragment_scan_options (pyarrow.dataset.FragmentScanOptions, optional) – Options specific to a particular scan and fragment type, by default None.
use_threads (bool, optional) – If enabled, maximum parallelism will be used, by default True.
memory_pool (pyarrow.dataset.MemoryPool, optional) – For memory allocations, if required. By default, uses the default pool.
- Returns:
A scanner object for the fragment.
- Return type:
pyarrow.dataset.Scanner
- property schema: Schema#
The schema of the RecordBatches in the fragment.
- Return type:
pyarrow.Schema
- to_batches(schema: pa.Schema | None = None, columns: list[str] | None = None, filter: ds.Expression | None = None, batch_size: int | None = None, batch_readahead: int = 16, fragment_readahead: int = 4, fragment_scan_options: ds.FragmentScanOptions | None = None, use_threads: bool = True, memory_pool: ds.MemoryPool | None = None, **kwargs) Iterator[pa.RecordBatch][source]#
Scan and read the fragment as materialized record batches.
Projections and filters are applied if specified.
- Parameters:
schema (pyarrow.Schema, optional) – The schema to use for scanning. If not specified, uses the Fragment’s physical schema.
columns (list[str], optional) – Names of columns to project. By default, all available columns are projected.
filter (pyarrow.dataset.Expression, optional) – A filter expression. Scan will return only the rows matching the filter.
batch_size (int, optional) – The maximum row count for scanned record batches.
batch_readahead (int, optional) – The number of batches to read ahead in a file.
fragment_readahead (int, optional) – The number of fragments/files to read ahead,.
fragment_scan_options (pyarrow.dataset.FragmentScanOptions, optional) – Options specific to a particular scan and fragment type, by default None.
use_threads (bool, optional) – If enabled, maximum parallelism will be used, by default True.
memory_pool (pyarrow.dataset.MemoryPool, optional) – For memory allocations, if required. By default, uses the default pool.
- Returns:
An iterator of record batches.
- Return type:
Iterator[pyarrow.RecordBatch]