oxbow.arrow.BatchReaderFragment

oxbow.arrow.BatchReaderFragment#

class oxbow.arrow.BatchReaderFragment(make_batchreader: Callable[[list[str] | None, int], RecordBatchReader | Iterator[RecordBatch]], schema: Schema, batch_size: int = 131072, partition_expression: Expression = None)#

A Fragment that emits RecordBatches from a reproducible source.

To provide stateless replay, a new record batch iterator over the same records is constructed whenever a scanner is requested.

Parameters:

make_batchreader (Callable[[list[str] | None, int], RecordBatchIter]) – A function that recreates a specific stream of record batches.
schema (pyarrow.Schema) – The schema of the RecordBatches.
batch_size (int, optional) – The maximum row count for scanned record batches.
partition_expression (pyarrow.dataset.Expression, optional) – A partition expression for the fragment, by default None.

schema#

The schema of the RecordBatches in the fragment.

Type:: pyarrow.Schema

partition_expression#

An expression that evaluates to true for all data viewed by this fragment.

Type:: pyarrow.dataset.Expression

__init__(make_batchreader: Callable[[list[str] | None, int], RecordBatchReader | Iterator[RecordBatch]], schema: Schema, batch_size: int = 131072, partition_expression: Expression = None)[source]#

Create a BatchReaderFragment from a BatchReader factory function.

Parameters:

make_batchreader (Callable[[list[str] | None, int], RecordBatchIter]) – A function that recreates a specific stream of record batches.
schema (pyarrow.Schema) – The schema of the RecordBatches.
batch_size (int, optional) – The maximum row count for scanned record batches.
partition_expression (pyarrow.dataset.Expression, optional) – A partition expression for the fragment, by default None.

Notes

The make_batchreader function should accept the following arguments and return a pyarrow.RecordBatchReader:

columns: The columns to project.
batch_size: The maximum number of rows per batch.

Methods

`__init__`(make_batchreader, schema[, ...])	Create a BatchReaderFragment from a BatchReader factory function.
`count_rows`([filter, batch_size, ...])
`head`(num_rows[, columns, filter, ...])
`iter_batches`([columns, batch_size])	Iterate over batches in the fragment.
`scanner`([schema, columns, filter, ...])	Build a scan operation against the fragment.
`take`(indices[, columns, filter, batch_size, ...])
`to_batches`([schema, columns, filter, ...])	Scan and read the fragment as materialized record batches.
`to_table`([schema, columns, filter, ...])

Attributes

`partition_expression`	An expression that evaluates to true for all data viewed by this fragment.
`physical_schema`
`schema`	The schema of the RecordBatches in the fragment.

iter_batches(columns: list[str] | None = None, batch_size: int | None = None) → Iterator[RecordBatch][source]#

Iterate over batches in the fragment.

Parameters:

columns (list[str], optional) – Names of columns to project. By default, all available columns are projected.
batch_size (int, optional) – The maximum row count for scanned record batches.

Returns:

An iterator of record batches.

Return type:

Iterator[pyarrow.RecordBatch]

property partition_expression: Expression#

An expression that evaluates to true for all data viewed by this fragment.

Return type:: pyarrow.dataset.Expression

scanner(schema: pa.Schema | None = None, columns: list[str] | None = None, filter: ds.Expression | None = None, batch_size: int | None = None, batch_readahead: int = 16, fragment_readahead: int = 4, fragment_scan_options: ds.FragmentScanOptions | None = None, use_threads: bool = True, memory_pool: ds.MemoryPool | None = None, **kwargs) → ds.Scanner[source]#

Build a scan operation against the fragment.

Parameters:

schema (pyarrow.Schema, optional) – The schema to use for scanning. If not specified, uses the Fragment’s physical schema.
columns (list[str], optional) – Names of columns to project. By default, all available columns are projected.
filter (pyarrow.dataset.Expression, optional) – A filter expression. Scan will return only the rows matching the filter.
batch_size (int, optional) – The maximum row count for scanned record batches.
batch_readahead (int, optional) – The number of batches to read ahead in a file.
fragment_readahead (int, optional) – The number of fragments/files to read ahead.
fragment_scan_options (pyarrow.dataset.FragmentScanOptions, optional) – Options specific to a particular scan and fragment type, by default None.
use_threads (bool, optional) – If enabled, maximum parallelism will be used, by default True.
memory_pool (pyarrow.dataset.MemoryPool, optional) – For memory allocations, if required. By default, uses the default pool.

Returns:

A scanner object for the fragment.

Return type:

pyarrow.dataset.Scanner

property schema: Schema#

The schema of the RecordBatches in the fragment.

Return type:: pyarrow.Schema

to_batches(schema: pa.Schema | None = None, columns: list[str] | None = None, filter: ds.Expression | None = None, batch_size: int | None = None, batch_readahead: int = 16, fragment_readahead: int = 4, fragment_scan_options: ds.FragmentScanOptions | None = None, use_threads: bool = True, memory_pool: ds.MemoryPool | None = None, **kwargs) → Iterator[pa.RecordBatch][source]#

Scan and read the fragment as materialized record batches.

Projections and filters are applied if specified.

Parameters:

schema (pyarrow.Schema, optional) – The schema to use for scanning. If not specified, uses the Fragment’s physical schema.
columns (list[str], optional) – Names of columns to project. By default, all available columns are projected.
filter (pyarrow.dataset.Expression, optional) – A filter expression. Scan will return only the rows matching the filter.
batch_size (int, optional) – The maximum row count for scanned record batches.
batch_readahead (int, optional) – The number of batches to read ahead in a file.
fragment_readahead (int, optional) – The number of fragments/files to read ahead,.
fragment_scan_options (pyarrow.dataset.FragmentScanOptions, optional) – Options specific to a particular scan and fragment type, by default None.
use_threads (bool, optional) – If enabled, maximum parallelism will be used, by default True.
memory_pool (pyarrow.dataset.MemoryPool, optional) – For memory allocations, if required. By default, uses the default pool.

Returns:

An iterator of record batches.

Return type:

Iterator[pyarrow.RecordBatch]

oxbow.arrow.BatchReaderFragment

Contents

oxbow.arrow.BatchReaderFragment#