oxbow.arrow.BatchReaderFragment#

class oxbow.arrow.BatchReaderFragment(make_batchreader: Callable[[list[str] | None, int], RecordBatchReader | Iterator[RecordBatch]], schema: Schema, batch_size: int = 131072, partition_expression: Expression = None, tokenize: Any = None)#

A Fragment that emits RecordBatches from a reproducible source.

To provide stateless replay, a new record batch iterator over the same records is constructed whenever a scanner is requested.

Parameters:
  • make_batchreader (Callable[[list[str] | None, int], RecordBatchIter]) – A function that recreates a specific stream of record batches.

  • schema (pyarrow.Schema) – The schema of the RecordBatches.

  • batch_size (int, optional) – The maximum row count for scanned record batches.

  • partition_expression (pyarrow.dataset.Expression, optional) – A partition expression for the fragment, by default None.

schema#

The schema of the RecordBatches in the fragment.

Type:

pyarrow.Schema

partition_expression#

An expression that evaluates to true for all data viewed by this fragment.

Type:

pyarrow.dataset.Expression

__init__(make_batchreader: Callable[[list[str] | None, int], RecordBatchReader | Iterator[RecordBatch]], schema: Schema, batch_size: int = 131072, partition_expression: Expression = None, tokenize: Any = None)[source]#

Create a BatchReaderFragment from a BatchReader factory function.

Parameters:
  • make_batchreader (Callable[[list[str] | None, int], RecordBatchIter]) – A function that recreates a specific stream of record batches.

  • schema (pyarrow.Schema) – The schema of the RecordBatches.

  • batch_size (int, optional) – The maximum row count for scanned record batches.

  • partition_expression (pyarrow.dataset.Expression, optional) – A partition expression for the fragment, by default None.

Notes

The make_batchreader function should accept the following arguments and return a pyarrow.RecordBatchReader:

  • columns: The columns to project.

  • batch_size: The maximum number of rows per batch.

Methods

__init__(make_batchreader, schema[, ...])

Create a BatchReaderFragment from a BatchReader factory function.

count_rows([filter, batch_size, ...])

head(num_rows[, columns, filter, ...])

iter_batches([columns, batch_size])

Iterate over batches in the fragment.

scanner([schema, columns, filter, ...])

Build a scan operation against the fragment.

take(indices[, columns, filter, batch_size, ...])

to_batches([schema, columns, filter, ...])

Scan and read the fragment as materialized record batches.

to_table([schema, columns, filter, ...])

Attributes

partition_expression

An expression that evaluates to true for all data viewed by this fragment.

physical_schema

schema

The schema of the RecordBatches in the fragment.

iter_batches(columns: list[str] | None = None, batch_size: int | None = None) Iterator[RecordBatch][source]#

Iterate over batches in the fragment.

Parameters:
  • columns (list[str], optional) – Names of columns to project. By default, all available columns are projected.

  • batch_size (int, optional) – The maximum row count for scanned record batches.

Returns:

An iterator of record batches.

Return type:

Iterator[pyarrow.RecordBatch]

property partition_expression: Expression#

An expression that evaluates to true for all data viewed by this fragment.

Return type:

pyarrow.dataset.Expression

scanner(schema: pa.Schema | None = None, columns: list[str] | None = None, filter: ds.Expression | None = None, batch_size: int | None = None, batch_readahead: int = 16, fragment_readahead: int = 4, fragment_scan_options: ds.FragmentScanOptions | None = None, use_threads: bool = True, memory_pool: ds.MemoryPool | None = None, **kwargs) ds.Scanner[source]#

Build a scan operation against the fragment.

Parameters:
  • schema (pyarrow.Schema, optional) – The schema to use for scanning. If not specified, uses the Fragment’s physical schema.

  • columns (list[str], optional) – Names of columns to project. By default, all available columns are projected.

  • filter (pyarrow.dataset.Expression, optional) – A filter expression. Scan will return only the rows matching the filter.

  • batch_size (int, optional) – The maximum row count for scanned record batches.

  • batch_readahead (int, optional) – The number of batches to read ahead in a file.

  • fragment_readahead (int, optional) – The number of fragments/files to read ahead.

  • fragment_scan_options (pyarrow.dataset.FragmentScanOptions, optional) – Options specific to a particular scan and fragment type, by default None.

  • use_threads (bool, optional) – If enabled, maximum parallelism will be used, by default True.

  • memory_pool (pyarrow.dataset.MemoryPool, optional) – For memory allocations, if required. By default, uses the default pool.

Returns:

A scanner object for the fragment.

Return type:

pyarrow.dataset.Scanner

property schema: Schema#

The schema of the RecordBatches in the fragment.

Return type:

pyarrow.Schema

to_batches(schema: pa.Schema | None = None, columns: list[str] | None = None, filter: ds.Expression | None = None, batch_size: int | None = None, batch_readahead: int = 16, fragment_readahead: int = 4, fragment_scan_options: ds.FragmentScanOptions | None = None, use_threads: bool = True, memory_pool: ds.MemoryPool | None = None, **kwargs) Iterator[pa.RecordBatch][source]#

Scan and read the fragment as materialized record batches.

Projections and filters are applied if specified.

Parameters:
  • schema (pyarrow.Schema, optional) – The schema to use for scanning. If not specified, uses the Fragment’s physical schema.

  • columns (list[str], optional) – Names of columns to project. By default, all available columns are projected.

  • filter (pyarrow.dataset.Expression, optional) – A filter expression. Scan will return only the rows matching the filter.

  • batch_size (int, optional) – The maximum row count for scanned record batches.

  • batch_readahead (int, optional) – The number of batches to read ahead in a file.

  • fragment_readahead (int, optional) – The number of fragments/files to read ahead,.

  • fragment_scan_options (pyarrow.dataset.FragmentScanOptions, optional) – Options specific to a particular scan and fragment type, by default None.

  • use_threads (bool, optional) – If enabled, maximum parallelism will be used, by default True.

  • memory_pool (pyarrow.dataset.MemoryPool, optional) – For memory allocations, if required. By default, uses the default pool.

Returns:

An iterator of record batches.

Return type:

Iterator[pyarrow.RecordBatch]