oxbow.core.PyVcfScanner

oxbow.core.PyVcfScanner#

class oxbow.core.PyVcfScanner(src, compressed=False, fields=None, info_fields=None, genotype_fields=None, samples=None, genotype_by=None)#

A VCF file scanner.

Parameters:

src (str or file-like) – The path to the VCF file or a file-like object.
compressed (bool, optional [default: False]) – Whether the source is BGZF-compressed.
fields (list[str], optional) – Names of the fixed fields to project.
info_fields (list[str], optional) – Names of the INFO fields to project.
genotype_fields (list[str], optional) – Names of the sample-specific genotype fields to project.
samples (list[str], optional) – Names of the samples to include in the genotype fields.
genotype_by (Literal["sample", "field"], optional [default: "sample"]) – How to project the genotype fields. If “sample”, the columns correspond to the samples. If “field”, the columns correspond to the genotype fields.

__init__()#

Methods

`__init__`()
`chrom_names`()	Return the names of the reference sequences.
`chrom_sizes`()	Return the names of the reference sequences and their lengths in bp.
`field_names`()	Return the names of the fixed fields.
`genotype_field_defs`()	Return the definitions of the FORMAT fields.
`genotype_field_names`()	Return the names of the FORMAT fields.
`info_field_defs`()	Return the definitions of the INFO fields.
`info_field_names`()	Return the names of the INFO fields.
`sample_names`()	Return the names of the samples.
`scan`([columns, batch_size, limit])	Scan batches of records from the file.
`scan_byte_ranges`(byte_ranges[, columns, ...])	Scan batches of records from specified byte ranges in the file.
`scan_query`(region[, index, columns, ...])	Scan batches of records from a genomic range query on a BGZF-encoded file.
`scan_virtual_ranges`(vpos_ranges[, columns, ...])	Scan batches of records from virtual position ranges in a BGZF file.
`schema`()	Return the Arrow schema.

chrom_names()#: Return the names of the reference sequences.

chrom_sizes()#: Return the names of the reference sequences and their lengths in bp.

field_names()#: Return the names of the fixed fields.

genotype_field_defs()#: Return the definitions of the FORMAT fields.

genotype_field_names()#: Return the names of the FORMAT fields.

info_field_defs()#: Return the definitions of the INFO fields.

info_field_names()#: Return the names of the INFO fields.

sample_names()#: Return the names of the samples.

scan(columns=None, batch_size=1024, limit=None)#

Scan batches of records from the file.

Parameters:

columns (list[str], optional) – Names of the columns to project.
batch_size (int, optional [default: 1024]) – The number of records to include in each batch.
limit (int, optional) – The maximum number of records to scan. If None, records are scanned until EOF.

Returns:

An iterator yielding Arrow record batches.

Return type:

arro3 RecordBatchReader (pycapsule)

scan_byte_ranges(byte_ranges, columns=None, batch_size=1024, limit=None)#

Scan batches of records from specified byte ranges in the file.

The byte positions must align with record boundaries.

Parameters:

byte_ranges (list[tuple[int, int]]) – List of (start, end) byte position tuples to read from.
columns (list[str], optional) – Names of the columns to project.
batch_size (int, optional [default: 1024]) – The number of records to include in each batch.
limit (int, optional) – The maximum number of records to scan. If None, all records in the specified ranges are scanned.

Returns:

An iterator yielding Arrow record batches.

Return type:

arro3 RecordBatchReader (pycapsule)

scan_query(region, index=None, columns=None, batch_size=1024, limit=None)#

Scan batches of records from a genomic range query on a BGZF-encoded file.

This operation requires an index file.

Parameters:

region (str) – Genomic region in the format “chr:start-end”.
index (path or file-like, optional) – The index file to use for querying the region. If None and the source was provided as a path, we will attempt to load the index from the same path with an additional extension.
columns (list[str], optional) – Names of the columns to project.
batch_size (int, optional [default: 1024]) – The number of records to include in each batch.
limit (int, optional) – The maximum number of records to scan. If None, all records intersecting the query range are scanned.

Returns:

An iterator yielding Arrow record batches.

Return type:

arro3 RecordBatchReader (pycapsule)

scan_virtual_ranges(vpos_ranges, columns=None, batch_size=1024, limit=None)#

Scan batches of records from virtual position ranges in a BGZF file.

The virtual positions must align with record boundaries. That means that the compressed offset must point to the beginning of a BGZF block and the uncompressed offset must point to the beginning or end of a record decoded within the block.

Parameters:

vpos_ranges (list[tuple[vpos, vpos]]) – List of virtual position ranges as pairs. Each virtual position can be given as either a packed virtual position (int), or an unpacked tuple of ints (c, u) specifying the compressed and uncompressed offsets, respectively.
columns (list[str], optional) – Names of the columns to project.
batch_size (int, optional [default: 1024]) – The number of records to include in each batch.
limit (int, optional) – The maximum number of records to scan. If None, all records in the specified ranges are scanned.

Returns:

An iterator yielding Arrow record batches.

Return type:

arro3 RecordBatchReader (pycapsule)

schema()#

Return the Arrow schema.

Return type:: arro3 Schema (pycapsule)

oxbow.core.PyVcfScanner

Contents

oxbow.core.PyVcfScanner#