oxbow.core.PyVcfScanner#
- class oxbow.core.PyVcfScanner(src, compressed=False, fields=None, info_fields=None, genotype_fields=None, samples=None, genotype_by=None)#
A VCF file scanner.
- Parameters:
src (str or file-like) – The path to the VCF file or a file-like object.
compressed (bool, optional [default: False]) – Whether the source is BGZF-compressed.
fields (list[str], optional) – Names of the fixed fields to project.
info_fields (list[str], optional) – Names of the INFO fields to project.
genotype_fields (list[str], optional) – Names of the sample-specific genotype fields to project.
samples (list[str], optional) – Names of the samples to include in the genotype fields.
genotype_by (Literal["sample", "field"], optional [default: "sample"]) – How to project the genotype fields. If “sample”, the columns correspond to the samples. If “field”, the columns correspond to the genotype fields.
- __init__()#
Methods
__init__()Return the names of the reference sequences.
Return the names of the reference sequences and their lengths in bp.
Return the names of the fixed fields.
Return the definitions of the FORMAT fields.
Return the names of the FORMAT fields.
Return the definitions of the INFO fields.
Return the names of the INFO fields.
Return the names of the samples.
scan([columns, batch_size, limit])Scan batches of records from the file.
scan_byte_ranges(byte_ranges[, columns, ...])Scan batches of records from specified byte ranges in the file.
scan_query(region[, index, columns, ...])Scan batches of records from a genomic range query on a BGZF-encoded file.
scan_virtual_ranges(vpos_ranges[, columns, ...])Scan batches of records from virtual position ranges in a BGZF file.
schema()Return the Arrow schema.
- chrom_names()#
Return the names of the reference sequences.
- chrom_sizes()#
Return the names of the reference sequences and their lengths in bp.
- field_names()#
Return the names of the fixed fields.
- genotype_field_defs()#
Return the definitions of the FORMAT fields.
- genotype_field_names()#
Return the names of the FORMAT fields.
- info_field_defs()#
Return the definitions of the INFO fields.
- info_field_names()#
Return the names of the INFO fields.
- sample_names()#
Return the names of the samples.
- scan(columns=None, batch_size=1024, limit=None)#
Scan batches of records from the file.
- Parameters:
columns (list[str], optional) – Names of the columns to project.
batch_size (int, optional [default: 1024]) – The number of records to include in each batch.
limit (int, optional) – The maximum number of records to scan. If None, records are scanned until EOF.
- Returns:
An iterator yielding Arrow record batches.
- Return type:
arro3 RecordBatchReader (pycapsule)
- scan_byte_ranges(byte_ranges, columns=None, batch_size=1024, limit=None)#
Scan batches of records from specified byte ranges in the file.
The byte positions must align with record boundaries.
- Parameters:
byte_ranges (list[tuple[int, int]]) – List of (start, end) byte position tuples to read from.
columns (list[str], optional) – Names of the columns to project.
batch_size (int, optional [default: 1024]) – The number of records to include in each batch.
limit (int, optional) – The maximum number of records to scan. If None, all records in the specified ranges are scanned.
- Returns:
An iterator yielding Arrow record batches.
- Return type:
arro3 RecordBatchReader (pycapsule)
- scan_query(region, index=None, columns=None, batch_size=1024, limit=None)#
Scan batches of records from a genomic range query on a BGZF-encoded file.
This operation requires an index file.
- Parameters:
region (str) – Genomic region in the format “chr:start-end”.
index (path or file-like, optional) – The index file to use for querying the region. If None and the source was provided as a path, we will attempt to load the index from the same path with an additional extension.
columns (list[str], optional) – Names of the columns to project.
batch_size (int, optional [default: 1024]) – The number of records to include in each batch.
limit (int, optional) – The maximum number of records to scan. If None, all records intersecting the query range are scanned.
- Returns:
An iterator yielding Arrow record batches.
- Return type:
arro3 RecordBatchReader (pycapsule)
- scan_virtual_ranges(vpos_ranges, columns=None, batch_size=1024, limit=None)#
Scan batches of records from virtual position ranges in a BGZF file.
The virtual positions must align with record boundaries. That means that the compressed offset must point to the beginning of a BGZF block and the uncompressed offset must point to the beginning or end of a record decoded within the block.
- Parameters:
vpos_ranges (list[tuple[vpos, vpos]]) – List of virtual position ranges as pairs. Each virtual position can be given as either a packed virtual position (int), or an unpacked tuple of ints
(c, u)specifying the compressed and uncompressed offsets, respectively.columns (list[str], optional) – Names of the columns to project.
batch_size (int, optional [default: 1024]) – The number of records to include in each batch.
limit (int, optional) – The maximum number of records to scan. If None, all records in the specified ranges are scanned.
- Returns:
An iterator yielding Arrow record batches.
- Return type:
arro3 RecordBatchReader (pycapsule)
- schema()#
Return the Arrow schema.
- Return type:
arro3 Schema (pycapsule)