oxbow.from_vcf#
- oxbow.from_vcf(source: str | Path | Callable[[], IO[bytes] | str], compression: Literal['infer', 'bgzf', 'gzip', None] = 'infer', *, fields: list[str] | None = None, info_fields: list[str] | None = None, samples: list[str] | None = None, genotype_fields: list[str] | None = None, genotype_by: Literal['sample', 'field'] = 'sample', regions: str | list[str] | None = None, index: str | Path | Callable[[], IO[bytes] | str] | None = None, batch_size: int = 131072) VcfFile#
Create a VCF file data source.
- Parameters:
source (str, pathlib.Path, or Callable) – The URI or path to the VCF file, or a callable that opens the file as a file-like object.
compression (Literal["infer", "bgzf", "gzip", None], default: "infer") – Compression of the source bytestream. If “infer” and
sourceis a URI or path, the file’s compression is guessed based on the extension, where “.gz” or “.bgz” is interpreted as BGZF. Pass “gzip” to decode regular GZIP. If None, the source bytestream is assumed to be uncompressed. For more customized decoding, provide a callablesourceinstead.fields (list[str], optional) – Specific fixed fields to project. By default, all fixed fields are included.
info_fields (list[str], optional [default: None]) – INFO fields to project. These will be nested under an “info” column. If None, all INFO fields declared in the header are included. To omit all INFO fields, set
info_fields=[].samples (list[str], optional [default: None]) – A subset of samples to include in the genotype output. If None, all samples declared in the header are included. To omit all sample genotype data, set
samples=[].genotype_fields (list[str], optional [default: None]) – Genotype (aka “FORMAT”) fields to project for each sample. If None, all FORMAT fields declared in the header are included.
genotype_by (Literal["sample", "field"], optional [default: "sample"]) – Determines how genotype-specific data is organized. If “sample”, each sample is provided as a separate column with nested FORMAT fields. If “field”, each FORMAT field is provided as a separate column with nested sample name fields.
regions (str | list[str], optional) – One or more genomic regions to query. Only applicable if an associated index file is available.
index (str, pathlib.Path, or Callable, optional) – An optional index file associated with the VCF file. If
sourceis a URI or path, is BGZF-compressed, and the index file shares the same name with a “.tbi” or “.csi” extension, the index file is automatically detected.batch_size (int, optional [default: 131072]) – The number of records to read in each batch.
- Returns:
A data source object representing the VCF file.
- Return type:
Notes
The Variant Call Format (VCF) is a text-based format used to store information about genomic variants. It is widely used in bioinformatics for storing and sharing variant data from sequencing projects.
See also
from_bcfCreate a BCF file data source.