oxbow.from_vcf

Contents

oxbow.from_vcf#

oxbow.from_vcf(source: str | Path | Callable[[], IO[bytes] | str], compression: Literal['infer', 'bgzf', 'gzip', None] = 'infer', *, fields: Literal['*'] | list[str] | None = '*', info_fields: Literal['*'] | list[str] | None = '*', genotype_fields: Literal['*'] | list[str] | None = '*', genotype_by: Literal['sample', 'field'] = 'sample', samples: Literal['*'] | list[str] | None = None, samples_nested: bool = False, coords: Literal['01', '11'] = '11', regions: str | list[str] | None = None, index: str | Path | Callable[[], IO[bytes] | str] | None = None, batch_size: int = 131072) VcfFile#

Create a VCF file data source.

Changed in version 0.7.0: The samples parameter now defaults to omitting sample genotype data (None) instead of including all samples ("*"). To include samples, pass a value to the samples parameter or use the with_samples() method on the returned data source.

Parameters:
  • source (str, pathlib.Path, or Callable) – The URI or path to the VCF file, or a callable that opens the file as a file-like object.

  • compression (Literal["infer", "bgzf", "gzip", None], default: "infer") – Compression of the source bytestream. If “infer” and source is a URI or path, the file’s compression is guessed based on the extension, where “.gz” or “.bgz” is interpreted as BGZF. Pass “gzip” to decode regular GZIP. If None, the source bytestream is assumed to be uncompressed. For more customized decoding, provide a callable source instead.

  • fields ("*", list[str], or None, optional [default: "*"]) – Fixed fields to project. "*" includes all standard fields. Pass a list to select specific fields. None omits all fixed fields.

  • info_fields ("*", list[str], or None, optional [default: "*"]) – INFO fields to project, nested under an "info" column. "*" includes all INFO fields declared in the header. Pass a list to select specific fields. None omits the info column entirely.

  • genotype_fields ("*", list[str], or None, optional [default: "*"]) – Genotype (aka “FORMAT”) fields to project for each sample. "*" includes all FORMAT fields declared in the header. Pass a list to select specific fields. None omits the genotype fields.

  • genotype_by (Literal["sample", "field"], optional [default: "sample"]) – Determines how genotype-specific data is organized. If “sample”, each sample is provided as a separate column with nested FORMAT fields. If “field”, each FORMAT field is provided as a separate column with nested sample name fields.

  • samples ("*", list[str], or None, optional [default: None]) – Samples to include in the genotype output. "*" includes all samples declared in the header. Pass a list to select specific samples. None omits all sample genotype data.

  • samples_nested (bool, optional [default: False]) – Whether to nest sample data under a single structured column.

  • regions (str | list[str], optional) – One or more genomic regions to query. Only applicable if an associated index file is available.

  • index (str, pathlib.Path, or Callable, optional) – An optional index file associated with the VCF file. If source is a URI or path, is BGZF-compressed, and the index file shares the same name with a “.tbi” or “.csi” extension, the index file is automatically detected.

  • batch_size (int, optional [default: 131072]) – The number of records to read in each batch.

Returns:

A data source object representing the VCF file.

Return type:

VcfFile

Notes

The Variant Call Format (VCF) is a text-based format used to store information about genomic variants. It is widely used in bioinformatics for storing and sharing variant data from sequencing projects.

See also

from_bcf

Create a BCF file data source.