oxbow.from_bcf

Contents

oxbow.from_bcf#

oxbow.from_bcf(source: str | Path | Callable[[], IO[bytes] | str], compression: Literal['bgzf', None] = 'bgzf', *, fields: Literal['*'] | list[str] | None = '*', info_fields: Literal['*'] | list[str] | None = '*', genotype_fields: Literal['*'] | list[str] | None = '*', genotype_by: Literal['sample', 'field'] = 'sample', samples: Literal['*'] | list[str] | None = None, samples_nested: bool = False, coords: Literal['01', '11'] = '11', regions: str | list[str] | None = None, index: str | Path | Callable[[], IO[bytes] | str] | None = None, batch_size: int = 131072) BcfFile#

Create a BCF file data source.

Changed in version 0.7.0: The samples parameter now defaults to omitting sample genotype data (None) instead of including all samples ("*"). To include samples, pass a value to the samples parameter or use the with_samples() method on the returned data source.

Parameters:
  • source (str, pathlib.Path, or Callable) – The URI or path to the BCF file, or a callable that opens the file as a file-like object.

  • compression (Literal["bgzf", None], default: "bgzf") – Compression of the source bytestream. By default, BCF sources are assumed to be BGZF-compressed. If None, the source is assumed to be uncompressed. For more custom decoding, provide a callable source instead.

  • fields ("*", list[str], or None, optional [default: "*"]) – Fixed fields to project. "*" includes all standard fields. Pass a list to select specific fields. None omits all fixed fields.

  • info_fields ("*", list[str], or None, optional [default: "*"]) – INFO fields to project, nested under an "info" column. "*" includes all INFO fields declared in the header. Pass a list to select specific fields. None omits the info column entirely.

  • genotype_fields ("*", list[str], or None, optional [default: "*"]) – Genotype (aka “FORMAT”) fields to project for each sample. "*" includes all FORMAT fields declared in the header. Pass a list to select specific fields. None omits the genotype fields.

  • genotype_by (Literal["sample", "field"], optional [default: "sample"]) – Determines how genotype-specific data is organized. If “sample”, each sample is provided as a separate column with nested FORMAT fields. If “field”, each FORMAT field is provided as a separate column with nested sample name fields.

  • samples ("*", list[str], or None, optional [default: None]) – Samples to include in the genotype output. "*" includes all samples declared in the header. Pass a list to select specific samples. None omits all sample genotype data.

  • samples_nested (bool, optional [default: False]) – Whether to nest sample data under a single structured column.

  • regions (str | list[str], optional) – One or more genomic regions to query. Only applicable if an associated index file is available.

  • index (str, optional) – An optional index file associated with the BCF file. If source is a URI or path, is BGZF-compressed, and the index file shares the same name with a “.csi” extension, the index file is automatically detected.

  • batch_size (int, optional [default: 131072]) – The number of records to read in each batch.

Returns:

A data source object representing the BCF file.

Return type:

BcfFile

Notes

The Binary Call Format (BCF) is a binary representation of the Variant Call Format (VCF), designed for efficient storage and processing of genomic variant data. It is commonly used in large-scale sequencing projects.

See also

from_vcf

Create a VCF file data source.