vcflib: 一个用于解析和操作VCF文件的开源Python库
Sentieon开发了一个名为vcflib的开源Python库,用于解析和操作变异检测的VCF结果文件。以下是该库的一些亮点功能:
-
支持读写未压缩或bgzip压缩并经过tabix索引的VCF文件;
-
支持VCF和gVCF(以.g.vcf.gz为后缀)文件;
-
输出文件会即时写入tribble(.vcf.idx)和tabix(vcf.gz.tbi)索引,无需对数据进行单独的索引遍历;
-
通过Sharder类支持跨基因组区域的并行化处理;
-
支持Python2.7和Python3。
使用示例 - VCF过滤
example/filter_dp.py中提供了一个简单的脚本示例,用于从输入的VCF中过滤掉DP<10的变异。
#!/usr/bin/env python
"""
An example script that utilizes vcflib to remove variants in a VCF file by DP
"""
from __future__ import print_function
import argparse
import os
import os.path
import sys
import vcflib
def filter_vcf(in_vcf, out_vcf, min_dp=10):
""" Filter variants in the input VCF with a INFO/DP < min_dp """
for variant in in_vcf:
dp = variant.info.get("DP", None) # INFO fields are stored as a dict
if dp isnotNoneand dp >= min_dp:
out_vcf.emit(variant)
return
def parse_args(argv):
parser = argparse.ArgumentParser(
description=__doc__,
formatter_class=argparse.ArgumentDefaultsHelpFormatter,
)
parser.add_argument("--input_vcf", required=True, help="The input VCF")
parser.add_argument("--output_vcf", required=True, help="The output VCF")
parser.add_argument(
"--min_dp", default=10, type=int, help="The minimum DP for a variant"
)
parser.add_argument(
"--n_threads",
default=1,
type=int,
help="Number of threads to use for VCF processing",
)
parser.add_argument(
"--step_size",
default=10 * 1000 * 1000,
type=int,
help="Size of the shards",
)
return parser.parse_args(argv)
def main(argv=None):
args = parse_args(argv)
assert os.path.isfile(args.input_vcf)
in_vcf = vcflib.VCF(str(args.input_vcf))
# `VCF()` accepts an option mode argument. Behavior is similar to mode in `open()`
out_vcf = vcflib.VCF(str(args.output_vcf), "wb")
out_vcf.copy_header(
in_vcf
) # Copy headers from the input VCF to the output VCF
out_vcf.emit_header() # Write the VCF header in the output VCF
if args.n_threads < 2:
# Run the filtering in the current process without parallelization
filter_vcf(in_vcf, out_vcf, args.min_dp)
else:
## Run the filtering in multiple processes - requires a VCF with contig
## lengths in the header
# Initialize the Sharder object with the number of threads
sharder = vcflib.Sharder(args.n_threads)
# Partition the genome into equal-sized shards
try:
contig_lengths = [
(contig, 0, int(d["length"]))
for contig, d in in_vcf.contigs.items()
]
except KeyError:
return (
"This example script requires a VCF with contig lengths in"
" the header when using multiple threads"
)
shards = sharder.cut(contig_lengths, args.step_size)
# Run all of the shards in parallel
sharder.run(shards, filter_vcf, [], in_vcf, out_vcf, args.min_dp)
out_vcf.close() # Close the output VCF and write the index file
return os.EX_OK
if __name__ == "__main__":
sys.exit(main())
运行命令如下:
PYTHONPATH=$(pwd) python example/filter_dp.py --input_vcf <VCF> --output_vcf <VCF>
vcflib为处理VCF文件提供了方便的工具,支持压缩文件、多种文件格式、即时索引、并行化等特性,可以提高VCF文件的解析和操作效率。欢迎感兴趣的开发者试用并给出宝贵意见,Sentieon团队也欢迎社区开发者为该项目做出贡献。
项目地址:https://github.com/Sentieon/vcflib/
来源:毅硕科技
本文来自博客园,作者:生物信息与育种,转载请注明原文链接:https://www.cnblogs.com/miyuanbiotech/p/18929983。若要及时了解动态信息,请关注同名微信公众号:生物信息与育种。

浙公网安备 33010602011771号