This vignette shows how to preprocess individual trio VCFs (proband, father, mother) into a clean trio VCF suitable for UPDhmm.
You should have one VCF (e.g. oputput from GATK) per family member:
proband.vcf.gz
mother.vcf.gz
father.vcf.gz
Each file should be bgzipped (.vcf.gz
) and indexed with tabix (.tbi
).
For this step you will need a reference genome FASTA indexed with samtools faidx
.
bcftools norm -m-any -f reference.fa proband.vcf.gz -Oz -o proband.norm.vcf.gz
bcftools norm -m-any -f reference.fa mother.vcf.gz -Oz -o mother.norm.vcf.gz
bcftools norm -m-any -f reference.fa father.vcf.gz -Oz -o father.norm.vcf.gz
tabix -p vcf proband.norm.vcf.gz
tabix -p vcf mother.norm.vcf.gz
tabix -p vcf father.norm.vcf.gz
Keep only essential fields (drop INFO/FORMAT annotations not required downstream). The resulting VCFs are lighter.
# Keep only GT, AD, DP, and GQ fields (remove all other INFO)
bcftools annotate -x INFO,^FORMAT/GT,FORMAT/AD,FORMAT/DP,FORMAT/GQ \
proband.norm.vcf.gz -Oz -o proband.clean.vcf.gz
bcftools annotate -x INFO,^FORMAT/GT,FORMAT/AD,FORMAT/DP,FORMAT/GQ \
mother.norm.vcf.gz -Oz -o mother.clean.vcf.gz
bcftools annotate -x INFO,^FORMAT/GT,FORMAT/AD,FORMAT/DP,FORMAT/GQ \
father.norm.vcf.gz -Oz -o father.clean.vcf.gz
# Index the cleaned VCFs
tabix -p vcf proband.clean.vcf.gz
tabix -p vcf mother.clean.vcf.gz
tabix -p vcf father.clean.vcf.gz
Goal:
If the input files are gVCFs, you can directly merge them, as they already include all genomic positions.
If they are standard VCFs, merge the files first and then remove missing genotypes (./.
) to keep only fully called variants.
#--------------------------------------------------------------
# Option 1: For standard VCFs
#--------------------------------------------------------------
# Merge the three individuals
bcftools merge \
proband.clean.vcf.gz \
mother.clean.vcf.gz \
father.clean.vcf.gz \
-Oz -o trio_merged_raw.vcf.gz
# Retain only biallelic variants
bcftools view -m2 -M2 trio_merged_raw.vcf.gz -Oz -o trio_merged_biallelic.vcf.gz
# Remove sites where all genotypes are homozygous reference (0/0)
bcftools view \
-i 'COUNT(FORMAT/GT="0/0") != 3' \
trio_merged_biallelic.vcf.gz -Oz -o trio_merged_nohom.vcf.gz
# Remove missing genotypes (./.) to keep only fully called variants
bcftools view \
-e 'GT="./."' \
trio_merged_nohom.vcf.gz -Oz -o trio_merged_clean.vcf.gz
#--------------------------------------------------------------
# Option 2: For gVCFs
#--------------------------------------------------------------
# gVCFs already include all genomic sites, so intersection is not needed
bcftools merge \
proband.clean.vcf.gz \
mother.clean.vcf.gz \
father.clean.vcf.gz \
-Oz -o trio_merged_raw.vcf.gz
# Retain only biallelic variants
bcftools view -m2 -M2 trio_merged_raw.vcf.gz -Oz -o trio_merged_biallelic.vcf.gz
# Remove sites where all are 0/0 (non-informative for UPD analysis)
bcftools view \
-i 'COUNT(FORMAT/GT="0/0") != 3' \
trio_merged_biallelic.vcf.gz -Oz -o trio_merged_clean.vcf.gz
#--------------------------------------------------------------
# Normalize and index the merged VCF
#--------------------------------------------------------------
# Normalize (split multi-allelics if any remain and remove duplicates)
bcftools norm -m -both -d all -Oz -o trio_merged_norm.vcf.gz trio_merged_clean.vcf.gz
# Index for downstream tools
tabix -p vcf trio_merged_norm.vcf.gz
Before detecting UPD events, it is recommended to exclude genomic regions prone to alignment artifacts or abnormal variant density — such as centromeres, segmental duplications, and immune complex regions (e.g., HLA and KIR).
These regions can lead to false-positive signals due to mapping ambiguity or high polymorphism.
To simplify this step, a curated set of BED masks is available in the following Zenodo repository:
🔗 Zenodo – UPDhmm Excluded Regions
The repository provides BED files for both reference genome builds:
hg19_excluded_regions.bed
hg38_excluded_regions.bed
These files include merged genomic intervals covering:
Download the appropriate BED file for your genome build and use it to mask excluded regions from the merged trio VCF:
# Mask problematic regions
bcftools view -T ^merged_mask.bed \
trio_merged_norm.vcf.gz -Oz -o trio_masked.vcf.gz
# Index the masked VCF
tabix -p vcf trio_masked.vcf.gz
sessionInfo()
## R version 4.5.1 Patched (2025-08-23 r88802)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] karyoploteR_1.35.3 regioneR_1.41.3
## [3] VariantAnnotation_1.55.1 Rsamtools_2.25.3
## [5] Biostrings_2.77.2 XVector_0.49.1
## [7] SummarizedExperiment_1.39.2 Biobase_2.69.1
## [9] GenomicRanges_1.61.5 IRanges_2.43.5
## [11] S4Vectors_0.47.4 Seqinfo_0.99.2
## [13] MatrixGenerics_1.21.0 matrixStats_1.5.0
## [15] BiocGenerics_0.55.3 generics_0.1.4
## [17] dplyr_1.1.4 UPDhmm_1.5.1
## [19] BiocStyle_2.37.1
##
## loaded via a namespace (and not attached):
## [1] DBI_1.2.3 bitops_1.0-9 gridExtra_2.3
## [4] rlang_1.1.6 magrittr_2.0.4 biovizBase_1.57.1
## [7] compiler_4.5.1 RSQLite_2.4.3 GenomicFeatures_1.61.6
## [10] png_0.1-8 vctrs_0.6.5 ProtGenerics_1.41.0
## [13] stringr_1.5.2 pkgconfig_2.0.3 crayon_1.5.3
## [16] fastmap_1.2.0 magick_2.9.0 backports_1.5.0
## [19] rmarkdown_2.30 UCSC.utils_1.5.0 tinytex_0.57
## [22] bit_4.6.0 xfun_0.53 cachem_1.1.0
## [25] GenomeInfoDb_1.45.12 jsonlite_2.0.0 blob_1.2.4
## [28] DelayedArray_0.35.3 BiocParallel_1.43.4 parallel_4.5.1
## [31] cluster_2.1.8.1 R6_2.6.1 bslib_0.9.0
## [34] stringi_1.8.7 RColorBrewer_1.1-3 bezier_1.1.2
## [37] rtracklayer_1.69.1 rpart_4.1.24 jquerylib_0.1.4
## [40] Rcpp_1.1.0 bookdown_0.45 knitr_1.50
## [43] base64enc_0.1-3 Matrix_1.7-4 nnet_7.3-20
## [46] tidyselect_1.2.1 rstudioapi_0.17.1 dichromat_2.0-0.1
## [49] abind_1.4-8 yaml_2.3.10 codetools_0.2-20
## [52] curl_7.0.0 lattice_0.22-7 tibble_3.3.0
## [55] KEGGREST_1.49.2 S7_0.2.0 evaluate_1.0.5
## [58] foreign_0.8-90 pillar_1.11.1 BiocManager_1.30.26
## [61] checkmate_2.3.3 RCurl_1.98-1.17 ensembldb_2.33.2
## [64] ggplot2_4.0.0 scales_1.4.0 glue_1.8.0
## [67] lazyeval_0.2.2 Hmisc_5.2-4 tools_4.5.1
## [70] BiocIO_1.19.0 data.table_1.17.8 BSgenome_1.77.2
## [73] GenomicAlignments_1.45.4 XML_3.99-0.19 grid_4.5.1
## [76] AnnotationDbi_1.71.2 colorspace_2.1-2 htmlTable_2.4.3
## [79] restfulr_0.0.16 Formula_1.2-5 cli_3.6.5
## [82] HMM_1.0.2 S4Arrays_1.9.1 AnnotationFilter_1.33.0
## [85] gtable_0.3.6 sass_0.4.10 digest_0.6.37
## [88] SparseArray_1.9.1 rjson_0.2.23 htmlwidgets_1.6.4
## [91] farver_2.1.2 memoise_2.0.1 htmltools_0.5.8.1
## [94] lifecycle_1.0.4 httr_1.4.7 bit64_4.6.0-1
## [97] bamsignals_1.41.2