gnomAD: expanding multi-allelic variants in VCF (Part 2)

VCF Playground – Level 2

Following my previous post on GC elements’ order, I’m now going to present an empirical proof of this convention! As a reminder, the inferred order of elements within a GC field is:

GC=AA, AB,BB, AC,BC,CC, AD,BD,CD,DD, AE,BE,CE,DE,EE, ...   (1)

Notation used:

  • Reference AlleleA
  • Alternative AllelesBCD, … (in order of appearance in the VCF file)

We would need to have sufficient example entries from a VCF to show concordance between the hypothesised -but not yet validated-  convention (1) and the GC values as calculated using the AC and AN fields from the VCF data (equations from 1st post copied below for convenience):

  • Hom = {value in Hom field in VCF or gnomAD browser}
  • Het AC β€“ (2 x  Hom)
  • Ref AN / 2 – Hom – Het

So, let’s start our calculations… I will first cite two representative examples from chr21 with 4 alleles that validate our convention (1):

Examples


  1. VCF entry: 21, 9416219, G, A,GCATT,GCA

gnomAD browser entry:
http://gnomad.broadinstitute.org/variant/21-9416219-G-A

GC=685, 13609, 272, 26, 17, 0, 6, 3, 0, 0


(GC=AA, AB, BB, AC, BC, CC, AD, BD, CD, DD)

  • Allele B:
Hom = 272
Het = AC - 2*Hom = 14173 - 2*272 = 13629
Ref_b = AN/2 - Het - Hom = 29236/2 - Het - Hom = 717 = 
= AA + AC + AD = 685 + 26 + 6  
# (Rest of A counts, apart from AB which is already counted in Het)

  1. VCF entry: 21, 9415561, CG, C,TG,GG

gnomAD browser entry:
http://gnomad.broadinstitute.org/variant/21-9415561-CG-C

GC=2925, 6449, 1, 1825, 84, 0, 14, 1, 0, 0


(GC=AA, AB, BB, AC, BC, CC, AD, BD, CD, DD)

  • Allele B:
Het = 6536 - 2*1 = 6534
Hom = 1
Ref_b = 22598/2 - Het - Hom = 4764    = AA + AC + AD = 2925 + 1825 + 14 
# (Rest of A counts, apart from AB which is already counted in Het)
  • Allele C: (in VCF: CG/TG, in browser: C/T)
Het = 1909 - 2*0 = 1909     = AC + BC + CD = 1825 + 84 + 0 = 
Hom = 0
Ref_c = 22598/2 - Het - Hom = 9390 =  
= AA + AB + AD + BB + BD = 2925 + 6449 + 14  + 1 + 1 
# (Rest of A and B counts, apart from AC and BC,
which are already counted in Het)
  • Allele D: (in VCF: CG/GG, in browser: C/T)
Het = 15 - 2*0 = 15     = AD + BD + CD = 14 + 1 + 0
Hom = 0
Ref_d = 22598/2 - Het - Hom = 11284 = 
= AA + AB + BB + AC + BC + CC = 2925 + 6449 + 1 + 1825 + 84 + 0
# (Rest of A, B and C counts, apart from AD, BD and CD,
which are already counted in Het)

So far, so good…

However, I’m still not convinced even after several validation examples of 4 alleles, especially when they have several zero values in some pairs within GC…

Let’s try to see if we can get something really solid from chr21, such as an entry with no zero-value at the last pair of the GC field:

$ zcat < gnomad.genomes.r2.0.2.sites.chr21.vcf.bgz | awk -F "GC=" '{print $2}' | cut -d';' -f1 | sort | uniq | grep -v "0$" | head -20

1151,536,2,27,0,1,261,1,0,2
12801,768,5,2,0,0,1,0,0,0,0,0,0,0,1
12863,77,0,44,0,0,21,0,0,1
4517,11,0,125,0,12,12,0,0,1
46,1435,0,439,47,1,2135,70,211,1,1417,788,155,497,49,921,849,310,986,367,117,25,29,15,34,13,14,1

And voila! The last line looks like the perfect example to test our convention: 7 alleles and just one zero-value pair in GC! So, let’s find the relevant entry in the original VCF file (did that already with grep) and then continue with the validation:

ULTIMATE CHECK: 7 alleles

  1. VCF entry: 21, 9421430, GTTT, GT,GTT,G,GTTTTTT,GTTTT,GTTTTT

gnomAD browser entry:


http://gnomad.broadinstitute.org/variant/21-9421430-GTT-G

GC=46, 1435, 0, 439, 47, 1, 2135, 70, 211, 1, 1417, 788, 155, 497, 49, 921, 849, 310, 986, 367, 117, 25, 29, 15, 34, 13, 14, 1


(GC=AA, AB, BB, AC, BC, CC, AD, BD, CD, DD, AE, BE, CE, DE, EE, AF, BF, CF, DF, EF, FF, AG, BG, CG, DG, EG, FG, GG)

  • Allele B: (in VCF: GTTT/GT, in browser: GTT/G)
Het = 3218 - 2 * 0 = 3218 = 1435 + 47 + 70 + 788 + 849 + 29 = 
= AB + BC + BD + BE + BF + BG
  • Allele C: (in VCF: GTTT/GTT, in browser: GT/G)
Het = 1179 - 2 = 1177 = 439 + 47 + 211 + 155 + 310 + 15 = 
= AC + BC + CD + CE + CF + CG
  • Allele D: (in VCF: GTTT/G, in browser: GTTT/G)
Het = 3935 - 2 = 3933 = 2135 + 70 + 211 + 497 + 986 + 34 =
= AD + BD + CD + DE + DF + DG
  • Allele E: (in VCF: GTTT/G, in browser: G/GTTT)
Hom = 49
Het = 3335 - 2*49 = 3237 = 1417 + 788 + 155 + 497 + 367 + 13 =
= AE + BE + CE + DE + EF + EG
Ref_e = 21944/2 - Het - Hom = 7686 = 
= 46 + 1435 + 439 + 2135 + 921 + 25 + 0 + 47 + 70 + 849 + 29 + 1 + 211 + 310 + 15 + 1 + 986 + 34 + 117 + 14 + 1
= AA + AB + AC + AD + AF + AG + BB + BC + BD + BF + BG + CC + CD + CF + CG + DD + DF + DG + FF + FG + GG =  
# (Rest of A, B, C and D counts, apart from AE, BE, CE, DE which are already counted in Het)    

So, that was it! πŸ™‚

Feel free to test any other alleles from this example to get a full grasp of these calculations!


SPECIAL CASES

There are some entries in the gnomAD VCF files where allele counts derived from GC do not add up to the total AN value.

This is due to the presence of alleles with spanning deletions (denoted by STAR_* fields) that are not counted eventually in the Genotype Count field.

Example

  • 219438414ACTTTTTGCGCCCACCGCGGA,ACGG,GCTTTTTGCGCCCACCGCGG

gnomAD browser entry

http://gnomad.broadinstitute.org/variant/21-9438414-ACTTTTTGCGCCCACCGCGG-A

GC=1151, 536, 2, 27, 0, 1, 261, 1, 0, 2


(GC=AA, AB, BB, AC, BC, CC, AD, BD, CD, DD)

AN = 11108   fields in GC cannot add up to this AN value

  • Allele B: (in VCF: ACTTTTTGCGCCCACCGCGG/A, in browser: ACTTTTTGCGCCCACCGCGG/A)
Het = 541 - 2*2 = 537
  • Allele C: (in VCF: ACTTTTTGCGCCCACCGCGG/ACGG, in browser: ACTTTTTGCGCCCACCG/A)
Het = 29 - 2*1 = 27
  • Allele D: (in VCF: ACTTTTTGCGCCCACCGCGG/GCTTTTTGCGCCCACCGCGG, in browser: A/G)
Het = 266 - 2*2 = 262 == AD + BD

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s