Showing posts with label 23andMe. Show all posts
Showing posts with label 23andMe. Show all posts

Wednesday, July 14, 2010

23andMe: SNPs So Nice They Call Them Twice

...with all of the duplicates tabulated below.

If you've tested with 23andMe and you've gotten the Complete Edition, you can download your raw data. What you'll get is a file containing the results for each SNP that was tested.

For each SNP, there's a line in the file containing:

  • the id number of the SNP (starting with "rs" for a standard SNP id, or with "i" for one of 23andMe's internal identifiers);
  • the chromosome the SNP is on;
  • the location of the SNP on that chromosome;
  • your genotype for that SNP.

Nearly 600,000 SNPs are tested for, so this file is fairly large.

Now, if you look closely at the data, you'll find that some of the locations are there twice. The two instances have different id numbers, but the same chromosome and the same location. Sometimes the two genotypes they'll give for you are the same, and sometimes they're different.

In fact, there are two locations that are reported three times, under three different names.

I've compiled a list of all the duplicate 23andMe SNPs, as of April 26, 2010, along with the possible alleles (variants) reported for each one.

So, why does 23andMe include these apparent duplicates? The duplicate SNPs fall into two distinct categories, and the rationale may be different for these two different types.

 

Category 1: The duplicate SNPs all have the same possible alleles.

For example, on chromosome 2, location 38151881 is reported both as i3002894 and as rs28936413, and both of them are C/T SNPs (that is, C and T are the only values tested for, and reported, on that SNP).

For the Category 1 duplicates, some are simply SNPs that have been given two different rs labels in the literature. But most of them involve 23andMe's internal identifiers (starting with i instead of rs). Perhaps some of these are there as quality control — testing the same SNP in two different ways on the chip. It's conceivable (although it seems unlikely to me) that some of these have different flanking base sequences on the chip, in which case they would be testing for different things.

 

Category 2: The duplicate SNPs have different possible alleles.

For example, on chromosome 1, location 156921704 is reported both as i3002777 (an A/T SNP) and as i3002778 (a G/T SNP).

The Category 2 SNPs are more interesting than the Category 1 SNPs, as we'll see.

First, though, we need to take a brief detour...

In general, DNA consists of a sequences of bases, each one an A, C, G, or T. (It's actually a sequence of base pairs, but only one base is reported in each pair, since the other base is complementary and is determined by the first one.)

The vast majority of DNA, more than 99.9% of the base pairs, is the same in all human beings. But with more than 3 billion base pairs in the human genome, even a 0.1% difference between two individuals comes to more than 3 million differences.

These differences take a variety of forms. For example, there can be entire base sequences that are present in some people but absent in others (insertions and deletions). Segments can be repeated a varying number of times (copy number variation). Segments can be reversed in some individuals. But the simplest kind of variation is where a substantial segment of DNA is the same in everybody except that the base in one specific location in the segment varies: maybe some people have an A at that location, and some people have a C.

These are called SNPs — single-nucleotide polymorphisms — specific locations in our DNA where there are differences between people, but where the surrounding bases are the same for everybody (or nearly everybody).

Theoretically the four bases A, C, G, and T can occur at any SNP. However, most of these SNPs are biallelic. This means that, at that location, only two alleles actually appear there if you look throughout the population. So we might describe a SNP as an A/C SNP, for example, if everybody (or nearly everybody) has either an A or a C for that SNP.

But a very small number of locations are triallelic — if you look in a number of people, you'll find three different bases appearing at that spot, not just two. Even more rarely, you'll find SNPs where all four bases appear in the population.

The Illumina chip can apparently test directly for some of these triallelic SNPs. For example, rs3091244 (location 157951289 on Chromosome 1) is an A/G/T SNP. The genotype reported for rs3091244 any particular individual will be a pair of bases where each base is an A, G, or T.

Now, it looks like there are additional triallelic, or potentially triallelic, SNPs that 23andMe has decided to handle by testing for the possible alleles in pairs. For instance, look at location 156921704 on chromosome 1. 23andMe has defined one SNP i3002777 to test for A vs. T, and a second SNP i3002778 to test for G vs. T.

I don't know if these Category 2 SNPs are known to be triallelic, or if 23andMe is just unsure as to which bases may appear at each one. It's also possible that 23andMe wants to double-check some normal biallelic SNPs to be sure that they're not getting spurious results, or to see if they really are biallelic.

 
THE TABLES

Sunday, May 16, 2010

What's Hard to Call?

It looks like heterozygous base pairs are harder for 23andMe (or its Illumina chip) to call than homozygous base pairs.

I tested with 23andMe twice.  Of the autosomal SNPs that were no-calls on one of my tests but were genotyped on the other test, about half were heterozygous and about half were homozygous.

This is in contrast to the fact that about 68% of the autosomal SNPs overall were homozygous.

So my heterozygous SNPs were disproportionately represented among the no-calls.

Call Me Sometime

When you test with 23andMe, there are always some no-calls: locations at which the genotyping chip wasn't able to get a good reading, so no result is reported.

I tested twice with 23andMe. The first time I tested, there were 2607 no-calls, and the second time there were 3260 no-calls.

But these generally weren't the same SNPs. Most of the time, a no-call on one test was resolved by the other test. Only 459 SNPs were no-calls both times.

Merging the results from the two tests actually gives 544 no-calls, since I have to add as new no-calls the 85 SNPs that were reported differently on the two tests.

There are 578,320 SNPs that 23andMe reports on. So merging the two tests brings the no-call rate from 0.45% and 0.56% individually all the way down to 0.094% for the combo data.

By the way, it's possible that some of the 459 repeated no-calls don't represent inadequacies in the test at all but instead indicate microdeletions—short fragments of DNA that most people have but that are missing in my genome. (For the autosomes, this would require my having inherited a microdeletion from both parents, which seems unlikely.)

Saturday, May 15, 2010

My Spitting Image: 23andMe error rate

The company 23andMe offers a DNA testing service. Send them a sample of your spit, and they'll test your DNA at nearly 600,000 nucleotide locations.

These locations are among the single-nucleotide polymorphisms, or SNPs, the spots in human DNA which are known to differ frequently among individuals — unlike the vast majority of our genetic code which is identical among all humans.

I've been curious what the 23andMe error rate is. So I took advantage of their DNA Day offer to test a second time.

The new results from 23andMe just came in (amazingly fast -- only 10 days after their receipt of my spit kit, even though they said it would take 6-8 weeks).

Comparing the raw data, 85 SNPs were called differently by 23andMe in the two tests. Of these 85 SNPs, 73 were called as homozygous one time and heterozygous the other time. The remaining 12 were homozygous but opposite on the two occasions. (None of the differences were on the X, Y, or mitochondrial unpaired locations.)

This suggests a 23andMe error rate of 0.0074%, based on calls that differed on the two occasions.

The actual error rate is presumably somewhat higher than this, since some of the SNPs that were called one time but were no-calls the other time may also be incorrect, but I have no way of identifying those. It's also possible that there are a few systematic errors, which would be wrong nearly every time.

I also have Family Finder results from Family Tree DNA; they have some SNPs in common with 23andMe. The two companies use different chips: 23andMe uses an Illumina chip, and FTDNA went with Affymetrix.

On the SNPs that 23andMe disagreed on and that Family Finder also covered, FTDNA always agreed with one or the other of the 23andMe results. (It was conceivable that all three results would sometimes be different, with one heterozygous genotype and two opposite homozygous genotypes, but that didn't happen at all in my data.)

All the above data excludes no-calls. I'll put something up on those in a separate post. By the way, in case anyone who uses 23andMe is wondering, my alter ego shows up as my identical twin in Relative Finder!