...with all of the duplicates tabulated below.
If you've tested with 23andMe and you've gotten the Complete Edition, you can download your raw data. What you'll get is a file containing the results for each SNP that was tested.
For each SNP, there's a line in the file containing:
- the id number of the SNP (starting with "rs" for a standard SNP id, or with "i" for one of 23andMe's internal identifiers);
- the chromosome the SNP is on;
- the location of the SNP on that chromosome;
- your genotype for that SNP.
Nearly 600,000 SNPs are tested for, so this file is fairly large.
Now, if you look closely at the data, you'll find that some of the locations are there twice. The two instances have different id numbers, but the same chromosome and the same location. Sometimes the two genotypes they'll give for you are the same, and sometimes they're different.
In fact, there are two locations that are reported three times, under three different names.
I've compiled a list of all the duplicate 23andMe SNPs, as of April 26, 2010, along with the possible alleles (variants) reported for each one.
So, why does 23andMe include these apparent duplicates? The duplicate SNPs fall into two distinct categories, and the rationale may be different for these two different types.
Category 1: The duplicate SNPs all have the same possible alleles.
For example, on chromosome 2, location 38151881 is reported both as i3002894 and as rs28936413, and both of them are C/T SNPs (that is, C and T are the only values tested for, and reported, on that SNP).
For the Category 1 duplicates, some are simply SNPs that have been given two different rs labels in the literature. But most of them involve 23andMe's internal identifiers (starting with i instead of rs). Perhaps some of these are there as quality control — testing the same SNP in two different ways on the chip. It's conceivable (although it seems unlikely to me) that some of these have different flanking base sequences on the chip, in which case they would be testing for different things.
Category 2: The duplicate SNPs have different possible alleles.
For example, on chromosome 1, location 156921704 is reported both as i3002777 (an A/T SNP) and as i3002778 (a G/T SNP).
The Category 2 SNPs are more interesting than the Category 1 SNPs, as we'll see.
First, though, we need to take a brief detour...
In general, DNA consists of a sequences of bases, each one an A, C, G, or T. (It's actually a sequence of base pairs, but only one base is reported in each pair, since the other base is complementary and is determined by the first one.)
The vast majority of DNA, more than 99.9% of the base pairs, is the same in all human beings. But with more than 3 billion base pairs in the human genome, even a 0.1% difference between two individuals comes to more than 3 million differences.
These differences take a variety of forms. For example, there can be entire base sequences that are present in some people but absent in others (insertions and deletions). Segments can be repeated a varying number of times (copy number variation). Segments can be reversed in some individuals. But the simplest kind of variation is where a substantial segment of DNA is the same in everybody except that the base in one specific location in the segment varies: maybe some people have an A at that location, and some people have a C.
These are called SNPs — single-nucleotide polymorphisms — specific locations in our DNA where there are differences between people, but where the surrounding bases are the same for everybody (or nearly everybody).
Theoretically the four bases A, C, G, and T can occur at any SNP. However, most of these SNPs are biallelic. This means that, at that location, only two alleles actually appear there if you look throughout the population. So we might describe a SNP as an A/C SNP, for example, if everybody (or nearly everybody) has either an A or a C for that SNP.
But a very small number of locations are triallelic — if you look in a number of people, you'll find three different bases appearing at that spot, not just two. Even more rarely, you'll find SNPs where all four bases appear in the population.
The Illumina chip can apparently test directly for some of these triallelic SNPs. For example, rs3091244 (location 157951289 on Chromosome 1) is an A/G/T SNP. The genotype reported for rs3091244 any particular individual will be a pair of bases where each base is an A, G, or T.
Now, it looks like there are additional triallelic, or potentially triallelic, SNPs that 23andMe has decided to handle by testing for the possible alleles in pairs. For instance, look at location 156921704 on chromosome 1. 23andMe has defined one SNP i3002777 to test for A vs. T, and a second SNP i3002778 to test for G vs. T.
I don't know if these Category 2 SNPs are known to be triallelic, or if 23andMe is just unsure as to which bases may appear at each one. It's also possible that 23andMe wants to double-check some normal biallelic SNPs to be sure that they're not getting spurious results, or to see if they really are biallelic.
THE TABLES