Wednesday, July 14, 2010

23andMe: SNPs So Nice They Call Them Twice

...with all of the duplicates tabulated below.

If you've tested with 23andMe and you've gotten the Complete Edition, you can download your raw data. What you'll get is a file containing the results for each SNP that was tested.

For each SNP, there's a line in the file containing:

  • the id number of the SNP (starting with "rs" for a standard SNP id, or with "i" for one of 23andMe's internal identifiers);
  • the chromosome the SNP is on;
  • the location of the SNP on that chromosome;
  • your genotype for that SNP.

Nearly 600,000 SNPs are tested for, so this file is fairly large.

Now, if you look closely at the data, you'll find that some of the locations are there twice. The two instances have different id numbers, but the same chromosome and the same location. Sometimes the two genotypes they'll give for you are the same, and sometimes they're different.

In fact, there are two locations that are reported three times, under three different names.

I've compiled a list of all the duplicate 23andMe SNPs, as of April 26, 2010, along with the possible alleles (variants) reported for each one.

So, why does 23andMe include these apparent duplicates? The duplicate SNPs fall into two distinct categories, and the rationale may be different for these two different types.

 

Category 1: The duplicate SNPs all have the same possible alleles.

For example, on chromosome 2, location 38151881 is reported both as i3002894 and as rs28936413, and both of them are C/T SNPs (that is, C and T are the only values tested for, and reported, on that SNP).

For the Category 1 duplicates, some are simply SNPs that have been given two different rs labels in the literature. But most of them involve 23andMe's internal identifiers (starting with i instead of rs). Perhaps some of these are there as quality control — testing the same SNP in two different ways on the chip. It's conceivable (although it seems unlikely to me) that some of these have different flanking base sequences on the chip, in which case they would be testing for different things.

 

Category 2: The duplicate SNPs have different possible alleles.

For example, on chromosome 1, location 156921704 is reported both as i3002777 (an A/T SNP) and as i3002778 (a G/T SNP).

The Category 2 SNPs are more interesting than the Category 1 SNPs, as we'll see.

First, though, we need to take a brief detour...

In general, DNA consists of a sequences of bases, each one an A, C, G, or T. (It's actually a sequence of base pairs, but only one base is reported in each pair, since the other base is complementary and is determined by the first one.)

The vast majority of DNA, more than 99.9% of the base pairs, is the same in all human beings. But with more than 3 billion base pairs in the human genome, even a 0.1% difference between two individuals comes to more than 3 million differences.

These differences take a variety of forms. For example, there can be entire base sequences that are present in some people but absent in others (insertions and deletions). Segments can be repeated a varying number of times (copy number variation). Segments can be reversed in some individuals. But the simplest kind of variation is where a substantial segment of DNA is the same in everybody except that the base in one specific location in the segment varies: maybe some people have an A at that location, and some people have a C.

These are called SNPs — single-nucleotide polymorphisms — specific locations in our DNA where there are differences between people, but where the surrounding bases are the same for everybody (or nearly everybody).

Theoretically the four bases A, C, G, and T can occur at any SNP. However, most of these SNPs are biallelic. This means that, at that location, only two alleles actually appear there if you look throughout the population. So we might describe a SNP as an A/C SNP, for example, if everybody (or nearly everybody) has either an A or a C for that SNP.

But a very small number of locations are triallelic — if you look in a number of people, you'll find three different bases appearing at that spot, not just two. Even more rarely, you'll find SNPs where all four bases appear in the population.

The Illumina chip can apparently test directly for some of these triallelic SNPs. For example, rs3091244 (location 157951289 on Chromosome 1) is an A/G/T SNP. The genotype reported for rs3091244 any particular individual will be a pair of bases where each base is an A, G, or T.

Now, it looks like there are additional triallelic, or potentially triallelic, SNPs that 23andMe has decided to handle by testing for the possible alleles in pairs. For instance, look at location 156921704 on chromosome 1. 23andMe has defined one SNP i3002777 to test for A vs. T, and a second SNP i3002778 to test for G vs. T.

I don't know if these Category 2 SNPs are known to be triallelic, or if 23andMe is just unsure as to which bases may appear at each one. It's also possible that 23andMe wants to double-check some normal biallelic SNPs to be sure that they're not getting spurious results, or to see if they really are biallelic.

 
THE TABLES

Monday, July 12, 2010

Combinatorics 7: Exceptions

This is the last in a series of 7 posts on sharing combinatorics:

Part 1: Combinatorial Principles for DNA Sharing

Part 2: Non-transitivity

Part 3: A Genetic Pigeonhole Principle

Part 4: Transitivity Principles

Part 5: X-Chromosome Transitivity Principles

Part 6: Mutual Sharing Principle

Part 7: Exceptions

 

If 23andMe reports something that seems to violate one of our genetic combinatorial principles, the anomaly is presumably due to one of the following:

(1) [the only true exception] unusual situations where two matches on different strands butt up against one another by coincidence, making what are really two separate matches look like one long match;

(2) genotyping errors;

(3) uncertain boundaries on short, close-to-the-threshold matches;

(4) microdeletions (or other, rarer phenomena such as chimeras);

(5) 23andMe's practice of reporting regions as half-identical even though they have a small number of isolated non-matching SNPs;

or (6) on the X chromosome, 23andMe's varying minimum threshold size for reporting matches, which depends on the genders of the individuals whose genomes are being compared.

Combinatorics 6: Mutual Sharing Principle

This is the sixth in a series of 7 posts on sharing combinatorics:

Part 1: Combinatorial Principles for DNA Sharing

Part 2: Non-transitivity

Part 3: A Genetic Pigeonhole Principle

Part 4: Transitivity Principles

Part 5: X-Chromosome Transitivity Principles

Part 6: Mutual Sharing Principle

Part 7: Exceptions

 

Mutual Sharing Principle:
If a number of people all share with one another (that is, if every pair shares), then either:

(a) there is a common strand that all of them share;

or

(b) there are at most 3 genotypes that all of them fall into; in other words, the people can divided into 3 sets S, T, and U for which the following three statements are true: all the people in S are fully identical to one another, all the people in T are fully identical to one another, all the people in U are fully identical to one another. In case (b), we can also say that at least one-third of the people are fully identical to one another.


 

Here is a stronger mutual sharing principle for the X chromosome.

Mutual Sharing Principle for the X chromosome:

If a number of people share with one another on a region on the X chromosome and if at least one of them is male, then there is a common strand that all of them share.

If all the people are female, though, then the best we can do is the general version at the beginning of this post.

Combinatorics 5: X‑Chromosome Transitivity Principles

This is the fifth in a series of 7 posts on sharing combinatorics:

Part 1: Combinatorial Principles for DNA Sharing

Part 2: Non-transitivity

Part 3: A Genetic Pigeonhole Principle

Part 4: Transitivity Principles

Part 5: X-Chromosome Transitivity Principles

Part 6: Mutual Sharing Principle

Part 7: Exceptions

 

Here are some strengthened transitive principles that apply when the sharing is on the X chromosome. What we can say depends on the the genders of the people involved.

Recall that men have one X chromosome and women have two X chromosomes. This is the reason for analyzing sharing on the X chromosome specially. In the case of a male, there is no difference between half-identical sharing and fully identical sharing when looking at regions on the X chromosome. In the case of a female, "sharing" refers to the usual half-identical sharing.

 
 
Transitive Principles for X-Chromosome Sharing

Anybody—Male—Anybody:
If person A shares with a male person on the X chromosome, and if that second person shares with a third person B, then A shares with B.

 
Male—Females—Male:

If male A shares with two or more females on the X chromosome, and if those females all share with male B, then either:

(a) More commonly — A and B share with each other;

or

(b) More rarely — All the females share fully identically on the region in question.

 
Male—Females—Female,
or Female—Females—Male:

If person A shares with N females on the X chromosome, and if those N females all share with person B, and if A and B are of opposite sex, then either:

(a) More commonly — A and B share with each other;

or

(b) More rarely — At least N/2 (rounding up) of the N females share fully identically on the region in question.

 
The only other case is Female—Females—Female, but the best we can do there is the general version, applicable to all chromosomes.

Combinatorics 4: Transitivity Principles

This is the fourth in a series of 7 posts on sharing combinatorics:

Part 1: Combinatorial Principles for DNA Sharing

Part 2: Non-transitivity

Part 3: A Genetic Pigeonhole Principle

Part 4: Transitivity Principles

Part 5: X-Chromosome Transitivity Principles

Part 6: Mutual Sharing Principle

Part 7: Exceptions

 

As before, "share" means "share a particular half-identical region with" (all regions are at the same locations).

Sharing "fully identically" means that the two people match on that region on both homologous chromosomes, not just on one.

Our transitive principles all involve somebody (person A) sharing with one or more other people, who themselves all share with somebody else (person B). True transitivity would say that person A must share with person B, but this isn't necessarily true.

Although sharing isn't transitive, here is a valid transitive-like principle, which we'll state first for sharing with 5 people, and then in its general formulation:

 
Five-Way Transitive Principle:

If person A shares with 5 other people, and if those 5 people all share with person B, then either:

(a) More commonly — A and B share with each other;

or

(b) More rarely — at least 2 of the 5 other people share fully identically on the region in question.

 
N-Way Transitive Principle:

If person A shares with N other people, and if those N people all share with person B, then either:

(a) More commonly — A and B share with each other;

or

(b) More rarely — at least N/4 (rounding up) of the N other people share fully identically on the region in question.


 

We can also state some strengthened transitivity principles for sharing on the X chromosome, but those are in a separate post.

Combinatorics 3: A Genetic Pigeonhole Principle

This is the third in a series of 7 posts on sharing combinatorics:

Part 1: Combinatorial Principles for DNA Sharing

Part 2: Non-transitivity

Part 3: A Genetic Pigeonhole Principle

Part 4: Transitivity Principles

Part 5: X-Chromosome Transitivity Principles

Part 6: Mutual Sharing Principle

Part 7: Exceptions

 

In the following, "share" means "share a particular half-identical region with" (all regions are at the same locations).

Sharing "fully identically" means that the two people match on that region on both homologous chromosomes, not just on one.

 

Genetic Pigeonhole Principle:
If a person shares with a number of other people, those other people can be divided into two sets S and T, where everybody in S shares with one another, and everybody in T shares with one another.

[Looking at the raw data, all the people in S share one strand with the original person, and all the people in T share the other strand with the original person.]

It's possible for S or T to be empty, or to have just one person in it. In that case, the other set contains all, or all but one, of the other people.

We can also state the following two consequences to the Genetic Pigeonhole Principle:

 

Three-Way Pigeonhole Principle:
If a person shares with 3 other people, then at least 2 of those other people share with each other. [Looking at the raw data, those two would share a common strand with the original person.]

 

N-Way Pigeonhole Principle:
If a person shares with N other people, then at least N/2 (rounding up if N is odd) of those other people all share with one another.

[Again, all of these N/2 people share a common strand with the original person.]


 

Finally, we have a stronger pigeonhole principle for the special case of a male sharing on the X chromosome.

Men have one X chromosome; women have two X chromosomes. This is the reason for analyzing sharing on the X chromosome specially. In the case of a male, there is no difference between half-identical sharing and fully identical sharing when looking at regions on the X chromosome. In the case of a female, "sharing" refers to the usual half-identical sharing.

 

Pigeonhole Principle for Sharing with the Male X Chromosome:
If a male shares on the X chromosome with a number of other people, then they all share with one another; in fact, they all share a common strand.

Note: For a female sharing on the X chromosome, there isn't a strengthened version; the general formulation that applies to all chromosomes is the best we can do.

Sunday, July 11, 2010

Combinatorics 2: Non‑transitivity

This is the second in a series of 7 posts on sharing combinatorics:

Part 1: Combinatorial Principles for DNA Sharing

Part 2: Non-transitivity

Part 3: A Genetic Pigeonhole Principle

Part 4: Transitivity Principles

Part 5: X-Chromosome Transitivity Principles

Part 6: Mutual Sharing Principle

Part 7: Exceptions

 

Sharing isn't transitive, as I've already mentioned: it's quite possible for Person A to share a segment with Person B, and for Person B to share that same segment with Person C, but for Person A not to share with Person C.

For example, Person A and Person B could have the same bases 1,000,000 to 2,000,000 on one copy each of Chromosome 1. And Person B and Person C could also share the same bases 1,000,000 to 2,000,000 on one copy each of Chromosome 1. But we could be talking about the two different Chromosome 1's in Person B:


As you can see, Person A doesn't share on that segment with Person C — even though A shares with B and B shares with C.

After all, if A is related to B's mother, and C is related to B's father, there' s no reason to expect that Person A and Person C should be related to each other.

We'll see that, even though sharing isn't transitive, there are some combinatorial laws that do apply to genome sharing, including a transitive-like law.

Tuesday, July 6, 2010

Combinatorial Principles for DNA Sharing, 1 of 7

This is the first in a series of 7 posts on sharing combinatorics:

Part 1: Combinatorial Principles for DNA Sharing

Part 2: Non-transitivity

Part 3: A Genetic Pigeonhole Principle

Part 4: Transitivity Principles

Part 5: X-Chromosome Transitivity Principles

Part 6: Mutual Sharing Principle

Part 7: Exceptions

 

This series of posts is about several combinatorial principles that apply to DNA sharing among three or more people.

For a brief introduction to sharing and half-identical regions, please see Basics: Sharing and Half-Identical Regions.
 

Sharing isn't transitive. You can find instances on 23andMe where somebody shares on the same region with two different people, but those two people don't share with one another.

So, if Alice shares with both Bob and Charlie (on the same region), we can't infer that Bob and Charlie must share. But what if Alice also shares with a third person on the same region—Bob, Charlie, and Deborah? Can we reach any conclusion about Bob, Charlie, and Deborah?

Here's another way to look at this: If Alice shares with Bob and Bob shares with Charlie (on the same region), we can't conclude that Alice shares with Charlie. But what if Alice shares with two people, both of whom share with Charlie? Or three people? Or five people?

Or, what if we just have a number of people, all of whom share with one another on the same region? Can anything be concluded from that?

It turns out that there are some combinatorial principles that apply to sharing, even though simple transitivity fails to hold.