AncestryDNA has genotyped over two million human samples, the largest consumer DNA database, and our research team is tasked with analyzing these data, and finding new ways to use them to connect people with their personal ancestry.
I will discuss the machine learning problems involved in connecting people to their ancestors at three different levels. The first problem is estimating a person's admixture with respect to several established populations, typically from hundreds or thousands of years ago. This problem has received considerable attention in the population genetics community, and I will briefly describe a new approach, using models of haplotype frequency inferred from data. The second problem is estimating much more recent ancestry, such as particular migrations from Europe to settlements in America. This is a clustering problem, based on shared DNA between individuals. Finally, I will discuss using pedigrees to connect people through DNA to families with specific named ancestors.