vallejuelo serie 108: mesclados

Unravelling the hidden ancestry of American admixed populations Francesco Montinaro1,2, George B.J. Busby2,3, Vincenzo L. Pascali1 , Simon Myers3,4, Garrett Hellenthal5 & Cristian Capelli2 The movement of people into the Americas has brought different populations into contact, and contemporary American genomes are the product of a range of complex admixture events. Here we apply a haplotype-based ancestry identification approach to a large set of genome-wide SNP data from a variety of American, European and African populations to determine the contributions of different ancestral populations to the Americas. Our results provide a fine-scale characterization of the source populations, identify a series of novel, previously unreported contributions from Africa and Europe and highlight geohistorical structure in the ancestry of American admixed populations. DOI: 10.1038/ncomms7596 OPEN 1 Institute of Legal Medicine, Catholic University, Largo F. Vito 1, Rome 00168, Italy. 2Department of Zoology, University of Oxford, South Parks Road, Oxford OX1 3PS, UK. 3Wellcome Trust Center for Human Genetics, Roosevelt Drive, Oxford OX3 7BN, UK. 4Department of Statistics, University of Oxford, 1 South Parks Road, Oxford OX1 3TG, UK. 5UCL Genetics Institute, University College London, WC1E 6BT Gower Street, UK. Correspondence and requests for materials should be addressed to C.C. (email: cristian.capelli@zoo.ox.ac.uk). NATURE COMMUNICATIONS | 6:6596 | DOI: 10.1038/ncomms7596 | www.nature.com/naturecommunications 1 & 2015 Macmillan Publishers Limited. All rights reserved. The genetic make-up of the Americas has been significantly shaped by the Colonial Era and the Atlantic slave trade. Given its historical and epidemiological implications, the estimation of the genetic ancestry of admixed American populations has been the subject of much attention1–5. However, despite historical evidence suggesting a wide heterogeneity in the European and African ancestry composition, sources have often been identified in terms of macrogeographic areas (for example, Southern versus Northern Europe) or by single populations as ‘consensus’ continental sources (for example, Yoruba from Nigeria for the whole of Africa). More recently, a significant contribution by the Spaniards has been highlighted for Caribbean and Southern American groups4,5. However, these methods, based on the local ancestry at a continental scale, make the identification of multiple sources from the same continent challenging. In order to obtain a finer characterization of the ancestry landscape of admixed American populations, we implemented a novel inference method that reconstructs local genomic ancestry using a haplotype-based approach6,7. It has been shown in previous investigations6–8 that approaches based on haplotypes allow for a finer reconstruction of genetic structure when compared with classical approaches that directly employ singlemarker genotypes, and that they are characterized by a lower degree of bias due to the ascertainment process of the polymorphisms studied9. We applied this methodology to genome-wide single-nucleotide polymorphisms (SNP) data from more than 2,500 individuals collected from various putatively admixed American and Caribbean populations. We compared the DNA of these ‘recipient’ groups to that of a cross-section of world-wide ‘donor’ populations that act as surrogates for the true ancestral source groups (Fig. 1, Supplementary Table 1), generating a detailed description of the genomic contribution of these groups to admixed American populations. Results Clustering of donor populations. In order to minimize the impact of within-source genetic heterogeneity in the ancestry characterization process, we partitioned the 1,414 individuals from 42 population-label donors into genetically homogeneous clusters using a CHROMOPAINTER and fineSTRUCTURE analysis as described in the Methods section. This identified 78 clusters (Fig. 2, Supplementary Table 2) related by a hierarchical tree, with a broad correlation between clusters and geographic origin, allowing the grouping of clusters in 13 groups within Europe, Africa and East Asia/America (Fig. 2; Supplementary Table 2). African individuals are divided within 33 clusters. Populations from West Africa showed a high degree of homogeneity, with all the Yoruba individuals from Nigeria forming a single cluster and the Mandenka from Senegal grouped into two. Individuals from Eastern and Southern Africa were distributed across 20 different clusters from three different regions (East Africa, South Africa and South West Africa), perhaps because of the complex demographic histories of populations from these areas10–12. In our collection of donor individuals, South-Central Africa is represented only by Bantu-speaking individuals from South Africa, while the South West Africa and the East Africa region clusters are represented exclusively by Herero and a Bantu speaker from South Africa (one individual from the HGDP data set13) and Bantu speakers from Kenya, respectively. Interestingly, one of the Herero individuals clusters together with Sandawe individuals instead of the other Herero individuals. Pygmies, Sandawe and San (Khoisan/Pygmies14) were separated into clusters, essentially according to their population labels, although with some labelled groups differentiated into multiple clusters (Fig. 2, Supplementary Table 2). European individuals are differentiated into 37 clusters that we grouped into six geographic regions (Fig. 2, Supplementary Table 2). As previously reported, Sardinians and Basques formed population-specific groups15,16. Notably, by implementing the haplotype-based approach we were able to (i) detect eight individuals who are more related to the Basque population than to the Spanish individuals, within the Spanish data set included in Donors Recipients Basque NE europe SE europe NW europe SW europe Sardinia S africa E africa W africa SW africa East asia/america Senegambia Khoisan/phgmies Dominican R. Colombia B Colombia A Barbados African–american A African–american B Ecuador Maya Mexico Peru Puerto rico A Puerto rico B Figure 1 | Approximate geographic sampling location of donor and recipient populations analysed. Colours refer to the 13 groups as described in Fig. 2 and Supplementary Table 2. Circles and diamond refer, respectively, to donors and recipients. ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/ncomms7596 2 NATURE COMMUNICATIONS | 6:6596 | DOI: 10.1038/ncomms7596 | www.nature.com/naturecommunications & 2015 Macmillan Publishers Limited. All rights reserved. the 1000 Genome Project panel (cluster ‘Basque 1’)17, probably reflecting a basque ancestry, and to (ii) differentiate them from the French Basque population included in the HGDP data set13 (‘cluster Basque 2’). We identified five Spanish clusters (‘SW Europe’; two of them including also a single French individual), highlighting the presence of a non-negligible heterogeneity in the country18. The South-Eastern Europe group (‘SE Europe’) contains 10 clusters composed of individuals from Romania, Cyprus, Italy (excluding Sardinia), Bulgaria, Greece and France (one individual). Notably, Italian individuals are distributed into four different clusters according to their geographic origin (Supplementary Table 2). A North-Western Europe group (‘NW Europe’) consists of eight clusters comprising individuals from British Isles, Orkney Islands, Norway, France, Germany and Austria. Similarly to the Basque populations, our approach clusters 23 individuals in a clade containing members of the Orcadian sample from the HGDP13. The North-Eastern Europe group (‘NE-Europe’) is composed of eight clusters including individuals from Lithuania, Poland, Belarus, Hungary, Russia, Germany, Austria, Finland and Norway. Native American and East Asian (China) individuals are grouped into eight clusters, each exclusively containing individuals from the same labelled sample. These results confirm the extent of genetic structure in Africa and Europe, and provide a number of potential donor groups to the present-day American populations. Ancestry composition of the American populations. We fit each of the American admixed populations as a mixture of the identified donor groups19 (see Methods, Supplementary Data 1). The contribution to the American admixed populations for the 23 most representative clusters and macro-areas is reported in Fig. 3 and Supplementary Fig. 1. This analysis assumes that haplotypes from the admixing populations are well represented within a mixture of present-day sampled groups. We were concerned that the demographic and evolutionary complexity of the peopling of the Americas20, coupled with the high genetic drift among Native American populations, might make the identification of the Native American contribution challenging. In particular, the true admixing groups from this region might be highly drifted from the possible ‘donor’ groups sampled, particularly given our geographically relatively sparse sample of such donor groups. To reduce this effect we always allowed a single well-sampled East Asian group (China) as a potential donor in the analysis, to act as a surrogate for haplotypes carried by any Native American donor population incompletely captured as a mixture of sampled Native American groups. Because this donor group is still likely to be strongly drifted relative to this East Asian ‘surrogate’, we also repeated our analysis after ‘masking’ direct copying of China population in the mixture-fitting step, although we still allowed all groups to contribute in the mixture. We compared the continental ancestry contributions from the full painting and the East Asian masked painting with an ADMIXTURE21 analysis performed at K ¼ 3 (Supplementary Figs 2 and 3), which closely matches the Africa, Europe and Asia/Native Americans partition. Continental ancestry estimates are highly correlated (P value o10 12) between all three approaches (Supplementary Fig. 2), although the squared distance between the masked continental ancestry estimates and that estimated by ADMIXTURE21 was, respectively, 5.4-fold and 7.9-fold reduced by the masking procedure for Europe and Asia/Native Americans, suggesting a slight gain in accuracy using this procedure. No major difference is seen for African contributions, while identified donor populations contributing to the mixture were very similar in both approaches; therefore, we henceforth report results on the basis of the masking procedure (Fig. 3). Estimated African ancestry ranges from virtually 0 (Maya) to 0.87 (Barbados) in all the analysed populations. Caribbean populations show a higher African component than Southern American ones, consistent with historical records that documented a larger number of slaves in the Caribbean Islands22,23. Although our sampling of Africans is incomplete, we see variation among groups in similarity to present-day populations from different parts of Africa. In all groups, the Yorubans from West Africa are the largest contributor, confirming this region as the major component of African slaves1,2,4. However, our finescale analysis suggests additional genetic contributions from populations from other parts of Africa, with contributions from South africa bantu 1 South africa bantu 2 South africa bantu 3 South africa bantu 4 South africa bantu 5 South africa bantu 6 South africa bantu 7 South africa bantu 8 South africa bantu 9 South africa bantu 10 South africa bantu 11 San 1 San 2 San 3 Bantu kenya 1 Bantu kenya 2 Bantu kenya 3 Bantu kenya 4 Herero 1 Herero 2 Herero 3 Herero 4 Herero 5 Sandawe 1 Sandawe 2 Sandawe 3 Sandawe 4 Sandawe 5 Mbuti Pygmy Biaka Pygmy Mandenka 1 Mandenka 2 Yoruba Romania 1 Georgia Cyprus South italy North italy Tuscany 1 Tuscany 2 Bulgaria 2 Greece Spain 1 Spain 2 Spain 3 Spain 4 Spain 5 Sardinia 1 Sardinia 2 Sardinia 3 Basque 1 Basque 2 Great britain 1 Great britain 2 Great britain 3 Great britain 4 Norway France Orkney 1 Orkney 2 Orkney 3 Lithuania Poland Hungary Russia Finland 1 Finland 2 Finland 3 Finland 4 Colombia 1 Colombia 2 Surui Karitiana 2 Pima 1 Pima 2 China Bulgaria 1 Karitiana 1 Figure 2 | fineSTRUCTURE clustering of the analysed individuals. Tree of the analysed individuals pooled in 78 clusters as inferred by fineSTRUCTURE. Colours follow macro-area affiliations as in Fig. 1. NATURE COMMUNICATIONS | DOI: 10.1038/ncomms7596 ARTICLE NATURE COMMUNICATIONS | 6:6596 | DOI: 10.1038/ncomms7596 | www.nature.com/naturecommunications 3 & 2015 Macmillan Publishers Limited. All rights reserved. particular groups sampled in Senegambia (the Mandenka), Southern (South African Bantu language speakers) and Eastern Africa (Kenyan Bantu language speakers) identified in 6 out of 12 populations we investigated. Historical reports indicate that Senegambia and South-Eastern Africa contributed an average of 6 and 4% of all disembarked slaves to the Americas (totalling several hundreds of thousands individuals), respectively, with ethnic groups from Senegal and Mozambique being among the 10 most prominent according to slavery documentation22. In addition, more than 30% of the total slaves arriving in mainland Spanish America up to the 1630s came from Senegambia23, and we accordingly find that the relative contribution from the Mandenka is higher in all areas historically under the Spanish rule (Fig. 4). The degree of resolution in the identification of the sources provided by our approach is also evident in the fine characterization of the European component, which ranges between 0.078 (Barbados) and 0.79 (Puerto Rico). We specifically identify Spaniards among other available Southern European populations as the most represented European source for all nine Hispanic/ Latino populations. In contrast, the most represented European sources in the Afro-Americans and Barbadians were Great Britain clusters (Figs 3 and 4a), in full agreement with historical records24,25; a small amount of Spanish ancestry is also inferred in these groups. Interestingly among the Spaniards, two clusters do not contribute to any of the analysed populations, presumably reflecting a differential contribute of Iberian regions to the genetic pool of American populations. Among smaller genetic contributions, we identify for the first time a genetic signature of Basque ancestry in five (out of six) of the Continental South American populations, ranging between 0.015 in the Maya population to 0.07 in Colombia. It has been documented that Basque individuals were a considerable fraction of Spanish immigrants in the XVI and XVII centuries, especially to Mexico, Cuba, Chile, Peru and Colombia26. These results could explain, at least in part, the recently observed structure in the 0.6 0.4 0.2 0.0 0.6 0.4 0.2 0.0 Relative contribution 0.6 0.4 0.2 0.0 0.6 Ecuador Maya Mexico Peru Puerto Rico A Puerto Rico B Barbados Colombia A Colombia B Dominican R. African-American A African-American B 0.4 0.2 0.0 South Africa Bantu 2 South Africa Bantu 3 Bantu Kenya 1 Bantu Kenya 3 Bantu Kenya 4 Herero 4 Mandenka 2 Yoruba Cyprus South Italy Bulgaria 2 Great Britain 4 France Russia Colombia 2 Surui Karitiana 2 Pima 1 China Spain 1 Spain 2 Spain 3 Basque 1 Bantu Kenya 1 Bantu Kenya 3 Bantu Kenya 4 Herero 4 Mandenka 2 Yoruba Cyprus South Italy Bulgaria 2 Great Britain 4 France Russia Colombia 2 Surui Karitiana 2 Pima 1 China Spain 1 Spain 2 Spain 3 Basque 1 Bantu Kenya 1 Bantu Kenya 3 Bantu Kenya 4 Herero 4 Mandenka 2 Yoruba Cyprus South Italy Bulgaria 2 Great Britain 4 France Russia Colombia 2 Surui Karitiana 2 Pima 1 China Spain 1 Spain 2 Spain 3 Basque 1 South Africa Bantu 2 South Africa Bantu 3 South Africa Bantu 2 South Africa Bantu 3 Figure 3 | Contribution of the most by informative 23 clusters inferred by fineSTRUCTURE to the analysed recipient populations. Contribution of the most informative 23 clusters to the American and Caribbean populations estimated using the non-negative least square approach. Standard error based on jack-knife resampling (22 replicates) is reported. ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/ncomms7596 4 NATURE COMMUNICATIONS | 6:6596 | DOI: 10.1038/ncomms7596 | www.nature.com/naturecommunications & 2015 Macmillan Publishers Limited. All rights reserved. Spanish component of the Continental but not Caribbean populations4. Among the remaining European clusters the most represented, contributing to five of the analysed populations, is composed of individuals from South Italy and Sicily. This might indicate a minor contribution from the Italian peninsula as documented in historical records27. Interestingly, we also identified a considerable fraction of French ancestry in one AfricanAmerican sample, in agreement with French immigration into the Southern United States during colonial times28,29. At the individual level, the analysis highlights a high heterogeneity in several analysed populations (Supplementary Fig. 4), as expected given recent admixture. This is particularly evident in the African-American populations, in which, for the African ancestry, the inferred contributions of Mandenka and W Africa range from 0 to 35% and 0 to 100%, respectively. For the European contribution, a few individuals possessed a high degree of inferred Spanish (95% confidence interval (CI) 0–0.27) or Italian ancestry (95% CI 0–0.14), while global Native American ancestry varies from 0 to 65%. Clusters versus population-label-based ancestry reconstruction. We explored the variation in ancestry determination when using a population-label-based approach instead of a clustering-based one by comparing estimates obtained using the same set of source individuals but grouped in different ways (Supplementary Fig. 5). Population labels might mask contributions, by for example, falsely grouping genetically distinct donor populations with different actual contributions to an admixed population. In accordance with this concern, although results were mainly similar, the label-based approach inferred the French population (partially replacing Great Britain) as the major source for the African-American and Barbados samples and no longer detected the Basques as a source population. A more refined ancestry depiction by a cluster-based approach is not unexpected for the European sources, given the population stratification following the complex ancient and more recent admixture history of the continent7,13,30,31. These results indicate that using fine-scale genetics-based clustering methods on the basis of phased data to replace or supplement sample-based labels can strongly improve the resolution of ancestry reconstruction. Analysis of relative ancestry composition. We used a hierarchical clustering algorithm on the basis of the Euclidean distances between relative ancestry proportions to explore the dissimilarities in source composition across admixed populations (Fig. 4) and constructed the 80% consensus tree of 1,000 simulated data sets (see Methods section). Clustering based on European components broadly support two groups of recipient populations: one containing AfroAmericans and Barbadians, the other containing all of the remaining populations (Fig. 4a). Notably, these clusters match the English and Spanish colonies in the Americas and reflect geohistorical differences in the migration pattern from the Northern hemisphere23 (Voyages: The Trans-Atlantic Slave Trade Database: http://www.slavevoyages.org/tast/assessment/ estimates.faces) as suggested by their different European source composition (Fig. 4a). In addition, the Caribbean Islands Puerto Rico and Dominican Republic tend to cluster together, probably reflecting a different migration pattern between Caribbean and mainland America. On the other end no particular clustering, apart from between the two African-American groups, emerges when the African relative composition is considered, reflecting the complexity of the slave trade dynamics (Fig. 4b). Discussion Our results provided new insights into the genetic make-up of American populations, highlighting the underappreciated heterogeneity of ancestral components across American populations and the power of haplotype-based analytical techniques in identifying fine-scale ancestry without strong prior assumptions. The application of this approach to additional admixed populations (for example, Brazilians) and the inclusion of more sources, particularly from Africa and the Americas, are expected to further clarify the complexity of the ancestry composition of the American continent. Methods Data set. We assembled from literature a data set composed of 4,139 individuals from 64 populations sampled from Europe, Africa, East Asia (represented by a single sample from China) and the Americas, genotyped with different Illumina platforms (Supplementary Table 1). The data set was filtered using PLINK ver. 1.07 (ref. 32) to retain only SNPs and individuals with genotyping success rate 498%, retaining 250,800 autosomal markers. We screened the pruned data set using KING33 to remove individuals with kinship parameter higher than 0.0884 as potentially related as indicated in the software’s manual. The final data set is composed of 3,960 individuals from 64 populations. Of these, 12 were treated as ‘recipients’ (African-American A, African-American B, Barbados, Colombia A, Colombia B, Dominican Republic, Ecuador, Maya, Mexico, Peru, Puerto Rico A and Puerto Rico B), and the remaining 52 as donors, as described below. Phasing. The data set was phased using the Segmented Haplotype Estimation and Imputation tool ver. 2 (ShapeIT) software34, which improves the Hidden Markov

vallejuelo serie 108

miércoles, 11 de noviembre de 2015

mesclados

No hay comentarios:

Publicar un comentario