miércoles, 11 de noviembre de 2015
mesclados
Unravelling the hidden ancestry of American
admixed populations
Francesco Montinaro1,2, George B.J. Busby2,3, Vincenzo L. Pascali1
, Simon Myers3,4, Garrett Hellenthal5
& Cristian Capelli2
The movement of people into the Americas has brought different populations into contact,
and contemporary American genomes are the product of a range of complex admixture
events. Here we apply a haplotype-based ancestry identification approach to a large set of
genome-wide SNP data from a variety of American, European and African populations to
determine the contributions of different ancestral populations to the Americas. Our results
provide a fine-scale characterization of the source populations, identify a series of novel,
previously unreported contributions from Africa and Europe and highlight geohistorical
structure in the ancestry of American admixed populations.
DOI: 10.1038/ncomms7596 OPEN
1 Institute of Legal Medicine, Catholic University, Largo F. Vito 1, Rome 00168, Italy. 2Department of Zoology, University of Oxford, South Parks Road, Oxford
OX1 3PS, UK. 3Wellcome Trust Center for Human Genetics, Roosevelt Drive, Oxford OX3 7BN, UK. 4Department of Statistics, University of Oxford, 1 South
Parks Road, Oxford OX1 3TG, UK. 5UCL Genetics Institute, University College London, WC1E 6BT Gower Street, UK. Correspondence and requests for
materials should be addressed to C.C. (email: cristian.capelli@zoo.ox.ac.uk).
NATURE COMMUNICATIONS | 6:6596 | DOI: 10.1038/ncomms7596 | www.nature.com/naturecommunications 1
& 2015 Macmillan Publishers Limited. All rights reserved.
The genetic make-up of the Americas has been significantly
shaped by the Colonial Era and the Atlantic slave trade.
Given its historical and epidemiological implications, the
estimation of the genetic ancestry of admixed American populations
has been the subject of much attention1–5. However, despite
historical evidence suggesting a wide heterogeneity in the European
and African ancestry composition, sources have often been
identified in terms of macrogeographic areas (for example,
Southern versus Northern Europe) or by single populations as
‘consensus’ continental sources (for example, Yoruba from Nigeria
for the whole of Africa). More recently, a significant contribution
by the Spaniards has been highlighted for Caribbean and Southern
American groups4,5. However, these methods, based on the local
ancestry at a continental scale, make the identification of multiple
sources from the same continent challenging.
In order to obtain a finer characterization of the ancestry
landscape of admixed American populations, we implemented a
novel inference method that reconstructs local genomic ancestry
using a haplotype-based approach6,7. It has been shown in
previous investigations6–8 that approaches based on haplotypes
allow for a finer reconstruction of genetic structure when
compared with classical approaches that directly employ singlemarker
genotypes, and that they are characterized by a lower
degree of bias due to the ascertainment process of the
polymorphisms studied9. We applied this methodology to
genome-wide single-nucleotide polymorphisms (SNP) data from
more than 2,500 individuals collected from various putatively
admixed American and Caribbean populations. We compared the
DNA of these ‘recipient’ groups to that of a cross-section of
world-wide ‘donor’ populations that act as surrogates for the true
ancestral source groups (Fig. 1, Supplementary Table 1),
generating a detailed description of the genomic contribution of
these groups to admixed American populations.
Results
Clustering of donor populations. In order to minimize the
impact of within-source genetic heterogeneity in the ancestry
characterization process, we partitioned the 1,414 individuals
from 42 population-label donors into genetically homogeneous
clusters using a CHROMOPAINTER and fineSTRUCTURE
analysis as described in the Methods section. This identified 78
clusters (Fig. 2, Supplementary Table 2) related by a hierarchical
tree, with a broad correlation between clusters and geographic
origin, allowing the grouping of clusters in 13 groups within
Europe, Africa and East Asia/America (Fig. 2; Supplementary
Table 2).
African individuals are divided within 33 clusters. Populations
from West Africa showed a high degree of homogeneity, with all
the Yoruba individuals from Nigeria forming a single cluster and
the Mandenka from Senegal grouped into two. Individuals from
Eastern and Southern Africa were distributed across 20 different
clusters from three different regions (East Africa, South Africa
and South West Africa), perhaps because of the complex
demographic histories of populations from these areas10–12. In
our collection of donor individuals, South-Central Africa is
represented only by Bantu-speaking individuals from South
Africa, while the South West Africa and the East Africa region
clusters are represented exclusively by Herero and a Bantu
speaker from South Africa (one individual from the HGDP data
set13) and Bantu speakers from Kenya, respectively. Interestingly,
one of the Herero individuals clusters together with Sandawe
individuals instead of the other Herero individuals.
Pygmies, Sandawe and San (Khoisan/Pygmies14) were
separated into clusters, essentially according to their population
labels, although with some labelled groups differentiated into
multiple clusters (Fig. 2, Supplementary Table 2).
European individuals are differentiated into 37 clusters that
we grouped into six geographic regions (Fig. 2, Supplementary
Table 2).
As previously reported, Sardinians and Basques formed
population-specific groups15,16. Notably, by implementing the
haplotype-based approach we were able to (i) detect eight
individuals who are more related to the Basque population than
to the Spanish individuals, within the Spanish data set included in
Donors
Recipients
Basque
NE europe
SE europe
NW europe
SW europe
Sardinia
S africa
E africa
W africa
SW africa
East asia/america
Senegambia
Khoisan/phgmies
Dominican R.
Colombia B
Colombia A
Barbados
African–american A
African–american B
Ecuador
Maya
Mexico
Peru
Puerto rico A
Puerto rico B
Figure 1 | Approximate geographic sampling location of donor and recipient populations analysed. Colours refer to the 13 groups as described
in Fig. 2 and Supplementary Table 2. Circles and diamond refer, respectively, to donors and recipients.
ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/ncomms7596
2 NATURE COMMUNICATIONS | 6:6596 | DOI: 10.1038/ncomms7596 | www.nature.com/naturecommunications
& 2015 Macmillan Publishers Limited. All rights reserved.
the 1000 Genome Project panel (cluster ‘Basque 1’)17, probably
reflecting a basque ancestry, and to (ii) differentiate them from
the French Basque population included in the HGDP data set13
(‘cluster Basque 2’). We identified five Spanish clusters (‘SW
Europe’; two of them including also a single French individual),
highlighting the presence of a non-negligible heterogeneity in the
country18.
The South-Eastern Europe group (‘SE Europe’) contains
10 clusters composed of individuals from Romania, Cyprus,
Italy (excluding Sardinia), Bulgaria, Greece and France (one
individual). Notably, Italian individuals are distributed into
four different clusters according to their geographic origin
(Supplementary Table 2).
A North-Western Europe group (‘NW Europe’) consists of
eight clusters comprising individuals from British Isles, Orkney
Islands, Norway, France, Germany and Austria. Similarly to the
Basque populations, our approach clusters 23 individuals in a
clade containing members of the Orcadian sample from the
HGDP13.
The North-Eastern Europe group (‘NE-Europe’) is composed
of eight clusters including individuals from Lithuania, Poland,
Belarus, Hungary, Russia, Germany, Austria, Finland and
Norway. Native American and East Asian (China) individuals
are grouped into eight clusters, each exclusively containing
individuals from the same labelled sample. These results confirm
the extent of genetic structure in Africa and Europe, and provide
a number of potential donor groups to the present-day American
populations.
Ancestry composition of the American populations. We fit each
of the American admixed populations as a mixture of the
identified donor groups19 (see Methods, Supplementary Data 1).
The contribution to the American admixed populations for the 23
most representative clusters and macro-areas is reported in Fig. 3
and Supplementary Fig. 1. This analysis assumes that haplotypes
from the admixing populations are well represented within a
mixture of present-day sampled groups. We were concerned that
the demographic and evolutionary complexity of the peopling of
the Americas20, coupled with the high genetic drift among Native
American populations, might make the identification of the
Native American contribution challenging. In particular, the true
admixing groups from this region might be highly drifted from
the possible ‘donor’ groups sampled, particularly given our
geographically relatively sparse sample of such donor groups. To
reduce this effect we always allowed a single well-sampled East
Asian group (China) as a potential donor in the analysis, to act as
a surrogate for haplotypes carried by any Native American donor
population incompletely captured as a mixture of sampled Native
American groups. Because this donor group is still likely to be
strongly drifted relative to this East Asian ‘surrogate’, we also
repeated our analysis after ‘masking’ direct copying of China
population in the mixture-fitting step, although we still allowed
all groups to contribute in the mixture. We compared the
continental ancestry contributions from the full painting and the
East Asian masked painting with an ADMIXTURE21 analysis
performed at K ¼ 3 (Supplementary Figs 2 and 3), which closely
matches the Africa, Europe and Asia/Native Americans partition.
Continental ancestry estimates are highly correlated (P value
o10 12) between all three approaches (Supplementary Fig. 2),
although the squared distance between the masked continental
ancestry estimates and that estimated by ADMIXTURE21 was,
respectively, 5.4-fold and 7.9-fold reduced by the masking
procedure for Europe and Asia/Native Americans, suggesting a
slight gain in accuracy using this procedure. No major difference
is seen for African contributions, while identified donor
populations contributing to the mixture were very similar in
both approaches; therefore, we henceforth report results on the
basis of the masking procedure (Fig. 3).
Estimated African ancestry ranges from virtually 0 (Maya) to
0.87 (Barbados) in all the analysed populations.
Caribbean populations show a higher African component
than Southern American ones, consistent with historical records
that documented a larger number of slaves in the Caribbean
Islands22,23.
Although our sampling of Africans is incomplete, we see
variation among groups in similarity to present-day populations
from different parts of Africa. In all groups, the Yorubans from
West Africa are the largest contributor, confirming this region as
the major component of African slaves1,2,4. However, our finescale
analysis suggests additional genetic contributions from
populations from other parts of Africa, with contributions from
South africa bantu 1
South africa bantu 2
South africa bantu 3
South africa bantu 4
South africa bantu 5
South africa bantu 6
South africa bantu 7
South africa bantu 8
South africa bantu 9
South africa bantu 10
South africa bantu 11
San 1
San 2
San 3
Bantu kenya 1
Bantu kenya 2
Bantu kenya 3
Bantu kenya 4
Herero 1
Herero 2
Herero 3
Herero 4
Herero 5
Sandawe 1
Sandawe 2
Sandawe 3
Sandawe 4
Sandawe 5
Mbuti Pygmy
Biaka Pygmy
Mandenka 1
Mandenka 2
Yoruba
Romania 1
Georgia
Cyprus
South italy North italy
Tuscany 1
Tuscany 2
Bulgaria 2
Greece
Spain 1
Spain 2
Spain 3 Spain 4
Spain 5
Sardinia 1
Sardinia 2
Sardinia 3
Basque 1
Basque 2
Great britain 1
Great britain 2
Great britain 3
Great britain 4
Norway
France
Orkney 1
Orkney 2
Orkney 3
Lithuania
Poland
Hungary
Russia
Finland 1
Finland 2
Finland 3
Finland 4
Colombia 1
Colombia 2
Surui
Karitiana 2
Pima 1
Pima 2
China
Bulgaria 1
Karitiana 1
Figure 2 | fineSTRUCTURE clustering of the analysed individuals. Tree of the analysed individuals pooled in 78 clusters as inferred by fineSTRUCTURE.
Colours follow macro-area affiliations as in Fig. 1.
NATURE COMMUNICATIONS | DOI: 10.1038/ncomms7596 ARTICLE
NATURE COMMUNICATIONS | 6:6596 | DOI: 10.1038/ncomms7596 | www.nature.com/naturecommunications 3
& 2015 Macmillan Publishers Limited. All rights reserved.
particular groups sampled in Senegambia (the Mandenka),
Southern (South African Bantu language speakers) and Eastern
Africa (Kenyan Bantu language speakers) identified in 6 out of 12
populations we investigated. Historical reports indicate that
Senegambia and South-Eastern Africa contributed an average of 6
and 4% of all disembarked slaves to the Americas (totalling
several hundreds of thousands individuals), respectively, with
ethnic groups from Senegal and Mozambique being among the
10 most prominent according to slavery documentation22. In
addition, more than 30% of the total slaves arriving in mainland
Spanish America up to the 1630s came from Senegambia23, and
we accordingly find that the relative contribution from the
Mandenka is higher in all areas historically under the Spanish
rule (Fig. 4).
The degree of resolution in the identification of the sources
provided by our approach is also evident in the fine characterization
of the European component, which ranges between 0.078
(Barbados) and 0.79 (Puerto Rico). We specifically identify
Spaniards among other available Southern European populations
as the most represented European source for all nine Hispanic/
Latino populations. In contrast, the most represented European
sources in the Afro-Americans and Barbadians were Great Britain
clusters (Figs 3 and 4a), in full agreement with historical
records24,25; a small amount of Spanish ancestry is also inferred
in these groups. Interestingly among the Spaniards, two clusters
do not contribute to any of the analysed populations, presumably
reflecting a differential contribute of Iberian regions to the genetic
pool of American populations.
Among smaller genetic contributions, we identify for the first
time a genetic signature of Basque ancestry in five (out of six) of
the Continental South American populations, ranging between
0.015 in the Maya population to 0.07 in Colombia. It has been
documented that Basque individuals were a considerable fraction
of Spanish immigrants in the XVI and XVII centuries, especially
to Mexico, Cuba, Chile, Peru and Colombia26. These results could
explain, at least in part, the recently observed structure in the
0.6
0.4
0.2
0.0
0.6
0.4
0.2
0.0
Relative contribution
0.6
0.4
0.2
0.0
0.6
Ecuador Maya Mexico
Peru Puerto Rico A Puerto Rico B
Barbados
Colombia A Colombia B Dominican R.
African-American A African-American B
0.4
0.2
0.0
South Africa Bantu 2
South Africa Bantu 3
Bantu Kenya 1 Bantu Kenya 3 Bantu Kenya 4 Herero 4
Mandenka 2
Yoruba
Cyprus South Italy Bulgaria 2
Great Britain 4
France
Russia
Colombia 2
Surui
Karitiana 2
Pima 1
China
Spain 1 Spain 2 Spain 3 Basque 1
Bantu Kenya 1 Bantu Kenya 3 Bantu Kenya 4 Herero 4
Mandenka 2
Yoruba
Cyprus South Italy Bulgaria 2
Great Britain 4
France
Russia
Colombia 2
Surui
Karitiana 2
Pima 1
China
Spain 1 Spain 2 Spain 3 Basque 1
Bantu Kenya 1 Bantu Kenya 3 Bantu Kenya 4 Herero 4
Mandenka 2
Yoruba
Cyprus South Italy Bulgaria 2
Great Britain 4
France
Russia
Colombia 2
Surui
Karitiana 2
Pima 1
China
Spain 1 Spain 2 Spain 3 Basque 1
South Africa Bantu 2
South Africa Bantu 3
South Africa Bantu 2
South Africa Bantu 3
Figure 3 | Contribution of the most by informative 23 clusters inferred by fineSTRUCTURE to the analysed recipient populations. Contribution
of the most informative 23 clusters to the American and Caribbean populations estimated using the non-negative least square approach. Standard error
based on jack-knife resampling (22 replicates) is reported.
ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/ncomms7596
4 NATURE COMMUNICATIONS | 6:6596 | DOI: 10.1038/ncomms7596 | www.nature.com/naturecommunications
& 2015 Macmillan Publishers Limited. All rights reserved.
Spanish component of the Continental but not Caribbean
populations4.
Among the remaining European clusters the most represented,
contributing to five of the analysed populations, is composed of
individuals from South Italy and Sicily. This might indicate a
minor contribution from the Italian peninsula as documented
in historical records27. Interestingly, we also identified a
considerable fraction of French ancestry in one AfricanAmerican
sample, in agreement with French immigration into
the Southern United States during colonial times28,29.
At the individual level, the analysis highlights a high
heterogeneity in several analysed populations (Supplementary
Fig. 4), as expected given recent admixture. This is particularly
evident in the African-American populations, in which, for the
African ancestry, the inferred contributions of Mandenka and W
Africa range from 0 to 35% and 0 to 100%, respectively. For the
European contribution, a few individuals possessed a high degree
of inferred Spanish (95% confidence interval (CI) 0–0.27) or
Italian ancestry (95% CI 0–0.14), while global Native American
ancestry varies from 0 to 65%.
Clusters versus population-label-based ancestry reconstruction.
We explored the variation in ancestry determination when using
a population-label-based approach instead of a clustering-based
one by comparing estimates obtained using the same set of source
individuals but grouped in different ways (Supplementary Fig. 5).
Population labels might mask contributions, by for example,
falsely grouping genetically distinct donor populations with
different actual contributions to an admixed population. In
accordance with this concern, although results were mainly
similar, the label-based approach inferred the French population
(partially replacing Great Britain) as the major source for the
African-American and Barbados samples and no longer detected
the Basques as a source population. A more refined ancestry
depiction by a cluster-based approach is not unexpected for the
European sources, given the population stratification following
the complex ancient and more recent admixture history of the
continent7,13,30,31. These results indicate that using fine-scale
genetics-based clustering methods on the basis of phased data to
replace or supplement sample-based labels can strongly improve
the resolution of ancestry reconstruction.
Analysis of relative ancestry composition. We used a hierarchical
clustering algorithm on the basis of the Euclidean
distances between relative ancestry proportions to explore the
dissimilarities in source composition across admixed populations
(Fig. 4) and constructed the 80% consensus tree of 1,000
simulated data sets (see Methods section).
Clustering based on European components broadly support
two groups of recipient populations: one containing AfroAmericans
and Barbadians, the other containing all of the
remaining populations (Fig. 4a). Notably, these clusters match the
English and Spanish colonies in the Americas and reflect
geohistorical differences in the migration pattern from the
Northern hemisphere23 (Voyages: The Trans-Atlantic Slave
Trade Database: http://www.slavevoyages.org/tast/assessment/
estimates.faces) as suggested by their different European source
composition (Fig. 4a). In addition, the Caribbean Islands Puerto
Rico and Dominican Republic tend to cluster together, probably
reflecting a different migration pattern between Caribbean and
mainland America.
On the other end no particular clustering, apart from between
the two African-American groups, emerges when the African
relative composition is considered, reflecting the complexity of
the slave trade dynamics (Fig. 4b).
Discussion
Our results provided new insights into the genetic make-up of
American populations, highlighting the underappreciated heterogeneity
of ancestral components across American populations
and the power of haplotype-based analytical techniques in
identifying fine-scale ancestry without strong prior assumptions.
The application of this approach to additional admixed populations
(for example, Brazilians) and the inclusion of more sources,
particularly from Africa and the Americas, are expected to further
clarify the complexity of the ancestry composition of the
American continent.
Methods
Data set. We assembled from literature a data set composed of 4,139 individuals
from 64 populations sampled from Europe, Africa, East Asia (represented by a
single sample from China) and the Americas, genotyped with different Illumina
platforms (Supplementary Table 1). The data set was filtered using PLINK ver. 1.07
(ref. 32) to retain only SNPs and individuals with genotyping success rate 498%,
retaining 250,800 autosomal markers.
We screened the pruned data set using KING33 to remove individuals with
kinship parameter higher than 0.0884 as potentially related as indicated in the
software’s manual. The final data set is composed of 3,960 individuals from
64 populations. Of these, 12 were treated as ‘recipients’ (African-American A,
African-American B, Barbados, Colombia A, Colombia B, Dominican Republic,
Ecuador, Maya, Mexico, Peru, Puerto Rico A and Puerto Rico B), and the
remaining 52 as donors, as described below.
Phasing. The data set was phased using the Segmented Haplotype Estimation and
Imputation tool ver. 2 (ShapeIT) software34, which improves the Hidden Markov
Suscribirse a:
Enviar comentarios (Atom)
No hay comentarios:
Publicar un comentario