library(admix)
The clustering of populations following admixture models is, for now, based on the Ksample test theory (see (Milhaud and Vandekerkhove 2023). Consider \(K\) samples. For \(i=1,...,K\), sample \(X^{(i)} = (X_1^{(i)}, ..., X_{n_i}^{(i)})\) follows \[L_i(x) = p_i F_i(x) + (1p_i) G_i, \qquad x \in \mathbb{R}.\]
We still use IBM approach to perform pairwise hypothesis testing. The idea is to adapt the Ksample test procedure to obtain a datadriven method that cluster the \(K\) populations into \(N\) subgroups, characterized by a common unknown mixture component. The advantages of such an approach is twofold:
This clustering technique thus allows to cluster unobserved subpopulations instead of individuals. We call this algorithm the Ksample 2component mixture clustering (K2MC).
We now detail the steps of the algorithm.
If $H_0$ is not rejected then $S_1 = \{x,y\}$,\\
Else $S_1 = \{x\}$, $S_{c+1} = \{y\}$ and then $c=c+1$.
Select $u={\rm argmin}\{d(i,j); i\in S_c, j\in S\setminus \bigcup_{k=1}^c S_k\}$;
Test $H_0$ the simultaneous equality of all the $f_j$, $j\in S_c$ :\\
If $H_0$ not rejected, then put $S_c=S_c\bigcup \{u\}$;\\
Else $S_{c+1} = \{u\}$ and $c = c+1$.
We present a case study with 5 populations to cluster on \(\mathbb{R}^+\), with GammaExponential and GammaGamma mixtures.
## Simulate data (chosen parameters indicate 3 clusters (populations (1,3), (2,5) and 4)!):
< list(f1 = "gamma", g1 = "exp",
list.comp f2 = "gamma", g2 = "exp",
f3 = "gamma", g3 = "gamma",
f4 = "exp", g4 = "exp",
f5 = "gamma", g5 = "exp")
< list(f1 = list(shape = 16, rate = 4), g1 = list(rate = 1/3.5),
list.param f2 = list(shape = 14, rate = 2), g2 = list(rate = 1/5),
f3 = list(shape = 16, rate = 4), g3 = list(shape = 12, rate = 2),
f4 = list(rate = 1/2), g4 = list(rate = 1/7),
f5 = list(shape = 14, rate = 2), g5 = list(rate = 1/6))
< rsimmix(n=6000, unknownComp_weight=0.8, comp.dist = list(list.comp$f1,list.comp$g1),
A.sim comp.param = list(list.param$f1, list.param$g1))$mixt.data
< rsimmix(n=6000, unknownComp_weight=0.5, comp.dist = list(list.comp$f2,list.comp$g2),
B.sim comp.param = list(list.param$f2, list.param$g2))$mixt.data
< rsimmix(n=6000, unknownComp_weight=0.65, comp.dist = list(list.comp$f3,list.comp$g3),
C.sim comp.param = list(list.param$f3, list.param$g3))$mixt.data
< rsimmix(n=6000, unknownComp_weight=0.75, comp.dist = list(list.comp$f4,list.comp$g4),
D.sim comp.param = list(list.param$f4, list.param$g4))$mixt.data
< rsimmix(n=6000, unknownComp_weight=0.55, comp.dist = list(list.comp$f5,list.comp$g5),
E.sim comp.param = list(list.param$f5, list.param$g5))$mixt.data
## Look for the clusters:
< list(f1 = NULL, g1 = "exp",
list.comp f2 = NULL, g2 = "exp",
f3 = NULL, g3 = "gamma",
f4 = NULL, g4 = "exp",
f5 = NULL, g5 = "exp")
< list(f1 = NULL, g1 = list(rate = 1/3.5),
list.param f2 = NULL, g2 = list(rate = 1/5),
f3 = NULL, g3 = list(shape = 12, rate = 2),
f4 = NULL, g4 = list(rate = 1/7),
f5 = NULL, g5 = list(rate = 1/6))
< admix_clustering(samples = list(A.sim,B.sim,C.sim,D.sim,E.sim), n_sim_tab = 10,
clusters comp.dist=list.comp, comp.param=list.param, parallel=FALSE, n_cpu=2)
#>

  0%

====  8%

==============================  60%

========================================  80%

================================================== 100%
clusters#> Call:
#> admix_clustering(samples = list(A.sim, B.sim, C.sim, D.sim, E.sim),
#> n_sim_tab = 10, comp.dist = list.comp, comp.param = list.param,
#> parallel = FALSE, n_cpu = 2)
#>
#> The number of populations/samples under study is 5.
#> The level of the underlying ksample testing procedure is set to 5%.
#>
#> The number of detected clusters in these populations equals 3.
#> The pvalues of the ksample tests (showing when to close the clusters (i.e. pvalue < 0.05) equal: 0.333, 0, 0, 0.111.
#>
#> The list of clusters with populations belonging to them (in numeric format, i.e. inside c()) :
#>  Cluster #1: vector of populations c(1, 3)
#>  Cluster #2: vector of populations 4
#>  Cluster #3: vector of populations c(2, 5)
#>
#> The list of estimated weights for the unknown component distributions in each detected cluster
#> (in the same format and order as listed populations for clusters just above) :
#>  estimated weights of the unknown component distributions for cluster 1 : c(0.809178290213258, 0.631001196480248)
#>  estimated weights of the unknown component distributions for cluster 2 : numeric(0)
#>  estimated weights of the unknown component distributions for cluster 3 : c(0.523639573546502, 0.58440106951725)
#>
#> The matrix giving the distances between populations, used in the clustering procedure through the ksample tests:
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0.00000000 29.9669866 0.02926702 7.540128 45.3573537
#> [2,] 29.96698661 0.0000000 19.15790613 4.827311 0.1379936
#> [3,] 0.02926702 19.1579061 0.00000000 69.499273 20.9549883
#> [4,] 7.54012842 4.8273114 69.49927347 0.000000 3.1603965
#> [5,] 45.35735370 0.1379936 20.95498826 3.160397 0.0000000