In many cases, the connected families that we have to integrate over may be large but sparsely connected. As an example, consider the following three generation family:
# the data set we use
dat <- data.frame(
id = 1:48,
mom = c(NA, NA, 2L, 2L, 2L, NA, NA, 7L, 7L, 7L, 3L, 3L, 3L, 3L, NA, 15L, 15L, 43L, 18L, NA, NA, 21L, 21L, 9L, 9L, 9L, 9L, NA, NA, 29L, 29L, 29L, 30L, 30L, NA, NA, 36L, 36L, 36L, 38L, 38L, NA, NA, 43L, 43L, 43L, 32L, 32L),
dad = c(NA, NA, 1L, 1L, 1L, NA, NA, 6L, 6L, 6L, 8L, 8L, 8L, 8L, NA, 4L, 4L, 42L, 5L, NA, NA, 20L, 20L, 22L, 22L, 22L, 22L, NA, NA, 28L, 28L, 28L, 23L, 23L, NA, NA, 35L, 35L, 35L, 31L, 31L, NA, NA, 42L, 42L, 42L, 45L, 45L),
sex = c(1L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L))
# plot the pedigree
library(kinship2, quietly = TRUE)
ped <- with(dat, pedigree(id = id, dadid = dad, momid = mom, sex = sex))
par(mar = c(1, 1, 1, 1))
plot(ped)
In the plot, circles are females and squares are males. An individual may be repeated in which case duplicate entries are illustrated by a dashed line. If we observe outcomes for all 48 members of the family then we have to do 48 dimensional as everybody are marginally dependent if we have an additive genetic effect or similar effects. We assume that this is true in the rest of this vignette. Though it is possible also use the same method we describe here in other cases if one specifies a suitable graph, like the one we will get to soon, for the random effect structure.
To see that everybody are connected, notice that
The relationships can also be represented as graph. For this purposes, we assign the functions below to create a list of edges between parents and their children. You can likely skip the functions.
# returns the mothers id.
#
# Args:
# pedigree: the pedigree object.
get_momid <- function(peddat)
with(peddat,
vapply(mindex, function(x) if(x > 0L) id[x] else NA_integer_, 1L))
# returns the fathers id.
#
# Args:
# pedigree: the pedigree object.
get_dadid <- function(peddat)
with(peddat,
vapply(findex, function(x) if(x > 0L) id[x] else NA_integer_, 1L))
# creates an edge list to pass to igraph. An edge is included between children
# and parents.
#
# Args:
# pedigree: the pedigree object.
create_igraph_input <- function(peddat){
id <- peddat$id
father <- get_dadid(peddat)
mother <- get_momid(peddat)
# TODO: this is O(n^2)
stopifnot(anyDuplicated(id) < 1)
out <- lapply(id, function(x){
# find the children
children_idx <- which(x == father | x == mother)
children <- if(length(children_idx) > 0)
id[children_idx] else NULL
# get the correct order (smallest first) and return
is_larger <- x > children
cbind(
ifelse(is_larger, children, x ),
ifelse(is_larger, x , children))
})
out <- do.call(rbind, out)
as.data.frame(out[!duplicated(out), ])
}
The graph version of the family looks like this:
# For some reason, kinship2::pedigree requires that we provide both a father
# and mother or none. Therefore, we create a mock object. You can skip this
get_pedigree_mock <- function(id, dadid, momid, sex){
if(is.factor(sex))
sex <- as.integer(sex)
# checks
n <- length(id)
stopifnot(n > 0, length(dadid) == n, length(momid) == n, length(sex) == n,
all(is.finite(sex)), all(sex %in% 1:2),
all(is.na(dadid) | dadid %in% id),
all(is.na(momid) | momid %in% id),
all(is.finite(id)))
# create objects to return
findex <- match(dadid, id, nomatch = 0L)
mindex <- match(momid, id, nomatch = 0L)
structure(
list(famid = rep(1L, n), id = id, findex = findex, mindex = mindex,
sex = factor(sex, levels = 1:2, labels = c("male", "famle"))),
class = "pedigree")
}
# assign function to plot the pedigree as a graph
do_graph_plot <- function(dat){
ped <- with(dat, get_pedigree_mock(
id = id, dadid = dad, momid = mom, sex = sex))
library(igraph, quietly = TRUE)
g_dat <- create_igraph_input(ped)
graph_fam <- graph.data.frame(g_dat, directed = FALSE)
par(mar = c(1, 1, 1, 1))
plot(graph_fam, vertex.size = 12, vertex.color = "gray",
vertex.label.cex = .75, layout = layout_with_kk)
}
do_graph_plot(dat)
A node/vertex in the graph represents an individual and an edge indicates a parent-child relation. The graph have the property that if we split (find a connected partition of) the above into two sub-families (subgraphs) by removing child-parent relations (removing edges) then we only have to do two smaller integrals rather than one large. Having said that, the 48 dimensional integral that we show here is not a problem but higher dimensional integrals may be.
The above suggests the following procedure for simplifying the computational problem:
We have implemented procedures to do just this. It also turns out that it is fine to not satisfy number 1. That is, to not get two connected sets in the partition. We will return to this later and to what we have implemented but first provide an example. We start by splitting the above family in way that only satisfies 1.:
# get the partition
library(pedmod)
only_balanced <- max_balanced_partition_pedigree(
id = dat$id, father.id = dat$dad, mother.id = dat$mom, trace = 2L)
#> Exited early. Balance criterion is 4
#> Exited early. Balance criterion is 7
#> Cannot improve balance criterion further
#> Found split. Balance criterion is 24
#> Found perfect balanced connected partition early. Balance criterion is 24
#> Found approximately balanced connected partition. Balance criterion is 24
only_balanced$balance_criterion # the smallest of the two sets
#> [1] 24
removed_edges <- only_balanced$removed_edges
removed_edges # the relationships we have removed
#> from to
#> [1,] 6 8
#> [2,] 6 10
#> [3,] 7 9
#> [4,] 32 47
#> [5,] 32 48
# assign function to get the new data. You can likely skip this
get_reduced_data <- function(dat, removed_edges){
for(i in 1:nrow(removed_edges)){
. <- function(child, parent){
idx <- which(dat$id == removed_edges[i, child])
idx_m <- which(dat$mom[idx] == removed_edges[i, parent])
if(length(idx_m > 0)){
dat[idx[idx_m], "mom"] <<- NA_integer_
return(TRUE)
}
idx_d <- which(dat$dad[idx] == removed_edges[i, parent])
if(length(idx_d > 0)){
dat[idx[idx_d], "dad"] <<- NA_integer_
return(TRUE)
}
FALSE
}
if(.(1L, 2L))
next
.(2L, 1L)
}
dat
}
new_dat <- get_reduced_data(dat, removed_edges)
# redo the plot
do_graph_plot(new_dat)
The above shows the family after we remove the 5 relationships (edges) that our algorithm finds. To illustrate where the cut is in the original graph, we can color the vertices according to which set they are in:
# assign function to show the split in the original graph
show_split <- function(dat, partition_obj){
ped <- with(dat, pedigree(id = id, dadid = dad, momid = mom, sex = sex))
g_dat <- create_igraph_input(ped)
graph_fam <- graph.data.frame(g_dat, directed = FALSE)
nam <- vertex_attr(graph_fam)$name
V(graph_fam)$color[nam %in% partition_obj$set_1] <- "lightblue"
V(graph_fam)$color[nam %in% partition_obj$set_2] <- "lightgray"
par(mar = c(1, 1, 1, 1))
plot(graph_fam, vertex.size = 12, vertex.label.cex = .75,
layout = layout_with_kk)
}
show_split(dat, only_balanced)
Next, we use the slack
argument to reduce the number of relationships we remove (the number of edges we cut):
also_cut <- max_balanced_partition_pedigree(
id = dat$id, father.id = dat$dad, mother.id = dat$mom, trace = 2L,
slack = .1)
#> Exited early. Balance criterion is 4
#> Exited early. Balance criterion is 7
#> Cannot improve balance criterion further
#> Found split. Balance criterion is 24
#> Found perfect balanced connected partition early. Balance criterion is 24
#> Found approximately balanced connected partition. Balance criterion is 24
#> Starting to reduce the cost of the cut in a block of size 38. The partition in the block consists of two sets of size 17 and 21. The cut cost is 5
#> Starting iteration 1
#> Found 6 vertices to move
#> Keept 2 moves with a gain of 1 and a balance criterion of 24
#> The cut cost is 4
#> Starting iteration 2
#> Found 6 vertices to move
#> Keept 0 moves
#> The cut cost is 4
also_cut$balance_criterion # the smallest of the two sets
#> [1] 24
removed_edges <- also_cut$removed_edges
removed_edges # the relationships we have removed
#> from to
#> [1,] 6 9
#> [2,] 7 9
#> [3,] 32 48
#> [4,] 45 47
# redo the plot
show_split(dat, also_cut)
We only removed 4 relationships (edges) this time.
In many applications, we only observe some individuals. Therefore, we want the to split the family into equal sizes in terms of the individuals that we actually observe. We can do this by providing individual weights (vertex weights).
As an example, we will take the family we have been working with and assume that we only observe individuals in the final generation. Thus, we will set the weight of these individuals to a one and the remaining to some small positive number (say \(10^{-5}\)):
# add the weights
is_final <- c(16:17, 11:14, 24:27, 33:34, 40:41, 47:48, 19L)
dat$id_weight <- ifelse(dat$id %in% is_final, 1., 1e-5)
weighted_partition <- max_balanced_partition_pedigree(
id = dat$id, father.id = dat$dad, mother.id = dat$mom, trace = 2L,
slack = .1, id_weight = dat$id_weight)
#> Exited early. Balance criterion is 4e-05
#> Exited early. Balance criterion is 2.00005
#> Cannot improve balance criterion further
#> Found split. Balance criterion is 8.00019
#> Exited early. Balance criterion is 2.00001
#> Found approximately balanced connected partition. Balance criterion is 8.00019
#> Starting to reduce the cost of the cut in a block of size 38. The partition in the block consists of two sets of size 20 and 18. The cut cost is 6
#> Starting iteration 1
#> Found 8 vertices to move
#> Keept 1 moves with a gain of 2 and a balance criterion of 8.00018
#> The cut cost is 4
#> Starting iteration 2
#> Found 7 vertices to move
#> Keept 0 moves
#> The cut cost is 4
weighted_partition$balance_criterion # the smallest of the two sets
#> [1] 8
removed_edges <- weighted_partition$removed_edges
removed_edges # the relationships we have removed
#> from to
#> [1,] 6 8
#> [2,] 7 8
#> [3,] 32 47
#> [4,] 32 48
# plot the new graph
plot_weighted <- function(dat, removed_edges){
new_dat <- get_reduced_data(dat, removed_edges)
ped <- with(new_dat, get_pedigree_mock(
id = id, dadid = dad, momid = mom, sex = sex))
g_dat <- create_igraph_input(ped)
graph_fam <- graph.data.frame(g_dat, directed = FALSE)
V(graph_fam)$color <- ifelse(vertex_attr(graph_fam)$name %in% is_final,
"gray", "white")
par(mar = c(1, 1, 1, 1))
plot(graph_fam, vertex.size = 12, vertex.label.cex = .75,
layout = layout_with_kk)}
plot_weighted(dat, removed_edges)
show_split(dat, weighted_partition)
We have colored vertices which we observe in the first graph to make it easy to see that there is a as close as possible to an equal number in each sub-family.
We can notice that the previous solution removes relations between the individuals we observe and their immediate ancestors. We can add a larger weight to these relationships to avoid or discourage this. To do so, we need to use the father_weight
and mother_weight
arguments as shown below:
# add the weights
is_final <- c(16:17, 11:14, 24:27, 33:34, 40:41, 47:48, 19L)
dat$id_weight <- ifelse(dat$id %in% is_final, 1., 1e-5)
# add the edge weights
dat$father_weight <- dat$mother_weight <- ifelse(dat$id %in% is_final, 10., 1.)
# find the partition
weighted_partition <- max_balanced_partition_pedigree(
id = dat$id, father.id = dat$dad, mother.id = dat$mom, trace = 2L,
slack = .1, id_weight = dat$id_weight, father_weight = dat$father_weight,
mother_weight = dat$mother_weight)
#> Exited early. Balance criterion is 4e-05
#> Exited early. Balance criterion is 2.00005
#> Cannot improve balance criterion further
#> Found split. Balance criterion is 8.00019
#> Exited early. Balance criterion is 2.00001
#> Found approximately balanced connected partition. Balance criterion is 8.00019
#> Starting to reduce the cost of the cut in a block of size 38. The partition in the block consists of two sets of size 20 and 18. The cut cost is 60
#> Starting iteration 1
#> Found 8 vertices to move
#> Keept 2 moves with a gain of 56 and a balance criterion of 8.00017
#> The cut cost is 4
#> Starting iteration 2
#> Found 6 vertices to move
#> Keept 0 moves
#> The cut cost is 4
weighted_partition$balance_criterion # the smallest of the two sets
#> [1] 8
removed_edges <- weighted_partition$removed_edges
removed_edges # the relationships we have removed
#> from to
#> [1,] 6 8
#> [2,] 7 8
#> [3,] 28 32
#> [4,] 29 32
# plot the new graph
plot_weighted(dat, removed_edges)
show_split(dat, weighted_partition)