reclin2
has the functionality to use a cluster created by parallel
or snow
for record linkage. There are a couple of advantages to this. First, record linkage can be a computationally intensive problem as all records from both datasets have to be compared to each other. Splitting the computation over multiple cores or CPU’s can give a substantial speed benefit. The problem easily to parallelize. Second, when using a snow
cluster, the computation can be distributed over multiple machines allowing reclin2
to use the memory of these multiple machined. Besides computationally intensive, record linkage can also be memory intensive as all pairs are stored in memory.
Parallelization over k
cluster nodes is realised by randomly splitting the first dataset x
into k
equally sized parts and distribution over the nodes. The second dataset y
is copied to each of the nodes. Therefore, it is beneficial for memory consumption if the first dataset is the largest of the two. On each node the local y
is compared to the local x
and a local set of pairs is generated. For most operations there exist methods for cluster_pairs
. These usually consist of running the operations for the regular pairs
on each of the nodes.
Below an example is given using a small cluster. It is assumed that the reader has read the introduction vignette and knows the general procedure of record linkage.
In this example the example in the introduction vignette is repeated using a cluster.
> library(reclin2)
We will work with a pair of data sets with artificial data. They are tiny, but that allows us to see what happens. In this example we will perform ‘classic’ probabilistic record linkage.
> data("linkexample1", "linkexample2")
> print(linkexample1)
id lastname firstname address sex postcode1 1 Smith Anna 12 Mainstr F 1234 AB
2 2 Smith George 12 Mainstr M 1234 AB
3 3 Johnson Anna 61 Mainstr F 1234 AB
4 4 Johnson Charles 61 Mainstr M 1234 AB
5 5 Johnson Charly 61 Mainstr M 1234 AB
6 6 Schwartz Ben 1 Eaststr M 6789 XY
> print(linkexample2)
id lastname firstname address sex postcode1 2 Smith Gearge 12 Mainstreet <NA> 1234 AB
2 3 Jonson A. 61 Mainstreet F 1234 AB
3 4 Johnson Charles 61 Mainstr F 1234 AB
4 6 Schwartz Ben 1 Main M 6789 XY
5 7 Schwartz Anna 1 Eaststr F 6789 XY
We first have to start a cluster. Pairs can then be generated using any of the cluster_pair_*
functions.
> library(parallel)
> cl <- makeCluster(2)
> pairs <- cluster_pair_blocking(cl, linkexample1, linkexample2, "postcode")
> print(pairs)
'default' with size: 2
Cluster : 6 records
First data set: 5 records
Second data set: 17 pairs
Total number of pairs: 'postcode'
Blocking on
:
Showing a random selection of pairs .x .y
1: 1 1
2: 3 1
3: 1 3
4: 3 2
5: 5 2
6: 4 1
7: 2 1
8: 4 3
9: 2 2
10: 2 3
The print function collects a few (max 6) pairs from each of the nodes and shows those. Other cluster_pair_*
functions are cluster_pair
and cluster_pair_minsim
.
The cluster_pair_*
functions return an object of type cluster_pairs
. Most other methods work the same as for regular pairs. For example, to compare the pairs on variables:
> compare_pairs(pairs, on = c("lastname", "firstname", "address", "sex"),
+ default_comparator = cmp_jarowinkler(0.9), inplace = TRUE)
> print(pairs)
'default' with size: 2
Cluster : 6 records
First data set: 5 records
Second data set: 17 pairs
Total number of pairs: 'postcode'
Blocking on
:
Showing a random selection of pairs .x .y lastname firstname address sex
1: 3 3 1.000000 0.4642857 1.0000000 1
2: 1 2 0.000000 0.5833333 0.8641026 1
3: 1 1 1.000000 0.4722222 0.9230769 NA
4: 1 3 0.447619 0.4642857 0.9333333 1
5: 5 2 0.952381 0.0000000 0.9230769 0
6: 2 3 0.447619 0.5396825 0.9333333 0
7: 2 2 0.000000 0.0000000 0.8641026 0
8: 4 2 0.952381 0.0000000 0.9230769 0
9: 4 1 0.447619 0.6428571 0.8641026 NA
10: 4 3 1.000000 1.0000000 1.0000000 0
The code above was copy-pasted from the introduction. Here the argument inplace = TRUE
was used, which adds the new variables to the existing pairs. One difference between regular pairs
and cluster_pairs
is that most methods will modify the existing pairs in place. Therefore, inplace
is ignored here and we should use:
> compare_pairs(pairs, on = c("lastname", "firstname", "address", "sex"),
+ default_comparator = cmp_jarowinkler(0.9))
> print(pairs)
'default' with size: 2
Cluster : 6 records
First data set: 5 records
Second data set: 17 pairs
Total number of pairs: 'postcode'
Blocking on
:
Showing a random selection of pairs .x .y lastname firstname address sex
1: 5 3 1.000000 0.8492063 1.0000000 0
2: 5 1 0.447619 0.5555556 0.8641026 NA
3: 3 1 0.447619 0.4722222 0.8641026 NA
4: 3 2 0.952381 0.5833333 0.9230769 1
5: 3 3 1.000000 0.4642857 1.0000000 1
6: 2 1 1.000000 0.8888889 0.9230769 NA
7: 6 5 1.000000 0.5277778 1.0000000 0
8: 2 2 0.000000 0.0000000 0.8641026 0
9: 4 3 1.000000 1.0000000 1.0000000 0
10: 4 2 0.952381 0.0000000 0.9230769 0
Most methods for cluster_pairs
do have a new_name
argument that will generate a new set of pairs on the cluster nodes. For example, the following code will generate a new set of pairs and will not modify the existing pairs:
> pairs2 <- compare_pairs(pairs, on =
+ c("lastname", "firstname", "address", "sex"), new_name = "pairs2")
> print(pairs2)
'pairs2' with size: 2
Cluster : 6 records
First data set: 5 records
Second data set: 17 pairs
Total number of pairs: 'postcode'
Blocking on
:
Showing a random selection of pairs .x .y lastname firstname address sex
1: 1 1 TRUE FALSE FALSE NA
2: 5 1 FALSE FALSE FALSE NA
3: 3 3 TRUE FALSE TRUE TRUE
4: 3 1 FALSE FALSE FALSE NA
5: 5 2 FALSE FALSE FALSE FALSE
6: 2 2 FALSE FALSE FALSE FALSE
7: 2 3 FALSE FALSE FALSE FALSE
8: 4 1 FALSE FALSE FALSE NA
9: 2 1 TRUE FALSE FALSE NA
10: 4 2 FALSE FALSE FALSE FALSE
> print(pairs)
'default' with size: 2
Cluster : 6 records
First data set: 5 records
Second data set: 17 pairs
Total number of pairs: 'postcode'
Blocking on
:
Showing a random selection of pairs .x .y lastname firstname address sex
1: 5 1 0.447619 0.5555556 0.8641026 NA
2: 3 3 1.000000 0.4642857 1.0000000 1
3: 1 2 0.000000 0.5833333 0.8641026 1
4: 5 3 1.000000 0.8492063 1.0000000 0
5: 1 3 0.447619 0.4642857 0.9333333 1
6: 6 4 1.000000 1.0000000 0.6111111 1
7: 2 2 0.000000 0.0000000 0.8641026 0
8: 4 2 0.952381 0.0000000 0.9230769 0
9: 2 1 1.000000 0.8888889 0.9230769 NA
10: 6 5 1.000000 0.5277778 1.0000000 0
The function compare_vars
offers more flexibility than compare_pairs
. It can for example compare multiple variables at the same time (e.g. compare birth day and month allowing for swaps) or generate multiple results from comparing on one variable. This method also works on cluster_pairs
.
The next step in the process, is to determine which pairs of records belong to the same entity and which do not. As in the introduction vignette we will use the classic method. Again, we hardly need to change the code from the introduction:
> m <- problink_em(~ lastname + firstname + address + sex, data = pairs)
> print(m)
- and u-probabilities estimated by the EM-algorithm:
M Variable M-probability U-probability
0.9990000 0.001152679
lastname 0.1999999 0.000100000
firstname 0.8999206 0.285831118
address 0.3002011 0.285427112
sex
: 0.5885595.
Matching probability> pairs <- predict(m, pairs = pairs, add = TRUE)
> print(pairs)
'default' with size: 2
Cluster : 6 records
First data set: 5 records
Second data set: 17 pairs
Total number of pairs: 'postcode'
Blocking on
:
Showing a random selection of pairs .x .y lastname firstname address sex weights
1: 1 1 1.000000 0.4722222 0.9230769 NA 7.7103862
2: 3 3 1.000000 0.4642857 1.0000000 1 7.9350221
3: 5 1 0.447619 0.5555556 0.8641026 NA 0.6717426
4: 5 3 1.000000 0.8492063 1.0000000 0 8.5458257
5: 3 1 0.447619 0.4722222 0.8641026 NA 0.6017106
6: 2 2 0.000000 0.0000000 0.8641026 0 -6.3177171
7: 2 3 0.447619 0.5396825 0.9333333 0 0.7937508
8: 6 4 1.000000 1.0000000 0.6111111 1 14.6796595
9: 4 1 0.447619 0.6428571 0.8641026 NA 0.7713174
10: 2 1 1.000000 0.8888889 0.9230769 NA 8.6064218
We can then select the pairs with a weight above a threshold.
> pairs <- select_threshold(pairs, "threshold", score = "weights", threshold = 8)
> print(pairs)
'default' with size: 2
Cluster : 6 records
First data set: 5 records
Second data set: 17 pairs
Total number of pairs: 'postcode'
Blocking on
:
Showing a random selection of pairs .x .y lastname firstname address sex weights threshold
1: 3 2 0.952381 0.5833333 0.9230769 1 4.0674910 FALSE
2: 1 1 1.000000 0.4722222 0.9230769 NA 7.7103862 FALSE
3: 5 2 0.952381 0.0000000 0.9230769 0 3.6961688 FALSE
4: 1 2 0.000000 0.5833333 0.8641026 1 -5.9463949 FALSE
5: 5 1 0.447619 0.5555556 0.8641026 NA 0.6717426 FALSE
6: 2 2 0.000000 0.0000000 0.8641026 0 -6.3177171 FALSE
7: 2 3 0.447619 0.5396825 0.9333333 0 0.7937508 FALSE
8: 4 3 1.000000 1.0000000 1.0000000 0 15.4915816 TRUE
9: 4 1 0.447619 0.6428571 0.8641026 NA 0.7713174 FALSE
10: 6 5 1.000000 0.5277778 1.0000000 0 7.9139248 FALSE
And this is roughly where we have to stop working with cluster_pairs
. The subset of selected pairs remaining should now be small enough that we can comfortably work locally. The most computationally intensive steps have been done. When we are not sure exactly what the threshold should be, we can also work with a more conservative threshold. That should still give us enough of a reduction in pairs that we can work locally. Using cluster_collect
we can copy the selected pairs (or all pairs) locally:
> pairs <- select_threshold(pairs, "threshold", score = "weights", threshold = 0)
> local_pairs <- cluster_collect(pairs, "threshold")
> print(local_pairs)
: 6 records
First data set: 5 records
Second data set: 15 pairs
Total number of pairs: 'postcode'
Blocking on
.x .y lastname firstname address sex weights threshold1: 1 1 1.000000 0.4722222 0.9230769 NA 7.7103862 TRUE
2: 1 3 0.447619 0.4642857 0.9333333 1 0.8042090 TRUE
3: 3 1 0.447619 0.4722222 0.8641026 NA 0.6017106 TRUE
4: 3 2 0.952381 0.5833333 0.9230769 1 4.0674910 TRUE
5: 3 3 1.000000 0.4642857 1.0000000 1 7.9350221 TRUE
6: 5 1 0.447619 0.5555556 0.8641026 NA 0.6717426 TRUE
7: 5 2 0.952381 0.0000000 0.9230769 0 3.6961688 TRUE
8: 5 3 1.000000 0.8492063 1.0000000 0 8.5458257 TRUE
9: 2 1 1.000000 0.8888889 0.9230769 NA 8.6064218 TRUE
10: 2 3 0.447619 0.5396825 0.9333333 0 0.7937508 TRUE
11: 4 1 0.447619 0.6428571 0.8641026 NA 0.7713174 TRUE
12: 4 2 0.952381 0.0000000 0.9230769 0 3.6961688 TRUE
13: 4 3 1.000000 1.0000000 1.0000000 0 15.4915816 TRUE
14: 6 4 1.000000 1.0000000 0.6111111 1 14.6796595 TRUE
15: 6 5 1.000000 0.5277778 1.0000000 0 7.9139248 TRUE
local_pairs
is a regular pairs
object (and therefore a data.table
) which can be operated upon as any pairs
object. cluster_collect
also has the option clear
which when TRUE
will delete the pairs on the cluster nodes. After this we can use the code from the introduction vignette:
> local_pairs <- compare_vars(local_pairs, "truth", on_x = "id", on_y = "id")
> local_pairs <- select_n_to_m(local_pairs, "weights", variable = "ntom", threshold = 0)
> table(local_pairs$truth, local_pairs$ntom)
FALSE TRUE
FALSE 11 0
TRUE 0 4
> linked_data_set <- link(local_pairs, selection = "ntom")
> print(linked_data_set)
: 4 pairs
Total number of pairs
.y .x id.x lastname.x firstname.x address.x sex.x postcode.x .id id.y1: 1 2 2 Smith George 12 Mainstr M 1234 AB 2 2
2: 2 3 3 Johnson Anna 61 Mainstr F 1234 AB 3 3
3: 3 4 4 Johnson Charles 61 Mainstr M 1234 AB 4 4
4: 4 6 6 Schwartz Ben 1 Eaststr M 6789 XY 6 6
lastname.y firstname.y address.y sex.y postcode.y1: Smith Gearge 12 Mainstreet <NA> 1234 AB
2: Jonson A. 61 Mainstreet F 1234 AB
3: Johnson Charles 61 Mainstr F 1234 AB
4: Schwartz Ben 1 Main M 6789 XY
The cluster_pair
object is a list with two elements:
cluster
with a copy of the parallel
or snow
cluster.name
the name of the environment on the cluster nodes in which the pairs are stored.On the cluster nodes there exists an environment (reclin2::reclin_env
). For each set of pairs an environment is created in that environment containing the pairs. To demonstrate, let us get the first pair on each of the nodes:
> clusterCall(pairs$cluster, function(name) {
+ pairs <- reclin2:::reclin_env[[name]]$pairs
+ head(pairs, 1)
+ }, name = pairs$name)
1]]
[[: 3 records
First data set: 5 records
Second data set: 1 pairs
Total number of pairs: 'postcode'
Blocking on
.x .y lastname firstname address sex weights threshold1: 1 1 1 0.4722222 0.9230769 NA 7.710386 TRUE
2]]
[[: 3 records
First data set: 5 records
Second data set: 1 pairs
Total number of pairs: 'postcode'
Blocking on
.x .y lastname firstname address sex weights threshold1: 1 1 1 0.8888889 0.9230769 NA 8.606422 TRUE
cluster_pairs
Regular pairs
are also a data.table
. Therefore, it is easy to manually create columns, select or aggregate. As for cluster_pairs
the pairs are distributed over the cluster nodes, this is more difficult for cluster_pairs
. In order to help with this, reclin2
has two helper functions: cluster_call
and cluster_modify_pairs
.
You can pass cluster_call
the cluster_pairs
object and a function. This function will be called on each cluster node and will be passed the pairs
object, the local x
and y
(in that order). This can be used to modify the pairs, or calculate statistics from the pairs. The result of the function calls is returned by cluster_call
. Therefore, if the sole goal is to modify the pairs, make sure to return NULL
(or at least something small). Below we use cluster_call
to make a random stratified sample of pairs:
> compare_vars(pairs, "id")
> cluster_call(pairs, function(pairs, ...) {
+ sel1 <- sample(which(pairs$id), 2)
+ sel2 <- sample(which(!pairs$id), 2)
+ pairs[, sample := FALSE]
+ pairs[c(sel1, sel2), sample := TRUE]
+ NULL
+ })
> sample <- cluster_collect(pairs, "sample")
cluster_modify_pairs
is very similar to cluster_call
but is mainly meant for modifying the pairs object. Although in the previous example we also used cluster_call
for that. When the function passed to cluster_modify_pairs
returns a data.table
, this data.table
will overwrite the pairs
object. cluster_modify_pairs
also accepts a new_name
argument. When set a new pairs object will be created.
Let’s use the sample from above to estimate a model and then use cluster_modify_pairs
to add the predictions to the pairs:
> mglm <- glm(id ~ lastname + firstname, data = sample)
> cluster_modify_pairs(pairs, function(pairs, model, ...) {
+ pairs$pmodel <- predict(model, newdata = pairs, type = "response")
+ pairs
+ }, model = mglm)
And stop the cluster.
> stopCluster(cl)