Capture melt

2023-08-31

Capture melt

This vignette explains how to use functions for “melting” wide data tables, i.e. converting to tall/long data tables. To clarify the discussion we introduce the following three terms:

The nc functions use data.table::melt internally:

Both are useful when you want to use a regular expression to specify both (1) the set of input columns to reshape and (2) some information to extract from those column names.

Reshaping several input columns into a single output column

Sometimes you want to melt a “wide” data table which has several distinct pieces of information encoded in each column name. One example is the familiar iris data, which have flower part and measurement dimension encoded in each of four column names:

head(iris)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          4.6         3.1          1.5         0.2  setosa
#> 5          5.0         3.6          1.4         0.2  setosa
#> 6          5.4         3.9          1.7         0.4  setosa

Those four reshape column names can be specified via a regex in nc::capture_melt_single. The first argument is the input data table to reshape, and the subsequent arguments are interpreted as a pattern which is passed to nc::capture_first_vec. Any input column names which match the specified regex will be passed as measure.vars to melt:

(iris.tall <- nc::capture_melt_single(
  iris,
  part=".*",
  "[.]",
  dim=".*",
  value.name="cm"))
#>        Species  part    dim  cm
#>   1:    setosa Sepal Length 5.1
#>   2:    setosa Sepal Length 4.9
#>   3:    setosa Sepal Length 4.7
#>   4:    setosa Sepal Length 4.6
#>   5:    setosa Sepal Length 5.0
#>  ---                           
#> 596: virginica Petal  Width 2.3
#> 597: virginica Petal  Width 1.9
#> 598: virginica Petal  Width 2.0
#> 599: virginica Petal  Width 2.3
#> 600: virginica Petal  Width 1.8

Note the output above has one copy column (Species), two capture columns (part, dim), and one reshape column (cm). Internally the function joins the result of nc::capture_first_vec (on column names) to the result of melt (on the data).

The reshaped data can be plotted with different parts on rows and different dimensions on columns:


if(require(ggplot2)){
  ggplot()+
    theme_bw()+
    theme(panel.spacing=grid::unit(0, "lines"))+
    facet_grid(part ~ dim)+
    geom_bar(aes(cm, fill=Species), data=iris.tall)
}
#> Loading required package: ggplot2

Reshaping several input columns into multiple output columns

We could instead use capture_melt_multiple to get multiple output columns. Like capture_melt_single, the first argument of capture_melt_multiple is the subject data table and the following arguments form a pattern which is matched to the input data column names. However the pattern must have at least two groups:

(iris.part.cols <- nc::capture_melt_multiple(
  iris,
  column=".*",
  "[.]",
  dim=".*"))
#>        Species    dim Petal Sepal
#>   1:    setosa Length   1.4   5.1
#>   2:    setosa Length   1.4   4.9
#>   3:    setosa Length   1.3   4.7
#>   4:    setosa Length   1.5   4.6
#>   5:    setosa Length   1.4   5.0
#>  ---                             
#> 296: virginica  Width   2.3   3.0
#> 297: virginica  Width   1.9   2.5
#> 298: virginica  Width   2.0   3.0
#> 299: virginica  Width   2.3   3.4
#> 300: virginica  Width   1.8   3.0

Note that the reshaped table above contains one copy column (Species), one capture column (dim), and two reshape columns (Petal, Sepal). We can plot these data to see whether or not sepals are bigger than petals:


if(require(ggplot2)){
  ggplot()+
    theme_bw()+
    theme(panel.spacing=grid::unit(0, "lines"))+
    facet_grid(dim ~ Species)+
    coord_equal()+
    geom_abline(slope=1, intercept=0, color="grey")+
    geom_point(aes(
      Petal, Sepal),
      data=iris.part.cols)
}

It is clear from the plot above that sepals are indeed both longer and wider than petals, on each measured plant.

Melting WHO data with a more complex pattern

Another data set where it is useful to do column name pattern matching followed by melting is the World Health Organization data:


if(requireNamespace("tidyr")){
  data(who, package="tidyr")
}else{
  who <- data.frame(id=1, new_sp_m5564=2, newrel_f65=3)
}
#> Loading required namespace: tidyr
names(who)
#>  [1] "country"      "iso2"         "iso3"         "year"         "new_sp_m014" 
#>  [6] "new_sp_m1524" "new_sp_m2534" "new_sp_m3544" "new_sp_m4554" "new_sp_m5564"
#> [11] "new_sp_m65"   "new_sp_f014"  "new_sp_f1524" "new_sp_f2534" "new_sp_f3544"
#> [16] "new_sp_f4554" "new_sp_f5564" "new_sp_f65"   "new_sn_m014"  "new_sn_m1524"
#> [21] "new_sn_m2534" "new_sn_m3544" "new_sn_m4554" "new_sn_m5564" "new_sn_m65"  
#> [26] "new_sn_f014"  "new_sn_f1524" "new_sn_f2534" "new_sn_f3544" "new_sn_f4554"
#> [31] "new_sn_f5564" "new_sn_f65"   "new_ep_m014"  "new_ep_m1524" "new_ep_m2534"
#> [36] "new_ep_m3544" "new_ep_m4554" "new_ep_m5564" "new_ep_m65"   "new_ep_f014" 
#> [41] "new_ep_f1524" "new_ep_f2534" "new_ep_f3544" "new_ep_f4554" "new_ep_f5564"
#> [46] "new_ep_f65"   "newrel_m014"  "newrel_m1524" "newrel_m2534" "newrel_m3544"
#> [51] "newrel_m4554" "newrel_m5564" "newrel_m65"   "newrel_f014"  "newrel_f1524"
#> [56] "newrel_f2534" "newrel_f3544" "newrel_f4554" "newrel_f5564" "newrel_f65"

Each column which starts with new has three distinct pieces of information encoded in its name: diagnosis type (e.g. sp or rel), gender (m or f), and age range (e.g. 5564 or 1524). We would like to use a regex to match these column names, then using the matching columns as measure.vars in a melt, then join the two results.

new.diag.gender <- list(
  "new_?",
  diagnosis=".*",
  "_",
  gender=".")
nc::capture_melt_single(who, new.diag.gender, ages=".*")
#>                          country iso2 iso3 year diagnosis gender ages value
#>     1:               Afghanistan   AF  AFG 1997        sp      m  014     0
#>     2:               Afghanistan   AF  AFG 1998        sp      m  014    30
#>     3:               Afghanistan   AF  AFG 1999        sp      m  014     8
#>     4:               Afghanistan   AF  AFG 2000        sp      m  014    52
#>     5:               Afghanistan   AF  AFG 2001        sp      m  014   129
#>    ---                                                                     
#> 76042:                  Viet Nam   VN  VNM 2013       rel      f   65  3110
#> 76043: Wallis and Futuna Islands   WF  WLF 2013       rel      f   65     2
#> 76044:                     Yemen   YE  YEM 2013       rel      f   65   360
#> 76045:                    Zambia   ZM  ZMB 2013       rel      f   65   669
#> 76046:                  Zimbabwe   ZW  ZWE 2013       rel      f   65   725

Note the output includes the new reshape column called value by default, as in melt. The input reshape column names which matched the specified pattern, and there is a new column for each group in that pattern. The following example shows how to rename the value column and use numeric type conversion functions:

years.pattern <- list(new.diag.gender, ages=list(
  min.years="0|[0-9]{2}", as.numeric,
  max.years="[0-9]{0,2}", function(x)ifelse(x=="", Inf, as.numeric(x))))
(who.typed <- nc::capture_melt_single(
  who, years.pattern,
  value.name="count"))
#>                          country iso2 iso3 year diagnosis gender ages min.years
#>     1:               Afghanistan   AF  AFG 1997        sp      m  014         0
#>     2:               Afghanistan   AF  AFG 1998        sp      m  014         0
#>     3:               Afghanistan   AF  AFG 1999        sp      m  014         0
#>     4:               Afghanistan   AF  AFG 2000        sp      m  014         0
#>     5:               Afghanistan   AF  AFG 2001        sp      m  014         0
#>    ---                                                                         
#> 76042:                  Viet Nam   VN  VNM 2013       rel      f   65        65
#> 76043: Wallis and Futuna Islands   WF  WLF 2013       rel      f   65        65
#> 76044:                     Yemen   YE  YEM 2013       rel      f   65        65
#> 76045:                    Zambia   ZM  ZMB 2013       rel      f   65        65
#> 76046:                  Zimbabwe   ZW  ZWE 2013       rel      f   65        65
#>        max.years count
#>     1:        14     0
#>     2:        14    30
#>     3:        14     8
#>     4:        14    52
#>     5:        14   129
#>    ---                
#> 76042:       Inf  3110
#> 76043:       Inf     2
#> 76044:       Inf   360
#> 76045:       Inf   669
#> 76046:       Inf   725
str(who.typed)
#> Classes 'data.table' and 'data.frame':   76046 obs. of  10 variables:
#>  $ country  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
#>  $ iso2     : chr  "AF" "AF" "AF" "AF" ...
#>  $ iso3     : chr  "AFG" "AFG" "AFG" "AFG" ...
#>  $ year     : num  1997 1998 1999 2000 2001 ...
#>  $ diagnosis: chr  "sp" "sp" "sp" "sp" ...
#>  $ gender   : chr  "m" "m" "m" "m" ...
#>  $ ages     : chr  "014" "014" "014" "014" ...
#>  $ min.years: num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ max.years: num  14 14 14 14 14 14 14 14 14 14 ...
#>  $ count    : num  0 30 8 52 129 90 127 139 151 193 ...
#>  - attr(*, ".internal.selfref")=<externalptr>

Note in the code/result above that non-character captured output columns can be obtained by specifying type conversion functions in the pattern.

Melting monthly Eurostat data

Another example is exchange rate data from Eurostat. We first use read the entire data set into R:

ert.gz <- system.file(
  "extdata", "ert_eff_ic_m.tsv.gz", package="nc", mustWork=TRUE)
ert.all <- data.table::fread(ert.gz, na.strings=":")
ert.all[1:5, 1:5]
#>    exch_rt,unit,geo\\time 2020M07 2020M06 2020M05 2020M04
#> 1:       NEER_EA19,I10,AT  100.00  100.00   100.0  100.00
#> 2:       NEER_EA19,I10,AU   88.50   88.39    86.3   83.46
#> 3:       NEER_EA19,I10,BE  100.00  100.00   100.0  100.00
#> 4:       NEER_EA19,I10,BG  100.00  100.00   100.0  100.00
#> 5:       NEER_EA19,I10,BR   38.49   39.94    38.0   40.45

We see that the first column has some CSV data which we can parse via:

ert.first <- ert.all[, 1]
csv.lines <- c(sub("\\\\.*", "", names(ert.first)), ert.first[[1]])
ert.first.dt <- data.table::fread(text=paste(csv.lines, collapse="\n"))
ert.wide <- data.table::data.table(ert.first.dt, ert.all[,-1])
ert.wide[1:5, 1:5]
#>      exch_rt unit geo 2020M07 2020M06
#> 1: NEER_EA19  I10  AT  100.00  100.00
#> 2: NEER_EA19  I10  AU   88.50   88.39
#> 3: NEER_EA19  I10  BE  100.00  100.00
#> 4: NEER_EA19  I10  BG  100.00  100.00
#> 5: NEER_EA19  I10  BR   38.49   39.94

The wide data table can then be melted:

(ert.tall <- nc::capture_melt_single(
  ert.wide,
  year="[0-9]{4}", as.integer,
  "M",
  month="[0-9]{2}", as.integer))
#>         exch_rt unit geo year month   value
#>    1: NEER_EA19  I10  AT 2020     7  100.00
#>    2: NEER_EA19  I10  AU 2020     7   88.50
#>    3: NEER_EA19  I10  BE 2020     7  100.00
#>    4: NEER_EA19  I10  BG 2020     7  100.00
#>    5: NEER_EA19  I10  BR 2020     7   38.49
#>   ---                                      
#> 2229: NEER_EA19  I10  BE 1994     1   99.07
#> 2230: NEER_EA19  I10  BG 1994     1 4808.49
#> 2231: NEER_EA19  I10  BR 1994     1 1225.11
#> 2232: NEER_EA19  I10  CA 1994     1   92.19
#> 2233: NEER_EA19  I10  CH 1994     1   83.43

After that we can create a time variable and plot via


ert.tall[, month.IDate := data.table::as.IDate(
  sprintf("%d-%d-15", year, month))]
if(require("ggplot2")){
  ggplot()+
    geom_hline(aes(
      yintercept=value),
      color="grey",
      data=data.frame(value=100))+
    geom_line(aes(
      month.IDate, value, color=geo),
      data=ert.tall[geo %in% c("CA", "US", "JP", "FR")])+
    facet_grid(exch_rt ~ .)+
    theme_bw()+
    theme(panel.spacing=grid::unit(0, "lines"))
}

Another way to do it would be via

nc::capture_melt_single(ert.wide, month.POSIXct="[0-9].*", function(x){
  as.POSIXct(strptime(paste0(x,"15"), "%YM%m%d"))
})
#>         exch_rt unit geo month.POSIXct   value
#>    1: NEER_EA19  I10  AT    2020-07-15  100.00
#>    2: NEER_EA19  I10  AU    2020-07-15   88.50
#>    3: NEER_EA19  I10  BE    2020-07-15  100.00
#>    4: NEER_EA19  I10  BG    2020-07-15  100.00
#>    5: NEER_EA19  I10  BR    2020-07-15   38.49
#>   ---                                         
#> 2229: NEER_EA19  I10  BE    1994-01-15   99.07
#> 2230: NEER_EA19  I10  BG    1994-01-15 4808.49
#> 2231: NEER_EA19  I10  BR    1994-01-15 1225.11
#> 2232: NEER_EA19  I10  CA    1994-01-15   92.19
#> 2233: NEER_EA19  I10  CH    1994-01-15   83.43

Melting into multiple output columns with missing input columns

What if the input data set has “missing” input columns?

iris.missing <- iris[, names(iris) != "Sepal.Length"]
head(iris.missing)
#>   Sepal.Width Petal.Length Petal.Width Species
#> 1         3.5          1.4         0.2  setosa
#> 2         3.0          1.4         0.2  setosa
#> 3         3.2          1.3         0.2  setosa
#> 4         3.1          1.5         0.2  setosa
#> 5         3.6          1.4         0.2  setosa
#> 6         3.9          1.7         0.4  setosa

In that case melting into multiple columns is an error by default:

iris.pattern <- list(column=".*", "[.]", dim=".*")
nc::capture_melt_multiple(iris.missing, iris.pattern)
#> Error in (function (subject.names, match.dt, no.match, fill = TRUE) : need dim=same count for each value, but have: Length=1 Width=2; please change pattern, edit input column names, or use fill=TRUE to output missing values

The error message explains that the number of input columns for each value of dim must be the same, but there is one for Length and two for Width. To ignore the error and fill the output with missing values,

nc::capture_melt_multiple(iris.missing, iris.pattern, fill=TRUE)
#>        Species    dim Petal Sepal
#>   1:    setosa Length   1.4    NA
#>   2:    setosa Length   1.4    NA
#>   3:    setosa Length   1.3    NA
#>   4:    setosa Length   1.5    NA
#>   5:    setosa Length   1.4    NA
#>  ---                             
#> 296: virginica  Width   2.3   3.0
#> 297: virginica  Width   1.9   2.5
#> 298: virginica  Width   2.0   3.0
#> 299: virginica  Width   2.3   3.4
#> 300: virginica  Width   1.8   3.0

Note the missing values in the table above, which correspond to the missing input column in the original/wide data set.

Multiple output columns of different types

Some real-world data sets can be reshaped into output columns with different types. An example data set from the PROVEDIt benchmark in criminology:

peaks.csv <- system.file(
  "extdata", "RD12-0002_PP16HS_5sec_GM_F_1P.csv",
  package="nc", mustWork=TRUE)
peaks.wide <- data.table::fread(peaks.csv)
print(data.table::data.table(
  names=names(peaks.wide),
  class=sapply(peaks.wide, class)),
  topn=10)
#>            names     class
#>   1: Sample File character
#>   2:      Marker character
#>   3:         Dye character
#>   4:    Allele 1 character
#>   5:      Size 1   numeric
#>   6:    Height 1   integer
#>   7:    Allele 2 character
#>   8:      Size 2   numeric
#>   9:    Height 2   integer
#>  10:    Allele 3 character
#>  ---                      
#> 294:   Height 97   logical
#> 295:   Allele 98   logical
#> 296:     Size 98   logical
#> 297:   Height 98   logical
#> 298:   Allele 99   logical
#> 299:     Size 99   logical
#> 300:   Height 99   logical
#> 301:  Allele 100   logical
#> 302:    Size 100   logical
#> 303:  Height 100   logical

There are 303 columns, with info for 100 peaks. Each peak has three features: Allele=character, Size=numeric, and Height=integer. The ending peaks are class logical because they are all missing. These data can be reshaped via

peaks.tall <- nc::capture_melt_multiple(
  peaks.wide,
  column=".*",
  " ",
  peak="[0-9]+", as.integer,
  na.rm=TRUE)
#> Warning in melt.data.table(data.table(L[["data"]]), measure.vars = measure.vars, :
#> 'measure.vars' [Allele 1, Allele 2, Allele 3, Allele 4, ...] are not all of the same
#> type. By order of hierarchy, the molten data value column will be of type 'character'.
#> All measure variables not of type 'character' will be coerced too. Check DETAILS in
#> ?melt.data.table for more on coercion.
#> Warning in melt.data.table(data.table(L[["data"]]), measure.vars = measure.vars, :
#> 'measure.vars' [Height 1, Height 2, Height 3, Height 4, ...] are not all of the same
#> type. By order of hierarchy, the molten data value column will be of type 'integer'. All
#> measure variables not of type 'integer' will be coerced too. Check DETAILS in
#> ?melt.data.table for more on coercion.
#> Warning in melt.data.table(data.table(L[["data"]]), measure.vars = measure.vars, :
#> 'measure.vars' [Size 1, Size 2, Size 3, Size 4, ...] are not all of the same type. By
#> order of hierarchy, the molten data value column will be of type 'double'. All measure
#> variables not of type 'double' will be coerced too. Check DETAILS in ?melt.data.table for
#> more on coercion.
options(width=90)
print(peaks.tall)
#>                                Sample File  Marker Dye peak Allele Height   Size
#>   1: A02-RD12-0002-35-0.5PP16-001.5sec.fsa D3S1358   B    1     OL      5  94.68
#>   2: A02-RD12-0002-35-0.5PP16-001.5sec.fsa    TH01   B    1     OL      3 148.69
#>   3: A02-RD12-0002-35-0.5PP16-001.5sec.fsa  D21S11   B    1     OL      7 195.84
#>   4: A02-RD12-0002-35-0.5PP16-001.5sec.fsa  D18S51   B    1     OL      4 285.51
#>   5: A02-RD12-0002-35-0.5PP16-001.5sec.fsa Penta E   B    1     OL      3 369.36
#>  ---                                                                            
#> 438: A02-RD12-0002-35-0.5PP16-001.5sec.fsa     FGA   Y   25     OL      5 413.28
#> 439: A02-RD12-0002-35-0.5PP16-001.5sec.fsa Penta E   B   26     OL      3 479.21
#> 440: A02-RD12-0002-35-0.5PP16-001.5sec.fsa     FGA   Y   26     OL      4 415.55
#> 441: A02-RD12-0002-35-0.5PP16-001.5sec.fsa     FGA   Y   27     OL      4 428.06
#> 442: A02-RD12-0002-35-0.5PP16-001.5sec.fsa     FGA   Y   28     OL      4 445.08
str(peaks.tall)
#> Classes 'data.table' and 'data.frame':   442 obs. of  7 variables:
#>  $ Sample File: chr  "A02-RD12-0002-35-0.5PP16-001.5sec.fsa" "A02-RD12-0002-35-0.5PP16-001.5sec.fsa" "A02-RD12-0002-35-0.5PP16-001.5sec.fsa" "A02-RD12-0002-35-0.5PP16-001.5sec.fsa" ...
#>  $ Marker     : chr  "D3S1358" "TH01" "D21S11" "D18S51" ...
#>  $ Dye        : chr  "B" "B" "B" "B" ...
#>  $ peak       : int  1 1 1 1 1 1 1 1 1 1 ...
#>  $ Allele     : chr  "OL" "OL" "OL" "OL" ...
#>  $ Height     : int  5 3 7 4 3 4 3 197 45 4 ...
#>  $ Size       : num  94.7 148.7 195.8 285.5 369.4 ...
#>  - attr(*, "sorted")= chr "peak"
#>  - attr(*, ".internal.selfref")=<externalptr>