Shuffling Columns With data.table

June 4, 2022 at 10 AM

Yesterday, in a post syndicated to R-bloggers, kjytay asked about how to programmatically shuffle a data.table column in place, as the straightforward way didn’t work well.

Here are two other ways to solve the same problem, one using data.table::set() and the other .SDcols:

scramble_set <- function(input_dt, colname) {
  set(input_dt, j = colname, value = sample(input_dt[[colname]]))
}

scramble_sd <- function(input_dt, colname) {
  input_dt[, c(colname) := .SD[sample(.I, .N)], .SDcols = colname]
}

Each approach returns the correct result and avoids the strange dispatch problem when trying to shuffle a column named “colname”.

It’s good to check performance with these kinds of things, too, especially when .SD is involved, and set() handily outperforms the other two solutions (kjytay’s original solution I named “orig”):

microbenchmark(
  orig = scramble_orig(input_dt, "x"),
  set = scramble_set(input_dt, "x"),
  sd = scramble_sd(input_dt, "x"), 
  setup = {
    input_dt <- data.table(x = 1:5)
    set.seed(1)
  }, 
  check = "identical"
)
Unit: microseconds
 expr     min       lq      mean  median       uq      max neval
 orig 291.970 315.4400 351.52132 319.474 327.5635 3248.663   100
  set  33.196  36.0965  61.62936  37.262  39.5380 2419.880   100
   sd 557.834 591.2370 636.88657 597.579 616.2675 3821.737   100

Related Posts