Yesterday, in a post syndicated to R-bloggers, kjytay asked about how to programmatically shuffle a data.table column in place, as the straightforward way didn’t work well.
Here are two other ways to solve the same problem, one using
data.table::set()
and the other .SDcols
:
scramble_set <- function(input_dt, colname) {
set(input_dt, j = colname, value = sample(input_dt[[colname]]))
}
scramble_sd <- function(input_dt, colname) {
input_dt[, c(colname) := .SD[sample(.I, .N)], .SDcols = colname]
}
Each approach returns the correct result and avoids the strange dispatch problem when trying to shuffle a column named “colname”.
It’s good to check performance with these kinds of things, too,
especially when .SD
is involved, and set()
handily
outperforms the other two solutions (kjytay’s original solution I
named “orig”):
microbenchmark(
orig = scramble_orig(input_dt, "x"),
set = scramble_set(input_dt, "x"),
sd = scramble_sd(input_dt, "x"),
setup = {
input_dt <- data.table(x = 1:5)
set.seed(1)
},
check = "identical"
)
Unit: microseconds
expr min lq mean median uq max neval
orig 291.970 315.4400 351.52132 319.474 327.5635 3248.663 100
set 33.196 36.0965 61.62936 37.262 39.5380 2419.880 100
sd 557.834 591.2370 636.88657 597.579 616.2675 3821.737 100