Optimizing R

Recently, I’ve been trying to plan out how to move libaffy to R. Most of the package is not needed anymore thanks to the excellent affy R/Bioconductor package. But IRON (iterative rank-order normalization) is popular at my institution so a R port is on the table. Part of this planning is optimization - the goal of libaffy was to create optimized code for processing Affymetrix GeneChip data. To convert this into very slow R seems contrary then.

Options

I decided to perform some experiments to determine what the “right” approach is to optimize my code in R. Meaning the code should work over large datasets, taking into account (possibly) memory and compute. As I understand it, there are several approaches to this problem:

  • All R: This approach means using the optimized numerical routines of R, with as much vectorized operations (apply, sweep, etc) as possible to keep it performant. The benefit is that there are fewer dependencies and needs to compile.
  • Optimized R: The second approach is to consider packages like matrixStats to perform optimized matrix-level operations. Since most of libaffy is based on fairly basic numerical routines, this seems a reasonable tradeoff between writing my own low-level code and using R directly.
  • Rcpp: An excellent C++ interface for R. It does a lot of the interfacing and provides some very nice helpers to get C routines implemented. Given that the routines are C, this should be much faster, but with a nice wrapper for R.
  • R/C Interface: There is a lower level interface to R/C which involves using a number of C macros, protecting memory allocations and other steps. This is as good as it gets for executing code (minimal additional overhead), it is also much harder to write/maintain.

Right away this indicates that any hope I have of maintaining one code base (libaffy C and libaffy R) is not ideal. While it is possible to unbox various data structures then call straight C calls (reusing the same library), there is code and performance overhead for doing this. Since R has become one of the de facto tools for bioinformatics, it seems that a move (vs. code sharing) is in order.

Example

I chose mean, median and Tukey’s BiWeight as the examples to implement in various ways to explore the benefits. Mean and median are easy, although with partial ordering medians it can be optimized. Tukey’s Biweight is fairly self-contained but does require two iterations of medians which can be expensive. Plus, the affy package already implements what turns out to be a fairly efficient version (affy::tukey.biweight()). This is also a good example since we can simulate large data sets with many different “probesets”, or row indices.

The dataset is a simple simulated dataset of 100 “arrays” (columns) by 1000000 “probes” (rows). A series of “probesets” are defined as row indices. The code to generate these is:

generate_dataset <- function(n=100, p = 1000000, np = 50000) {
  d <- matrix(data = 1000 * rexp(n*p), ncol=n, nrow=p)
  probes <- sapply(1:np,\(.x) {as.integer(runif(n=15,min=1,max=p))},simplify=FALSE)
  probes <- setNames(
    probes,
    nm = sprintf("%05d_at", as.integer(runif(n=np, min=10000,max =50000)))
  )
  list(data = d, probes = probes)
}

Initial Experiment

To get an early read on results and to test the process, I started with a smaller dataset:

> d<-generate_dataset(n=10,p=1000,np=100)

A quick summary of the results: Optimized R is the most efficient, followed by the Rcpp version. The R-code based approaches (apply) were the slowest. Of note, purrr::map is generally a tad bit slower than sapply (but that was not the purpose of the experiment).

Means

expr min lq mean median uq max neval
sapply(d$probes, function(.x) { colMeans(d$data[.x, ]) }, simplify = FALSE) 214.840 223.5525 230.6885 228.4725 237.2875 272.814 100
sapply(d$probes, function(.x) { apply(d$data[.x, ], 2, mean) }, simplify = FALSE) 2343.191 2463.3620 2795.6420 2497.3305 2565.4110 12769.860 100
sapply(d$probes, function(.x) { apply(d$data[.x, ], 2, libaffyperftest::mean) }, simplify = FALSE) 2334.294 2487.9825 2712.8142 2529.8230 2560.6345 12119.846 100
sapply(d$probes, function(.x) { libaffyperftest::col_means(d$data[.x, ]) }, simplify = FALSE) 479.946 499.4005 513.5799 511.0035 526.1940 570.392 100
sapply(d$probes, function(.x) { matrixStats::colMeans2(d$data[.x, ]) }, simplify = FALSE) 200.039 207.7470 217.0446 214.8810 222.4455 258.177 100
purrr::map(d$probes, function(.x) { colMeans(d$data[.x, ]) }) 227.960 239.2555 350.5229 247.5990 258.1975 10228.803 100

Medians

|expr | min| lq| mean| median| uq| max| neval| |:——————————————————————————————————–|——–:|——–:|———:|———:|———:|———:|—–:| |sapply(d$probes, function(.x) { apply(d$data[.x, ], 2, median) }, simplify = FALSE) | 2902.964| 3039.146| 3211.5386| 3099.5795| 3155.3805| 12629.230| 100| |sapply(d$probes, function(.x) { apply(d$data[.x, ], 2, libaffyperftest::median) }, simplify = FALSE) | 2937.199| 3039.883| 3622.1544| 3116.3690| 3214.1540| 22920.025| 100| |sapply(d$probes, function(.x) { apply(d$data[.x, ], 2, naive_median1) }, simplify = FALSE) | 3008.416| 3128.854| 3296.6329| 3190.9685| 3253.4935| 12945.996| 100| |sapply(d$probes, function(.x) { apply(d$data[.x, ], 2, naive_median2) }, simplify = FALSE) | 3177.418| 3275.224| 3365.5322| 3345.0055| 3404.0865| 3786.965| 100| |sapply(d$probes, function(.x) { apply(d$data[.x, ], 2, naive_median3) }, simplify = FALSE) | 2352.949| 2454.854| 3079.5481| 2507.1705| 2559.9785| 12334.071| 100| |sapply(d$probes, function(.x) { libaffyperftest::col_medians(d$data[.x, ]) }, simplify = FALSE) | 1129.140| 1164.626| 1189.3936| 1179.7955| 1211.1810| 1312.697| 100| |sapply(d$probes, function(.x) { matrixStats::colMedians(d$data[.x, ]) }, simplify = FALSE) | 306.926| 318.529| 425.9756| 325.8475| 341.3455| 9807.938| 100| |purrr::map(d$probes, function(.x) { matrixStats::colMedians(d$data[.x, ]) }) | 319.800| 340.013| 355.8661| 353.9735| 365.6585| 440.381| 100|

Tukey

|expr | min| lq| mean| median| uq| max| neval| |:—————————————————————————————————————-|———:|———:|———:|———:|———:|———-:|—–:| |sapply(d$probes, function(.x) { apply(d$data[.x, ], 2, affy::tukey.biweight) }, simplify = FALSE) | 18.694852| 19.580452| 21.442210| 19.804455| 22.386656| 100.769595| 100| |sapply(d$probes, function(.x) { apply(d$data[.x, ], 2, libaffyperftest::tukey_biweight) }, simplify = FALSE) | 4.581873| 4.745196| 4.926748| 4.794192| 4.856020| 8.676707| 100| |sapply(d$probes, function(.x) { apply(d$data[.x, ], 2, matrixStats_tukeybw_vector) }, simplify = FALSE) | 5.801090| 6.018472| 6.363052| 6.097090| 6.166708| 9.681248| 100| |sapply(d$probes, function(.x) { affy::tukeybiweight(d$data[.x, ]) }, simplify = FALSE) | 19.101244| 19.806075| 21.080633| 20.040083| 22.658957| 42.038407| 100| |sapply(d$probes, function(.x) { matrixStats_tukeybw_matrix(d$data[.x, ]) }, simplify = FALSE) | 1.520444| 1.599185| 1.700660| 1.629586| 1.672554| 5.077030| 100| |sapply(d$probes, function(.x) { libaffyperftest::col_tukey_biweights(d$data[.x, ]) }, simplify = FALSE) | 2.538720| 2.630724| 2.749495| 2.671253| 2.709936| 6.819407| 100| |purrr::map(d$probes, function(.x) { matrixStats_tukeybw_matrix(d$data[.x, ]) }) | 1.569152| 1.645720| 1.682940| 1.679504| 1.711750| 1.860047| 100|

Conclusions

Exclusions

After my experiments, there are two options to rule out: Optimized R and R/C interface.

Optimized R does not make sense for my purposes. While it is incredibly powerful and useful, it took a fair amount of time to implement and it partitions up the problem differently than one might have originally thought of the algorithm, in terms of understandability. It was faster than straight R, but not by as much as I’d hoped.

The R/C Interface does not make sense. There is significant code changes to manage the R interface within the C code. This also limits readability. Based on my experiences with Tukeys BiWeight, I was able to convert libaffy C code to Rcpp code in 30 minutes. That has a significant value, and I believe it also maintains a fair level of understandability.

Design Patterns

One of the things that emerged from the testing (and that presumably matrixStats learned long ago) is that iterative R function calling is not as efficient as a full matrix application of a function. Meaning, applying the median across columns in R is slower than C code to apply median across columns. This means that any apply-type activity would be best served as C code directly. However, it does appear that we can develop entry points for vectors (e.g., rows or columns) within Rcpp and then implement a matrix-wise apply in Rcpp. This allows us to avoid the repeated R interpretation (all in C) while still allowing the vectorized version.

For instance, when considering Tukey’s Biweight, it is convenient to have a vectorized version. Then, if you didn’t have a massive dataset, you could just apply(m,2,tukey) or if performance was an issue, col_tukeys(m). This could be used effectively to provide both options, with relatively low coding overhead for the col option.

Tradeoffs

Rcpp offered good performance but not the best. Optimized R showed the best results since it uses C underneath (and is, obviously, heavily optimized). In terms of readability and coding, it seems that Rcpp provides the benefits of C-based speedup without the complexity of straight C implementation.

Written on December 5, 2023