My GSoC 2020 work in a nutshell

Phew… So it’s already time for the end of GSoC 2020. Time sure flew by this summer. Now it’s the time for me to take a step back and ponder over my summer with The R Project for Statistical Computing. So, let’s touch upon my summer in a chronological fashion.

Humble Beginnings:

Around the time when The R Project for Statistical Computing announced their participation in GSoC 2020, a list of proposed projects were announced. The rco or the R Code Optimizer caught my attention at the very first moment and I didn’t look back since then.

Although the exsisting rco repository was extremely well maintained and documented, me being a newbie still faced quite a lot of difficulties and also, most probably gave my mentors, Dr. JC Rodriguez and Mauricio “Pachá” Vargas Sepúlveda a hard time.

As part of the tests to be completed for being considered as a potential student developer for R with the rco project, I was required to build a completely functional and working optimizer.

I decided to work on a suggested optimizer that made column extraction faster and more efficient. After the benign guidance of my mentors, I was able to come up with an column extraction optimizer, that performed according to its expectation. This got me so excited that I worked on and built another value extraction optimizer without any explicit incentive :D

Post GSoC selection:

It was only a few days before my acceptance into the GSoC 2020 program, that my mentors gave their insights into both these optimizers and explained, how these optimizers can be broken, owing to the extreme flexibility that the R Language offers. Since, .subset2( ) is not a reserved keyword in the R Language, it could be over-written. Hence we can never be sure while optimizing, that the functions that are being called has not been re-defined by the user. However, this served as a great primer as to what to expect when GSoC officially began.

Getting up to speed:

The next tasks that I were assigned were not too glorious/dramatic or blinge-y but they were essential steps for the upkeep of a package on CRAN. Firstly, I designed a vignette that enlisted all the future potential optimizers that can be implemented in the rco package. Next, I went on to collate the efforts put in the rco library by the Google Code-In students, by creating PRs that such that every student gets credit as a contributor and at the same time hand-picking the best of examples and explanations to include in the official rco documentation. It was a riveting experience for me, as I had to learn and implement some really advanced concepts of git and github to be able to pull this off, including, but not limited to, cherry-picking, resolving merge conficts, adding remotes, etc.

Unexpected Turn of Events:

As Dwight D. Eisenhower once said,

In preparing for battle I have always found that plans are useless, but planning is indispensable.

I had listed several optimization techniques that could have been implemented in rco in my GSoC proposal as well the potential optimizers vignette but when we got down to discussing the design of these optimizers, it became quite clear to me and my mentors that we were walking on thin ice. With the extreme constraints that are imposed due to R’s flexible nature, only a few of our optimization strategies seemed bullet-proof.

Conditional Threading Optimizer:

I went on ahead with the idea of Jump/Conditional Threading and started designing the optimizer. The objective was to replace if statments with else wherever possible and also to group together the code blocks of different if blocks if the if conditionals were same. This concept is succintly covered over here.

To see the speed-ups provided by this optimization strategy, have a look at this example:

Unoptimized Code

cond_thread <- function(n) {
  evens <- 0
  evens_sum <- 0
  odds <- 0
  for (i in seq_len(n)) {
    if (i %% 2 == 0) { # same logical as next if condition (can be merged)
      evens <- evens + 1
    }
    if (i %% 2 == 0) {
      evens_sum <- evens_sum + i
    }
    if (!(i %% 2 == 0)) { # exact negation as previous if (can be an else)
      odds <- odds + 1
    }
  }
}

Proposed Optimized Code

cond_thread_opt <- function(n) {
  evens <- 0
  evens_sum <- 0
  odds <- 0
  for (i in seq_len(n)) {
    if (i %% 2 == 0) { # merged
      evens <- evens + 1
      evens_sum <- evens_sum + i
    } else { # converted to else
      odds <- odds + 1
    }
  }
}

Benchmark

n <- 100000
autoplot(microbenchmark(cond_thread(n), cond_thread_opt(n)))

PR for the Conditional Threading Optimizer

Memory Allocation Optimizer:

Next, I set my eyes on the Memory Pre-Allocation optimization technique. The objective here was to save the R Programmers from the sin of growing a vector in loops. In our experience, we’ve seen that vector, lists, etc are initialized often with NULL or c() out of convenience. Bothering about the size of the vectors may seem trivial to the programmer, but it has a huge impact on the performance of the script.

To see the speed-ups provided by this optimization strategy, consider this example:

Unoptimized Code

mem_alloc <- function(n) {
  vec <- NULL
  for (i in seq_len(n)) {
    vec[i] <- i
  }
}

Proposed Optimized Code

mem_alloc_opt <- function(n) {
  vec <- vector(length = n)
  for (i in seq_len(n)) {
    vec[i] <- i
  }
}

Benchmark

n <- 100000
autoplot(microbenchmark(mem_alloc(n), mem_alloc_opt(n)))

PR for the Memory Allocation Optimizer

Pivot and Push:

As mentioned in the last segment, we had pretty much exhausted all the possible optimization strategies, that could be implemented without breaking. Now, it was onto us to decide whether, to work on new optimizer, that had a non-zero chance of breaking or pivot and change objectives. Following the lead of my mentors, I decided to put on hold, the optimizers that could break and focus on further bolstering the rco package.

Fixing a critical issue:

Owing to the fantastic upkeep of the rco repository, it had almost no bugs in the optimizers, and that is no mean feat. However an issue reporting a bug was reported. The bug was that an optimizer, namely opt-dead-expr() did not function when the user used ;. Ideally, the optimizer shouldn’t have been affected by the usage of ;, but here it was, a bug, laying dormant for over 6 months.

I decided to tackle this bug head on. I created a separate branch for this bug and reproduced the bug. After lots of trials and tribulations and lots of reverse-engineering, I zeroed in one the problem. It was a problem that existed in very low-level R code. When a R script without the use of ; is parsed, then they get tokenized as something called expr but when ; was used, the parser tokenized them as an exprlist. The exsisting optimizer did not handle the case of exprlist, so I appended code that handled exprlist and then the optimizer started functioning normally irrespective of the usage of ;. This is an issue which could arise in other optimizers too, but given that, it has been solved in one optimizer, solving them in other optimizers would not prove challenging. Also, in the process of making the opt_dead_expr() optimizer work, I further bolstered the robustness of the optimizer by doubling the number of test cases that the optimizer must pass, including tests with ;.

PR for solution of issue #107

Call for Actions:

To reduce dependencies on 3rd party applications and to promote homogeneity, we decided to go for an in-house solution for the CI/CD needs of our package, ie Github Actions. Earlier the popular choice for carrying out CI/CD operations were 3rd party applications such as Travis CI or codecov.io, but lately more and more developers and organizations had been migrating to Github Actions and we decided to follow suit. Github Marketplace wasn’t much help, as there were not much support for the R Language as compared to other languages such as Rust, Ruby, Typescript, etc. So, I went through several documentations and scarce examples and created a branch that renders a website and completes the testing that Travis used to do. We left out the code coverage as Github does not yet support badges that shows the percentage of coverage.

Acting on users’ recommendations:

One of the users of rco opened an issue, with a feature request. The request that the user asked for, is as follows:

I think it would be nifty to be able to create a report out of rco optimisers. The idea would be to analyse the whole code and return some kind of markdown / bookdown that will list all the results from the optimisers, without changing the > original code. It might be useful to analyse your own code and learn what you could do better, but also it could be use in the industry to analyse the quality of code.

While actually implementing this idea, would have been a tall order, me and my mentors discussed that, as the first iteration we could work on creating a function, which could be called to check, exactly how many files from a folder or a group of files can be optimized and which optimizer is being used there. If interested, the user could run that specific file along with the optimizers named in the output, in the rco_gui() function and see the code difference.

PR for the reporting function

And, my GSoC ’20 journey comes to an end:

Well, I never thought that 3 months could pass in a jiffy, but these months when I was working under the R Language in this program, had been the golden months of my life. Never to be forgotten and to be re-lived time and again. I am proud and happy to state that, my GSoC experience and project was mostly a hit. Yes, we did face obstacles, but each and every time we emerged stronger. I have now become a life-long fan of open-source.

Open source is simply put, magic. And, no one can ever get enough of magic, right >.<

- My GSoC 2020 work in a nutshell

GSoC 2020 Final Report

My GSoC 2020 work in a nutshell

Humble Beginnings:

Post GSoC selection:

Getting up to speed:

Unexpected Turn of Events:

Conditional Threading Optimizer:

Unoptimized Code

Proposed Optimized Code

Benchmark

Memory Allocation Optimizer:

Unoptimized Code

Proposed Optimized Code

Benchmark

Pivot and Push:

Fixing a critical issue:

Call for Actions:

Acting on users’ recommendations:

And, my GSoC ’20 journey comes to an end: