Phew… So it’s already time for the end of GSoC 2020. Time sure flew by this summer. Now it’s the time for me to take a step back and ponder over my summer with The R Project for Statistical Computing. So, let’s touch upon my summer in a chronological fashion.
Around the time when The R Project for Statistical Computing
announced their participation in GSoC 2020, a list of proposed
projects were announced. The rco
or the R Code Optimizer caught my attention at the very
first moment and I didn’t look back since then.
Although the exsisting rco
repository was extremely well maintained and documented, me being a
newbie still faced quite a lot of difficulties and also, most probably
gave my mentors, Dr. JC
Rodriguez and Mauricio “Pachá” Vargas
Sepúlveda a hard time.
As part of the tests
to be completed for being considered as a potential student developer
for R with the rco
project, I was required to build a
completely functional and working optimizer.
I decided to work on a suggested optimizer that made column extraction faster and more efficient. After the benign guidance of my mentors, I was able to come up with an column extraction optimizer, that performed according to its expectation. This got me so excited that I worked on and built another value extraction optimizer without any explicit incentive :D
It was only a few days before my acceptance into the GSoC 2020
program, that my mentors gave their insights into both these optimizers
and explained, how these optimizers can be broken, owing to the extreme
flexibility that the R Language offers. Since, .subset2( )
is not a reserved keyword in the R Language, it could be over-written.
Hence we can never be sure while optimizing, that the functions that are
being called has not been re-defined by the user. However, this served
as a great primer as to what to expect when GSoC officially began.
The next tasks that I were assigned were not too glorious/dramatic or
blinge-y but they were essential steps for the upkeep of a package on
CRAN
. Firstly, I designed a vignette that enlisted all the
future
potential optimizers that can be implemented in the rco
package. Next, I went on to collate the efforts put in the
rco
library by the Google Code-In students, by
creating PRs that such that every student gets credit as a contributor
and at the same time hand-picking the best of examples and explanations
to include in the official rco
documentation. It was a
riveting experience for me, as I had to learn and implement some really
advanced concepts of git
and github
to be able
to pull this off, including, but not limited to,
cherry-picking
, resolving merge conficts
,
adding remotes
, etc.
As Dwight D. Eisenhower once said,
In preparing for battle I have always found that plans are useless, but planning is indispensable.
I had listed several optimization techniques that could have been
implemented in rco
in my GSoC
proposal as well the potential
optimizers vignette but when we got down to discussing the
design of these optimizers, it became quite clear to me and my mentors
that we were walking on thin ice. With the extreme constraints that are
imposed due to R’s flexible nature, only a few of our optimization
strategies seemed bullet-proof.
I went on ahead with the idea of Jump/Conditional
Threading and started designing the optimizer. The objective was to
replace if
statments with else
wherever
possible and also to group together the code blocks of different
if
blocks if the if conditionals
were same.
This concept is succintly covered over here.
To see the speed-ups provided by this optimization strategy, have a look at this example:
cond_thread <- function(n) {
evens <- 0
evens_sum <- 0
odds <- 0
for (i in seq_len(n)) {
if (i %% 2 == 0) { # same logical as next if condition (can be merged)
evens <- evens + 1
}
if (i %% 2 == 0) {
evens_sum <- evens_sum + i
}
if (!(i %% 2 == 0)) { # exact negation as previous if (can be an else)
odds <- odds + 1
}
}
}
Next, I set my eyes on the Memory
Pre-Allocation optimization technique. The objective here was to
save the R Programmers from the sin of growing a vector in
loops. In our experience, we’ve seen that vector, lists, etc are
initialized often with NULL
or c()
out of
convenience. Bothering about the size of the vectors may seem trivial to
the programmer, but it has a huge impact on the performance of the
script.
To see the speed-ups provided by this optimization strategy, consider this example:
As mentioned in the last segment, we had pretty much exhausted all
the possible optimization strategies, that could be implemented without
breaking. Now, it was onto us to decide whether, to work on new
optimizer, that had a non-zero chance of breaking or pivot and change
objectives. Following the lead of my mentors, I decided to put on hold,
the optimizers that could break and focus on further bolstering the
rco
package.
Owing to the fantastic upkeep of the rco
repository, it
had almost no bugs in the optimizers, and that is no mean feat. However
an issue
reporting a bug was reported. The bug was that an optimizer, namely
opt-dead-expr()
did not function when the user used
;
. Ideally, the optimizer shouldn’t have been affected by
the usage of ;
, but here it was, a bug, laying dormant for
over 6 months.
I decided to tackle this bug head on. I created a
separate branch for this bug and reproduced the bug. After lots of
trials and tribulations and lots of reverse-engineering, I zeroed in one
the problem. It was a problem that existed in very low-level R code.
When a R script without the use of ;
is parsed, then they
get tokenized as something called expr
but when
;
was used, the parser tokenized them as an
exprlist
. The exsisting optimizer did not handle the case
of exprlist
, so I appended code that handled
exprlist
and then the optimizer started functioning
normally irrespective of the usage of ;
. This is an issue
which could arise in other optimizers too, but given that, it has been
solved in one optimizer, solving them in other optimizers would not
prove challenging. Also, in the process of making the
opt_dead_expr()
optimizer work, I further bolstered the
robustness of the optimizer by doubling the number of test
cases that the optimizer must pass, including tests with
;
.
To reduce dependencies on 3rd party applications and to promote
homogeneity, we decided to go for an in-house solution for the CI/CD
needs of our package, ie Github Actions
. Earlier the
popular choice for carrying out CI/CD operations were 3rd party
applications such as Travis CI or codecov.io, but
lately more and more developers and organizations had been migrating to
Github Actions and we decided to follow suit. Github
Marketplace wasn’t much help, as there were not much support for
the R Language as compared to other languages such as
Rust, Ruby, Typescript, etc. So, I went
through several documentations and scarce examples and created a
branch that renders a
website and completes the testing that Travis used to do. We left
out the code coverage as Github does
not yet support badges that shows the percentage of coverage.
One of the users of rco
opened an issue, with a feature
request. The request that the user asked for, is as follows:
I think it would be nifty to be able to create a report out of
rco
optimisers. The idea would be to analyse the whole code and return some kind of markdown / bookdown that will list all the results from the optimisers, without changing the > original code. It might be useful to analyse your own code and learn what you could do better, but also it could be use in the industry to analyse the quality of code.
While actually implementing this idea, would have been a tall order,
me and my mentors discussed that, as the first iteration we could work
on creating a function, which could be called to check, exactly how many
files from a folder or a group of files can be optimized and which
optimizer is being used there. If interested, the user could run that
specific file along with the optimizers named in the output, in the
rco_gui()
function and see the code difference.
Well, I never thought that 3 months could pass in a jiffy, but these months when I was working under the R Language in this program, had been the golden months of my life. Never to be forgotten and to be re-lived time and again. I am proud and happy to state that, my GSoC experience and project was mostly a hit. Yes, we did face obstacles, but each and every time we emerged stronger. I have now become a life-long fan of open-source.
Open source is simply put, magic. And, no one can ever get enough of magic, right >.<