
Welcome to post 49 in the
R4 series.
The Two Cultures is a term first used by C.P. Snow in a
1959
speech and monograph focused on the split between humanities and the
sciences. Decades later, the term was (quite famously) re-used by Leo
Breiman in a (somewhat prophetic)
2001
article about the split between data models and algorithmic
models . In this note, we argue that statistical computing practice and
deployment can also be described via this
Two Cultures
moniker.
Referring to the term linking these foundational pieces is of course
headline bait. Yet when preparing for the discussion of
r2u in the invited talk in
Mons (
video,
slides),
it occurred to me that there is in fact a wide gulf between two
alternative approaches of
using R and, specifically,
deploying packages.
On the one hand we have the approach described by my friend
Jeff as you go to the Apple store,
buy the nicest machine you can afford, install what you need
and
then never ever touch it . A computer / workstation / laptop is
seen as an immutable object where every attempt at
change may
lead to breakage, instability, and general chaos and is hence best
avoided. If you know Jeff, you know he exaggerates. Maybe only slightly
though.
Similarly, an entire sub-culture of users striving for
reproducibility (and sometimes also replicability ) does the same.
This is for example evidenced by the popularity of package
renv
by
Rcpp collaborator and pal
Kevin. The expressed hope is
that by nailing down a (sub)set of packages, outcomes are constrained to
be unchanged. Hope springs eternal, clearly. (Personally, if need be, I
do the same with Docker containers and their respective
Dockerfile
.)
On the other hand, rolling is fundamentally different approach. One
(well known) example is Google building everything at
@HEAD . The entire (ginormous)
code base is considered as a mono-repo which at
any point in
time is expected to be buildable as is. All changes made are pre-tested
to be free of side effects to other parts. This sounds hard, and likely
is more involved than an alternative of a whatever works approach of
independent changes and just hoping for the best.
Another example is a rolling (Linux) distribution as for example
Debian. Changes are first committed to
a staging place (Debian calls this the unstable distribution) and,
if no side effects are seen, propagated after a fixed number of days to
the rolling distribution (called testing ). With this mechanism,
testing should always be installable too. And based on the rolling
distribution, at certain times (for Debian roughly every two years) a
release is made from testing into stable (following more elaborate
testing). The released stable version is then immutable (apart from
fixes for seriously grave bugs and of course security updates). So this
provides the connection between frequent and rolling updates, and
produces immutable fixed set: a release.
This Debian approach has been influential for any other
projects including
CRAN as can
be seen in aspects of its system providing a rolling set of curated
packages. Instead of a staging area for all packages, extensive tests
are made for candidate packages before adding an update. This aims to
ensure quality and consistence and has worked remarkably well. We argue
that it has clearly contributed to the success and renown of
CRAN.
Now, when accessing
CRAN
from
R, we fundamentally have
two accessor functions. But seemingly only one is widely known
and used. In what we may call the Jeff model , everybody is happy to
deploy
install.packages()
for
initial
installations.
That sentiment is clearly expressed
by
this bsky post:
One of my #rstats coding rituals is that every time I load a @vincentab.bsky.social package
I go check for a new version because invariably it s been updated with
18 new major features
And that is why we have
two cultures.
Because some of us, yours truly included, also use
update.packages()
at recurring (frequent !!) intervals:
daily or near-daily for me. The goodness and, dare I say, gift of
packages is not limited to those by my pal
Vincent.
CRAN updates all the time, and
updates are (generally) full of (usually excellent) changes, fixes, or
new features. So update frequently! Doing (many but small) updates
(frequently) is less invasive than (large, infrequent) waterfall -style
changes!
But the fear of change, or disruption, is clearly pervasive. One can
only speculate why. Is the experience of updating so painful on other
operating systems? Is it maybe a lack of exposure / tutorials on best
practices?
These Two Cultures coexist. When I delivered the talk in Mons, I
briefly asked for a show of hands among all the
R users in the audience to see who
in fact does use
update.packages()
regularly. And maybe a
handful of hands went up: surprisingly few!
Now back to the context of installing packages: Clearly only
installing has its uses. For continuous integration checks we generally
install into ephemeral temporary setups. Some debugging work may be with
one-off container or virtual machine setups. But all other uses may well
be under maintained setups. So consider calling
update.packages()
once in while. Or even weekly or daily.
The
rolling feature of
CRAN is a real benefit, and it is
there for the taking and enrichment of your statistical computing
experience.
So to sum up, the real power is to use
install.packages()
to obtain fabulous new statistical
computing resources, ideally in an instant; and
update.packages()
to keep these fabulous resources
current and free of (known) bugs.
For both tasks, relying on
binary installations accelerates
and eases the process. And where available, using
binary
installation with system-dependency support as
r2u does makes it easier
still, following the
r2u slogan of Fast. Easy.
Reliable. Pick All Three. Give it a try!
This post by Dirk
Eddelbuettel originated on his Thinking inside the box
blog. If you like this or other open-source work I do, you can now sponsor me at
GitHub.