3 R Code Workflows
In the previous chapter we introduced two frameworks to think about the steps required for a successful data science project. Next we will explore commonly used R code workflows that are used to orchestrate the end-to-end running of analysis or modelling projects.
3.0.1 R Scripts
The most basic workflow is to colocate all code into a single R script. This is a common starting place for beginners or when completing small, basic tasks in R. An obvious limitation is the inability to separate out logical components for readability, testing, debugging and control flow.
3.0.2 Monolithic Markdown
Commonly adopted tools to promote more literate programming1 are notebooks such as RMarkdown or more recently Quarto2. These tools allow users to write plain English commentary in markdown or a visual editor and splice in ‘code-chunks’. Typically this notebook style document is then sequentially rendered in order and knitted into some for of output like HTML, PDF or Word etc.
This is a great way to make code more readable and self-contained while managing complexity. Obvious drawbacks exist around the inflexible execution order, control flow and caching. These can also get very long!
3.0.3 Control Scripts
Analysts who prefer a more scripted workflow will often attempt to break down the complexity of their project into smaller chunks, often placing parts of the analysis into their own R script.
The next question is, how do we orchestrate the running of all these R scripts? This is usually solved with a ‘control’ or ‘run’ script, which source()
’s the relevant scripts in the right order.
This is a step in the right direction, but requires lots of overhead in managing state and data flows between scripts, often by manually ‘caching’ results. The scripts are often not self-contained and this can quickly be a recipe for disaster for more complex projects.
An example of this is the ProjectTemplate3 framework.
3.0.4 {targets}
{targets}
4 is an R package that allows users to adopt a make-like5 pipeline philosophy for their R code. This has the advantage of more sophisticated handling of computationally-intensive workflows and provides a more opinionated structure to follow. With this are the drawbacks of forcing your collaborators to adopt the same framework and dealing with the initial learning curve.
3.0.5 R Package
An R Package is the canonical way to organise and ‘package’ R code for use and sharing. It provides easy means to share, install, document, test and run code.
Given it is the adopted standard for packaging R code, many users have adapted the structure to run analysis projects and research compendia6. This is possible, however the structure is most commonly applied when building tools or algorithms rather than analysis workflows.
3.1 Choosing the right workflow
So which workflow should you use?
Unfortunately this is not a straightforward decision. For quick experimental code you are unlikely to create a new R package. For a complex production deployed model, you really don’t want all your code in one giant R script.
Picking the correct workflow needs to align the project goals and scope. Often this choice can evolve throughout the project.
3.1.1 An evolution
A concept or idea might be tested in a single R script, like how you would use the back of a napkin for an idea. Next you might break this down into chunks and add some prose, heading and plots so you can share and have other understand it. Next you might refactor the messy code into functions to better control the flow and the improve development practices. These functions can be documented and unit tested once you know you want to rely on them. To orchestrate the running and dependency structure to avoid re-running slow and complex code you may use the {targets}
package. Finally to re-use, share and improve on the functions you might spin these out into their own R package!
3.1.2 Repro-retro
I talked a little about how you might want to weight and prioritise the elements of reproducibility in an rstudio::global talk in 2019. I used to concept of a reproducibility retrospective (repro-retro). Feel free to conduct your own repro-retro.