6 Elements of Production Deployed R Code

R has a tricky historical reputation as a programming language. With its origins in academia it is often argued that it is not fit for production deployment. The recent development of tools to support the deployment of R workloads are changing this. Below we will explore some key technical elements that should be included into any stable deployment of a data science workflow adapted from Kreuzberger, Kühl, and Hirschl (2023).

Orchestration
Automation
Reproducibility
Version Control
Metadata and Documentation
Testing & Monitoring

6.1 Orchestration

Orchestration refers to how your code is structured and run. We explored various ways of structuring our code for data science workflows in Chapter 2. There are two key facets to code orchestration. The first is separating the functional components of your project, and the orchestration of those functional components to do some task.

As R is a functional programming language the canonical way to structure our code is to use functions. In a practical sense, functions are a convenient way to package, document, test and control the execution of code in a project.

Once the functional elements of your project are stored, documented and tested appropriately they need to be run in the correct order. This can be achieved using a number of frameworks such as using notebooks, an R Package, {targets} and more.

6.2 Automation

Automation refers to how your project is ‘built’. In other words how it is pushed and pulled from your local development environment into a remote environment where other users can access the results. Again, many options exist and there are varying levels of automation.

Manual approaches such as click button deployment from your IDE or manually copying files across to a remote server are one option.

A better approach would be to stage your code using a remote version control repository. This will act as a store for source code which can be more conveniently ‘pulled’ or synchronised with the server where your analysis is hosted for users. Often, accompanying model artefacts or data will also to stored somewhere remotely for the server to access.

More contemporary approaches involve Continuous Integration and Continuous Deployment (CI/CD) practices. This is a DevOps style workflow that will automatically build, test and deploy code that has been pushed to a remote version control repository. A thorough exploration of CI/CD Solutions is beyond the scope of this book and is now a clearly defined sub specialty known as MLOps.

6.3 Reproducibility

6.3.1 Code dependencies

When we talk about reproducibility, there are many elements of a data science workflow that need to be considered. The first step to a reproducible pipeline is ensuring that all users have the same code that is being used to run the project. The recommended approach here is to ensure that all code is checked in to a remote version control repository, using a version control system such as git.

That way other collaborators can clone the code base from the remote git repository and branch or fork from the code repository in order to make their own changes. It can then be integrated back using proper version control principles.

6.3.2 Packages dependencies

Another key element of reproducibility is ensuring that all users can resolve having the same R package dependencies. This includes having the correct packages, installing those packages from the correct locations and ensuring the version of those packages are equivalent. A convenient solution for these problems is the {renv}¹ package. The {renv} package helps users by creating a locally stored environment and infrastructure to collaborate and automate this process with other users.

6.3.3 System dependencies

System dependencies describe the software that are installed on the computational environment that the project is being run on. These may include external libraries such as libxml or GDAL. Tools like Posit Public Package Manager² can provide some analysis of the system dependencies that are required to support R packages on various operating systems. Another solution explored in the next section is using technologies such as Docker³.

6.3.4 Operating System dependencies

Users, even when running the same code with the same R packages may find differences in how the code performs based on the operating system they’re using. For example, Windows versus Linux versus Mac OS. It is common in a production setting to deploy code to an external server most likely using a Linux operating system. A way to control operating system and system dependencies is through technologies such as Docker. Docker provides a way for developers to specify the exact building blocks of a computational environment (using a Dockerfile) and instructions to run these components in the right order. Once build, it provides portability (as a container image) to be run agnostically on different hardware.

6.3.5 Hardware dependencies

Hardware dependencies are a little trickier. There are physical hardware infrastructure constraints on how a project is run regardless of the operating system and software that is running on it. For example, this is commonly seen with the use of CPU versus GPU Technologies and different types of processor chips. A thorough exploration of hardware dependencies is outside the scope of this guide.

6.4 Version Control

Version Control is a critical aspect to ensure reproducibility in production deployed data science solutions. Version control includes not only Version Control for code but also version control for data and models. Version control for code is commonly achieved using the git⁴ software. Git is a version control system that allows users to manage changes to software and code collaboratively.

If the data science project results in a statistical or machine learning model being fit, it is the model itself that will be the deployed artefact in order for inference to be performed. Therefore it is critical that this model is also versioned. Again, there are many solutions for this. A recent development in the R ecosystem is the {vetiver} package⁵. This package brings in practices of MLOps to the R ecosystem. The exploration of version control with data is beyond the scope of this book however it is recommended that data are also versioned and stored appropriately in a secured data store, such as a data lake or data warehouse.

6.5 Metadata and Documentation

It is important when deploying artefacts into a production setting, that there are appropriate metadata. Metadata refers to tags, versions, dates and other relevant information to describe what work is being deployed. We’ll explore this further in a modelling context later on using the {vetiver} package. In terms of metadata and documentation for the functional aspects of the code, we can rely on the internal documentation used in R packages.

To achieve elegant function documentation we can use the {roxygen2}⁶ package. This provides a useful template for documenting, our functions using tags that are automatically generated into help documents. Another form of documentation in an R package is a README. The package README is an important artefact to tell users, what the software does how it is to be run and any other important information such as the licence or prerequisites and dependencies.

6.6 Testing

Finally we must address unit testing. Writing tests is an important element of software engineering. In terms of our code testing can be performed using the {testthat}⁷ R package. The {testthat} package provides a framework for software testing that integrates well into writing R packages, using the RStudio IDE or testing can also be performed on a directory interactively.

6.7 Putting it into practice

Pulling these elements and tools into practice will be done in Chapter 8.