Contents

Rcpp Notes From a Newbie

Intro

You probably already know that C++ has speed advantages compared to R, and Rcpp is the package for exposing that fast and efficient C++ code to R. For me, the motive to learn Rcpp (and C++) stems from xgboost and lightgbm - two prominent machine learning models used in Kaggle competitions. At their core, they’re both written in C++. However, they both have R and Python interfaces which, in my opinion, is a huge part of their popularity. For years I’ve had a nagging desire to understand how this works and how I can build my own cross-platform models. Here I outline some notes as I get my hands dirty with C++ and Rcpp.

Prerequisites

  1. Learn R
  2. Learn C++ (I found Frank Mitropoulous’s course on Udemy fantastically helpful.)

Setup

I’ll assume you have R, RStudio, Rcpp, and a C++ compiler installed…

  1. Open a fresh R session and run Rcpp::evalCpp("1+1"). If R doesn’t return 2 in the console, something’s wrong.
  2. In RStudio, File > New File > C++, and save as cpp_functions.cpp
  3. In RStudio, File > New File > R Script, and save as r_functions.R
  4. By default, RStudio prefills cpp_functions.cpp with starter code and leaves r_functions.R empty Let’s modify these files as follows
// rcpp_functions.cpp

//--- Load header files --------------------------------------

#include <Rcpp.h>

# r_functions.R

#---- Load Rcpp ---------------------------------------------

library(Rcpp)  # version 1.0.0

1. Hello World

Create a C++ function hello_world() that simply prints “Hello World” and call it from R.

// cpp_functions.cpp

#include <Rcpp.h>

//---- 1. Hello World ---------------------------------------------
// Print "Hello World!" to the console

// [[Rcpp::export]]
void hello_world(){
  Rcpp::Rcout << "Hello World!" << std::endl;
}
# r_functions.R

#---- Load Rcpp ---------------------------------------------

library(Rcpp)  # version 1.0.0

#---- Compile cpp_functions.cpp ---------------------------------------------

sourceCpp('cpp_functions.cpp')

#---- 1. Hello World ---------------------------------------------
# Create a function hello_world() that prints "Hello World" to the console

hello_world()

# What does hello_world() return?
foo <- hello_world()
foo  # NULL

A few notes about this…

  1. In order to expose our C++ hello_world() function to R, we have to put the special tag // [[Rcpp::export]] just above the function definition.
  2. We use Rcpp::Rcout instead of the more common std::cout as reccomended by Dirk Eddelbuettel (Rcpp’s primary author and maintainer).
  3. In r_functions.R we use Rcpp’s sourceCpp() function to compile our C++ code and expose it to R.

2. User Input

Create a function hello_master() that prompts the user with “Enter your name”. After entering your name and hitting enter, “Hello Master your_name!” should be printed to the console.

// cpp_functions.cpp

//---- 2. Hello Master ---------------------------------------------
// Prompt the user to enter their name
// Print "Hello Master <user's name>!"

// [[Rcpp::export]]
void hello_master(){
  Rcpp::Environment base = Rcpp::Environment("package:base");
  Rcpp::Function readline = base["readline"];
  
  // Prompt the user to enter a string
  std::string mystring = Rcpp::as<std::string>(readline("Enter your name: "));
  Rcpp::Rcout << "Hello Master " << mystring << "!" << std::endl;
}
# r_functions.R

#---- 2. User Input ---------------------------------------------

hello_master()

Notes

  1. In this example, C++ is actually calling base R’s readline() function to get the user input

3. Add Numbers

Create a function add_numbers(a, b) that adds two numbers a and b.

// cpp_functions.cpp

//---- 3. Add Numbers ---------------------------------------------
// Add two numbers and return the result

// [[Rcpp::export]]
double add_numbers(double a, double b){
  // Add two numbers and return the result
  
  return a + b;
}
# r_functions.R

#---- 3. Add Numbers ---------------------------------------------
# Add two numbers and return the result

add_numbers(a = 1, b = 1)    # 2
add_numbers(1L, 1L)          # 2
class(add_numbers(1L, 1L))   # numeric, not integer! 
add_numbers(1, NA_integer_)  # NA
add_numbers(1)               # Error in add_numbers(1) : argument "b" is missing, with no default

Notes

  1. When we add two integers, Rcpp casts them to doubles automatically and returns a double
  2. When we try adding a number and NA_integer_, we get back NA_real_
  3. We get a nice error message if we forget one of the arguments

4. Random Number Generation

Create a function roll_die() that returns a random integer between 1 and 6.

// cpp_functions.cpp

//---- 4. Random Number Generation ---------------------------------------------
// Simulate rolling a fair die

// [[Rcpp::export]]
int roll_die(){
  // Returns a random integer between 1 and 6
  
  // Create a vector of possible values
  Rcpp::IntegerVector vals = Rcpp::IntegerVector::create(1, 2, 3, 4, 5, 6);
  
  // Roll the die
  int result = Rcpp::as<int>(Rcpp::sample(vals, 1));
  
  // Return the result
  return result;
}
# r_functions.R

#---- 4. Random Number Generation ---------------------------------------------
# Simulate rolling a fair die

roll_die()  # 3
roll_die()  # 1
roll_die()  # 4

set.seed(2016); roll_die()  # 2
set.seed(2016); roll_die()  # 2
set.seed(2016); roll_die()  # 2

Notes

  1. Rcpp’s sample() pays attention to the random seed in R so that we can get reproducible results

5. Function Prototypes

  1. Create a function called hi_mom() that prints “hi mom” to the console
  2. Create a function called hi_dad() that prints “hi dad” to the console
  3. Modify hi_mom() so that after printing “hi mom”, the function randomly decides whether to call hi_dad() with 50% probability
  4. Modify hi_dad() so that after printing “hi dad”, the function randomly decides whether to call hi_mom() with 50% probability
// cpp_functions.cpp

//---- 5. Function Prototypes ---------------------------------------------

void hi_mom();
void hi_dad();

// [[Rcpp::export]]
void hi_mom(){
  // Print "hi mom" to the console
  // Then randomly decide whether to call hi_dad()
  
  Rcpp::Rcout << "hi mom" << std::endl;
  if(Rcpp::runif(1)[0] > 0.5) hi_dad();
}

// [[Rcpp::export]]
void hi_dad(){
  // Print "hi dad" to the console
  // Then randomly decide whether to call hi_mom()
  
  Rcpp::Rcout << "hi dad" << std::endl;
  if(Rcpp::runif(1)[0] > 0.5) hi_mom();
}
# r_functions.R

#---- 5. Function Prototypes ---------------------------------------------

hi_mom()
hi_dad()

Notes

  1. If we exclude the function prototypes void hi_mom() and void hi_dad(), then when we sourceCpp('cpp_functions.cpp') we get the error use of undeclared identifier 'hi_dad'. When C++ is compiling the function hi_mom(), it sees that the function calls another function named hi_dad, but at that moment of compilation, the function hi_dad doesn’t exist (since its declared below hi_mom). So, the prototypes void hi_mom() and void hi_dad() simply tell the C++ compiler these functions exist even though we haven’t defined them yet.

6. Pass by Value vs Reference

Create functions add_one(int x), add_two(double x), add_three(Rcpp::IntegerVector x), … with slightly different implementations. Observe how, if we call these functions from R, some of them actually change the value of the variable we pass into them.

// cpp_functions.cpp

//---- 6. Pass by value/reference ---------------------------------------------

// [[Rcpp::export]]
int add_one(int x){
  x = x + 1;
  return x;
}

// [[Rcpp::export]]
int add_two(int &x){
  x = x + 2;
  return x;
}

// [[Rcpp::export]]
Rcpp::IntegerVector add_three(Rcpp::IntegerVector x){
  x = x + 3;
  return x;
}

// [[Rcpp::export]]
Rcpp::IntegerVector add_four(Rcpp::IntegerVector x){
  x = clone(x);
  x = x + 4;
  return x;
}
# r_functions.R

#---- 6. Pass by value/reference ---------------------------------------------

x <- 1L; cbind(add_one(x), x)    # 2 1
x <- 1L; cbind(add_two(x), x)    # 3 1
x <- 1L; cbind(add_three(x), x)  # 4 4
x <- 1;  cbind(add_three(x), x)  # 4 1 (type conversion)
x <- 1L; cbind(add_four(x), x)   # 5 1

Notes

  1. By default, Rcpp passes objects from R to C++ by reference, so any changes you make to the input parameter in C++ should be reflected in R. This is why add_three(x = 1L) results in changing the value of x from 1L to 4L. However, if Rcpp has to coerce the input from one type to another, then the original object will not be modified. When we call add_one(x = 1L), Rcpp converts x from an IntegerVector to a plain old int, thus x in R’s environment is not modified. Similarly, add_two(x = 1L) and add_three(x = 1) both result in type changes. Lastly, add_four(x = 1L) uses Rcpp’s clone() function to force a copy so that the original x variable is not modified.

7. Mean of a Vector

Create a function my_mean(x, na_rm = false) that returns the mean of a vector. my_mean() should behave just like base R’s mean()

// cpp_functions.cpp

//---- 7. Mean of a Vector ---------------------------------------------

// [[Rcpp::export]]
double my_mean(Rcpp::NumericVector x, bool na_rm = false){
  
  double s = 0;
  double count = 0;
  double val_i = 0;
  bool hasPosInf = false;
  bool hasNegInf = false;
  
  // Loop through x
  for(int i = 0; i < x.size(); i++){
    val_i = x[i];
    
    // Check if val_i is NaN...
    if(R_IsNaN(val_i)){
      if(na_rm){
        continue;
      } else{
        return R_NaN;;
      }
    }
    
    // Check if val_i is NA...
    if(R_IsNA(val_i)){
      if(na_rm){
        continue;
      } else{
        return R_NaReal;
      }
    }
    
    // Check if val_i is +Inf
    if(val_i == R_PosInf){
      hasPosInf = true;
      if(hasNegInf) return R_NaN;
    }
    
    // Check if val_i is -Inf
    if(val_i == R_NegInf){
      hasNegInf = true;
      if(hasPosInf) return R_NaN;
    }
    
    // Update the current sum and count
    s += x[i];
    count++;
  }
  
  // Special cases
  if(hasPosInf) return R_PosInf;
  if(hasNegInf) return R_NegInf;
  if(count == 0) return R_NaN;
  
  // Calculate the mean
  double result = s/count;
  
  return result;
}
# r_functions.R

my_mean(c(1, 2, 3))                  # 2
my_mean(c(1, NA, 3))                 # NA
my_mean(c(1, NaN, 3))                # NaN
my_mean(c(1, NA, 3), na_rm = TRUE)   # 2
my_mean(c(1, NaN, 3), na_rm = TRUE)  # 2
my_mean(NA_real_, na_rm = TRUE)      # NaN
my_mean(c(1, Inf))                   # Inf
my_mean(c(1, -Inf))                  # -Inf
my_mean(c(1, -Inf, Inf))             # NaN

Notes

  1. We can’t use a dot “.” in C++ variable names, so I’ve changed na.rm to na_rm
  2. Use R_IsNaN() to check check for NaN
  3. Use R_IsNA() to check check for NA
  4. The constants R_NaN, R_NaInt, R_NaReal, R_NaString, R_PosInf, and R_NegInf correspond to R’s NaN, NA_integer_, NA_real_, NA_character_, Inf, and -Inf