Sampling Integers with RcppArmadillo
DISCLAIMER: This is a blog post from my graduate school years. The blog is no longer maintained and the material might include typos and errors.
To my surprise, it turns out that sampling integers with unequal probabilities using the RcppArmadillo
the package is more subtle than it appeared at first. To explain the problem, consider the following situation. We have a length-$K$ vector, $\mathbf p = [p_1,p_2,…,p_K]^\top$ which satisfies $\sum_{j=1}^K p_j = 1$. In a C++
code that uses the Armadillo library, we want to sample an integer from $\mathcal X = \{1,2,…,K\}$ according to these probabilities, i.e., sample the integer $1$ with probability $p_1$, $2$ with probability $p_2$ and so on. In addition, after sampling, we need the vector $\mathbf p$ for some further calculations and the end result of the program is to be exported into R
.
While the sampling function of the RcppArmadillo
package is super convenient, it has a small problem in these situations: namely, the function modifies the probability vector, $\mathbf p$,in its calculations.
The following code might illustrate the point. First we load the RcppArmadillo
package:
Our code for sampling looks like the following. First, we include the header for the sampling function and tell Rcpp
that we the code depends on the Armadillo
library. It suffices to include the RcppArmadilloExtensions/sample.h
header as it will include RcppArmadillo.h
automatically.
Next, we write our sampling function:
Notice that we are passing the pvec
object by reference to the function in order to avoid making a deep copy. Second, as the Rcpp::RcppArmadillo::sample
function returns an unsigned int
object, we have to explicitly convert it into an uword
object (which corresponds to an unsigned integer).
Now, when testing the code we observe the following:
## [1] 0.1214495 0.3329164 0.1727188 0.3729152
## [1] 0
## [1] 0.3729152 0.7058317 0.8785505 1.0000000
Observe that the pvec
object has changed after the sampling is done: it now contains the cumulative probabilities. If we need to do some subsequent calculations with the pvec
object after using it to sample integers, this might be problematic. We would face two options: add a calculate the finite differences to recover the original probability vector or pass the vector of probabilities by value. (Unfortunately, passing pvec
by constant reference will not work). Both of these optinos might become expensive if the program needs to sample very often and pvec
has many elements.
After having a look into the source code of the function, I decided to write up a new function based on it. The logic of how the function works is quite straightforward. If we have only two possible outcomes, $1$ and $2$, and we want to sample $1$ with probability $p_1$ and $2$ with probability $p_2 = 1-p_1$, we can just draw a random number, $u$, from the interval $(0,1)$ and return $1$ if $u < p_1$ and $2$ otherwise. As the probability of $u$ being smaller or equal to $p_1$ is exactly $p_1$, we would be drawing $1$ according to the correct probability.
For the general case, we consider the length-$K$ vector of cumulative probabilities, $q = [q_1, q_2, …, q_K]$, where $q_1 = p_1$ and $q_j = \sum_{i \le j} p_j$. Since $q_{j} - q_{j-1} = \sum_{i\le j}p_j - \sum_{i\le j-1} p_j = p_j$, we see that the length of the interval $(q_j - q_{j-1})$ is equal to the probability with which we want to sample the $j$th outcome. Hence, we can draw $U\sim \text{Uniform}(0,1)$ and choose outcome $j$ if $q_{j-1} < U < q_j$, where we define $q_0$ to be equal to zero.
Now, there are two additional tricks that we might want to consider. First, after sampling $U$ from $(0,1)$, how would we find the interval into which $U$ falls? We might loop over all the intervals and stop when the desired interval is reached; yet, this can be inefficient if we want to sample multiple observations using the same probability vector, because some of the intervals, $(q_j - q_{j-1})$, are wider than the others. If we start the loop from the first interval and advance in order, it would be great if the first interval is the widest and, thus, has the largest probability of $U$ falling into it, the second interval the second largest, and so on. This could save us quite a lot of time, as for each draw, we likely would have to iterate over a smaller number of intervals.
To incorporate this consideration, we might write our tentative sampling code as follows:
Notice that once u <= qvec(k)
is true, the program immediately terminates after returning the $k+1$th index (it is the $k+1$th index because indices in C++
start from zero). Thus, if the loop does not terminate in the first iteration (k=0
), we know that u > qvec(0)
; if it then terminates in the second iteration (k=1
), this means that qvec(0) < u < qvec(1)
, and so on. Also, the code will draw only a single integer; the extension to drawing N
samples would be straightforward (although we’ll not pursue this here).
This is all good, but these calculations can become quite unstable if the probabilities contained in pvec
get very small. So, for better numerical stability, it is always good to operate in the log-space. We won’t go into too much detail, but only note that there is a special function to calculate $\log(1 + x)$ in a numerically stable way. Thus, whenever you encounter a situation in which you have to add one to a real number and take the logarithm, never use log(1 + x)
but instead use log1p(x)
, which is provided in almost all programming languages (for example, try typing ?log1p
into your R
console).
So, why is this important? Suppose we are working with log-probabilities. We have $\log p_i, \log p_j \ll 0$, and we want to calculate $\log(p_i + p_j)$. As both $\log p_i$ and $\log p_j$ are very small, exponentiating them might lead to underflow. For example, if you have $\log p_i = \log p_j = -750$ and you exponentiate them in R
, you’ll get;
## [1] 0
Hence, exp(lp1) + exp(lp2)
will evaluate to zero, and taking the logarithm evaluates to -Inf
.
## [1] -Inf
We know that this answer cannot be true: since both $p_1$ and $p_2$ are finite, their sum is finite as well, and $\log(p_1 + p_2) > -\infty$.
A better way to calculate $\log(p_i + p_j)$ is the following. We rewrite the quantity as
\[\begin{aligned} \log(p_i + p_j) & =\log(e^{\log p_i}+ e^{\log p_j}) = \log\left[ p_i\Big(1 + p_i^{-1}e^{\log p_j}\Big) \right] = \log p_i + \log\Big(1 + e^{\log p_j - \log p_i}\Big) \\ &=\log p_i + \text{log1p}\Big[\exp(\log p_j - \log p_i)\Big] \end{aligned}\]and use the log1p
function to evaulate $\log(p_1 + p_2)$ as follows:
## [1] -749.3069
In C++
we might code up a function to calculate the logarithm of the sum of two exponentials as follows:
We might, further, use this function to accumulate log-probabilities on the log-scale:
So, if we enter a vector $\mathbf x = [\log(x_1),\log(x_2),…,\log(x_K)]^\top$ into log_accu_exp
, the first element of the returned vector will be $\log(e^{\log(x_1)}) = \log(x_1)$, the second element $\log(e^{\log(x_1)} + e^{\log(x_2)}) = \log(x_1 + x_2)$, and so on.
Now, we are ready to write our second (tentative) sampling function. Here we use the fact that if $U\sim \text{Uniform}(0,1)$, then for a constant $c>0$,
\[\Pr[ -\log(U) \le c] = 1 - \Pr[U \le \exp(-c)] = 1 - e^{-c},\]which shows that $Z = -\log(U)$ is distributed as a $\text{Exponential}(1)$ random variable. To sample the log of a $\text{Uniform(0,1)}$ distribution, we can therefore sample from the $\text{Exponential}(1)$ distribution and negate the result.
Lastly, we accomodate the function so that it can take log-probability vectors which are not properly normalized. Let $\mathbf p^{\ast} = [p_1^{\ast}, …, p_k^{\ast}]^\top$ be a unnormalized probability vector such that $\mathbf p = s^{-1}\mathbf p^{\ast}$ and $s = \sum_j p_j^{\ast}$. To sample from $\mathbf p^{\ast}$, we might first normalize $\mathbf p^{\ast}$ and enter the normalized form into our sampling function. Here, we use an alternative approach and rather change the draw from the uniform distribution from the interval $(0,1)$ to $(0, s)$. As we are working on the log-scale, however, this means that we have to sum the exponentials of the log-probabilities and then take the logarithm again to obtain $s$ (which, again, can lead to underflow if some of the probabilities are small), because
\[\log(s) = \log\left(\sum_{j=1}^K p_j\right) = \log\left(\sum_{j=1}^K \exp[\log(p_j)]\right)\]Fortunately, we have already calculate this quantity when accumulating the probabilities on the log-scale in the lqvec
vector. That is, the last element of lqvec
is log(exp(lpvec[1]) + ... + exp(lpvec[K]))
which is exactly log(sum(exp(lpvec)))
, the normalizing quantity we need.
Also,
\[\Pr[-\log(sU)\le c] = 1 - \Pr[U \le e^{-(c+\log s )}] = 1 - e^{-(c+ \log s)}, \quad c > 0\]which is the distribution function of a $\text{shifted-Exponential}(1, -\log s)$ random variable. Thus, to sample uniformly from the interval $(0, s)$ on the log-scale, we might subtract from $\log s$ a draw from an $\text{Exponential}(1)$ distribution.
So, our final sampling function looks like the following:
Notice that this function will work for properly normalized probability vectors as well. In these situations, lqvec.max()
would be equal to zero and so we have exactly the same sampling function as lsamp_temp
.
We might also rewrite the function that accepts unnormalized probability vectors (instead of log-probability vectors) as follows:
A small Monte Carlo simulation suggests that the code works well:
## 0 1 2 3
## norm.p 0.1214495 0.3329164 0.1727188 0.3729152
## samp.int 0.1219100 0.3326480 0.1723480 0.3730940
## samp_one 0.1213660 0.3326760 0.1720160 0.3739420
## lsamp_one 0.1218680 0.3335720 0.1721560 0.3724040
The full C++
code used is reproduced below: