Predicted Probabilities from Ordered Logit Models
DISCLAIMER: This is a blog post from my graduate school years. The blog is no longer maintained and the material might include typos and errors.
I was preparing materials for a lab session to the graduate students at NYU, when I realized that there is no straightforward way to calculate uncertainty estimates for predicted probabilities from ordered logit models in R. The predict
method for MASS::polr
objects, for example, returns the predicted probabilities but not the estimated standard errors (try using the se = TRUE
option; it won’t work.)
This created a problem. Due to traveling, I was not able to prepare in advance; so, I was sitting in front of my labtop at 2 am the day before class thinking about what to do. In the end, I coded up a short function that calculates confidence intervals for predictions of ordered logit models based on Monte Carlo simulation and added it to the appendix of my teaching materials. The model, code, and a baby Monte Carlo study to check whether the coverage rate of the calculated confidence intervals are close to the nominal rate, will be the main topic of the current post.
The Ordered Logit Model
The theory to calculate confidence intervals for predicted probabilities is quite straightforward. We assume that we have a latent variable $Y_i^\ast$, which is generated by the DGP
\[Y_i^\ast = \mathbf x_i'\boldsymbol\beta + \epsilon_i^\ast\]where $\epsilon_i^\ast \sim \text{Logistic}(0,1)$. Unfortunately, we don’t observe $Y_i^\ast$ but only a truncated version of it. The link between $Y_i^\ast$ and the observed but truncated outcome variable $Y_i$ is given as
\[Y_i = k\times \mathbb I_{(\tau_{k-1}, \tau_k]}(Y_i^\ast), \qquad k = 1,2,...,K\]where $K$ is the number of response categories, $\mathbb I_A(x) = 1$ if $x\in A$ and $0$ otherwise, and where $\boldsymbol\tau = [\tau_0, \tau_1,…,\tau_{K-1},\tau_{K}]’$ are threshold parameters with $\tau_0 = -\infty$ and $\tau_K= \infty$ so that only $\{\tau_1,…,\tau_{K-1}\}$ are estimands.
As the latent error term has a standard Logistic distribution, the probability of the event $\{Y_i = k\}$ is
\[\begin{aligned} \pi^{(k)}(\mathbf x_i) &= \Pr[Y_i = k \,\vert\, \mathbf X_i = \mathbf x_i] \\ &= \Pr[\mathbb I_{(\tau_{k-1}, \tau_k]}(Y_i^\ast) = 1 \,\vert\, \mathbf X_i = \mathbf x_i]\\ &= \Pr[\tau_{k-1} < Y_i^\ast \le \tau_k\,\vert\, \mathbf X_i = \mathbf x_i] \\ &= \Pr[\tau_{k-1}< \mathbf x_i'\boldsymbol\beta + \epsilon_i^\ast\le \tau_k]\\ &= \Pr[\tau_{k-1} - \mathbf x_i'\boldsymbol\beta < \epsilon_i^\ast \le \tau_{k} - \mathbf x_i'\boldsymbol\beta ]\\ &= \Lambda_{0,1}(\tau_k - \mathbf x_i'\boldsymbol\beta) - \Lambda_{0,1}(\tau_{k-1} - \mathbf x_i'\boldsymbol\beta) \\ &= \Lambda_{\mathbf x_i'\boldsymbol\beta, 1}(\tau_k) - \Lambda_{\mathbf x_i'\boldsymbol\beta, 1}(\tau_{k-1}), \end{aligned}\]where $\Lambda_{a, b}(\cdot)$ is the Logistic CDF with location and scale parameters, respectively, equal to $a$ and $b$. The likelihood of an iid sample of size $n$ under the model is
\[\mathcal L(\boldsymbol\theta) = \prod_{i=1}^n \prod_{k=1}^K\Big[\pi^{(k)}(\mathbf x_i)\Big]^{\mathbb I_{\{k\}}(y_i)}.\]Maximizing $\mathcal L(\boldsymbol\theta)$ with respect to $\boldsymbol\theta =[\boldsymbol\beta, \boldsymbol\tau]’$ gives us the MLE
\[\hat{\boldsymbol\theta}_\text{MLE} = \text{argmax}_{\boldsymbol\theta \in \boldsymbol\Theta}\mathcal L(\boldsymbol\theta)\]where $\boldsymbol\Theta$ is the parameter space. Under the usual regularity conditions, we can rely on the Central Limit Theorem to show that
\[\sqrt{n} (\hat{\boldsymbol\theta}_\text{MLE} - \boldsymbol\theta) \longrightarrow \mathbf Z\]in distribution as $n \rightarrow \infty$, where $\mathbf Z \sim \text{Normal}(\mathbf 0, \mathcal I (\boldsymbol\theta)^{-1})$ and where $\mathcal I(\boldsymbol\theta)$ is the Fisher Information. Hence, in large samples, we might approximate the distribution of $\hat{\boldsymbol\theta}_\text{MLE}$ with a $\text{Normal}(\boldsymbol\theta, n^{-1}\mathcal I(\hat{\boldsymbol\theta}_\text{MLE})^{-1})$ distribution.
Quantifying Uncertainty in Predicted Probabilities
After estimating the parameter $\hat{\boldsymbol\theta} = [\hat{\boldsymbol\beta}, \hat{\boldsymbol\tau}]’$ (we’ve dropped the MLE subscript for notational clarity), the predicted probabilities can be calculated straightforwardly as
\[\hat{\pi}(\mathbf x_i)^{(k)} = \Lambda_{\mathbf x_i'\hat{\boldsymbol\beta}, 1}(\hat{\tau}_k) - \Lambda_{\mathbf x_i' \hat{\boldsymbol\beta}, 1}(\hat{\tau}_{k-1})\]for $i = 1,…,n$ and $k = 1,…, K$, where we define, again, $\hat\tau_0 = -\infty$ and $\hat\tau_K = \infty$.
The uncertainty in these estimates might be calculated in several ways. Two often used methods are Monte Carlo simulations, following the dictum that everything of a distribution can be learned by simulating from it, and the delta method (also called the method of propagation of error).
Confidence Intervals via the Delta Method
The delta method relies on the first-order Taylor Expension at a the MLE to approximate the distribution of a function of the MLE. Here we are interested in the distribution of the predicted probability at a point $\mathbf X = \mathbf x_\ast$. As $\hat{\boldsymbol\theta}$ is a consistent estimator of $\boldsymbol\theta$, and the mapping $\boldsymbol\theta \mapsto \Lambda_{\mathbf x_\ast’\boldsymbol\beta}(\tau_k)$ is continuous, it follows from the Continuous Mapping Theorem that
\[\hat\pi(\hat{\boldsymbol\theta}; \mathbf x_\ast) = \Lambda_{\mathbf x_\ast'\hat{\boldsymbol\beta}}(\hat\tau_k) - \Lambda_{\mathbf x_\ast'\hat{\boldsymbol\beta}}(\hat\tau_{k-1})\quad \longrightarrow\quad \Lambda_{\mathbf x_\ast'\boldsymbol\beta}(\tau_k) - \Lambda_{\mathbf x_\ast'\boldsymbol\beta}(\tau_{k-1})=\pi(\boldsymbol\theta; \mathbf x_\ast)\]in probability as $n\rightarrow \infty$, where we have made the dependence of the predicted probabilities on the parameter vector explicit and dropped the superscript $\{(k)\}$ for clarity. Taylor-expanding $\hat\pi(\hat{\boldsymbol\theta}; \mathbf x_\ast)$ about $\boldsymbol\theta$, we obtain
\[\hat\pi(\hat{\boldsymbol\theta}; \mathbf x_\ast) = \pi(\boldsymbol\theta; \mathbf x_\ast) + \nabla \pi(\boldsymbol\theta; \mathbf x_\ast)'(\hat{\boldsymbol\theta} - \boldsymbol\theta) + R(\hat{\boldsymbol\theta}, \boldsymbol\theta)\]where $R(\hat{\boldsymbol\theta}, \boldsymbol\theta)$ is the remainder. As $\hat{\boldsymbol\theta} \rightarrow \boldsymbol\theta$ in probability, $R \rightarrow 0$ in probability as well; so, by Slutsky’s Theorem,
\[\hat\pi(\hat{\boldsymbol\theta}; \mathbf x_\ast) \longrightarrow \pi(\boldsymbol\theta; \mathbf x_\ast) + \nabla \pi(\boldsymbol\theta; \mathbf x_\ast)'(\hat{\boldsymbol\theta} - \boldsymbol\theta)\]in distribution as $n\rightarrow\infty$. But as $\sqrt{n}(\hat{\boldsymbol\theta} - \boldsymbol\theta)$ converges to a $\text{Normal}(0, \mathcal I(\boldsymbol\theta)^{-1})$ distribution, it follows that the asymptotic distribution $\hat\pi(\hat{\boldsymbol\theta}; \mathbf x_\ast)$ is
\[W \sim \text{Normal}\Big(\pi(\boldsymbol\theta; \mathbf x_\ast),\; \nabla \pi(\boldsymbol\theta; \mathbf x_\ast)' [\mathcal I(\boldsymbol\theta)^{-1}/n]\nabla \pi(\boldsymbol\theta; \mathbf x_\ast)\Big).\]The matrix sandwiched in the middle of the gradients is just the covariance matrix of the MLE, which can be estimated in the usual way. The gradient vector has the form
\[\nabla \pi(\boldsymbol\theta; \mathbf x_\ast) = \left[\frac{\partial \pi(\boldsymbol\theta, \mathbf x_\ast)}{\partial \beta_1}, ..., \frac{\partial \pi(\boldsymbol\theta, \mathbf x_\ast)}{\partial \beta_J}, \frac{\partial \pi(\boldsymbol\theta, \mathbf x_\ast)}{\partial \tau_1},...,\frac{\partial \pi(\boldsymbol\theta, \mathbf x_\ast)}{\partial \tau_{K-1}}\right]'\]where $J$ is the number of predictors (not including the constant; this model has no constant). Further,
\[\begin{aligned} \frac{\partial \pi(\boldsymbol\theta, \mathbf x_\ast)}{\partial \beta_j} &= \big[ \lambda_{\mathbf x_\ast'\boldsymbol\beta, 1}(\tau_{k-1}) -\lambda_{\mathbf x_\ast'\boldsymbol\beta, 1}(\tau_k)\big] x_{\ast, j}, \qquad j = 1,2,...,J \\ \frac{\partial \pi(\boldsymbol\theta, \mathbf x_\ast)}{\partial \tau_k} &= \lambda_{\mathbf x_\ast'\boldsymbol\beta, 1}(\tau_{k}), \\ \frac{\partial \pi(\boldsymbol\theta, \mathbf x_\ast)}{\partial \tau_{k-1}} &= - \lambda_{\mathbf x_\ast'\boldsymbol\beta, 1}(\tau_{k-1}),\\ \end{aligned}\]where $\lambda_{a,b}$ is the Logistic pdf with location parameter $a$ and scale parameter $b$, and $x_{\ast, j}$ the $j$th element of the vector $\mathbf x_\ast$. Notice that all partial derivatives are continuous functions of $\boldsymbol\theta$. So we might rely on the Continuous Mapping Theorem once again, and use the MLE-based plug-in estimators to estimate these quantities.
The way to calculate delta method based standard errors, which can then be used to calculate confidence intervals, in R
is shows below:
pred_polr_delta = function(
m, # fitted model
cat, # outcome to predict
newdata) # data for which preds should be made
{
if (class(m) != "polr")
stop("fitted model is not of class polr")
if (
sum(
c("data.frame",
"data.table",
"matrix") %in% class(newdata)
) == 0
)
stop("newdata is not a data.table, data.frame, or matrix")
# get estimates
beta = coef(m)
V = vcov(m)
tau_a = c(-Inf, m$zeta, Inf)
if (!is.matrix(newdata))
X = as.matrix(newdata)
# linear predictor
xb = X %*% beta
# beta gradient
grad_beta = do.call(
"rbind",
lapply(
seq_along(xb),
function(w) {
(dlogis(tau_a[cat], xb[w], 1.0) -
dlogis(tau_a[cat + 1], xb[w], 1.0)) * X[w, ]
}
)
)
# tau gradient
if (cat == 1) {
grad_tau = cbind(
dlogis(tau_a[cat + 1], xb, 1.0), 0
)
} else if (cat == (length(tau_a) - 1)) {
grad_tau = cbind(
0,
-1.0 * dlogis(tau_a[cat], xb, 1.0)
)
} else {
grad_tau = cbind(
-1.0 * dlogis(tau_a[cat], xb, 1.0),
dlogis(tau_a[cat + 1], xb, 1.0)
)
}
# gradient
grad_xb = cbind(grad_beta, grad_tau)
# get est. s.e.
sqrt(
apply(grad_xb, 1, function(w) {
w %*% V %*% w
})
)
}
Confidence Intervals via Simulation
Another way to estimate confidence intervals is to draw samples from the large sample distribution of the MLE. To obtain 95% confidence intervals for our predictions, the procedure is as follows:
-
We know (using somewhat sloppy notation) that \(\hat{\boldsymbol \theta} \sim \text{Normal}(\boldsymbol\theta, n^{-1}\mathcal I(\hat{\boldsymbol\theta})^{-1})\) in large samples. So, we might draw $S$ simulation draws from this distribution, which we denote by $\boldsymbol\theta^{(1)}, \boldsymbol\theta^{(2)},…, \boldsymbol\theta^{(S)}$.
-
For each simulated $\boldsymbol\theta^{(s)}, s= 1,2,…,S$, we calculate the predicted probabilities at $\mathbf x_\ast$. The predicted logit for the $s$th simulation draw is
\[\nu_\ast^{(s)} = \mathbf x_\ast'\boldsymbol\beta^{(s)}\]and the predicted probability of outcome $k$ at $\mathbf x_\ast$
\[\hat{\pi}(\mathbf x_\ast)^{(k), (s)} = \Lambda_{\nu_\ast^{(s)}, 1}(\tau_k^{(s)}) - \Lambda_{\nu_\ast^{(s)}, 1}(\tau_{k-1}^{(s)})\] -
This will give us $S$ simulated values of $\hat\pi(\mathbf x_\ast)^{(k),(s)}, k = 1,2,…,K$. For each outcome $k$, we can then summarize uncertainty using these draws. For example, to obtain the 95% confidence intervals for the predicted probability of the $k$th outcome, we can calculate the $0.025$th and $0.975$th quantile of the distribution of $\hat\pi(\mathbf x_{\ast})^{(k),(s)}$ across the simulation draws $s = 1,2,…,S$.
The following function implements these steps:
sim_ci_pred_polr = function(
m, # the model
newdata, # data for predictions
level, # confidence level
n_sim = 1000, # no of simulation draws
return_fit = TRUE, # return fitted values
return_list = TRUE) # return list
{
if (!requireNamespace("abind"))
stop("Install package abind to use this function")
if (class(m) != "polr")
stop("fitted model is not of class polr")
# transform new_dat into matrix
X = as.matrix(newdata)
# get coefs
beta = coef(m)
n_beta = length(beta)
# get thresholds
tau = m$zeta
n_cat = length(tau) + 1
# get variance-covariance matrix
Sigma = vcov(m)
message(
paste0("Calculating ", 100 * level,
"% confidence intervals with ",
n_sim,
" simulation draws ... ")
)
# simulate coefficients from multivariate normal
sim_res = MASS::mvrnorm(
n_sim,
mu = c(beta, tau),
Sigma = Sigma)
# for each of simulated coefs, make predictions
pred = lapply(1:nrow(sim_res), function(w) {
# linear predictor
xb = X %*% sim_res[w, 1:n_beta]
aug_tau = c(-Inf, sim_res[w, (n_beta + 1):ncol(sim_res)], Inf)
P = matrix(NA, nrow = nrow(X), ncol = n_cat)
for (i in 2:length(aug_tau)) {
P[, i - 1] = plogis(aug_tau[i], xb) -
plogis(aug_tau[i - 1], xb)
}
return(P)
})
# transform to array
pred_arr = simplify2array(pred)
# get lower and upper bound probs
lo = 0.5 * (1.0 - level)
hi = 1.0 - lo
# calculate CIs
ci = apply(pred_arr, 1:2, quantile, probs = c(lo, hi))
if (return_fit) {
# linear predictor
xb = X %*% beta
aug_tau = c(-Inf, tau, Inf)
P = matrix(NA, nrow = nrow(X), ncol = n_cat)
for (i in 2:length(aug_tau)) {
P[, i - 1] = plogis(aug_tau[i], xb) -
plogis(aug_tau[i - 1], xb)
}
# add fit and re-arrange
res = aperm(
abind::abind(P, ci, along = 1),
c(2, 1, 3)
)
# add dimension names
dimnames(res) = list(
NULL,
c("fit", "lower", "upper"),
paste0("outcome_", 1:n_cat)
)
} else {
# re-arrange
res = aperm(ci, c(2, 1, 3))
# add dimension names
dimnames(res) = list(
NULL,
c("lower", "upper"),
paste0("outcome_", 1:n_cat)
)
}
if (!return_list)
return(res)
# transform back to list
res_list = lapply(
seq.int(dim(res)[3]), function(w) res[ , , w]
)
# add names
names(res_list) = dimnames(res)[[3]]
return(res_list)
}
A Short Simulation
We might compare the coverage of the estimated 95% confidence interval with the nominal level. First, we load the needed packages and set up a cluster for parallel computing:
library(doParallel)
library(doRNG)
registerDoParallel(cores = 4)
Next, we set the seed, set up the “true” parameter values, and generate the fixed part of the dataset:
set.seed(12562)
# obs
n = 3000
# coefficients
true_beta = c(.8, .25)
true_tau = c(-.5, 2)
# predictors (dummy & continuous)
x1 = sample(c(0, 1), n, replace = T)
x2 = runif(n, -2, 2)
# linear predictor
xb = cbind(x1, x2) %*% true_beta
# data to make predictions for
pred_dat = data.frame(x1 = 1, x2 = c(-1, 0, 1))
Lastly, we run the simulation. We start with the simulation method:
res = foreach(ii = 1:1000) %dorng% {
# generate latent outcome
ystar = xb + rlogis(n)
# generate obseved outcome
y = ifelse(
ystar < true_tau[1], 1,
ifelse(ystar >= true_tau[1] & ystar < true_tau[2],
2, 3)
)
# fit model
fit = MASS::polr(factor(y) ~ x1 + x2, Hess = T)
# calculate 95% CI
s = sim_ci_pred_polr(
m = fit,
newdata = pred_dat,
level = .95,
n_sim = 1000,
return_fit = FALSE,
return_list = TRUE)
# return predictions
cbind(do.call(rbind, s),
x = rep(c(-1, 0, 1), 3),
outcome = rep(1:3, each = 3)
)
}
# true predicted probabilities
true_xb = cbind(1, c(-1, 0, 1)) %*% true_beta
true_p = c(
plogis(true_tau[1], true_xb),
plogis(true_tau[2], true_xb) - plogis(true_tau[1], true_xb),
plogis(true_tau[2], true_xb, lower.tail = F)
)
# check whether CIs contain true parameter
in_range = do.call(
rbind,
lapply(res, function(w) true_p > w[,1] & true_p < w[,2])
)
# coverage rate
cov_rate = matrix(colMeans(in_range), nrow = 3, byrow = T)
rownames(cov_rate) = paste0("x2 = ", c(-1, 0, 1))
colnames(cov_rate) = paste0("outcome = ", 1:3)
Next, we check the confidence intervals created with the delta method:
res_delta = foreach(ii = 1:1000) %dorng% {
# generate latent outcome
ystar = xb + rlogis(n)
# generate obseved outcome
y = ifelse(
ystar < true_tau[1], 1,
ifelse(ystar >= true_tau[1] & ystar < true_tau[2],
2, 3)
)
# fit model
fit = MASS::polr(factor(y) ~ x1 + x2,
Hess = T)
# calculate 95% CI
se = do.call(
`c`,
lapply(
1:3,
function(cat) pred_polr_delta(
m = fit,
cat = cat,
newdata = pred_dat
)
)
)
# point estimates
p = c(predict(fit, newdata = pred_dat, type = "p"))
# return intervals
z = -1.0 * qnorm(0.025)
cbind(p - z * se, p + z * se)
}
# resolve cluster
stopImplicitCluster()
# check whether CIs contain true parameter
in_range_delta = do.call(
rbind,
lapply(
res_delta,
function(w) true_p > w[,1] & true_p < w[,2])
)
# coverage rate
cov_rate_delta = matrix(
colMeans(in_range_delta), nrow = 3, byrow = T
)
rownames(cov_rate_delta) = paste0("x2 = ", c(-1, 0, 1))
colnames(cov_rate_delta) = paste0("outcome = ", 1:3)
The results of the simulation are as follows:
# print results (simulation)
print(cov_rate)
## outcome = 1 outcome = 2 outcome = 3
## x2 = -1 0.958 0.961 0.949
## x2 = 0 0.952 0.943 0.941
## x2 = 1 0.944 0.945 0.953
# print results (delta method)
print(cov_rate_delta)
## outcome = 1 outcome = 2 outcome = 3
## x2 = -1 0.958 0.962 0.957
## x2 = 0 0.951 0.954 0.945
## x2 = 1 0.942 0.936 0.958
It’s hard to be certain with only a handful of simulation runs. But the coverage rate seems to be quite close to the nominal level for both methods.