Predicted Probabilities from Ordered Logit Models
DISCLAIMER: This is a blog post from my graduate school years. The blog is no longer maintained and the material might include typos and errors.
I was preparing materials for a lab session to the graduate students at NYU, when I realized that there is no straightforward way to calculate uncertainty estimates for predicted probabilities from ordered logit models in R. The predict
method for MASS::polr
objects, for example, returns the predicted probabilities but not the estimated standard errors (try using the se = TRUE
option; it won’t work.)
This created a problem. Due to traveling, I was not able to prepare in advance; so, I was sitting in front of my labtop at 2 am the day before class thinking about what to do. In the end, I coded up a short function that calculates confidence intervals for predictions of ordered logit models based on Monte Carlo simulation and added it to the appendix of my teaching materials. The model, code, and a baby Monte Carlo study to check whether the coverage rate of the calculated confidence intervals are close to the nominal rate, will be the main topic of the current post.
The Ordered Logit Model
The theory to calculate confidence intervals for predicted probabilities is quite straightforward. We assume that we have a latent variable $Y_i^\ast$, which is generated by the DGP
\[Y_i^\ast = \mathbf x_i'\boldsymbol\beta + \epsilon_i^\ast\]where $\epsilon_i^\ast \sim \text{Logistic}(0,1)$. Unfortunately, we don’t observe $Y_i^\ast$ but only a truncated version of it. The link between $Y_i^\ast$ and the observed but truncated outcome variable $Y_i$ is given as
\[Y_i = k\times \mathbb I_{(\tau_{k-1}, \tau_k]}(Y_i^\ast), \qquad k = 1,2,...,K\]where $K$ is the number of response categories, $\mathbb I_A(x) = 1$ if $x\in A$ and $0$ otherwise, and where $\boldsymbol\tau = [\tau_0, \tau_1,…,\tau_{K-1},\tau_{K}]’$ are threshold parameters with $\tau_0 = -\infty$ and $\tau_K= \infty$ so that only $\{\tau_1,…,\tau_{K-1}\}$ are estimands.
As the latent error term has a standard Logistic distribution, the probability of the event $\{Y_i = k\}$ is
\[\begin{aligned} \pi^{(k)}(\mathbf x_i) &= \Pr[Y_i = k \,\vert\, \mathbf X_i = \mathbf x_i] \\ &= \Pr[\mathbb I_{(\tau_{k-1}, \tau_k]}(Y_i^\ast) = 1 \,\vert\, \mathbf X_i = \mathbf x_i]\\ &= \Pr[\tau_{k-1} < Y_i^\ast \le \tau_k\,\vert\, \mathbf X_i = \mathbf x_i] \\ &= \Pr[\tau_{k-1}< \mathbf x_i'\boldsymbol\beta + \epsilon_i^\ast\le \tau_k]\\ &= \Pr[\tau_{k-1} - \mathbf x_i'\boldsymbol\beta < \epsilon_i^\ast \le \tau_{k} - \mathbf x_i'\boldsymbol\beta ]\\ &= \Lambda_{0,1}(\tau_k - \mathbf x_i'\boldsymbol\beta) - \Lambda_{0,1}(\tau_{k-1} - \mathbf x_i'\boldsymbol\beta) \\ &= \Lambda_{\mathbf x_i'\boldsymbol\beta, 1}(\tau_k) - \Lambda_{\mathbf x_i'\boldsymbol\beta, 1}(\tau_{k-1}), \end{aligned}\]where $\Lambda_{a, b}(\cdot)$ is the Logistic CDF with location and scale parameters, respectively, equal to $a$ and $b$. The likelihood of an iid sample of size $n$ under the model is
\[\mathcal L(\boldsymbol\theta) = \prod_{i=1}^n \prod_{k=1}^K\Big[\pi^{(k)}(\mathbf x_i)\Big]^{\mathbb I_{\{k\}}(y_i)}.\]Maximizing $\mathcal L(\boldsymbol\theta)$ with respect to $\boldsymbol\theta =[\boldsymbol\beta, \boldsymbol\tau]’$ gives us the MLE
\[\hat{\boldsymbol\theta}_\text{MLE} = \text{argmax}_{\boldsymbol\theta \in \boldsymbol\Theta}\mathcal L(\boldsymbol\theta)\]where $\boldsymbol\Theta$ is the parameter space. Under the usual regularity conditions, we can rely on the Central Limit Theorem to show that
\[\sqrt{n} (\hat{\boldsymbol\theta}_\text{MLE} - \boldsymbol\theta) \longrightarrow \mathbf Z\]in distribution as $n \rightarrow \infty$, where $\mathbf Z \sim \text{Normal}(\mathbf 0, \mathcal I (\boldsymbol\theta)^{-1})$ and where $\mathcal I(\boldsymbol\theta)$ is the Fisher Information. Hence, in large samples, we might approximate the distribution of $\hat{\boldsymbol\theta}_\text{MLE}$ with a $\text{Normal}(\boldsymbol\theta, n^{-1}\mathcal I(\hat{\boldsymbol\theta}_\text{MLE})^{-1})$ distribution.
Quantifying Uncertainty in Predicted Probabilities
After estimating the parameter $\hat{\boldsymbol\theta} = [\hat{\boldsymbol\beta}, \hat{\boldsymbol\tau}]’$ (we’ve dropped the MLE subscript for notational clarity), the predicted probabilities can be calculated straightforwardly as
\[\hat{\pi}(\mathbf x_i)^{(k)} = \Lambda_{\mathbf x_i'\hat{\boldsymbol\beta}, 1}(\hat{\tau}_k) - \Lambda_{\mathbf x_i' \hat{\boldsymbol\beta}, 1}(\hat{\tau}_{k-1})\]for $i = 1,…,n$ and $k = 1,…, K$, where we define, again, $\hat\tau_0 = -\infty$ and $\hat\tau_K = \infty$.
The uncertainty in these estimates might be calculated in several ways. Two often used methods are Monte Carlo simulations, following the dictum that everything of a distribution can be learned by simulating from it, and the delta method (also called the method of propagation of error).
Confidence Intervals via the Delta Method
The delta method relies on the first-order Taylor Expension at a the MLE to approximate the distribution of a function of the MLE. Here we are interested in the distribution of the predicted probability at a point $\mathbf X = \mathbf x_\ast$. As $\hat{\boldsymbol\theta}$ is a consistent estimator of $\boldsymbol\theta$, and the mapping $\boldsymbol\theta \mapsto \Lambda_{\mathbf x_\ast’\boldsymbol\beta}(\tau_k)$ is continuous, it follows from the Continuous Mapping Theorem that
\[\hat\pi(\hat{\boldsymbol\theta}; \mathbf x_\ast) = \Lambda_{\mathbf x_\ast'\hat{\boldsymbol\beta}}(\hat\tau_k) - \Lambda_{\mathbf x_\ast'\hat{\boldsymbol\beta}}(\hat\tau_{k-1})\quad \longrightarrow\quad \Lambda_{\mathbf x_\ast'\boldsymbol\beta}(\tau_k) - \Lambda_{\mathbf x_\ast'\boldsymbol\beta}(\tau_{k-1})=\pi(\boldsymbol\theta; \mathbf x_\ast)\]in probability as $n\rightarrow \infty$, where we have made the dependence of the predicted probabilities on the parameter vector explicit and dropped the superscript $\{(k)\}$ for clarity. Taylor-expanding $\hat\pi(\hat{\boldsymbol\theta}; \mathbf x_\ast)$ about $\boldsymbol\theta$, we obtain
\[\hat\pi(\hat{\boldsymbol\theta}; \mathbf x_\ast) = \pi(\boldsymbol\theta; \mathbf x_\ast) + \nabla \pi(\boldsymbol\theta; \mathbf x_\ast)'(\hat{\boldsymbol\theta} - \boldsymbol\theta) + R(\hat{\boldsymbol\theta}, \boldsymbol\theta)\]where $R(\hat{\boldsymbol\theta}, \boldsymbol\theta)$ is the remainder. As $\hat{\boldsymbol\theta} \rightarrow \boldsymbol\theta$ in probability, $R \rightarrow 0$ in probability as well; so, by Slutsky’s Theorem,
\[\hat\pi(\hat{\boldsymbol\theta}; \mathbf x_\ast) \longrightarrow \pi(\boldsymbol\theta; \mathbf x_\ast) + \nabla \pi(\boldsymbol\theta; \mathbf x_\ast)'(\hat{\boldsymbol\theta} - \boldsymbol\theta)\]in distribution as $n\rightarrow\infty$. But as $\sqrt{n}(\hat{\boldsymbol\theta} - \boldsymbol\theta)$ converges to a $\text{Normal}(0, \mathcal I(\boldsymbol\theta)^{-1})$ distribution, it follows that the asymptotic distribution $\hat\pi(\hat{\boldsymbol\theta}; \mathbf x_\ast)$ is
\[W \sim \text{Normal}\Big(\pi(\boldsymbol\theta; \mathbf x_\ast),\; \nabla \pi(\boldsymbol\theta; \mathbf x_\ast)' [\mathcal I(\boldsymbol\theta)^{-1}/n]\nabla \pi(\boldsymbol\theta; \mathbf x_\ast)\Big).\]The matrix sandwiched in the middle of the gradients is just the covariance matrix of the MLE, which can be estimated in the usual way. The gradient vector has the form
\[\nabla \pi(\boldsymbol\theta; \mathbf x_\ast) = \left[\frac{\partial \pi(\boldsymbol\theta, \mathbf x_\ast)}{\partial \beta_1}, ..., \frac{\partial \pi(\boldsymbol\theta, \mathbf x_\ast)}{\partial \beta_J}, \frac{\partial \pi(\boldsymbol\theta, \mathbf x_\ast)}{\partial \tau_1},...,\frac{\partial \pi(\boldsymbol\theta, \mathbf x_\ast)}{\partial \tau_{K-1}}\right]'\]where $J$ is the number of predictors (not including the constant; this model has no constant). Further,
\[\begin{aligned} \frac{\partial \pi(\boldsymbol\theta, \mathbf x_\ast)}{\partial \beta_j} &= \big[ \lambda_{\mathbf x_\ast'\boldsymbol\beta, 1}(\tau_{k-1}) -\lambda_{\mathbf x_\ast'\boldsymbol\beta, 1}(\tau_k)\big] x_{\ast, j}, \qquad j = 1,2,...,J \\ \frac{\partial \pi(\boldsymbol\theta, \mathbf x_\ast)}{\partial \tau_k} &= \lambda_{\mathbf x_\ast'\boldsymbol\beta, 1}(\tau_{k}), \\ \frac{\partial \pi(\boldsymbol\theta, \mathbf x_\ast)}{\partial \tau_{k-1}} &= - \lambda_{\mathbf x_\ast'\boldsymbol\beta, 1}(\tau_{k-1}),\\ \end{aligned}\]where $\lambda_{a,b}$ is the Logistic pdf with location parameter $a$ and scale parameter $b$, and $x_{\ast, j}$ the $j$th element of the vector $\mathbf x_\ast$. Notice that all partial derivatives are continuous functions of $\boldsymbol\theta$. So we might rely on the Continuous Mapping Theorem once again, and use the MLE-based plug-in estimators to estimate these quantities.
The way to calculate delta method based standard errors, which can then be used to calculate confidence intervals, in R
is shows below:
Confidence Intervals via Simulation
Another way to estimate confidence intervals is to draw samples from the large sample distribution of the MLE. To obtain 95% confidence intervals for our predictions, the procedure is as follows:
-
We know (using somewhat sloppy notation) that \(\hat{\boldsymbol \theta} \sim \text{Normal}(\boldsymbol\theta, n^{-1}\mathcal I(\hat{\boldsymbol\theta})^{-1})\) in large samples. So, we might draw $S$ simulation draws from this distribution, which we denote by $\boldsymbol\theta^{(1)}, \boldsymbol\theta^{(2)},…, \boldsymbol\theta^{(S)}$.
-
For each simulated $\boldsymbol\theta^{(s)}, s= 1,2,…,S$, we calculate the predicted probabilities at $\mathbf x_\ast$. The predicted logit for the $s$th simulation draw is
\[\nu_\ast^{(s)} = \mathbf x_\ast'\boldsymbol\beta^{(s)}\]and the predicted probability of outcome $k$ at $\mathbf x_\ast$
\[\hat{\pi}(\mathbf x_\ast)^{(k), (s)} = \Lambda_{\nu_\ast^{(s)}, 1}(\tau_k^{(s)}) - \Lambda_{\nu_\ast^{(s)}, 1}(\tau_{k-1}^{(s)})\] -
This will give us $S$ simulated values of $\hat\pi(\mathbf x_\ast)^{(k),(s)}, k = 1,2,…,K$. For each outcome $k$, we can then summarize uncertainty using these draws. For example, to obtain the 95% confidence intervals for the predicted probability of the $k$th outcome, we can calculate the $0.025$th and $0.975$th quantile of the distribution of $\hat\pi(\mathbf x_{\ast})^{(k),(s)}$ across the simulation draws $s = 1,2,…,S$.
The following function implements these steps:
A Short Simulation
We might compare the coverage of the estimated 95% confidence interval with the nominal level. First, we load the needed packages and set up a cluster for parallel computing:
Next, we set the seed, set up the “true” parameter values, and generate the fixed part of the dataset:
Lastly, we run the simulation. We start with the simulation method:
Next, we check the confidence intervals created with the delta method:
The results of the simulation are as follows:
## outcome = 1 outcome = 2 outcome = 3
## x2 = -1 0.958 0.961 0.949
## x2 = 0 0.952 0.943 0.941
## x2 = 1 0.944 0.945 0.953
## outcome = 1 outcome = 2 outcome = 3
## x2 = -1 0.958 0.962 0.957
## x2 = 0 0.951 0.954 0.945
## x2 = 1 0.942 0.936 0.958
It’s hard to be certain with only a handful of simulation runs. But the coverage rate seems to be quite close to the nominal level for both methods.