Margin call: Is odds ratio really that good?

2023-05-11T00:00:00-05:00

It has been commonly argued in sociology that odds ratio—as a measure of association between categorical variables—is appealing, because of its “margin-free” property. This has always baffled me, so I decided to articulate my bafflement here as a sanity check.

To begin with, here’s my understanding of what people are really trying to achieve when they use a “margin-free” measure. The idea is to separate absolute/marginal/uniform/across-the-board changes in the marginal distribution from (relative) changes in the disparity/association (e.g., Mare, 1981). In other words, people want a measure of disparities/associations that’s impervious to marginal changes. And odds ratio is such a measure in terms of cell or joint probabilities.

Consider a simple two-by-two table, where the dependent variable is a binary indicator of college graduation, and the independent variable is a binary SES indicator. Then the odds ratio is defined as

\begin{equation} \left. \frac{Pr(college=1 | SES=high)}{Pr(college=0 | SES=high)} \middle/ \frac{Pr(college=1 | SES=low)}{Pr(college=0 | SES=low)} \right., \end{equation} which can be written as \begin{equation} \left. \frac{Pr(college=1, SES=high)}{Pr(college=0, SES=high)} \middle/ \frac{Pr(college=1, SES=low)}{Pr(college=0, SES=low)} \right.. \end{equation}

So Expression (1) is in terms of conditional probabilities, and Expression (2) is in terms of joint probabilities (or equivalently, cell probabilities, referring to cells of the 2-by-2 table). And odds ratio is margin-free in the sense that if we multiply both $Pr(college=1, SES=high)$ and $Pr(college=1, SES=low)$ by a number $u$, and correspondingly multiply $Pr(college=0, SES=high)$ and $Pr(college=0, SES=low)$ by $v=\frac{1-\alpha[Pr(college=1, SES=high)+Pr(college=1, SES=low)]}{Pr(college=0, SES=high) + Pr(college=0, SES=low)}$, then the odds ratio remains unchanged (the value of $v$ is such so that the the four joint probabilities still sum to 1). Note that this logic only works on Exression (2), not Expression (1), due to the constraint that $Pr(college=1 \mid SES=high)+Pr(college=0 \mid SES=high)=1$ and $Pr(college=1 \mid SES=low)+Pr(college=0 \mid SES=low)=1$.

And this is why I find the margin-free property of odds ratio not that appealing. To me, joint probabilities and changes defined in terms of them are substantively not as natural and relevant as conditional probabilities and the corresponding changes. I think conditional probabilities are more natural because I tend to think about them when I think about stratification. For example, it feels right to talk about the probability of college graduation and how it varies by SES, not the probability of being in a cell jointly defined by graduation status and SES. I think this intuition also works well with how statistical models are set up in practice: we often model the conditional probability as, for example, $Pr(college=1 \mid SES)= invlogit(\alpha+\beta SES)$, but we never model a joint probability.

In fact, once we focus on conditional probabilities, both risk difference and risk ratio seem to be measures that are more naturally “margin free”. In particular, if there had been an across-the-board additive expansion in education (adding the same number to $Pr(college=1 \mid SES=high)$ and $Pr(college=1 \mid SES=low)$), the risk difference, \begin{equation} Pr(college=1 \mid SES=high)-Pr(college=1 \mid SES=low), \end{equation} wouldn’t change. And if there had been an across-the-board multiplicative expansion in eduaction (multiplying $Pr(college=1 \mid SES=high)$ and $Pr(college=1 \mid SES=low)$ by the same number), the risk ratio, \begin{equation} \frac{Pr(college=1 \mid SES=high)}{Pr(college=1 \mid SES=low)}, \end{equation} wouldn’t change. These changes are marginal in the sense that they happen uniformly to the two SES groups, and risk difference and risk ratio are each impervious to one of these changes. In contrast, odds ratio does not remain the same under either of these changes.

Horvitz–Thompson and Weighted Least Squares

2023-01-30T00:00:00-06:00

Inverse probability weighting (IPW) is a popular tool for estimating the average treatment effect (ATE) of a binary variable under the conditional ignorability assumption. There are multiple variants of IPW. Particularly, I used to be intrigued by the relationship between the so-called Horvitz–Thompson (HT) estimator and the Weighted Least Squares (WLS) estimator, both of which implement IPW. Both estimators are introduced in popular causal inference textbooks. For example, Angrist and Pischke (2009, p.82) focus on the HT estimator, while Winship and Morgan (2014, p228-9) focus on the WLS estimator. However, I haven’t seen them introduced together in these textbooks, and their relationship could seem a little unclear. In fact, a quick simulation would reveal that the classic version of the HT estimator gives a different ATE estimate from that of the WLS estimator in any finite sample (although asymptotically they both converge to the true ATE).

As it turns out, the WLS estimator exactly coincides with a “stabilized” version of the HT estimator. Let $Y$ be the outcome, $T$ be a binary treatment, and $\pi=E(T \mid X)$ be the propensity score defined in terms of covariates $X$. Then the stabilized HT estimator for ATE is

$$ \frac{ \sum \frac{T_i}{\pi_i} Y_i }{ \sum \frac{T_i}{\pi_i} } - \frac{ \sum \frac{1-T_i}{1-\pi_i} Y_i}{ \sum \frac{1-T_i}{1-\pi_i}}. $$

The classic HT estimator is just (1) with the denominators replaced by $n$. The stabilized HT estimator is also called the Hájek estimator and the normalized estimator. And the WLS estimator for ATE is the second element of the 2-by-1 vector $(\boldsymbol{T}’\boldsymbol{W}\boldsymbol{T})^{-1}\boldsymbol{T}’\boldsymbol{W}\boldsymbol{Y}$, where $\boldsymbol{T}$ is a n-by-2 matrix, where the first column is a vector of 1’s and the second is the variable $T$, $\boldsymbol{W}$ is a n-by-n diagonal matrix where the $i$-th element is $W_i := \frac{T_i}{\pi_i}+\frac{1-T_i}{1-\pi_i}$, and $\boldsymbol{Y}$ is a n-by-1 vector of the variable $Y.$ The equivalence between the stabilized HT estimator and the WLS estimator was briefly mentioned in Freeman and Berk (2008, p.406). To demystify this equivalence, I offer a proof of it in this post, which is unsurprisingly quite short.

\begin{aligned} &\mathrel{\phantom{=}}(\boldsymbol{T}’\boldsymbol{W}\boldsymbol{T})^{-1}\boldsymbol{T}’\boldsymbol{W}\boldsymbol{Y} \\\
&= \begin{pmatrix} \sum W_i & \sum W_i T_i \\\
\sum W_i T_i & \sum W_i T_i^2 \end{pmatrix}^{-1} \begin{pmatrix} \sum W_i Y_i \\\
\sum W_i T_i Y_i \end{pmatrix} \\\
&= \begin{pmatrix} \sum \frac{T_i}{\pi_i}+\frac{1-T_i}{1-\pi_i} & \sum \left[\frac{T_i}{\pi_i}+\frac{1-T_i}{1-\pi_i} \right] T_i \\\
\sum \left[\frac{T_i}{\pi_i}+\frac{1-T_i}{1-\pi_i} \right] T_i & \sum \left[\frac{T_i}{\pi_i}+\frac{1-T_i}{1-\pi_i} \right] T_i^2 \end{pmatrix}^{-1} \begin{pmatrix} \sum \left[\frac{T_i}{\pi_i}+\frac{1-T_i}{1-\pi_i} \right] Y_i \\\
\sum \left[\frac{T_i}{\pi_i}+\frac{1-T_i}{1-\pi_i} \right] T_i Y_i \end{pmatrix} \\\
&= \begin{pmatrix} \sum \frac{T_i}{\pi_i}+\frac{1-T_i}{1-\pi_i} & \sum \frac{T_i}{\pi_i} \\\
\sum \frac{T_i}{\pi_i} & \sum \frac{T_i}{\pi_i} \end{pmatrix}^{-1} \begin{pmatrix} \sum \left[\frac{T_i}{\pi_i}+\frac{1-T_i}{1-\pi_i} \right] Y_i \\\
\sum \frac{T_i}{\pi_i} Y_i \end{pmatrix} \\\
&=\frac{1}{\sum\frac{T_i}{\pi_i} \sum\frac{1-T_i}{1-\pi_i}} \begin{pmatrix} \sum \sum \frac{T_i}{\pi_i} & -\sum \frac{T_i}{\pi_i} \\\
-\sum \frac{T_i}{\pi_i} & \frac{T_i}{\pi_i}+\frac{1-T_i}{1-\pi_i} \end{pmatrix} \begin{pmatrix} \sum \left[\frac{T_i}{\pi_i}+\frac{1-T_i}{1-\pi_i} \right] Y_i \\\
\sum \frac{T_i}{\pi_i} Y_i \end{pmatrix} \\\
&=\frac{1}{\sum\frac{T_i}{\pi_i} \sum\frac{1-T_i}{1-\pi_i}} \begin{pmatrix} \sum \frac{T_i}{\pi_i} \sum \frac{1-T_i}{1-\pi_i} Y_i \\\
\sum \frac{T_i}{\pi_i} Y_i \sum \frac{1-T_i}{1-\pi_i} - \sum \frac{T_i}{\pi_i} \sum \frac{1-T_i}{1-\pi_i} Y_i \end{pmatrix} \\\
&=\begin{pmatrix} \sum \frac{1-T_i}{1-\pi_i} Y_i / \sum \frac{1-T_i}{1-\pi_i} \\\
\sum \frac{T_i}{\pi_i} Y_i / \sum \frac{T_i}{\pi_i} - \sum \frac{1-T_i}{1-\pi_i} Y_i / \sum \frac{1-T_i}{1-\pi_i} \end{pmatrix}. \end{aligned} Note that the second element equals (1). Hence, the equivalence is established simply by re-writting the WLS estimator in matrix notation.

Now, a more interesting and harder question is which IPW estimator should be preferred, classic HT or stabilized HT (also WLS)? It is commonly argued that the stabilized HT often has a lower variance than the classic HT. A new working paper of Khan and Ugander (2021) explains this phenomenon and very interestingly develop an adaptive approach that finds the most efficient mixture of the two estimators in any application.

References

Angrist, Joshua D, and Jörn-Steffen Pischke. 2009. Mostly Harmless Econometrics. Princeton: Princeton university press.
Freedman, David A., and Richard A. Berk. 2008. “Weighting Regressions by Propensity Scores.” Evaluation Review 32 (4): 392–409. https://doi.org/10.1177/0193841X08317586.
Khan, Samir, and Johan Ugander. 2021. “Adaptive Normalization for IPW Estimation.” arXiv. http://arxiv.org/abs/2106.07695.
Morgan, Stephen L., and Christopher Winship. 2014. Counterfactuals and Causal Inference: Methods and Principles For Social Research. 2nd edition. Analytical Methods for Social Research. Cambridge, UK: Cambridge University Press.

Ang Yu

Margin call: Is odds ratio really that good?

Horvitz–Thompson and Weighted Least Squares

References