It’s worth it to write out the basic assumption for logit models first. For a logit model to estimate something interpretable, a functional form assumption on the conditional probability $Pro(Y=1 \mid X)$ is needed: \(Pr(Y=1 \mid X)=\frac{\exp(\alpha+\beta X)}{1+\exp(\alpha+\beta X)}.\)

I’ll call this the logit functional form. Given this assumption, using a logit model to regress $Y$ on $X$ will give you consistent estimates of $\alpha$ and $\beta$.

Short answer: because you (the researcher) are making self-contradictory assumptions! In two nested logit models, the researcher first use logit to regress $Y$ on $X1$, then again use logit to regress $Y$ on $X1$ and $X2$ at the same time. But to do this, the researcher implicitly assume that $Pr(Y=1 \mid X1)=\frac{\exp(\alpha+\beta X1)}{1+\exp(\alpha+\beta X1)}$ and $Pr(Y=1 \mid X1, X2)=\frac{\exp(\alpha+\beta 1 X1 + \beta 2 X2)}{1+\exp(\alpha+\beta 1 X1 + \beta 2 X2)}$. However, $Pr(Y=1 \mid X1)=E[Pr(Y=1 \mid X1, X2) \mid X1]$, which means that if $Pr(Y=1 \mid X1, X2)$ has a logit functional form, it is generally impossible that $Pr(Y=1 \mid X1)$ also has a logit functional form. (This can be numerically verified by using a simple DGP as an example.) Hence, by estimating nested logits, the researcher contradict themselves. Logit, on the other hand, makes no mistake.

Short answer: no! Recent methodological literature tends to give the impression that omitting covariates that are independent of the variable of interest will bias the estimated logit coefficient. But this impression is false. The reasoning in the literature amounts to saying that if $Pr(Y=1 \mid X1, X2)$ has a logit functional form $\frac{\exp(\alpha+\beta 1 X1 + \beta 2 X2)}{1+\exp(\alpha+\beta 1 X1 + \beta 2 X2)}$, then using a logit model to regress $Y$ on $X1$ will give you a biased estimate of $\beta 1$, even if $X1$ and $X2$ are independent. This is technically true, but again, a researcher who doesn’t want to be self-contradictory just wouldn’t use a logit model to regress $Y$ on $X1$ if they believe $Pr(Y=1 \mid X1, X2)$ has a logit functional form.

If, instead, we just assume $Pr(Y=1 \mid X1)$ alone has a logit functional form $\frac{\exp(\alpha+\beta X1)}{1+\exp(\alpha+\beta X1)}$, then the logit model will simply give us consistent estimates of $\alpha$ and $\beta$, regardless of whether there’s any variation in $Y$ within each level of $X1$. Hence, logit is not really “biased” in the presence of unmodelled variation, as long as the researcher doesn’t go against their own assumption.

Short answer: forget about logit, and also forget about linear probability model, for that matter. Even if logit itself doesn’t make mistakes, it nevertheless makes it easy for researchers to do so, because the researcher has to assume the logit functional form in order to use logit. But there really is no need for us to assume any functional form, including the assumed functional form of the linear probability model: $Y=\alpha + \beta X$. What we care about are conditional probabilities like $Pr(Y=1 \mid X)$ or maybe log odds $\log \left[\frac{Pr(Y=1 \mid X)}{1-Pr(Y=1 \mid X)} \right]$ (or maybe an effect measure such as risk difference or marginal odds ratio). These estimands are in no way tied to any specific model (logit, probit, linear probability model). In fact, since we, as humans, tend to make mistakes about functional form assumptions, we shouldn’t be given the power to determine the functional form at all. Given a chosen estimand, we can and should just let the machine take over and learn the functional form for us. Replacing logit with Machine Learning, we have nothing to lose, but everything to gain.

]]>To begin with, here’s my understanding of what people are really trying to achieve when they use a “margin-free” measure. The idea is to separate absolute/marginal/uniform/across-the-board changes in the marginal distribution from (relative) changes in the disparity/association (e.g., Mare, 1981). In other words, people want a measure of disparities/associations that’s impervious to marginal changes. And odds ratio is such a measure in terms of cell or joint probabilities.

Consider a simple two-by-two table, where the dependent variable is a binary indicator of college graduation, and the independent variable is a binary SES indicator. Then the odds ratio is defined as

\begin{equation} \left. \frac{Pr(college=1 | SES=high)}{Pr(college=0 | SES=high)} \middle/ \frac{Pr(college=1 | SES=low)}{Pr(college=0 | SES=low)} \right., \end{equation} which can be written as \begin{equation} \left. \frac{Pr(college=1, SES=high)}{Pr(college=0, SES=high)} \middle/ \frac{Pr(college=1, SES=low)}{Pr(college=0, SES=low)} \right.. \end{equation}

So Expression (1) is in terms of conditional probabilities, and Expression (2) is in terms of joint probabilities (or equivalently, cell probabilities, referring to cells of the 2-by-2 table). And odds ratio is margin-free in the sense that if we multiply both $Pr(college=1, SES=high)$ and $Pr(college=1, SES=low)$ by a number $u$, and correspondingly multiply $Pr(college=0, SES=high)$ and $Pr(college=0, SES=low)$ by $v=\frac{1-u[Pr(college=1, SES=high)+Pr(college=1, SES=low)]}{Pr(college=0, SES=high) + Pr(college=0, SES=low)}$, then the odds ratio remains unchanged (the value of $v$ is such so that the the four joint probabilities still sum to 1). Note that this logic only works on Exression (2), not Expression (1), due to the constraint that $Pr(college=1 \mid SES=high)+Pr(college=0 \mid SES=high)=1$ and $Pr(college=1 \mid SES=low)+Pr(college=0 \mid SES=low)=1$.

And this is why I find the margin-free property of odds ratio not that appealing. To me, joint probabilities and changes defined in terms of them are substantively not as natural and relevant as conditional probabilities and the corresponding changes. I think conditional probabilities are more natural because I tend to think about them when I think about stratification. For example, it feels right to talk about the probability of college graduation and how it varies by SES, not the probability of being in a cell jointly defined by graduation status and SES. I think this intuition also works well with how statistical models are often set up in practice: we often model the conditional probability (for example, $Pr(college=1 \mid SES)= invlogit(\alpha+\beta SES)$) instead of a joint probability.

In fact, once we focus on conditional probabilities, both risk difference and risk ratio seem to be measures that are more naturally “margin free”. In particular, if there had been an across-the-board additive expansion in education (adding the same number to $Pr(college=1 \mid SES=high)$ and $Pr(college=1 \mid SES=low)$), the risk difference, \begin{equation} Pr(college=1 \mid SES=high)-Pr(college=1 \mid SES=low), \end{equation} wouldn’t change. And if there had been an across-the-board multiplicative expansion in eduaction (multiplying $Pr(college=1 \mid SES=high)$ and $Pr(college=1 \mid SES=low)$ by the same number), the risk ratio, \begin{equation} \frac{Pr(college=1 \mid SES=high)}{Pr(college=1 \mid SES=low)}, \end{equation} wouldn’t change. These changes are marginal in the sense that they happen uniformly to the two SES groups, and risk difference and risk ratio are each impervious to one of these changes. In contrast, odds ratio does not remain the same under either of these changes.

]]>Both estimators are introduced in popular causal inference textbooks. For example, Angrist and Pischke (2009, p.82) focus on the HT estimator, while Winship and Morgan (2014, p228-9) focus on the WLS estimator. However, I haven’t seen them introduced together in these textbooks, and their relationship could seem a little unclear. In fact, a quick simulation would reveal that the classic version of the HT estimator gives a different ATE estimate from that of the WLS estimator in any finite sample (although asymptotically they both converge to the true ATE).

As it turns out, **the WLS estimator exactly coincides with a “stabilized” version of the HT estimator**. Let $Y$ be the outcome, $T$ be a binary treatment, and $\pi=E(T \mid X)$ be the propensity score defined in terms of covariates $X$. Then the stabilized HT estimator for ATE is

$$ \frac{ \sum \frac{T_i}{\pi_i} Y_i }{ \sum \frac{T_i}{\pi_i} } - \frac{ \sum \frac{1-T_i}{1-\pi_i} Y_i}{ \sum \frac{1-T_i}{1-\pi_i}}. $$

The classic HT estimator is just (1) with the denominators replaced by $n$. The stabilized HT estimator is also called the Hájek estimator and the normalized estimator. And the WLS estimator for ATE is the second element of the 2-by-1 vector $(\boldsymbol{T}’\boldsymbol{W}\boldsymbol{T})^{-1}\boldsymbol{T}’\boldsymbol{W}\boldsymbol{Y}$, where $\boldsymbol{T}$ is a n-by-2 matrix, where the first column is a vector of 1’s and the second is the variable $T$, $\boldsymbol{W}$ is a n-by-n diagonal matrix where the $i$-th element is $W_i := \frac{T_i}{\pi_i}+\frac{1-T_i}{1-\pi_i}$, and $\boldsymbol{Y}$ is a n-by-1 vector of the variable $Y.$ The equivalence between the stabilized HT estimator and the WLS estimator was briefly mentioned in Freeman and Berk (2008, p.406). To demystify this equivalence, I offer a proof of it in this post, which is unsurprisingly quite short.

\begin{aligned}
&\mathrel{\phantom{=}}(\boldsymbol{T}’\boldsymbol{W}\boldsymbol{T})^{-1}\boldsymbol{T}’\boldsymbol{W}\boldsymbol{Y} \\\

&= \begin{pmatrix}
\sum W_i & \sum W_i T_i \\\

\sum W_i T_i & \sum W_i T_i^2
\end{pmatrix}^{-1}
\begin{pmatrix}
\sum W_i Y_i \\\

\sum W_i T_i Y_i
\end{pmatrix} \\\

&= \begin{pmatrix}
\sum \frac{T_i}{\pi_i}+\frac{1-T_i}{1-\pi_i} & \sum \left[\frac{T_i}{\pi_i}+\frac{1-T_i}{1-\pi_i} \right] T_i \\\

\sum \left[\frac{T_i}{\pi_i}+\frac{1-T_i}{1-\pi_i} \right] T_i & \sum \left[\frac{T_i}{\pi_i}+\frac{1-T_i}{1-\pi_i} \right] T_i^2
\end{pmatrix}^{-1}
\begin{pmatrix}
\sum \left[\frac{T_i}{\pi_i}+\frac{1-T_i}{1-\pi_i} \right] Y_i \\\

\sum \left[\frac{T_i}{\pi_i}+\frac{1-T_i}{1-\pi_i} \right] T_i Y_i
\end{pmatrix} \\\

&= \begin{pmatrix}
\sum \frac{T_i}{\pi_i}+\frac{1-T_i}{1-\pi_i} & \sum \frac{T_i}{\pi_i} \\\

\sum \frac{T_i}{\pi_i} & \sum \frac{T_i}{\pi_i}
\end{pmatrix}^{-1}
\begin{pmatrix}
\sum \left[\frac{T_i}{\pi_i}+\frac{1-T_i}{1-\pi_i} \right] Y_i \\\

\sum \frac{T_i}{\pi_i} Y_i
\end{pmatrix} \\\

&=\frac{1}{\sum\frac{T_i}{\pi_i} \sum\frac{1-T_i}{1-\pi_i}}
\begin{pmatrix}
\sum \sum \frac{T_i}{\pi_i} & -\sum \frac{T_i}{\pi_i} \\\

-\sum \frac{T_i}{\pi_i} & \frac{T_i}{\pi_i}+\frac{1-T_i}{1-\pi_i}
\end{pmatrix}
\begin{pmatrix}
\sum \left[\frac{T_i}{\pi_i}+\frac{1-T_i}{1-\pi_i} \right] Y_i \\\

\sum \frac{T_i}{\pi_i} Y_i
\end{pmatrix} \\\

&=\frac{1}{\sum\frac{T_i}{\pi_i} \sum\frac{1-T_i}{1-\pi_i}}
\begin{pmatrix}
\sum \frac{T_i}{\pi_i} \sum \frac{1-T_i}{1-\pi_i} Y_i \\\

\sum \frac{T_i}{\pi_i} Y_i \sum \frac{1-T_i}{1-\pi_i} - \sum \frac{T_i}{\pi_i} \sum \frac{1-T_i}{1-\pi_i} Y_i
\end{pmatrix} \\\

&=\begin{pmatrix}
\sum \frac{1-T_i}{1-\pi_i} Y_i / \sum \frac{1-T_i}{1-\pi_i} \\\

\sum \frac{T_i}{\pi_i} Y_i / \sum \frac{T_i}{\pi_i} - \sum \frac{1-T_i}{1-\pi_i} Y_i / \sum \frac{1-T_i}{1-\pi_i}
\end{pmatrix}.
\end{aligned}
Note that the second element equals (1). Hence, the equivalence is established simply by re-writting the WLS estimator in matrix notation.

Now, a more interesting and harder question is which IPW estimator should be preferred, classic HT or stabilized HT (also WLS)? It is commonly argued that the stabilized HT often has a lower variance than the classic HT. A new working paper of Khan and Ugander (2021) explains this phenomenon and very interestingly develop an adaptive approach that finds the most efficient mixture of the two estimators in any application.

Angrist, Joshua D, and Jörn-Steffen Pischke. 2009. Mostly Harmless Econometrics. Princeton: Princeton university press.

Freedman, David A., and Richard A. Berk. 2008. “Weighting Regressions by Propensity Scores.” Evaluation Review 32 (4): 392–409. https://doi.org/10.1177/0193841X08317586.

Khan, Samir, and Johan Ugander. 2021. “Adaptive Normalization for IPW Estimation.” arXiv. http://arxiv.org/abs/2106.07695.

Morgan, Stephen L., and Christopher Winship. 2014. Counterfactuals and Causal Inference: Methods and Principles For Social Research. 2nd edition. Analytical Methods for Social Research. Cambridge, UK: Cambridge University Press.