Horvitz–Thompson and Weighted Least Squares

4 minute read

Published:

Inverse probability weighting (IPW) is a popular tool for estimating the average treatment effect (ATE) of a binary variable under the conditional ignorability assumption. There are multiple variants of IPW. Particularly, I used to be intrigued by the relationship between the so-called Horvitz–Thompson (HT) estimator and the Weighted Least Squares (WLS) estimator, both of which implement IPW. Both estimators are introduced in popular causal inference textbooks. For example, Angrist and Pischke (2009, p.82) focus on the HT estimator, while Winship and Morgan (2014, p228-9) focus on the WLS estimator. However, I haven’t seen them introduced together in these textbooks, and their relationship could seem a little unclear. In fact, a quick simulation would reveal that the classic version of the HT estimator gives a different ATE estimate from that of the WLS estimator in any finite sample (although asymptotically they both converge to the true ATE).

As it turns out, the WLS estimator exactly coincides with a “stabilized” version of the HT estimator. Let $Y$ be the outcome, $T$ be a binary treatment, and $\pi=E(T \mid X)$ be the propensity score defined in terms of covariates $X$. Then the stabilized HT estimator for ATE is

$$ \frac{ \sum \frac{T_i}{\pi_i} Y_i }{ \sum \frac{T_i}{\pi_i} } - \frac{ \sum \frac{1-T_i}{1-\pi_i} Y_i}{ \sum \frac{1-T_i}{1-\pi_i}}. $$

The classic HT estimator is just (1) with the denominators replaced by $n$. The stabilized HT estimator is also called the Hájek estimator and the normalized estimator. And the WLS estimator for ATE is the second element of the 2-by-1 vector $(\boldsymbol{T}’\boldsymbol{W}\boldsymbol{T})^{-1}\boldsymbol{T}’\boldsymbol{W}\boldsymbol{Y}$, where $\boldsymbol{T}$ is a n-by-2 matrix, where the first column is a vector of 1’s and the second is the variable $T$, $\boldsymbol{W}$ is a n-by-n diagonal matrix where the $i$-th element is $W_i := \frac{T_i}{\pi_i}+\frac{1-T_i}{1-\pi_i}$, and $\boldsymbol{Y}$ is a n-by-1 vector of the variable $Y.$ The equivalence between the stabilized HT estimator and the WLS estimator was briefly mentioned in Freeman and Berk (2008, p.406). To demystify this equivalence, I offer a proof of it in this post, which is unsurprisingly quite short.

\begin{aligned} &\mathrel{\phantom{=}}(\boldsymbol{T}’\boldsymbol{W}\boldsymbol{T})^{-1}\boldsymbol{T}’\boldsymbol{W}\boldsymbol{Y} \\\
&= \begin{pmatrix} \sum W_i & \sum W_i T_i \\\
\sum W_i T_i & \sum W_i T_i^2 \end{pmatrix}^{-1} \begin{pmatrix} \sum W_i Y_i \\\
\sum W_i T_i Y_i \end{pmatrix} \\\
&= \begin{pmatrix} \sum \frac{T_i}{\pi_i}+\frac{1-T_i}{1-\pi_i} & \sum \left[\frac{T_i}{\pi_i}+\frac{1-T_i}{1-\pi_i} \right] T_i \\\
\sum \left[\frac{T_i}{\pi_i}+\frac{1-T_i}{1-\pi_i} \right] T_i & \sum \left[\frac{T_i}{\pi_i}+\frac{1-T_i}{1-\pi_i} \right] T_i^2 \end{pmatrix}^{-1} \begin{pmatrix} \sum \left[\frac{T_i}{\pi_i}+\frac{1-T_i}{1-\pi_i} \right] Y_i \\\
\sum \left[\frac{T_i}{\pi_i}+\frac{1-T_i}{1-\pi_i} \right] T_i Y_i \end{pmatrix} \\\
&= \begin{pmatrix} \sum \frac{T_i}{\pi_i}+\frac{1-T_i}{1-\pi_i} & \sum \frac{T_i}{\pi_i} \\\
\sum \frac{T_i}{\pi_i} & \sum \frac{T_i}{\pi_i} \end{pmatrix}^{-1} \begin{pmatrix} \sum \left[\frac{T_i}{\pi_i}+\frac{1-T_i}{1-\pi_i} \right] Y_i \\\
\sum \frac{T_i}{\pi_i} Y_i \end{pmatrix} \\\
&=\frac{1}{\sum\frac{T_i}{\pi_i} \sum\frac{1-T_i}{1-\pi_i}} \begin{pmatrix} \sum \sum \frac{T_i}{\pi_i} & -\sum \frac{T_i}{\pi_i} \\\
-\sum \frac{T_i}{\pi_i} & \frac{T_i}{\pi_i}+\frac{1-T_i}{1-\pi_i} \end{pmatrix} \begin{pmatrix} \sum \left[\frac{T_i}{\pi_i}+\frac{1-T_i}{1-\pi_i} \right] Y_i \\\
\sum \frac{T_i}{\pi_i} Y_i \end{pmatrix} \\\
&=\frac{1}{\sum\frac{T_i}{\pi_i} \sum\frac{1-T_i}{1-\pi_i}} \begin{pmatrix} \sum \frac{T_i}{\pi_i} \sum \frac{1-T_i}{1-\pi_i} Y_i \\\
\sum \frac{T_i}{\pi_i} Y_i \sum \frac{1-T_i}{1-\pi_i} - \sum \frac{T_i}{\pi_i} \sum \frac{1-T_i}{1-\pi_i} Y_i \end{pmatrix} \\\
&=\begin{pmatrix} \sum \frac{1-T_i}{1-\pi_i} Y_i / \sum \frac{1-T_i}{1-\pi_i} \\\
\sum \frac{T_i}{\pi_i} Y_i / \sum \frac{T_i}{\pi_i} - \sum \frac{1-T_i}{1-\pi_i} Y_i / \sum \frac{1-T_i}{1-\pi_i} \end{pmatrix}. \end{aligned} Note that the second element equals (1). Hence, the equivalence is established simply by re-writting the WLS estimator in matrix notation.

Now, a more interesting and harder question is which IPW estimator should be preferred, classic HT or stabilized HT (also WLS)? It is commonly argued that the stabilized HT often has a lower variance than the classic HT. A new working paper of Khan and Ugander (2021) explains this phenomenon and very interestingly develop an adaptive approach that finds the most efficient mixture of the two estimators in any application.

References

Angrist, Joshua D, and Jörn-Steffen Pischke. 2009. Mostly Harmless Econometrics. Princeton: Princeton university press.
Freedman, David A., and Richard A. Berk. 2008. “Weighting Regressions by Propensity Scores.” Evaluation Review 32 (4): 392–409. https://doi.org/10.1177/0193841X08317586.
Khan, Samir, and Johan Ugander. 2021. “Adaptive Normalization for IPW Estimation.” arXiv. http://arxiv.org/abs/2106.07695.
Morgan, Stephen L., and Christopher Winship. 2014. Counterfactuals and Causal Inference: Methods and Principles For Social Research. 2nd edition. Analytical Methods for Social Research. Cambridge, UK: Cambridge University Press.

Leave a Comment