Big data, machine learning, and causal inference in economics

Our first lesson is not for training your hands, but for training your mind.

Author: Jau-er CHEN

Big Data, Machine Learning, and Causal Inference in Economics

Machine learning methods have been actively studied in economic big data settings in recent years, cf. Athey (2017) and Athey and Imbens (2019). Big data is not only about massive data and observed characteristics, but also about the usage of new types of datasets including, for example, images and sensor data. Since big data by no means automatically and successfully tackle challenging problems, in applied economics it hinges as well on doing good data. Besides, the machine learning techniques often require the adaptation to exploit the structure of the economic problems, or the adaptation changing the optimization criteria of machine learning algorithms in an economic policy analysis. Doing good data is closely related to the domain knowledge and empirical strategies used to identify the causal effect in most economic empirics. Those adaptations are on an emerging research area at the intersection of machine learning and econometrics, which is called the causal machine learning in the economic literature.

Most empirical studies in economics aim at understanding the program evaluation, or equivalently, causal effect. Constructing the counterfactual and then estimating causal effects rely on an appropriately chosen identification strategy. Either off-the-shelf machine learning methods or big data does not suffice to reveal causal effects without the help of identification strategies. Domain knowledge and a good understanding of institutional background of the dataset play an essential role in choosing or designing a convincing identification strategy. That is, identification of the causal effect precedes training machine learning models. In economics, the econometrics “Furious Five” are extensively used identification strategies in conducting causal inference, which include the randomized controlled trails (RCT), regression (a matching maker), instrumental variable approach, difference-in-differences method, and regression discontinuity design (RDD), cf. Angrist and Pischke (2009, 2015). Each member of the econometrics Furious Five depends on specific identifying restrictions, and some of the restrictions are untestable; therefore, researchers need to justify those identifying restrictions and interpret the resulting causal effect cautiously. In what follows, we discuss the identifying restrictions used in RCT and the regression discontinuity design, and their relation to machine learning in the environment of economic big data. RCT requires researchers can randomly assign the treatment and thus identify the average treatment effect. Even for companies in the information technology industry, implementing pure randomized controlled trails is costly. With big data and contextual bandits which observe characteristics of units that can be used in the assignment mechanism, a more effective and efficient RCT-built-in machine learning model can be implemented to estimate the treatment effect. The epsilon-greedy policy is a leading example in this setting, where with probability (1-epsilon), the assignment is based on the past data, and with probability epsilon, the assignment is an RCT. If the treatment effects are heterogeneous, and if that heterogeneity is associated with observed characteristics of the units, then the bandit learning may have substantial gains from assigning units to different treatments based on these characteristics. Consequently, nowadays many information technology companies use the bandit learning rather than the RCT which does not utilize new information gathered during the experiment until after the experiment has ended. If the treatment assignment is completely determined by a running variable and the units cannot manipulate that variable, the regression discontinuity design can be used to identify the treatment effect around the cut-off point of the running variable. For instance, when estimating the causal effect of alcohol consumption by American young adults on mortality, the RDD may use the minimum legal drinking age as the running variable and the treatment assignment around the cut-off point at age 21 is good as randomly assigned (for details, see Carpenter and Dobkin 2009). It is worth noting that the RDD-built-in machine learning models produce the cut-off point and then generate natural experiments which can be used to identify treatment effect, evaluate the performance of the algorithm, and accordingly lead to a better decision-making process, cf. Narita and Yata (2020, Machine Learning is Natural Experiment). Specific examples include the COMPAS algorithm developed by Equivant and used as a decision support tool by U.S. courts to assess the likelihood of a defendant becoming a recidivist, and the surge-pricing algorithm used by the UberX, see Cohen et al. (2016). We here emphasize the benefit in conducting causal inference coming from natural experiment generated by machine learning algorithm in the big-data environment. However, we indeed ignore debates questioning what fairness means and how each definition could be mathematically specified in implementing machine learning models, especially when using COMPAS algorithm. In short, identification precedes estimation of causal effects, and the involved machine learning algorithm generates some type of identification strategies (i.e., natural experiments) which can further be used to improve the algorithm. The aforementioned process can be exploited iteratively.

Although some substantive economic problems are naturally cast as prediction problems and then analyzed by off-the-shelf machine learning methods, cf. Mullainathan and Spiess (2017), many applied works call for valid confidence intervals for a causal parameter of interest. In the econometric literature, two causal machine learning approaches are currently available to estimate treatment effects through adapted machine learning algorithms, and they also provide valid standard errors of a causal parameter of interest, such as average treatment effect or quantile treatment effect. Causal machine learning, which is based on two approaches: the double machine learning (DML), cf. Chernozhukov et al. (2018), and the generalized random forests method (GRF), cf. Athey et al. (2019), has been actively studied in economics. DML utilizes techniques such as sample splitting, cross-fitting, and Neyman-orthogonization, to improve the performance of adapted machine learning estimators in causal inference. Furthermore, DML is feasible to deal with high-dimensional datasets where researchers observe massive characteristics of the units. GRF estimates heterogeneous (individual) treatment effects and explore variable importance accounting for heterogeneity in the treatment effect. The resulting information is crucial in optimal polices mapping from individuals’ observed characteristics to treatments. At the end of this article, we reproduce an empirical study conducted by Chen and Hsiang (2019) to illustrate the advantage of causal machine learning. Based on the identification strategy of instrumental variable approach and the GRF framework, Chen and Hsiang (2019) reinvestigate the distributional effect of 401(k) participation on net financial assets. Examining the effects of the 401(k) retirement savings plan on accumulated wealth is an issue of long-standing empirical interest in economics. The data with 9915 observations is from the 1991 Survey of Income and Program Participation. The outcome variable is the net financial asset. The treatment variable is a binary variable standing for participation in the 401(k) plan. The instrumental variable is an indicator for being eligible to enroll in the 401(k) plan. Control variables (observed characteristics) consist of age, income, family size, education, marital status, two-earner status, defined benefit pension status, individual retirement account (IRA) participation status, and homeownership status. Their estimation results signify that the 401(k) participation has larger positive effects on net financial assets for people with higher savings propensity which corresponds to the upper conditional quantiles. The estimated treatment effects show a monotonically increasing pattern across the conditional distribution of net financial assets. The effects are all statistically significant. In addition, based on the measure of variable importance considered in Chen and Hsiang (2019), the Figure depicts that income, age, education, and family size are the first four important variables in explaining treatment effect heterogeneity. g_vi_401kOn average, income and age are the most important variables accounting for heterogeneity, which lead to values of the variable importance 64.4% and 15.6%, respectively. We should interpret the variable importance measure with caution, because researchers could reduce the importance measure of one variable by adding a highly correlated additional variable to the model. Accordingly, in this case, the two highly correlated variables have to share the sample splits in implementing the GRF algorithm. However, even with the caution mentioned above, we now have an addition dimension, i.e., quantile index, which suffices to compare variable importance across quantiles. Particularly, the importance of age variable increases as the savings propensity (quantile index) goes up. The importance of income variable, however, decreases across conditional distribution of net financial assets. Furthermore, these four variables are also identified as important in the context of DML with high-dimensional observed characteristics, cf. Chen, Huang and Tien (2020). As illustrated above, the empirical study highlights advantages stemmed from causal machine learning methods. Besides, in practice, understanding treatment effect heterogeneity is useful for understanding and for estimating optimal policy assignments (the interested reader is referred to Athey and Imbens 2017).

References

  1. Angrist, J. and J.S. Pischke (2009), Mostly Harmless Econometrics: An Empiricist’s Companion, Princeton University Press.
  2. Angrist, J. and J.S. Pischke (2015), Mastering ‘Metrics: The Path from Cause to Effect, Princeton University Press.
  3. Athey, S. (2017). “Beyond prediction: Using big data for policy problem,” Science, 355, 483-485.
  4. Athey, S. and G.W. Imbens (2017). “The state of applied econometrics: causality and policy evaluation,” Journal of Economic Perspectives, 31, 3-32.
  5. Athey, S. and G.W. Imbens (2019). “Machine learning methods that economists should know about,” The Annual Review of Economics, 11, 685-725.
  6. Athey, S., J. Tibshirani, and S. Wager (2019). “Generalized random forests,” The Annals of Statistics, 47, 1148-1178.
  7. Carpenter, C. and C. Dobkin (2009). “The effect of alcohol consumption on mortality: Regression discontinuity evidence from the minimum drinking age,” American Economic Journal: Applied Economics, 1, 164-182.
  8. Chen, J.-E., and C.-W. Hsiang (2019). “Causal random forests model using instrumental variable quantile regression,” Econometrics, 7, 1-22.
  9. Chen, J.-E., Huang, C.-H., and J.-J. Tien (2020). “Debiased/double machine learning for instrumental variable quantile regressions,” Working Paper. Taipei, Taiwan: Center for Research in Econometric Theory and Applications, National Taiwan University.
  10. Chernozhukov, V., D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey, and J. Robins (2018). “Double/debiased machine learning for treatment and structural parameters,” The Econometrics Journal, 21, C1-C68.
  11. Cohen, P., R. Hahn, J. Hall, S. Levitt, and R. Metcalfe (2016), “Using big data to estimate consumer surplus: The case of Uber,” NBER Working Paper 22627.
  12. Narita, Y. and K. Yata (2020). “Machine learning is natural experiment,” Working Paper, Yale University.
  13. Mullainathan, S. and J. Spiess (2017). “Machine learning: An applied econometric approach,” Journal of Economic Perspectives, 31, 87-106.