Evaluating and Improving Experimental Forecasts
Paper Session
Friday, Jan. 5, 2024 8:00 AM - 10:00 AM (CST)
- Chair: Nicholas G. Otis, University of California-Berkeley
Policy Choice and the Wisdom of Crowds
Abstract
Using data from seven large-scale randomized experiments, I test whether crowds of academic experts can forecast the relative effectiveness of policy interventions. Eight-hundred and sixty-three academic experts provided 9,295 forecasts of the causal effects from these experiments, which span a diverse set of interventions (e.g., information provision, psychotherapy, soft-skills training), outcomes (e.g., consumption, COVID-19 vaccination, employment), and locations (Jordan, Kenya, Sweden, the United States). For each policy comparisons (a pair of policies and an outcome), I calculate the percent of crowd forecasts that correctly rank policies by their experimentally estimated treatment effects. While only 65% of individual experts identify which of two competing policies will have a larger causal effect, the average forecast from bootstrapped crowds of 30 experts identifies the better policy 86% of the time, or 92% when restricting analysis to pairs of policies who effects differ at the p < 0.10 level. Only 10 experts are needed to produce an 18-percentage point (27%) improvement in policy choice.Rationalizing Entrepreneurs' Forecasts
Abstract
We analyze, benchmark, and run randomized controlled trials on a panel of 7,463 U.S. entrepreneurs making incentivized sales forecasts. We assess accuracy using a novel administrative dataset obtained in collaboration with a leading US payment processing firm. At baseline, only 13% of entrepreneurs can forecast their firm’s sales in the next three months within 10% of the realized value, with 7.3% of the mean squared error attributable to bias and the remaining 92.7% attributable to noise. Our first intervention rewards entrepreneurs up to $400 for accurate forecasts, our second requires respondents to review historical sales data, and our third provides forecasting training. Increased reward payments significantly reduce bias but have no effect onnoise, despite inducing entrepreneurs to spend more time answering. The historical sales data intervention has no effect on bias but significantly reduces noise. Since bias is only a minor part of forecasting errors, reward payments have small effects on mean
squared error, while the historical data intervention reduces it by 12.4%. The training intervention has negligible effects on bias, noise, and ultimately mean squared error. Our results suggest that while offering financial incentives make forecasts more realistic,
firms may not fully realize the benefits of having easy access to past performance data.
The Gender Gap in Confidence: Expected But Not Accounted For
Abstract
We investigate how the gender gap in confidence affects the views that evaluators(e.g., employers) hold about men and women. If evaluators fail to account for the
confidence gap, it may cause overly pessimistic views about women. Alternatively, if
evaluators expect and account for the confidence gap, such a detrimental impact may
be avoided. We find robust evidence for the former: even when the confidence gap is
expected, evaluators fail to account for it. This “contagious” nature of the gap persists
across many interventions and types of evaluators. Only a targeted intervention that
facilitates Bayesian updating proves (somewhat) effective.
JEL Classifications
- C8 - Data Collection and Data Estimation Methodology; Computer Programs
- C9 - Design of Experiments