« Back to Results

Evaluating and Improving Experimental Forecasts

Paper Session

Friday, Jan. 5, 2024 8:00 AM - 10:00 AM (CST)

Grand Hyatt, Travis B
Hosted By: Econometric Society
  • Chair: Nicholas G. Otis, University of California-Berkeley

Weighing the Evidence: Which Studies Count?

Eva Vivalt
University of Toronto
Aidan Coville
World Bank
Sampada KC
University of British Columbia


We present results from two experiments run at World Bank and Inter-American Development Bank workshops on how policymakers, policy practitioners and researchers weigh evidence and seek information from impact evaluations. We find that policymakers and policy practitioners care more about attributes of studies associated with external validity than internal validity, while for researchers the reverse is true. Policymakers and policy practitioners who had the most accurate forecasts of estimated program impacts were those who acted the most like researchers in seeking evidence, and vice versa. These preferences can yield large differences in the estimated effects of pursued policies: policymakers were willing to accept a program that had a 6.3 percentage point smaller effect on enrollment rates if it were recommended by a local expert, larger than the effects of most programs.

Policy Choice and the Wisdom of Crowds

Nicholas G. Otis
University of California-Berkeley


Using data from seven large-scale randomized experiments, I test whether crowds of academic experts can forecast the relative effectiveness of policy interventions. Eight-hundred and sixty-three academic experts provided 9,295 forecasts of the causal effects from these experiments, which span a diverse set of interventions (e.g., information provision, psychotherapy, soft-skills training), outcomes (e.g., consumption, COVID-19 vaccination, employment), and locations (Jordan, Kenya, Sweden, the United States). For each policy comparisons (a pair of policies and an outcome), I calculate the percent of crowd forecasts that correctly rank policies by their experimentally estimated treatment effects. While only 65% of individual experts identify which of two competing policies will have a larger causal effect, the average forecast from bootstrapped crowds of 30 experts identifies the better policy 86% of the time, or 92% when restricting analysis to pairs of policies who effects differ at the p < 0.10 level. Only 10 experts are needed to produce an 18-percentage point (27%) improvement in policy choice.

Rationalizing Entrepreneurs' Forecasts

Nicholas Bloom
Stanford University
Mihai Codreanu
Stanford University
Robert Fletcher
Stanford University


We analyze, benchmark, and run randomized controlled trials on a panel of 7,463 U.S. entrepreneurs making incentivized sales forecasts. We assess accuracy using a novel administrative dataset obtained in collaboration with a leading US payment processing firm. At baseline, only 13% of entrepreneurs can forecast their firm’s sales in the next three months within 10% of the realized value, with 7.3% of the mean squared error attributable to bias and the remaining 92.7% attributable to noise. Our first intervention rewards entrepreneurs up to $400 for accurate forecasts, our second requires respondents to review historical sales data, and our third provides forecasting training. Increased reward payments significantly reduce bias but have no effect on
noise, despite inducing entrepreneurs to spend more time answering. The historical sales data intervention has no effect on bias but significantly reduces noise. Since bias is only a minor part of forecasting errors, reward payments have small effects on mean
squared error, while the historical data intervention reduces it by 12.4%. The training intervention has negligible effects on bias, noise, and ultimately mean squared error. Our results suggest that while offering financial incentives make forecasts more realistic,
firms may not fully realize the benefits of having easy access to past performance data.

The Gender Gap in Confidence: Expected But Not Accounted For

Christine Exley
Harvard Business School
Kirby Nielsen
California Institute of Technology


We investigate how the gender gap in confidence affects the views that evaluators
(e.g., employers) hold about men and women. If evaluators fail to account for the
confidence gap, it may cause overly pessimistic views about women. Alternatively, if
evaluators expect and account for the confidence gap, such a detrimental impact may
be avoided. We find robust evidence for the former: even when the confidence gap is
expected, evaluators fail to account for it. This “contagious” nature of the gap persists
across many interventions and types of evaluators. Only a targeted intervention that
facilitates Bayesian updating proves (somewhat) effective.
JEL Classifications
  • C8 - Data Collection and Data Estimation Methodology; Computer Programs
  • C9 - Design of Experiments