Six Lessons about Crowd Prediction

Six Lessons about Crowd Prediction

In this entry, we briefly summarize several thought-provoking findings from the Good Judgment Project. These findings helped us improve our aggregated forecasts and win the Aggregative Contingent Estimation (ACE) forecasting tournament, sponsored by IARPA. We also provide references to the original papers for curious readers.

1. Prediction skill over time.[1]

Prior research had demonstrated that expert forecasters are embarrassingly as good as Dart Throwing Chimps. Does this mean that accurate forecasting is simply a matter of luck, with no skill involved? Not at all.

Here’s how we tested for the persistence of forecasting skill. Within a sample of 600 Good Judgment Project forecasters, we identified the 100 most and least accurate forecasters, based on standardized Brier scores,[2] over the first 25 questions in the forecasting tournament. Then, we tracked their scores over the next 175 questions.

As we can see in the figure below, the top 100 forecasters (based on initial performance) were consistently more accurate than the bottom 100, with the top guns beating the rear-guard on 169 out of 174 questions.

Upshot: Forecasting accuracy is a matter of skill, not just luck.

2. Training and teaming[3]

How important are situational factors in affecting individual forecasting performance? We focused on two such factors: training and teaming.

Training was delivered in a 1-hour online module and focused on forecasting reasoning tips, such as using base rates, mathematical models and updating one’s beliefs. Teaming allowed forecasters to collaborate as members of 12-15 person teams who had online tools for allocating effort, sharing information and rationales with one another. As the figure shows, training and teaming significantly reduced forecasting error in the tournament. These results are replicated across all four seasons of the tournament.

Upshot: Forecasting training and teaming improve forecasting performance.

3. Superforecasters.[4]

The first two lessons concerned persistence of individual skill and the importance of environmental factors. We wondered what would happen if we introduce highly skilled forecasters to an enriched environment. To test whether such “tracking” would further improve performance, we promoted the top 2% most accurate forecasters to “superforecaster” status and placed them in teams. The resulting super teams had an elite-egalitarian structure: they were composed of top past-performers, all of whom had equal rights and responsibilities within their teams.

The performance of super-teams was extremely strong. We document this using a simple version of discontinuity analysis. Namely, we compare superforecasters (top 2%) with those who just missed the cut (3-5%). In Year 1, when the selection took place, both groups performed much better than average. In both Years 2 & 3, super-teams increased their lead over the comparison group. Rather than regressing toward the mean, super teams increased their level of engagement and produced highly accurate forecasts.

Upshot: Tracking top performers, and placing them in flat, non-hierarchical teams improves motivation, engagement and performance.


4. Belief updating. [5]

What is the best behavioral predictor of prediction skill, apart from accuracy? One potential indicator is belief updating. The average duration of Good Judgment Project forecasting questions was over three months; forecasters were able to update their predictions whenever they wished. The pattern of belief updating was strongly and robustly related to forecasting accuracy.

We measured the frequency of belief updates and their magnitude. For example, we could compare a forecaster who places 1.5 predictions per question and whose average update is 20 percentage points with another one who makes 2.3 predictions and updates by 11 percentage points, on average. Forecasters who updated their beliefs more often, and in smaller increments, tended to be more accurate than those who made fewer, or larger updates. Frequency and magnitude independently predicted accuracy. We verified the robustness of these relationships in and out of sample.

Upshot: Frequent, small belief updates are the marks of an accurate forecaster.

5. Measuring forecasting skill over time[6]

So which are the best predictors of individual forecasting skill? It depends on the amount of performance data available.

With minimal accuracy data, we can best predict future performance with behavioral/effort measures (belief updating), situational variables (in our case, training and teaming) and dispositional measures, such as fluid intelligence, cognitive reflection, numeracy and actively open-minded thinking.

The more data available on past accuracy of participants, the less noisy this measure becomes. As soon as we had 10 resolved questions on which to judge individual accuracy, past accuracy became the best single predictor of future performance. Once we had 50 resolved questions, past accuracy was more predictive of future performance than a model combining dispositional, situational and behavioral measures.

Upshot: Dispositional, behavioral and situational factors, as well as past performance, are highly predictive of individual accuracy.


6. Prediction polls and markets[7]

The results discussed above all are derived from prediction polls (surveys), a method for crowdsourcing probability judgments. Individual estimates from prediction polls can be aggregated to produce wisdom-of-crowds forecasts. Prediction markets also produce crowd assessments, by aggregating the price signals of market participants. Which method produces more accurate forecasts, prediction polls or prediction markets?

We compared a continuous double auction market with individual and team-based prediction polls over the course of one year in the tournament. Forecasters were randomly assigned to conditions and produced more than 50,000 market orders and 100,000 probability predictions. Accuracy scores across 114 questions are shown below. Prediction markets outperformed simple, unweighted forecasts from polls. However, the accuracy of poll aggregates increased when we introduced three out-of-sample derived adjustments to aggregation algorithms: temporal decay placed higher relative weights on more recent forecasts; past-performance weights increased the relative influence of individuals who tended to update their forecasts more frequently and who had a better track record of accuracy; and an recalibration function pushed aggregated estimates toward the extremes of the probability scale. These adjustments helped team-based prediction polls significantly outperform prediction markets.

Upshot: Probability prediction polls can produce more accurate crowd estimates than prediction markets.

NOTE: This blog entry was prepared by Pavel Atanasov and Angela Minster, based on research conducted by the Good Judgment Project.

[1] Based on research presented in: Mellers, B., Stone, E., Atanasov, P., Rohrbaugh, N., Metz, S. E., Ungar, L., & Tetlock, P. (2015). The psychology of intelligence analysis: Drivers of prediction accuracy in world politics. Journal of Experimental Psychology: Applied, 21(1), 1.

[2] Standardized Brier scores are calculated so that higher scores denote lower accuracy, and the mean score across all forecasters is zero.

[3] Mellers, B., Ungar, L., Baron, J., Ramos, J., Gurcay, B., Fincher, K., Scott, S., Moore, D., Atanasov, P., Swift, S., Murray, T., Tetlock, P.  (2014) Psychological strategies for winning a geopolitical forecasting tournament. Psychological Science, 25(5), 1106-1115.

[4] Mellers, B., Stone, E., Murray, T., Minster, A., Rohrbaugh, N., Bishop, M., Chen, E., Baker, J., Hou, Y., Horowitz, M., Ungar, L & Tetlock, P. (2015). Identifying and Cultivating Superforecasters as a Method of Improving Probabilistic Predictions. Perspectives on Psychological Science, 10(3), 267-281.

[5] Atanasov, P., Witkowski, J., Mellers, B., Tetlock, P. (In Prep). Small, frequent updates enroute to accuracy. Working Paper.

[6] Based on research presented in: Mellers, B., Stone, E., Atanasov, P., Rohrbaugh, N., Metz, S. E., Ungar, L., & Tetlock, P. (2015). The psychology of intelligence analysis: Drivers of prediction accuracy in world politics. Journal of Experimental Psychology: Applied, 21(1), 1.

[7] Atanasov, P., Rescober, P., Stone, E., Servan-Schreiber, E., Tetlock, P., Ungar, L., Mellers, B. (2015) Distilling the Wisdom of Crowds with Prediction Markets and Prediction Polls? Revision submitted at Management Science.

Posted in forecasting, wisdom of the crowd | Comments Off on Six Lessons about Crowd Prediction

Regina Joseph: Five Forecasts for a New Year

Rejoice forecasters, for the glad tidings of the season are especially upon us at the end of 2014. News of the Good Judgment Project’s results further infiltrated the media in the last twelve months: after last year’s great coverage in the New York Times, Washington Post, and The Economist, we broke big-time this year thanks in particular to a broadcast featuring Superforecaster Elaine Rich on National Public Radio, as well as hits in The Financial Times, The Wall Street Journal, and even the UK’s notorious tabloid font of all things Kardashian and Bieber, The Daily Mail! The list goes on and on.

As awareness of GJP’s unparalleled quest for accuracy and empirical rigor in forecasting grew in 2014, a very distinct change crept into the prose of the geopolitical infosphere’s usual pundits and prognosticators.

All of a sudden, think tanks, consultancies, and experts everywhere augmented their typically qualitative forecasts of foreign affairs with quantitative commentary and evaluative hindsight. Assigning a probability to a potential event became more prevalent among authorities and bloviators alike. No doubt some of this can be attributed to the rise of a data overflow mined by popular luminaries like Nate Silver and assorted “freakonomists.” But GJP’s quantitative forecasting methodology and the early successes of the superforecasters clearly motivated some talking heads towards greater accountability. We keep score, so now others must too, if only to maintain credibility and relevancy in a metrics-driven world.

For GJP forecasters, year’s end may not necessarily trigger a taking of stock—after all, we’ve still got plenty of open questions that bridge well into 2015 in this, our last season of the ACE tournament. So in these last few hours of 2014, let’s look ahead at five possible areas of change in the coming year:

  1. North Korea: Will 2015 be the year that GJP forecasters see Kim Jong-Un vacate office? The wobbling republic has both China and South Korea watching very carefully. As ROK seeks advantage in bringing the split countries closer, China contemplates its key foreign policy objective of avoiding a unified Korea on its borders.  My forecast for Kim Jong-Un vacating by June 2015: 20%.
  2. Russia: The rouble has hit the skids and Putin is closing in the wagons. His strategy combining information warfare; the imprisonment of opposition figures; cozying up to former Soviet satellite states via bribes and threats; and jingoistic appeals to nationalism and religion has worked to keep the Russian citizenry on side despite crippling Western sanctions. But forecasters may need to determine how long Putin can keep simmering discontent at bay, especially if Russia goes into economic default. My forecast for financial default before June 2015 is high (80%), but Putin vacating office before then is low (3%).
  3. The Middle East: Peace appears further off than ever in the region: Syria’s war continues after 3 long years with no end in sight; Bibi Netanyahu’s mission to make Israel a faith-identified nation has riven the government and derailed Palestinian peace talks for the foreseeable future; Libya and Yemen are unraveling as they get pulled apart by rival factions; and Iraq is struggling to vanquish the Islamic State. But Saudi Arabia’s oil production moves are an attempt to contain and weaken regional rivals, whether in the form of states like Iran or non-state actors like ISIS. Will decisive Sunni triumphs—whether for the oil-producing monarchists of the Gulf Cooperation Council or Salafist jihadists—or will Shia victories—like Iranian nuclear deals—top headlines in 2015? My forecast for a breakthrough in negotiations between the P5+1 and Iran before June 2015 is on the low side (40%).
  4. Europe: The European Union weathered 2014’s economic doldrums, the rise of the far right and elections in the European Parliament, and managed to stay intact. But could a new government in Greece and populist fatigue over austerity re-heat the potential for a Grexit? My forecast is that it won’t be before June 2015 (1%), but that the announcement of a referendum before the end of 2015 is more probable (65%).
  5. China: China’s new luxury-loving middle classes and princelings may be revving up their cars, but the state itself is decelerating. The spot price of iron ore has crashed; ghost cities threaten to topple the once-frothy real estate boom; and emboldened citizens, especially in large cities and ethnic enclaves, will increase their opposition to the Central Committee. Still, China will continue to vigorously seek resources, whether via Silk Road revivalism in the Central Asian Republics, or via Europe’s Arctic north or port-laden south. Forecasters in 2015 may need to consider if China’s slowdown will hasten cooperation in both foreign and domestic affairs or back it into a confrontational corner. For example, will China enter into TPP negotiations before June 2015? My forecast: 10%.

 Tell us your forecasts for 2015 and Happy New Year to all!

Superforecaster and geopolitical expert Regina Joseph is a consultant to the Good Judgment Project’s question-generation team. The five forecasts in this blog post represent Regina’s personal opinions. Will GJP forecasters agree? We’ll find out in the remainder of Season 4.

Posted in forecasting questions, GJP in the Media | Tagged , | Comments Off on Regina Joseph: Five Forecasts for a New Year

Jay Ulfelder on Wisdom of Crowds FTW

I’m a cyclist who rides indoors a fair amount, especially in cold or wet weather. A couple of months ago, I bought an indoor cycle with a flywheel and a power meter. For the past several years, I’d been using the kind of trainer you attach to the back wheel of your bike for basement rides. Now, though, my younger son races, so I wanted something we could both use without too much fuss, and his coach wants to see power data from his home workouts.

To train properly with a power meter, I need to benchmark my current fitness. The conventional benchmark is Functional Threshold Power (FTP), which you can estimate from your average power output over a 20-minute test. To get the best estimate, you need to go as hard as you can for the full 20 minutes. To do that, you need to pace yourself. Go out too hard and you’ll blow up partway through. Go out too easy and you’ll probably end up lowballing yourself.

Once you have an estimate of your FTP, that pacing is easy to do: just ride at the wattage you expect to average. But what do you do when you’re taking the test for the first time?

I decided to solve that problem by appealing to the wisdom of the crowd. When I ride outdoors, I often ride with the same group, and many of those guys train with power meters. That means they know me and they know power data. Basically, I had my own little panel of experts.

Early this week, I emailed that group, told them how much I weigh (about 155 lbs), and asked them to send me estimates of the wattage they thought I could hold for 20 minutes. Weight matters because power covaries with it. What the other guys observe is my speed, which is a function of power relative to weight. So, to estimate power based on observed speed, they need to know my weight, too.

I got five responses that ranged from 300 to 350. Based on findings from the Good Judgment Project, I decided to use the median of those five guesses—314—as my best estimate.

I did the test on Tuesday. After 15 minutes of easy spinning, I did 3 x 30 sec at about 300W with 30 sec easy in between, then another 2 min easy, then 3 min steady above 300W, then 7 min easy, and then I hit it. Following emailed advice from Dave Guttenplan, who sometimes rides with our group, I started out a little below my target, then ramped up my effort after about 5 min. At the halfway point, I peeked at my interval data and saw that I was averaging 310W. With 5 min to go, I tried to up the pace a bit more. With 1 min to go, I tried to dial up again and found I couldn’t go much harder. No finish-line sprint for me. When the 20-minute mark finally arrived, I hit the “interval” button, dialed the resistance down, and spent the next minute or so trying not to barf—a good sign that I’d given it just about all I had.

And guess what the final average was: 314!

Now, you might be thinking I tried to hit that number because it makes for a good story. Of course I was using the number as a guideline, but I’m as competitive as the next guy, so I was actually pretty motivated to outperform the group’s expectations. Over the last few minutes of the test, I was getting a bit cross-eyed, too, and I don’t remember checking the output very often.

This result is also partly coincidence. Even the best power meters have a margin of error of about 2 percent, and that’s assuming they’re properly calibrated. So the best I can say is that my average output from that test was probably around 314W, give or take several watts.

Still, as an applied stats guy who regularly works with “wisdom of crowds” systems, I thought this was a great illustration of those methods’ utility. In this case, the remarkable accuracy of the crowd-based estimate surely had a lot to do with the crowd’s expertise. I only got five guesses, but they came from people who know a lot about me as a rider and whose experience training with power and looking at other riders’ numbers has given them a strong feel for the distribution of these stats. If I’d asked a much bigger crowd who didn’t know me or the data, I suspect the estimate would have missed badly (like this one). Instead, I got just what I needed.

ADMIN NOTE: This post is cross-posted from Jay’s Dart-Throwing Chimp blog, with his permission.

Posted in Uncategorized | Comments Off on Jay Ulfelder on Wisdom of Crowds FTW

Karen Ruth Adams: “Reflections of a (Female Subject-Matter Expert) Superforecaster”

In August, I attended the Good Judgment Project’s 2014 Superforecaster Conference, where the top 2% of last year’s 7,200 forecasters and top forecasters from previous seasons met with the principal investigators.  We discussed the project’s findings to date, changes for Season 4, and plans for the future.

At the conference, I was struck by both the diversity and lack of diversity among the forecasters.  There was a lot of occupational diversity.  I met people who work in finance, IT, materials science, law, and other commercial sectors.  Yet, considering that the aim of the study is to improve national security forecasting by US intelligence agencies, there was a notable lack of subject-matter experts.  I met just a handful of security scholars from academia and think tanks and policy makers and practitioners from government and non-profits.

I was also surprised to see few women.  It wasn’t that I expected a 50/50 ratio.  I thought women would be 25-33% of forecasters, mirroring the percentage of women among American faculty in political science, security scholars at the International Studies Association, policy analysts and leadership staff at Washington think tanks, and senior US national security and foreign policy officials.  Instead I learned from GJP researcher Pavel Atanasov that at the beginning of Season 3 (Fall 2013), women were just 17% of GJP forecasters.  By the end of the season (Spring 2014), women had dropped out at higher rates (35%) than men (29%).  Among this year’s superforecasters, just 7% are women.

As a woman who has spent decades developing expertise on international relations and human, national, and international security, and as a citizen who would like US security forecasting and policy to improve, this concerns me.  It also concerns GJP’s principal investigators, who have asked forecasters to offer suggestions for improving the mix.  This post is a contribution to that conversation.  I explain why I joined the project and what I’ve done and learned so far.  I also offer some thoughts about what remains to be discovered and improved about gender and expertise among forecasters.

Why I Joined the Project

In March 2011, I received an intriguing email via a listserve of strategic studies scholars.  Bob Jervis, a noted expert on national and international security, was looking for

knowledgeable people to participate in a quite unprecedented study of forecasting sponsored by Intelligence Advanced Research Projects Activity (“IARPA”) and focused on a wide range of political, economic and military trends around the globe.  The goal of this unclassified project is to explore the effectiveness of techniques such as prediction markets, probability elicitation, training, incentives and aggregation that the research literature suggests offer some hope of helping forecasters see at least a little further and more reliably into the future.

Bob was recruiting for the GJP team on behalf of principal investigators Barbara Mellers, Don Moore, and Phil Tetlock.  According to Bob, the “minimum time commitment would be several hours in passing training exercises, grappling with forecasting problems, and updating your forecasting response to new evidence throughout the year.”  The rewards would be “Socratic self-knowledge,” the opportunity to learn and be assessed on “state-of-the-art techniques (training and incentive systems) designed to augment accuracy,” a $150 honorarium, and the opportunity to compete anonymously with the freedom to go public later.  In addition, Bob said he thought it would be fun.

I immediately said yes, for two reasons.  First, I remembered Phil Tetlock from my time as a political science graduate student at UC Berkeley, and I trusted him to run an interesting and high-quality study in which my anonymity would be protected.  That was important to me because I wanted to take the risk of forecasting without worrying about the effects on my scholarly reputation.  After all, in Expert Political Judgment, Phil had shown that experts (highly educated professionals in academia, think tanks, government, and international organizations) weren’t much better at forecasting that “dilettantes” (undergraduates).  Moreover, they had trouble outperforming “dart-throwing chimps” (chance) and generally underperformed computers (extrapolation algorithms).  As an expert on security studies, this was my chance to try to prove him wrong.

Second, as a recently-tenured political science professor, I had begun to expand my research program, focusing less on individual publications and more on my career contribution.  For several years, I had been developing a new framework for studying, teaching, and improving human, national, and international security.  One element of the project is helping students evaluate and explain the historical and current security levels of various actors (individuals, social groups, and states) and predict future security levels.  Thus the opportunity to find out more about forecasting and try my hand as a participant-observer was too good to ignore.

What I’ve Done So Far

Since Fall 2011, I’ve participated on the GJP team in all three years of IARPA’s ACE tournament.  In Season 1, I was in an individual prediction polling group, making individual forecasts on a survey platform with no interaction with other participants.  In Season 2, I was on an interactive individual survey platform, where participants could explain their forecasts and see their own and others’ accuracy ratings (Brier scores).  In Season 3, I was on the Inkling platform, one of two large prediction markets, where participants were given 50,000 “inkles” to buy and sell “stocks” in answers to questions, with probabilities expressed as prices.  We could also make comments and see one another’s scores (earnings).

Over time, my accuracy has improved.  I moved from the top 18% in Season 1 (Brier score of .42) to the top 8% in Season 2 (Brier score of .34), and top 1% in Season 3 (no Brier score because I was in the prediction market, where I more than tripled my “money,” finishing 6th of 693 forecasters).  My best categories have consistently been those in which I have the most expertise:  international relations, military conflict, and diplomacy.

How GJP Has Enhanced My Skills and Confidence

Participating in the study has done what I hoped it would.  It has improved my forecasting skill.  By compelling me to express forecasts in stark probabilistic terms and by using clear and generally fair rules to score them, GJP has given me a laboratory in which to learn how to balance the forecasting skills of “calibration” (understanding base rates) and “discrimination” (identifying exceptions).

Now, when a colleague or reporter asks what I think will happen in an international conflict or international organization, I think more clearly about the theories and facts I’m using to arrive at my answer, the probabilities I assign to various outcomes, and the confidence I wish to express.  In my security class, I model this process for my students and ask them to make their own forecasts.

Participating in the GJP has also increased my confidence.  Like most experts, I used to be reluctant to make point predictions that could run afoul of complexity or chance and be taken out of context.  Moreover, like many professional women, I suffered from both “imposter syndrome” — the feeling that I don’t know enough and will be found out — and the knowledge that women’s qualifications and contributions are systematically discounted.  Thus it never helped to be told that I should bluff, like men.

Thanks to the GJP, I’ve learned I don’t have to pretend to be something I’m not.  I have a good sense of where my expertise lies, where it makes the most difference in improving group accuracy, and when and on what terms I wish to go public with predictions.  I also know it’s not a weakness but a strength to approach forecasting with humility.

As Bob Jervis predicted, participating in the GJP has also been fun.  I don’t worry about being right all of the time.  In Season 3, I was one of the most frequent commenters, revealing my forecasts and logic, and asking for feedback.  When I’m wrong, I have other forecasters to laugh with and learn from.

In August, before the superforecaster conference, I revealed my identity.  That was a surprise to some of my fellow forecasters, who had assumed Fork Aster was a man.

What Remains to be Discovered about Gender and Expertise

Before the superforecaster conference, I wondered if being out as a female subject-matter expert would change the dynamics of my participation.  Would I speak up less often for fear of being dubbed a pointy-headed intellectual?  Would GJP turn into a forum in which women’s comments were discounted or ignored, or in which successful women were deemed unlikable?

My concerns about gender were allayed at the conference.  Although women are just 7% of superforecasters, they were not segregated by choice or default.  Women sat and stood and worked in groups with men.  To me, this shows the value of initial anonymity.  Superforecasters were known to be good forecasters, whether or not they were known to be women.  The women who were there had made a cut based on merit, so they were accepted and confident.

But they weren’t overconfident.  After all, the GJP’s major findings are that forecasters perform best when they understand probability, are open-minded, and are scored for accuracy.  Together, this means superforecasters of all stripes know that perfection is unattainable, there’s a good chance they’re wrong and should listen to other views, and the best way to improve is to put themselves out there to be scored.

My concerns about expertise were allayed in the first month of Season 4.  In the superforecaster market, I’ve been speaking up about as much as I did last year.  Moreover, I’ve found that instead of saying less for fear of being wrong, I’ve been tempted to say more than I can confidently support simply to burnish my credentials.  Thus this year is shaping up to be a test of whether I can remain open-minded despite having my reputation at stake.  With the recent brouhaha about faux experts and experts-for-hire, this will be very interesting indeed.

Like all forecasters, I care a great deal about how “right” and “wrong” are scored.  As someone who is trying to build her confidence, I also care about ratings and rankings.  But I’m not motivated by fake money, and I doubt most subject-matter experts are either.  So this year, I’m looking forward to a new metric, “market contribution,” which will summarize each market forecaster’s contribution to the market’s accuracy (Brier score).  To understand what motivates forecasters of various types, I hope GJP will track via surveys and team and market behavior the extent to which individuals seem to be motivated by problem-solving, competition, social interaction, accuracy, and other goals.

Why and How Female and Expert Participation Should Be Improved

In one field after another, studies have found that groups with more diverse participants and in particular more women make more accurate decisions.  That is reason enough for GJP to redouble its efforts to recruit and retain women.  It also speaks to the importance of preserving occupational diversity.  Yet GJP should also make an effort to recruit more subject-matter experts.  Otherwise, it will be hard to evaluate whether Tetlock’s earlier findings about the overall unreliability of expert political judgment are valid.

Although I’m not an expert on the effects of gender and expertise on participation in scientific studies, I have some thoughts about how GJP could recruit and retain more women and subject-matter experts.

First, it’s important to think about how the recruiting pitch sounds.  The one I got was perfect for me.  It appealed to my expertise and love of learning, my desire to improve US security policy, and my sense of fun.  It also seemed reasonable.  A few hours, a few updates…  no big commitment.  In fact, the requirement has been higher.  In the first three years, it took me about 5 hours per week on top of my regular current events reading to research, answer, and discuss the required 25 questions.  Since women spend more time than men on the “second shift” of family and household work, the time requirement probably depresses female participation and retention rates.  If security scholars and practitioners have heavier work obligations than individuals in the private sector, high time commitments could depress their participation as well.  Since the whole point of expertise is to be good at something in particular instead of everything in general, perhaps GJP should set different participation expectations for subject-matter experts.  It would be more fair to all forecasters, though, either to reduce the time requirements overall or to provide more financial or reputational compensation.

Second, where does the recruiting pitch go?  According to Project Director Terry Murray, the GJP has not made a systematic effort to recruit a diverse pool of forecasters.  Instead, the project has relied on word of mouth by the principal investigators and advisors (most of whom are men) and serendipitous media coverage.  To include more women across the professions, GJP should reach out to interest groups such as the World Affairs Council and skill-building networks such as Lean In.  (For a big bang, GJP could collaborate with Lean In to produce a video about how to improve forecasting skills).  To reach more subject-matter experts, GJP should recruit through professional organizations such as Women in International Security (which has both male and female members), and the international security studies divisions of the American and international political science associations.

Third, what are GJP training materials, questions, and discussions like?  Since women dropped out of Season 3 at higher rates than men, there may be something about the experience itself that’s a turn-off.  As a middle-aged, female security expert, I was used to a lot of the bravado I saw on the GJP boards, and I was willing to live with it because I wanted to learn something.  I also thought my anonymous participation might improve things.  Other women may drop out because they find the exchanges unpleasant or irrelevant, or because they lack the confidence to weigh in.

Still other women may be turned off by some of the recommended reading.  I devoured Kathryn Schulz’s Being Wrong.  But it was a shock to open Daniel Kahneman’s Thinking Fast and Slow and discover that the first chapter features the cognitive challenge posed by a photograph of an angry woman.  Later, it emerges that that one of the most debated problems in cognitive science relates to experiments in which people erroneously assume that a woman who lives in Berkeley is more likely to be a feminist and a bank teller than simply a bank teller.  To increase female participation, researchers and forecasters need to think carefully about their language and examples so they don’t evoke what Kahneman refers to as “System 1 errors” in a whole subset of participants.  Although researchers don’t intend for their work to have these effects, if they’re not attentive, it can.

A Continuing Conversation

At the superforecaster conference, many of the male participants asked me how I thought female participation could be improved.  When they found out I’m a political science Ph.D. and professor, they asked me the same thing about including more security scholars.  They were not just being polite.  Over the past three years, as we’ve contributed to GJP reportedly outperforming intelligence analysts, we’ve all learned the value of open-minded thinking, and we know it’s more likely in groups with diverse participants whose individual contributions are heard and valued.

For now, my recommendations for GJP are to review the participation requirements, reach out to organizations and networks populated by women and subject-matter experts, and survey current and past participants about their impressions of the work load and content and their reasons for staying in or leaving the project.

My recommendation to women and subject-matter experts is to give forecasting a shot.  Decide what you want to get out of the project and what kind of participant you want to be.  Then do your best.  See what you learn and what others learn from you.  Forecast anonymously at first, then come out if you like.  I predict a lot of wonky fun – intellectual puzzles, interesting exchanges of ideas, head-to-head competition, some memorable “Aha!” moments, and the pride of knowing that you have contributed in some small way to improving security forecasting.

And you?  What do you suggest?  Let’s continue the conversation.


Karen Ruth Adams (aka Fork Aster) is an associate professor of political science at the University of Montana.

Posted in IARPA ACE Tournament, Recruitment, superforecasters | Tagged | Comments Off on Karen Ruth Adams: “Reflections of a (Female Subject-Matter Expert) Superforecaster”

A Little Summer Reading

Summer means vacation time for many Good Judgment Project forecasters (we’re currently on hiatus between forecasting seasons), but our research team is busily working on plans to make this the best and most useful forecasting season yet!

While you await news about Season 4, which will begin in August, we wanted to bring your attention to two recent articles by GJP investigator Michael C. Horowitz, an associate professor of political science at UPenn, that discuss the Good Judgment Project.

In the first, an article for The National Interest, Professor Horowitz and co-author Dafna Rand of CNAS lay out a case for what they call “The Crisis-Prevention Directorate.” Since trouble around the world is often hard to predict, Horowitz and Rand argue that the National Security Council Staff should create a new crisis-prevention directorate that not only draws on trained personnel from throughout the national-security community, but explicitly draws on Good Judgment Project methodologies to help the President anticipate and head off crises before they happen.

The second article, by Professor Horowitz in Politico, looks at recent analogies that President Obama has made between baseball and American foreign policy. Horowitz uses the lens of Moneyball to assess Obama’s statement that United States foreign policy should be focused on hitting singles rather than home runs. He also describes the Good Judgment Project as essentially the “moneyball” of national security decision-making. And just as one of the challenges described in Moneyball was the integration of sabermetrics into mainstream baseball decision-making, the next challenge for the Good Judgment Project and our sponsors is ensuring the lessons that our forecasters have taught us over the last three years make their way into how the US government thinks about national security decision-making moving forward.

To what extent will the national-security community incorporate learnings from GJP and other quantitative forecasting initiatives over the next 3-5 years? Suggestion: Record your own prediction today and then check five years hence to see how accurate you were. Keeping score is probably the number one “secret of the superforecasters”!

Posted in GJP in the Media, GJP research team | Tagged , , , | Comments Off on A Little Summer Reading

Jay Ulfelder: Crowds Aren’t Magic

One of my cousins, Steve Ulfelder, writes good mystery novels. He left a salaried writer’s job 13 years ago to freelance and make time to pen those books. In March, he posted this on Facebook:

CONTEST! When I began freelancing, I decided to track the movies I saw to remind myself that this was a nice bennie you can’t have when you’re an employee (I like to see early-afternoon matinees in near-empty theaters). I don’t review them or anything; I simply keep a Word file with dates and titles.

Here’s the contest: How many movies have I watched in the theater since January 1, 2001? Type your answer as a comment. Entries close at 8pm tonight, east coast time. Closest guess gets a WOLVERINE BROS. T-shirt and a signed copy of the Conway Sax novel of your choice. The eminently trustworthy Martha Ruch Ulfelder is holding a slip of paper with the correct answer.

I read that post and thought: Now, that’s my bag. I haven’t seen Steve in a while and didn’t have a clear idea of how many movies he’s seen in the past 13 years, but I do know about Francis Galton and that ox at the county fair. Instead of just hazarding a guess of my own, I would give myself a serious shot at outwitting Steve’s Facebook crowd by averaging their guesses.

After a handful of Steve’s friends had submitted answers, I posted the average of them as a comment of my own, then updated it periodically as more guesses came in. I had to leave the house not long before the contest was set to close, so I couldn’t include the last few entrants in my final answer. Still, I had about 40 guesses in my tally at the time and was feeling pretty good about my chances of winning that t-shirt and book.

In the end, 45 entries got posted before Steve’s 8 PM deadline, and my unweighted average wasn’t even close. The histogram below shows the distribution of the crowd’s guesses and the actual answer. Most people guessed fewer than 300 movies, but a couple of extreme values on the high side pulled the average up to 346.  Meanwhile, the correct answer was 607, nearly one standard deviation (286) above that mean. I hadn’t necessarily expected to win, but I was surprised to see that 12 of the 45 guesses—including the winner at 600—landed closer to the truth than the average did.

I read the results of my impromptu experiment as a reminder that crowds are often smart, but they aren’t magic. Retellings of Galton’s experiment sometimes make it seem like even pools of poorly informed guessers will automatically produce an accurate estimate, but, apparently, that’s not true.

As I thought about how I might have done better, I got to wondering if there was something about Galton’s crowd that made it particularly powerful for his task. Maybe we should expect a bunch of county fair–goers in nineteenth century England to be good at guessing the weight of farm animals. Still, the replication of Galton’s experiment under various conditions suggests that domain knowledge helps, but it isn’t essential. So maybe this was just an unusually hard problem. Steve has seen an average of nearly one movie in theaters each week for the past 13 years. In my experience, that’s pretty extreme, so even with the hint he dropped in his post about being a frequent moviegoer, it’s easy to see why the crowd would err on the low side. Or maybe this result was just a fluke, and if we could rerun the process with different or larger pools, the average would usually do much better.

Whatever the reason for this particular failure, though, the results of my experiment also got me thinking again about ways we might improve on the unweighted average as a method of gleaning intelligence from crowds. Unweighted averages are a reasonable strategy when we don’t have reliable information about variation in the quality of the individual guesses (see here), but that’s not always the case. For example, if Steve’s wife or kids had posted answers in this contest, it probably would have been wise to give their guesses more weight on the assumption that they knew better than acquaintances or distant relatives like me.

Figuring out smarter ways to aggregate forecasts is also an area of active experimentation for the Good Judgment Project (GJP), and the results so far are encouraging. The project’s core strategy involves discovering who the most accurate forecasters are and leaning more heavily on them. I couldn’t do this in Steve’s single-shot contest, but GJP gets to see forecasters’ track records on large numbers of questions and has been using them to great effect. In the recently-ended Season 3, GJP’s “super forecasters” were grouped into teams and encouraged to collaborate, and this approach has proved very effective. In a paper published this spring, GJP has also shown that they can do well with nonlinear aggregations derived from a simple statistical model that adjusts for systematic bias in forecasters’ judgments. Team GJP’s bias-correction model beats not only the unweighted average but also a number of widely-used and more complex nonlinear algorithms.

Those are just a couple of the possibilities that are already being explored, and I’m sure people will keep coming up with new and occasionally better ones. After all, there’s a lot of money to be made and bragging rights to be claimed in those margins. In the meantime, we can use Steve’s movie-counting contest to remind ourselves that crowds aren’t automatically as clairvoyant as we might hope, so we should keep thinking about ways to do better.

Posted in aggregation, forecasting, Galton, wisdom of the crowd | Comments Off on Jay Ulfelder: Crowds Aren’t Magic

Jay Ulfelder on the Rigor-Relevance Tradeoff

I came to the Good Judgment Project (GJP) two years ago, in Season 2, as a forecaster, excited about contributing to an important research project and curious to learn more about my skill at prediction. I did pretty well at the latter, and GJP did very well at the former. I’m also a political scientist who happened to have more time on my hands than many of my colleagues, because I work as an independent consultant and didn’t have a full plate at that point. So, in Season 3, the project hired me to work as one of its lead question writers.

Going into that role, I had anticipated that one of the main challenges would be negotiating what Phil Tetlock calls the “rigor-relevance trade-off”—finding questions that are relevant to the project’s U.S. government sponsors and can be answered as unambiguously as possible. That forecast was correct, but even armed with that information, I failed to anticipate just how hard it often is to strike this balance.

The rigor-relevance trade-off exists because most of the big questions about global politics concern latent variables. Sometimes we care about specific political events because of their direct consequences, but more often we care about those events because of what they reveal to us about deeper forces shaping the world. For example, we can’t just ask if China will become more cooperative or more belligerent, because cooperation and belligerence are abstractions that we can’t directly observe. Instead, we have to find events or processes that (a) we can see and (b) that are diagnostic of that latent quality. For example, we can tell when China issues another statement reiterating its claim to the Senkaku Islands, but that happens a lot, so it doesn’t give us much new information about China’s posture. If China were to fire on Japanese aircraft or vessels in the vicinity of the islands—or, for that matter, to renounce its claim to them—now that would be interesting.

It’s tempting to forego some rigor to ask directly about the latent stuff, but it’s also problematic. For the forecast’s consumers, we need to be able to explain clearly what a forecast does and does not cover, so they can use the information appropriately. As forecasters, we need to understand what we’re being asked to anticipate so we can think clearly about the forces and pathways that might or might not produce the relevant outcome. And then there’s the matter of scoring the results. If we can’t agree on what eventually happened, we won’t agree on the accuracy of the predictions. Then the consumers don’t know how reliable those forecasts are, the producers don’t get the feedback they need, and everyone gets frustrated and demotivated.

It’s harder to formulate rigorous questions than many people realize until they try to do it, even on things that seem like they should be easy to spot. Take coups. It’s not surprising that the U.S. government might be keen on anticipating coups in various countries for various reasons. It is, however, surprisingly hard to define a “coup” in such a way that virtually everyone would agree on whether or not one had occurred.

In the past few years, Egypt has served up a couple of relevant examples. Was the departure of Hosni Mubarak in 2011 a coup? On that question, two prominent scholarly projects that use similar definitions to track coups and coup attempts couldn’t agree. Where one source saw an “overt attempt by the military or other elites within the state apparatus to unseat the sitting head of state using unconstitutional means,” the other saw the voluntary resignation of a chief executive due to a loss of his authority and a prompt return to civilian-led government. And what about the ouster of Mohammed Morsi in July 2013? On that, those academic sources could readily agree, but many Egyptians who applauded Morsi’s removal—and, notably, the U.S. government—could not.

We see something similar on Russian military intervention in Ukraine. Not long after Russia annexed Crimea, GJP posted a question asking whether or not Russian armed forces would invade the eastern Ukrainian cities of Kharkiv or Donetsk before 1 May 2014. The arrival of Russian forces in Ukrainian cities would obviously be relevant to U.S. policy audiences, and with Ukraine under such close international scrutiny, it seemed like that turn of events would be relatively easy to observe as well.

Unfortunately, that hasn’t been the case. As Mark Galeotti explained in a mid-April blog post,

When the so-called “little green men” deployed in Crimea, they were very obviously Russian forces, simply without their insignia. They wore Russian uniforms, followed Russian tactics and carried the latest, standard Russian weapons.

However, the situation in eastern Ukraine is much less clear. U.S. Secretary of State John Kerry has asserted that it was “clear that Russian special forces and agents have been the catalyst behind the chaos of the last 24 hours.” However, it is hard to find categorical evidence of this.

Even evidence that seemed incontrovertible when it emerged, like video of a self-proclaimed Russian lieutenant colonel in the Ukrainian city of Horlivka, has often been debunked.

This doesn’t mean we were wrong to ask about Russian intervention in eastern Ukraine. If anything, the intensity of the debate over whether or not that’s happened simply confirms how relevant this topic was. Instead, it implies that we chose the wrong markers for it. We correctly anticipated that further Russian intervention was possible if not probable, but we—like many others—failed to anticipate the unconventional forms that intervention would take.

Both of these examples show how hard it can be to formulate rigorous questions for forecasting tournaments, even on topics that are of keen interest to everyone involved and seem like naturals for the task. In an ideal world, we could focus exclusively on relevance and ask directly about all the deeper forces we want to understand and anticipate. As usual, though, that ideal world isn’t the one we inhabit. Instead, we struggle to find events and processes whose outcomes we can discern that will also reveal something telling about those deeper forces at play.


Posted in forecasting questions | Tagged | Comments Off on Jay Ulfelder on the Rigor-Relevance Tradeoff

Statistical models vs. human judgments: The Nate Silver controversy seen through a GJP lens

Do statistical models always outperform human forecasters?

The storm of controversy over Nate Silver’s new 538 website has transformed what had been a question of largely academic interest into a staple of water-cooler conversation. Critics such as Paul Krugman complain that Silver improperly dismisses the role of human expertise in forecasting, which Krugman views as essential to developing meaningful statistical models.

The Good Judgment Project’s research team has given the human vs. machine debate a lot of thought. GJP’s chief data scientist Lyle Ungar shared his views in a recent interview. According to Ungar,

The bottom line is that if you have lots of data and the world isn’t changing too much, you can use statistical methods. For questions with more uncertainty, human experts become more important.

Ungar sees the geopolitical questions posed in the ACE tournament as well suited to human forecasting:

Some problems, like the geo-political forecasting we are doing, require lots of collection of information and human thought. Prediction markets and team-based forecasts both work well for sifting through the conflicting information about international events. Computer models mostly don’t work as well here – there isn’t a long enough track records of, say, elections or coups in Mali to fit a good statistical model, and it isn’t obvious what other countries are ‘similar.’

GJP does see a role for statistical models in geopolitical forecasting. For example, we use both prediction markets and statistical aggregation techniques to combine the judgments of our forecasters and generate probability estimates that typically are more accurate than simple unweighted averages of human predictions.

Moreover, for many geopolitical forecasting questions, we see promise in a human-machine hybrid approach that combines the best strengths of human judgments and statistical models. (Think “Kasparov plus Deep Blue” in the chess context.) We hope to take initial steps to test this approach in Season 4 of the IARPA ACE forecasting tournament. Stay tuned for further developments!


Posted in aggregation, expert judgment, forecasting | Tagged , , | Comments Off on Statistical models vs. human judgments: The Nate Silver controversy seen through a GJP lens

Reflections on Season 3 Thus Far (part 1)

As 2013 draws to a close, so does the first half of Season 3 of the IARPA forecasting tournament. This seems as good a time as any to review some of the highs and lows of this tournament season. We begin our review with a look at IFP #1318, which closed on Christmas Day and—in our view—represents one of the biggest surprises thus far in Season 3.

A surprising visit to Yasukuni

In mid-November, we launched a question asking whether Japan’s Prime Minister would visit the controversial Yasukuni Shrine before the year’s end. Just a few days before the question was scheduled to expire, Abe did in fact visit the shrine—something that few forecasters anticipated.

A post-visit comment by one team forecaster reflected the general amazement: “Jaw-drop! I have to wonder if this was telegraphed and we missed it, or it really was a spontaneous decision.”

Forecasters who posted good scores on this question seem to have taken Prime Minister Abe at his word. As the Japan News reported, the Prime Minister had pledged during his most recent election campaign to visit the shrine while in office. Nonetheless, over the past year, Abe had avoided commitment to any specific date for a visit. A few weeks before the question launched, there was a brief flurry of news stories in which an Abe aide indicated his expectation that the visit would occur before the end of the year, though the report was downplayed by another Japanese official. In retrospect, the aide seems to have been providing accurate intelligence: Abe’s visit occurred on the first anniversary of his current administration and before the end of the year.

Those who failed to anticipate Abe’s shrine visit can take comfort from a poll published in a major Japanese newspaper on December 24th, reflecting a question posed to over 1,000 Japanese households on December 21st and 22nd:

Of the respondents, 48 percent appreciated Prime Minister Abe’s decision to refrain from visiting Tokyo’s controversial Yasukuni Shrine where Class A war criminals are enshrined along with the war dead since he took office, as compared with 37 percent who do not appreciate his decision.

The poll itself reflects that there was great uncertainty about what Abe would do only a few days before his visit.

But, forecasters who steadily decreased their probability estimates as the end-date for this question approached may want to think twice before applying this strategy to future questions that have reasonable potential for surprise endings. At least one superforecaster had noted the risk that the Yasukuni Shrine question could be this season’s “Monti”—referencing a question that caught forecasters off-guard almost exactly a year before when Italy’s then-Prime-Minister followed through on his stated intention to resign, rather than stay in office as leader of a minority government, after Berlusconi withdrew his support from the Monti-led coalition. And, of course, there is now infamous question “1007” from Season 1, which asked whether a “lethal confrontation” would occur in the South China Sea. That question resolved as “Yes” near the scheduled closing date when a Chinese fishing boat captain fatally stabbed a South Korean coast-guard official who had boarded the fishing boat.

In these cases, “surprise” outcomes occurred late in a question’s lifespan, when many forecasters had been tapering their predictions toward a 0% likelihood that the event would occur. And, in all of these cases, the outcome reflected the actions of a single individual who was able to take the action that resulted in a “Yes” outcome with little or no news coverage in the days leading up to the action that would have allowed those following the question closely to know that the event was about to occur.

If there is a lesson to be learned here, it seems to be: Take a moment to think about the ways that an event could occur. If the event of interest does not require elaborate preparations beforehand and can be accomplished by one or two people, with little fanfare, it may hold more potential for a surprise outcome than our first impression would lead us to assume. This is particularly true when there is evidence early in the lifespan of a question suggesting that the event might occur, followed by no news, as opposed to contradictory news, later in the lifespan of a question.

Posted in forecasting, IARPA ACE Tournament | Tagged , | Comments Off on Reflections on Season 3 Thus Far (part 1)

GJP in the News, Again (and Again)

The Economist’s The World in 2014 issue just hit the newsstands, focusing international attention on questions that Good Judgment Project forecasters consider on a daily basis: What geopolitical outcomes can we expect over the next 12-14 months?

One outcome that our forecasters may not have anticipated, though, is that the Good Judgment Project itself would be featured in this annual compendium of forecasts. An article co-authored by GJP’s Phil Tetlock and journalist Dan Gardner poses the question “Who’s good at forecasts?” and offers insight into “How to sort the best from the rest.” Participants in the ongoing ACE forecasting tournament sponsored by the Intelligence Advanced Research Projects Activity (IARPA) will not be surprised to learn that Tetlock and Gardner believe that such forecasting tournaments are the best way to compare forecasting ability.

Tetlock and Gardner’s brief article does not address a second benefit of such tournaments: participants can improve their forecasting skills through a combination of training and practice, with frequent feedback on their accuracy. Combining training and practice with what GJP’s research suggests is a stable trait of forecasting skill seems to produce the phenomenon that GJP calls “superforecasters.”

GJP’s top forecasters have been so accurate that, according to a recent report by Washington Post columnist David Ignatius, they even outperformed the forecasts of intelligence analysts who have access to classified information.

In a brief video clip on The Economist’s web site, Phil Tetlock notes that “People are not as good at anticipating the future as they think they are.” If you would like to test your own forecasting skills against GJP’s best forecasters, we invite you to register to become a Good Judgment Project forecaster. Because of the strong demand to participate, we expect to open new slots soon as part of the ongoing Season 3 tournament. Who knows? Maybe you have the skills of which superforecasters are made!

Posted in forecasting, GJP in the Media, IARPA ACE Tournament, Tetlock | Tagged , , | Comments Off on GJP in the News, Again (and Again)