Scorekeeping and Its Limits
Why headline scores alone leave interpretation incomplete—in polling and in market research.
Scorekeeping is the dominant mode of public polling and most market research: report the headline number, add a few cross-tabs, and offer narrative commentary. A Gallup release reports that 38 percent of U.S. adults approve of the president's job performance and suggests that immigration may be weighing on the rating—but does not show, under a disclosed explanatory model, how much the score would change if immigration evaluations improved or worsened. A concept test reports a top-2-box purchase intent score of 62 percent but doesn't show which features are driving it or which segments are closest to converting.
The problem is not that scores are wrong. It's that they're incomplete. As the pollster Lou Harris observed in 1961, presenting findings as "a series of relatively isolated headlines" stops short of "the full reporting that gives a reader an understanding of what forces really are at work." Stopping there leaves interpretation to outside commentators who are not bound by the same data or the same disclosure rules.
The site you're on is built around one answer to that problem: couple the score with a disclosed explanatory model, expressed in the score's own units, so that citizens, journalists, researchers, and clients can see not just where the headline stands, but what is moving it—and where that account is vulnerable.
Yes—more than most practitioners realize. Across the major service lines of market research and public opinion research, the headline metric is almost always a binary one:
- Brand trackingIs the respondent aware of the brand, or not? Does she consider it? Would he recommend it?
- Concept testingWould the respondent "definitely or probably" buy? That top-2-box purchase intent score is the core deliverable.
- Copy testingDid the ad communicate the intended message? Did it shift brand perception in the desired direction?
- CX researchIs the customer a promoter (top-2-box on NPS or overall satisfaction), or not?
- Public opinion pollingDoes the respondent approve or disapprove? Support or oppose?
The top-2-box score—collapsing a rating scale into "top two responses vs. everything else"—is one of the most widely used metrics in market research precisely because it forces a binary frame: you're either in the key group or you're not. That framing is intuitive, actionable, and easy to track over time.
What's puzzling is that both fields have embraced the binary outcome while largely ignoring the analytic method purpose-built for it. The field has accepted the destination but declined the most reliable route to get there.
Salience–influence divergence is the gap between what stands out in a descriptive record—what dominates cross-tabs and commentary—and what actually moves the headline once competing predictors are evaluated jointly in an explanatory model.
The 1963 Harris–Newsweek survey offers a concrete example. Military defense and Kennedy's handling of the Berlin situation looked highly favorable in the descriptive record: roughly three in four likely voters rated his handling of them as excellent or very good, making them prominent in any cross-tab view. Yet once those evaluations were assessed alongside other foreign-policy measures, the economy, and civil rights, neither produced a meaningful shift in the approval headline. Their descriptive visibility did not carry over into headline-unit movement.
This is not a data anomaly. It is what happens when multiple predictors compete for explanatory weight together. A cross-tab cannot show it. An explanatory model can—and that is the analytical core of what this site demonstrates.
Logistic regression is, as statisticians Andrew Gelman and Jennifer Hill put it, "the standard way to model binary outcomes." It models the probability of belonging to one group versus another—top-2-box versus not, approve versus disapprove—while jointly accounting for the effects of multiple predictor variables. That joint accounting is what separates it from crosstabulation, which can only examine one or two variables at a time before cell sizes become too small to trust.
The output—a predicted probability for any combination of predictor values—translates naturally into the percentage-point language that pollsters and market researchers already use. A top-2-box score is a probability. A presidential approval rating is a probability. Logistic regression simply makes those probabilities more precise, more defensible, and more actionable by expressing them as a function of measured predictors evaluated together.
That is why logistic regression is a natural fit for the framework this site demonstrates—not because it is the only possible method, but because it produces output that can be translated directly into the score's own units on release day.
The difference is more fundamental than it might appear. A crosstabulation works entirely on observed data—it reports what was actually recorded among people who already held a given combination of views. It might show that 44 percent of men and 36 percent of women are enthusiastic about driverless vehicles—but only because respondents who are men and respondents who are women already existed in the sample. The cross-tab cannot tell you what the enthusiasm rate would be under conditions that don't yet exist in the data.
The simulators on this site do something categorically different: they change the data. The All-or-Nothing Simulator sets a chosen predictor to the same value for everyone—for example, "what if all likely voters rated Kennedy's handling of Khrushchev as Poor?"—and recomputes the model-implied headline under those counterfactual conditions. The Fine-Tuning Simulator redistributes response frequencies within each predictor, approximating how the headline would shift if public opinion moved in a specific direction. Neither operation is possible with a cross-tab, because neither is asking about people who already exist in the data.
Cross-tabs also struggle when four or more variables are involved—cell sizes shrink to the point where results are unstable, and there is no way to hold competing predictors constant simultaneously. Scenario simulation produces credible estimates for any scenario, along with two-sided 95 percent confidence intervals, because it operates on a fitted explanatory model rather than raw cell counts.
One prominent public opinion researcher, after reviewing one of our simulators, described the output as "predictive crosstab analysis." That captures the family resemblance—but the underlying operation is counterfactual, not descriptive.
Logistic regression is used in some research contexts—discrete choice modeling, attribution modeling, and propensity scoring, for example. George Terhanian introduced propensity scoring to the market research community in the late 1990s, as described in Public Opinion Quarterly.) But it is used only rarely for core work in brand tracking, concept testing, copy testing, customer experience research, and public polling. Three reasons account for most of the gap:
- Researchers have followed academic convention—focusing on the sign and significance of logit coefficients. As sociologist Richard Williams has noted, that convention gives "little emphasis on the substantive and practical significance of the findings." Coefficients in log-odds units are not meaningful to most clients or citizens.
- Major statistical software packages don't produce scenario contrasts in headline units through pre-packaged procedures. Getting there requires custom programming that most research firms haven't built—and a browser-based interface that makes the results accessible to non-technical audiences.
- A scenario contrast cannot be summarized with a single number the way a linear regression coefficient can. Its size depends on where respondents sit on the probability curve—effects are slimmer near 0 or 1 than in the middle. Some statisticians have called this "intractable." The tools on this site treat it as expected behavior: what the model is showing you is often real.
The Mirror: A Coupled Release in Practice
What the framework proposes, how the 1963 case demonstrates it, and what each tool does.
The Mirror is the public-facing component of a coupled release: a traditional press release linked, on the same day results appear, to a single browser-based object containing the score, one disclosed explanatory model expressed in headline units, two vulnerability checks, and the tools needed to inspect, challenge, and extend that account.
This site is a proof-of-concept demonstration of that framework, grounded in a real dataset: a 1963 Harris–Newsweek national political poll that measured Kennedy's presidential approval alongside foreign-policy, economic, and civil-rights evaluations. The 1963 case was chosen deliberately as a demanding test—a legacy survey not designed for this purpose. If even that data can support a meaningful coupled release, then a contemporary poll built expressly for it should do so more clearly and under better rules.
The framework was proposed in an academic paper, Beyond Isolated Headlines, submitted to Public Opinion Quarterly. The site is its companion—a working demonstration of what the paper argues is now technically and practically feasible.
Under the framework proposed in the paper, a coupled release contains five elements, all issued on the same day as one disclosed public object:
- The score — the published headline percentage (e.g., 57% Kennedy approval), aligned with the explanatory model's baseline.
- One explanatory model in headline units — scenario contrasts showing how the score changes when measured predictors shift, expressed as predicted percentages and percentage-point differences with two-sided 95% confidence intervals.
- An omitted-variable check — screening for variables measured but excluded from the model, and for hypothetical factors the questionnaire never captured.
- A nonresponse check — showing how sensitive the headline is to explicit assumptions about who did not respond.
- The tools to inspect, challenge, and extend — the Mirror modules described below.
Transparency and engagement are the point—not merely that the release is checkable, which is a baseline expectation once materials are public, but that it becomes common analytic ground on the day it appears.
The Survey Explorer is the descriptive starting point of the Mirror. It lets users select variables from the 1963 Harris–Newsweek dataset—spanning policy evaluations, candidate matchups, demographics, and issue-specific measures—and inspect the resulting pattern through weighted frequencies, cross-tabs, and correlations. Results can be pinned within a session for side-by-side comparison.
Its role is orienting rather than explanatory. The Explorer shows what stands out in the observed record, but it does not jointly condition on other predictors—and that limit is stated explicitly. The move from the Explorer to the Model Builder is precisely the move from "what looks prominent in the cross-tabs" to "what actually moves the headline once predictors compete together." That gap is where salience–influence divergence lives.
The Model Builder is the Mirror's core explanatory module. It fits the disclosed explanatory model used for the paper's scenario reporting and lets users build and compare alternative specifications from the same collected data. For each declared model it generates a corresponding scenario interface, so results are reported as predicted percentages and percentage-point shifts—not coefficients.
The Builder supports three binary outcomes: presidential approval, vote intention (Kennedy vs. Goldwater), and support for a personal income tax cut. Its outputs include the baseline predicted percentage with two-sided 95 percent uncertainty, fit and separation summaries, a calibration assessment, and predicted margins by predictor level.
The Auto-Build option runs forward stepwise selection across available variables and reports the full selection path. That model is labeled as optimized for discrimination rather than as the published explanatory model—a useful contrast showing what a fit-seeking specification looks like versus the disclosed one. Models dominated by vote-intention items may discriminate better while offering less explanatory purchase, because near-synonyms of the outcome can improve fit without clarifying what is actually moving it.
The Builder also hosts the Variable-Ranking Tool and Synthetic-Variable Sensitivity Module, described separately below.
The Variable-Ranking Tool (accessible within the Model Builder) addresses one kind of omission: variables measured in the questionnaire but left out of the published explanatory model. It ranks them using two signals—the strength of their association with the outcome and their distinctiveness from predictors already in the model. Variables that largely duplicate what the model already captures are penalized; those combining a strong outcome association with low redundancy rank highest.
Its output is a ranked shortlist with association and redundancy indicators and brief measurement guidance. It is a screening device, not an explanatory verdict. A high rank signals possible incremental value—not guaranteed improvement once the variable is formally modeled.
The Synthetic-Variable Sensitivity Module (also within the Model Builder) addresses a harder omission problem: not a measured variable left out of the model, but an important factor the questionnaire never captured at all. Users specify a hypothetical missing factor—its target correlation with the outcome, its maximum overlap with existing predictors, and its number of response categories. The module constructs that factor iteratively and shows how diagnostics and scenario contrasts change with and without it.
This is a stress test, not a finding about what was "actually missing." If the stated targets cannot be met under the overlap constraint, the shortfall is flagged rather than silently relaxed. The purpose is to ask: how much does the published interpretation depend on the absence of a factor of this kind?
The All-or-Nothing Simulator applies one category choice per predictor to the full sample and recomputes the model-implied headline percentage. It is designed to show the implications of maximally uniform assumptions—what the approval score would become if everyone evaluated Khrushchev as "Poor," for example—and to make extrapolation risk visible when scenarios move far from observed support.
Outputs include the scenario percentage, the percentage-point change from baseline, a two-sided 95 percent interval, and an illustrative population-count translation. Because it compounds assumptions across predictors, the interval reflects model uncertainty under the stated routine, not the plausibility of the scenario itself. Population counts are scale cues, not forecasts. The module also includes an Explore Each Factor view that varies one predictor at a time from the current scenario position and ranks the resulting one-change shifts.
Where the All-or-Nothing Simulator sets everyone to a single category, the Fine-Tuning Simulator lets users redistribute response frequencies within each predictor. Instead of asking "what if everyone evaluated Khrushchev as Poor?", it asks "what if the share evaluating Khrushchev as Poor increased from 9 percent to 20 percent, with the difference drawn from the 'Only fair' group?" That is a more realistic scenario—closer to how public opinion actually shifts in practice.
Outputs include the updated scenario percentage, the shift from baseline, a two-sided 95 percent interval, a population-count translation, and calibration diagnostics for the redistributed scenario. Scenarios can be saved within the session for comparison. As assumptions move farther from the observed distributions, reliability warnings become more prominent rather than remaining implicit.
The Non-Response Test operates on the published headline directly rather than on the fitted model. Users specify how many respondents to replace and assign an assumed outcome rate to the replacement group. The module simulates the adjusted headline and benchmarks the shift against the poll's reported margin of error—the quantity readers already encounter in public releases.
For example: replacing 73 percent of the 1963 sample with a group at 26 percent approval lowers the headline from 57 percent to 34 percent—a shift more than eight times the sampling error. That scenario may not reflect actual nonresponse in 1963, but the exercise does what statistician Howard Wainer argued adjustments should do: force explicit confrontation with missingness assumptions and show, in headline units, whether they would materially alter the released score.
The module supports multiple binary outcomes so users can see that nonresponse sensitivity may differ across headlines within the same release. Its key guardrail: it reports sensitivity to stated assumptions, not estimates of actual nonresponse bias.
All scenario outputs in the Mirror include two-sided 95 percent confidence intervals—lower and upper bounds on the scenario estimate. The interpretation: if the same study were fielded 100 times under identical conditions, the true value would fall within those bounds 95 times out of 100. Wider intervals signal less certainty; a row with bounds of, say, 1–49 percent is telling you the model cannot pin down that estimate with confidence.
The intervals are computed using the delta method, which approximates the standard error of a scenario percentage by combining the gradient of the scenario estimate with respect to the model coefficients and the coefficient variance–covariance matrix.
One important caveat: the confidence interval captures only sampling variability—uncertainty from how respondents were selected. It does not account for nonresponse bias, question-wording effects, model misspecification, or omitted variables. Those vulnerabilities are addressed by the separate omission and nonresponse modules.
If the starting probability of a key outcome is very close to 0 or 1—as in online advertising, where click-through rates are often below 1 percent—the scenario contrast of an otherwise important predictor can look trivially small. An analysis might show that active-voice copy increases click-through probability from .0025 to .0067, a difference of .0042—easy to overlook.
Conventional logistic regression output (logits, z-scores, odds ratios) would flag this: .0042 translates to a 172 percent odds increase and a one-point logit increase, both substantial. The statistician Frederick Mosteller called this kind of pairing "balancing biases"—letting the weakness of one method be covered by the strength of another. Near the extremes of the probability scale, conventional output covers for scenario simulation, not the other way around. The Mirror is designed to keep both layers visible.
Applying It to Your Research
How the coupled-release framework applies to specific research contexts beyond the 1963 demonstration.
A concept test delivers a top-2-box purchase intent score. That score tells you where you stand. The simulator tells you what to do about it. Specifically:
- Which concept features drive purchase intent most—and by how much—so you can refine before re-testing.
- Which socio-demographic segments have the highest predicted purchase probability, including group size and expected number of buyers. That is the foundation of high-ROI targeting.
- An individual-level purchase probability for any member of the target population, enabling personalized marketing.
Brand trackers report scores wave over wave—awareness, consideration, NPS—with narrative commentary suggesting what may be driving movement. That is scorekeeping. A coupled release would instead disclose an explanatory model alongside the score, showing which brand perceptions, if improved, would move the top-2-box consideration score most—and which demographics are closest to converting.
For copy testing, the same logic applies: rather than comparing top-2-box scores across executions, the simulator identifies which elements of each execution drive the outcome and by how much, so creative decisions are grounded in evidence rather than preference or intuition.
Stated importance asks respondents directly how important each attribute is. Derived importance uses regression to infer importance from the relationship between attribute ratings and the key outcome. Many agencies plot both on an importance-performance matrix.
That matrix is useful but won't tell you how much the top-2-box score would increase if mobile check-in ratings improved by one scale point, controlling for other factors. Scenario simulation answers that question directly, turning importance analysis from a ranking exercise into an investment guide with stated uncertainty.
Interpreting the effect of a socio-demographic variable requires care. Imagine setting the Age variable to 18-29 for everyone in a marijuana legalization simulator while holding all other variables constant. The simulator might return a simulated probability of .75—an 8-percentage-point increase over the .68 starting probability—and a simulated population count suggesting 192 million adults would support legalization if all 255 million were 18-29 years old.
That second number requires a major assumption: that everyone could be 18-29. That's obviously impossible. So many researchers hold constant socio-demographic variables when estimating population-wide effects, and use them instead to zoom in on specific segments where targeting is actually feasible.
Design, Data, and Working With Us
How to build or adapt a survey for this framework, and how to get started.
The pre-fieldwork decisions are where a coupled release is won or lost. Before a single interview is conducted, the adopting organization should name: the headline it intends to feature, a bounded set of plausible predictors of that headline measured alongside it, and the scenario and interval rules that will govern released contrasts. In practice that reduces to three things:
- The dependent variable should be a true binary from the start—ideally a top-2-box score or an explicit approve/disapprove question—not a scale collapsed post hoc.
- The predictor variables should be attitudes, perceptions, or behaviors that are plausibly related to the headline and could plausibly be influenced—linchpins, not just controls.
- For question format, research suggests unipolar, four-category, fully-anchored scales work well across modes and languages.
You should also consider linking and syncing: designing data sources from the outset to share common factors—sampling frames, collection dates, question wording—so they can be combined and cross-validated later.
Ready to explore the 1963 data yourself?
Start with the Survey Explorer to see what stood out descriptively—then move to the Model Builder to find what actually moved the headline.