Marketing teams using "How did you hear about us?" (HDYHAU) surveys face a persistent challenge: partial response rates. Without a complete set of responses, critical budget decisions rest on incomplete data. This raises a fundamental question: how can we confidently attribute channel revenue with limited survey responses?
This research presents a statistical framework for extrapolating channel attribution from partial survey responses. Using Monte Carlo simulations, we demonstrate that accurate channel attribution can be achieved with response rates around 40%. This finding fundamentally changes how businesses should approach attribution survey optimization.
Summary of Findings
The following points summarize this document. Our statistical analysis reveals three critical insights:
-
Response rates and prediction accuracy demonstrate a non-linear relationship, with Mean Squared Error (MSE) - a measure of prediction accuracy where lower values indicate better predictions - stabilizing around a 40% response rate. Beyond this threshold, additional responses yield diminishing returns in accuracy improvement.
-
The model's accuracy exhibits systematic variation with both survey volume and channel options:
a. Higher survey volumes (n > 50,000) achieve stable predictions at lower response rates.
b. Reducing channel options from 24 to 6 decreases the required response rate for reliable predictions by approximately 15%.
-
Order value distributions maintain statistical consistency across channels (with standard deviation σ < 0.1), validating our ability to accurately estimate revenue across different marketing channels using response-based extrapolation.
These findings provide marketing teams with a mathematically robust framework for making data-driven decisions about channel effectiveness while optimizing their attribution survey strategy.
The goal of this research was two-fold:
- Develop a reliable extrapolation model to account for partial survey response rates and accurately estimate channel-specific revenue contributions to optimize marketing spend.
- Determine the minimum response rate required for accurate predictions using said model.
The following research will ultimately be integrated into Fairing’s attribution survey product and predictive extrapolation.
Key Concepts
Confidence Intervals
A confidence interval (CI) is a range of values that is likely to contain the true value of a parameter with a specified probability. A 95% CI therefore suggests that there is a 95% probability that the true mean falls within the interval. Higher confidence levels (e.g., 99%) give more certainty but result in wider intervals, reducing precision. Conversely, lower levels (e.g., 90%) make intervals narrower but increase the risk of missing the true mean.
We use a 95% CI in this analysis to reflect most of the distribution’s variability and provide a good balance of precision and certainty. When reporting extrapolated channel contributions, the 95% CI helps stakeholders understand the potential variability in the estimate. This interval narrows as completion rates increase, improving precision as more data becomes available.
Error Metrics
- Mean Squared Error (MSE) measures the average squared deviation between the predicted and actual values.MSE emphasizes larger errors due to squaring, which helps identify specific response rates or completion thresholds where predictions deviate significantly.
The Model
The goal of this model is to estimate the distribution of total customers across marketing channels and the revenue contribution of each channel. Since not all customers respond to the survey question, the model uses a combination of known response data and Monte Carlo simulations to infer the likely channel distribution among non-respondents. A Monte Carlo simulation is a statistical method that uses random sampling to model and predict outcomes in uncertain situations by running many simulated scenarios.
Input Data
- Survey Responses
- Number of non-respondents
- Response rate
Steps in Extrapolation
- Calculate the proportion of survey respondents attributed to each channel. This distribution serves as the basis for simulating the responses of non-respondents.
- Use a Monte Carlo simulation to simulate the response distributions for non-respondents using a Dirichlet distribution to ensure that the proportions of simulated non-responses sum to 1.
- Hyperparameters: [1] channel weights and [2] confidence on the weights. [1] is defined as the current channel distribution and [2] as the response rate. The higher the response rate, the more confident we are that the current fraction of responses resembles the full population.
- Use the respondent data and the simulated non-respondent data to estimate the total customer distribution by channel.
- Perform over 10,000 simulations to generate possible channel proportions for non-respondents. This approach captures variability in potential responses and results in a distribution of possible outcomes, rather than a single deterministic estimate.
Key Assumptions
- Channel Independence: Assumes that the channel distribution for respondents is similar to that for non-respondents. This allows us to extrapolate the observed channel proportions to the unobserved portion of the customer base. We acknowledge there could be non-response bias across channels and proceed with this assumption for the purposes of this analysis.
- Dirichlet Distribution for Simulations: A Dirichlet distribution is used in the Monte Carlo simulation step, as it enforces the proportions to sum to 1, reflecting realistic constraints on channel attribution.
- Equal Customer Value Across Channels: Assumes that, on average, customers from each channel contribute similarly.
- Accurate Representation of Non-Respondents Using Observed Data: The model assumes that the behavior of respondents can be a reasonable proxy for non-respondents. Though a simplification, this assumption is typically valid when there are no significant biases in the survey responses (i.e., the survey respondents are fairly representative of the total customer base). See Appendix A for a more thorough explanation of this.
Impact of Completion Rates on Prediction Accuracy
With a low response rate (5-20%), predictions are less accurate due to insufficient data, causing high variability in channel contribution estimates. As response rate increases (especially beyond 30%), MSE decreases, indicating improved prediction accuracy. This trend continues until error stabilizes, suggesting that a certain level of data is enough to produce reliable extrapolations.
Based on our findings, error plateaus at around 40% response rate. Beyond this threshold, additional data has minimal impact on improving accuracy.
Using this error plateau as a benchmark, we calculate optimal response targets for each survey question. Once a question reaches its target response rate, you can confidently rely on those results while focusing the remaining survey capacity on other valuable questions. This approach helps maximize the insights you gain from each survey touchpoint while maintaining statistical reliability and respecting your customers' time.
What impacts the error curve?
- A higher number of views translates into a flatter error curve. This means that higher views lead to understanding the true channel distribution at a lower response rate.
- A lower number of response options translates into a flatter error curve. In other words, for the same number of views, lower response options lead to understanding the true channel distribution at a lower response rate. This is because there are fewer parameters to estimate, leading to less variability in the data.
Extrapolation of Revenue
The following analysis shows that the order total has a generally similar distribution across all channels, as seen in the box plot. This suggests that regardless of how customers discover a brand, their purchase amounts are largely consistent. This consistency provides confidence that we can reasonably extrapolate revenue across channels even when data is incomplete.
Methodology
The revenue extrapolation was carried out using a similar Monte Carlo approach to the above, where we simulated customer responses based on observed data. This time, we applied the extrapolation model by determining weights for each channel according to the revenue generated per channel, rather than by customer volume alone. This adjustment ensures that the estimated revenue reflects the observed differences in spending behavior across channels, providing a more tailored estimate.
Key Results
- The extrapolated revenue estimates show a strong alignment with actual revenue, allowing us to predict revenue within one standard deviation. This means that even with partial response data, our model provides reliable revenue projections.
- The precision of our predictions improves as the response rate increases. As we move closer to a full response rate, the confidence intervals narrow, making our revenue projections increasingly accurate.
The following visualization demonstrates projected revenue if every buyer responded to the “How Did You Hear About Us?” (HDYHAU) survey. This scenario uses the current response rate (30%) to estimate how closely our current extrapolation matches actual revenue by channel. The graph compares Current Revenue (given partial responses) against Real Revenue (hypothetical 100% response). Channels are sorted to reflect the variations in revenue distribution, and this comparison shows that even at a lower response rate, our model performs well, closely approximating real revenue across channels.
Making strategic decisions about your marketing requires reliable data you can trust. This research shows exactly how many responses you need to make confident choices about your channels. We've grounded this in rigorous statistics to help you navigate partial response rates with clarity. Want to see this analysis for your own data? Drop us a line at [email protected] or message us in your Fairing account.
Appendix A: Studying non-response bias
In analyzing channel engagement across different geo-locations, we observed a statistically significant non-response bias in the data. Specifically, responses are skewed toward certain channels based on the response rate, which means that the voices of people from high response areas may be overrepresented in the original dataset. See the following graph of example data from an 8 figure brand:
To address this non-response bias, we modified the extrapolation model by adjusting the priors used in the Dirichlet distribution. Here’s how:
- Defining separate priors for high and low response rates. These priors were based on the observed differences in channel engagement between the two groups.
- High (Low) response rates were defined as response rates above (below) the median value across geo-locations (i.e. US states).
- To ensure statistical significance of the response rates, only states with more than 20 views were taken into account.
During the Monte Carlo simulations, the adjusted priors allowed the model to “expect” a different distribution of channel engagement based on the type of location (high vs. low response). By updating the priors to account for non-response bias, the model now provides a more accurate extrapolation of channel distribution. This modification ensures that the extrapolated results better reflect the diversity of response behavior across various geographies, reducing the overrepresentation of high-response channels and giving a more balanced view of channel engagement.