[1]:
%run ../initscript.py
HTML("""
<div id="popup" style="padding-bottom:5px; display:none;">
    <div>Enter Password:</div>
    <input id="password" type="password"/>
    <button onclick="done()" style="border-radius: 12px;">Submit</button>
</div>
<button onclick="unlock()" style="border-radius: 12px;">Unclock</button>
<a href="#" onclick="code_toggle(this); return false;">show code</a>
""")
[1]:
show code

Statistical Significance

When you run an experiment, conduct a survey, take a poll, or analyze a set of data, you’re taking a sample of some population of interest, not looking at every single data point that you possibly can. Statistical significance helps quantify whether a result is likely due to chance or to some factor of interest. When a finding is (statistically) significant, it simply means you can feel confident that’s it real, not that you just got lucky (or unlucky) in choosing the sample.

The Process of Evaluating

No matter what you’re studying, the process for evaluating significance is the same.

  1. Establish a null hypothesis (a straw man that you’re trying to disprove). For example, in an experiment of a marketing campaign, the null hypothesis might be “on average, customers don’t prefer our new campaign to the old one”. In an experiment of introducing new surgical intervention, the null hypothesis might be “the new surgical intervention cannot reduce the number of patient deaths”.

  2. Set a target and interpret p-value. The significance level is an expression of how rare your results are, under the assumption that the null hypothesis is true. It is usually expressed as a p-value, and the lower the p-value, the less likely the results are due purely to chance. Setting a target can be dauntingly complex and depends on what we are analyzing. For the surgical intervention, we may want an every low p-value (such as 0.001) to be conservative. But if we are testing for whether the new marketing concept is better, we probably willing to take a higher value (such as 0.2).

  3. Collect data.

  4. Plot the results. The graph will help us to understand the variation, sampling error and statistical significance.

  5. Calculate statistics.

After the process, we want to know if the findings are “significant”. However, a statistically significant result is not necessarily of practical importance because

  • Practical significance is business relevance

  • Statistical significance is the confidence that a result isn’t due purely to chance

Give an example that statistical significance is not practical significance.

[2]:
hide_answer()

Example: Weight-Loss Program Researchers are studying a new weight-loss program. Using a large sample they construct a \(95%\) confidence interval for the mean amount of weight loss after six months on the program to be \([0.12, 0.20]\). All measurements were taken in pounds. Note that this confidence interval does not contain \(0\), so we know that their results were statistically significant at a \(0.05\) alpha level. However, most people would say that the results are not practically significant because after six months on a weight-loss program we would want to lose more than \(0.12\) to \(0.20\) pounds.

Case: Surgical Intervention

We need to decide whether a new surgical intervention is more appropriate for a cancer patient with a brain tumor compared to the standard chemotherapy treatment:

Hypothetical result 1: The new surgical intervention significantly reduced the number of patient deaths compared to the current standard chemotherapy treatment (p=0.04)

Evaluate the result?

[3]:
hide_answer()

What is significant? Do the analysts mean considerably fewer deaths or just statistically significant fewer deaths? It is not clear, but possibly just the latter, since the p-value is reported. Here, we don’t have any information on exactly how superior the new surgical intervention was compared to chemotherapy.

Hypothetical result 2: The new surgical intervention significantly reduced the number of patient deaths compared to the current standard chemotherapy treatment (p=0.04). After five years, there were two fewer deaths in the intervention group.

Evaluate the result?

[4]:
hide_answer()

Now the exact difference in number of deaths is provided and it is clear the word “significantly” refers to statistical significance in this case, since two is not a large difference. As a manager, it is now important to contextualize these results. In an exploratory study in which each group had only 10 participants, two fewer deaths in the intervention group would be meaningful and warrant further investigation. In a large clinical trial with 1000 participants in each group, two fewer deaths, even if statistically significant, is less impressive. In this case, we would consider the two interventions more or less equal and base our treatment decision on other factors.

Hypothetical result 3: The new surgical intervention did not statistically significantly reduce the number of patient deaths compared to the current standard chemotherapy treatment (p=0.07). After five years, there was 1 death in the surgical intervention group and 9 deaths in the standard chemotherapy group.

Evaluate the result?

[5]:
hide_answer()

This time we have a statistically non–significant result that corresponds to a seemingly large point estimate (8 fewer deaths). In this case, it appears the treatment has an important effect but perhaps the study lacks sufficient power for this difference to be statistically significant. Again, we need more information about the size of the trial to contextualize the results. To simply conclude by virtue of statistical hypothesis testing that this study shows no difference between groups would seem inappropriate.

Case: Market Campaign

The marketing department comes up with a new concept and you want to see if it works better than the current one. You can’t show it to every single target customer, of course, so you choose a sample group. When you run the results, you find that those who saw the new campaign spent $10.17 on average, more than the $8.41 those who saw the old one spent.

This $1.76 might seem like a big — and perhaps important — difference. But in reality you may have been unlucky, drawing a sample of people who do not represent the larger population; in fact, maybe there was no difference between the two campaigns and their influence on consumers’ purchasing behaviors. This is called a sampling error, something you must contend with in any test that does not include the entire population of interest.

What causes sampling error?

[6]:
hide_answer()
  • There are two main contributors to sampling error the size of the sample and the variation in the underlying population.

  • sample size follows our intuition:

    • All else being equal, we will feel more comfortable in the accuracy of the campaigns’ $1.76 difference if you showed the new one to 1,000 people rather than just 25. Of course, showing the campaign to more people costs more, so you have to balance the need for a larger sample size with your budget.

    • The same is true of statistical significance: with bigger sample sizes, you’re less likely to get results that reflect randomness. Thus, a small difference, which may not be practical significance, can be statistical significance.

  • variation is a little trickier to understand. image0

    • In the graph, each plot expresses a different possible distribution of customer purchases under the campaign. The plot with less variation: most people spend roughly the same amount of dollars. Some people spend a few dollars more or less, but if you pick a customer at random, chances are pretty good that they’ll be pretty close to the average. So it’s less likely that you’ll select a sample that looks vastly different from the total population, which means you can be relatively confident in your results.

    • The plot with more variation: people vary more widely in how much they spend. The average is still the same, but quite a few people spend more or less. If you pick a customer at random, chances are higher that they are pretty far from the average. So if you select a sample from a more varied population, you can’t be as confident in your results.

  • To summarize, the important thing to understand is that the greater the variation in the underlying population, the larger the sampling error.

Should we adopt the new campaign? Evaluate the new campaign in different scenarios.

[7]:
hide_answer()
  • The new marketing campaign shows a $1.76 increase (more than 20%) in average sales. If the p-values is 0.02, then the result is also statistically significant, and we should adopt the new campaign.

  • If the p-value comes in at 0.2 the result is not statistically significant, but since the boost is so large we’ll likely still proceed, though perhaps with a bit more caution.

  • But what if the difference were only a few cents? If the p-value comes in at 0.2, we’ll stick with your current campaign or explore other options. But even if it had a significance level of 0.01, the result is likely real, though quite small. In this case, our decision probably will be based on other factors, such as the cost of implementing the new campaign.

Advice to Managers

How to evaluate?

[8]:
hide_answer()

Although software will report statistical significance, it’s still helpful to know the process described above in order to understand and interpret the results because Managers should not trust a model they don’t understand.

How do managers use it?

[9]:
hide_answer()

Companies use statistical significance to understand how strongly the results of an experiment they’ve conducted should influence the decisions they make. Managers want to know what findings say about what they should do in the real world. But confidence intervals and hypothesis tests were designed to support science, where the idea is to learn something that will stand the test of time. So rather than obsessing about whether your findings are precisely right, think about the implication of each finding for the decision you’re hoping to make. What would you do differently if the finding were different?

What mistakes do managers make?

[10]:
hide_answer()
[10]:
  • The word “significant” is often used in businesses to mean whether a finding is strategically important. When you look at results of a survey or experiment, ask about the statistical significance if the analyst hasn’t reported it.

  • Statistical significance tests help you account for potential sampling errors, but what is often more worrisome is the non-sampling error. Non-sampling error involves things where the experimental and/or measurement protocols didn’t happen according to plan, such as people lying on the survey, data getting lost, or mistakes being made in the analysis.

  • Keep in mind the practical application of the findings.

  • Be all for using statistics, but always wed it with good judgment.