Overview

P-values are a term commonly heard in scientific literature, however often they are misconstrued, misunderstood or misinterpreted. The p in p-value stands for probability, as a lot of statistics begins with. What exactly does it tell us? Let’s find out

Formal Definitions

Hypotheses

Hypotheses are the best place to start for any science experiment or project. A hypothesis is what you are wanting to know. First you need to start with your research question. For example, are people who love lollipops older than people who do not? To do this, formally in statistics, we need a null hypothesis (\(H_0\)) and an alternative hypothesis (\(H_A\)). We will use two abbreviations here: LL for lollipop lickers and NLL for not lollipop lickers. We want to know “Do lollipop lickers have more sore tongues than those who do not lick lollipops?”

Re-phrasing this in statistically terms, \(H_0\) is that people do not have more sore tongues if they lick lollipops compared to if they do. The question now is what is our alternative hypothesis? Well there are three options to choose from:

We think LL have more sore tongues than NLL
We think LL have less sore tongues than NLL
We don’t know, but we don’t think they are the same!

1 and 2 are what are called one-sided hypothesis. A one-sided hypothesis, means you are assuming the difference is bigger or smaller. A two-sided hypothesis means we simply don’t know! We will assume we don’t know and use \(p\) to denote the proportion of lollipop sore tongues.

\[ H_0: p_{LL-ST} = p_{NLL-ST} \]

\[ H_A: p_{LL-ST} \neq p_{NLL-ST} \]

Now, before we continue we need to review how okay we are with being wrong. Being wrong is okay, but there are two different types of being wrong.

Type I and II Errors

A type I error, sometimes referred to as \(\alpha\) error, is the error of saying something happened when it did not. For example, the error of me assuming you bought a balloon when you did not. A type II error, sometimes referred to as \(\beta\) error, is the error of saying something did not happen when it did. Using our balloon example, me assuming you didn’t buy a balloon when you did.

Power

The power of a statistical test is the probability that it correctly rejects the null hypothesis (Gelman and Carlin 2014). It is tied to the type II error, through the below formula.

\[ Power = 1 - \beta \]

Remember our lollipop example? For that it would be the probability that we can say lollipop lickers do not have the same amount of sore tongues as non lollipop lickers.

Test Statistic

Alright, I know at this point I’m boring you. One last definition and then we can get to the good stuff. A test statistic is exactly what it sounds like. A statistic used for tests! What kind of tests you may ask? Hypothesis tests like we have above. Now, the test statistic in itself may follow a variety of distributions depending on the test. I’ll leave distributions to another post.

P-Value

The probability of such an extreme value of the test statistic occurring if the null hypothesis were true is often called the P-Value (Bland 2015).

Significance Level

Remember \(\alpha\) from before? The type I error is sometimes referred to as the significance level. The number that is picked for the type I error is usually 0.05, however it is completely arbitrary to pick 0.05. It could be 0.20 or 0.02.

Important

There is no such thing as more significant. It is either significant at the level that was pre-specified before the analysis or not. For example: a p-value of 0.01 is not more significant than a p-value of 0.05

Back to our Problem

Finally, after all those boring definitions we are back to where we started. How are we going to determine whether people who are lollipop lickers have more sore tongues? This requires us to do a bit more thinking to determine the right statistical tool for the toolbox.

Let’s think about this question a bit more first. Lollipop lickers tend to be younger right? Do boys like lollipops more than girls? Let’s assume that both of those things are true. We will want to control for both of those variables. Control in this sense, means adding the variable in the equation like below.

\[ ST = LL + age + sex \]

Now, the type of regression we will use is a generalized linear model (GLM). Generalized here just means that you can apply to almost any situation, hence general. Another post will cover the differences between GLMs and regular old regression (OLS).

Data on Lollipop Lickers and Sore Tongues

Now we know our question, have our tool and want to find out the answer. What are we missing…..the most important thing of all! Data! Luckily, there is a database that collects such data. It has data on 422 people. Below are the summary statistics for our groups

	Sore Tongues (n = 181)	Not Sore Tongues (n = 241)
Lollipop Lickers, n (%)	146 (80.7%)	60 (24.9%)
Age, mean (SD)	17.9 (7.30)	48.0 (17.8)
Female, n (%)	107 (59.1%)	52 (21.6%)

Lollipop Lickers, n (%)

146 (80.7%)

60 (24.9%)

Age, mean (SD)

17.9 (7.30)

48.0 (17.8)

Female, n (%)

107 (59.1%)

52 (21.6%)

Analysis Time!

Using our handy dandy formula from above, and our tool from our toolbox (GLM). Below are the results.

Term	Estimate	SE	Statistic (Z-Value)	Pr(>\|z\|) (P-Value)
(Intercept)	6.31	1.20	5.25	1.53e-7
LL	2.77	0.484	5.73	1.00e-8
Age	-0.317	0.0499	-6.36	2.04e-10
Sex	2.15	0.493	4.36	1.32e-5

Now for the Interpretation of the P-Value

The test statistic for regression is defined by \(\beta / SE(\beta)\). Beta here is the coefficient. For example, the test statistic for age is (note, close to -6.36. We will chalk up the difference to a rounding error):

\[ -0.317/0.0499 = -6.35 \]

Now, this number on it’s own is pretty useless. We want to know, what is our p-value?! Well, to do that we turn to the normal distribution.

Normal Distribution

Using the normal distribution, since our sample size is bigger than 30 (a wonderfully random number. Honestly, I have no idea why 30 is the cutoff), we can determine our almighty p-value.

Code

2*pnorm(-6.358)

[1] 2.043975e-10

Ta-da! We have our p-value of 2.04e-10. This is less than our pre-specified value of 0.05. What does it mean though? From earlier, it is the probability of such an extreme value of the test statistic (-6.35) occurring if the null hypothesis were true. Finally, we can use significant in our interpretation right? Technically yes, however there are some strong caveats we need to go over first

Issues with the P-Value

P-values are typically over emphasized. They are a piece of the puzzle, however they are not the whole puzzle. Arguably, a more important piece of the puzzle is the effect size. For our example, lollipop lickers have increased odds of a sore tongue compared to non-lollipop lickers (OR = 5.87, 95% CI: 2.53 to 13.87). The effect, in this case OR, should be another piece of the puzzle that is considered. Finally, there are two other errors we should go over that highlight how not confident we can be in the p-value.

Type M & Type S Error

Often type I and type II errors are highlighted however type M and type S errors are important to note as well. A type S error is the probability of an estimate being in the wrong direction (Gelman and Carlin 2014). A type M error is the factor by which the magnitude of an effect might be overestimated (Gelman and Carlin 2014). Definitions are always great but an example helps to better understand.

Using our example from before, and the function outlined in Gelman and Carlin (2014):

Code

retrodesign <- function(A, s, alpha=.05, df=Inf, n.sims=10000){
  z <- qt(1-alpha/2, df)
  p.hi <- 1 - pt(z-A/s, df)
  p.lo <- pt(-z-A/s, df)
  power <- p.hi + p.lo
  typeS <- p.lo/power
  estimate <- A + s*rt(n.sims,df)
  significant <- abs(estimate) > s*z
  exaggeration <- mean(abs(estimate)[significant])/A
  return(list(power=power, typeS=typeS, exaggeration=exaggeration))
}

retrodesign(-0.31703, 0.04986)

The type S error is 0.99, while the type M error is -0.997. This means that there is a 99% probability that the estimate is the wrong sign, and that the value is slightly underestimated (0.997 of what it should be). Doesn’t sound good right?

So are P-values garbage?

P-values have a place in science, however they should be interpreted with caution and as a piece to the puzzle. Other pieces include the effect size, study design, sample that was studied and a myriad of other components.

References

Bland, Martin. 2015. An Introduction to Medical Statistics. Oxford university press.

Gelman, Andrew, and John Carlin. 2014. “Beyond Power Calculations: Assessing Type s (Sign) and Type m (Magnitude) Errors.” Perspectives on Psychological Science 9 (6): 641–51.