Python Inferential Statistics Cheat Sheet#

import polars as pl
import pingouin as pg

# prepare examples for analysis
from datascipsych import datasets
data = pl.read_csv(datasets.get_dataset_file("Onton2005"))
one_sample = (
    data.filter(pl.col("probe") == "lure")
    .group_by("subject")
    .agg(pl.col("correct").mean())
    .sort("subject")
)
paired_sample = (
    data.group_by("subject", "probe")
    .agg(pl.col("response_time").mean())
    .sort("subject", "probe")
    .pivot("probe", index="subject", values="response_time")
)
one_way = (
    data.filter(pl.col("probe") == "target")
    .group_by("subject", "set_size")
    .agg(pl.col("response_time").mean())
    .sort("subject", "set_size")
)
two_way = (
    data.group_by("subject", "probe", "set_size")
    .agg(pl.col("response_time").mean())
    .sort("subject", "probe", "set_size")
)

Interpreting the \(p\)-value#

The \(p\)-value represents the probability of having observed a difference as extreme as the one we observed in our sample, assuming that there is no effect. This is not the probability that our observations are due to chance. Instead, it is the probability of our observations occurring assuming that they are due to chance.

When the \(p\)-value is small, we may decide to reject the null hypothesis. In psychology, the usual standard is that we decide to reject the null hypothesis when \(p < 0.05\). This means that, if the null hypothesis is true, we will have a false positive (that is, a false rejection of the null hypothesis) less than 5% of the time.

If \(p < 0.05\), then we conclude that the null hypothesis can be rejected, and there is a significant difference. If \(p >= 0.05\), then we fail to reject the null hypothesis, and conclude that there is not a significant difference.

One-sample t-test#

If you want to test whether some distribution of measures is significantly different from a specific null value, use a one-sample t-test. For example, say you want to test whether an “old” response occurred more than 50% of the time. Because you have a specific hypothesis about the direction of the effect (that is, that performance will be greater than 50%, not less), you can use a one-tailed test (indicated by setting alternative="greater").

one_sample.head()
shape: (5, 2)
subjectcorrect
i64f64
10.934783
21.0
31.0
40.955556
50.934783
null_value = 0.5
pg.ttest(one_sample["correct"], null_value, alternative="greater")
T dof alternative p-val CI95% cohen-d BF10 power
T-test 66.848551 22 greater 3.287812e-27 [0.95, inf] 13.938886 4.636e+23 1.0

To report the results of this test, you could write something like:

We tested whether response accuracy was greater than chance (0.5) using a one-tailed t-test. Accuracy was significantly greater than chance (t(22)=66.85, p=3.3e-27, d=13.94).

Note that the \(p\)-value is very small and is therefore written using scientific notation. 3.3e-27 means \(3.3 x 10^{-27}\), which is 3.3 with the decimal point shifted 27 positions to the left.

For \(t\)-tests, use Cohen’s \(d\) as a measure of effect size. Cohen’s \(d\) is the difference between the means divided by an estimate of the standard deviation. Values of \(d\) are interpreted as “small” (around 0.2), “medium” (around 0.5), or “large” (around 0.8).

Two-sample paired t-test#

If you want to test whether some measure that was collected for each subject in different conditions is statistically different between conditions, use a paired t-test. For example, say you wanted to test whether response times were different on lure trials and target trials.

paired_sample.head()
shape: (5, 3)
subjectluretarget
i64f64f64
10.9524670.850602
21.023680.847816
31.8281311.251809
41.6560731.482451
51.5684721.244664

Setting paired=True indicates that the samples come from the same subjects and are in the same order.

pg.ttest(paired_sample["lure"], paired_sample["target"], paired=True)
T dof alternative p-val CI95% cohen-d BF10 power
T-test 2.969829 22 two-sided 0.007072 [0.04, 0.24] 0.450835 6.57 0.542759

To report the results of this test, you could write something like:

We tested whether response time differed between target and lure trials using a paired t-test. We did observe a significance difference in response time (\(t(22)=2.97\), \(p=0.0071\), \(d=0.45\)), with slower response times on lure trials.

One-way repeated-measures ANOVA#

If some measure was observed for more than two conditions and you want to test whether the measure varied between conditions, use a repeated-measures analysis of variance (ANOVA) analysis. For example, say that you wanted to examine whether response time varied depending on the set size variable.

one_way.head()
shape: (5, 3)
subjectset_sizeresponse_time
i64i64f64
130.884958
150.830895
170.83307
230.755198
250.868365
pg.rm_anova(
    data=one_way.to_pandas(),  # some Pingouin functions require Pandas
    dv="response_time",
    within="set_size", 
    subject="subject",
)
Source ddof1 ddof2 F p-unc ng2 eps
0 set_size 2 44 7.831295 0.001232 0.058064 0.923359

To report the results of this test, you could write something like:

We tested whether response time varied with set size using a one-way repeated-measures ANOVA. We observed a significant effect of set size (F(2, 44)=7.83, p=0.0012, ng2=0.058).

For ANOVAs, report the general eta squared measure as a measure of effect size.

Two-way repeated-measures ANOVA#

If there were two variables of interest that were manipulated, and you want to determine whether some measure varied with either variable, use a two-way repeated-measures ANOVA. For example, say that there was a set size variable that could be 3, 5, or 7 items, and a probe variable that could be either target or lure, and you want to examine how response time varied based on set size and probe.

two_way.head(6)
shape: (6, 4)
subjectprobeset_sizeresponse_time
i64stri64f64
1"lure"30.895321
1"lure"50.926712
1"lure"71.028225
1"target"30.884958
1"target"50.830895
1"target"70.83307
pg.rm_anova(
    data=two_way.to_pandas(),  # some Pingouin functions require Pandas
    dv="response_time",
    within=["set_size", "probe"],
    subject="subject",
)
/home/runner/work/datascipsych/datascipsych/.venv/lib/python3.13/site-packages/pingouin/distribution.py:507: FutureWarning: DataFrame.groupby with axis=1 is deprecated. Do `frame.T.groupby(...)` without axis instead.
  data.groupby(level=1, axis=1, observed=True, group_keys=False)
/home/runner/work/datascipsych/datascipsych/.venv/lib/python3.13/site-packages/pingouin/distribution.py:508: FutureWarning: DataFrameGroupBy.diff with axis=1 is deprecated and will be removed in a future version. Operate on the un-grouped DataFrame instead
  .diff(axis=1)
Source SS ddof1 ddof2 MS F p-unc p-GG-corr ng2 eps
0 set_size 1.049384 2 44 0.524692 13.292064 0.000031 0.000257 0.065393 0.721376
1 probe 0.545674 1 22 0.545674 8.312252 0.008634 0.008634 0.035106 1.000000
2 set_size * probe 0.109888 2 44 0.054944 2.167342 0.126550 0.137903 0.007274 0.807938

To report the results of this test, you could write something like:

We tested whether response time varied with set size using a two-way repeated-measures ANOVA. We observed a significant effect of set size (F(2, 44)=13.29, p=0.000026, ng2=0.065), a significant effect or probe (F(1, 22)=8.31, p=0.0087, ng2=0.035), and no interaction (F(2, 44)=2.17, p=0.14, ng2=0.0073).

The Greenhouse-Geisser correction is used to avoid an assumption of standard ANOVAs that different conditions will be equally correlated with one another. In this case, the three set sizes may not be equally correlated, and the p-GG-corr column has a p-value that is corrected for this possibility.