Python Inferential Statistics Cheat Sheet

10.13. Python Inferential Statistics Cheat Sheet#

import polars as pl
import pingouin as pg

# prepare examples for analysis
from datascipsych import datasets
data = pl.read_csv(datasets.get_dataset_file("Onton2005"))
one_sample = (
    data.filter(pl.col("probe") == "lure")
    .group_by("subject")
    .agg(pl.col("correct").mean())
    .sort("subject")
)
paired_sample = (
    data.group_by("subject", "probe")
    .agg(pl.col("response_time").mean())
    .sort("subject", "probe")
    .pivot("probe", index="subject", values="response_time")
)
one_way = (
    data.filter(pl.col("probe") == "target")
    .group_by("subject", "set_size")
    .agg(pl.col("response_time").mean())
    .sort("subject", "set_size")
)
two_way = (
    data.group_by("subject", "probe", "set_size")
    .agg(pl.col("response_time").mean())
    .sort("subject", "probe", "set_size")
)

Interpreting the \(p\)-value#

The \(p\)-value represents the probability of having observed a difference as extreme as the one we observed in our sample, assuming that there is no effect. This is not the probability that our observations are due to chance. Instead, it is the probability of our observations occurring assuming that they are due to chance.

When the \(p\)-value is small, we may decide to reject the null hypothesis. In psychology, the usual standard is that we decide to reject the null hypothesis when \(p < 0.05\). This means that, if the null hypothesis is true, we will have a false positive (that is, a false rejection of the null hypothesis) less than 5% of the time.

If \(p < 0.05\), then we conclude that the null hypothesis can be rejected, and there is a significant difference. If \(p >= 0.05\), then we fail to reject the null hypothesis, and conclude that there is not a significant difference.

One-sample t-test#

If you want to test whether some distribution of measures is significantly different from a specific null value, use a one-sample t-test. For example, say you want to test whether an “old” response occurred more than 50% of the time. Because you have a specific hypothesis about the direction of the effect (that is, that performance will be greater than 50%, not less), you can use a one-tailed test (indicated by setting alternative="greater").

one_sample.head()

shape: (5, 2)

subject	correct
i64	f64
1	0.934783
2	1.0
3	1.0
4	0.955556
5	0.934783

null_value = 0.5
pg.ttest(one_sample["correct"], null_value, alternative="greater")

	T	dof	alternative	p-val	CI95%	cohen-d	BF10	power
T-test	66.848551	22	greater	3.287812e-27	[0.95, inf]	13.938886	4.636e+23	1.0

To report the results of this test, you could write something like:

We tested whether response accuracy was greater than chance (0.5) using a one-tailed t-test. Accuracy was significantly greater than chance (t(22)=66.85, p=3.3e-27, d=13.94).

Note that the \(p\)-value is very small and is therefore written using scientific notation. 3.3e-27 means \(3.3 x 10^{-27}\), which is 3.3 with the decimal point shifted 27 positions to the left.

For \(t\)-tests, use Cohen’s \(d\) as a measure of effect size. Cohen’s \(d\) is the difference between the means divided by an estimate of the standard deviation. Values of \(d\) are interpreted as “small” (around 0.2), “medium” (around 0.5), or “large” (around 0.8).

Two-sample paired t-test#

If you want to test whether some measure that was collected for each subject in different conditions is statistically different between conditions, use a paired t-test. For example, say you wanted to test whether response times were different on lure trials and target trials.

paired_sample.head()

shape: (5, 3)

subject	lure	target
i64	f64	f64
1	0.952467	0.850602
2	1.02368	0.847816
3	1.828131	1.251809
4	1.656073	1.482451
5	1.568472	1.244664

Setting paired=True indicates that the samples come from the same subjects and are in the same order.

pg.ttest(paired_sample["lure"], paired_sample["target"], paired=True)

	T	dof	alternative	p-val	CI95%	cohen-d	BF10	power
T-test	2.969829	22	two-sided	0.007072	[0.04, 0.24]	0.450835	6.57	0.542759

To report the results of this test, you could write something like:

We tested whether response time differed between target and lure trials using a paired t-test. We did observe a significance difference in response time (\(t(22)=2.97\), \(p=0.0071\), \(d=0.45\)), with slower response times on lure trials.

One-way repeated-measures ANOVA#

If some measure was observed for more than two conditions and you want to test whether the measure varied between conditions, use a repeated-measures analysis of variance (ANOVA) analysis. For example, say that you wanted to examine whether response time varied depending on the set size variable.

one_way.head()

shape: (5, 3)

subject	set_size	response_time
i64	i64	f64
1	3	0.884958
1	5	0.830895
1	7	0.83307
2	3	0.755198
2	5	0.868365

pg.rm_anova(
    data=one_way.to_pandas(),  # some Pingouin functions require Pandas
    dv="response_time",
    within="set_size", 
    subject="subject",
)

	Source	ddof1	ddof2	F	p-unc	ng2	eps
0	set_size	2	44	7.831295	0.001232	0.058064	0.923359

To report the results of this test, you could write something like:

We tested whether response time varied with set size using a one-way repeated-measures ANOVA. We observed a significant effect of set size (F(2, 44)=7.83, p=0.0012, ng2=0.058).

For ANOVAs, report the general eta squared measure as a measure of effect size.

Two-way repeated-measures ANOVA#

If there were two variables of interest that were manipulated, and you want to determine whether some measure varied with either variable, use a two-way repeated-measures ANOVA. For example, say that there was a set size variable that could be 3, 5, or 7 items, and a probe variable that could be either target or lure, and you want to examine how response time varied based on set size and probe.

two_way.head(6)

shape: (6, 4)

subject	probe	set_size	response_time
i64	str	i64	f64
1	"lure"	3	0.895321
1	"lure"	5	0.926712
1	"lure"	7	1.028225
1	"target"	3	0.884958
1	"target"	5	0.830895
1	"target"	7	0.83307

pg.rm_anova(
    data=two_way.to_pandas(),  # some Pingouin functions require Pandas
    dv="response_time",
    within=["set_size", "probe"],
    subject="subject",
)

/home/runner/work/datascipsych/datascipsych/.venv/lib/python3.13/site-packages/pingouin/distribution.py:507: FutureWarning: DataFrame.groupby with axis=1 is deprecated. Do `frame.T.groupby(...)` without axis instead.
  data.groupby(level=1, axis=1, observed=True, group_keys=False)
/home/runner/work/datascipsych/datascipsych/.venv/lib/python3.13/site-packages/pingouin/distribution.py:508: FutureWarning: DataFrameGroupBy.diff with axis=1 is deprecated and will be removed in a future version. Operate on the un-grouped DataFrame instead
  .diff(axis=1)

	Source	SS	ddof1	ddof2	MS	F	p-unc	p-GG-corr	ng2	eps
0	set_size	1.049384	2	44	0.524692	13.292064	0.000031	0.000257	0.065393	0.721376
1	probe	0.545674	1	22	0.545674	8.312252	0.008634	0.008634	0.035106	1.000000
2	set_size * probe	0.109888	2	44	0.054944	2.167342	0.126550	0.137903	0.007274	0.807938

To report the results of this test, you could write something like:

We tested whether response time varied with set size using a two-way repeated-measures ANOVA. We observed a significant effect of set size (F(2, 44)=13.29, p=0.000026, ng2=0.065), a significant effect or probe (F(1, 22)=8.31, p=0.0087, ng2=0.035), and no interaction (F(2, 44)=2.17, p=0.14, ng2=0.0073).

The Greenhouse-Geisser correction is used to avoid an assumption of standard ANOVAs that different conditions will be equally correlated with one another. In this case, the three set sizes may not be equally correlated, and the p-GG-corr column has a p-value that is corrected for this possibility.

Python Inferential Statistics Cheat Sheet

Contents

10.13. Python Inferential Statistics Cheat Sheet#

Interpreting the \(p\)-value#

One-sample t-test#

Two-sample paired t-test#

One-way repeated-measures ANOVA#

Two-way repeated-measures ANOVA#