Project Iteration#

Real code projects change over time as developers add features, refactor code to improve organization, and write tests to make sure their code is working as intended. These iterative changes can make the difference between projects delivering reliable results or being abandoned. We will discuss some important ways that projects can be developed over time to improve usability, flexibility, and extensability.

Python scripts#

Python scripts are useful for running programs that do not need to be interactive; that is, once they are started, they do not require the user to do anything. They can be very powerful for completing repetitive tasks.

In this course, we have focused on Jupyter notebooks, which are very flexible and well-suited for data analysis and visualization. However, they are relatively complicated to run. First, you have to open an IDE like Visual Studio Code or a web browser. Then you have to set up a kernel to run the notebook. Finally, you can then run commands to execute the code you want to run, either by running all cells or running individual cells manually.

Python scripts, in contrast, can be executed by running one command in a terminal. The script interface can be written to be flexible, allowing the user to easily change options that affect how the program runs.

Script types#

There are two basic kinds of scripts: script files and installed scripts.

Script files are individual .py files that can be run using the python function. To use them, you must either be in the same directory where they are or know the full path to them.

Installed scripts work like any other command installed on your computer, like python or pip. You can run them just by typing the name of the script, so you don’t have to specify the full path to their location. They can be executed anytime regardless of what your current directory is.

Script files#

Script files are simple to write; you just create a file with a .py extension with the Python code you want to execute.

A very simple script#

Let’s try making a simple script. We’ll start with a script that just prints “Hello world.”

Create a new file in the main directory of this project called hello.py, with the following contents:

print("Hello world.")

This isn’t much of a script, but it technically qualifies. Run it by opening a terminal and typing python hello.py.

Using arguments#

Similar to Python functions, most scripts take at least one argument. Arguments are used to specify something about how the script will run.

Arguments come after the name of the script and are separated by spaces:

python myscript.py argument1 argument2 argument3 ...

In the code for the script, we can fetch any arguments that the user supplied using the sys module. The sys.argv variable holds whatever arguments have been passed into the current script. sys.argv[0] has the name of the script. sys.argv[1] has the first argument, sys.argv[2] has the second argument, etc.

Let’s edit hello.py to take one argument, which we will call user.

import sys
user = sys.argv[1]  # gets the argument to the script
print(f"Hello {user}.")

Now we can call our script with an argument. For example, we can greet Dave using python hello.py Dave.

Exercise: script file#

Write a new script called describe.py that loads a CSV file using Polars and runs describe to display an overview.

Your script should take one argument that gives the path to a CSV file. For example, from the main project directory, the path to the Osth2019 dataset is src/datascipsych/data/Osth2019.csv. Your script should read the CSV using Polars, get a description of the dataset using the describe function, and print it.

Use your script to print a description of the Osth2019 dataset.

Installed script commands#

Installed scripts are commands that have been installed into your virtual environment. They are a little more complicated to set up.

First, we need a function in one of the modules in our package that we want to turn into a script. We can use the hello function in the cli module (“cli” stands for command-line interface). The function looks like this:

def hello():
    """Print a greeting."""
    if len(sys.argv) > 1:
        user = sys.argv[1]
    else:
        user = "world"
    print(f"Hello {user}.")

In this function, have an optional user argument. If it is not specified (which we can tell if sys.argv has less than two items), we will use the default setting, "world".

To make the hello function available as a command, we must have settings in the pyproject.toml file to indicate the name we want for the new command and where the function can be found:

[project.scripts]
hello = "datascipsych:cli.hello"

To specify where the function is, start with the package and a colon (datascipsych:), then the module and a dot (cli.), and finally the name of the function (hello). That gives us "datascipsych:cli.hello". The text on the left side of the equal sign indicates what the new command should be named (hello).

Run pip install -e . to install the datascipsych package, including our hello command. When installing the package, pip will find our hello function and make it into a command that we can call from the terminal.

Open a terminal and try running hello and hello Dave. Note that, unlike when we used a script file, now we are using an installed command. That is why we don’t have to write python first or give a full filename this time; instead we can just type hello.

Using Click#

Packages have been developed to make it easier to create commandline tools using Python. Click allows you to quickly add more advanced features like optional inputs, just by adding a few lines of code before a function.

In the cli.py module, we have another function called hello_click:

@click.command()
@click.option("--user", default="world", help="User to greet.")
def hello_click(user):
    print(f"Hello {user}.")

The @click statements are an example of what is called a function decorator. Function decorators, which start with an @ sign, are a newer feature of Python that allow functions to be modified. In this case, the Click package uses decorators to turn an ordinary function into a command that can get inputs from the terminal and pass them into the function.

In pyproject.toml, this line under [project.scripts] sets up a command called hello-click.

hello-click = "datascipsych:cli.hello_click"

Run pip install -e . to install hello-click.

In the terminal, run hello-click --help. You should see a message showing the options for running the command. Click automatically puts this message together for us, based on how we have set up the function. You can customize the user with the --user flag. For example, try hello-click --user Dave.

Click has a lot of features to make it easier to define inputs to Python scripts. See the website for details.

Using unit tests to ensure code correctness#

When developing code, it is very important to check that the code is working as expected. Unit tests are a powerful method for checking code output.

Using assert statements#

The assert statement makes it easy to run a check of some assumption or output from a function.

An assert statement checks whether some code is False. If it is, an error is thrown. For example, running the code below will raise an error.

test_variable = 2
assert test_variable == 3

The idea is that we made an assertion that test_variable == 3. Our assertion was incorrect, because test_variable is actually 2, so an error is thrown.

We can use assert statements to check assumptions and throw an error if our assumptions are incorrect. For example, we can write a unit test by defining some input to a function, calling it to get an output, and comparing that to the correct answer that we have calculated by hand.

Say we have written a function to calculate standard error of the mean (SEM) using NumPy, and we want to make sure it is working as expected. We define a test case (here, a set of observations of some hypothetical measure), and then manually calculate the SEM for that test case. Finally, we use an assert statement to check that our function gives the same result as our manual calculation.

import numpy as np
def sem(x):
    """Calculate the standard error of the mean for an array."""
    return np.std(x, ddof=1) / np.sqrt(x.size)


x_to_test = np.array([1, 3, 2, 4, 8, 2, 4])
# Manual calculations: STD=2.300, N=7, sqrt(N)=2.646, SEM=STD/sqrt(N)=0.869
assert np.round(sem(x_to_test), 3) == 0.869  # compare the function output to our calculations

Note that this is a similar pattern to what we have used in the assignments in this course. Assert statements can be used to automatically check answers that are output from analysis code (assuming you know the correct answer for a test case, or know how to calculate it).

Test-driven development#

In the test-driven development method, you choose a test case, determine what the result should be for that test case, write an automated test, and then write code to make your test work.

This can be very helpful when you have clear requirements for how code should run, which are relatively easy to write into a test. Then you have a concrete target when writing your code.

Completing the assignments in this course is an example of test-driven development. Problems involve writing code that will pass the set of assert statements at the end of each problem, making it so you can automatically know whether your solution is (probably) correct.

Running test suites using pytest#

A code project may have many different tests, put together as a test suite. This makes it possible to automatically test whether different aspects of the project are working as expected.

To make it easier to run test suites, developers have created systems for managing and running tests. The pytest package makes it easy to add tests to a project.

To use pytest, first create a directory in the main project directory called tests. Under the tests directory, add .py files that start with test. For example, if you wanted to test some data cleaning functions, you could place your tests in tests/test_data_cleaning.py.

To add a test, edit your .py file to add functions whose names start with test. For example:

import polars as pl


def test_drop_nulls():
    df = pl.DataFrame({"a": [1, 2, 3, 4, 5], "b": [1, 1, 2, None, 2]})
    df_drop = df.drop_nulls()
    assert df_drop.shape == (4, 2)  # check that row with null has been dropped

You can install pytest using pip install pytest. Then open a terminal and run pytest. That should automatically detect your tests directory, the test_data_cleaning.py module, and the test_drop_nulls function in that. It will run all the tests it found and let you know if any of them raised an error. If there were no errors, the test “passed”. If there was an error, the test “failed”.

Having a suite of tests that can be run using pytest is very helpful for periodically checking that code still runs as expected.

For example, changes to dependencies can sometimes violate your expectations, leading to broken code. Test suites make it much easier to identify problems when they appear.

The pytest package has a lot of options, including tools for preparing and using test data. See the documentation for details.

Debugging#

Debugging tools make it easier to figure out what is going wrong in a Python program. They are most useful for diagnosing problems that occur in functions, loops, or long scripts, where variables can be hard to keep track of.

The calculate_stats function below has a bug in it. If we try to call it, there will be an error. To help figure out what is going on, we can add a breakpoint. Breakpoints are used to indicate where execution should stop. Trying uncommenting the calculate_stats call below; it will throw a ColumnNotFoundError. Then try it again with a breakpoint. In VSCode, we can click on the left side of the code cell, next to the stats = ... line, to add a breakpoint. After adding a breakpoint, click on the menu next to the play button on the top left of the code cell, and select Debug Cell.

import polars as pl
def calculate_stats(df, subject, condition, dv):
    subject_means = (
        df.group_by(subject, condition)
        .agg(mean=pl.col(dv).mean())
        .sort(subject, condition)
    )
    stats = subject_means.group_by(condition).agg(pl.col(dv).mean()).sort(condition)
    return stats

df = pl.DataFrame(
    {
        "subject": [1, 1, 1, 1, 2, 2, 2, 2], 
        "condition": [1, 1, 2, 2, 1, 1, 2, 2], 
        "correct": [0, 1, 0, 1, 1, 1, 0, 0]
    }
)
# calculate_stats(df, "subject", "condition", "correct")  # this will throw an error

The Debug Console makes it possible to inspect variables at a breakpoint. From that, we can see the problem: in the subject_means DataFrame, there is no "correct" column. Instead, we need to use the new "mean" column.

This version of the function works as expected, giving us an average for each condition.

def calculate_stats(df, subject, condition, dv):
    subject_means = (
        df.group_by(subject, condition)
        .agg(mean=pl.col(dv).mean())
        .sort(subject, condition)
    )
    stats = subject_means.group_by(condition).agg(pl.col("mean").mean()).sort(condition)
    return stats

calculate_stats(df, "subject", "condition", "correct")
shape: (2, 2)
conditionmean
i64f64
10.75
20.25

Sharing Python packages#

Python packages can be shared with others through PyPI and GitHub. Both methods make it possible for others to install your package using Pip.

Sharing through the Python Package Index#

Python packages can be published to the official Python Package Index (PyPI) to make them easily accessible to users. Packages hosted there can be installed by just running pip install [packagename], where [packagename] is the name of your package. For example, my package for analysis of free-recall data, Psifr, can be installed by running pip install psifr.

If you have followed the directions in this course for setting up an installable package with a pyproject.toml file, you have already done most of the work necessary to host a package on PyPI. See the Python Packaging User Guide for details.

Sharing through GitHub#

You can also install packages directly from GitHub. For example, to install Psifr from the latest code on GitHub:

pip install psifr@git+https://github.com/mortonne/psifr

To install a package from PyPI, we only have to indicate the name of the package (for example, psifr). When installing from GitHub, we need to specify more information.

First, we indicate the name the package should be installed under using psifr@. Next, git+ indicates that we want to access a Git repository. Finally, we have the URL for the GitHub webpage for the project we want to install: https://github.com/mortonne/psifr. See the Pip documentation page on VCS support for details.

A GitHub dependency, like psifr@git+https://github.com/mortonne/psifr, can also be used in a dependency list in a pyproject.toml file to indicate that a project requires that package. When installing the project, the GitHub package will automatically be downloaded and installed.

Using a third-party package#

Using Pip to install a package makes it so we can now run code from modules in that package. For example, this code will load some sample data and convert it to a Polars DataFrame. Try installing Psifr from PyPI or GitHub, then uncommenting the code below and running it. In Visual Studio Code, you can uncomment a block of code by highlighting the code and running Edit > Toggle Line Comment.

# import polars as pl
# from psifr import fr
# df = fr.sample_data("Morton2013")
# data = pl.DataFrame(fr.merge_free_recall(df))
# data.head()

Sharing Jupyter notebooks#

When collaborating or sharing results with others, you can have them clone or download your code project; however, running the code involves some setup work. Different methods can be used to share Jupyter notebooks more directly.

Static notebooks on GitHub#

GitHub makes it easy to share the results of a data analysis project. Jupyter notebooks are automatically rendered, allowing visitors to see your code and the results that you got the last time you ran it. However, notebooks are only updated when you push changes to GitHub, and users will not be able to edit and run code themselves.

For example, the sample project has a notebook on GitHub that you can view. The output of each cell shows what the results were the last time the notebook was run and changes were pushed to GitHub.

Executable notebooks on Binder#

The Binder service lets you create an interactive notebook that you can share with anyone, to let them run your code interactively. This can be a convenient way to share results with a collaborator without them having to clone your project, create a Python virtual environment, install your project, open your notebook, and specify the kernel.

To use Binder, you provide the URL for a GitHub repository and the path to a Jupyter notebook relative to the main directory of your repository. After you fill in a form with information about the repository and the location of the notebook you want to run, Binder will create an environment to run the notebook and open an interface where you can make edits and run code.

Binder currently does not support installing dependencies from a pyproject.toml file. You can either create a file in your main project directory called requirements.txt with one dependency per line, or you can instruct users to add code at the top of the notebook to install any necessary dependencies.

For example, to install Polars, users can add this line at the top of the notebook:

!pip install polars

The ! indicates to Jupyter that you want to run something outside the usual Jupyter environment. It lets you run pip directly from inside a notebook to install Polars into the environment that is running the notebook.

To use a module from a code project on GitHub (for example, if the notebook is designed to use a module defined in the same project), you can use the same method as in the Sharing through GitHub section. For example:

!pip install project@git+https://github.com/mortonne/datascipsych-project

Google Colab#

The Google Colab website is another option for hosting interactive notebooks. The default kernel has many common data science packages already installed, so setup is sometimes easier compared to using Binder.

Python packages for data science#

This course focuses on a core set of packages for data science, including NumPy, SciPy, Polars, Matplotlib, Seaborn, and Pingouin. However, there are other packages that are useful for more advanced applications like advanced statistics and machine learning. Working in Python makes it easy to incorporate these tools into your analyses.

Advanced statistics#

The statsmodels package provides many methods for estimating regression models, including linear regression, logistic regression, and multilevel regression. Models can be specified using formulas like y ~ x, similar to R.

The Bambi package helps define and estimate Bayesian multilevel models.

Machine learning#

The scikit-learn package has tools, such as pattern classification, clustering, and dimensionality reduction, for working with high-dimensional data such as brain activity measures.

The TensorFlow package helps to create, train, and deploy deep learning models. It also provides access to many pre-trained machine-learning models that are available on TensorFlow Hub. Deep learning models are increasingly being used in psychology research to simulate perception, learning, and decision making.

The PyTorch package is another popular package for creating and training deep learning models. Compared to TensorFlow, it is used more often for research and small-scale projects.

Summary#

Python scripts#

Scripts are command line tools that you can run in the terminal. They make it easier to run programs that do not need to be as interactive as Jupyter notebooks. Scripts may be in standalone files or may be installed as commands.

Unit tests#

Unit tests are automatic checks of whether code is producing expected results. They can be written one at a time using individual assert statements or collected into a test suite using tools like pytest.

Debugging#

Debugging tools can help identify problems with functions and loops, by making it possible to inspect variables during code execution.

Sharing Python packages#

Python packages can be shared with others through the Python Package Index or version-tracking services like GitHub.

Sharing notebooks#

Static notebooks can be shared through GitHub. Dynamic, executable notebooks can be shared through services like Binder and Google Colab.

Python packages for data science#

Python has a large ecosystem of data science packages that include tools for advanced statistics and machine learning.