Software Packages

12. Software Packages#

Software packages organize code to make it easier to install and use. Anyone can write and share Python packages through the Python Package Index (PyPI) or other repositories like GitHub. Unlike code in Jupyter notebooks, packages make it easy to develop functions that can be used in different contexts; for example, a package might include an analysis function that is used in multiple notebooks.

12.1. Python modules#

Python modules are .py files that contain functions and attributes. They can be accessed by importing them. We can import the analysis module because it is in the same directory as this notebook.

import analysis

For example, we have a function that can be used to take trial types and responses and calculate d-prime, as measure of response accuracy.

trial_type = ["target", "lure", "lure", "target", "target", "target"]
response = ["old", "old", "new", "new", "old", "old"]
analysis.dprime(trial_type, response)

np.float64(0.967421566101701)

We can use help to see docstrings for the module and each function, just like we can do for built-in modules like math and third-party modules like numpy.

help(analysis)

Help on module analysis:

NAME
    analysis - Sample module with some sample functions for data analysis.

FUNCTIONS
    dprime(trial_type, response)
        Calculate d-prime for recognition memory task responses.

        Args:
          trial_type:
            An iterable with strings, indicating whether each trial is a "target"
            or "lure".
          response:
            An iterable with strings, indicating whether the response on each trial
            was "old" or "new".

        Returns:
          The d-prime measure of recognition accuracy.

    exclude_fast_responses(response_times, threshold)
        Exclude response times that are too fast.

        Args:
          response_times:
            An iterable with response times.
          threshold:
            Threshold for marking response times. Response times less than or equal
            to the threshold will be marked False.

        Returns:
          filtered:
            An array with only the included response times.
          is_included:
            A boolean array that is False for responses less than the threshold,
            and True otherwise.

FILE
    /home/runner/work/datascipsych/datascipsych/book/chapters/chapter12/analysis.py

Let’s try adding a new function. Edit analysis.py to add a function called hello that takes no inputs and just prints "hello world" when called. Add a docstring that says "Print 'hello world'". Try calling it using analysis.hello().

# analysis.hello()  # uncommment to try running the new function

You probably got an error saying AttributeError: module 'analysis' has no attribute 'hello'. This is because Python does not automatically update a module when you change the code. Try importing the module again using import analysis. Then call the function using analysis.hello().

import analysis
# analysis.hello()  # uncommment to try running the new function

You probably got the same error again. This is because import only runs if the module has not been imported already. Python halts importing of previously imported modules because importing takes time, and a module may be imported in multiple places when importing a package. To avoid running the same code multiple times, will only import a module once.

What should we do then, if we want to work on developing a new module?

We can always restart the notebook kernel and re-run all the cells. This will re-run all import statements from scratch and get the latest code for each module. This takes time, though, and can slow things down when working on a notebook.

A better solution is to use the importlib module to reload a module that has already been imported.

import importlib
importlib.reload(analysis)
# analysis.hello()  # uncomment to try running the new function
# help(analysis.hello)  # uncomment to see function docstring

<module 'analysis' from '/home/runner/work/datascipsych/datascipsych/book/chapters/chapter12/analysis.py'>

The importlib.reload function takes a module, looks up the current source code, and updates the imported module to reflect the latest code.

Exercise: developing a module#

Add a function called ismissing to analysis.py that takes in a response NumPy array and returns a boolean array that is True for items that are equal to "n/a". Make a test array and check that your function works.

Advanced#

Extend your function to also work with lists. If one of the inputs to a function may be either a list or a NumPy array, you can use np.asarray to either convert to a NumPy array (in case the input is a list) or do nothing (if the input is already a NumPy array).

# answer here

12.2. The search path#

Python does not automatically have access to all Python code on your computer. To import a module, it must be in a list of directories called the search path.

We can see the current path using sys.path. The sys module provides access to information that is specific to your system.

import sys
sys.path

['/opt/hostedtoolcache/Python/3.13.6/x64/lib/python313.zip',
 '/opt/hostedtoolcache/Python/3.13.6/x64/lib/python3.13',
 '/opt/hostedtoolcache/Python/3.13.6/x64/lib/python3.13/lib-dynload',
 '',
 '/home/runner/work/datascipsych/datascipsych/.venv/lib/python3.13/site-packages',
 '/home/runner/work/datascipsych/datascipsych/src']

The path shown will depend on your system. Each entry in the list is one directory where Python will look for modules.

The empty string indicates the current directory; if it is included, this means that modules in your current directory, like analysis.py, can be imported.

You should also see a site-packages directory; this is where Python packages are installed when you run uv sync or pip install.

You may also see the src directory of this project. If a project is installed using a command like uv sync or pip install -e ., then the source code will be added to your path so you can make changes and have them immediately “installed”.

We can use the os module to see files in a directory. Let’s use it to look at the files in the current directory.

import os
os.listdir(".")

['review.ipynb',
 'src',
 '__pycache__',
 'jupyter',
 'software_packages.ipynb',
 'analysis.py',
 'assignment12.ipynb']

This should show the analysis.py file that we’ve been working with. Because the current directory is automatically added to the search path, we can import modules from it.

If we look in the site-packages directory, we can see all the packages that have been installed into our virtual environment.

sp = [p for p in sys.path if p.endswith("site-packages")][0]
os.listdir(sp)[:10]

['imagesize-1.4.1.dist-info',
 'typing_extensions-4.15.0.dist-info',
 'terminado',
 'ipython-9.7.0.dist-info',
 'attrs-25.4.0.dist-info',
 'zipp',
 'defusedxml',
 'cffi-2.0.0.dist-info',
 'pythonjsonlogger',
 'python_dateutil-2.9.0.post0.dist-info']

Finally, if the datascipsych package is installed, we will have a src directory with that package installed. This setup helps us develop a Python package with one or more modules, which we can import and use in our notebooks.

sp = [p for p in sys.path if p.endswith("src")]
if sp:
    print(os.listdir(sp[0]))

['datascipsych.egg-info', 'datascipsych']

The examples module has versions of the hello and ismissing functions that we wrote earlier.

from datascipsych import examples
help(examples)

Help on module datascipsych.examples in datascipsych:

NAME
    datascipsych.examples - Module with example functions.

FUNCTIONS
    hello()
        Print a greeting.

    ismissing(responses)
        Check if responses are n/a.

FILE
    /home/runner/work/datascipsych/datascipsych/src/datascipsych/examples.py

Note the FILE section at the end, which should show the path to the examples.py module file in your copy of the datascipsych project.

Exercise: the search path#

Normally, only some standard directories and the current directory are on the Python search path. However, it can be modified to add additional directories.

First, try running import example_module. This import should fail because example_module is not in the same directory as this notebook and is therefore not on the Python path by default.

Try running sys.path.append("src") to add the src directory to your path temporarily. Then run import example_module to import the src/example_module.py file. Run help(example_module) to display documentation for the file. Note that it is not usually best to edit the Python path directly; instead, code should be installed by making and installing Python packages.

# answer here

12.3. Python packages#

Python packages are directories that contain code, documentation, and files that specify how the code should be installed.

A common approach for Python coding is to have .py files in a directory, which can be imported as module and used in notebooks and scripts. This approach is simpler than creating Python packages, but packages have advantages.

Python packages include information about code version, authors, any Python version requirements, and package dependencies like NumPy and Polars. You can also indicate the specific versions of dependencies that are required. They are also installable just like any of the packages you can download using uv or pip. This makes it so, once your package has been installed, you can change your current directory and still have access to your code. Finally, making a package allows you to share your code in a form that others can easily install into their Python environment.

To create a Python package, we need this basic directory setup:

myproject
├── pyproject.toml
└── src
    └── myproject
        ├── __init__.py
        ├── mymodule1.py
        └── mymodule2.py

The myproject directory has all of the files related to your project. The pyproject.toml file has metadata about your project, and includes information about how to install your package. The src directory contains source code for the package. The myproject directory has the name that will be used when importing code from your package (this will usually have the same name as the main outer directory). The __init__.py file is a (usually) empty file that lets Python know that this is a package.

The module files under src/myproject will be importable. For example, after installation, you could import mymodule1 using from myproject import mymodule1.

Exercise: Python packages#

Look at the main directory of the datascipsych project. Look in the src/datascipsych directory and open the datasets.py module. Try importing that module using from datascipsych import datasets. Look at the documentation for the module using help(datasets).

Open the pyproject.toml file and look at the various settings. What dependencies are listed?

# answer here

12.4. Package configuration#

The pyproject.toml file in the main project directory provides metadata about the package, including dependencies that must be installed to use it.

The pyproject.toml file follows a standard format. The file extension is .toml, which stands for Tom’s Obvious Minimal Language. It’s designed for files like this, that configure how programs work.

The available features of pyproject.toml files are still evolving. See the Python Packaging User Guide for up-to-date details. Here, we will cover some of the most commonly used features.

There are two main “tables”, which are indicated by a name in brackets. The tables are [project], which contains information about the project, and [build-system], which has information about the program that will be used to install the code.

The first fields in the [project] table specify the name and version of the package, a short description, and information about the authors. For example:

[project]
name = "datascipsych"
version = "0.1.0"
description = "Code for the Data Science for Psychology course."
authors = [
    {name = "Neal W Morton", email = "mortonne@gmail.com"}
]

TOML code looks a lot like Python code (it has lists and dictionaries), but the syntax is a little different.

The version code should generally follow Semantic Versioning. In a version string “X.Y.Z”:

“X” indicates the major version. A change in the major version indicates that there is been a breaking change; that is, updating to that version may cause code that used to work to stop working.

“Y” indicates the minor version. A change in the minor version indicates that functionality has been added in a reverse-compatible manner. That is, updating will give access to new features, but should not break existing code.

“Z” indicates the patch version. A change in the patch version indicates that one or more bugs has been fixed.

If the major version is 0, that indicates that the package isn’t stable yet, meaning that updates may break existing code that uses the package. New packages can start with version 0.1.0.

The [project] table also lists any required dependencies. For example, if we want to be able to import and use NumPy, Polars, and Seaborn in our code, we could include this list in pyproject.toml:

dependencies = [
    "numpy",
    "polars",
    "seaborn"
]

We can optionally specify requirements for dependency versions. It’s often a good idea to at least indicate the major version of dependencies, because changes to major versions may cause your code to stop working. For example, if you tested your code with NumPy 2.2.2, and want to make sure your code is always run with a compatible version, you could put "numpy ~= 2.2" in your pyproject.toml file.

See the Python Packaging User Guide for information about version specifiers.

Finally, the pyproject.toml file will also include information about the program that should be used to build our package for installation. There are multiple packaging programs for Python, including uv, Setuptools, Hatchling, Flit, PDM, and Poetry.

For example, this code specifies how to build and install a package using uv:

[build-system]
requires = ["uv_build>=0.9.13,<0.10.0"]
build-backend = "uv_build"

To avoid having to manually create a package skeleton, we can use the uv init command. In the terminal, in the main package directory, run uv init --lib to generate a project skeleton with recommended files. This will include a pyproject.toml template with placeholders for project metadata, which you can then edit manually.

Exercise: creating a package#

Create a new folder on your computer called myproject. Open that directory in your IDE. Open the terminal and run uv init --lib. This will create a src directory with a myproject directory in it, and a __init__.py file within that. It will also create a python-version file that indicates the version of Python that will be used with the project, a pyproject.toml file with information about the package, and a README file where documentation about the package can be placed.

Open the pyproject.toml in the myproject directory. In the [project] table, you will see fields for name, description, version, and authors. Edit pyproject.toml to change the name and description. Below the [project] table you will see a [build-system] entry to determines how the project should be built and installed.

You can run uv add [package1] [package2] ... to add Python packages as dependencies for your package. Use uv add to add the numpy and polars packages as dependencies.

12.5. Version tracking#

Professional programmers use programs called version control systems (VCS) to keep track of changes to their code. This makes it possible to see when code was changed, who changed it, and why. When using a VCS, old code is never lost; it’s always available somewhere in the stored history of the code repository.

Today, most programmers use Git for version tracking. Git is a command-line tool that can be used from a terminal. There are many commands that can be used to add files to a repository, track changes, and sync the history of changes with a remote copy of the repository stored on a service like GitHub. See the cheat sheet for an overview of commands and terminology. But most work can be done most easily through an IDE like VSCode.

There are three main parts to a Git repository: the workspace, the index, and the local repository.

The workspace includes the files that you can see in the directory being tracked. You edit files here, using programs such as an IDE.

The index is where changes are staged. The set of all staged changes can be put into a snapshot called a commit.

The local repository holds a history of all commits. The local repository can be synced with a remote repository such as a GitHub repository.

Exercise: version tracking#

In your IDE, use its version tracking features to view the history of the datascipsych repository. This history is stored on your computer in the local Git repository. Open the notebook corresponding to this chapter in your IDE. Make an edit to code or text in the notebook and use your IDE’s version tracking features to view those modifications.

12.6. Creating a Git repository#

To track code, you must first create a Git repository. You will need to have Git installed on your system.

If you have initialized a project using uv init, then your project will already have a Git repository associated with it. The Git history is stored in a hidden directory called .git. Git repositories may also be initialized using IDE applications.

If you have an initialized Git repository, then you will be able to see the history of tracked changes to your repository in your IDE. Changes will only be added to the history if they are explicitly added and then committed.

The Git directory in your project’s directory is known as a local repository. Local repositories may be synced with online services such as GitHub, which are known as remote repositories.

We can pull from a remote repository to get the latest history of changes and apply them, or push to a remote repository to send our latest changes there.

When a Git repository has been initialized, changes have been committed to its history, and those changes have been pushed to GitHub, then anyone with access to the GitHub repository can clone it to get a complete record of all changes that have been committed to the repository.

This book has been written and edited by writing code to a local Git repository. Changes are pushed to the datascipsych repository on GitHub. The GitHub repository can then be cloned, allowing code examples to be run on your computer.

Exercise: Creating a Git repository#

Go back to the myproject directory that you created in the last exercise. Open the terminal and run ls -a to show hidden files and directories; you should see a .git directory. Look at the Git history using your IDE’s version tracking features. If you have a GitHub account, and if your IDE supports creating GitHub repositories, you can optionally publish your project to a new GitHub repository.

12.7. Making a commit#

When you have made a change to your code or documentation that you want to add to your record of changes in your repository, you will make a commit, a snapshot of all the files in your repository after making those changes.

Before making your first commit, you will have to define the name and email address that will be associated with your commits. To do this, you will need to open the terminal and run the following commands.

git config --global user.name "[name]"
git config --global user.email "[email address]"

Replace [name] and [email address] with the name and email address you want to associate with commits.

Next, you will have to add any changes that you want to track. Use your IDE to view your local changes and add any changes you want to commit. Most IDEs have features for adding code “chunks” within a file, so you don’t have to include all changes at once.

Try to add a set of changes that are related to one another. These changes may involve code changes in multiple files, but should be related in some way.

Finally, write a commit message to explain what changes you are making. The Conventional Commits standard describes a good general method for writing commit messages to help others (and yourself in the future) understand what you were trying to do with your changes.

Each commit has a type code to indicate the goal of the changes.

feat: New feature

fix: A fix of some bug

refactor: A change to how code is organized that does not change functionality

style: A change to coding style

build: Something related to the project or installation of the project (e.g., changes to pyproject.toml)

docs: A change to project documentation

Conventional commits start with a type code, followed by a colon, a space, and then a short message describing the change. The change is written as an imperative. For example, a new feature to add data analysis tools could have a commit message like feat: add data analysis tools.

If you look at the history of this project, you may notice that the commits use Conventional Commits format.

To finish making a commit, make sure all the changes you want to include are added, then enter your commit message and make the commit to add it to your repository history. If you are syncing with an external repository such as a GitHub repository, you can then push your changes to that repository. You can also pull any changes from the external repository that you don’t have yet.

Exercise: making a commit#

Go back to your myproject directory. Use your IDE to look at the changes in your local workspace. Add the .gitignore file and make a commit. Set the commit message to feat: add .gitignore file. Look at your commit and the associated changes in your IDE.

12.8. Reverting and stashing changes#

Sometimes, you will make a change that you don’t want to commit to your Git repository.

IDEs commonly have features to revert changes. Reverting a file will change the file back to the last version that was stored as a commit. Make sure you are okay with losing those changes before reverting.

When pulling changes from an external repository, local changes in your workspace may lead to conflicts. Git’s stash feature provides a convenient way of dealing with these conflicts without losing any changes.

You can create a stash of all uncommitted changes in your repository. Making a stash will move all your changes to the new stash, cleaning up your workspace. You can then pull any changes without conflicts.

Finally, pop the stash. This will apply the changes stored in the stash to your workspace and delete the stash, which is no longer needed.

Exercise: reverting and stashing changes#

Open the src/example_module.py file and add a line with this comment: # TODO: write some code here. Save the file. Then use your IDE’s version tracking features to revert the file to its previous state.

Edit the file to change the module’s docstring to add and practice stashing to the end of the sentence. Then use your IDE’s version tracking features to stash that change. Verify that your change is gone. Finally, pop the stash to re-apply your change to the file.

12.9. Summary#

Python modules contain code that can be imported and run. We can create a module by simply creating a file ending in .py that contains Python code.

The Python search path determines what .py files can be imported. To import a file, you must make sure it is somewhere on the search path, which can be accessed through the path attribute of the sys module. Python automatically includes the current directory on the search path, so .py files in the same directory as a Jupyter notebook may be imported.

Python packages include both Python code and instructions for installing it. This includes information about dependencies, which are other Python packages that are used in the package. Python packages should also contain metadata describing the purpose of the package, listing authors, and the system that should be used for installing the package. Packages should also include a README file documenting how to install and use it.

Programs for version tracking such as Git are used to keep track of changes to code. Each set of changes is called a commit and contains information about what files were added or changed, the author, and the date and time. Websites such as GitHub can be used to make it more convenient to share code with others.