Python Packages

Python Packages#

Python packages are used for functionality that is not included in the Python standard library. Anyone can write and share Python packages through the Python Package Index (PyPI) or other repositories like GitHub. Unlike code in Jupyter notebooks, packages make it easy to develop functions that can be used in different contexts; for example, a package might include an analysis function that is used in multiple notebooks.

Python modules#

Python modules are .py files that contain functions and attributes. They can be accessed by importing them. We can import the analysis module because it is in the same directory as this notebook.

import analysis

For example, we have a function that can be used to take trial types and responses and calculate d-prime, as measure of response accuracy.

trial_type = ["target", "lure", "lure", "target", "target", "target"]
response = ["old", "old", "new", "new", "old", "old"]
analysis.dprime(trial_type, response)

np.float64(0.967421566101701)

We can use help to see docstrings for the module and each function, just like we can do for built-in modules like math and third-party modules like numpy.

help(analysis)

Help on module analysis:

NAME
    analysis - Sample module with some sample functions for data analysis.

FUNCTIONS
    dprime(trial_type, response)
        Calculate d-prime for recognition memory task responses.

        Args:
          trial_type:
            An iterable with strings, indicating whether each trial is a "target"
            or "lure".
          response:
            An iterable with strings, indicating whether the response on each trial
            was "old" or "new".

        Returns:
          The d-prime measure of recognition accuracy.

    exclude_fast_responses(response_times, threshold)
        Exclude response times that are too fast.

        Args:
          response_times:
            An iterable with response times.
          threshold:
            Threshold for marking response times. Response times less than or equal
            to the threshold will be marked False.

        Returns:
          filtered:
            An array with only the included response times.
          is_included:
            A boolean array that is False for responses less than the threshold,
            and True otherwise.

FILE
    /home/runner/work/datascipsych/datascipsych/book/assignments/assignment8/analysis.py

Let’s try adding a new function. Edit analysis.py to add a function called hello that takes no inputs and just prints "hello world" when called. Add a docstring that says "Print 'hello world'". Try calling it using analysis.hello().

# analysis.hello()  # uncommment to try running the new function

You probably got an error saying AttributeError: module 'analysis' has no attribute 'hello'. This is because Python does not automatically update a module when you change the code. Try importing the module again using import analysis. Then call the function using analysis.hello().

import analysis
# analysis.hello()  # uncommment to try running the new function

You probably got the same error again. This is because import only runs if the module has not been imported already. Python halts importing of previously imported modules because importing takes time, and a module may be imported in multiple places when importing a package. To avoid running the same code multiple times, will only import a module once.

What should we do then, if we want to work on developing a new module?

We can always restart the notebook kernel and re-run all the cells. This takes time, though, and can slow things down when working on a notebook.

A better solution is to use the importlib module to reload a module that has already been imported.

import importlib
importlib.reload(analysis)
# analysis.hello()  # uncomment to try running the new function
# help(analysis.hello)  # uncomment to see function docstring

<module 'analysis' from '/home/runner/work/datascipsych/datascipsych/book/assignments/assignment8/analysis.py'>

The importlib.reload function takes a module, looks up the current source code, and updates the imported module to reflect the latest code.

Exercise: developing a module#

Add a function called ismissing to analysis.py that takes in a response NumPy array and returns a boolean array that is True for items that are equal to "n/a". Make a test array and check that your function works.

Advanced#

Extend your function to also work with lists. If one of the inputs to a function may be either a list or a NumPy array, you can use np.asarray to either convert to a NumPy array (in case the input is a list) or do nothing (if the input is already a NumPy array).

# answer here

The search path#

Python does not automatically have access to all Python code on your computer. To import a module, it must be in a list of directories called the search path.

We can see the current path using sys.path. The sys module provides access to information that is specific to your system.

import sys
sys.path

['/opt/hostedtoolcache/Python/3.12.10/x64/lib/python312.zip',
 '/opt/hostedtoolcache/Python/3.12.10/x64/lib/python3.12',
 '/opt/hostedtoolcache/Python/3.12.10/x64/lib/python3.12/lib-dynload',
 '',
 '/opt/hostedtoolcache/Python/3.12.10/x64/lib/python3.12/site-packages']

The path shown will depend on your system. Each entry in the list is one directory where Python will look for modules.

The empty string indicates the current directory; if it is included, this means that modules in your current directory, like analysis.py, can be imported.

You should also see a site-packages directory; this is where Python packages are installed when you run pip install.

You may also see the src directory of this project. If a project is installed using a command like pip install -e ., with the -e (editable) flag, then the source code will be added to your path so you can make changes and have them immediately “installed”.

We can use the os module to see files in a directory. Let’s use it to look at the files in the current directory.

import os
os.listdir(".")

['__pycache__', 'analysis.py', 'python_packages.ipynb', 'assignment8.ipynb']

This should show the analysis.py file that we’ve been working with. Because the current directory is automatically added to the search path, we can import modules from it.

If we look in the site-packages directory, we can see all the packages that have been installed into our virtual environment.

sp = [p for p in sys.path if p.endswith("site-packages")][0]
os.listdir(sp)[:10]

['statsmodels',
 'uc_micro',
 'zipp',
 'asttokens',
 'sniffio',
 'distutils-precedence.pth',
 'platformdirs-4.3.8.dist-info',
 'babel-2.17.0.dist-info',
 'sqlalchemy',
 'alabaster-0.7.16.dist-info']

Finally, if the datascipsych package is installed, we will have a src directory with that package installed. This setup helps us develop a Python package with one or more modules, which we can import and use in our notebooks.

sp = [p for p in sys.path if p.endswith("src")]
if sp:
    print(os.listdir(sp[0]))

The examples module has versions of the hello and ismissing functions that we wrote earlier.

from datascipsych import examples
help(examples)

Help on module datascipsych.examples in datascipsych:

NAME
    datascipsych.examples - Module with example functions.

FUNCTIONS
    hello()
        Print a greeting.

    ismissing(responses)
        Check if responses are n/a.

FILE
    /opt/hostedtoolcache/Python/3.12.10/x64/lib/python3.12/site-packages/datascipsych/examples.py

Note the FILE section at the end, which should show the path to the examples.py module file in your copy of the datascipsych project.

Python packages#

A common approach for Python coding is to have .py files in a directory, which can be imported as module and used in notebooks and scripts. This approach is simpler than creating Python packages, but packages have advantages.

Python packages include information about code version, authors, any Python version requirements, and package dependencies like NumPy and Polars. You can also indicate the specific versions of dependencies that are required. They are also installable just like any of the packages you can download using pip. This makes it so, once your package has been installed, you can change your current directory and still have access to your code. Finally, making a package allows you to share your code in a form that others can easily install into their Python environment.

To create a Python package, we need this basic directory setup:

myproject
├── pyproject.toml
└── src
    └── mypackage
        ├── __init__.py
        ├── mymodule1.py
        └── mymodule2.py

The myproject directory has all of the files related to your project. The pyproject.toml file has metadata about your project, and includes information about how to install your package. The src directory contains source code for the package. The mypackage directory should have the same name as your package. The __init__.py file is a (usually) empty file that lets Python know that this is a package.

The module files under mypackage will be importable. For example, after installation, you could import mymodule1 using from mypackage import mymodule1.

Package configuration#

The pyproject.toml file in the main project directory provides metadata about the package, including dependencies that must be installed to use it.

The pyproject.toml file follows a standard format. The file extension is .toml, which stands for Tom’s Obvious Minimal Language. It’s designed for files like this, that configure how programs work.

The available features of pyproject.toml files are still evolving. See the Python Packaging User Guide for up-to-date details.

There are two main “tables”, which are indicated by a name in brackets. The tables are [project], which contains information about the project, and [build-system], which has information about the program that will be used to install the code.

The first fields specify the name and version of the package, a short description, and information about the authors. For example:

[project]
name = "datascipsych"
version = "0.1.0"
description = "Code for the Data Science for Psychology course."
authors = [
    {name = "Neal W Morton", email = "mortonne@gmail.com"}
]

TOML code looks a lot like Python code (it has lists and dictionaries), but the syntax is a little different.

The version code should generally follow Semantic Versioning. In a version string “X.Y.Z”:

“X” indicates the major version. A change in the major version indicates that there is been a breaking change; that is, updating to that version may cause code that used to work to stop working.

“Y” indicates the minor version. A change in the minor version indicates that functionality has been added in a reverse-compatible manner. That is, updating will give access to new features, but should not break existing code.

“Z” indicates the patch version. A change in the patch version indicates that one or more bugs has been fixed.

If the major version is 0, that indicates that the package isn’t stable yet, meaning that updates may break existing code that uses the package.

In the [project] table, we should also list any required dependencies. For example, if we want to be able to import and use NumPy, Polars, and Seaborn in our code, we could include this list in pyproject.toml:

dependencies = [
    "numpy",
    "polars",
    "seaborn"
]

We can optionally specify requirements for dependency versions. See the Python Packaging User Guide for information about version specifiers.

It’s often a good idea to at least indicate the major version of dependencies, because changes to major versions may cause your code to stop working. For example, if you tested your code with NumPy 2.2.2, and want to make sure your code is always run with a compatible version, you could put "numpy ~= 2.2" in your pyproject.toml file.

Finally, we should also include information about the program that should be used to build our package for installation. There are multiple packaging programs for Python, including Setuptools, Hatchling, Flit, PDM, and Poetry.

Setuptools is the oldest, most popular, and most flexible. We’ll use Setuptools here.

To indicate the build system, we add a [build-system] table. For example, this code can be added to indicate that Setuptools should be used:

[build-system]
requires = ["setuptools >= 61.0"]
build-backend = "setuptools.build_meta"

Exercise: creating a package#

Create a new folder on your system called myproject. Open that directory in your IDE. Then create a src directory with a mypackage directory in it, and a blank __init__.py file within that.

In the myproject directory, create a file called pyproject.toml. In the [project] table, add entries for name, description, version, and authors. Add numpy and polars as dependencies. Finally, add a [build-system] table for Setuptools.

Set up a virtual environment for your project.

Version tracking#

Professional programmers use programs called version control systems (VCS) to keep track of changes to their code. This makes it possible to see when code was changed, who changed it, and why. When using a VCS, old code is never lost; it’s always available somewhere in the stored history of the code repository.

Today, most programmers use Git for version tracking. Git is a command-line tool that can be used from a terminal. There are many commands that can be used to add files to a repository, track changes, and sync the history of changes with a remote copy of the repository stored on a service like GitHub. See the cheat sheet for an overview of commands and terminology. But most work can be done most easily through an IDE like VSCode.

There are three main parts to a Git repository: the workspace, the index, and the local repository.

The workspace includes the files that you can see in the directory being tracked. You edit files here, using programs such as an IDE.

The index is where changes are staged. The set of all staged changes can be put into a snapshot called a commit.

The local repository holds a history of all commits. The local repository can be synced with a remote repository such as a GitHub repository.

Creating a Git repository#

To track code, you must first create a Git repository. It’s usually easiest to create a repository on a service like GitHub, but you can also make a local repository directly on your computer. Either way, you will need to have Git installed on your system.

To create a repository on GitHub, go to your Dashboard page and click on the “New” button. You will be prompted to write a repository name and description. Select whether you want your repository to be public or private. Check the box to “Add a README file”. Under “Add .gitignore”, select Python. This will keep Git from trying to track common temporary files associated with Python programs. If you care about how others can or can’t use your code, consider choosing a license. There is a link where you can learn more about different licenses. Finally, click on “Create repository”.

After creating your repository on GitHub, you can clone your repository to your computer using an IDE such as VSCode.

It is also possible to create a Git repository directly on your computer. You will need to initialize a Git project in your code directory before you can start tracking changes. You will have to create the README file, .gitignore file, and license file yourself.

Exercise: Creating a Git repository#

Go back to the myproject directory that you created in the last exercise. Use your IDE to initialize a Git repository. If you have a GitHub account, and if your IDE supports creating GitHub repositories, you can optionally publish your changes to a new GitHub repository.

Create a file called .gitignore with the following contents:

*.egg-info
build/

This will keep temporary package files from being tracked by Git.

Making a commit#

When you have made a change to your code or documentation that you want to add to your record of changes in your repository, you will make a commit, a snapshot of all the files in your repository after making those changes.

Before making your first commit, you will have to define the name and email address that will be associated with your commits. To do this, you will need to open a terminal program and run the following commands.

git config --global user.name "[name]"
git config --global user.email "[email address]"

Replace [name] and [email address] with the name and email address you want to associate with commits.

Next, you will have to add any changes that you want to track. Use your IDE to view your local changes and add any changes you want to commit. Most IDEs have features for adding code “chunks” within a file, so you don’t have to include all changes at once.

Try to find a set of changes that are related to one another. These changes may involve code changes in multiple files, but should be related in some way.

Finally, write a commit message to explain what changes you are making. The Conventional Commits standard describes a good general method for writing commit messages to help others (and yourself in the future) understand what you were trying to do with your changes.

Each commit has a type code to indicate the goal of the changes.

feat: New feature

fix: A fix of some bug

refactor: A change to how code is organized that does not change functionality

style: A change to coding style

build: Something related to the project or installation of the project (e.g., changes to pyproject.toml)

docs: A change to project documentation

Conventional commits start with a type code, followed by a colon, a space, and then a short message describing the change. The change is written as an imperative. For example, a new feature to add data analysis tools could have a commit message like feat: add data analysis tools.

If you look at the history of this project, you may notice that the commits use Conventional Commits format.

In your IDE, make sure all the changes you want to include are added, then enter your commit message and make the commit to add it to your repository history. If you are syncing with an external repository such as a GitHub repository, you can then push your changes to that repository. You can also pull any changes from the external repository that you don’t have yet.

Exercise: making a commit#

Go back to your myproject directory. Use your IDE to look at the changes in your local workspace. Add the .gitignore file and make a commit. Set the commit message to feat: add .gitignore file. Look at your commit and the associated changes in your IDE.

Reverting changes#

Sometimes, you will make a change that you don’t want to commit to your Git repository. IDEs commonly have features to revert changes. Reverting a file will change the file back to the last version that was stored as a commit. Make sure you are okay with losing those changes before reverting.

Another option is to stash your changes.

Stashes#

When pulling changes from an external repository, local changes in your workspace may lead to conflicts. Git’s stash feature provides a convenient way of dealing with these conflicts. You can create a stash of all uncommitted changes in your repository. Making a stash will move all your changes to the new stash, cleaning up your workspace. You can then pull any changes without conflicts. Finally, pop the stash. This will apply the changes stored in the stash to your workspace and delete the stash, which is no longer needed.