Debug CPython Lib and Module

October 18, 2024

In this post, I'm going to present how to debug Python libraries and its C module implementation for CPython interpreter. The libraries (e.g., asyncio, json in the Lib directory in CPython project) act as the bridge between users and underlying C modules. Hence when it comes to debugging CPython interpreter, we use the built-in debugger pdb to trace the libraries in Python world and gdb for C world.

Debug Python Lib with PDB breakpoint()

You can simply insert breakpoint() anywhere in the Python code, and type s (or step) to step into the library function implementation.

import asyncio

async def hello_world():
    print("hello...")
    breakpoint()
    await asyncio.sleep(1)
    print("world")

asyncio.run(hello_world())

The above script will stop at the breakpoint(), and you can interactively trace the next execution.

Cross the boundary

Some Python library function calls the actual implementation from the C module indeed. Here we use the asyncio as the example to describe how Python and C are bridged. asyncio.events.get_event_loop() is used to obtain the event loop (if exists) of current thread.

Lib/asyncio/events.py
def get_running_loop():
    """Return the running event loop.  Raise a RuntimeError if there is none.

    This function is thread-specific.
    """
    # NOTE: this function is implemented in C (see _asynciomodule.c)
    loop = _get_running_loop()
    if loop is None:
        raise RuntimeError('no running event loop')
    return loop

And the implementation is

Modules/_asynciomodule.c
static PyObject *
get_event_loop(asyncio_state *state)
{
    PyObject *loop;
    PyObject *policy;

    _PyThreadStateImpl *ts = (_PyThreadStateImpl *)_PyThreadState_GET();
    loop = Py_XNewRef(ts->asyncio_running_loop);

    if (loop != NULL) {
        return loop;
    }

    policy = PyObject_CallNoArgs(state->asyncio_get_event_loop_policy);
    if (policy == NULL) {
        return NULL;
    }

    loop = PyObject_CallMethodNoArgs(policy, &_Py_ID(get_event_loop));
    Py_DECREF(policy);
    return loop;
}

which returns the event loop from the internal per thread state. So how Python function get_event_loop can be bind (and forwarded) into the C version?

If you look at the last few lines of _asynciomodule.c, you can find the following declaration:

Modules/_asynciomodule.c
static struct PyModuleDef _asynciomodule = {
    .m_base = PyModuleDef_HEAD_INIT,
    .m_name = "_asyncio",
    .m_doc = module_doc,
    .m_size = sizeof(asyncio_state),
    .m_methods = asyncio_methods,
    .m_slots = module_slots,
    .m_traverse = module_traverse,
    .m_clear = module_clear,
    .m_free = (freefunc)module_free,
};

PyMODINIT_FUNC
PyInit__asyncio(void)
{
    return PyModuleDef_Init(&_asynciomodule);
}

where the module is defined with name, size, and public methods. When asyncio is imported in Python, the PyInit__asyncio will be invoked (see importlib for more details, we may cover this later).

To declare such a lib-module bridge, you should put a line in Modules/Setup (or Modules/Setup.stdlib.in for built-in modules).

C module with GDB

We'll harness GDB to trace the execution path in C module implementation. It's better to build the interpreter with debug information with --with-pydebug option set when configuring.

Python extension for GDB

When you build CPython from source, a GDB configuration extension file named python-gdb.py will be generated in the build directory as well. Following the GDB helper guide, you should disable GDB security protection by setting set auto-load safe-path / into ~/.gdbinit or ~/.config/gdb/gdbinit so that GDB can automatically load the helper script under build directory.

Set breakpoint

To see the frame stack when importing the asyncio module, we run the above simple script with GDB: gdb --args ./python test.py (here the python executable is produced from debug build from source code). Then set the breakpoint via b PyInit__asyncio (you may need to allow breakpoint pending on unloaded shared libraries, or append

set breakpoint pending on

in your gdbinit files.

With the help of python-gdb.py, you can inspect the Python frame stack via py-bt:

(gdb) py-bt
Traceback (most recent call first):
  <built-in method create_dynamic of module object at remote 0x7ffff7988890>
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 1046, in create_module
  File "<frozen importlib._bootstrap>", line 813, in module_from_spec
  File "<frozen importlib._bootstrap>", line 921, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 1330, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 1359, in _find_and_load
  File "/mnt/data/home/amd/repos/cpython/Lib/asyncio/events.py", line 839, in <module>
    from _asyncio import (_get_running_loop, _set_running_loop,
  <built-in method exec of module object at remote 0x7ffff796d490>
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 752, in exec_module
  File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 1330, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 1359, in _find_and_load
  <built-in method __import__ of module object at remote 0x7ffff796d490>
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1414, in _handle_fromlist
  File "/mnt/data/home/amd/repos/cpython/Lib/asyncio/base_events.py", line 40, in <module>
    from . import events
  <built-in method exec of module object at remote 0x7ffff796d490>
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 752, in exec_module
  File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 1330, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 1359, in _find_and_load
  File "/mnt/data/home/amd/repos/cpython/Lib/asyncio/__init__.py", line 8, in <module>
    from .base_events import *
  <built-in method exec of module object at remote 0x7ffff796d490>
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 752, in exec_module
  File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 1330, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 1359, in _find_and_load
  File "/mnt/data/home/amd/repos/asyncio-learning/examples/run.py", line 1, in <module>
    import asyncio

hence we can obtain both the C module frame and Python code frame.

PyObject is the "first-citizen" of the C world of Python, from whom every objects are inherited. It is defined as:

Include/object.h
#ifndef Py_GIL_DISABLED
struct _object {
#if (defined(__GNUC__) || defined(__clang__)) \
        && !(defined __STDC_VERSION__ && __STDC_VERSION__ >= 201112L)
    // On C99 and older, anonymous union is a GCC and clang extension
    __extension__
#endif
#ifdef _MSC_VER
    // Ignore MSC warning C4201: "nonstandard extension used:
    // nameless struct/union"
    __pragma(warning(push))
    __pragma(warning(disable: 4201))
#endif
    union {
       Py_ssize_t ob_refcnt;
#if SIZEOF_VOID_P > 4
       PY_UINT32_T ob_refcnt_split[2];
#endif
    };
#ifdef _MSC_VER
    __pragma(warning(pop))
#endif

    PyTypeObject *ob_type;
};

which contains reference count and type information ob_type. For instance, if one object is:

(gdb) p *module
$5 = {{ob_refcnt = 23, ob_refcnt_split = {23, 0}}, ob_type = 0x555555bc85c0 <PyModule_Type>}

we can see this object is a PyModuleObject since its ob_type is PyModule_Type. Thereafter we can print its concrete fields with type casting:

(gdb) p *(PyModuleObject*)module
$6 = {ob_base = {{ob_refcnt = 23, ob_refcnt_split = {23, 0}}, ob_type = 0x555555bc85c0 <PyModule_Type>},
...