In this short series of blog posts I will show you how to fuzz Python native extensions with Atheris, libFuzzer, and AFL++.
In this post I focus on whitebox fuzzing which implies the source code of the native extensions is available. In the next post, I touch on blackbox or binary-only fuzzing which is the opposite case.
Why should I fuzz native extensions?#
Behind the scenes, many Python modules use C/C++ to run performance-critical code. For example, running cloc
on numpy 1.26.4
results in 170,000 lines of C code. As we all know, C/C++ is prone to memory corruption where bugs in native extensions might lead to escape from the Python sandbox (see example exploit in numpy 1.11.0
).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
| ~/numpy$ cloc *
2041 text files.
2039 unique files.
162 files ignored.
github.com/AlDanial/cloc v 1.90 T=4.56 s (416.3 files/s, 183300.1 lines/s)
-------------------------------------------------------------------------------
Language files blank comment code
-------------------------------------------------------------------------------
Python 736 47161 81270 170695
C 173 40866 71153 167309
reStructuredText 479 23464 16957 65645
CSV 30 0 0 36702
C/C++ Header 230 4563 7477 32005
SVG 26 48 11 19314
C++ 32 2405 1360 16052
Cython 19 2875 7664 5723
SWIG 9 330 538 2573
Meson 17 194 344 2458
diff 9 350 1556 1817
Fortran 90 59 120 91 953
Fortran 77 36 28 109 537
YAML 5 63 62 354
Markdown 10 110 0 342
Bourne Shell 10 45 96 311
make 4 48 44 213
TOML 2 33 12 173
sed 1 0 12 139
JSON 1 18 0 73
DOS Batch 1 11 8 55
CSS 1 17 6 51
CMake 1 9 9 47
INI 3 1 0 38
NAnt script 1 7 0 31
HTML 1 2 0 21
TeX 1 0 0 20
-------------------------------------------------------------------------------
SUM: 1897 122768 188779 523651
-------------------------------------------------------------------------------
|
Despite many other techniques for detecting bugs, fuzzing can be great and easy way to find memory corruption bugs in Python native extensions. As an example, Trail of Bits found a bug in the cbor2
module which was detected by the Atheris fuzzer which we are going to introduce below.
How can I find fuzzing targets?#
If you have the source code available, this is more or less straightforward. For example, go to Github and identify native functions that parse obscure/complex data structures (e.g., file parsers). Ideally, such functions should handle inputs that can be controlled by an attacker via untrusted channels (e.g., sockets, files, etc.). Functions that work with Python objects make for an interesting target too. See the warning from the Python documentation:
To avoid memory corruption, extension writers should never try to operate on Python objects with the functions exported by the C library: malloc(), calloc(), realloc() and free().
In cases where you don’t have the native extension’s source code, you can simply import it into Python and run dir(<native_extension>)
to list its available methods. You could also just use a decompiler and look for PyArg_ParseTuple symbols to identify exported functions.
Fuzz hello world extension with Atheris#
I prepared a hello world native extension as a vulnerable example to get you started. In this section, we’ll fuzz it with Atheris (version 2.3.0
) which is a coverage-guided fuzzer for Python code as well as native C extensions.
First, go through the required setup steps:
- Make sure you have a working Docker installation.
- Clone this repository:
$ git clone -b fuzz-extensions-in-2024 https://github.com/stulle123/fuzzing_native_python_extensions
and cd
into it - Grab two ☕ ☕ and build the Docker container:
$ docker build -t fuzz -f Dockerfile
(this will take a while) - Run it:
$ docker run -it --rm -v $(pwd):/app/output/ --name fuzz fuzz bash
(all your changes will be lost once you exit the container, only the fuzzer’s crash files will be saved to your host’s current working directory) - Finally, we can manually trigger the bug:
1
2
3
| [fuzz e20a39df2e4a] /app $ LD_PRELOAD=$(python3 -c "import atheris; print(atheris.path())")/asan_with_fuzzer.so \
ASAN_OPTIONS="allocator_may_return_null=1,detect_leaks=0,external_symbolizer_path=$CLANG_DIR/bin/llvm-symbolizer" \
python3 -c "import memory; memory.corruption(b'FUZZ')"
|
- Next, trigger the bug with Atheris. Crash files will be stored on your host’s current working directory.
1
2
3
| [fuzz e20a39df2e4a] /app $ LD_PRELOAD=$(python3 -c "import atheris; print(atheris.path())")/asan_with_fuzzer.so \
ASAN_OPTIONS="allocator_may_return_null=1,detect_leaks=0,external_symbolizer_path=$CLANG_DIR/bin/llvm-symbolizer" \
python3 ./fuzzing_native_python_extensions/atheris/atheris_fuzz.py -artifact_prefix=/app/output
|
Fuzz Numpy with Atheris#
Next, we will look at a real-world integer overflow bug in Numpy v1.11.0
which I took from Gabe Pike’s great blog post. Can you spot it 😉? (Spoiler)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
| NPY_NO_EXPORT PyObject *
PyArray_Resize(PyArrayObject *self, PyArray_Dims *newshape, int refcheck,
NPY_ORDER order)
{
// npy_intp is `long long`
npy_intp* new_dimensions = newshape->ptr;
npy_intp newsize = 1;
int new_nd = newshape->len;
int k;
// NPY_MAX_INTP is MAX_LONGLONG (0x7fffffffffffffff)
npy_intp largest = NPY_MAX_INTP / PyArray_DESCR(self)->elsize;
for(k = 0; k < new_nd; k++) {
newsize *= new_dimensions[k];
if (newsize <= 0 || newsize > largest) {
return PyErr_NoMemory();
}
}
if (newsize == 0) {
sd = PyArray_DESCR(self)->elsize;
}
else {
sd = newsize*PyArray_DESCR(self)->elsize;
}
/* Reallocate space if needed */
new_data = realloc(PyArray_DATA(self), sd);
if (new_data == NULL) {
PyErr_SetString(PyExc_MemoryError,
“cannot allocate memory for array”);
return NULL;
}
((PyArrayObject_fields *)self)->data = new_data;
|
As with the other example, let’s trigger the bug manually first and then with Atheris:
- Trigger the bug with my example trigger_bug_in_numpy_1.11.0.py script:
1
2
3
| [fuzz e20a39df2e4a] /app $ LD_PRELOAD=$($CC -print-file-name=libclang_rt.ubsan_standalone-x86_64.so) \
ASAN_OPTIONS="allocator_may_return_null=1,detect_leaks=0,external_symbolizer_path=$CLANG_DIR/bin/llvm-symbolizer" \
python3 ./fuzzing_native_python_extensions/course/trigger_bug_in_numpy_1.11.0.py
|
- Now, find the bug with Atheris:
1
2
3
| [fuzz e20a39df2e4a] /app $ LD_PRELOAD=$(python3 -c "import atheris; print(atheris.path())")/asan_with_fuzzer.so \
ASAN_OPTIONS="allocator_may_return_null=1,detect_leaks=0,external_symbolizer_path=$CLANG_DIR/bin/llvm-symbolizer" \
python3 ./fuzzing_native_python_extensions/course/numpy_fuzz.py -artifact_prefix=/app/output
|
Fuzz hello world extension with libFuzzer#
Atheris runs libFuzzer under the hood, so no surprise that you can also fuzz Python native extensions with it:
- Build the fuzzing harness:
1
2
| [fuzz e20a39df2e4a] /app $ clang++ $(python3-config --embed --cflags) $(python3-config --embed --ldflags) \
-fsanitize=address,fuzzer -g -o lib_fuzz ./fuzzing_native_python_extensions/libfuzzer/lib_fuzz.c
|
- Run it:
1
2
| [fuzz e20a39df2e4a] /app $ ASAN_OPTIONS="allocator_may_return_null=1,detect_leaks=0,external_symbolizer_path=$CLANG_DIR/bin/llvm-symbolizer" \
./lib_fuzz -artifact_prefix=/app/output
|
Fuzz hello world extension with AFL++#
Fuzzing Python native extensions with AFL++ (version 4.22a
) is also straightforward:
- Remove the
memory
Python module and re-build it again:
1
2
3
| [fuzz e20a39df2e4a] /app $ pip3 uninstall -y memory && rm -rf ./fuzzing_native_python_extensions/build/ \
CC=afl-clang-fast CXX=afl-clang-fast++ \
LD=afl-clang-fast LDSHARED="clang -shared" python3 -m pip install ./fuzzing_native_python_extensions
|
- Build the fuzzing harness:
1
2
| [fuzz e20a39df2e4a] /app $ afl-clang-fast $(python3-config --embed --cflags) $(python3-config --embed --ldflags) \
-o afl_fuzz ./fuzzing_native_python_extensions/afl++/whitebox_fuzz.c
|
- Run it:
1
| [fuzz e20a39df2e4a] /app $ afl-fuzz -i ./fuzzing_native_python_extensions/afl++/in -o /app/output -- ./afl_fuzz
|
Conclusion#
Of the three popular fuzzers explored here, Atheris is the most straightforward one to setup. In terms of performance I can’t tell which fuzzer is the fastest as I haven’t done any measurements. I’ll leave that to another post in the future ;)
In the next blog post, I will detail fuzzing for Python C extensions when source code isn’t available.