Out-of-memory issue

Working with big data frames sometimes hangs the system with an out-of-memory error. For Spark, this issue can still happen when you collect data back into master node.

The out-of-memory issue can be prevented in the following two ways:

1: signal Garbage Collector asap

(venv) ➜  ~ python -m memory_profiler t.py
Line #    Mem usage    Increment  Occurrences   Line Contents
======    =========    =========  ===========   =============
    35  150.836 MiB  150.836 MiB           1   @profile
    36                                         def func():
    37  192.664 MiB   41.828 MiB           1       df1 = <big df 1>
    38 1347.035 MiB 1154.371 MiB           1       df2 = <big df 2>

In the code above 2 big data frames df1 and df2 exist in memory at the same time. df1 takes 41.828 MiB and df2 takes 1154.371 MiB (Increment column).

A better way to use memory is to load and process each data frame at a time, and del it asap.

(venv) ➜  ~ python -m memory_profiler t.py
Line #    Mem usage    Increment  Occurrences   Line Contents
======    =========    =========  ===========   =============
    35  150.738 MiB  150.738 MiB           1   @profile
    36                                         def func():
    37  192.449 MiB   41.711 MiB           1       df1 = <big df 1>
    38  191.016 MiB   -1.434 MiB           1       del df1
    39 1373.656 MiB 1182.641 MiB           1       df2 = <big df 2>
    40  334.859 MiB -1038.797 MiB          1       del df2

In the code above command del df1 signals Garbage Collector to release the memory (it decided to only release 1.434 MiB). The command del df2 could release up to 1038.797 MiB.

2: use chunked/iterator output

If your code uses a single very big data frame that may not fit in memory, then enable the chunked/iterator option - most APIs already include this option.

(venv) ➜  ~ python -m memory_profiler t.py
Line #   Mem usage    Increment Occurrences   Line Contents
======   =========    ========= ===========   =============
    35 150.836 MiB  150.836 MiB          1   @profile
    36                                       def func():
    37 167.367 MiB   16.531 MiB          1       df_iter = <chunked>
    38 779.520 MiB   76.082 MiB         31       for df in df_iter:
    39 779.520 MiB -452.129 MiB         30           df.shape

In the code above the chunked API splits the very big output into 30 chunks (Occurrences column).

Memory usage does not increase because on every iteration the variable df is assigned a new chunk and Garbage Collector releases the memory used by old chunk.

Note: Monitor Memory in Python

pip install memory-profiler
decorate the function with @profile
run the file with -m memory_profiler

Last modified on 2023-09-13