Get your Python code for data preparation to perform significantly faster with just a few lines of code. Take advantage of the build in Concurrent futures

This post will discuss and show how to utilize all your CPU cores when executing your Python code for data preparation by just adding a few lines of extra code.

A common data preparation task

One very common task in data science is data preparation. A common task could be preparing images for model training.
The following code snippet demonstrates the common task of iterating over images in a folder manipulating each image and finally saving each change.

Create file compress.py

The file should contain the following code.

from PIL import Image
from io import BytesIO


def compress_image(image_file, max_size=4000, max_dim=150, dim_diff_w=0, dim_diff_h=0, orig_image_w=None, orig_image_h=None):
    print_values = {
        'max_size': max_size,
        'max_dim': max_dim,
        'dim_diff_w': dim_diff_w,
        'dim_diff_h': dim_diff_h,
        'orig_image_w': orig_image_w,
        'orig_image_h': orig_image_h,
    }
    print(print_values)

    orig_image = Image.open(image_file)
    orig_image_w, orig_image_h = orig_image.size

    if orig_image_w > max_dim:
        dim_diff_w = orig_image_w - max_dim

        orig_image_w = max_dim
        orig_image_h = orig_image_h - dim_diff_w

    if orig_image_h > max_dim:
        dim_diff_h = orig_image_h - max_dim

        orig_image_h = max_dim
        orig_image_w = orig_image_w - dim_diff_h

    with BytesIO() as file_bytes:
        if dim_diff_h + dim_diff_w > 0 or file_bytes.tell() >= max_size:
            resize_image = orig_image.resize((orig_image_w, orig_image_h), Image.ANTIALIAS)

            resize_image.save(file_bytes, optimize=True, quality=100, format='png')

            if file_bytes.tell() <= max_size:
                file_bytes.seek(0, 0)
                with open(image_file, 'wb') as f_output:
                    f_output.write(file_bytes.read())
            else:
                new_max_dim = round(max_dim - (max_dim * 0.10))
                compress_image(image_file, max_size=max_size, max_dim=new_max_dim, dim_diff_w=0, dim_diff_h=0,
                               orig_image_w=orig_image_w, orig_image_h=orig_image_h)

The function compress_image() checks what dimension your image as and adjusts these according to set max_dim. The aspect ratio is preserved while scaling the image dimensions. When resized the image is saved in memory, not on the disc, and the size is checked. If set image_size is not satisfied the function is called recursively with a lower value for the variable max_dim, otherwise, the image is overwritten with the new dimensions. If you want to know more about the libraries PIL and io.BytesIO please visit their respective documentation.

The next task is to use this function in order to measure how long it takes to execute on some arbitrary amount of images. For the purpose of testing, we will create one more Python file.

Create file test_compress.py

The file should contain the following code.

import time
import glob
from compressing import compres

start_time = time.time()
for file_name in glob.glob("../data/test_compres/*.png"):
    compres.compress_image(file_name)
end_time = time.time()
print(end_time - start_time)

The above code creates a list of all image files in a specific folder, in my case “../data/test_compres/*.png”, furthermore, I only want PNG files to be added to the list. Finally for each image path in the list the function compress_image(file_name) is called. The library time is used to measure the time elapsed when executing the file.

Running the file test_compress.py prints the time 6.6042399406433105 seconds.
This is kind of slow, therefore, let’s improve the time by using one of Pythons core libraries called Concurrent futures.

Using Python Concurrent Futures

The Concurrent Futures in python allow us to utilize all available cores on the machine where the code is executed.
The first example we run was only using one core out of the four available on my Mac. Let’s see how much we can improve the time if we use all four.

In order to make this improvement, we need to import a new library and change the code in file test_compress.py slightly. The modified test_compress.py file should now contain the following.

import concurrent.futures
import time
import glob
from compressing import compres


start_time = time.time()
list_files = glob.glob("../data/test_compres/*.png")
with concurrent.futures.ProcessPoolExecutor(max_workers=4) as executor:
    executor.map(compres.compress_image, list_files)
end_time = time.time()
print(end_time - start_time)

This version of file test_compress.py is slightly different from the previous version. A new library is imported concurrent.futures. Further function compress_image() is now called in a different way. Concurrent futures are used to create a pool of available resources and we use map to call the function compress_image() for each item in the list list_files. That’s all, now lets run the file again.

Running the file test_compress.py now prints the time 3.422382116317749 seconds. This is an improvement of approximately 90%.

Conclusion

You probably already noticed that if one core could do the task in about 6 seconds then four would have done it in 1.5 seconds. This is however often dependent on what your code is doing and for other tasks then the above it might even scale that well. However, an improvement of 90% just by adding two lines of code is awesome.

There are probably many other optimizations that can be done in order to improve the performance of your Python code. An interesting one could be to use Cython and compile your Python code to C code. Cython does an excellent job of doing all then without you having to know any C programming. The topic of using Cython is not in the scope of this post, however, it’s an interesting topic worth exploring.