Scaleout Systems

Federated learning (FL) is transforming machine learning by enabling decentralized training across multiple devices. In this post, we will explore some performance insights using the FEDn framework, comparing Python and C++ clients with and without training. The goal is to understand how these implementations handle computational load and memory usage over time, and how the choice of libraries impacts performance.

Experimental Settings

Before diving into the results, let’s outline the experimental setup:

Dataset: The MNIST Handwritten Digit Dataset was used for training. This dataset consists of 60,000 training images and 10,000 test images of handwritten digits (0-9), making it a standard benchmark for evaluating machine learning models.

Model: A standard neural network model, available open-source, was used for both Python and C++ implementations. The model architecture was kept consistent across both clients to ensure a fair comparison.

Model Architecture:

Layer 1: Transforms the input (784 features) into a 64-dimensional hidden representation.

Layer 2: Further reduces the 64-dimensional representation to 32 dimensions.

Layer 3:: Maps the 32-dimensional representation to the final output (10 classes).

Input Size: 784 (for 28x28 images)

Output Size: 10 (for 10 classes)

Client Nodes: Both Python and C++ clients were deployed on Google Cloud Platform (GCP) using e2-medium instances. Each instance has:

2 vCPUs (1 core)
4 GB of memory

This setup ensures that both clients operate under the same hardware constraints, allowing for a direct comparison of performance.

Code Used:

C++ Client: The C++ client implementation can be found in the FEDn C++ Client repository. This implementation uses LibTorch for neural network operations and cnpy for handling NumPy files.

Python Client: The Python client implementation is available in the FEDn Examples repository. This implementation uses PyTorch for training the neural network.

Without Training: Idle State Comparison

First, let’s look at the idle state, where no training is performed. This gives us a baseline for resource consumption.

Python Client: The CPU usage remains consistently low, hovering around 0-2%. Memory usage is also modest, staying under 70 MB. This indicates that the Python client is lightweight when idle, making it suitable for scenarios where resources need to be conserved.

C++ Client: The C++ client shows higher memory usage in the idle state, staying under 160 MB, despite having minimal CPU usage (0-2%). This is not expected, this is due to how C++ compiles, links build executable. The higher memory footprint is attributed to the libraries used in the C++ implementation, such as LibTorch and cnpy, which may have some initialization overhead even when no training is being performed.

With Training: 10 Rounds of Federated Learning

Now, we evaluate the performance during active training over 10 rounds. Prior to the 10-round experiment, a single warm-up round was conducted to initialize the model and ensure a stable starting point.

Python Client (PyTorch): During training, the CPU usage spikes to nearly 100%, indicating that one full CPU core is being fully utilized. Memory usage also increases significantly, rising from under 60 MB to a peak of around 700 MB. This sharp increase reflects the intensive computational and memory demands of training neural networks using PyTorch. The consistent high CPU usage suggests that Python is effectively leveraging available resources to handle the workload, but at the cost of higher memory consumption.

C++ Client (LibTorch + cnpy): Similarly, the C++ client shows CPU usage close to 100%, indicating that it is also fully utilizing one CPU core during training. The C++ client exhibits more frequent white spaces in the CPU usage plot, indicating that it completes tasks more quickly and spends more time idle between tasks. This suggests that the C++ implementation is more efficient in processing and requires less continuous CPU usage. However, memory usage behaves differently compared to Python. While the C++ client starts with a higher memory footprint in the idle state (under 160 MB), its memory consumption does not rise significantly during training, peaking at around 450 MB. This is a stark contrast to the Python client, where memory usage is more than ten-times during training. The stability in memory usage for the C++ client is attributed to the libraries used:

LibTorch: The C++ version of PyTorch, LibTorch, is optimized for performance and memory efficiency. It provides a lower-level interface, which allows for finer control over resource allocation and avoids the memory bloat often seen in higher-level frameworks.

cnpy: This library is used for handling NumPy files in C++. It is lightweight and efficient, contributing to the stable memory footprint during training.

Key Considerations

CPU Utilization: Both Python and C++ clients fully utilize one CPU core during training, reaching close to 100% usage. This is expected, as training neural networks is computationally intensive and requires significant processing power.

Task Efficiency: The C++ client not only manages memory better but also completes tasks faster, as evidenced by the more frequent white spaces in the CPU usage plot.

Memory Usage:

Idle State: The C++ client uses more memory in the idle state (under 150 MB) compared to the Python client (under 60 MB). This could be due to the initialization overhead of libraries like LibTorch and cnpy.
Training State: During training, the C++ client demonstrates efficient memory utilization. While the Python client’s memory usage rises sharply (from 60 MB to 700 MB), the C++ client’s memory consumption remains stable, increasing from 150 MB to 450 MB. This is a key advantage of the C++ implementation, as it avoids the memory spikes seen in Python.

Library Impact: The choice of libraries plays a significant role in memory efficiency. While LibTorch and cnpy introduce some overhead in the idle state, they shine during training by maintaining a stable and lower memory footprint compared to Python’s PyTorch implementation.

Framework Choice: The choice between Python and C++ in FEDn depends on your specific needs. Python offers flexibility and ease of use, especially with frameworks like PyTorch, making it ideal for rapid prototyping and development. On the other hand, C++ provides better memory efficiency during training, which can be advantageous in resource-constrained environments, despite its slightly higher idle memory usage.

Interoperability at Scale: One of the most important aspects of this experiment is that both Python and C++ clients were connected to the same FEDn project and contributed to the aggregation process. This demonstrates the interoperability of the FEDn framework, which can seamlessly integrate clients written in different programming languages and still perform federated aggregation effectively. This capability is crucial for real-world deployments where heterogeneous environments are common.

Conclusion

These experiments, conducted on GCP e2-medium instances using the MNIST dataset and a standard neural network model, provide a clear picture of how different implementations handle the demands of federated learning. While the C++ client uses more memory in the idle state, it demonstrates superior memory efficiency during training, with memory consumption remaining stable and significantly lower than Python’s. This is a key advantage of the C++ implementation, especially in environments where memory resources are limited.

Moreover, the ability of the FEDn framework to support interoperability at scale—where both Python and C++ clients can participate in the same federated learning process—highlights its flexibility and readiness for real-world, heterogeneous deployments. Understanding these performance characteristics can help you optimize your federated learning workflows, whether you prioritize development speed, resource efficiency, or interoperability.

‍

Exploring Python vs. C++ Clients - A Performance Deep Dive with FEDn Framework

Experimental Settings

Without Training: Idle State Comparison

With Training: 10 Rounds of Federated Learning

Key Considerations

Conclusion