Data parallelism, weak and strong scaling, and what you need to know to scale a single training job to multiple CPUs or GPUs
Training of many state-of-the-art deep neural networks is a very compute-intensive task. It can take hours, days or even weeks to train a model with a single computational device, such as a CPU or GPU. To speed up the training, you have to scale out, to distribute the computations to multiple devices. The most commonly used approach to distribute the training is data parallelism
, when every computational device possesses its own replica of a model and computes a model update based on its own shard of data. Two options are possible for data parallelism: Strong
and weak
scaling. HPE's Deep Learning Benchmarking Suite supports both.
Strong scaling
Strong scaling assumes that the problem size remains the same and we vary the number of computational devices. From a deep learning point of view it means that we fix a batch size and vary a number of CPUs/GPUs to train a neural network. For instance, given a batch size of 1024 training samples, an eight-GPU system will give us the following benchmarking configurations:
#GPUs | GPUs IDs | per-GPU batch size | effective batch size |
1 | 0 | 1024 | 1024 |
2 | 0,1 | 512 | 1024 |
4 | 0,1,2,3 | 256 | 1024 |
8 | 0,1,2,3,4,5,6,7 | 128 | 1024 |
The efficiency of strong scaling is calculated in the following manner:
efficiency = t1 / (N * tN) * 100%
Where t1 is the time of solving a problem with one compute device, N is the number of compute devices, and tN is the time to solve the same problem with these N devices. For instance, an ideal case is when the tN time is N times smaller than t1 and we get ideal efficiency of 100%.
With strong scaling, an effective batch size
is constant, and per-device batch size is varying. In case of synchronous training, strong scaling with multiple devices is equivalent to training with a single device. The effective batch will be the same, and model updates will happen based on gradients computed for the same number of training samples.
Weak scaling
Weak scaling assumes we keep the amount of work per compute device per iteration fixed. With N compute devices we end up solving N times larger problem than with one. In deep learning world it means we keep per-device batch size fixed. With the same hardware setup as it is described above, we get the following benchmarking configurations assuming that the per GPU batch size is 128:
#GPUs | GPUs IDs | per-GPU batch size | effective batch size |
1 | 0 | 128 | 128 |
2 | 0,1 | 128 | 256 |
4 | 0,1,2,3 | 128 | 512 |
8 | 0,1,2,3,4,5,6,7 | 128 | 1024 |
The efficiency of weak scaling is calculated in the following way:
efficiency = t1 / tN * 100%
Where t1 is the time of solving a problem with one compute device and tN is the time to solve the N times larger problem with N compute devices. Ideally the tN time is exactly the same as t1 and we get ideal efficiency of 100%.
With weak scaling, a per-device batch size is constant and the effective batch size
increases with the assignment of more compute devices to the training job.
It is much easier to utilize a larger number of compute devices efficiently with weak scaling, as the amount of work per unit doesn't decreases when more units are added. At the same time, weak scaling may lead to very large effective batches, when convergence will suffer, and it won't be faster after all.
Multi-GPU benchmarking with DLBS
All supported frameworks (BVLC/NVIDIA Caffe, Caffe2, MXNet and TensorFlow) can be benchmarked in a multi-GPU mode, excluding Intel's fork of Caffe and apparently NVIDIA's inference engine TensorRT.
Weak scaling
The default option implemented in DLBS is a weak
scaling. Users need to provide a --exp.device_batch
and exp.gpus
parameters. The first one is an integer value specifying a per-GPU batch size. The second one is a comma-separated list of GPU identifiers. For instance, the following command line launches TensorFlow benchmark with ResNet50 on a an eight-GPU system with a per GPU batch size being equal to 32:
python experimenter.py run -Pexp.framework='"tensorflow"' \ -Pexp.model='"resnet50"' \ -Pexp.gpus='"0,1,2,3,4,5,6,7"' \ -Pexp.device_batch=32 \ -Pexp.bench_root='"./benchmarks/my_experiment"'\ -Pexp.log_file='"${exp.bench_root}/tf_resnet50.log"'
DLBS will compute internal parameter exp.effective_batch
by multiplying number of GPUs by a per-GPU device batch size:
{ "exp.num_gpus": "$(len('${exp.gpus}'.replace(',', ' ').split()))$", "exp.device": "$('gpu' if ${exp.num_gpus} > 0 else 'cpu')$", "exp.effective_batch": "$(${exp.num_gpus}*${exp.device_batch} if '${exp.device}' == 'gpu' else ${exp.device_batch})$" }
Strong scaling
There are several ways to benchmark strong scaling. One is to correctly adjust a per-device batch size with parameter exp.device_batch
so that the effective batch size does not change. The mechanism of extensions can effectively be used to disable certain benchmarks that result in non-desirable effective batch sizes.
The second approach is to provide effective batch size as a base parameter and compute per-device batch size based on it. For instance, to explore Caffe2's strong scaling with AlexNet model with 1024 images in a batch on an eight-GPU system, users can run the following command:
python experimenter.py run -Pexp.framework='"caffe2"' \ -Pexp.model='"alexnet"' \ -Vexp.gpus='["0", "0,1", "0,1,2,3", "0,1,2,3,4,5,6,7"]' \ -Pexp.effective_batch=1024\ -Pexp.device_batch='"$(${exp.effective_batch}/${exp.num_gpus})$"'\ -Pexp.bench_root='"./benchmarks/my_experiment"'\ -Pexp.log_file='"${exp.bench_root}/${exp.id}.log"'
For an introduction to specifying benchmark configurations, read the introduction section. For commonly used parameters and framework specific parameters, read the parameters and the framework specific parameters sections.
BVLC/NVIDIA Caffe
The family of Caffe frameworks supports multi-GPU training with the NCCL/NCCL2 library. The only configuration parameter that affects this is the comma-separated list of GPUs, exp.gpus
.
TensorFlow
We use the tf_cnn_benchmarks project as a backend for TensorFlow framework. The following parameters affect multi-GPU training:
exp.gpus
Comma separated list of GPUs to use.tensorflow.var_update
The method for managing variables: parameter_server, replicated, distributed_replicated, independent (source code).tensorflow.use_nccl
Whether to use nccl all-reduce primitives where possible (source code)tensorflow.local_parameter_device
Device to use as parameter server: cpu or gpu (see source code)
Caffe2
In the current version of a Caffe2 backend, the only parameter that affects multi-GPU training is exp.gpus
.
MXNet
Currently, in our MXNet backend two parameters affect multi-GPU training:
exp.gpus
Comma-separated list of GPUs to use.mxnet.kv_store
A method to aggregate gradients 'local', 'device', 'dist_sync', 'dist_device_sync' or 'dist_async'. See this page for more details.