Best gpu benchmarking test 2019

The PCI connectivity has a measurable influence in deep learning performance, especially in multi GPU configurations. It is an elaborated environment to run high performance GPUs by providing optimal cooling and the availability to run each GPU in a PCI 3.0 x16 slot directly connected to the CPU. We used our AIME R400 server for testing.

As it is used in many benchmarks a close to optimal implementation is available, which drives the GPU to maximum performance and shows where the performance limits of the devices are. As the classic deep learning network with its complex 50 layer architecture with different convolutional and residual layers it is still a good network for comparing achievable deep learning performance. The visual recognition ResNet50 model is used for our benchmark. We provide benchmarks for both float 32bit and 16bit precision as a reference to demonstrate the potential. The full potential of mixed precision learning will be better explored with Tensor Flow 2.X and will probably be the development trend for improving deep learning framework performance. As not all calculation steps should be done with a lower bit precision, the mixing of different bit resolutions for calculation is referred as " mixed precision". Applying float 16bit precision is not that trivial as the model has to be adjusted to use it. Studies are suggesting that float 16bit precision can also be applied for training tasks with neglectable loss in training accuracy and can speed-up training jobs dramatically. Float 16bit / Mixed Precision LearningĬoncerning inference jobs, a lower floating point precision and even lower 8 or 4 bit integer resolution is already granted and used to improve performance. How to enable XLA in you projects read here. This feature can be turned on by a simple option or environment flag and will have a direct effect on the execution performance. This can have performance benefits of 10% to 30% compared to the static crafted Tensorflow kernels for different layer types. It does optimization on the network graph by dynamically compiling parts of the network to specific kernels optimized for a device. Tensorflow XLAĪ Tensorflow performance feature that was lately declared stable is XLA (Accelerated Linear Algebra). A further interesting read about the influence of the batch size on the training results was published by OpenAI. An example is BigGAN where batch sizes as high as 2,048 are suggested to deliver best results. But the batch size should not exceed the available GPU memory as then memory swapping mechanisms have to kick in and reduce the performance or the application simply crashes with an 'out of memory' exception.Ī large batch size has to some extent no negative effect to the training results, to the contrary a large batch size can have a positive effect to get more generalized results. The basic rule is to increase the batch size so that the complete GPU memory is used.Ī larger batch size will increase the parallelism and improve the utilization of the GPU cores.

The best batch size in regards of performance is directly related to the amount of GPU memory available. The batch size specifies how many backpropagations of the network are done in parallel, the result of each backpropagation is then averaged among the batch and then the result is applied to adjust the weights of the network. One of the most important setting to optimize the workload for each type of GPU is to use the optimal batch size. Some regards were taken to get the most performance out of Tensorflow for benchmarking. Getting the best performance out of Tensorflow With 640 Tensor Cores, the Tesla V100 was the world’s first GPU to break the 100 teraFLOPS (TFLOPS) barrier of deep learning performance including 16 GB of highest bandwidth HBM2 memory.