大规模训练,gpu和cpu速度差别很大。

概述

GPU

  • CPU (Central Processing Unit) 即中央处理器。
  • GPU (Graphics Processing Unit) 即图形处理器。

当程序员为CPU编写程序时,他们倾向于利用复杂的逻辑结构优化算法从而减少计算任务的运行时间,即Latency。
当程序员为GPU编写程序时,则利用其处理海量数据的优势,通过提高总的数据吞吐量(Throughput)来掩盖Lantency。

img

其中绿色的是计算单元,橙红色的是存储单元,橙黄色的是控制单元。

首先你需要硬件支持,一块能够支持GPU加速计算的显卡,这里以NVIDIA的GPU为例。

CUDA

CUDA(Compute Unified Device Architecture),是显卡厂商NVIDIA推出的运算平台。 CUDA™是一种由NVIDIA推出的通用并行计算架构,该架构使GPU能够解决复杂的计算问题。

CUDA提供了一种可扩展的编程模型,使得已经写好的CUDA代码可以在任意数量核心的GPU上运行。只有运行时,系统才知道物理处理器的数量。

CUDNN

NVIDIA cuDNN是用于深度神经网络的GPU加速库。它强调性能、易用性和低内存开销。NVIDIA cuDNN可以集成到更高级别的机器学习框架中,如加州大学伯克利分校的流行CAFFE软件。简单的,插入式设计可以让开发人员专注于设计和实现神经网络模型,而不是调整性能,同时还可以在GPU上实现高性能现代并行计算。

支持GPU的计算框架

Tensorflow-GPU

安装教程参考 https://tensorflow.google.cn/install

需要注意的是 严格对应 tensorflow_gpu、Python、 编译器、 cuDNN、CUDA 的版本关系。

相关对应关系windows平台可参考:

Tensorflow 文档

显卡驱动与Cuda版本之间的对应关系

cuda与cudnn版本之间的对应关系

打开NVIDIA控制面板进入系统信息,可查看当前支持的CUDA驱动版本。

这里的CUDA驱动版本是指你只可以安装该版本及以下版本的CUDA。

根据实际情况,笔者计划使用的环境是:

  • tensorflow_gpu-1.14.0
  • python 3.7
  • MSVC 2017
  • cuDNN 7.4.2.24
  • CUDA 10.0.130

安装CUDA

进入cuda-toolkit-archive选择需要的CUDA Toolkit版本下载安装即可。

笔者选择network安装方式。注意第一个路径是选择临时解压路径。选择自定义安装,取消安装不需要的应用,如NVIDIA GeForce Experience。

在cmd中执行:

1
nvcc -V

需要注意配置相关环境变量。

安装CUDNN

进入cudnn-archive选择需要的CUDA Toolkit版本下载解压即可。

笔者将CUDA默认安装在C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0

将CUDNN解压到C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0中即可。

或者配置单独的环境变量。

安装Tensorflow-GPU

如果之前安装过cpu版本的Tensorflow需要进行卸载.

若在虚拟环境中使用conda安装,conda会自动安装相关的cuda和cudnn依赖。

1
conda install tensorflow-gpu

下面给出一段GPU测试程序:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import time
import tensorflow as tf
begin = time.time()
with tf.device('/gpu:0'):
rand_t = tf.random_uniform([50,50],0,10,dtype=tf.float32,seed=0)
a = tf.Variable(rand_t)
b = tf.Variable(rand_t)
c = tf.matmul(a,b)
init = tf.global_variables_initializer()
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True)) #强制使用GPU
sess.run(init)
print(sess.run(c))
end = time.time()
print(end-begin,'s')

运行结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
2020-10-11 09:36:56.713905: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
2020-10-11 09:36:56.742138: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library nvcuda.dll
2020-10-11 09:36:57.193243: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: GeForce GTX 960M major: 5 minor: 0 memoryClockRate(GHz): 1.176
pciBusID: 0000:01:00.0
2020-10-11 09:36:57.199952: I tensorflow/stream_executor/platform/default/dlopen_checker_stub.cc:25] GPU libraries are statically linked, skip dlopen check.
2020-10-11 09:36:57.204946: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2020-10-11 09:36:58.622428: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-10-11 09:36:58.626880: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0
2020-10-11 09:36:58.630073: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N
2020-10-11 09:36:58.633910: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1382 MB memory) -> physical GPU (device: 0, name: GeForce GTX 960M, pci bus id: 0000:01:00.0, compute capability: 5.0)
Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 960M, pci bus id: 0000:01:00.0, compute capability: 5.0
2020-10-11 09:36:58.657797: I tensorflow/core/common_runtime/direct_session.cc:296] Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 960M, pci bus id: 0000:01:00.0, compute capability: 5.0

random_uniform/RandomUniform: (RandomUniform): /job:localhost/replica:0/task:0/device:GPU:0
2020-10-11 09:36:58.683509: I tensorflow/core/common_runtime/placer.cc:54] random_uniform/RandomUniform: (RandomUniform)/job:localhost/replica:0/task:0/device:GPU:0
random_uniform/sub: (Sub): /job:localhost/replica:0/task:0/device:GPU:0
2020-10-11 09:36:58.690684: I tensorflow/core/common_runtime/placer.cc:54] random_uniform/sub: (Sub)/job:localhost/replica:0/task:0/device:GPU:0
random_uniform/mul: (Mul): /job:localhost/replica:0/task:0/device:GPU:0
2020-10-11 09:36:58.705871: I tensorflow/core/common_runtime/placer.cc:54] random_uniform/mul: (Mul)/job:localhost/replica:0/task:0/device:GPU:0
random_uniform: (Add): /job:localhost/replica:0/task:0/device:GPU:0
2020-10-11 09:36:58.719880: I tensorflow/core/common_runtime/placer.cc:54] random_uniform: (Add)/job:localhost/replica:0/task:0/device:GPU:0
Variable: (VariableV2): /job:localhost/replica:0/task:0/device:GPU:0
2020-10-11 09:36:58.738611: I tensorflow/core/common_runtime/placer.cc:54] Variable: (VariableV2)/job:localhost/replica:0/task:0/device:GPU:0
Variable/Assign: (Assign): /job:localhost/replica:0/task:0/device:GPU:0
2020-10-11 09:36:58.749906: I tensorflow/core/common_runtime/placer.cc:54] Variable/Assign: (Assign)/job:localhost/replica:0/task:0/device:GPU:0
Variable/read: (Identity): /job:localhost/replica:0/task:0/device:GPU:0
2020-10-11 09:36:58.759648: I tensorflow/core/common_runtime/placer.cc:54] Variable/read: (Identity)/job:localhost/replica:0/task:0/device:GPU:0
Variable_1: (VariableV2): /job:localhost/replica:0/task:0/device:GPU:0
2020-10-11 09:36:58.769854: I tensorflow/core/common_runtime/placer.cc:54] Variable_1: (VariableV2)/job:localhost/replica:0/task:0/device:GPU:0
Variable_1/Assign: (Assign): /job:localhost/replica:0/task:0/device:GPU:0
2020-10-11 09:36:58.776819: I tensorflow/core/common_runtime/placer.cc:54] Variable_1/Assign: (Assign)/job:localhost/replica:0/task:0/device:GPU:0
Variable_1/read: (Identity): /job:localhost/replica:0/task:0/device:GPU:0
2020-10-11 09:36:58.792460: I tensorflow/core/common_runtime/placer.cc:54] Variable_1/read: (Identity)/job:localhost/replica:0/task:0/device:GPU:0
MatMul: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
2020-10-11 09:36:58.808352: I tensorflow/core/common_runtime/placer.cc:54] MatMul: (MatMul)/job:localhost/replica:0/task:0/device:GPU:0
init: (NoOp): /job:localhost/replica:0/task:0/device:GPU:0
2020-10-11 09:36:58.823256: I tensorflow/core/common_runtime/placer.cc:54] init: (NoOp)/job:localhost/replica:0/task:0/device:GPU:0
random_uniform/shape: (Const): /job:localhost/replica:0/task:0/device:GPU:0
2020-10-11 09:36:58.838355: I tensorflow/core/common_runtime/placer.cc:54] random_uniform/shape: (Const)/job:localhost/replica:0/task:0/device:GPU:0
random_uniform/min: (Const): /job:localhost/replica:0/task:0/device:GPU:0
2020-10-11 09:36:58.853758: I tensorflow/core/common_runtime/placer.cc:54] random_uniform/min: (Const)/job:localhost/replica:0/task:0/device:GPU:0
random_uniform/max: (Const): /job:localhost/replica:0/task:0/device:GPU:0
2020-10-11 09:36:58.868493: I tensorflow/core/common_runtime/placer.cc:54] random_uniform/max: (Const)/job:localhost/replica:0/task:0/device:GPU:0
[[1405.2429 1441.7413 1364.38 ... 1480.2251 1279.0061 1620.0938 ]
[1232.6589 1344.4458 1169.7095 ... 1205.1284 1040.5566 1421.967 ]
[1209.3164 1180.3206 1158.1396 ... 1200.0344 1014.03217 1222.5107 ]
...
[1298.9648 1262.9236 1205.6918 ... 1396.479 1090.7253 1437.241 ]
[1118.2473 1209.0151 1077.7229 ... 1180.7025 1076.4694 1139.742 ]
[1200.8866 1297.2266 1260.01 ... 1289.4297 1165.2448 1433.4183 ]]
2.6225805282592773 s

可以看到GPU已经在工作了。

同时可以在cmd中运行 nvidia-smi【C:\Program Files\NVIDIA Corporation\NVSMInvidia-smi.exe】监控GPU使用情况和更改GPU状态。

Keras-GPU

conda直接安装即可,注意需要先安装tensorflow-gpu。

1
conda install keras-gpu

Pytorch-GPU

再也找不到比官网更详尽的文档了,请直接参考https://pytorch.org/。

1
conda install pytorch torchvision cudatoolkit=10.0

GPU 测试程序:

1
2
3
4
5
6
7
8
9
10
import torch
flag = torch.cuda.is_available()
print(flag)

ngpu= 1
# Decide which device we want to run on
device = torch.device("cuda:0" if (torch.cuda.is_available() and ngpu > 0) else "cpu")
print(device)
print(torch.cuda.get_device_name(0))
print(torch.rand(3,3).cuda())

输出结果:

1
2
3
4
5
6
True
cuda:0
GeForce GTX 960M
tensor([[0.0208, 0.2799, 0.4918],
[0.0020, 0.1067, 0.8207],
[0.5531, 0.0994, 0.2108]], device='cuda:0')

评论