电子发烧友App

硬声App

0
  • 聊天消息
  • 系统消息
  • 评论与回复
登录后你可以
  • 下载海量资料
  • 学习在线课程
  • 观看技术视频
  • 写文章/发帖/加入社区
创作中心

完善资料让更多小伙伴认识你,还能领取20积分哦,立即完善>

3天内不再提示
电子发烧友网>电子资料下载>电子资料>PyTorch教程13.6之多个GPU的简洁实现

PyTorch教程13.6之多个GPU的简洁实现

2023-06-05 | pdf | 0.19 MB | 次下载 | 免费

资料介绍

为每个新模型从头开始实施并行性并不好玩。此外,优化同步工具以获得高性能有很大的好处。在下文中,我们将展示如何使用深度学习框架的高级 API 来执行此操作。数学和算法与第 13.5 节中的相同毫不奇怪,您至少需要两个 GPU 才能运行本节的代码。

import torch
from torch import nn
from d2l import torch as d2l
from mxnet import autograd, gluon, init, np, npx
from mxnet.gluon import nn
from d2l import mxnet as d2l

npx.set_np()

13.6.1。玩具网络

让我们使用一个比13.5 节中的 LeNet 更有意义的网络 ,它仍然足够容易和快速训练。我们选择了一个 ResNet-18 变体He et al. , 2016由于输入图像很小,我们对其进行了轻微修改。特别地,与第 8.6 节的不同之处在于,我们在开始时使用了更小的卷积核、步长和填充。此外,我们删除了最大池化层。

#@save
def resnet18(num_classes, in_channels=1):
  """A slightly modified ResNet-18 model."""
  def resnet_block(in_channels, out_channels, num_residuals,
           first_block=False):
    blk = []
    for i in range(num_residuals):
      if i == 0 and not first_block:
        blk.append(d2l.Residual(out_channels, use_1x1conv=True,
                    strides=2))
      else:
        blk.append(d2l.Residual(out_channels))
    return nn.Sequential(*blk)

  # This model uses a smaller convolution kernel, stride, and padding and
  # removes the max-pooling layer
  net = nn.Sequential(
    nn.Conv2d(in_channels, 64, kernel_size=3, stride=1, padding=1),
    nn.BatchNorm2d(64),
    nn.ReLU())
  net.add_module("resnet_block1", resnet_block(64, 64, 2, first_block=True))
  net.add_module("resnet_block2", resnet_block(64, 128, 2))
  net.add_module("resnet_block3", resnet_block(128, 256, 2))
  net.add_module("resnet_block4", resnet_block(256, 512, 2))
  net.add_module("global_avg_pool", nn.AdaptiveAvgPool2d((1,1)))
  net.add_module("fc", nn.Sequential(nn.Flatten(),
                    nn.Linear(512, num_classes)))
  return net
#@save
def resnet18(num_classes):
  """A slightly modified ResNet-18 model."""
  def resnet_block(num_channels, num_residuals, first_block=False):
    blk = nn.Sequential()
    for i in range(num_residuals):
      if i == 0 and not first_block:
        blk.add(d2l.Residual(
          num_channels, use_1x1conv=True, strides=2))
      else:
        blk.add(d2l.Residual(num_channels))
    return blk

  net = nn.Sequential()
  # This model uses a smaller convolution kernel, stride, and padding and
  # removes the max-pooling layer
  net.add(nn.Conv2D(64, kernel_size=3, strides=1, padding=1),
      nn.BatchNorm(), nn.Activation('relu'))
  net.add(resnet_block(64, 2, first_block=True),
      resnet_block(128, 2),
      resnet_block(256, 2),
      resnet_block(512, 2))
  net.add(nn.GlobalAvgPool2D(), nn.Dense(num_classes))
  return net

13.6.2。网络初始化

我们将在训练循环内初始化网络。有关初始化方法的复习,请参阅第 5.4 节

net = resnet18(10)
# Get a list of GPUs
devices = d2l.try_all_gpus()
# We will initialize the network inside the training loop

The initialize function allows us to initialize parameters on a device of our choice. For a refresher on initialization methods see Section 5.4. What is particularly convenient is that it also allows us to initialize the network on multiple devices simultaneously. Let’s try how this works in practice.

net = resnet18(10)
# Get a list of GPUs
devices = d2l.try_all_gpus()
# Initialize all the parameters of the network
net.initialize(init=init.Normal(sigma=0.01), ctx=devices)

Using the split_and_load function introduced in Section 13.5 we can divide a minibatch of data and copy portions to the list of devices provided by the devices variable. The network instance automatically uses the appropriate GPU to compute the value of the forward propagation. Here we generate 4 observations and split them over the GPUs.

x = np.random.uniform(size=(4, 1, 28, 28))
x_shards = gluon.utils.split_and_load(x, devices)
net(x_shards[0]), net(x_shards[1])
[08:00:43] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (set the environment variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
(array([[ 2.2610207e-06, 2.2045981e-06, -5.4046786e-06, 1.2869955e-06,
     5.1373163e-06, -3.8297967e-06, 1.4339059e-07, 5.4683451e-06,
     -2.8279192e-06, -3.9651104e-06],
    [ 2.0698672e-06, 2.0084667e-06, -5.6382510e-06, 1.0498458e-06,
     5.5506434e-06, -4.1065491e-06, 6.0830087e-07, 5.4521784e-06,
     -3.7365021e-06, -4.1891640e-06]], ctx=gpu(0)),
 array([[ 2.4629783e-06, 2.6015525e-06, -5.4362617e-06, 1.2938218e-06,
     5.6387889e-06, -4.1360108e-06, 3.5758853e-07, 5.5125256e-06,
     -3.1957325e-06, -4.2976326e-06],
    [ 1.9431673e-06, 2.2600434e-06, -5.2698201e-06, 1.4807417e-06,
     5.4830934e-06, -3.9678889e-06, 7.5751018e-08, 5.6764356e-06,
     -3.2530229e-06, -4.0943951e-06]], ctx=gpu(1)))

Once data passes through the network, the corresponding parameters are initialized on the device the data passed through. This means that initialization happens on a per-device basis. Since we picked GPU 0 and GPU 1 for initialization, the network is initialized only there, and not on the CPU. In fact, the parameters do not even exist on the CPU. We can verify this by printing out the parameters and observing any errors that might arise.

weight = net[0].params.get('weight')

try:
  weight.data()
except RuntimeError:
  print('not initialized on cpu')
weight.data(devices[0])[0], weight.data(devices[1])[0]
not initialized on cpu
(array([[[ 0.01382882, -0.01183044, 0.01417865],
     [-0.00319718, 0.00439528, 0.02562625],
     [-0.00835081, 0.01387452, -0.01035946]]], ctx=gpu(0)),
 array([[[ 0.01382882, -0.01183044, 0.01417865],
     [-0.00319718, 0.00439528, 0.02562625],
     [-0.00835081, 0.01387452, -0.01035946]]], ctx=gpu(1)))

Next, let’s replace the code to evaluate the accuracy by one that works in parallel across multiple devices. This serves as a replacement of the evaluate_accuracy_gpu function from Section 7.6. The main difference is that we split a minibatch before invoking the network. All else is essentially identical.

#@save
def evaluate_accuracy_gpus(net, data_iter, split_f=d2l.split_batch):
  """Compute the accuracy for a model on a dataset using multiple GPUs."""
  # Query the list of devices
  devices = list(net.collect_params().values())[0].list_ctx()
  # No. of correct predictions, no. of predictions
  metric = d2l.Accumulator(2)
  for features, labels in data_iter:
    X_shards, y_shards = split_f(features, labels, devices)
    # Run in parallel
    pred_shards = [net(X_shard) for X_shard in X_shards]
    metric.add(sum(float(d2l.accuracy(pred_shard, y_shard)) for
            pred_shard, y_shard in zip(
              pred_shards, y_shards)), labels.size)
  return metric[0] / metric[1]

13.6.3。训练

和以前一样,训练代码需要执行几个基本功能以实现高效并行:

  • 需要在所有设备上初始化网络参数。

  • 在迭代数据集时,小批量将被划分到所有设备上。

  • 我们跨设备并行计算损失及其梯度。

  • 梯度被聚合并且参数被相应地更新。

最后,我们计算精度(再次并行)以报告网络的最终性能。训练例程与前面章节中的实现非常相似,只是我们需要拆分和聚合数据。

def train(net, num_gpus, batch_size, lr):
  train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
  devices = [d2l.try_gpu(i) for i in range(num_gpus)]
  def init_weights(module):
    if type(module) in [nn.Linear, nn.Conv2d

下载该资料的人也在下载 下载该资料的人还在阅读
更多 >

评论

查看更多

下载排行

本周

  1. 1IP5332电源管理SOC中文手册
  2. 2.94 MB  |  3次下载  |  免费
  3. 2RGB SMD LED打造壮观的灯光秀
  4. 3.39 MB  |  2次下载  |  免费
  5. 3SD8224C触摸检测IC中文手册
  6. 13.73 MB  |  1次下载  |  2 积分
  7. 4PG-FP5 Flash Memory Programmer Setup 手册
  8. 150.83KB  |  1次下载  |  免费
  9. 5ZSPM9015数据表
  10. 601.74KB  |  1次下载  |  免费
  11. 6通过WIFI修改配置
  12. 0.71 MB  |  1次下载  |  免费
  13. 7LabVIEW调康耐视VisionPro
  14. 12.47 MB  |  1次下载  |  10 积分
  15. 8M16C/63 组 数据表
  16. 1016.09KB  |  1次下载  |  免费

本月

  1. 1传感器基础知识讲座
  2. 9.21 MB  |  37次下载  |  免费
  3. 2ASUS主板图纸
  4. 1.49 MB  |  21次下载  |  免费
  5. 3中科昊芯Start_DSC28034PNT开发板试用手册
  6. 2.98 MB  |  15次下载  |  免费
  7. 4Labview的9点标定计算
  8. 0.22 MB  |  8次下载  |  5 积分
  9. 5实现高效率、无LDO、低损耗电源的低噪声和低纹波技术
  10. 2.20 MB  |  8次下载  |  免费
  11. 6STM32F10xxx参考手册(中文)
  12. 15.19 MB  |  7次下载  |  1 积分
  13. 7智能电源板开源分享
  14. 0.00 MB  |  7次下载  |  免费
  15. 8STM32F10xxx闪存编程参考手册(中文)
  16. 0.45 MB  |  6次下载  |  1 积分

总榜

  1. 1matlab软件下载入口
  2. 未知  |  935037次下载  |  免费
  3. 2protel99se软件下载(可英文版转中文版)
  4. 78.1 MB  |  537765次下载  |  免费
  5. 3MATLAB 7.1 下载 (含软件介绍)
  6. 未知  |  420007次下载  |  免费
  7. 4OrCAD10.5下载,OrCAD10.5中文版软件
  8. 817182  |  234275次下载  |  免费
  9. 5Altium DXP2002下载入口
  10. 未知  |  233031次下载  |  免费
  11. 6数据采集系统基础知识视频
  12. 16.3 MB  |  192989次下载  |  免费
  13. 7电路仿真软件multisim 10.0免费下载
  14. 340992  |  191158次下载  |  免费
  15. 8十天学会AVR单片机与C语言视频教程 下载
  16. 158M  |  183248次下载  |  免费