百万级类别的分类模型的拆分训练

本文使用 Zhihu On VSCode 创作并发布

很多人脸识别算法都是以分类的方式进行训练的，分类的训练方式中存在一个很大的问题，就是模型的最后一个全连接层的参数量太大了，以512为特征为例：

类别数参数矩阵尺寸参数矩阵大小（MB）

100w类别——1953MB
200w类别——3906MB
500w类别——9765MB

类别再多的话，1080TI这种消费级的GPU就装不下了，更不用说还有forward/backward的中间结果需要占据额外的显存。

现在的开源数据越来越多，就算没有自己的数据，靠开源数据也能把类别数量堆到100万了，这种条件下，在单卡难以训练，需要进行模型拆分。

模型拆分

最容易想到的拆分方式就是拆分最大的那个fc层。

class facemodel(torch.nn.Module):
    def __init__(self,num_classes):
        super(facemodel,self).__init__()
        # backbone放在GPU-0
        self.backbone = resnet50().to(torch.device("cuda:0"))
        self.backbone.fc = torch.nn.Linear(2048, 512,bias=True).to(torch.device("cuda:0"))
        self.fc1 = torch.nn.Linear(512, int(num_classes / 6)).to(torch.device("cuda:0"))
        # 将fc拆掉一部分放在GPU-1，考虑到forward/backward，需要多拆一点
        self.fc2 = torch.nn.Linear(512, num_classes - int(num_classes / 6)).to(torch.device("cuda:1"))
    def forward(self,x):
        x = self.backbone(x)
        x1 = self.fc1(x)
        x2 = self.fc2(x.to(torch.device("cuda:1")))
        return torch.cat([x1,x2.to(torch.device("cuda:0"))],dim = 1) # 传回GPU-0，便于计算loss

以一个200万类别的模型为例：

net = facemodel(2000000)
summary(net,(3,224,224))

模型参数量如下：

================================================================
Total params: 1,050,557,120
Trainable params: 1,050,557,120
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.57
Forward/backward pass size (MB): 301.82
Params size (MB): 4007.56
Estimated Total Size (MB): 4309.95
----------------------------------------------------------------

理论上在单卡可以跑(11178 - 4007.56) / (301.82) = 23.76个batch，双卡就是47.52个batch。

下面试试在双卡可以跑多大的batch_size。

此时在两个GPU上的显存分配为：

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.78       Driver Version: 410.78       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:01:00.0 Off |                  N/A |
|  0%   59C    P8    20W / 250W |   1531MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:04:00.0 Off |                  N/A |
| 29%   52C    P8    19W / 250W |   3841MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     19447      C   /home/dai/py36env/bin/python                1521MiB |
|    1     19447      C   /home/dai/py36env/bin/python                3831MiB |
+-----------------------------------------------------------------------------+

尝试batch_size=64:

batch_size = 64
img = torch.ones(batch_size,3,224,224).cuda()
out = net(img)
label = torch.ones(batch_size).long().to(torch.device("cuda:0"))
loss = torch.nn.CrossEntropyLoss()(out,label)
loss.backward()
loss.item()

使用64的batch_size进行反向传播之后，得到的GPU显存占用情况如下：

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.78       Driver Version: 410.78       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:01:00.0 Off |                  N/A |
|  0%   73C    P2    84W / 250W |   9855MiB / 11178MiB |     56%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:04:00.0 Off |                  N/A |
|  0%   61C    P2    79W / 250W |   7505MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     19963      C   /home/dai/py36env/bin/python                9845MiB |
|    1     19963      C   /home/dai/py36env/bin/python                7495MiB |
+-----------------------------------------------------------------------------+

可见拆分模型后，可以以更大的batch_size进行训练。

但是从上面的显存占用情况可以看出一个问题：两个GPU中的forward/backward显存增长幅度不同，GPU利用率差别也很大。这样容易造成显存浪费，而且长期一个GPU干活一个GPU围观的情况也容易把其中一个GPU搞坏。

为了解决这个问题，可以尝试更细致的模型拆分。

更细致的拆分

我们可以把resnet50的backbone部分也拆分到两个GPU上：

class face_model(torch.nn.Module):
    def __init__(self,num_classes):
        super(face_model,self).__init__()
        backbone = resnet50()
        self.bottom = torch.nn.Sequential(
                backbone.conv1,backbone.bn1, backbone.relu, backbone.maxpool
            ).to(torch.device("cuda:0"))
        self.layer1 = backbone.layer1.to(torch.device("cuda:0"))
        self.layer2 = backbone.layer2.to(torch.device("cuda:0"))
        self.layer3 = backbone.layer3.to(torch.device("cuda:1"))
        self.layer4 = backbone.layer4.to(torch.device("cuda:1"))
        self.avgpool = backbone.avgpool.to(torch.device("cuda:1"))
        self.fc = torch.nn.Linear(2048, 512,bias=True).to(torch.device("cuda:1"))
        self.fc1 = torch.nn.Linear(in_features=512, out_features = int(num_classes / 2),bias=True).to(torch.device("cuda:0"))
        self.fc2 = torch.nn.Linear(in_features=512, out_features = num_classes - int(num_classes / 2),bias=True).to(torch.device("cuda:1"))
    def forward(self,x):
        x = x.to(torch.device("cuda:0"))
        x = self.bottom(x)
        x = self.layer1(x)
        x = self.layer2(x)
        x = x.to(torch.device("cuda:1"))
        x = self.layer3(x)
        x = self.layer4(x)
        x = self.avgpool(x).squeeze(3).squeeze(2)
        x = self.fc(x)
        x2 = self.fc2(x)
        x1 = self.fc1(x.to(torch.device("cuda:0")))
        return torch.cat([x1,x2.to(torch.device("cuda:0"))],dim = 1)
net = face_model(2000000)

注意网络及tensor的迁移要使用to(device)，不要用cuda(GPUID)

空载情况下的显存占用比较均衡：

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.78       Driver Version: 410.78       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:01:00.0 Off |                  N/A |
|  0%   64C    P2    76W / 250W |   2539MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:04:00.0 Off |                  N/A |
| 29%   62C    P2    80W / 250W |   2625MiB / 11178MiB |     62%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      9574      C   /home/dai/py36env/bin/python                2529MiB |
|    1      9574      C   /home/dai/py36env/bin/python                2615MiB |
+-----------------------------------------------------------------------------+

但是用64的batchsize一跑起来就变成这样了：

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.78       Driver Version: 410.78       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:01:00.0 Off |                  N/A |
|  0%   67C    P2    81W / 250W |  10945MiB / 11178MiB |     78%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:04:00.0 Off |                  N/A |
| 31%   62C    P2    81W / 250W |   6315MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      9574      C   /home/dai/py36env/bin/python               10935MiB |
|    1      9574      C   /home/dai/py36env/bin/python                6305MiB |
+-----------------------------------------------------------------------------+

显存和负载都显得很不均衡，我认为这个情况可以通过两种手段解决：

将fc层中更多的权重迁移到GPU1；
将loss计算分配到两个GPU上进行。

在两个GPU上计算loss

人脸识别里面的loss计算往往比较复杂，所以这种负载不均衡的情况会变得更加明显，为了缓解这种情况，

class face_model(torch.nn.Module):
    def __init__(self,num_classes):
        super(face_model,self).__init__()
        backbone = resnet50()
        self.bottom = torch.nn.Sequential(
                backbone.conv1,backbone.bn1, backbone.relu, backbone.maxpool
            ).to(torch.device("cuda:0"))
        self.layer1 = backbone.layer1.to(torch.device("cuda:0"))
        self.layer2 = backbone.layer2.to(torch.device("cuda:0"))
        self.layer3 = backbone.layer3.to(torch.device("cuda:1"))
        self.layer4 = backbone.layer4.to(torch.device("cuda:1"))
        self.avgpool = backbone.avgpool.to(torch.device("cuda:1"))
        self.fc = torch.nn.Linear(2048, 512,bias=True).to(torch.device("cuda:1"))
        self.fc1 = torch.nn.Linear(in_features=512, out_features = int(num_classes / 2),bias=True).to(torch.device("cuda:0"))
        self.fc2 = torch.nn.Linear(in_features=512, out_features = num_classes - int(num_classes / 2),bias=True).to(torch.device("cuda:1"))
    def forward(self,x,label):
        x = x.to(torch.device("cuda:0"))
        x = self.bottom(x)
        x = self.layer1(x)
        x = self.layer2(x)
        x = x.to(torch.device("cuda:1"))
        x = self.layer3(x)
        x = self.layer4(x)
        x = self.avgpool(x).squeeze(3).squeeze(2)
        x = self.fc(x)
        x2 = self.fc2(x)
        x1 = self.fc1(x.to(torch.device("cuda:0")))
        x = torch.cat([x1,x2.to(torch.device("cuda:0"))],dim = 1)
        loss1 = torch.nn.CrossEntropyLoss()(x[:len(label)//2],label[:len(label)//2].to(torch.device("cuda:0")))
        loss2 = torch.nn.CrossEntropyLoss()(x[len(label)//2:].to(torch.device("cuda:1")),label[len(label)//2:].to(torch.device("cuda:1")))
        return (loss1 + loss2.to(torch.device("cuda:0"))) / 2
net = face_model(2000000)

从下面的GPU信息可以看到，将loss分散之后，显存分配情况有了少许改善，GPU的利用率看起来也正常了一些。

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.78       Driver Version: 410.78       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:01:00.0 Off |                  N/A |
|  0%   86C    P2   166W / 250W |  10701MiB / 11178MiB |     43%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:04:00.0 Off |                  N/A |
| 34%   62C    P2    81W / 250W |   7053MiB / 11178MiB |     74%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     11743      C   /home/dai/py36env/bin/python               10691MiB |
|    1     11743      C   /home/dai/py36env/bin/python                7043MiB |
+-----------------------------------------------------------------------------+

模型速度问题

将模型拆分之后，多了很多数据传输的操作，模型的训练速度自然是会下降不少的。可以利用PyTorch的前后端异步特性对速度进行优化，具体参考：

参考：https://zhuanlan.zhihu.com/p/87596314