手把手教你使用yolo进行对象检测

引言

古语云“不进则退，不喜则忧” ，在ai大变革的时代，掌握基本的ai技能是技术人员必备生存之道。本文从对象检测应用出发，一步一步的给出使用yolo进行对象检测的流程。这里主要关注利用已有工具（yolo模型）进行对象检测应用（即模型的推理），不注重原理解析和模型训练。

Yolo简要介绍

Yolo 是“you only look once”的缩写，是2015年提出（joseph redmon和ali farhadi）的目标检测深度网络。2017年提出了yolov2, 主要是基于coco数据集训练的yolov2 和yolo9000（可以检测9000多类物体)模型,对比v1版本v2 增加了bn层，提高了分类器精度（224*224-> 448 * 448）,并且增加了anchor boxes，multi-scale等特性.yolov3 使用fpn提高数据表征能力，分类损失使用binary-crossentry,residual等特性，当然网络更深了，backbone 网络是darknet-53。更近一点时间，还陆续诞生了yolov4, yolov5，笔者没用过，这里不说了。更多详细的对比网上资料很多大家自行搜索即可。

我们这里使用yolov3进行对象检测任务的实现，为描述方便，后面直接描述为yolo。

核心知识点

使用yolo 进行目标检测，很自然的要问一下几个问题：

1. 从哪儿里获取模型，

2. 如何加载和使用模型进行推理计算

3. 模型output格式如何，如何使用

4. 如何验证模型是否正常工作。

其实这些问题是使用已有模型进行迁移学习，应用到实际工作中的基础（暂不考虑推理优化，finetune等），如果换成其他网络，进行其他应用，过程也是类似。下面就是针对这些问题和知识点的方式进行一一解答。

模型加载

加载yolo需要三个文件一个模型结构cfg文件描述了模型的网络结构，一个分类标签文件描述了所有标签的名字例如car，truck等，以及网络结构权重文件。

网络结构文件可以从这里下载：

https://github.com/pjreddie/darknet/blob/master/cfg/yolov3.cfg

分类标签文件，我们使用coco数据集的标签：

https://github.com/pjreddie/darknet/blob/master/data/coco.names

权重文件：

https://pjreddie.com/media/files/yolov3.weights

可以大概看下yolov.cfg 结果，你会看到里面是一个全卷积的结构，有一些shortcut和route来实现residual结构。主要到最后一个conv层filter数目为255，后面介绍下这个255的来历。

有了相关文件，如何加载这些文件，变成一个可以推理使用的模型呢？

事实上有很多框架可以选用，tensorflow，pytorch，我们可以根据上述cfg文件，解析并按照定义将网络结构实现出来，之后解析上面的权重文件，将权重一块一块拷贝到对应网络层中。我们这里不会使用这些（主要是根据cfg实现网络比较繁琐，当然可以加深对yolo的理解，pytorch可以参考这篇博文：[2]），而是受用opencv带的ddn模块中，readNetFromDarknet函数。片段如下：

cfg = “cfg/yolov3.cfg”
weights = “cfg/yolov3.weights”
Net = cv2.dnn. readNetFromDarknet(cfg, weights)
Net.setPreferableBackend(cv2.dnn.DNN_BACKEND_OPENCV)
Net.setPreferableTarget(cv2.dnn.DNN_TARGET_CPU)

这里很清晰，我们准备好下载的cfg和weights，使用cv2提供的函数，指定后端和计算设备之后这里的Net就可以用来推理了。

读取图片数据和推理

图像读取使用opencv的imread即可，得到每帧frame，之后对输入进行调整，变成适合模型输入的格式，如下代码：

blob = cv2.dnn.blobFromImage(frame, 1/255, (416, 416), [0,0,0], 1, crop=False)
 net.setInput(blob)

简单介绍下，第一行对frame进行处理的函数blobFromImage参数：

第一参数：imread读取的图片数据

第二参数：1/255是scalefactor，即对每个通道值除以255，如果通道中值范围是0-255，那么这个操作相当于把该值缩放到了0-1之间

第三参数是将整个图片缩放到对应大小：这里是（416*416）即模型输入尺寸。

第四参数：[0,0,0] 是每个通道值的mean，函数会先对通道值减去对应的mean之后乘以scalefactor，这里设置为全0

第五参数：是否进行rgb和bgr顺序翻转，这里选1，进行翻转

第六参数： crop 这里是false不进行裁剪

设置完输入之后就可以进行forward传播推理了，在这之前，我们还需要执行以下代码：

layersNames = net.getLayerNames()
names=[layersNames[i[0] - 1] for i in net.getUnconnectedOutLayers()]

这里第一行把所有层的名字列出来类似：[‘conv_1’,’conv_2’，..]

第二行首先找出所有没有输出层的层，也就是网络输出层列表。然后利用idx找出对应层的名字。

最后通过：

outs = net.forward(names)

进行推理计算。

模型输出

推理结果是一个list，每个元素是一个boundingbox相关信息，依次

(Box_center_x, box_center_y, box_width,box_height,detection_confidence, class1_confidence,class2_confidence… class80_confidence)

可以看出前四个字段描述和boundingbox的位置，宽度和高度，第五个字段描述检测的置信度分数，后面80个是该检测对象是某个分类的置信度。

注意到这里和boundingbox相关的四个值都是归一化后的结果，需要和实际图片的对应宽度和高度相乘以后才得到最终在图片中的坐标位置。

发现没有上述字段加起来一共 4+1 + 80 = 85 ，三个通道 3 * 85 = 255，这就解释了上文中神奇数字255的由来。

遍历所有forward out的代码如下：

for out in outs:
  for detection in out:
   scores = detection[5:]
  classId = np.argmax(scores)
  confidence = scores[classId]
  if confidence > confThreshold:
  center_x = int(detection[0] * frameWidth)
  center_y = int(detection[1] * frameHeight)
   width = int(detection[2] * frameWidth)
  height = int(detection[3] * frameHeight)
  left = int(center_x - width / 2)
  top = int(center_y - height / 2)
  classIds.append(classId)
   confidences.append(float(confidence))
  boxes.append([left, top, width, height])

这里的confThreshold是配置参数用来过滤较低置信度的检测，可以看出，我们使用检测结果的前四个值来计算boundingbox在图片中的像素坐标，检测结果中，从下标5开始，一共80个值，分别对应80个分类的对应置信度，我们取置信度最大的那个分类当前检测的分类结果。

最因为可能存在一个对象被多个boundingbox框住的情况，所以还需要进行一步nms的处理。即只保留框住一个对象中，置信度最高的那个boundingbox

幸好，opencv提供了这样一个函数，我们只需要提供对应的boundingbox和对应的置信度就行了。

indices = cv2.dnn.NMSBoxes(boxes, confidences, confThreshold, nmsThreshold)

这里boxes就是上面得到的所有框，condifences就是对应的置信度值，confThreshold是置信度阈值，nmsthreshold是nms计算时的阈值。

返回的indices是boxes的下标列表，通过它就可以得到最终的所有boundingbox。

final_res = []
for i in indices:
  i = i[0]
  label = classes[classIds[i]] 
  confidence = confidences[i]
  box = boxes[i]
    final_res.append(box,confidence, label)

classes是从coco.names里面得到的名字列表。

检测结果可视化

有了上面的box，我们就可以通过画出图片和对应boundingbox来检查检测的效果了。这块实际上就是对每个box调用opencv的rect 函数，然后显示图片。

for i, newbox in enumerate(boxes):
  result_boxes[i].append(newbox) 
  p1 = (int(newbox[0]), int(newbox[1])) 
  p2 = (int(newbox[0] + newbox[2]), int(newbox[1] + newbox[3]))     cv2.rectangle(frame, p1, p2, colors[i], 2, 1)
cv2.imshow(frame)

注意这里的rectangle的中box的p1和p2 分别是左上角和右下角，需要简单计算一下。

效果展示

原图：

检测结果图：

Reference

[1] Csdn上yolo 相关网络介绍：

https://blog.csdn.net/weixin_38673554/article/details/106009117

加载

[2] 一片使用pytorch实现yolov3的博文：https://blog.paperspace.com/how-to-implement-a-yolo-v3-object-detector-from-scratch-in-pytorch-part-3/

notebook中的完整例子：

import os
import cv2
import numpy as np
from IPython.display import Image
import IPython.display
from PIL import Image as p_image
print(os.getcwd())
cfg = "../yolo3/yolov3.cfg"
weights = "../yolo3/yolov3.weights"
labels = "../yolo3/coco.names"
image = '../yolo3/dog-cycle-car.png'
conf_threshold = 0.9
nms_threshold = 0.4
#for f in os.listdir(os.getcwd()):
  #print(f)
os.path.exists(weights)
os.path.exists(cfg)
os.path.exists(labels)
def init_nn():
  nn = cv2.dnn.readNetFromDarknet(cfg, weights)
  nn.setPreferableBackend(cv2.dnn.DNN_BACKEND_OPENCV)
  nn.setPreferableTarget(cv2.dnn.DNN_TARGET_CPU)
  classes = []
  with open(labels, 'r') as f:
  for l in f.readlines():
  classes.append(l.strip())
  return classes, nn
def yolo_detect(nn, image_path):
  img = cv2.imread(image_path)
  blob = cv2.dnn.blobFromImage(img, 1/255, (416, 416), [0,0,0], 1, crop=False)
  nn.setInput(blob)
  layersN = nn.getLayerNames()
  names = [layersNames[i[0] - 1] for i in nn.getUnconnectedOutLayers()]
  print("target output layers: ", names)
  outs = nn.forward(names)
  return outs
def nms_filter(yolo_det, image):
  frame = cv2.imread(image)
  confidence = []
  boxes = []
  for dets in yolo_det:
  for d in dets:
  # each is a x,y,w,h,obj_conf,class1_conf,...
  scores = d[5:]
  confi = scores[np.argmax(scores)]
  if confi > conf_threshold:
  confidence.append(float(confi))
  center_x = int(d[0] * frame.shape[1])
  center_y = int(d[1] * frame.shape[0])
  width = int(d[2] * frame.shape[1])
   height = int(d[3] * frame.shape[0])
  left = int(center_x - width / 2)
  top = int(center_y - height / 2)
  boxes.append([left, top, width, height])
  else:
   pass
  #print("ignore det-box, max confidence is too low.")
  print("confidence, ", confidence)
  print("res_d, ", boxes)
  indices = cv2.dnn.NMSBoxes(boxes, confidence, conf_threshold, nms_threshold)
  res = []
  for i in indices:
  idx = i[0]
  confi = confidence[idx]
  bb = boxes[idx]
  res.append([idx, confi, bb])
  return res
def visual_detection(yolo_res, image):
  frame = cv2.imread(image)
  frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
  for dets in yolo_res:
  for d in dets:
  scores = d[5:]
  confi = scores[np.argmax(scores)]
   if confi > conf_threshold:
  center_x = int(d[0] * frame.shape[1])
  center_y = int(d[1] * frame.shape[0])
  width = int(d[2] * frame.shape[1])
  height = int(d[3] * frame.shape[0])
  left = int(center_x - width / 2)
  top = int(center_y - height / 2)
  right = left + width
  down = top  + height
  cv2.rectangle(frame, (left, top), (right, down), (255,0,0), 2)
  print("after detect.")
  display(p_image.fromarray(frame))
def visual_detection_nms(res, image,classes):
  frame = cv2.imread(image)
  frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
  for d in res:
  idx = d[0]
  confi = d[1]
  box  = d[2]
  left = box[0]
  top = box[1]
  right = left + box[2]
  down = top  + box[3]
  cls = classes[idx]
  cv2.rectangle(frame, (left, top), (right, down), (255,0,0), 2)
  print("after detect.")
  display(p_image.fromarray(frame))
#1 init nn
classes,nn = init_nn()
print(names)
#print(nn.getLayerNames())
print("example png: ")
display(Image(filename=image))
#2 do detection
yolo_res = yolo_detect(nn, image)
#nms filter and box transform
detections = nms_filter(yolo_res, image)
# visualize
visual_detection_nms(detections, image,classes)

--技术创作101训练营--