计算机视觉处理三大任务：分类、定位和检测

该笔记是以斯坦福cs231n课程（深度学习计算机视觉课程）的python编程任务为主线，展开对该课程主要内容的理解和部分数学推导。这篇学习笔记是关于计算机视觉处理的，分为两篇文章撰写完成。此为第一篇，内容包括图像的定位与检测、并使用深度学习中RNN与LSTM等神经网络来处理其得到的数据。

空间定位和检测

以ILSVRC竞赛为例，该竞赛包含了三个计算机视觉任务：分类、定位和检测。

分类：图像上是啥？预测top-5分类定位：目标在哪里、是啥？预测top-5分类+每个类别的bounding box（覆盖率50%以上）检测：在哪里、都有啥？ ---->> 定位是介于分类和检测的中间任务，分类和定位使用相同的数据集，检测的数据集有额外的数据集（物体比较小）。

这里贴张图，方便直观理解下各个任务的区别：

Computer Vision Tasks

其中，分类+定位我们可以一起完成。方便感受，上张图：

Classification + Localization

那么改如何一起完成呢？我们可以将定位看成回归问题，具体请看下图：

Localization as Regression

对于目标检测问题，我们是否也可以看成回归问题来解决呢？由于每个图像中目标个数不一样，要定位的坐标数量也不一样，所以这并不是一个很好的思路；另一个思路是将其看成分类问题，不过我们需要在不同位置进行很多次分类，这会很耗时。

对于目标检测，R-CNN无疑是深度学习下目标检测的经典作品，其思想引领了最近两年目标检测的潮流。这里简单介绍下R-CNN的算法思路：

输入一张图片，我们先定位出2K个物体候选框，然后采用CNN提取每个候选框中图片的特征向量，特征向量的维度为4096维，接着采用SVM算法对各个候选框中的物体进行分类识别。

R-CNN

这里列一下近两年来目标检测的风向标： R-CNN ---> SPPNET ---> Fast-RCNN ---> Faster-RCNN

循环神经网络

RNNs主要用来处理序列数据。在传统的神经网络模型中，是从输入层到隐含层再到输出层，层与层之间是全连接的，每层之间的节点是无连接的。但是这种普通的神经网络对于很多问题却无能无力。例如，你要预测句子中的下一个单词是什么，一般需要用到前面的单词，因为一个句子中前后单词并不是独立的。RNNs之所以称为循环神经网路，即一个序列当前的输出与前面的输出也有关。具体的表现形式为网络会对前面的信息进行记忆并应用于当前输出的计算中，即隐藏层之间的节点不再无连接而是有连接的，并且隐藏层的输入不仅包括输入层的输出还包括上一时刻隐藏层的输出。下图是一个典型的RNNs（右侧是左侧的简化结构）：

RNNs

为了更好地说明RNN，我们可以将网络按照时间进行展开：

A RNN and the unfolding in time of the computation involved in its forward computation.

在RNN中每一个时间步骤用到的参数（U, W, V）都是一样的。一般来说，每一时间的输入和输出是不一样的，比如对于序列数据就是将序列项依次传入，每个序列项再对应不同的输出（比如下一个序列项）。

1.反向传播算法

RNNs的前向传播依次按照时间的顺序计算，反向传播就是从最后一个时间点将累积的残差传递回来即可。下面给出前向传播和后向传播的计算公式：

BPTT

2.图像描述

顾名思义，对于给定的一张图片，自动生成一段文字描述。就像这样：

在Assignment3中，我们将用CNN+RNN来实现图像自动描述，将会用到Microsoft COCO数据库（http://mscoco.org/）。那么如何搭建结构框架呢？这里先给张图直观感受下：

CNN + RNN

上图中，我们用CNN来对输入图像进行特征提取，然后将提取到的特征作为RNN隐藏层的初始态（相当于t = -1时，隐藏层的输出值）输入到第一个时间点（t = 0）的隐藏层。RNN每个时间点的输出是当前输入序列项的下一项（比如，输入"straw"，输出"hat"）。

下面我给出一张详细的diagram，方便大家完成Assignment3第一部分的编程任务，即RNN_Captioning.ipynb里的任务：

Image Captioning by CNN+RNN

3.Python编程任务（RNN）

这部分我们需要完成以下编程任务（此外，需要理解下captioning_solver.py）：

--> rnn_layers.py，除了LSTM部分 --> rnn.py

具体代码如下：

--> rnn_layers.py

__coauthor__ = 'Deeplayer' # 8.13.2016 # import numpy as np def rnn_step_forward(x, prev_h, Wx, Wh, b): next_h = np.tanh(x.dot(Wx) + prev_h.dot(Wh) + b) cache = (x, Wx, Wh, prev_h, next_h) return next_h, cache def rnn_step_backward(dnext_h, cache): x, Wx, Wh, prev_h, next_h = cache dtanh = 1 - next_h ** 2 # (N, H) dx = (dnext_h * dtanh).dot(Wx.T) # (N, D) dprev_h = (dnext_h * dtanh).dot(Wh.T) # (N, H) dWx = x.T.dot(dnext_h * dtanh) # (D, H) dWh = prev_h.T.dot(dnext_h * dtanh) # (H, H) db = np.sum((dnext_h * dtanh), axis=0) return dx, dprev_h, dWx, dWh, db def rnn_forward(x, h0, Wx, Wh, b): N, T, D = x.shape _, H = h0.shape h = np.zeros((N, T, H)) h_interm = h0 cache = [] for i in xrange(T): h[:, i, :], cache_sub = rnn_step_forward(x[:, i, :], h_interm, Wx, Wh, b) h_interm = h[:, i, :] cache.append(cache_sub) return h, cache def rnn_backward(dh, cache): x, Wx, Wh, prev_h, next_h = cache[-1] _, D = x.shape N, T, H = dh.shape dx = np.zeros((N, T, D)) dh0 = np.zeros((N, H)) dWx = np.zeros((D, H)) dWh = np.zeros((H, H)) db = np.zeros(H) dprev_h_=np.zeros((N, H)) for i in xrange(T-1, -1, -1): dx_, dprev_h_, dWx_, dWh_, db_ = rnn_step_backward(dh[:, i, :] + dprev_h_,cache.pop()) dx[:, i, :] = dx_ dh0 = dprev_h_ dWx += dWx_ dWh += dWh_ db += db_ return dx, dh0, dWx, dWh, db def word_embedding_forward(x, W): N, T = x.shape V, D = W.shape out = np.zeros((N, T, D)) for n in xrange(N): for t in xrange(T): out[n, t, :] = W[x[n, t]] cache = (x, W) return out, cache def word_embedding_backward(dout, cache): x, W = cache N, T, D = dout.shape dW = np.zeros(W.shape) for n in xrange(N): for t in xrange(T): dW[x[n, t]] += dout[n, t, :] return dW

--> rnn.py

_coauthor__ = 'Deeplayer' # 8.13.2016 # def loss(self, features, captions): captions_in = captions[:, :-1] captions_out = captions[:, 1:] mask = (captions_out != self._null) W_proj, b_proj = self.params['W_proj'], self.params['b_proj'] W_embed = self.params['W_embed'] Wx, Wh, b = self.params['Wx'], self.params['Wh'], self.params['b'] W_vocab, b_vocab = self.params['W_vocab'], self.params['b_vocab'] loss, grads = 0.0, {} # forward pass imf2hid = features.dot(W_proj) + b_proj # initial hidden state: [N, H] word_vectors, word_cache = word_embedding_forward(captions_in, W_embed) # [N, T, W] if self.cell_type == 'rnn': hidden, rnn_cache = rnn_forward(word_vectors, imf2hid, Wx, Wh, b) # [N, T, H] else: hidden, lstm_cache = lstm_forward(word_vectors, imf2hid, Wx, Wh, b) scores, h2v_cache = temporal_affine_forward(hidden, W_vocab, b_vocab) # [N, T, V] loss, dscores = temporal_softmax_loss(scores, captions_out, mask) # backward pass dhidden, grads['W_vocab'], grads['b_vocab'] = temporal_affine_backward(dscores,h2v_cache) if self.cell_type == 'rnn': dword_vectors, dimf2hid, grads['Wx'], grads['Wh'], grads['b'] = rnn_backward(dhidden, rnn_cache) else: dword_vectors, dimf2hid, grads['Wx'], grads['Wh'], grads['b'] = lstm_backward(dhidden, lstm_cache) grads['W_embed'] = word_embedding_backward(dword_vectors, word_cache) grads['W_proj'] = features.T.dot(dimf2hid) grads['b_proj'] = np.sum(dimf2hid, axis=0) return loss, grads def sample(self, features, max_length=30): N = features.shape[0] captions = self._null * np.ones((N, max_length), dtype=np.int32) # [N, max_length] # Unpack parameters W_proj, b_proj = self.params['W_proj'], self.params['b_proj'] W_embed = self.params['W_embed'] # [V, W] V, W = W_embed.shape Wx, Wh, b = self.params['Wx'], self.params['Wh'], self.params['b'] W_vocab, b_vocab = self.params['W_vocab'], self.params['b_vocab'] # [H, V] h = features.dot(W_proj) + b_proj # [N, H] c = np.zeros(h.shape) init_word = np.repeat(self._start, N) captions[:, 0] = init_word for i in xrange(1, max_length): onehots = np.eye(V)[captions[:, i-1]] # [N, V] word_vectors = onehots.dot(W_embed) # [N, W] if self.cell_type == 'rnn': h, cache = rnn_step_forward(word_vectors, h, Wx, Wh, b) else: h, c, cache = lstm_step_forward(word_vectors, h, c, Wx, Wh, b) scores = h.dot(W_vocab) + b_vocab # [N, V]

captions[:, i] = np.argmax(scores, axis=1) return captions

Long Short-Term Memory Networks(LSTM Network)

对于上面提到的RNN，存在一个问题，就是无法解决长期依赖问题（long-term dependencies）。当时间序列变得很长的时候，前后信息的关联度会越来越小，直至消失，即所谓的梯度消失现象。而LSTM这种特殊的RNN结构，可以解决长期依赖问题。LSTM由Hochreiter & Schmidhuber于1997年提出，之后有很多改进版本。

下面介绍下一般的LSTM，也将是我们在assignment中用到的结构，内容借鉴自Colah 的博文(http://colah.github.io/posts/2015-08-Understanding-LSTMs/)。

和RNN一样，LSTM也是随着时间序列重复着一样的模块，只是LSTM的每个某块比RNN更加复杂，拥有四个层（3个门+1个记忆单元）。下图方框内上方的那条水平线，被称为胞元状态（cell state），LSTM通过门结构对记忆单元上的信息进行线性修改，保证了当时间序列变得很长的时候，前后信息的关联度不会衰减。

The repeating module in an LSTM

下面介绍下3个门：

遗忘门（Forget gate）：通过sigmoid来控制，它会根据上一时刻的输出ht-1和当前输入xt来产生一个0到1的值ft，来决定让上一时刻学到的信息Ct-1通过的程度（即对上一时刻的信息Ct-1进行遗忘）。

Forget gate

输入门（Input gate）：通过sigmoid来决定哪些值用来更新进cell state，这里的值是由一个tanh层生成的，称为候选值Ct（上方少个 ~）。

Input gate

现在，我们对cell state进行更新（丢弃不需要的信息，添加新信息），如下所示：

Updating cell state

输出门（Output gate）：通过sigmoid层来决定cell state的哪个部分将被输出。接着，我们把当前的cell state通过tanh层进行处理，并将它和sigmoid层的输出相乘，最终输出我们确定要输出的那部分信息 ht 。

Output gate

目前为止，我们所讲的是标准的LSTM。LSTM 还有许多变体，这里我们介绍几种变体。

由Gers & Schmidhuber于2000年提出的，增加了 “peephole connection” 的LSTM。主要变化是：3个门层接受了cell state的输入。

Peephole Connection

另一个变体是通过使用 coupled 遗忘门和输入门，遗忘和输入是同时进行的。从下图的公式可以看出，新的信息仅仅是输入到那些已经被遗忘的部分。

Coupled forget & input gate

另一个变体是 Gated Recurrent Unit (GRU)，由 Cho, et al. 于2014年提出。它将忘记门和输入门合成了一个单一的更新门。同样还混合了胞元状态和隐藏状态，和其他一些改动。最终的模型比标准的 LSTM 模型要简单。

GRU

在给出LSTM代码前，我先给出一下使用标准LSTM进行Image captioning的模型结构图：

Image Captioning by CNN+LSTM

代码如下：

def lstm_step_forward(x, prev_h, prev_c, Wx, Wh, b):
    _, H = prev_h.shape
    a = x.dot(Wx) + prev_h.dot(Wh) + b       # (N, 4H)
    ai, af, ao, ag = a[:, 0:H], a[:, H:2*H], a[:, 2*H:3*H], a[:, 3*H:]
    i, f, o, g = sigmoid(ai), sigmoid(af), sigmoid(ao), np.tanh(ag)
    next_c = f * prev_c + i * g
    next_h = o * np.tanh(next_c)
    cache = (x, prev_h, prev_c, Wx, Wh, a, i, f, o, g, next_c, next_h)
  
    return next_h, next_c, cache


def lstm_step_backward(dnext_h, dnext_c, cache):
    _, H = dnext_h.shape
    x, prev_h, prev_c, Wx, Wh, a, i, f, o, g, next_c, next_h = cache
    ai, af, ao, ag = a[:, 0:H], a[:, H:2*H], a[:, 2*H:3*H], a[:, 3*H:]
    dnext_c += dnext_h * o * (1 - (np.tanh(next_c))**2)
    do = dnext_h * np.tanh(next_c)
    df = dnext_c * prev_c
    dprev_c = dnext_c * f
    di = dnext_c * g
    dg = dnext_c * i
    dai = di * (sigmoid(ai) * (1-sigmoid(ai)))
    daf = df * (sigmoid(af) * (1-sigmoid(af)))
    dao = do * (sigmoid(ao) * (1-sigmoid(ao)))
    dag = dg * (1 - np.tanh(ag)**2)
    da = np.hstack((dai, daf, dao, dag))               # (N, 4H)
    dx = da.dot(Wx.T)                                  # (N, D)
    dWx = x.T.dot(da)
    dprev_h = da.dot(Wh.T)
    dWh = prev_h.T.dot(da)
    db = np.sum(da, axis=0)

    return dx, dprev_h, dprev_c, dWx, dWh, db

def lstm_forward(x, h0, Wx, Wh, b):
    N, T, D = x.shape
    _, H = h0.shape
    cache = []
    hidden = h0
    h = np.zeros((N, T, H))
    cell = np.zeros((N, H))
    for i in xrange(T):
        hidden, cell, sub_cache = lstm_step_forward(x[:, i, :], hidden, cell, Wx, Wh, b)
        cache.append(sub_cache)
        h[:, i, :] = hidden

    return h, cache


def lstm_backward(dh, cache):
    x, prev_h, prev_c, Wx, Wh, a, i, f, o, g, next_c, next_h = cache[0]
    N, T, H = dh.shape
    N, D = x.shape
    dh0 = np.zeros((N, H))
    db = np.zeros(4*H)
    dWx = np.zeros((D, 4*H))
    dWh = np.zeros((H, 4*H))
    dx = np.zeros((N, T, D))
    dprev_c_ = np.zeros((N, H))
    dprev_h = np.zeros((N, H))
    for i in xrange(T-1, -1, -1):
        dx_, dprev_h, dprev_c_, dWx_, dWh_, db_ = lstm_step_backward(dh[:, i, :]+dprev_h
, dprev_c_, cache.pop())
        dWx += dWx_
        dWh += dWh_
        db += db_
        dx[:, i, :] += dx_
        dh0 = dprev_h
  
    return dx, dh0, dWx, dWh, db

这里给出一些在验证集上的结果：

Some examples on validation set