Python机器学习的练习五：神经网络

在这篇文章中，我们将再次处理手写数字数据集，但这次使用反向传播的前馈神经网络。我们将通过反向传播算法实现神经网络成本函数的非正则化和正则化版本以及梯度计算。最后，我们将通过优化器运行该算法，并评估神经网络在手写数字数据集上的性能。

由于数据集与上次练习中使用的数据集相同，我们将重新使用上次的代码来加载数据。

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
from scipy.ioimport loadmat 
%matplotlib inline

data= loadmat('data/ex3data1.mat') 
data
{'X': array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]),
 '__globals__': [],
 '__header__':'MATLAB 5.0 MAT-file, Platform: GLNXA64, Created on: Sun Oct 16 13:09:09 2011',
 '__version__':'1.0',
 'y': array([[10],
        [10],
        [10],
        ...,
        [9],
        [9],
        [9]], dtype=uint8)}

我们以后需要并且经常使用变量，先创建一些有用的变量。

X= data['X'] 
y= data['y'] 
X.shape, y.shape

((5000L,400L), (5000L,1L))

我们还需要对标签进行独热编码。独热编码将类标签(n )(out of (k )类)转换为长度(k )的向量，其中索引(n )为“hot”(1)，其余为零。scikit- learn有一个内置的实用工具，我们可以使用它。

from sklearn.preprocessingimport OneHotEncoder 
encoder= OneHotEncoder(sparse=False) 
y_onehot= encoder.fit_transform(y) 
y_onehot.shape

(5000L,10L)
y[0], y_onehot[0,:]

(array([10], dtype=uint8),
 array([0., 0., 0., 0., 0., 0., 0., 0., 0., 1.]))

为这个练习创建的神经网络具有与我们实例数据（400 +偏差单元）大小匹配的输入层，25个单位的隐藏层（带有26个偏差单元）和10个单位的输出层对应我们的独热编码类标签。我们需要实现成本函数，用它来评估一组给定的神经网络参数的损失，源数学函数有助于将成本函数分解成多个。以下是计算成本所需的函数。

def sigmoid(z): 
    return 1 / (1 + np.exp(-z))
def forward_propagate(X, theta1, theta2): 
    m= X.shape[0]

    a1= np.insert(X,0, values=np.ones(m), axis=1)
    z2= a1* theta1.T
    a2= np.insert(sigmoid(z2),0, values=np.ones(m), axis=1)
    z3= a2* theta2.T
    h= sigmoid(z3)

    return a1, z2, a2, z3, h
def cost(params, input_size, hidden_size, num_labels, X, y, learning_rate): 
    m= X.shape[0]
    X= np.matrix(X)
    y= np.matrix(y)

    # reshape the parameter array into parameter matrices for each layer
    theta1= np.matrix(np.reshape(params[:hidden_size* (input_size+ 1)], (hidden_size, (input_size+ 1))))
    theta2= np.matrix(np.reshape(params[hidden_size* (input_size+ 1):], (num_labels, (hidden_size+ 1))))

    # run the feed-forward pass
    a1, z2, a2, z3, h= forward_propagate(X, theta1, theta2)

    # compute the cost
    J= 0
    for iin range(m):
        first_term= np.multiply(-y[i,:], np.log(h[i,:]))
        second_term= np.multiply((1 - y[i,:]), np.log(1 - h[i,:]))
        J+= np.sum(first_term- second_term)

    J= J/ m

    return J

我们之前已经使用过sigmoid函数。正向传播函数计算给定当前参数的每个训练实例的假设（换句话说，给定神经网络当前的状态和一组输入，它能计算出神经网络每一层的输出）。假设向量（由（h ）表示）的形状，包含了每个类的预测概率，应该与y的独热编码相匹配。最后成本函数运行正向传播步，并且计算实例的假设(预测)和真实标签之间的误差。

可以快速测试一下它是否按预期工作。从中间步骤中看到的输出也有助于了解发生了什么。

# initial setup
input_size= 400 
hidden_size= 25 
num_labels= 10 
learning_rate= 1

# randomly initialize a parameter array of the size of the full network's parameters
params= (np.random.random(size=hidden_size* (input_size+ 1)+ num_labels* (hidden_size+ 1))- 0.5)* 0.25

m= X.shape[0] 
X= np.matrix(X) 
y= np.matrix(y)

# unravel the parameter array into parameter matrices for each layer
theta1= np.matrix(np.reshape(params[:hidden_size* (input_size+ 1)], (hidden_size, (input_size+ 1)))) 
theta2= np.matrix(np.reshape(params[hidden_size* (input_size+ 1):], (num_labels, (hidden_size+ 1))))

theta1.shape, theta2.shape

((25L,401L), (10L,26L))
a1, z2, a2, z3, h= forward_propagate(X, theta1, theta2) 
a1.shape, z2.shape, a2.shape, z3.shape, h.shape

((5000L,401L), (5000L,25L), (5000L,26L), (5000L,10L), (5000L,10L))

计算假设矩阵(h)后的成本函数，用成本方程式计算(y) 和(h)之间的总偏差。

cost(params, input_size, hidden_size, num_labels, X, y_onehot, learning_rate)

6.8228086634127862

下一步是在成本函数中增加正则化，增加了与参数大小相关的惩罚项。这个方程式可以归结为一行代码，将其添加到成本函数中。只需在返回语句之前添加以下内容。

J+= (float(learning_rate)/ (2 * m))* (np.sum(np.power(theta1[:,1:],2))+ np.sum(np.power(theta2[:,1:],2)))

接下来是反向传播算法，反向传播算法计算参数更新以减少训练数据的误差。我们首先需要的是一个函数，用来计算我们先前创建的sigmoid函数梯度。

def sigmoid_gradient(z): 
    return np.multiply(sigmoid(z), (1 - sigmoid(z)))

现在我们准备用反向传播算法来计算梯度，由于反向传播算法所需的计算是成本函数要求的超集，我们将扩展成本函数来执行反向传播算法，并返回成本和梯度函数。不从backprop函数中调用现有的成本函数来使设计更加模块化的原因是，backprop函数使用了成本函数计算的一些其他变量。我跳过了完整的实现，添加了梯度正则化。

def backprop(params, input_size, hidden_size, num_labels, X, y, learning_rate): 
    ##### this section is identical to the cost function logic we already saw #####
    m= X.shape[0]
    X= np.matrix(X)
    y= np.matrix(y)

    # reshape the parameter array into parameter matrices for each layer
    theta1= np.matrix(np.reshape(params[:hidden_size* (input_size+ 1)], (hidden_size, (input_size+ 1))))
    theta2= np.matrix(np.reshape(params[hidden_size* (input_size+ 1):], (num_labels, (hidden_size+ 1))))

    # run the feed-forward pass
    a1, z2, a2, z3, h= forward_propagate(X, theta1, theta2)

    # initializations
    J= 0
    delta1= np.zeros(theta1.shape) # (25, 401)
    delta2= np.zeros(theta2.shape) # (10, 26)

    # compute the cost
    for iin range(m):
        first_term= np.multiply(-y[i,:], np.log(h[i,:]))
        second_term= np.multiply((1 - y[i,:]), np.log(1 - h[i,:]))
        J+= np.sum(first_term- second_term)

    J= J/ m

    # add the cost regularization term
    J+= (float(learning_rate)/ (2 * m))* (np.sum(np.power(theta1[:,1:],2))+ np.sum(np.power(theta2[:,1:],2)))

    ##### end of cost function logic, below is the new part #####

    # perform backpropagation
    for tin range(m):
        a1t= a1[t,:] # (1, 401)
        z2t= z2[t,:] # (1, 25)
        a2t= a2[t,:] # (1, 26)
        ht= h[t,:] # (1, 10)
        yt= y[t,:] # (1, 10)

        d3t= ht- yt # (1, 10)

        z2t= np.insert(z2t,0, values=np.ones(1)) # (1, 26)
        d2t= np.multiply((theta2.T* d3t.T).T, sigmoid_gradient(z2t)) # (1, 26)

        delta1= delta1+ (d2t[:,1:]).T* a1t
        delta2= delta2+ d3t.T* a2t

    delta1= delta1/ m
    delta2= delta2/ m

    # add the gradient regularization term
    delta1[:,1:]= delta1[:,1:]+ (theta1[:,1:]* learning_rate)/ m
    delta2[:,1:]= delta2[:,1:]+ (theta2[:,1:]* learning_rate)/ m

    # unravel the gradient matrices into a single array
    grad= np.concatenate((np.ravel(delta1), np.ravel(delta2)))

    return J, grad

成本函数的第一部分通过”神经网络”（正向传播函数）运行数据和当前参数来计算误差，将输出与真实标签作比较。数据集的总误差表示为(J)。这部分是我们之前谈论过的成本函数。

成本函数的其余部分的本质是回答“下次运行网络时，如何调整参数以减少误差？”，它通过计算每层的贡献与总误差，提出“梯度”矩阵（或者改变参数和方向）进行适当调整。

backprop计算中最难的部分是获取矩阵维度。顺便说一下，不是只有你对使用 A * B和 np.multiply(A, B)感到疑惑。

让我们测试一下，以确保函数返回我们所期望的。

J, grad= backprop(params, input_size, hidden_size, num_labels, X, y_onehot, learning_rate) 
J, grad.shape

(6.8281541822949299, (10285L,))

最后训练我们的神经网络，利用它做出预测，这和先前的多层次逻辑回归大致相同。

from scipy.optimizeimport minimize

# minimize the objective function
fmin= minimize(fun=backprop, x0=params, args=(input_size, hidden_size, num_labels, X, y_onehot, learning_rate), 
                method='TNC', jac=True, options={'maxiter':250})
fmin
status:3
success:False
   nfev:250
    fun:0.33900736818312283
      x: array([-8.85740564e-01,  2.57420350e-04, -4.09396202e-04, ...,
        1.44634791e+00,  1.68974302e+00,  7.10121593e-01])
message:'Max. number of function evaluations reach'
    jac: array([-5.11463703e-04,  5.14840700e-08, -8.18792403e-08, ...,
       -2.48297749e-04, -3.17870911e-04, -3.31404592e-04])
    nit:21

由于目标函数不太可能完全收敛，我们对迭代次数进行限制。我们的总成本已经下降到0.5以下，这是算法正常工作的一个指标。我们用它找到参数，然后通过神经网络正向传播它们以获得一些预测。我们必须重构优化器的输出，以匹配神经网络所期望的参数矩阵形状，然后运行正向传播函数以生成输入数据的假设。

X= np.matrix(X) 
theta1= np.matrix(np.reshape(fmin.x[:hidden_size* (input_size+ 1)], (hidden_size, (input_size+ 1)))) 
theta2= np.matrix(np.reshape(fmin.x[hidden_size* (input_size+ 1):], (num_labels, (hidden_size+ 1))))

a1, z2, a2, z3, h= forward_propagate(X, theta1, theta2) 
y_pred= np.array(np.argmax(h, axis=1)+ 1) 
y_pred
array([[10],
       [10],
       [10],
       ...,
       [9],
       [9],
       [9]], dtype=int64)

最后计算准确度以观察我们训练过的神经网络的工作状况

correct= [1 if a== belse 0 for (a, b)in zip(y_pred, y)] 
accuracy= (sum(map(int, correct))/ float(len(correct))) 
print 'accuracy = {0}%'.format(accuracy* 100)

accuracy= 99.22%

我们完成了，我们已经成功地实施了一个基本的反向传播的前馈式神经网络，并用它来分类手写数字图像。

本文为编译作品，作者John Wittenauer，原网址

http://www.johnwittenauer.net/machine-learning-exercises-in-python-part-5/