机器学习实验：反向传播算法

最近学习了反向传播算法，在此记录一下实验过程。

理论

什么是反向传播

反向传播的“反向”与神经网络前向传播相对应，前向传播是从输入层到输出层的过程，反向传播是从输出层到输入层的过程。

我们知道，神经网络前向传播是将 输入(input layer) 通过一系列的 加权求和(z=wx+b) 和 非线性变换(激活函数) 得到 输出(output layer) ，而这个过程中的 权重(weight) 和 偏置(biase) 是需要通过学习得来的。在最初的阶段，这两个参数都只是随机的初始化值，而通过利用 训练集标签(label) 与用初始值预测的结果 之间的 差异(Loss) ，调整权重和偏置，使其未来预测的正确率变得更高的过程，便被称为反向传播，这也是反向传播的意义所在。

反向传播的数学原理

反向传播算法以微积分中的链式法则为基础，其核心在于计算损失函数关于各权重和偏置的梯度，进而更新这些参数，使神经网络的预测更接近真实值。

假设我们有一个多层神经网络，其损失函数为 $L$ ，用于衡量网络预测值与真实标签的差异。第 $l$ 层的权重矩阵记为 $W^{l}$ ，偏置向量记为 $b^{l}$ 。反向传播算法主要包含以下核心步骤：

前向传播：从输入层开始，逐层计算每一层的输出。对于第 $l$ 层，首先通过加权求和计算 $z^{l}$ ，公式为

$z^{l} = W^{l}a^{l - 1}+b^{l}$

其中 $a^{l - 1}$ 是上一层的激活值。接着，将 $z^{l}$ 输入激活函数 $\sigma$ 得到该层的激活值 $a^{l}$ ，即

$a^{l}=\sigma(z^{l})$

如此逐层计算，直至输出层。

为什么要进行前向传播： 为了记录每一层的激活值，以便在反向传播过程中计算梯度。

计算损失：计算损失函数 $L$ ，一般使用均方误差（MSE）作为损失函数，公式为

$L=\frac{1}{2m}\sum_{i=1}^{m}(y^{(i)}-a^{L})^2$

其中 $m$ 是样本数量， $y^{(i)}$ 是第 $i$ 个样本的真实标签， $a^{L}$ 是网络的预测值。
看上去挺麻烦是吧，但在实际算法中通常直接计算 $L$ 对 $a^{L}$ 的偏导，也就是

$\frac{\partial L}{\partial y}=a^{L}-y$

这样就非常简单了。

或者使用交叉熵损失函数（Cross Entropy Loss），公式为

$\frac{\partial L}{\partial y}=-\frac{y}{a^{L}}+\frac{1-y}{1-a^{L}}$

其中 $y$ 是第 $i$ 个样本的真实标签， $a^{L}$ 是网络的预测值。

反向传播：从输出层开始，反向逐层计算损失函数关于每一层权重和偏置的梯度。具体而言，需要计算 $\frac{\partial L}{\partial W^{l}}$ 和 $\frac{\partial L}{\partial b^{l}}$ 。我们以 损失函数取均方误差 、 激活函数取SIGMOD 为例，推导反向传播的过程：

$\frac{\partial L}{\partial z^{L}}=\frac{\partial L}{\partial a^{L}}\frac{\partial a^{L}}{\partial z^{L}}=\frac{\partial L}{\partial a^{L}}\sigma'(z^{L})=(a^{L}-y)\sigma'(z^{L})$

故有

$\frac{\partial L}{\partial W^{L}}=\frac{\partial L}{\partial z^{L}}\frac{\partial z^{L}}{\partial W^{L}}=(a^{L}-y)a^{L-1}$

$\frac{\partial L}{\partial b^{L}}=\frac{\partial L}{\partial z^{L}}\frac{\partial z^{L}}{\partial b^{L}}=a^{L}-y$

参数更新：根据计算得到的梯度，采用优化算法（如随机梯度下降）更新权重和偏置。学习率 $\eta$ 控制着每次更新的步长

$W^{l} \gets W^{l}-\eta(a^{l-1})^T(a^{l}-y)$

$b^{l} \gets b^{l}-\eta(a^{l}-y)$

不断重复上述步骤。

具体实验

实验环境

python 3.10.12
唯一外部库：numpy
操作系统：linux ubuntu 22.04.3 LTS

实验内容

这个实验是学校人工智能课程的作业，要求实现反向传播算法，完成对“鸢尾花分类模型”的训练。下面直接贴原任务要求：

使用鸢尾花数据集（75 训练样本，75 测试样本），构建 4 输入 - 10 隐含层 - 3 输出的神经网络，独立运行 10 次，计算平均准确率和标准差。具体步骤如下：

1.加载鸢尾花数据集，并将其划分为训练集和测试集。
2.对训练集和测试集的输入特征进行规范化处理。
3.初始化神经网络的权重和偏置。
4.进行训练，重复训练过程 500 轮以上。
5.在测试集上进行预测，计算准确率。
6.独立运行 10 次上述步骤，计算平均准确率和标准差。

实验代码

代码最核心的反向传播部分已经在上文详细解释，其他部分的解释在代码中有注释。
Network类的实现参考了以下github代码：https://github.com/unexploredtest/neural-networks-and-deep-learning/blob/master/src/network.py

import random
import time
import numpy as np

class Network(object):

    def __init__(self, sizes):
        self.num_layers = len(sizes)
        self.sizes = sizes # 保存各层神经元数量的列表
        self.biases = [np.random.randn(y, 1) for y in sizes[1:]]
        self.weights = [np.random.randn(y, x)
                        for x, y in zip(sizes[:-1], sizes[1:])]

    def feedforward(self, a): # 前向传播
        for b, w in zip(self.biases, self.weights):
            a = sigmoid(np.dot(w, a)+b)
        return a

    def SGD(self, train_data, epochs, mini_batch_size, learning_rate):
        n = len(train_data)
        for j in range(epochs):
            if j % 50 == 0:
                start_time = time.time()  # 每50轮开始时记录起始时间
            random.shuffle(train_data) # 随机打乱训练数据集的顺序
            mini_batches = [train_data[k:k+mini_batch_size] 
                            for k in range(0, n, mini_batch_size)]
            for mini_batch in mini_batches:
                self.update_mini_batch(mini_batch, learning_rate)
            
            if (j + 1) % 50 == 0:  # 每 50 轮结束时
                end_time = time.time()  # 记录结束时间
                elapsed_time = end_time - start_time  # 计算50轮的总耗时
                print("Epoch {0}-{1} complete in {2:.3f} seconds".format(j - 49, j, elapsed_time))

    def update_mini_batch(self, mini_batch, learning_rate): 
        # 小批量数据更新神经网络的权重和偏置
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        for x, y in mini_batch:
            delta_nabla_b, delta_nabla_w = self.backprop(x, y)
            nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
            nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
        self.weights = [w-(learning_rate/len(mini_batch))*nw
                        for w, nw in zip(self.weights, nabla_w)]
        self.biases = [b-(learning_rate/len(mini_batch))*nb
                       for b, nb in zip(self.biases, nabla_b)]

#########################核心：反向传播##############################
    def backprop(self, x, y):
        """
        执行反向传播算法，计算损失函数关于权重和偏置的梯度。

        参数:
        x (np.ndarray): 单个输入样本的特征向量。
        y (np.ndarray): 该输入样本对应的目标输出向量。

        返回:
        tuple: 包含偏置梯度和权重梯度的元组 (nabla_b, nabla_w)。
        """
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        # 初始激活值为输入样本
        activation = x
        # 存储每一层的激活值，初始值为输入样本
        activations = [x]
        # 存储每一层的加权输入
        zs = [] 
        # 前向传播，计算每一层的加权输入和激活值
        for b, w in zip(self.biases, self.weights):
            # 计算当前层的加权输入
            z = np.dot(w, activation)+b
            # 将当前层的加权输入添加到 zs 列表中
            zs.append(z)
            # 计算当前层的激活值
            activation = sigmoid(z)
            # 将当前层的激活值添加到 activations 列表中
            activations.append(activation)
        # 计算输出层的误差项
        delta = self.cost_derivative(activations[-1], y) * sigmoid_prime(zs[-1])
        # 输出层偏置的梯度等于输出层的误差项
        nabla_b[-1] = delta
        # 计算输出层权重的梯度
        nabla_w[-1] = np.dot(delta, activations[-2].transpose()) 
        # 反向传播，从倒数第二层开始，依次计算每一层的误差项和梯度
        for l in range(2, self.num_layers):
            # 获取当前层的加权输入
            z = zs[-l]
            # 计算当前层激活函数的导数
            sp = sigmoid_prime(z)
            # 计算当前层的误差项
            delta = np.dot(self.weights[-l+1].transpose(), delta) * sp
            # 当前层偏置的梯度等于当前层的误差项
            nabla_b[-l] = delta
            # 计算当前层权重的梯度
            nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())
        # 返回偏置梯度和权重梯度
        return (nabla_b, nabla_w)

#####################################################################


    def evaluate(self, test_data):
        test_results = [(np.argmax(self.feedforward(x)), np.argmax(y)) for (x, y) in test_data]
        correct_count = sum(int(x == y) for (x, y) in test_results)
        error_indices = [i for i, (x, y) in enumerate(test_results) if x != y]
        return correct_count, error_indices

    def cost_derivative(self, output_activations, y):
        # 用均方误差计算损失函数对激活值的偏导
        return (output_activations-y)

def sigmoid(z):  # 计算Sigmoid函数在输入z处的值
    return 1.0/(1.0+np.exp(-z))

def sigmoid_prime(z):  # 计算Sigmoid函数在输入z处的导数
    return sigmoid(z)*(1-sigmoid(z))

def loadData(path): #读取数据
    data = []
    with open(path,'r') as file:
        for line in file:
            line = line.strip().split()
            features = np.array([float(x) for x in line[:-1]]).reshape(-1,1)
            label = np.zeros((3,1)) #输出部分采用独热编码
            label[int(line[-1])] = 1
            data.append((features,label)) 
    return data

def normalization(data): #规范化
    normalized_data = []
    for i in range(len(data)):
        ndata = (data[i][0] - np.min(data[i][0])) / (np.max(data[i][0]) - np.min(data[i][0]))
        normalized_data.append((ndata, data[i][1]))
    return normalized_data

######################主函数###########################

#读取训练集、测试集并规范化
traindata = loadData('Iris-train.txt')
testdata = loadData('Iris-test.txt')

'''
traindata = [
    (
        np.array([[5.1], [3.5], [1.4], [0.2]]), # 第一个样本的特征
        np.array([[1], [0], [0]])  #第一个样本的标签（独热编码）
    ),
    #其他样本...
]
traindata[0]：获取 traindata 列表中的第一个样本，它是一个包含特征和标签的元组。
traindata[0][0]：获取 traindata 列表中第一个样本的特征数据。
traindata[0][1]：获取 traindata 列表中第一个样本的标签数据。
'''

traindata = normalization(traindata)
testdata = normalization(testdata)

trainacc = []
testacc = []
train_error_info = []
test_error_info = []

# 训练模型
for iteration in range(10):
    model = Network([4, 10, 3])  # 4个特征、隐含层（10 个神经元）、3种输出的神经网络（4 × 10 × 3）

    model.SGD(traindata, epochs=750, mini_batch_size=10, learning_rate=0.12)

    train_correct, train_error_indices = model.evaluate(traindata)
    trainacc.append(train_correct / len(traindata))
    train_error_info.append((train_error_indices, train_correct, len(traindata)))

    test_correct, test_error_indices = model.evaluate(testdata)
    testacc.append(test_correct / len(testdata))
    test_error_info.append((test_error_indices, test_correct, len(testdata)))

print("训练集准确率")
for i, acc in enumerate(trainacc):
    print(f"第 {i + 1} 次: {acc * 100:.2f}%")
    error_indices, correct, total = train_error_info[i]
    print(f"错误编号: {error_indices}")
    print(f"正确/总数: {correct}/{total}")

print("测试集准确率")
for i, acc in enumerate(testacc):
    print(f"第 {i + 1} 次: {acc * 100:.2f}%")
    error_indices, correct, total = test_error_info[i]
    print(f"错误编号: {error_indices}")
    print(f"正确/总数: {correct}/{total}")

print(f"训练平均准确率: {np.mean(trainacc) * 100:.2f}%, 标准差: {np.std(trainacc):.3f}")
print(f"测试平均准确率: {np.mean(testacc) * 100:.2f}%, 标准差: {np.std(testacc):.3f}")