原文地址是: Training a Convolutional Neural Network from scratch。原文的撰写时间是2019年6月7日。

在这篇文章中,我们将会深入了解卷积神经网络(Convolutional Neural Networks, CNN),重点是如何训练一个卷积神经网络。这篇文章会教你如何推导梯度,实现backprop过程(只使用numpy),并最终建立起一个完整的训练的工作流。这里我们架设你对于什么是卷积神经网络有一个基础的了解。如果你还缺乏最基本的知识,可以阅读原文作者的文章。同时文章中部分地方还假定你对于多元微积分有一定的了解基本上你好好上过大学的高等数学应该就能理解

1 搭建环境

我们的主要任务是用CNN来解决MNIST手写数字字符的识别问题。

这里我们的CNN网络有一个卷积层(Conv Layer),一个Max Pooling层和一个Softmax层,如下图所示:

这个模型接受MNIST中28*28的灰度图输出,并输出一个10维向量,向量的每一维对应一个数字。

在代码层面,我们写了三个类,每个类代表一个层。这三个类分别是: Conv3x3, MaxPoolSoftmax。每个类实现了一个forward()函数,用来进行CNN模型的正向计算。代码内容如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
conv = Conv3x3(8)                  # 28x28x1 -> 26x26x8
pool = MaxPool2() # 26x26x8 -> 13x13x8
softmax = Softmax(13 * 13 * 8, 10) # 13x13x8 -> 10

def forward(image, label):
'''
Completes a forward pass of the CNN and calculates the accuracy and
cross-entropy loss.
- image is a 2d numpy array
- label is a digit
'''
# We transform the image from [0, 255] to [-0.5, 0.5] to make it easier
# to work with. This is standard practice.
out = conv.forward((image / 255) - 0.5)
out = pool.forward(out)
out = softmax.forward(out)

# Calculate cross-entropy loss and accuracy. np.log() is the natural log.
loss = -np.log(out[label])
acc = 1 if np.argmax(out) == label else 0

return out, loss, acc

代码的完整地址在Github

直接运行输出得到的结果是

1
2
3
4
5
MNIST CNN initialized!
[Step 100] Past 100 steps: Average Loss 2.302 | Accuracy: 11%
[Step 200] Past 100 steps: Average Loss 2.302 | Accuracy: 8%
[Step 300] Past 100 steps: Average Loss 2.302 | Accuracy: 3%
[Step 400] Past 100 steps: Average Loss 2.302 | Accuracy: 12%

这里模型给出的精度非常差,这是因为模型还未进行训练。

2 模型训练

训练一个神经网络通常由两个阶段组成:

  1. forward阶段:在forward阶段,我们将输入送入模型,正向计算出结果;
  2. backward阶段:用于计算梯度并更新权重参数;

我们遵循上述模式来训练我们的CNN。这里我们采用了两个主要的实现设计思路:

  1. 在forward阶段,每个层会缓存下所有的数据,包括输入和中间变量,这些缓存下来的值会用于backward阶段的计算。
  2. 在backwad阶段,每个层会接收到后一层的梯度,并返回本层计算出来的梯度。

这种设计思路可以让我们的训练实现保持间接和组织化。反映这一点最好的方式是直接读代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
# Feed forward
out = conv.forward((image / 255) - 0.5)
out = pool.forward(out)
out = softmax.forward(out)

# Calculate initial gradient
gradient = np.zeros(10)
# ...

# Backprop
gradient = softmax.backprop(gradient)
gradient = pool.backprop(gradient)
gradient = conv.backprop(gradient)

接下来我们来讨论每一层的backprop怎么做(forward其实是非常简单的)。

3 Backprop: Softmax

按照backward的顺序我们首先讨论Softmax层。首先我们来回忆一下交叉信息熵损失函数的定义:

\[ L = - ln (p_c) \]

其中\(p_c\)为正确类别的 c 的预测概率。首先我们要计算的是在backfowrad阶段Softmax输入,\(\partial L / \partial out_s\),其中\(out_s\)为Softmax的输出,这个输出为一个10维向量,其中每个元素是对应数字的概率。这个计算很简单:

\[ \frac{\partial L}{\partial o u t_si}=\left\{\begin{array}{ll} 0 & \text { if } i \neq c \\ -\frac{1}{p_{i}} & \text { if } i=c \end{array}\right. \]

其中\(c\)为正确的类别。代码的实现为:

1
2
gradient = np.zeros(10)
gradeint[label] = -1 / out[label]

这里得到的是Softmax层在backforward阶段的输入,Softmax需要以此为输入计算出一个梯度交给前面的层。我们先来看forward阶段的实现。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
class Softmax:
# ...
def forward(self, input):
'''
Performs a forward pass of the softmax layer using the given input.
Returns a 1d numpy array containing the respective probability values.
- input can be any array with any dimensions.
'''
self.last_input_shape = input.shape

input = input.flatten()
self.last_input = input

input_len, nodes = self.weights.shape

totals = np.dot(input, self.weights) + self.biases
self.last_totals = totals

exp = np.exp(totals)
return exp / np.sum(exp, axis=0)

从这里的代码来看,作者说的Softmax层其实是一个全连接层 + Softmax函数

我们缓存了三个变量:

  • input的形状
  • input展平后的值
  • totals,softmax激励函数的输入

基于上面这三个缓存的值我们可以开始计算梯度。上面我们已经计算了Softmax backprop的输入,可以发现只有out_s(c)是非零的,因此我们在Softmax内部计算的时候可以忽略其他项目,即我们只需要计算\(out_c\)关于输入的微分。out_s(c)的计算过程是:

\[ o u t_sc=\frac{e^{t_{c}}}{\sum_{i} e^{t_{i}}}=\frac{e^{t_{c}}}{S} \]

根据链式法则可以做计算,对于\(k \neq c\)

\[ \begin{aligned} \frac{\partial o u t_sc}{\partial t_{k}} &=-e^{t_{c}} S^{-2}\left(\frac{\partial S}{\partial t_{k}}\right) \\ &=-e^{t_{c}} S^{-2}\left(e^{t_{k}}\right) \\ &=\frac{-e^{t_{c}} e^{t_{k}}}{S^{2}} \end{aligned} \]

对于\(t_c\)

\[ \begin{aligned} \frac{\partial o u t_sc}{\partial t_{c}} &=\frac{S e^{t_{c}}-e^{t_{c}} \frac{\partial S}{\partial t_{c}}}{S^{2}} \\ &=\frac{S e^{t_{c}}-e^{t_{c}} e^{t_{c}}}{S^{2}} \\ &=\frac{e^{t_{c}}\left(S-e^{t_{c}}\right)}{S^{2}} \end{aligned} \]

我们把这部分用代码实现出来:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
class Softmax:
# ...

def backprop(self, d_L_d_out):
'''
Performs a backward pass of the softmax layer.
Returns the loss gradient for this layer's inputs.
- d_L_d_out is the loss gradient for this layer's outputs.
'''
# We know only 1 element of d_L_d_out will be nonzero
for i, gradient in enumerate(d_L_d_out):
if (gradient == 0):
continue

# e^totals
t_exp = np.exp(self.last_totals)

# Sum of all e^totals
S = np.sum(t_exp)

# Gradients of out[i] against totals
d_out_d_t = - t_exp[i] * t_exp / (S ** 2)
d_out_d_t[i] = t_exp[i] * (S - t_exp[i]) / (S ** 2)

# ... to be continued

这里得到还只是对中间变量的梯度,最终我们要得到的是对Weights, bias以及input的梯度,其中:

  1. 我们使用\(\partial \mathrm{L} / \partial \mathrm{W}\) 来更新Weights矩阵;
  2. 我们使用\(\partial \mathrm{L} / \partial \mathrm{b}\) 来更新bias向量;
  3. \(\partial \mathrm{L} / \partial i n p u t\)被送往上一层;

为了计算这三个梯度,我们从下面的式子出发:

\[ t=w \cdot input +b \]

这些梯度计算都比较简单:

\[\begin{aligned} &\frac{\partial t}{\partial w}=i n p u t\\ &\frac{\partial t}{\partial b}=1\\ &\frac{\partial t}{\partial i n p u t}=w \end{aligned}\]

利用链式法则汇总

\[\begin{aligned} \frac{\partial L}{\partial w} &=\frac{\partial L}{\partial o u t} * \frac{\partial o u t}{\partial t} * \frac{\partial t}{\partial w} \\ \frac{\partial L}{\partial b} &=\frac{\partial L}{\partial o u t} * \frac{\partial o u t}{\partial t} * \frac{\partial t}{\partial b} \\ \frac{\partial L}{\partial i n p u t} &=\frac{\partial L}{\partial o u t} * \frac{\partial o u t}{\partial t} * \frac{\partial t}{\partial i n p u t} \end{aligned}\]

将这些代码话就可以得到

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
class Softmax:
# ...

def backprop(self, d_L_d_out):
'''
Performs a backward pass of the softmax layer.
Returns the loss gradient for this layer's inputs.
- d_L_d_out is the loss gradient for this layer's outputs.
'''
# We know only 1 element of d_L_d_out will be nonzero
for i, gradient in enumerate(d_L_d_out):
if gradient == 0:
continue

# e^totals
t_exp = np.exp(self.last_totals)

# Sum of all e^totals
S = np.sum(t_exp)

# Gradients of out[i] against totals
d_out_d_t = -t_exp[i] * t_exp / (S ** 2)
d_out_d_t[i] = t_exp[i] * (S - t_exp[i]) / (S ** 2)

# Gradients of totals against weights/biases/input
d_t_d_w = self.last_input
d_t_d_b = 1
d_t_d_inputs = self.weights

# Gradients of loss against totals
d_L_d_t = gradient * d_out_d_t

# Gradients of loss against weights/biases/input
d_L_d_w = d_t_d_w[np.newaxis].T @ d_L_d_t[np.newaxis]
d_L_d_b = d_L_d_t * d_t_d_b
d_L_d_inputs = d_t_d_inputs @ d_L_d_t

# ... to be continued

剩下的问题是输出了。我们用\(W\)\(b\)的梯度,使用SGD方法更新对应的参数,同时将对input的梯度输出,沿着backfoward方向进行传递:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
class Softmax
# ...

def backprop(self, d_L_d_out, learn_rate):
'''
Performs a backward pass of the softmax layer.
Returns the loss gradient for this layer's inputs.
- d_L_d_out is the loss gradient for this layer's outputs.
- learn_rate is a float
'''
# We know only 1 element of d_L_d_out will be nonzero
for i, gradient in enumerate(d_L_d_out):
if gradient == 0:
continue

# e^totals
t_exp = np.exp(self.last_totals)

# Sum of all e^totals
S = np.sum(t_exp)

# Gradients of out[i] against totals
d_out_d_t = -t_exp[i] * t_exp / (S ** 2)
d_out_d_t[i] = t_exp[i] * (S - t_exp[i]) / (S ** 2)

# Gradients of totals against weights/biases/input
d_t_d_w = self.last_input
d_t_d_b = 1
d_t_d_inputs = self.weights

# Gradients of loss against totals
d_L_d_t = gradient * d_out_d_t

# Gradients of loss against weights/biases/input
d_L_d_w = d_t_d_w[np.newaxis].T @ d_L_d_t[np.newaxis]
d_L_d_b = d_L_d_t * d_t_d_b
d_L_d_inputs = d_t_d_inputs @ d_L_d_t

# Update weights / biases
self.weights -= learn_rate * d_L_d_w
self.biases -= learn_rate * d_L_d_b

return d_L_d_inputs.reshape(self.last_input_shape)

注意返回的时候我们要调整d_L_d_inputs的性状以吻合最初输入的input的大小。我们来验证一下我们的代码,单独训练一下Softmax层试试看:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# Imports and setup here
# ...

def forward(image, label):
# Impementation excluded
# ...

def train(im, label, lr=.005):
'''
Completes a full training step on the given image and label.
Returns the cross-entropy loss and accuracy.
- image is a 2d numpy array
- label is a digit
- lr is the learning rate
'''
# forward
out, loss, acc = forward(im, label)

# Calculate initial gradient
gradient = np.zeros(10)
gradient[label] = -1 / out[label]

# Backprop
gradient = softmax.backprop(gradient, lr)
# TODO: backprop MaxPool2 layer
# TODO: backprop Conv3x3 layer

return loss, acc

print('MNIST CNN initialized!')
# Train!
loss = 0
num_correct = 0
for i, (im, label) in enumerate(zip(train_images, train_labels)):
if i > 0 and i % 99 == 0:
print(
'[Step %d] Past 100 steps: Average Loss %.3f | Accuracy: %d%%' %
(i + 1, loss / 100, num_correct)
)
loss = 0
num_correct = 0

l, acc = train(im, label)
loss += l
num_correct += acc

这里的训练过程不太规范,是一个个地喂数据,同时计算在训练集上的准确率。

运行后的输出是:

1
2
3
4
5
6
7
8
9
10
11
MNIST CNN initialized!
[Step 100] Past 100 steps: Average Loss 2.239 | Accuracy: 18%
[Step 200] Past 100 steps: Average Loss 2.140 | Accuracy: 32%
[Step 300] Past 100 steps: Average Loss 1.998 | Accuracy: 48%
[Step 400] Past 100 steps: Average Loss 1.861 | Accuracy: 59%
[Step 500] Past 100 steps: Average Loss 1.789 | Accuracy: 56%
[Step 600] Past 100 steps: Average Loss 1.809 | Accuracy: 48%
[Step 700] Past 100 steps: Average Loss 1.718 | Accuracy: 63%
[Step 800] Past 100 steps: Average Loss 1.588 | Accuracy: 69%
[Step 900] Past 100 steps: Average Loss 1.509 | Accuracy: 71%
[Step 1000] Past 100 steps: Average Loss 1.481 | Accuracy: 70%

可以看到准确度有了很大的提高,我们的CNN已经开始学习了。

4 Backprop: Max Pooling

Max Pooling层其实没有任何参数,但是我们还是要实现backprop操作以将梯度操作继续向前传导。我们还是从forward开始

1
2
3
4
5
6
7
8
9
10
11
12
13
class MaxPool2:

def forward(self, input):
'''
Performs a forward pass of the maxpool layer using the given input.
Returns a 3d numpy array with dimensions (h / 2, w / 2, num_filters).
- input is a 3d numpy array with dimensions (h, w, num_filters)
'''

self.last_input = input

# More Implementation
# ...

MaxPool 的正向操作是将输入划分为\(2\times 2\)的小格子,每个格子里面输出最大值。那么backprop时,我们则将backprop的输入尺寸加倍,并将梯度值传递到Pooling时最大值所在的位置,其他位置上则为 0。

下面是一个简单的例子:

那么backprop的过程为

代码实现如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
class MaxPool2
# ...

def iterate_regions(self, image):
'''
Generates non-overlapping 2x2 image regions to pool over.
- image is a 2d numpy array
'''
h, w, _ = image.shape
new_h = h // 2
new_w = w // 2

for i in range(new_h):
for j in range(new_w):
m_region = image[(i * 2):(i * 2 + 2), (j * 2):(j * 2 + 2)]
yield im_region, i, j

def backprop(self, d_L_d_out):
'''
Performs a backward pass of the maxpool layer.
Returns the loss gradient for this layer's inputs.
- d_L_d_out is the loss gradient for this layer's outputs.
'''
d_L_d_input = np.zeros(self.last_input.shape)
for im region, i, j in self.iterate_regions(self.last_input):
h, w, f = im_region.shape
amax = np.amax(im_region, axis=(0, 1))
for i2 in range(h):
for j2 in range(w):
for f2 in range(f):
# If this pixel was the max value, copy the gradient to it
if im_region[i2, j2, f2] == amax[f2]:
d_L_d_input[i * 2 + i2, j * 2 + j2, f2] = d_L_d_out[i, j, f2]
return d_L_d_input

这个实现是循环的,效率太低了。

Max Pooling的backprop操作到这里就可以了。

5 Backprop: Conv

卷积层的处理是CNN的网络的核心。还是遵循前例,我们从forward阶段的缓存开始:

1
2
3
4
5
6
7
8
9
10
11
class Conv3x3
# ...
def forward(self, input):
'''
Performs a forward pass of the conv layer using the given input.
Returns a 3d numpy array with dimensions (h, w, num_filters).
- input is a 2d numpy array
'''
self.last_input = input

# More implementation

为了简化操作,我们这里假设Conv层的输入的是2维数组(即通道数为1),这是因为这个Conv层在这里是第一层。在更一般的网络中,位于中间的Conv层的输入应该是一个3位数组(多了一个通道维度)。

我们这里主要关心损失关于Filter的梯度。MaxPool层已经提供了Conv层的\(\partial L / \partial o u t\),所以这里我们只需要计算\(\partial o u t / \partial f i l t e r s\)。要理清这个梯度的计算方法,我们首先来考虑这个问题:如果我们修改Filter的参数,这会如何影响到Conv层的输出呢?

事实上修改任意Filter参数都会导致整个输出矩阵(图像)的改变。为了简化问题,我们只考虑输出的一个像素。修改Filter会对一个特定的像素产生什么影响呢?我们来看下面这个非常简单的例子:

我们有一个 \(3 \times 3\)的图像,使用一个 \(3 \times 3\) 的Filter来产生一个\(1 \times 1\)的输出。如果我们将Filter中心的数字修改为1,那么输出会成为:

细想不难发现,一个输出像素对于一个Filter元素的微分,正是输入图像对应位置上的像素灰度值。用公式表示是:

\[ \begin{aligned} \operatorname{out}(\mathrm{i}, \mathrm{j}) &=\text { convolve(image, filter }) \\ &=\sum_{x=0}^{3} \sum_{y=0}^{3} \operatornameimagei+x, j+y * \text filter x, y \\ & \frac{\partial \operatornameouti, j}{\partial \operatornamefilterx, y}=\operatornameimagei+x, j+y \end{aligned} \]

那么对于完整的图像来说,

\[ \frac{\partial L}{\partial \text filter x, y}=\sum_{i} \sum_{j} \frac{\partial L}{\partial \text out i, j} * \frac{\partial \text out i, j}{\partial \text filter x, y} \]

这样我们就可以做代码实现了:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
class Conv3x3
# ...
def iterate_regions(self, image):
'''
Generates all possible 3x3 image regions using valid padding.
- image is a 2d numpy array.
'''
h, w = image.shape

for i in range(h - 2):
for j in range(w - 2):
im_region = image[i:(i + 3), j:(j + 3)]
yield im_region, i, j

def backprop(self, d_L_d_out, learn_rate):
'''
Performs a backward pass of the conv layer.
- d_L_d_out is the loss gradient for this layer's outputs.
- learn_rate is a float.
'''
d_L_d_filters = np.zeros(self.filters.shape)

for im_region, i, j in self.iterate_regions(self.last_input):
for f in range(self.num_filters):
d_L_d_filters[f] += d_L_d_out[i, j, f] * im_region

# Update filters
self.filters -= learn_rate * d_L_d_filters

# We aren't returning anything here since we use Conv3x3 as
# the first layer in our CNN. Otherwise, we'd need to return
# the loss gradient for this layer's inputs, just like every
# other layer in our CNN.
return None

因为Conv层是第一层,所以这里backprop返回的是None。

6 训练CNN

至此我们的模型完整了。我们可以用下面的代码来训练模型:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
import mnist
import numpy as np
from conv import Conv3x3
from maxpool import MaxPool2
from softmax import Softmax

# We only use the first 1k examples of each set in the interest of time.
# Feel free to change this if you want.
train_images = mnist.train_images()[:1000]
train_labels = mnist.train_labels()[:1000]
test_images = mnist.test_images()[:1000]
test_labels = mnist.test_labels()[:1000]

conv = Conv3x3(8) # 28x28x1 -> 26x26x8
pool = MaxPool2() # 26x26x8 -> 13x13x8
softmax = Softmax(13 * 13 * 8, 10) # 13x13x8 -> 10

def forward(image, label):
'''
Completes a forward pass of the CNN and calculates the accuracy and
cross-entropy loss.
- image is a 2d numpy array
- label is a digit
'''
# We transform the image from [0, 255] to [-0.5, 0.5] to make it easier
# to work with. This is standard practice.
out = conv.forward((image / 255) - 0.5)
out = pool.forward(out)
out = softmax.forward(out)

# Calculate cross-entropy loss and accuracy. np.log() is the natural log.
loss = -np.log(out[label])
acc = 1 if np.argmax(out) == label else 0

return out, loss, acc

def train(im, label, lr=.005):
'''
Completes a full training step on the given image and label.
Returns the cross-entropy loss and accuracy.
- image is a 2d numpy array
- label is a digit
- lr is the learning rate
'''
# Forward
out, loss, acc = forward(im, label)

# Calculate initial gradient
gradient = np.zeros(10)
gradient[label] = -1 / out[label]

# Backprop
gradient = softmax.backprop(gradient, lr)
gradient = pool.backprop(gradient)
gradient = conv.backprop(gradient, lr)

return loss, acc

print('MNIST CNN initialized!')

# Train the CNN for 3 epochs
for epoch in range(3):
print('--- Epoch %d ---' % (epoch + 1))

# Shuffle the training data
permutation = np.random.permutation(len(train_images))
train_images = train_images[permutation]
train_labels = train_labels[permutation]

# Train!
loss = 0
num_correct = 0
for i, (im, label) in enumerate(zip(train_images, train_labels)):
if i > 0 and i % 100 == 99:
print(
'[Step %d] Past 100 steps: Average Loss %.3f | Accuracy: %d%%' %
(i + 1, loss / 100, num_correct)
)
loss = 0
num_correct = 0

l, acc = train(im, label)
loss += l
num_correct += acc

# Test the CNN
print('\n--- Testing the CNN ---')
loss = 0
num_correct = 0
for im, label in zip(test_images, test_labels):
_, l, acc = forward(im, label)
loss += l
num_correct += acc

num_tests = len(test_images)
print('Test Loss:', loss / num_tests)
print('Test Accuracy:', num_correct / num_tests)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
MNIST CNN initialized!
--- Epoch 1 ---
[Step 100] Past 100 steps: Average Loss 2.254 | Accuracy: 18%
[Step 200] Past 100 steps: Average Loss 2.167 | Accuracy: 30%
[Step 300] Past 100 steps: Average Loss 1.676 | Accuracy: 52%
[Step 400] Past 100 steps: Average Loss 1.212 | Accuracy: 63%
[Step 500] Past 100 steps: Average Loss 0.949 | Accuracy: 72%
[Step 600] Past 100 steps: Average Loss 0.848 | Accuracy: 74%
[Step 700] Past 100 steps: Average Loss 0.954 | Accuracy: 68%
[Step 800] Past 100 steps: Average Loss 0.671 | Accuracy: 81%
[Step 900] Past 100 steps: Average Loss 0.923 | Accuracy: 67%
[Step 1000] Past 100 steps: Average Loss 0.571 | Accuracy: 83%
--- Epoch 2 ---
[Step 100] Past 100 steps: Average Loss 0.447 | Accuracy: 89%
[Step 200] Past 100 steps: Average Loss 0.401 | Accuracy: 86%
[Step 300] Past 100 steps: Average Loss 0.608 | Accuracy: 81%
[Step 400] Past 100 steps: Average Loss 0.511 | Accuracy: 83%
[Step 500] Past 100 steps: Average Loss 0.584 | Accuracy: 89%
[Step 600] Past 100 steps: Average Loss 0.782 | Accuracy: 72%
[Step 700] Past 100 steps: Average Loss 0.397 | Accuracy: 84%
[Step 800] Past 100 steps: Average Loss 0.560 | Accuracy: 80%
[Step 900] Past 100 steps: Average Loss 0.356 | Accuracy: 92%
[Step 1000] Past 100 steps: Average Loss 0.576 | Accuracy: 85%
--- Epoch 3 ---
[Step 100] Past 100 steps: Average Loss 0.367 | Accuracy: 89%
[Step 200] Past 100 steps: Average Loss 0.370 | Accuracy: 89%
[Step 300] Past 100 steps: Average Loss 0.464 | Accuracy: 84%
[Step 400] Past 100 steps: Average Loss 0.254 | Accuracy: 95%
[Step 500] Past 100 steps: Average Loss 0.366 | Accuracy: 89%
[Step 600] Past 100 steps: Average Loss 0.493 | Accuracy: 89%
[Step 700] Past 100 steps: Average Loss 0.390 | Accuracy: 91%
[Step 800] Past 100 steps: Average Loss 0.459 | Accuracy: 87%
[Step 900] Past 100 steps: Average Loss 0.316 | Accuracy: 92%
[Step 1000] Past 100 steps: Average Loss 0.460 | Accuracy: 87%

--- Testing the CNN ---
Test Loss: 0.5979384893783474
Test Accuracy: 0.78

执行输出如上。可以看到最后我们在测试集合上取得了78%的精度。

注意到在训练集上的精度显著高于在测试集合上的精度,这说明出现了过拟合。