# 机器学习之Artificial Neural Networks

ANNs是Deep Learning的基本组成部分，它有很多用处：

ANNs are at the very core of Deep Learning. They are versatile, powerful, and scala‐ ble, making them ideal to tackle large and highly complex Machine Learning tasks, such as classifying billions of images (e.g., Google Images), powering speech recogni‐ tion services (e.g., Apple’s Siri), recommending the best videos to watch to hundreds of millions of users every day (e.g., YouTube), or learning to beat the world champion at the game of Go by examining millions of past games and then playing against itself (DeepMind’s AlphaGo).

## From Biological to Artificial Neurons

ANNs已经有很悠久的历史了，我们不谈历史，到目前为止，ANNs又重新焕发青春，主要有下边几个理由：

• 现在有大量高质量的数据来训练神经网络，并且ANNs更适宜大且复杂的问题
• 物理计算能力的巨大提升，为训练大型神经网络提供了基础，比如强大的GPU
• 训练算法有了一定的改进，性能得到很大提升
• ANNs的理论局限性已经在实践中被证明是良性的，通过大量的实验，ANNs都体现了良好的效果，最大限度的接近全局最优解
• 资金流向ANNs，这同样极大刺激了它的发展

### Biological Neurons

ANNs主要是模仿生物神经元，我们先简单了解一下生物神经元的组成：

• 细胞体的作用是处理树突传过来的各种信息，在ANNs中它就好比激活函数，用于决定当前神经元的输出
• 树突用于连接其他的神经元，在ANNs中它就像每个神经元和输入的连接
• 轴突是神经元的输出

### Logical Computations with Neurons

• 最左边的是一个identity函数，A的状态直接传递给C
• 第一个实现了逻辑AND，只有当A和B都激活的时候，C才被激活
• 第三个实现了逻辑OR， A或B任何一个被激活，C都会激活
• 第四个，只有A激活且B不激活的情况下，才能激活C

### The Perceptron

Perceptron是ANNs架构的最简单的一种，它的神经元成为linear threshold unit (LTU)，我们后续都使用LTU这个名称，它的输入和输出都是数值类型，先看一张示意图：

$z = w_1x_1 + w_2x_2 + \cdots + w_nx_n = \vec{w}^T \cdot \vec{x}$

$h_w(\vec{x}) = step(z) = step(\vec{w}^T \cdot \vec{x})$

$heaviside(z) = \begin{cases} 0 & \text{if z < 0} \\ 1 & \text{if z \geq 0} \end{cases}$

$sgn(z) = \begin{cases} -1 & \text{if z < 0} \\ 0 & \text{if z = 0} \\ 1 & \text{if z > 0} \end{cases}$

Perceptron可以说是由单层的LTUs组成，我们已经知道LTUs是一个神经单元，如果把多个LTUs放到一层，作为输出层，再增加一个输入层，就构成了感知机。在输入层还需要增加一个偏置。如下图所示：

$w_{i,j}^{(next \ step)} = w_{i,j} + \eta(\hat{y_j} - y_j)x_i$

• $w_{i, j}$是第i个输入神经元和第j个输出神经元的权值
• $x_i$是当前训练实例的第i个输入值
• $\hat{y_j}$是当前训练实例的第j个神经元的输出值
• $y_j$是当前训练实例的第j个神经元的实际值
• $\eta$是学习率

import numpy as np
from sklearn.linear_model import Perceptron

X = iris.data[:, (2, 3)]  # petal length, petal width
y = (iris.target == 0).astype(np.int)

per_clf = Perceptron(max_iter=100, tol=-np.infty, random_state=42)
per_clf.fit(X, y)

y_pred = per_clf.predict([[2, 0.5]])

"""
array([1])
"""


def heaviside(z):
return (z >= 0).astype(z.dtype)

def mlp_xor(x1, x2, activation=heaviside):
return activation(-activation(x1 + x2 - 1.5) + activation(x1 + x2 - 0.5) - 0.5)


### Multi-Layer Perceptron and Backpropagation

$\partial{n_7}/\partial{n_5}$

• logistic function $\sigma(z)=1/(1 + e^{-z})$
• hyperbolic tangent function $tanh(z) = 2\sigma(2z) - 1$
• ReLU function $Relu(z) = max(0, z)$

MLP经常被用来分类，它的输出层可以分别对应多个类别，因此我们需要对输出层的神经元的激活函数做一些特殊的处理，如果是分类任务，我们可以去掉输出层的激活函数，然后为它增加额外的softmax function。用来获取一些自定义的策略，比如控制threshold或者概率等等。

## Training an MLP with TensorFlow’s High-Level API

import tensorflow as tf

(X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data()
X_train = X_train.astype(np.float32).reshape(-1, 28 * 28) / 255.0
X_test = X_test.astype(np.float32).reshape(-1, 28 * 28) / 255.0
y_train = y_train.astype(np.int32)
y_test = y_test.astype(np.int32)
X_valid, X_train = X_train[:5000], X_train[5000:]
y_valid, y_train = y_train[:5000], y_train[5000:]


feature_cols = [tf.feature_column.numeric_column("X", shape=[28 * 28])]
dnn_clf = tf.estimator.DNNClassifier(hidden_units=[300, 100], n_classes=10, feature_columns=feature_cols)

input_fn = tf.estimator.inputs.numpy_input_fn(x={"X": X_train}, y=y_train, num_epochs=40, batch_size=50, shuffle=True)
dnn_clf.train(input_fn=input_fn)


TensorFlow提供了评估函数：

valid_input_fn = tf.estimator.inputs.numpy_input_fn(x={"X": X_valid}, y=y_valid, shuffle=False)
eval_results = dnn_clf.evaluate(input_fn=valid_input_fn)

"""
{'accuracy': 0.982,
'average_loss': 0.09565534,
'loss': 11.956917,
'global_step': 44000}
"""


test_input_fn = tf.estimator.inputs.numpy_input_fn(x={"X": X_test}, y=y_test, shuffle=False)
y_pred_iter = dnn_clf.predict(input_fn=test_input_fn)
y_pred = list(y_pred_iter)
y_pred[0]

"""
{'logits': array([ -7.4078193,   2.8550382,   1.8491653,   6.945245 ,  -5.996856 ,
-0.6053193, -10.372234 ,  22.27766  ,  -4.490141 ,   2.99006  ],
dtype=float32),
'probabilities': array([1.2816213e-13, 3.6716565e-09, 1.3428177e-09, 2.1938979e-07,
5.2545213e-13, 1.1535802e-10, 6.6119684e-15, 9.9999976e-01,
2.3707813e-12, 4.2024433e-09], dtype=float32),
'class_ids': array([7]),
'classes': array([b'7'], dtype=object),
'all_class_ids': array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int32),
'all_classes': array([b'0', b'1', b'2', b'3', b'4', b'5', b'6', b'7', b'8', b'9'],
dtype=object)}
"""


## Training a DNN Using Plain TensorFlow

### Construction Phase

n_inputs = 28 * 28
n_hidden1 = 300
n_hidden2 = 100
n_outputs = 10

reset_graph()

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int32, shape=(None), name="y")


def neuron_layer(X, n_neurons, name, activation=None):
with tf.name_scope(name):
n_inputs = int(X.get_shape()[1])
stddev = 2 / np.sqrt(n_inputs)
init = tf.truncated_normal((n_inputs, n_neurons), stddev=stddev)
W = tf.Variable(init, name="kernel")
b = tf.Variable(tf.zeros([n_neurons]), name="bias")
Z = tf.matmul(X, W) + b
if activation is not None:
return activation(Z)
else:
return Z


• 通过name给每一层加一个名称域，这么做的好处是在用TensorBoard看图的结构时，更加清晰
• n_inputs用户计算该层输入数据的特征的个数
• 第三部主要用于计算该层与上一层的权值矩阵,(n_inputs, n_neurons)保存着上一层中每个神经元与该层链接的权值，权值通过随机的方式产生，符合高斯正态分布
• 为该层的每个神经元添加一个偏置
• 计算该层的输出值，这里是一个输出矩阵，每一行保存着该层的神经元的值
• 如果指定了激活函数，则调用激活函数计算结果，否则直接输出结果

with tf.name_scope("dnn"):
hidden1 = neuron_layer(X, n_hidden1, name="hidden1", activation=tf.nn.relu)
hidden2 = neuron_layer(hidden1, n_hidden2, name="hidden2", activation=tf.nn.relu)
logits = neuron_layer(hidden2, n_outputs, name="outputs")


with tf.name_scope("loss"):
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
loss = tf.reduce_mean(xentropy, name="loss")


learning_rate = 0.01

with tf.name_scope("train"):
training_op = optimizer.minimize(loss)


with tf.name_scope("eval"):
correct = tf.nn.in_top_k(logits, y, 1)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))


init = tf.global_variables_initializer()
saver = tf.train.Saver()


### Execution Phase

n_epochs = 40
batch_size = 50

def shuffle_batch(X, y, batch_size):
rnd_idx = np.random.permutation(len(X))
n_batches = len(X) // batch_size
for batch_idx in np.array_split(rnd_idx, n_batches):
X_batch, y_batch = X[batch_idx], y[batch_idx]
yield X_batch, y_batch


with tf.Session() as sess:
init.run()
for epoch in range(n_epochs):
for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
acc_batch = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
acc_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
print(epoch, "Batch accuracy:", acc_batch, "Val accuracy:", acc_val)

save_path = saver.save(sess, "./my_model_final.ckpt")


0 Batch accuracy: 0.9 Val accuracy: 0.9146
1 Batch accuracy: 0.92 Val accuracy: 0.936
2 Batch accuracy: 0.96 Val accuracy: 0.945
...
37 Batch accuracy: 1.0 Val accuracy: 0.9776
38 Batch accuracy: 1.0 Val accuracy: 0.9792
39 Batch accuracy: 1.0 Val accuracy: 0.9776


### Using the Neural Network

with tf.Session() as sess:
saver.restore(sess, "./my_model_final.ckpt") # or better, use save_path
X_new_scaled = X_test[:20]
Z = logits.eval(feed_dict={X: X_new_scaled})
y_pred = np.argmax(Z, axis=1)

print("Predicted classes:", y_pred)
print("Actual classes:   ", y_test[:20])

"""
Predicted classes: [7 2 1 0 4 1 4 9 5 9 0 6 9 0 1 5 9 7 3 4]
Actual classes:    [7 2 1 0 4 1 4 9 5 9 0 6 9 0 1 5 9 7 3 4]
"""


with tf.name_scope("dnn"):
hidden1 = tf.layers.dense(X, n_hidden1, name="hidden1", activation=tf.nn.relu)
hidden2 = tf.layers.dense(hidden1, n_hidden2, name="hidden2", activation=tf.nn.relu)
logits = tf.layers.dense(hidden2, n_outputs, name="outputs")


## Exercises

### 2. Why is it generally preferable to use a Logistic Regression classifier rather than a classical Perceptron (i.e., a single layer of linear threshold units trained using the Perceptron training algorithm)? How can you tweak a Perceptron to make it equivalent to a Logistic Regression classifier?

A classical Perceptron will converge only if the dataset is linearly separable, and it won’t be able to estimate class probabilities. In contrast, a Logistic Regression classifier will converge to a good solution even if the dataset is not linearly sepa‐ rable, and it will output class probabilities. If you change the Perceptron’s activa‐ tion function to the logistic activation function (or the softmax activation function if there are multiple neurons), and if you train it using Gradient Descent (or some other optimization algorithm minimizing the cost function, typically cross entropy), then it becomes equivalent to a Logistic Regression classifier.

### 3. Why was the logistic activation function a key ingredient in training the first MLPs?

The logistic activation function was a key ingredient in training the first MLPs because its derivative is always nonzero, so Gradient Descent can always roll down the slope. When the activation function is a step function, Gradient Descent cannot move, as there is no slope at all.

The step function, the logistic function, the hyperbolic tangent, the rectified lin‐ ear unit (see Figure 10-8). See Chapter 11 for other examples, such as ELU and variants of the ReLU.

### 5. Suppose you have an MLP composed of one input layer with 10 passthrough neurons, followed by one hidden layer with 50 artificial neurons, and finally one output layer with 3 artificial neurons. All artificial neurons use the ReLU activa‐ tion function.

• What is the shape of the input matrix X?

• What about the shape of the hidden layer’s weight vector $W_h$, and the shape of

its bias vector $b_h$?

• What is the shape of the output layer’s weight vector $W_o$, and its bias vector $b_o$?

• What is the shape of the network’s output matrix Y?

• Write the equation that computes the network’s output matrix Y as a function of X, $W_h$, $b_h$, $W_o$ and $b_o$.

### 6. How many neurons do you need in the output layer if you want to classify email into spam or ham? What activation function should you use in the output layer? If instead you want to tackle MNIST, how many neurons do you need in the out‐ put layer, using what activation function? Answer the same questions for getting your network to predict housing prices as in Chapter 2.

To classify email into spam or ham, you just need one neuron in the output layer of a neural network—for example, indicating the probability that the email is spam. You would typically use the logistic activation function in the output layer when estimating a probability. If instead you want to tackle MNIST, you need 10 neurons in the output layer, and you must replace the logistic function with the softmax activation function, which can handle multiple classes, outputting one probability per class. Now, if you want your neural network to predict housing prices like in Chapter 2, then you need one output neuron, using no activation function at all in the output layer.4

### 7. What is backpropagation and how does it work? What is the difference between backpropagation and reverse-mode autodiff?

Backpropagation is a technique used to train artificial neural networks. It first computes the gradients of the cost function with regards to every model parame‐ ter (all the weights and biases), and then it performs a Gradient Descent step using these gradients. This backpropagation step is typically performed thou‐ sands or millions of times, using many training batches, until the model parame‐ ters converge to values that (hopefully) minimize the cost function. To compute the gradients, backpropagation uses reverse-mode autodiff (although it wasn’t called that when backpropagation was invented, and it has been reinvented sev‐ eral times). Reverse-mode autodiff performs a forward pass through a computa‐ tion graph, computing every node’s value for the current training batch, and then it performs a reverse pass, computing all the gradients at once (see Appendix Dfor more details). So what’s the difference? Well, backpropagation refers to the whole process of training an artificial neural network using multiple backpropa‐ gation steps, each of which computes gradients and uses them to perform a Gra‐ dient Descent step. In contrast, reverse-mode autodiff is a simply a technique to compute gradients efficiently, and it happens to be used by backpropagation.

### 8. Can you list all the hyperparameters you can tweak in an MLP? If the MLP over‐ fits the training data, how could you tweak these hyperparameters to try to solve the problem?

Here is a list of all the hyperparameters you can tweak in a basic MLP: the num‐ ber of hidden layers, the number of neurons in each hidden layer, and the activa‐ tion function used in each hidden layer and in the output layer.5 In general, the ReLU activation function (or one of its variants; see Chapter 11) is a good default for the hidden layers. For the output layer, in general you will want the logistic activation function for binary classification, the softmax activation function for multiclass classification, or no activation function for regression.

If the MLP overfits the training data, you can try reducing the number of hidden layers and reducing the number of neurons per hidden layer.

### 9. Train a deep MLP on the MNIST dataset and see if you can get over 98% preci‐ sion. Just like in the last exercise of Chapter 9, try adding all the bells and whistles (i.e., save checkpoints, restore the last checkpoint in case of an interruption, add summaries, plot learning curves using TensorBoard, and so on).

from datetime import datetime

reset_graph()

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int32, shape=(None), name="y")
with tf.name_scope("dnn"):
hidden1 = tf.layers.dense(X, n_hidden1, name="hidden1", activation=tf.nn.relu)
hidden2 = tf.layers.dense(hidden1, n_hidden2, name="hidden2", activation=tf.nn.relu)
logits = tf.layers.dense(hidden2, n_outputs, name="outputs")

with tf.name_scope("loss"):
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
loss = tf.reduce_mean(xentropy, name="loss")
loss_summary = tf.summary.scalar('log_loss', loss)

learning_rate = 0.01

with tf.name_scope("train"):
training_op = optimizer.minimize(loss)

with tf.name_scope("eval"):
correct = tf.nn.in_top_k(logits, y, 1)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
accuracy_summary = tf.summary.scalar('accuracy', accuracy)

init = tf.global_variables_initializer()
saver = tf.train.Saver()

def log_dir(prefix=""):
now = datetime.utcnow().strftime("%Y%m%d%H%M%S")
root_logdir = "tf_logs"
if prefix:
prefix += "_"
name = prefix + "run_" + now
return "{}/{}/".format(root_logdir, name)

logdir = log_dir("mnist_dnn")

file_writer = tf.summary.FileWriter(logdir, tf.get_default_graph())

m, n = X_train.shape
n_epochs = 10001
batch_size = 50
n_batches = int(np.ceil(m / batch_size))

checkpoint_path = "/tmp/my_deep_mnist_model.ckpt"
checkpoint_epoch_path = checkpoint_path + ".epoch"
fimal_model_path = './my_deep_mnist_model'

best_loss = np.infty
epochs_without_progress = 0
max_epochs_without_progress = 50

def shuffle_batch(X, y, batch_size):
rnd_idx = np.random.permutation(len(X))
n_batches = len(X) // batch_size
for batch_idx in np.array_split(rnd_idx, n_batches):
X_batch, y_batch = X[batch_idx], y[batch_idx]
yield X_batch, y_batch

with tf.Session() as sess:
if os.path.isfile(checkpoint_epoch_path):
# if the checkpoint file exists, restore the model and load the epoch number
with open(checkpoint_epoch_path, "rb") as f:
print("Training was interrupter. Continuing at epoch", start_epoch)
saver.restore(sess, checkpoint_path)
else:
start_epoch = 0
sess.run(init)

for epoch in range(start_epoch, n_epochs):
for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
accuracy_val, loss_val, accuracy_summary_str, loss_summary_str = sess.run([accuracy, loss, accuracy_summary, loss_summary],
feed_dict={X: X_valid, y: y_valid})
if epoch % 5 == 0:
print("Epoch:", epoch, "\tValidation accuracy: {:.3f}%".format(accuracy_val * 100), "\tLoss: {:.5f}".format(loss_val))
saver.save(sess, checkpoint_path)

with open(checkpoint_epoch_path, "wb") as f:
f.write(b"%d" % (epoch + 1))
if loss_val < best_loss:
saver.save(sess, fimal_model_path)
best_loss = loss_val
else:
epochs_without_progress += 5
if epochs_without_progress > max_epochs_without_progress:
print("Early stopping")
break


posted @ 2019-09-22 16:59  马在路上  阅读(163)  评论(0编辑  收藏