Deep Learing 杂记 (不定时更新)
一般来说,训练神经网络的目标函数是非常不好优化的,比如说,有非常非常多的局部最优值等等。通常认为使用 pre-training 的结果作为 back-propagation 的初始化有助于将 (stochastic) gradient descent 的初始搜索点放在一个比较好的地方,从而收敛到比较好的(局部最优)解。另外,pre-training 还被认为是起到 regularization 的作用,能够增强 generalization performance。关于这方面的详细讨论,可以参考 (Erhan, Courville, Bengio & Vincent, 2010)。
至于是否一定要做 pre-training,从实验结果方面,我们已经知道,当训练数据足够多的情况下,选择好合适的(随机)初始值和神经元之间的 non-linearity 的话,不使用 pre-training 而直接进行 supervised training 也是可以得到很好的效果的 (Ciresan, Meier, Gambardella & Schmidhuber, 2010), (Glorot, Bordes & Bengio, 2011), (Sutskever, Martens, Dahl & Hinton, 2013)。不过这些结果通常都是在大量数据的情况下,结合各种 trick (Montavon, Orr & Muller, 2012),再加上高性能的 GPU 设备和特别优化的并行算法,在训练了“足够长”的时间之后得到的结果。所以为什么在“大数据”时代和“GPU 并行”时代之前没有能很成功地训练出 deep neural network 模型似乎也并不难解释。
Using Of Convolution 使用卷积进行表达学习的条件与局限性
We can think of the use of convolution as introducing an infinitely strong prior probability distribution over the parameters of a layer. This prior says that the functionthe layer should learn contains only local interactions and is equivariant to translation.This view of convolution as an infinitely strong prior makes it clear that the efficiencyimprovements of convolution come with a caveat: convolution is only applicable whenthe assumptions made by this prior are close to true. The use of convolution constrains the class of functions that the layer can represent. If the function that a layer needs to learn is indeed a local, translation invariant function, then the layer will be dramatically more efficient if it uses convolution rather than matrix multiplication. If the necessary function does not have these properties, then using a convolutional layer will cause the model to have high training error.
Pooling 池化
In all cases, pooling helps to make the representation become invariant to small translations of the input. This means that if we translate the input by a small amount,the values of most of the pooled outputs do not change. See Fig. 11.6 for an example ofhow this works. Invariance to local translation can be a very useful propertyif we care more about whether some feature is present than exactly whereit is. For example, when determining whether an image contains a face, we need notknow the location of the eyes with pixel-perfect accuracy, we just need to know thatthere is an eye on the left side of the face and an eye on the right side of the face. Inother contexts, it is more important to preserve the location of a feature. For example,if we want to find a corner defined by two edges meeting at a specific orientation, weneed to preserve the location of the edges well enough to test whether they meet.The use of pooling can be viewed as adding an infinitely strong prior that the functionthe layer learns must be invariant to small translations. When this assumption is correct,it can greatly improve the statistical efficiency of the network.
Locally connected 局部连接
Locally connected layers are useful when we know that each feature should be a function of a small part of space, but there is no reason to think that the same featureshould occur across all of space. For example, if we want to tell if an image is a picture of a face, we only need to look for the mouth in the bottom half of the image.
Bias 偏置值
Generally, we do not use only a linear operation in order to transform from the inputs to the outputs in a convolutional layer. We generally also add some bias term to each output before applying the nonlinearity. This raises the question of how to share parameters among the biases. For locally connected layers it is natural to give each unit its own bias, and for tiled convolution, it is natural to share the biases with the same tiling pattern as the kernels. For convolutional layers, it is typical to have one bias per channel of the output and share it across all locations within each convolution map. However, if the input is of known, fixed size, it is also possible to learn a separate bias at each location of the output map. Separating the biases may slightly reduce the statistical efficiency of the model, but also allows the model to correct for differences in the image statistics at different locations. For example, when using implicit zero padding, detector units at the edge of the image receive less total input and may need larger biases.
Chapter 1 Deep Learning for AI
由浅入深,由一般到特殊地介绍: 人工智能-》机器学习-》表达学习-》深度学习 的应用背景,包括各自解决的主要问题、优势所在以及不足之处;重点介绍Deep Learning解决的核心问题,以及对Deep Learning概念原理上的理解
To summarize, deep learning, the subject of this book, is an approach to AI. Specifically, it is a type of machine learning, a technique that allows computer systems to improve with experience and data. According to the authors of this book, machine learning is the only viable approach to building AI systems that can operate in complicated, real-world environments. Deep learning is a particular kind of machine learning that achieves great power and flexibility by learning to represent the world as a nested hierarchy of concepts, with each concept defined in relation to simpler concepts. Fig. 1.2 illustrates the relationship between these different AI disciplines. Fig. 1.3 gives a high-level schematic of how each works.Deep learning is the subject of this book. It involves learning multiple levels of representation, corresponding to different levels of abstractions.
Fig. 1.2

浙公网安备 33010602011771号