笔记:CS231n+assignment2(作业二)(一)

 第二个作业难度很高,但做(抄)完之后收获还是很大的....

一、Fully-Connected Neural Nets

   首先是对之前的神经网络的程序进行重构,目的是可以构建任意大小的全连接的neural network,这里用模块化的思想构建整个代码,具体思路如下:

#前向传播
def layer_forward(x, w):
  """ Receive inputs x and weights w """
  # 做前向计算
  z = # 需要存储的中间值,便于BP的时候使用
  # Do some more computations ...
  out = # the output

  cache = (x, w, z, out) # Values we need to compute gradients

  return out, cache
#后向传播
def layer_backward(dout, cache):
  """
  Receive derivative of loss with respect to outputs and cache,
  and compute derivative with respect to inputs.
  """
  # Unpack cache values
  x, w, z, out = cache

  # Use values in cache to compute derivatives
  dx = # Derivative of loss with respect to x
  dw = # Derivative of loss with respect to w

  return dx, dw

在上面的思想指导下,要求实现下面的代码:

def affine_forward(x, w, b):
  """
  X的shape是(N,d_1,d_2,...d_k),第一维带便minibatch的数目,后面是把图片的shape,所以进来的时候把后面全面转为
  一维的向量 Inputs: - x: A numpy array containing input data, of shape (N, d_1, ..., d_k) - w: A numpy array of weights, of shape (D, M) - b: A numpy array of biases, of shape (M,) Returns a tuple of: - out: output, of shape (N, M) - cache: (x, w, b)
""" out = None N=x.shape[0] x_new=x.reshape(N,-1)#转为二维向量 out=np.dot(x_new,w)+b cache = (x, w, b) # 不需要保存out return out, cache def affine_backward(dout, cache): x, w, b = cache dx, dw, db = None, None, None dx=np.dot(dout,w.T) dx=np.reshape(dx,x.shape) x_new=x.reshape(x.shape[0],-1) dw=np.dot(x_new.T,dout) db=np.sum(dout,axis=0,keepdims=True) return dx, dw, db def relu_forward(x): """ Computes the forward pass for a layer of rectified linear units (ReLUs). Input: - x: Inputs, of any shape Returns a tuple of: - out: Output, of the same shape as x - cache: x """ out = None out=np.maximum(0,x) cache = x return out, cache def relu_backward(dout, cache): dx, x = None, cache ############################################################################# # TODO: Implement the ReLU backward pass. # ############################################################################# dx=dout dx[x<=0]=0 ############################################################################# # END OF YOUR CODE # ############################################################################# return dx

     上面值得商讨的就是为什么求db的公式是db=np.sum(dout,axis=0,keepdims=True),在我看来是少了一个平均的操作的,个人感觉还是因为db的作用小,所以这里用sum的话会方便...grandient check的代码不需要专门为它进行改变。

完成上面两个基本的layer,就可以构建一个Sandwich的层了,因为fc-relu的使用还是比较常见的,所以这里直接构建了出来:

def affine_relu_forward(x, w, b):
  """
  Convenience layer that perorms an affine transform followed by a ReLU

  Inputs:
  - x: Input to the affine layer
  - w, b: Weights for the affine layer

  Returns a tuple of:
  - out: Output from the ReLU
  - cache: Object to give to the backward pass
  """
  a, fc_cache = affine_forward(x, w, b)
  out, relu_cache = relu_forward(a)
  cache = (fc_cache, relu_cache)
  return out, cache


def affine_relu_backward(dout, cache):
  """
  Backward pass for the affine-relu convenience layer
  """
  fc_cache, relu_cache = cache
  da = relu_backward(dout, relu_cache)
  dx, dw, db = affine_backward(da, fc_cache)
  return dx, dw, db

后面有一个构建上层layer的网络,我不准备说了,直接聊一聊一个迄今为止最厉害的类FullyConnectecNEt吧,先上代码和注释: 

  1 class FullyConnectedNet(object):
2
""" 3 A fully-connected neural network with an arbitrary number of hidden layers, 4 ReLU nonlinearities, and a softmax loss function. This will also implement 5 dropout and batch normalization as options. For a network with L layers, 6 the architecture will be 7 8 {affine - [batch norm] - relu - [dropout]} x (L - 1) - affine - softmax 9 10 where batch normalization and dropout are optional, and the {...} block is 11 repeated L - 1 times. 12 13 Similar to the TwoLayerNet above, learnable parameters are stored in the 14 self.params dictionary and will be learned using the Solver class. 15 """ 16 17 def __init__(self, hidden_dims, input_dim=3*32*32, num_classes=10, 18 dropout=0, use_batchnorm=False, reg=0.0, 19 weight_scale=1e-2, dtype=np.float32, seed=None): 20 """ 21 Initialize a new FullyConnectedNet. 22 23 Inputs: 24 - hidden_dims: A list of integers giving the size of each hidden layer. 25 - input_dim: An integer giving the size of the input. 26 - num_classes: An integer giving the number of classes to classify. 27 - dropout: Scalar between 0 and 1 giving dropout strength. If dropout=0 then 28 the network should not use dropout at all. 29 - use_batchnorm: Whether or not the network should use batch normalization. 30 - reg: Scalar giving L2 regularization strength. 31 - weight_scale: Scalar giving the standard deviation for random 32 initialization of the weights. 33 - dtype: A numpy datatype object; all computations will be performed using 34 this datatype. float32 is faster but less accurate, so you should use 35 float64 for numeric gradient checking. 36 - seed: If not None, then pass this random seed to the dropout layers. This 37 will make the dropout layers deteriminstic so we can gradient check the 38 model. 39 """ 40 self.use_batchnorm = use_batchnorm 41 self.use_dropout = dropout > 0 42 self.reg = reg 43 self.num_layers = 1 + len(hidden_dims) 44 self.dtype = dtype 45 self.params = {} 46 47 ############################################################################ 48 # TODO: Initialize the parameters of the network, storing all values in # 49 # the self.params dictionary. Store weights and biases for the first layer # 50 # in W1 and b1; for the second layer use W2 and b2, etc. Weights should be # 51 # initialized from a normal distribution with standard deviation equal to # 52 # weight_scale and biases should be initialized to zero. # 53 # # 54 # When using batch normalization, store scale and shift parameters for the # 55 # first layer in gamma1 and beta1; for the second layer use gamma2 and # 56 # beta2, etc. Scale parameters should be initialized to one and shift # 57 # parameters should be initialized to zero. # 58 ############################################################################
59 layers_dims = [input_dim] + hidden_dims + [num_classes] #z这里存储的是每个layer的大小,因为中间的是list,所以要把前后连个加上list来做 60 for i in xrange(self.num_layers): 61 self.params['W' + str(i + 1)] = weight_scale * np.random.randn(layers_dims[i], layers_dims[i + 1]) 62 self.params['b' + str(i + 1)] = np.zeros((1, layers_dims[i + 1])) 63 if self.use_batchnorm and i < len(hidden_dims):#最后一层是不需要batchnorm的 64 self.params['gamma' + str(i + 1)] = np.ones((1, layers_dims[i + 1])) 65 self.params['beta' + str(i + 1)] = np.zeros((1, layers_dims[i + 1])) 66 ############################################################################ 67 # END OF YOUR CODE # 68 ############################################################################ 69 70 # When using dropout we need to pass a dropout_param dictionary to each 71 # dropout layer so that the layer knows the dropout probability and the mode 72 # (train / test). You can pass the same dropout_param to each dropout layer. 73 self.dropout_param = {} 74 if self.use_dropout: 75 self.dropout_param = {'mode': 'train', 'p': dropout} 76 if seed is not None: 77 self.dropout_param['seed'] = seed 78 79 # With batch normalization we need to keep track of running means and 80 # variances, so we need to pass a special bn_param object to each batch 81 # normalization layer. You should pass self.bn_params[0] to the forward pass 82 # of the first batch normalization layer, self.bn_params[1] to the forward 83 # pass of the second batch normalization layer, etc. 84 self.bn_params = [] 85 if self.use_batchnorm: 86 self.bn_params = [{'mode': 'train'} for i in xrange(self.num_layers - 1)] 87 88 # Cast all parameters to the correct datatype 89 for k, v in self.params.iteritems(): 90 self.params[k] = v.astype(dtype) 91 92 93 def loss(self, X, y=None): 94 """ 95 Compute loss and gradient for the fully-connected net. 96 97 Input / output: Same as TwoLayerNet above. 98 """ 99 X = X.astype(self.dtype) 100 mode = 'test' if y is None else 'train' 101 102 # Set train/test mode for batchnorm params and dropout param since they 103 # behave differently during training and testing. 104 if self.dropout_param is not None: 105 self.dropout_param['mode'] = mode 106 if self.use_batchnorm: 107 for bn_param in self.bn_params: 108 bn_param[mode] = mode 109 110 scores = None 111 ############################################################################ 112 # TODO: Implement the forward pass for the fully-connected net, computing # 113 # the class scores for X and storing them in the scores variable. # 114 # # 115 # When using dropout, you'll need to pass self.dropout_param to each # 116 # dropout forward pass. # 117 # # 118 # When using batch normalization, you'll need to pass self.bn_params[0] to # 119 # the forward pass for the first batch normalization layer, pass # 120 # self.bn_params[1] to the forward pass for the second batch normalization # 121 # layer, etc. # 122 ############################################################################ 123 h, cache1, cache2, cache3,cache4, bn, out = {}, {}, {}, {}, {}, {},{} 124 out[0] = X #存储每一层的out,按照逻辑,X就是out0[0] 125 126 # Forward pass: compute loss 127 for i in xrange(self.num_layers - 1): 128 #
得到每一层的参数 129 w, b = self.params['W' + str(i + 1)], self.params['b' + str(i + 1)] 130 if self.use_batchnorm: 131 gamma, beta = self.params['gamma' + str(i + 1)], self.params['beta' + str(i + 1)] 132 h[i], cache1[i] = affine_forward(out[i], w, b) 133 bn[i], cache2[i] = batchnorm_forward(h[i], gamma, beta, self.bn_params[i]) 134 out[i + 1], cache3[i] = relu_forward(bn[i]) 135 if self.use_dropout: 136 out[i+1], cache4[i] = dropout_forward(out[i+1] , self.dropout_param) 137 else: 138 out[i + 1], cache3[i] = affine_relu_forward(out[i], w, b) 139 if self.use_dropout: 140 out[i + 1], cache4[i] = dropout_forward(out[i + 1], self.dropout_param) 141 142 W, b = self.params['W' + str(self.num_layers)], self.params['b' + str(self.num_layers)] 143 scores, cache = affine_forward(out[self.num_layers - 1], W, b) #对最后一层进行计算144 145 ############################################################################ 146 # END OF YOUR CODE # 147 ############################################################################ 148 149 # If test mode return early 150 if mode == 'test': 151 return scores 152 153 loss, grads = 0.0, {} 154 ############################################################################ 155 # TODO: Implement the backward pass for the fully-connected net. Store the # 156 # loss in the loss variable and gradients in the grads dictionary. Compute # 157 # data loss using softmax, and make sure that grads[k] holds the gradients # 158 # for self.params[k]. Don't forget to add L2 regularization! # 159 # # 160 # When using batch normalization, you don't need to regularize the scale # 161 # and shift parameters. # 162 # # 163 # NOTE: To ensure that your implementation matches ours and you pass the # 164 # automated tests, make sure that your L2 regularization includes a factor # 165 # of 0.5 to simplify the expression for the gradient. # 166 ############################################################################ 167 data_loss, dscores = softmax_loss(scores, y) 168 reg_loss = 0 169 for i in xrange(self.num_layers): 170 reg_loss += 0.5 * self.reg * np.sum(self.params['W' + str(i + 1)] * self.params['W' + str(i + 1)]) 171 loss = data_loss + reg_loss 172 173 # Backward pass: compute gradients 174 dout, dbn, dh, ddrop = {}, {}, {}, {} 175 t = self.num_layers - 1 176 dout[t], grads['W' + str(t + 1)], grads['b' + str(t + 1)] = affine_backward(dscores, cache)#这个cache就是上面得到的177 for i in xrange(t): 178 if self.use_batchnorm: 179 if self.use_dropout: 180 dout[t - i] = dropout_backward(dout[t-i], cache4[t-1-i]) 181 dbn[t - 1 - i] = relu_backward(dout[t - i], cache3[t - 1 - i]) 182 dh[t - 1 - i], grads['gamma' + str(t - i)], grads['beta' + str(t - i)] = batchnorm_backward(dbn[t - 1 - i], 183 cache2[ 184 t - 1 - i]) 185 dout[t - 1 - i], grads['W' + str(t - i)], grads['b' + str(t - i)] = affine_backward(dh[t - 1 - i], 186 cache1[t - 1 - i]) 187 else: 188 if self.use_dropout: 189 dout[t - i] = dropout_backward(dout[t - i], cache4[t - 1 - i]) 190 191 dout[t - 1 - i], grads['W' + str(t - i)], grads['b' + str(t - i)] = affine_relu_backward(dout[t - i], 192 cache3[t - 1 - i]) 193 194 # Add the regularization gradient contribution 195 for i in xrange(self.num_layers): 196 grads['W' + str(i + 1)] += self.reg * self.params['W' + str(i + 1)] 197 ############################################################################ 198 # END OF YOUR CODE # 199 ############################################################################ 200 201 return loss, grads

     上面的代码因为是上层代码,不需要关心具体的Bp如何实现(因为之前已经实现了),所以还是很好看懂的,但到现在还是没有结束的,我们还要使用slover来对

    神经网络进优化求解。

  1 import numpy as np
  2 
  3 from cs231n import optim
  4 
  5 
  6 class Solver(object):
  7   """
  8   A Solver encapsulates all the logic necessary for training classification
  9   models. The Solver performs stochastic gradient descent using different
 10   update rules defined in optim.py.
 11 
 12   The solver accepts both training and validataion data and labels so it can
 13   periodically check classification accuracy on both training and validation
 14   data to watch out for overfitting.
 15 
 16   To train a model, you will first construct a Solver instance, passing the
 17   model, dataset, and various optoins (learning rate, batch size, etc) to the
 18   constructor. You will then call the train() method to run the optimization
 19   procedure and train the model.
 20   
 21   After the train() method returns, model.params will contain the parameters
 22   that performed best on the validation set over the course of training.
 23   In addition, the instance variable solver.loss_history will contain a list
 24   of all losses encountered during training and the instance variables
 25   solver.train_acc_history and solver.val_acc_history will be lists containing
 26   the accuracies of the model on the training and validation set at each epoch.
 27   
 28   Example usage might look something like this:
 29   
 30   data = {
 31     'X_train': # training data
 32     'y_train': # training labels
 33     'X_val': # validation data
 34     'X_train': # validation labels
 35   }
 36   model = MyAwesomeModel(hidden_size=100, reg=10)
 37   solver = Solver(model, data,
 38                   update_rule='sgd',
 39                   optim_config={
 40                     'learning_rate': 1e-3,
 41                   },
 42                   lr_decay=0.95,
 43                   num_epochs=10, batch_size=100,
 44                   print_every=100)
 45   solver.train()
 46 
 47 
 48   A Solver works on a model object that must conform to the following API:
 49 
 50   - model.params must be a dictionary mapping string parameter names to numpy
 51     arrays containing parameter values.
 52 
 53   - model.loss(X, y) must be a function that computes training-time loss and
 54     gradients, and test-time classification scores, with the following inputs
 55     and outputs:
 56 
 57     Inputs:
 58     - X: Array giving a minibatch of input data of shape (N, d_1, ..., d_k)
 59     - y: Array of labels, of shape (N,) giving labels for X where y[i] is the
 60       label for X[i].
 61 
 62     Returns:
 63     If y is None, run a test-time forward pass and return:
 64     - scores: Array of shape (N, C) giving classification scores for X where
 65       scores[i, c] gives the score of class c for X[i].
 66 
 67     If y is not None, run a training time forward and backward pass and return
 68     a tuple of:
 69     - loss: Scalar giving the loss
 70     - grads: Dictionary with the same keys as self.params mapping parameter
 71       names to gradients of the loss with respect to those parameters.
 72   """
 73 
 74   def __init__(self, model, data, **kwargs):
 75     """
 76     Construct a new Solver instance.
 77     
 78     Required arguments:
 79     - model: A model object conforming to the API described above
 80     - data: A dictionary of training and validation data with the following:
 81       'X_train': Array of shape (N_train, d_1, ..., d_k) giving training images
 82       'X_val': Array of shape (N_val, d_1, ..., d_k) giving validation images
 83       'y_train': Array of shape (N_train,) giving labels for training images
 84       'y_val': Array of shape (N_val,) giving labels for validation images
 85       
 86     Optional arguments:
 87     - update_rule: A string giving the name of an update rule in optim.py.
 88       Default is 'sgd'.
 89     - optim_config: A dictionary containing hyperparameters that will be
 90       passed to the chosen update rule. Each update rule requires different
 91       hyperparameters (see optim.py) but all update rules require a
 92       'learning_rate' parameter so that should always be present.
 93     - lr_decay: A scalar for learning rate decay; after each epoch the learning
 94       rate is multiplied by this value.
 95     - batch_size: Size of minibatches used to compute loss and gradient during
 96       training.
 97     - num_epochs: The number of epochs to run for during training.
 98     - print_every: Integer; training losses will be printed every print_every
 99       iterations.
100     - verbose: Boolean; if set to false then no output will be printed during
101       training.
102     """
103     self.model = model
104     self.X_train = data['X_train']
105     self.y_train = data['y_train']
106     self.X_val = data['X_val']
107     self.y_val = data['y_val']
108     
109     # Unpack keyword arguments
110     self.update_rule = kwargs.pop('update_rule', 'sgd')
111     self.optim_config = kwargs.pop('optim_config', {})
112     self.lr_decay = kwargs.pop('lr_decay', 1.0)
113     self.batch_size = kwargs.pop('batch_size', 100)
114     self.num_epochs = kwargs.pop('num_epochs', 10)
115 
116     self.print_every = kwargs.pop('print_every', 10)
117     self.verbose = kwargs.pop('verbose', True)
118 
119     # Throw an error if there are extra keyword arguments
120     if len(kwargs) > 0:
121       extra = ', '.join('"%s"' % k for k in kwargs.keys())
122       raise ValueError('Unrecognized arguments %s' % extra)
123 
124     # Make sure the update rule exists, then replace the string
125     # name with the actual function
126     if not hasattr(optim, self.update_rule):
127       raise ValueError('Invalid update_rule "%s"' % self.update_rule)
128     self.update_rule = getattr(optim, self.update_rule)
129 
130     self._reset()
131 
132 
133   def _reset(self):
134     """
135     Set up some book-keeping variables for optimization. Don't call this
136     manually.
137     """
138     # Set up some variables for book-keeping
139     self.epoch = 0
140     self.best_val_acc = 0
141     self.best_params = {}
142     self.loss_history = []
143     self.train_acc_history = []
144     self.val_acc_history = []
145 
146     # Make a deep copy of the optim_config for each parameter
147     self.optim_configs = {}
148     for p in self.model.params:
149       d = {k: v for k, v in self.optim_config.iteritems()}
150       self.optim_configs[p] = d
151 
152 
153   def _step(self):
154     """
155     Make a single gradient update. This is called by train() and should not
156     be called manually.
157     """
158     # Make a minibatch of training data
159     num_train = self.X_train.shape[0]
160     batch_mask = np.random.choice(num_train, self.batch_size)
161     X_batch = self.X_train[batch_mask]
162     y_batch = self.y_train[batch_mask]
163 
164     # Compute loss and gradient
165     loss, grads = self.model.loss(X_batch, y_batch)
166     self.loss_history.append(loss)
167 
168     # Perform a parameter update
169     for p, w in self.model.params.iteritems():
170       dw = grads[p]
171       config = self.optim_configs[p]
172       next_w, next_config = self.update_rule(w, dw, config) #因为有很多update的方法
173       self.model.params[p] = next_w
174       self.optim_configs[p] = next_config
175 
176 
177   def check_accuracy(self, X, y, num_samples=None, batch_size=100):
178     """
179     Check accuracy of the model on the provided data.
180     
181     Inputs:
182     - X: Array of data, of shape (N, d_1, ..., d_k)
183     - y: Array of labels, of shape (N,)
184     - num_samples: If not None, subsample the data and only test the model
185       on num_samples datapoints.
186     - batch_size: Split X and y into batches of this size to avoid using too
187       much memory.
188       
189     Returns:
190     - acc: Scalar giving the fraction of instances that were correctly
191       classified by the model.
192     """
193     
194     # Maybe subsample the data
195     N = X.shape[0]
196     if num_samples is not None and N > num_samples:
197       mask = np.random.choice(N, num_samples)
198       N = num_samples
199       X = X[mask]
200       y = y[mask]
201 
202     # Compute predictions in batches
203     num_batches = N / batch_size
204     if N % batch_size != 0:
205       num_batches += 1
206     y_pred = []
207     for i in xrange(num_batches):
208       start = i * batch_size
209       end = (i + 1) * batch_size
210       scores = self.model.loss(X[start:end])
211       y_pred.append(np.argmax(scores, axis=1))
212     y_pred = np.hstack(y_pred)
213     acc = np.mean(y_pred == y)
214 
215     return acc
216 
217 
218   def train(self):
219     """
220     Run optimization to train the model.
221     """
222     num_train = self.X_train.shape[0]
223     iterations_per_epoch = max(num_train / self.batch_size, 1)
224     num_iterations = self.num_epochs * iterations_per_epoch
225 
226     for t in xrange(num_iterations):
227       self._step()
228 
229       # Maybe print training loss
230       if self.verbose and t % self.print_every == 0:
231         print '(Iteration %d / %d) loss: %f' % (
232                t + 1, num_iterations, self.loss_history[-1])
233 
234       # At the end of every epoch, increment the epoch counter and decay the
235       # learning rate.
236       epoch_end = (t + 1) % iterations_per_epoch == 0
237       if epoch_end:
238         self.epoch += 1
239         for k in self.optim_configs:
240           self.optim_configs[k]['learning_rate'] *= self.lr_decay
241 
242       # Check train and val accuracy on the first iteration, the last
243       # iteration, and at the end of each epoch.
244       first_it = (t == 0)
245       last_it = (t == num_iterations + 1)
246       if first_it or last_it or epoch_end:
247         train_acc = self.check_accuracy(self.X_train, self.y_train,
248                                         num_samples=1000)
249         val_acc = self.check_accuracy(self.X_val, self.y_val)
250         self.train_acc_history.append(train_acc)
251         self.val_acc_history.append(val_acc)
252 
253         if self.verbose:
254           print '(Epoch %d / %d) train acc: %f; val_acc: %f' % (
255                  self.epoch, self.num_epochs, train_acc, val_acc)
256 
257         # Keep track of the best model
258         if val_acc > self.best_val_acc:
259           self.best_val_acc = val_acc
260           self.best_params = {}
261           for k, v in self.model.params.iteritems():
262             self.best_params[k] = v.copy()
263 
264     # At the end of training swap the best params into the model
265     self.model.params = self.best_params

   至此可以说构建了一个deep learning全连接网络的框架,我们可以来回顾一下具体做了些事:

    1.编写全连接层,Relu层的前向传播和反向传播算法。

    2.编写Sandwich的函数,只是将上面的集成起来而已。

    3.编一个FUllyconnect的类,功能是:传入neural network相应的参数,得到一个对应的model。

    4.编写一个solver的类,功能是:传入model和图片,进行最后的最优求解。

   有哪些问题呢:

    1.前向传播的时候需要保存一些参数,这里直接返回cache和out。

    2.编写多层的时候需要注意很多点,各层的参数,注意,对于i层,它的输入时out[i],输出是out[i+1],参数信息是cache[i]。

           3.SGD的update的rule毕竟还是太navie了,之后可以尝试一下别的。

  写好了上面的代码,然后呢?

  这里有一些很有用的trick需要记住的。

  当你构建了一个neural network准备去跑你的数据集的时候,你肯定不能一次就去跑那个最大最原始的,最好的方法是先去overfitting一个小数据集,证明你的网络是有不错的学习能力的,这个时候就要大胆调参了...个人建议LR小一点,迭代次数多一点,scale也要看情况。

总结:第二次作业的内容很多,这次先说这么多了,未完待续。

 

posted @ 2016-08-12 15:28  亦轩Dhc  阅读(14649)  评论(2编辑  收藏  举报