天马行空  
I ---- Inject 投入 L ---- Loyal 忠诚 O ---- Observant 用心 V ---- Valiant 勇敢 E ---- Enjoyment 喜悦 Y ---- Yes 愿意 O ---- Obligation 责任

# **IMPORTANT**
# Please note that this learning rate schedule is heavily dependent on the
# hardware architecture, batch size and any changes to the model architecture
# specification. Selecting a finely tuned learning rate schedule is an
# empirical process that requires some experimentation. Please see README.md
# more guidance and discussion.
#
# With 8 Tesla K40's and a batch size = 256, the following setup achieves
# precision@1 = 73.5% after 100 hours and 100K steps (20 epochs).
# Learning rate decay factor selected from http://arxiv.org/abs/1404.5997.

打开TensorBoard: tensorboard --logdir=/tmp/imagenet_train

imagenet 训练数据1000k,Inception v3 network在1060上训练batch_size=32,32 examples/sec,

20小时跑了70k step后共训练数据32*70k=2100k,2 epochs的训练数据,loss从13降到8,并且降低的趋势走*了。

55小时跑了204k step后共训练数据32*204k=6400k,6 epochs的训练数据,loss从13降到7,从120k开始趋势接**了。

4天1小时(97H)跑了360k step后共训练数据32*360k=10000k, 10 epochs的训练数据,loss还是7左右, loss从120k开始趋势接**了。

Eval: precision @ 1 = 0.5584 recall @ 5 = 0.8052 [50016 examples]


类似的问题:https://stackoverflow.com/questions/38259166/training-tensorflow-inception-v3-imagenet-on-modest-hardware-setup  他也没达到最优:

2016-06-06 12:07:52.245005: precision @ 1 = 0.5767 recall @ 5 = 0.8143 [50016 examples]
2016-06-09 22:35:10.118852: precision @ 1 = 0.5957 recall @ 5 = 0.8294 [50016 examples]
2016-06-14 15:30:59.532629: precision @ 1 = 0.6112 recall @ 5 = 0.8396 [50016 examples]
2016-06-20 13:57:14.025797: precision @ 1 = 0.6136 recall @ 5 = 0.8423 [50016 examples]

On a small hardware set up like yours, it will be difficult to achieve maximum performance. Generally speaking for CNN's, the best performance is with the largest batch sizes possible. This means that for CNN's the training procedure is often limited by the maximum batch size that can fit in GPU memory.

 
posted on 2017-10-03 11:46  浅蓝  阅读(793)  评论(0编辑  收藏  举报