关于single crop/multiple crops

什么是single crop/multiple crop

shicaiyang(星空下的巫师)说[1],训练的时候当然随机剪裁，但测试的时候有技巧：

• 单纯将测试图像resize到某个尺度（例如256xN），选择其中centor crop（即图像正中间区域，比如224x224），作为CNN的输入，去评估该模型
• Multiple Crop的话具体形式有多种，可自行指定，比如：
• 10个crops: 取（左上，左下，右上，右下，正中）以及它们的水平翻转。这10个crops在CNN下的预测输出取平均作为最终预测结果。
• 144个crops：这个略复杂，以ImageNet为例：
• 首先将图像resize到4个尺度（比如256xN，320xN，384xN，480xN）
• 每个尺度上去取（最左，正中，最右）3个位置的正方形区域
• 对每个正方形区域，取上述的10个224x224的crops，则得到4x3x10=120个crops
• 对上述正方形区域直接resize到224x224，以及做水平翻转，则又得到4x3x2=24个crops
• 总共加起来得到120+24=144个crops，所有crops的预测输出的平均作为整个模型对当前测试图像的输出

The model was trained on the ImageNet 2012 training
set (1.3 million images, spread over 1000 different
classes). Each RGB image was preprocessed by resizing
the smallest dimension to 256, cropping the center
256x256 region, subtracting the per-pixel mean (across
all images) and then using 10 di erent sub-crops of size
224x224 (corners + center with(out) horizontal
ips).

The easiest and most common method to reduce overfitting on image data is to artificially enlarge
the dataset using label-preserving transformations (e.g., [25, 4, 5]). We employ two distinct forms
of data augmentation, both of which allow transformed images to be produced from the original
images with very little computation, so the transformed images do not need to be stored on disk.
In our implementation, the transformed images are generated in Python code on the CPU while the
GPU is training on the previous batch of images. So these data augmentation schemes are, in effect,
computationally free.

The first form of data augmentation consists of generating image translations and horizontal reflections.
We do this by extracting random 224x224 patches (and their horizontal reflections) from the
256x256 images and training our network on these extracted patches4. This increases the size of our
training set by a factor of 2048, though the resulting training examples are, of course, highly interdependent.
Without this scheme, our network suffers from substantial overfitting, which would have
forced us to use much smaller networks. At test time, the network makes a prediction by extracting
five 224 x 224 patches (the four corner patches and the center patch) as well as their horizontal
reflections (hence ten patches in all), and averaging the predictions made by the network’s softmax
layer on the ten patches.

name: "AlexNet"
layer {
name: "data"
type: "Data"
top: "data"
top: "label"
include {
phase: TRAIN
}
transform_param {
mirror: true
crop_size: 227
mean_file: "data/ilsvrc12/imagenet_mean.binaryproto"
}
data_param {
source: "examples/imagenet/ilsvrc12_train_lmdb"
batch_size: 256
backend: LMDB
}
}
layer {
name: "data"
type: "Data"
top: "data"
top: "label"
include {
phase: TEST
}
transform_param {
mirror: false
crop_size: 227
mean_file: "data/ilsvrc12/imagenet_mean.binaryproto"
}
data_param {
source: "examples/imagenet/ilsvrc12_val_lmdb"
batch_size: 50
backend: LMDB
}
}


  int h_off = 0;
int w_off = 0;
if (crop_size) {
height = crop_size;
width = crop_size;
// We only do random crop when we do training.
if (phase_ == TRAIN) {
h_off = Rand(datum_height - crop_size + 1);
w_off = Rand(datum_width - crop_size + 1);
} else {
h_off = (datum_height - crop_size) / 2;
w_off = (datum_width - crop_size) / 2;
}
}