yolov1， yolo v2 和yolo v3系列

　　目标检测模型主要分为two-stage和one-stage， one-stage的代表主要是yolo系列和ssd。简单记录下学习yolo系列的笔记。

1 yolo V1

　　　yolo v1是2015年的论文 you only look once：unified，real-time object detection 中提出，为one-stage目标检测的开山之作。其网络架构如下：（24个卷积层和两个全连接层，注意最后一个全连接层可以理解为1*4096到1*1470（7*7*30）的线性变换）

　　yolo v1的理解主要在于三点：

　　1.1 网格划分：输入图片为448*448，yolo将其划为为49(7*7)个cell, 每个cell只负责预测一个物体框，如果这个物体的中心点落在了这个cell中，这个cell就负责预测这个物体

　　1.2 预测结果：最后网络的输出为7*7*30，也可以看做49个1*30的向量，每个向量的组成如下： (x, y, w, h, confidence) *2 + 20; 即每一个向量预测两个bounding box及对应的置信度，还有物体属于20个分类（VOC数据集包括20分类）的概率。

　 1.3 Loss 函数理解：loss函数如下图所示，下面几个概念需要理清楚

　　　　　　　s2：最后网络的输出为7*7*30，因此49个cell；

　　　　　　　B: 每个cell（1*30）预测了两个bbox，因此B=2，只有和ground truth具有最大IOU的bbox才参与计算

　　　　　　　7*7的正掩膜𝕝𝑖𝑗obj：最开始进行网络划分时，ground truth的中心点落在了该cell中，则该cell出值为1；只有为1出的cell才参与计算

　　　　　　　7*7的反掩膜𝕝𝑖𝑗noobj：正掩膜取反。

　　　　(1) 坐标预测损失（coordinate loss）: 上面损失函数的第一部分是对预测bbox的坐标损失，如下图所示，有两个注意点：一是对宽高取平方根，抑制大物体的loss值，平衡小物体和大物体预测的loss差异；二是采用了权重系数5，因为参与计算正样本太少(如上面7*7掩膜中只有三个cell的坐标参与计算)，增加权重

　　　　（2）置信度损失（Confidence loss）：第二部分是正负样本bbox的置信度损失，如下图所示；注意下ground truth的置信度：对于正样本其置信度为预测框和ground truth之间的IOU*1, 对于负样本，置信度为IOU*0；另外由于负样本多余正样本，取负样本的权重系数为0.5

　　　　（3）分类损失(Classification Loss): 第三部分是预测所属分类的损失，如下图所示，预测值为网络中softmax计算出，真实值为标注类别的one-hot编码（可以理解为20分类任务，若为第五类，则编码为00001000000000000000）

　　yolo v1的主要特点

　　　(1) 优点: one-stage，速度快

　　　缺点:

　　　　(1) 不支持拥挤物体的检测（划分网格时一个cell只预测一个物体）

　　　　(2) 对小物体的检测效果差, 且对新的宽高比物体检测效果不好

　　　　(3)网络中没有使用batch normalization

下面是pytorch的实现的Yolo V1 network 和 loss计算方式：（未经实验，仅供理解用）

import torch
import torch.nn as nn
from torch.nn import functional
from torch.autograd import Variable
import torchvision.models as models


class YoloLoss(nn.Module):
    def __init__(self, n_batch, B, C, lambda_coord, lambda_noobj, use_gpu=False):
        """

        :param n_batch: number of batches
        :param B: number of bounding boxes
        :param C: number of bounding classes
        :param lambda_coord: factor for loss which contain objects
        :param lambda_noobj: factor for loss which do not contain objects
        """
        super(YoloLoss, self).__init__()
        self.n_batch = n_batch
        self.B = B # assume there are two bounding boxes
        self.C = C
        self.lambda_coord = lambda_coord
        self.lambda_noobj = lambda_noobj
        self.use_gpu = use_gpu

    def compute_iou(self, bbox1, bbox2):
        """
        Compute the intersection over union of two set of boxes, each box is [x1,y1,w,h]
        :param bbox1: (tensor) bounding boxes, size [N,4]
        :param bbox2: (tensor) bounding boxes, size [M,4]
        :return:
        """
        # compute [x1,y1,x2,y2] w.r.t. top left and bottom right coordinates separately
        b1x1y1 = bbox1[:,:2]-bbox1[:,2:]**2 # [N, (x1,y1)=2]
        b1x2y2 = bbox1[:,:2]+bbox1[:,2:]**2 # [N, (x2,y2)=2]
        b2x1y1 = bbox2[:,:2]-bbox2[:,2:]**2 # [M, (x1,y1)=2]
        b2x2y2 = bbox2[:,:2]+bbox2[:,2:]**2 # [M, (x1,y1)=2]
        box1 = torch.cat((b1x1y1.view(-1,2), b1x2y2.view(-1, 2)), dim=1) # [N,4], 4=[x1,y1,x2,y2]
        box2 = torch.cat((b2x1y1.view(-1,2), b2x2y2.view(-1, 2)), dim=1) # [M,4], 4=[x1,y1,x2,y2]
        N = box1.size(0)
        M = box2.size(0)

        tl = torch.max(
            box1[:,:2].unsqueeze(1).expand(N,M,2),  # [N,2] -> [N,1,2] -> [N,M,2]
            box2[:,:2].unsqueeze(0).expand(N,M,2),  # [M,2] -> [1,M,2] -> [N,M,2]
        )
        br = torch.min(
            box1[:,2:].unsqueeze(1).expand(N,M,2),  # [N,2] -> [N,1,2] -> [N,M,2]
            box2[:,2:].unsqueeze(0).expand(N,M,2),  # [M,2] -> [1,M,2] -> [N,M,2]
        )

        wh = br - tl  # [N,M,2]
        wh[(wh<0).detach()] = 0
        #wh[wh<0] = 0
        inter = wh[:, :, 0] * wh[:, :, 1]  # [N,M]

        area1 = (box1[:,2]-box1[:,0]) * (box1[:,3]-box1[:,1])  # [N,]
        area2 = (box2[:,2]-box2[:,0]) * (box2[:,3]-box2[:,1])  # [M,]
        area1 = area1.unsqueeze(1).expand_as(inter)  # [N,] -> [N,1] -> [N,M]
        area2 = area2.unsqueeze(0).expand_as(inter)  # [M,] -> [1,M] -> [N,M]

        iou = inter / (area1 + area2 - inter)
        return iou

    def forward(self, pred_tensor, target_tensor):
        """

        :param pred_tensor: [batch,SxSx(Bx5+20))]
        :param target_tensor: [batch,S,S,Bx5+20]
        :return: total loss
        """
        n_elements = self.B * 5 + self.C
        batch = target_tensor.size(0)
        target_tensor = target_tensor.view(batch,-1,n_elements)
        #print(target_tensor.size())
        #print(pred_tensor.size())
        pred_tensor = pred_tensor.view(batch,-1,n_elements)
        coord_mask = target_tensor[:,:,5] > 0
        noobj_mask = target_tensor[:,:,5] == 0
        coord_mask = coord_mask.unsqueeze(-1).expand_as(target_tensor)
        noobj_mask = noobj_mask.unsqueeze(-1).expand_as(target_tensor)

        coord_target = target_tensor[coord_mask].view(-1,n_elements)
        coord_pred = pred_tensor[coord_mask].view(-1,n_elements)
        class_pred = coord_pred[:,self.B*5:]
        class_target = coord_target[:,self.B*5:]
        box_pred = coord_pred[:,:self.B*5].contiguous().view(-1,5)
        box_target = coord_target[:,:self.B*5].contiguous().view(-1,5)

        noobj_target = target_tensor[noobj_mask].view(-1,n_elements)
        noobj_pred = pred_tensor[noobj_mask].view(-1,n_elements)

        # compute loss which do not contain objects
        if self.use_gpu:
            noobj_target_mask = torch.cuda.ByteTensor(noobj_target.size())
        else:
            noobj_target_mask = torch.ByteTensor(noobj_target.size())
        noobj_target_mask.zero_()
        for i in range(self.B):
            noobj_target_mask[:,i*5+4] = 1
        noobj_target_c = noobj_target[noobj_target_mask] # only compute loss of c size [2*B*noobj_target.size(0)]
        noobj_pred_c = noobj_pred[noobj_target_mask]
        noobj_loss = functional.mse_loss(noobj_pred_c, noobj_target_c, size_average=False)

        # compute loss which contain objects
        if self.use_gpu:
            coord_response_mask = torch.cuda.ByteTensor(box_target.size())
            coord_not_response_mask = torch.cuda.ByteTensor(box_target.size())
        else:
            coord_response_mask = torch.ByteTensor(box_target.size())
            coord_not_response_mask = torch.ByteTensor(box_target.size())
        coord_response_mask.zero_()
        coord_not_response_mask = ~coord_not_response_mask.zero_()
        for i in range(0,box_target.size()[0],self.B):
            box1 = box_pred[i:i+self.B]
            box2 = box_target[i:i+self.B]
            iou = self.compute_iou(box1[:, :4], box2[:, :4])
            max_iou, max_index = iou.max(0)
            if self.use_gpu:
                max_index = max_index.data.cuda()
            else:
                max_index = max_index.data
            coord_response_mask[i+max_index]=1
            coord_not_response_mask[i+max_index]=0

        # 1. response loss
        box_pred_response = box_pred[coord_response_mask].view(-1, 5)
        box_target_response = box_target[coord_response_mask].view(-1, 5)
        contain_loss = functional.mse_loss(box_pred_response[:, 4], box_target_response[:, 4], size_average=False)
        loc_loss = functional.mse_loss(box_pred_response[:, :2], box_target_response[:, :2], size_average=False) +\
                   functional.mse_loss(box_pred_response[:, 2:4], box_target_response[:, 2:4], size_average=False)
        # 2. not response loss
        box_pred_not_response = box_pred[coord_not_response_mask].view(-1, 5)
        box_target_not_response = box_target[coord_not_response_mask].view(-1, 5)

        # compute class prediction loss
        class_loss = functional.mse_loss(class_pred, class_target, size_average=False)

        # compute total loss
        total_loss = self.lambda_coord * loc_loss + contain_loss + self.lambda_noobj * noobj_loss + class_loss
        return total_loss



def test():
    voc = False
    vot = 1-voc
    if voc:
        img_folder = '../codedata/voc2012train/JPEGImages'
        file = '../voc2012.txt'
        img_size = 448
        train_dataset = YoloDataset(img_folder=img_folder, file=file, img_size=img_size, S=7, B=2, C=20, transforms=[transforms.ToTensor()])
        train_loader = DataLoader(train_dataset, batch_size=2, shuffle=False, num_workers=0)
        train_iter = iter(train_loader)
        img, target = next(train_iter)
        print(target.size())
        target = Variable(target)
        img = Variable(img)
        net = YOLO_V1()
        pred = net(img)
        yololoss = YoloLoss(n_batch=2, B=2, C=20, lambda_coord=5, lambda_noobj=0.5)
        print(pred.size())
        print(target.size())
        loss = yololoss(pred, target)
        print(loss)

    if vot:
        img_folder = './small_train_dataset'
        bboxes = dd.io.load('girl_bbox_4dim.h5')
        learning_rate = 0.0005
        img_size = 224
        num_epochs = 2
        lambda_coord = 5
        lambda_noobj = .5
        n_batch = 5
        S = 7
        B = 2
        C = 1
        train_dataset = VotDataset(img_folder=img_folder, bboxes=bboxes, img_size=img_size, S=S, B=B, C=C,
                                   transforms=[transforms.ToTensor()])
        train_loader = DataLoader(train_dataset, batch_size=n_batch, shuffle=False, num_workers=2)
        yololoss = YoloLoss(n_batch=n_batch, B=B, C=C, lambda_coord=5, lambda_noobj=0.5)
        train_iter = iter(train_loader)
        img, target = next(train_iter)
        target = Variable(target)
        img = Variable(img)

        model = models.vgg16(pretrained=True)
        model.classifier = nn.Sequential(
            nn.Linear(512 * 7 * 7, 4096),
            nn.LeakyReLU(0.1, inplace=True),
            nn.Dropout(),
            nn.Linear(4096, 11 * 7 * 7),
            nn.Sigmoid(),
        )
        model.train()

        loss_fn = YoloLoss(n_batch, B, C, lambda_coord, lambda_noobj)
        optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate, weight_decay=1e-4)

        use_gpu = False
        for epoch in range(num_epochs):
            for i, (images, target) in enumerate(train_loader):
                images = Variable(images)
                target = Variable(target)
                if use_gpu:
                    images, target = images.cuda(), target.cuda()

                pred = model(images)
                print(pred.size())
                print(target.size())
                loss = loss_fn(pred, target)
                print(i + 1, loss)

                optimizer.zero_grad()
                loss.backward()
                optimizer.step()
                if i == 10:
                    break
            break



if __name__=='__main__':
    from own_yolo_v1.network import *
    from own_yolo_v1.load_dataset import *
    test()

Yolo_loss

import torch.nn as nn


class Flatten(nn.Module):
    def __init__(self):
        super(Flatten, self).__init__()
    def forward(self, x):
        return x.view(x.size(0), -1)


class YOLO_V1(nn.Module):
    def __init__(self):
        super(YOLO_V1, self).__init__()
        C = 20  # number of classes
        print("\n------Initiating YOLO v1------\n")
        self.conv_layer1 = nn.Sequential(
            nn.Conv2d(in_channels=3, out_channels=64, kernel_size=7, stride=2, padding=7//2),
            nn.BatchNorm2d(64),
            nn.LeakyReLU(0.1),
            nn.MaxPool2d(kernel_size=2, stride=2)
        )
        self.conv_layer2 = nn.Sequential(
            nn.Conv2d(in_channels=64, out_channels=192, kernel_size=3, stride=1, padding=3//2),
            nn.BatchNorm2d(192),
            nn.LeakyReLU(0.1),
            nn.MaxPool2d(kernel_size=2, stride=2)
        )
        self.conv_layer3 = nn.Sequential(
            nn.Conv2d(in_channels=192, out_channels=128, kernel_size=1, stride=1, padding=1//2),
            nn.Conv2d(in_channels=128, out_channels=256, kernel_size=3, stride=1, padding=3//2),
            nn.Conv2d(in_channels=256, out_channels=256, kernel_size=1, stride=1, padding=1//2),
            nn.Conv2d(in_channels=256, out_channels=512, kernel_size=3, stride=1, padding=3//2),
            nn.BatchNorm2d(512),
            nn.LeakyReLU(0.1),
            nn.MaxPool2d(kernel_size=2, stride=2)
        )
        self.conv_layer4 = nn.Sequential(
            nn.Conv2d(in_channels=512, out_channels=256, kernel_size=1, stride=1, padding=1//2),
            nn.Conv2d(in_channels=256, out_channels=512, kernel_size=3, stride=1, padding=3//2),
            nn.Conv2d(in_channels=512, out_channels=256, kernel_size=1, stride=1, padding=1//2),
            nn.Conv2d(in_channels=256, out_channels=512, kernel_size=3, stride=1, padding=3//2),
            nn.Conv2d(in_channels=512, out_channels=256, kernel_size=1, stride=1, padding=1//2),
            nn.Conv2d(in_channels=256, out_channels=512, kernel_size=3, stride=1, padding=3//2),
            nn.Conv2d(in_channels=512, out_channels=512, kernel_size=1, stride=1, padding=1//2),
            nn.Conv2d(in_channels=512, out_channels=1024, kernel_size=3, stride=1, padding=3//2),
            nn.BatchNorm2d(1024),
            nn.MaxPool2d(kernel_size=2, stride=2)
        )
        self.conv_layer5 = nn.Sequential(
            nn.Conv2d(in_channels=1024, out_channels=512, kernel_size=1, stride=1, padding=1//2),
            nn.Conv2d(in_channels=512, out_channels=1024, kernel_size=3, stride=1, padding=3//2),
            nn.Conv2d(in_channels=1024, out_channels=512, kernel_size=1, stride=1, padding=1//2),
            nn.Conv2d(in_channels=512, out_channels=1024, kernel_size=3, stride=1, padding=3//2),
            nn.Conv2d(in_channels=1024, out_channels=1024, kernel_size=3, stride=1, padding=3//2),
            nn.Conv2d(in_channels=1024, out_channels=1024, kernel_size=3, stride=2, padding=3//2),
            nn.BatchNorm2d(1024),
            nn.LeakyReLU(0.1),
        )
        self.conv_layer6 = nn.Sequential(
            nn.Conv2d(in_channels=1024, out_channels=1024, kernel_size=3, stride=1, padding=3//2),
            nn.Conv2d(in_channels=1024, out_channels=1024, kernel_size=3, stride=1, padding=3//2),
            nn.BatchNorm2d(1024),
            nn.LeakyReLU(0.1)
        )
        self.flatten = Flatten()
        self.conn_layer1 = nn.Sequential(
            nn.Linear(in_features=7*7*1024, out_features=4096),
            nn.Dropout(),
            nn.LeakyReLU(0.1)
        )
        self.conn_layer2 = nn.Sequential(nn.Linear(in_features=4096, out_features=7 * 7 * (2 * 5 + C)))

    def forward(self, input):
        conv_layer1 = self.conv_layer1(input)
        conv_layer2 = self.conv_layer2(conv_layer1)
        conv_layer3 = self.conv_layer3(conv_layer2)
        conv_layer4 = self.conv_layer4(conv_layer3)
        conv_layer5 = self.conv_layer5(conv_layer4)
        conv_layer6 = self.conv_layer6(conv_layer5)
        flatten = self.flatten(conv_layer6)
        conn_layer1 = self.conn_layer1(flatten)
        output = self.conn_layer2(conn_layer1)
        return output


'''
def test():
    from own_yolo_v1.load_dataset import *
    from torch.autograd import Variable
    img_folder = '../codedata/voc2012train/JPEGImages'
    file = '../voc2012.txt'
    img_size = 448
    train_dataset = YoloDataset(img_folder=img_folder, file=file, img_size=img_size, transforms=[transforms.ToTensor()])
    train_loader = DataLoader(train_dataset, batch_size=2, shuffle=False, num_workers=0)
    train_iter = iter(train_loader)
    img, target = next(train_iter)
    img = Variable(img)
    net = YOLO_V1()
    output = net(img)
    print(output.size())


if __name__ == '__main__':
    test()
'''

Yolo V1 network

2. Yolo V2

　　　　Yolo V2是在2016年的Yolo9000: Better, Faster, Stronger 中提出的，采用了新的网络模型，称为Darknet-19, 包括19个卷积层和5个maxpooling层，相比Yolo V1的计算量减小了33%左右。其结构如下：

在ImageNet上预训练的结构：

进行检测任务训练时模型结构：（引入了不同尺度特征融合）

Yolo V2 主要对Yolo V1进行了五处改进：

　　　　　　(1) 加入Batch Normalization, 去掉dropout

　　　　　　(2) High resolution classifier (高分辨率图片分类器)

　　　　　　(3) 引入 Anchors

　　　　　　(4) Fine-grained Features (低层和高层特征融合)

　　　　　　(5) Multi-scale Training （不同尺度图片的训练）

　　 2.1 High resolution classifier （4% mAP）

　　　　yolo v1中分类器在ImageNet数据集（224*224）上预训练，而检测时图片的大小为448*448，网络需要适应新的尺寸，因此yolo V2中又加入了一步finetune，步骤如下：

　　　　a，在ImageNet上预训练分类器（224*224），大概160个epoch　　

　　　　b，将ImageNet的图片resize到448*448，再finetune 10个epoch，让模型适应大图片

　　　　c，采用上述预训练的权重，在实际数据集上finetune（416*416），最终输出为13*13

　　2.2 Anchors

　　　借鉴Faster RCNN中Anchor的思想，通过kmeans方法在VOC数据集（COCO数据集）上对检测物体的宽高进行了聚类分析，得出了5个聚类中心，因此选取5个anchor的宽高： (聚类时衡量指标distance = 1-IOU(bbox, cluster))

COCO: (0.57273, 0.677385), (1.87446, 2.06253), (3.33843, 5.47434), (7.88282, 3.52778), (9.77052, 9.16828)
VOC: (1.3221, 1.73145), (3.19275, 4.00944), (5.05587, 8.09892), (9.47112, 4.84053), (11.2364, 10.0071)

　　这样每个grid cell将对应5个不同宽高的anchor，如下图所示：(上面给出的宽高是相对于grid cell，对应的实际宽高还需要乘以32)

　　关于预测的bbox的计算：(416*416-------13*13 为例)

　　　　(1) 输入图片尺寸为416*416, 最后输出结果为13*13*125，这里的125指5*（5 + 20），5表示5个anchor，25表示[x, y, w, h, confidence ] + 20 class ）,即每一个anchor预测一组值。

　　　　(2) 对于每一anchor预测的25个值， x, y是相对于该grid cell左上角的偏移值，需要通过sigmoid函数将其处理到0-1之间。如13*13大小的grid，对于index为（6, 6）的cell，预测的x, y通过sigmoid计算为xoffset, yoffset, 则对应的实际x = 6 + xoffset, y = 6+yoffset，由于0<xoffset<1, 0<yoffset<1, 预测的实际x， y总是在（6,6）的cell内。对于预测的w, h是相对于anchor的宽高，还需乘以anchor的(w, h), 就得到相应的宽高

　　　　(3) 由于上述尺度是在13*13下的，需要还原为实际的图片对应大小，还需乘以缩放倍数32

　　实际计算代码如下：

　　2.3 Fine-Grained Features

　　　　由上面网络架构中，可以看到一条shortcut，将低层的的feature map（26*26*512）和最后输出的feature map（13*13*1024）进行concat，从而将低层的位置信息特征和高层的语义特征进行融合。另外由于26*26尺度较大，网络采用Reorg层对其进行了reshape，使其转变为13*13，如下图所示：

　　2.4 Multi-scale Training

　　　　上述网络架构中，最后一层的（Conv22）为1*1*125的卷积层代替全连接函数，可以处理任何大小的图片输入，因此在训练时，每10个epoch，作者从[320×320, 352×352, … 608×608](都是32的倍数，最后输出都降采样32倍)选一个尺度作为输入图片的尺寸进行训练，增加模型的鲁棒性。（当尺度为416*416时，输出为13*13*125；输入为320*320，则输出为10*10*125）

Yolo V2的特点：

　　 (1)采用Darknet19网络结构，层数比Yolo V1更少，且没有全连接层，计算量更少；模型运行更快；

　　(2) 使用卷积代替全链接：解除了输入大小的限制, 多尺度的训练使得模型对不同尺度的图片的检测更加鲁棒

　　 (3) 每个cell采用5个anchor box进行预测，对拥挤和小物体检测更有效

3. Yolo 9000

　　Yolo 9000是和yolo v2在同一篇文章中提出，是在YOLOv2的基础上提出的一种可以检测超过9000个类别的模型，其主要贡献点在于提出了一种分类和检测的联合训练策略，具体细节参考：https://zhuanlan.zhihu.com/p/35325884

4. Yolo V3

　　　Yolo V3是在2018年的文章YOLO V3: An Incremental Improvement 中提出，Yolo V3网络结构为DarkNet53，如下图所示：(有ResNet， FPN的思想)。Yolo V3每个网格单元预测3个anchor box，每个box需要有(x, y, w, h, confidence)五个基本参数，然后有80个类别（COCO数据集）的概率，所以3*(5 + 80) = 255。（y1， y2，y3的深度都是255）

　　相比于Resnet，Darknet中的残差结构如下：

　　采用FPN的思想，将不同尺度的Feature map进行融合，并在每个尺度上进行预测，如下图所示：

　　yolo_v3也和v2一样，backbone都会将输出特征图缩小到输入的1/32，通常都要求输入图片是32的倍数，Yolo v3中的DarkNet 53 和yolo v2 的DarkNet 19对比如下图所示：

yolo v3的pytorch实现代码参考：

from __future__ import division

from models import *
# from utils.logger import *
from utils.utils import *
from utils.datasets import *
from utils.parse_config import *
from test import evaluate

# from terminaltables import AsciiTable

import os
import sys
import time
import datetime
import argparse

import torch
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision import transforms
from torch.autograd import Variable
import torch.optim as optim

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--epochs", type=int, default=100, help="number of epochs")
    parser.add_argument("--batch_size", type=int, default=8, help="size of each image batch")
    parser.add_argument("--gradient_accumulations", type=int, default=2, help="number of gradient accums before step")
    parser.add_argument("--model_def", type=str, default="config/yolov3.cfg", help="path to model definition file")
    parser.add_argument("--data_config", type=str, default="config/coco.data", help="path to data config file")
    parser.add_argument("--pretrained_weights", type=str, help="if specified starts from checkpoint model")
    parser.add_argument("--n_cpu", type=int, default=8, help="number of cpu threads to use during batch generation")
    parser.add_argument("--img_size", type=int, default=416, help="size of each image dimension")
    parser.add_argument("--checkpoint_interval", type=int, default=1, help="interval between saving model weights")
    parser.add_argument("--evaluation_interval", type=int, default=1, help="interval evaluations on validation set")
    parser.add_argument("--compute_map", default=False, help="if True computes mAP every tenth batch")
    parser.add_argument("--multiscale_training", default=True, help="allow for multi-scale training")
    opt = parser.parse_args()
    print(opt)

    # logger = Logger("logs")

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    # os.makedirs("output", exist_ok=True)
    # os.makedirs("checkpoints", exist_ok=True)

    # Get data configuration
    data_config = parse_data_config(opt.data_config)
    train_path = data_config["train"]
    valid_path = data_config["valid"]
    class_names = load_classes(data_config["names"])

    # Initiate model
    model = Darknet(opt.model_def).to(device)
    model.apply(weights_init_normal)

    # If specified we start from checkpoint
    if opt.pretrained_weights:
        if opt.pretrained_weights.endswith(".pth"):
            model.load_state_dict(torch.load(opt.pretrained_weights))
        else:
            model.load_darknet_weights(opt.pretrained_weights)

    # model = torch.nn.DataParallel(model).cuda()

    # Get dataloader
    dataset = ListDataset(train_path, augment=True, multiscale=opt.multiscale_training)
    dataloader = torch.utils.data.DataLoader(
        dataset,
        batch_size=opt.batch_size,
        shuffle=True,
        num_workers=opt.n_cpu,
        pin_memory=True,
        collate_fn=dataset.collate_fn,
    )

    optimizer = torch.optim.Adam(model.parameters())

    metrics = [
        "grid_size",
        "loss",
        "x",
        "y",
        "w",
        "h",
        "conf",
        "cls",
        "cls_acc",
        "recall50",
        "recall75",
        "precision",
        "conf_obj",
        "conf_noobj",
    ]

    for epoch in range(opt.epochs):
        model.train()
        start_time = time.time()
        for batch_i, (img_pth, imgs, targets) in enumerate(dataloader):
            batches_done = len(dataloader) * epoch + batch_i
            # img: (batch_size, channel, height, width)
            # target: (num, 6)  6=>(batch_index, cls, center_x, center_y, widht, height)
            imgs = Variable(imgs.to(device))
            targets = Variable(targets.to(device), requires_grad=False)

            loss, outputs = model(imgs, targets)
            loss.backward()

            if batches_done % opt.gradient_accumulations:
                # Accumulates gradient before each step
                optimizer.step()
                optimizer.zero_grad()

            # ----------------
            #   Log progress
            # ----------------

            log_str = "\n---- [Epoch %d/%d, Batch %d/%d] ----\n" % (epoch, opt.epochs, batch_i, len(dataloader))

            # metric_table = [["Metrics", *["YOLO Layer {}".format(i) for i in range(len(model.yolo_layers))]]]

            # Log metrics at each YOLO layer
            for i, metric in enumerate(metrics):
                formats = {m: "%.6f" for m in metrics}
                formats["grid_size"] = "%2d"
                formats["cls_acc"] = "%.2f%%"
                row_metrics = [formats[metric] % yolo.metrics.get(metric, 0) for yolo in model.yolo_layers]
                # metric_table += [[metric, *row_metrics]]

                # Tensorboard logging
                tensorboard_log = []
                for j, yolo in enumerate(model.yolo_layers):
                    for name, metric in yolo.metrics.items():
                        if name != "grid_size":
                            tensorboard_log += [("{}_{}".format(name, j+1), metric)]
                tensorboard_log += [("loss", loss.item())]
                # logger.list_of_scalars_summary(tensorboard_log, batches_done)

            # log_str += AsciiTable(metric_table).table
            log_str += "\nTotal loss {}".format(loss.item())

            # Determine approximate time left for epoch
            epoch_batches_left = len(dataloader) - (batch_i + 1)
            time_left = datetime.timedelta(seconds=epoch_batches_left * (time.time() - start_time) / (batch_i + 1))
            log_str += "\n---- ETA {}".format(time_left)

            print(log_str)

            model.seen += imgs.size(0)

            # if batch_i > 10:
            #     break

        if epoch % opt.evaluation_interval == 0:
            print("\n---- Evaluating Model ----")
            # Evaluate the model on the validation set
            precision, recall, AP, f1, ap_class = evaluate(
                model,
                path=valid_path,
                iou_thres=0.5,
                conf_thres=0.5,
                nms_thres=0.5,
                img_size=opt.img_size,
                batch_size=8,
            )
            evaluation_metrics = [
                ("val_precision", precision.mean()),
                ("val_recall", recall.mean()),
                ("val_mAP", AP.mean()),
                ("val_f1", f1.mean()),
            ]
            # logger.list_of_scalars_summary(evaluation_metrics, epoch)

            # Print class APs and mAP
            ap_table = [["Index", "Class name", "AP"]]
            for i, c in enumerate(ap_class):
                ap_table += [[c, class_names[c], "%.5f" % AP[i]]]
            # print(AsciiTable(ap_table).table)
            print("---- mAP {}".format(AP.mean()))

        if epoch % opt.checkpoint_interval == 0:
            torch.save(model.state_dict(), "checkpoints/yolov3_ckpt_%d.pth" % epoch)

train

# -*- coding: utf-8 -*-
from __future__ import division

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable
import numpy as np

from utils.parse_config import *
from utils.utils import build_targets, to_cpu, non_max_suppression

# import matplotlib.pyplot as plt
# import matplotlib.patches as patches


def create_modules(module_defs):
    """
    Constructs module list of layer blocks from module configuration in module_defs
    """
    hyperparams = module_defs.pop(0)
    output_filters = [int(hyperparams["channels"])]
    module_list = nn.ModuleList()
    for module_i, module_def in enumerate(module_defs):
        modules = nn.Sequential()

        if module_def["type"] == "convolutional":
            bn = int(module_def["batch_normalize"])
            filters = int(module_def["filters"])
            kernel_size = int(module_def["size"])
            pad = (kernel_size - 1) // 2
            modules.add_module(
                "conv_{}".format(module_i),
                nn.Conv2d(
                    in_channels=output_filters[-1],
                    out_channels=filters,
                    kernel_size=kernel_size,
                    stride=int(module_def["stride"]),
                    padding=pad,
                    bias=not bn,
                ),
            )
            if bn:
                modules.add_module("batch_norm_{}".format(module_i), nn.BatchNorm2d(filters, momentum=0.9, eps=1e-5))
            if module_def["activation"] == "leaky":
                modules.add_module("leaky_{}".format(module_i), nn.LeakyReLU(0.1))

        elif module_def["type"] == "maxpool":
            kernel_size = int(module_def["size"])
            stride = int(module_def["stride"])
            if kernel_size == 2 and stride == 1:
                modules.add_module("_debug_padding_{}".format(module_i), nn.ZeroPad2d((0, 1, 0, 1)))
            maxpool = nn.MaxPool2d(kernel_size=kernel_size, stride=stride, padding=int((kernel_size - 1) // 2))
            modules.add_module("maxpool_{}".format(module_i), maxpool)

        elif module_def["type"] == "upsample":
            upsample = Upsample(scale_factor=int(module_def["stride"]), mode="nearest")
            modules.add_module("upsample_{}".format(module_i), upsample)

        elif module_def["type"] == "route":
            layers = [int(x) for x in module_def["layers"].split(",")]
            filters = sum([output_filters[1:][i] for i in layers])
            modules.add_module("route_{}".format(module_i), EmptyLayer())

        elif module_def["type"] == "shortcut":
            filters = output_filters[1:][int(module_def["from"])]
            modules.add_module("shortcut_{}".format(module_i), EmptyLayer())

        elif module_def["type"] == "yolo":
            anchor_idxs = [int(x) for x in module_def["mask"].split(",")]
            # Extract anchors
            anchors = [int(x) for x in module_def["anchors"].split(",")]
            anchors = [(anchors[i], anchors[i + 1]) for i in range(0, len(anchors), 2)]
            anchors = [anchors[i] for i in anchor_idxs]
            num_classes = int(module_def["classes"])
            img_size = int(hyperparams["height"])
            # Define detection layer
            yolo_layer = YOLOLayer(anchors, num_classes, img_size)
            modules.add_module("yolo_{}".format(module_i), yolo_layer)
        # Register module list and number of output filters
        module_list.append(modules)
        output_filters.append(filters)

    return hyperparams, module_list


class Upsample(nn.Module):
    """ nn.Upsample is deprecated """

    def __init__(self, scale_factor, mode="nearest"):
        super(Upsample, self).__init__()
        self.scale_factor = scale_factor
        self.mode = mode

    def forward(self, x):
        x = F.interpolate(x, scale_factor=self.scale_factor, mode=self.mode)
        return x


class EmptyLayer(nn.Module):
    """Placeholder for 'route' and 'shortcut' layers"""

    def __init__(self):
        super(EmptyLayer, self).__init__()


class YOLOLayer(nn.Module):
    """Detection layer"""

    def __init__(self, anchors, num_classes, img_dim=416):
        super(YOLOLayer, self).__init__()
        self.anchors = anchors
        self.num_anchors = len(anchors)
        self.num_classes = num_classes
        self.ignore_thres = 0.5
        self.mse_loss = nn.MSELoss()
        self.bce_loss = nn.BCELoss()
        self.obj_scale = 1
        self.noobj_scale = 100
        self.metrics = {}
        self.img_dim = img_dim
        self.grid_size = 0  # grid size

    def compute_grid_offsets(self, grid_size, cuda=True):
        self.grid_size = grid_size
        g = self.grid_size
        FloatTensor = torch.cuda.FloatTensor if cuda else torch.FloatTensor
        self.stride = self.img_dim / self.grid_size  # 缩小多少倍
        # Calculate offsets for each grid
        # grid_x， grid_y（1, 1, gride, gride）
        self.grid_x = torch.arange(g).repeat(g, 1).view([1, 1, g, g]).type(FloatTensor)
        self.grid_y = torch.arange(g).repeat(g, 1).t().view([1, 1, g, g]).type(FloatTensor)

        # 图片缩小多少倍，对应的anchors也要缩小相应倍数
        self.scaled_anchors = FloatTensor([(a_w / self.stride, a_h / self.stride) for a_w, a_h in self.anchors])

        # scaled_anchors shape（3， 2），3个anchors，每个anchor有w,h两个量。下面步骤是把这两个量划分开
        self.anchor_w = self.scaled_anchors[:, 0:1].view((1, self.num_anchors, 1, 1))  # （1， 3， 1， 1）
        self.anchor_h = self.scaled_anchors[:, 1:2].view((1, self.num_anchors, 1, 1))  # （1， 3， 1， 1）

    def forward(self, x, targets=None, img_dim=None):

        # Tensors for cuda support
        FloatTensor = torch.cuda.FloatTensor if x.is_cuda else torch.FloatTensor
        LongTensor = torch.cuda.LongTensor if x.is_cuda else torch.LongTensor
        ByteTensor = torch.cuda.ByteTensor if x.is_cuda else torch.ByteTensor

        self.img_dim = img_dim  # (img_size)
        num_samples = x.size(0)  # (img_batch)
        grid_size = x.size(2)  # (feature_map_size)
        # print x.shape  # (batch_size, 255, grid_size, grid_size)

        prediction = (
            x.view(num_samples, self.num_anchors, 5 + self.num_classes, grid_size, grid_size)
            .permute(0, 1, 3, 4, 2)
            .contiguous()
        )
        # print prediction.shape (batch_size, num_anchors, grid_size, grid_size, 85)

        # Get outputs
        # 这里的prediction是初步的所有预测，在grid_size*grid_size个网格中，它表示每个网格都会有num_anchor（3）个anchor框
        # x,y,w,h, pred_conf的shape都是一样的 (batch_size, num_anchor, gride_size, grid_size)
        x = torch.sigmoid(prediction[..., 0])  # Center x
        y = torch.sigmoid(prediction[..., 1])  # Center y
        w = prediction[..., 2]  # Width
        h = prediction[..., 3]  # Height
        pred_conf = torch.sigmoid(prediction[..., 4])  # Conf
        pred_cls = torch.sigmoid(prediction[..., 5:])  # Cls pred. (batch_size, num_anchor, gride_size, grid_size, cls)

        # If grid size does not match current we compute new offsets
        # print grid_size, self.grid_size
        if grid_size != self.grid_size:
            self.compute_grid_offsets(grid_size, cuda=x.is_cuda)

        # print self.grid_x, self.grid_y, self.anchor_w, self.anchor_h
        # Add offset and scale with anchors
        pred_boxes = FloatTensor(prediction[..., :4].shape)
        # 针对每个网格的偏移量，每个网格的单位长度为1，而预测的中心点（x，y）是归一化的（0，1之间），所以可以直接相加
        pred_boxes[..., 0] = x.data + self.grid_x  # （1, 1, gride, gride）
        pred_boxes[..., 1] = y.data + self.grid_y
        pred_boxes[..., 2] = torch.exp(w.data) * self.anchor_w  # # （1， 3， 1， 1）
        pred_boxes[..., 3] = torch.exp(h.data) * self.anchor_h

        # (batch_size, num_anchors*grid_size*grid_size, 85)
        output = torch.cat(
            (
                # (batch_size, num_anchors*grid_size*grid_size, 4)
                pred_boxes.view(num_samples, -1, 4) * self.stride,  # 放大到最初输入的尺寸
                # (batch_size, num_anchors*grid_size*grid_size, 1)
                pred_conf.view(num_samples, -1, 1),
                # (batch_size, num_anchors*grid_size*grid_size, 80)
                pred_cls.view(num_samples, -1, self.num_classes),
            ),
            -1,
        )

        if targets is None:
            return output, 0
        else:
            # pred_boxes => (batch_size, anchor_num, gride, gride, 4)
            # pred_cls => (batch_size, anchor_num, gride, gride, 80)
            # targets => (num, 6)  6=>(batch_index, cls, center_x, center_y, widht, height)
            # scaled_anchors => (3, 2)
            # print pred_boxes.shape, pred_cls.shape, targets.shape, self.scaled_anchors.shape
            iou_scores, class_mask, obj_mask, noobj_mask, tx, ty, tw, th, tcls, tconf = build_targets(
                pred_boxes=pred_boxes,
                pred_cls=pred_cls,
                target=targets,
                anchors=self.scaled_anchors,
                ignore_thres=self.ignore_thres,
            )
            # iou_scores：预测框pred_boxes中的正确框与目标实体框target_boxes的交集IOU，以IOU作为分数，IOU越大，分值越高。
            # class_mask：将预测正确的标记为1（正确的预测了实体中心点所在的网格坐标，哪个anchor框可以最匹配实体，以及实体的类别）
            # obj_mask：将目标实体框所对应的anchor标记为1，目标实体框所对应的anchor与实体一一对应的
            # noobj_mask：将所有与目标实体框IOU小于某一阈值的anchor标记为1
            # tx, ty, tw, th： 需要拟合目标实体框的坐标和尺寸
            # tcls：目标实体框的所属类别
            # tconf：所有anchor的目标置信度

            # 这里计算得到的iou_scores，class_mask，obj_mask，noobj_mask，tx, ty, tw, th和tconf都是（batch, anchor_num, gride, gride）
            # 预测的x,y,w,h,pred_conf也都是（batch, anchor_num, gride, gride）

            # tcls 和 pred_cls 都是（batch, anchor_num, gride, gride，num_class）


            # Loss : Mask outputs to ignore non-existing objects (except with conf. loss)
            # 坐标和尺寸的loss计算：
            loss_x = self.mse_loss(x[obj_mask], tx[obj_mask])
            loss_y = self.mse_loss(y[obj_mask], ty[obj_mask])
            loss_w = self.mse_loss(w[obj_mask], tw[obj_mask])
            loss_h = self.mse_loss(h[obj_mask], th[obj_mask])
            # anchor置信度的loss计算：
            loss_conf_obj = self.bce_loss(pred_conf[obj_mask], tconf[obj_mask])  # tconf[obj_mask] 全为1
            loss_conf_noobj = self.bce_loss(pred_conf[noobj_mask], tconf[noobj_mask])  # tconf[noobj_mask] 全为0
            loss_conf = self.obj_scale * loss_conf_obj + self.noobj_scale * loss_conf_noobj
            # 类别的loss计算
            loss_cls = self.bce_loss(pred_cls[obj_mask], tcls[obj_mask])

            # loss汇总
            total_loss = loss_x + loss_y + loss_w + loss_h + loss_conf + loss_cls

            # Metrics
            cls_acc = 100 * class_mask[obj_mask].mean()
            conf_obj = pred_conf[obj_mask].mean()
            conf_noobj = pred_conf[noobj_mask].mean()
            conf50 = (pred_conf > 0.5).float()
            iou50 = (iou_scores > 0.5).float()
            iou75 = (iou_scores > 0.75).float()
            detected_mask = conf50 * class_mask * tconf

            obj_mask = obj_mask.float()

            # print type(iou50), type(detected_mask), type(conf50.sum()), type(iou75), type(obj_mask)
            #
            # print iou50.dtype, detected_mask.dtype, conf50.sum().dtype, iou75.dtype, obj_mask.dtype
            precision = torch.sum(iou50 * detected_mask) / (conf50.sum() + 1e-16)
            recall50 = torch.sum(iou50 * detected_mask) / (obj_mask.sum() + 1e-16)
            recall75 = torch.sum(iou75 * detected_mask) / (obj_mask.sum() + 1e-16)

            self.metrics = {
                "loss": to_cpu(total_loss).item(),
                "x": to_cpu(loss_x).item(),
                "y": to_cpu(loss_y).item(),
                "w": to_cpu(loss_w).item(),
                "h": to_cpu(loss_h).item(),
                "conf": to_cpu(loss_conf).item(),
                "cls": to_cpu(loss_cls).item(),
                "cls_acc": to_cpu(cls_acc).item(),
                "recall50": to_cpu(recall50).item(),
                "recall75": to_cpu(recall75).item(),
                "precision": to_cpu(precision).item(),
                "conf_obj": to_cpu(conf_obj).item(),
                "conf_noobj": to_cpu(conf_noobj).item(),
                "grid_size": grid_size,
            }

            return output, total_loss


class Darknet(nn.Module):
    """YOLOv3 object detection model"""

    def __init__(self, config_path, img_size=416):
        super(Darknet, self).__init__()
        self.module_defs = parse_model_config(config_path)
        self.hyperparams, self.module_list = create_modules(self.module_defs)
        self.yolo_layers = [layer[0] for layer in self.module_list if hasattr(layer[0], "metrics")]
        self.img_size = img_size
        self.seen = 0
        self.header_info = np.array([0, 0, 0, self.seen, 0], dtype=np.int32)

    def forward(self, x, targets=None):
        img_dim = x.shape[2]
        loss = 0
        layer_outputs, yolo_outputs = [], []
        for i, (module_def, module) in enumerate(zip(self.module_defs, self.module_list)):
            if module_def["type"] in ["convolutional", "upsample", "maxpool"]:
                x = module(x)
            elif module_def["type"] == "route":
                x = torch.cat([layer_outputs[int(layer_i)] for layer_i in module_def["layers"].split(",")], 1)
            elif module_def["type"] == "shortcut":
                layer_i = int(module_def["from"])
                x = layer_outputs[-1] + layer_outputs[layer_i]
            elif module_def["type"] == "yolo":
                x, layer_loss = module[0](x, targets, img_dim)
                loss += layer_loss
                yolo_outputs.append(x)
            layer_outputs.append(x)
        yolo_outputs = to_cpu(torch.cat(yolo_outputs, 1))
        return yolo_outputs if targets is None else (loss, yolo_outputs)

    def load_darknet_weights(self, weights_path):
        """Parses and loads the weights stored in 'weights_path'"""

        # Open the weights file
        with open(weights_path, "rb") as f:
            header = np.fromfile(f, dtype=np.int32, count=5)  # First five are header values
            self.header_info = header  # Needed to write header when saving weights
            self.seen = header[3]  # number of images seen during training
            weights = np.fromfile(f, dtype=np.float32)  # The rest are weights

        # Establish cutoff for loading backbone weights
        cutoff = None
        if "darknet53.conv.74" in weights_path:
            cutoff = 75

        ptr = 0
        for i, (module_def, module) in enumerate(zip(self.module_defs, self.module_list)):
            if i == cutoff:
                break
            if module_def["type"] == "convolutional":
                conv_layer = module[0]
                if module_def["batch_normalize"]:
                    # Load BN bias, weights, running mean and running variance
                    bn_layer = module[1]
                    num_b = bn_layer.bias.numel()  # Number of biases
                    # Bias
                    bn_b = torch.from_numpy(weights[ptr : ptr + num_b]).view_as(bn_layer.bias)
                    bn_layer.bias.data.copy_(bn_b)
                    ptr += num_b
                    # Weight
                    bn_w = torch.from_numpy(weights[ptr : ptr + num_b]).view_as(bn_layer.weight)
                    bn_layer.weight.data.copy_(bn_w)
                    ptr += num_b
                    # Running Mean
                    bn_rm = torch.from_numpy(weights[ptr : ptr + num_b]).view_as(bn_layer.running_mean)
                    bn_layer.running_mean.data.copy_(bn_rm)
                    ptr += num_b
                    # Running Var
                    bn_rv = torch.from_numpy(weights[ptr : ptr + num_b]).view_as(bn_layer.running_var)
                    bn_layer.running_var.data.copy_(bn_rv)
                    ptr += num_b
                else:
                    # Load conv. bias
                    num_b = conv_layer.bias.numel()
                    conv_b = torch.from_numpy(weights[ptr : ptr + num_b]).view_as(conv_layer.bias)
                    conv_layer.bias.data.copy_(conv_b)
                    ptr += num_b
                # Load conv. weights
                num_w = conv_layer.weight.numel()
                conv_w = torch.from_numpy(weights[ptr : ptr + num_w]).view_as(conv_layer.weight)
                conv_layer.weight.data.copy_(conv_w)
                ptr += num_w

    def save_darknet_weights(self, path, cutoff=-1):
        """
            @:param path    - path of the new weights file
            @:param cutoff  - save layers between 0 and cutoff (cutoff = -1 -> all are saved)
        """
        fp = open(path, "wb")
        self.header_info[3] = self.seen
        self.header_info.tofile(fp)

        # Iterate through layers
        for i, (module_def, module) in enumerate(zip(self.module_defs[:cutoff], self.module_list[:cutoff])):
            if module_def["type"] == "convolutional":
                conv_layer = module[0]
                # If batch norm, load bn first
                if module_def["batch_normalize"]:
                    bn_layer = module[1]
                    bn_layer.bias.data.cpu().numpy().tofile(fp)
                    bn_layer.weight.data.cpu().numpy().tofile(fp)
                    bn_layer.running_mean.data.cpu().numpy().tofile(fp)
                    bn_layer.running_var.data.cpu().numpy().tofile(fp)
                # Load conv bias
                else:
                    conv_layer.bias.data.cpu().numpy().tofile(fp)
                # Load conv weights
                conv_layer.weight.data.cpu().numpy().tofile(fp)

        fp.close()

models

# coding:utf8
from __future__ import division
import math
import time
import tqdm
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable
import numpy as np
# import matplotlib.pyplot as plt
# import matplotlib.patches as patches


def to_cpu(tensor):
    return tensor.detach().cpu()


def load_classes(path):
    """
    Loads class labels at 'path'
    """
    fp = open(path, "r")
    names = fp.read().split("\n")[:-1]
    return names


def weights_init_normal(m):
    classname = m.__class__.__name__
    if classname.find("Conv") != -1:
        torch.nn.init.normal_(m.weight.data, 0.0, 0.02)
    elif classname.find("BatchNorm2d") != -1:
        torch.nn.init.normal_(m.weight.data, 1.0, 0.02)
        torch.nn.init.constant_(m.bias.data, 0.0)


def rescale_boxes(boxes, current_dim, original_shape):
    """ Rescales bounding boxes to the original shape """
    orig_h, orig_w = original_shape
    # The amount of padding that was added
    pad_x = max(orig_h - orig_w, 0) * (current_dim / max(original_shape))
    pad_y = max(orig_w - orig_h, 0) * (current_dim / max(original_shape))
    # Image height and width after padding is removed
    unpad_h = current_dim - pad_y
    unpad_w = current_dim - pad_x
    # Rescale bounding boxes to dimension of original image
    boxes[:, 0] = ((boxes[:, 0] - pad_x // 2) / unpad_w) * orig_w
    boxes[:, 1] = ((boxes[:, 1] - pad_y // 2) / unpad_h) * orig_h
    boxes[:, 2] = ((boxes[:, 2] - pad_x // 2) / unpad_w) * orig_w
    boxes[:, 3] = ((boxes[:, 3] - pad_y // 2) / unpad_h) * orig_h
    return boxes


def xywh2xyxy(x):
    y = x.new(x.shape)
    y[..., 0] = x[..., 0] - x[..., 2] / 2
    y[..., 1] = x[..., 1] - x[..., 3] / 2
    y[..., 2] = x[..., 0] + x[..., 2] / 2
    y[..., 3] = x[..., 1] + x[..., 3] / 2
    return y


def ap_per_class(tp, conf, pred_cls, target_cls):
    """ Compute the average precision, given the recall and precision curves.
    Source: https://github.com/rafaelpadilla/Object-Detection-Metrics.
    # Arguments
        tp:    True positives (list).
        conf:  Objectness value from 0-1 (list).
        pred_cls: Predicted object classes (list).
        target_cls: True object classes (list).
    # Returns
        The average precision as computed in py-faster-rcnn.
    """

    # Sort by objectness
    i = np.argsort(-conf)
    tp, conf, pred_cls = tp[i], conf[i], pred_cls[i]

    # Find unique classes
    unique_classes = np.unique(target_cls)

    # Create Precision-Recall curve and compute AP for each class
    ap, p, r = [], [], []
    for c in tqdm.tqdm(unique_classes, desc="Computing AP"):
        i = pred_cls == c
        n_gt = (target_cls == c).sum()  # Number of ground truth objects
        n_p = i.sum()  # Number of predicted objects

        if n_p == 0 and n_gt == 0:
            continue
        elif n_p == 0 or n_gt == 0:
            ap.append(0)
            r.append(0)
            p.append(0)
        else:
            # Accumulate FPs and TPs
            fpc = (1 - tp[i]).cumsum()
            tpc = (tp[i]).cumsum()

            # Recall
            recall_curve = tpc / (n_gt + 1e-16)
            r.append(recall_curve[-1])

            # Precision
            precision_curve = tpc / (tpc + fpc)
            p.append(precision_curve[-1])

            # AP from recall-precision curve
            ap.append(compute_ap(recall_curve, precision_curve))

    # Compute F1 score (harmonic mean of precision and recall)
    p, r, ap = np.array(p), np.array(r), np.array(ap)
    f1 = 2 * p * r / (p + r + 1e-16)

    return p, r, ap, f1, unique_classes.astype("int32")


def compute_ap(recall, precision):
    """ Compute the average precision, given the recall and precision curves.
    Code originally from https://github.com/rbgirshick/py-faster-rcnn.

    # Arguments
        recall:    The recall curve (list).
        precision: The precision curve (list).
    # Returns
        The average precision as computed in py-faster-rcnn.
    """
    # correct AP calculation
    # first append sentinel values at the end
    mrec = np.concatenate(([0.0], recall, [1.0]))
    mpre = np.concatenate(([0.0], precision, [0.0]))

    # compute the precision envelope
    for i in range(mpre.size - 1, 0, -1):
        mpre[i - 1] = np.maximum(mpre[i - 1], mpre[i])

    # to calculate area under PR curve, look for points
    # where X axis (recall) changes value
    i = np.where(mrec[1:] != mrec[:-1])[0]

    # and sum (\Delta recall) * prec
    ap = np.sum((mrec[i + 1] - mrec[i]) * mpre[i + 1])
    return ap


def get_batch_statistics(outputs, targets, iou_threshold):
    """ Compute true positives, predicted scores and predicted labels per sample """
    batch_metrics = []

    # print "outputs len: {}".format(len(outputs))
    # print "targets shape: {}".format(targets.shape)
    # outputs: (batch_size, pred_boxes_num, 7) 7 =》x,y,w,h,conf,class_conf,class_pred
    # target:  (num, 6)  6=>(batch_index, cls, center_x, center_y, widht, height)
    for sample_i in range(len(outputs)):

        if outputs[sample_i] is None:
            continue

        # output: (pred_boxes_num, 7) 7 =》x,y,w,h,conf,class_conf,class_pred
        output = outputs[sample_i]
        # print "output: {}".format(output.shape)

        pred_boxes = output[:, :4]  # 预测框的x,y,w,h
        pred_scores = output[:, 4]  # 预测框的置信度
        pred_labels = output[:, -1]  # 预测框的类别label

        # 长度为pred_boxes_num的list，初始化为0，如果预测框和实际框匹配，则设置为1
        true_positives = np.zeros(pred_boxes.shape[0])

        # 获得真实目标框的类别label
        # annotations = targets[targets[:, 0] == sample_i][:, 1:]
        annotations = targets[targets[:, 0] == sample_i]
        annotations = annotations[:, 1:] if len(annotations) else []
        target_labels = annotations[:, 0] if len(annotations) else []

        if len(annotations):  # len(annotations)>0: 表示这张图片有真实的目标框
            detected_boxes = []
            target_boxes = annotations[:, 1:]  # 真实目标框的x,y,w,h

            for pred_i, (pred_box, pred_label) in enumerate(zip(pred_boxes, pred_labels)):

                # If targets are found break
                if len(detected_boxes) == len(annotations):
                    break

                # Ignore if label is not one of the target labels
                # 如果该预测框的类别标签不存在与目标框的类别标签集合中，则必定是预测错误
                if pred_label not in target_labels:
                    continue

                # 将一个预测框与所有真实目标框做IOU计算，并获取IOU最大的值(iou)，和与之对应的真实目标框的索引号(box_index)
                iou, box_index = bbox_iou(pred_box.unsqueeze(0), target_boxes).max(0)
                # 如果最大IOU大于阈值，则认为该真实目标框被发现。注意要防止被重复记录
                if iou >= iou_threshold and box_index not in detected_boxes:
                    true_positives[pred_i] = 1  # 对该预测框设置为1
                    detected_boxes += [box_index]  # 记录被发现的实际框索引号，防止预测框重复标记，即一个实际框只能被一个预测框匹配
        # 保存当前图片被预测的信息
        # true_positives：预测框的正确与否，正确设置为1，错误设置为0
        # pred_scores：预测框的x,y,w,h
        # pred_labels：预测框的类别标签
        batch_metrics.append([true_positives, pred_scores, pred_labels])
    return batch_metrics


def bbox_wh_iou(wh1, wh2):
    wh2 = wh2.t()
    w1, h1 = wh1[0], wh1[1]
    w2, h2 = wh2[0], wh2[1]
    # print w1, w2, h1, h2

    inter_area = torch.min(w1, w2) * torch.min(h1, h2)
    union_area = (w1 * h1 + 1e-16) + w2 * h2 - inter_area
    # print inter_area, union_area
    return inter_area / union_area


def bbox_iou(box1, box2, x1y1x2y2=True):
    """
    Returns the IoU of two bounding boxes
    """
    if not x1y1x2y2:
        # Transform from center and width to exact coordinates
        b1_x1, b1_x2 = box1[:, 0] - box1[:, 2] / 2, box1[:, 0] + box1[:, 2] / 2
        b1_y1, b1_y2 = box1[:, 1] - box1[:, 3] / 2, box1[:, 1] + box1[:, 3] / 2
        b2_x1, b2_x2 = box2[:, 0] - box2[:, 2] / 2, box2[:, 0] + box2[:, 2] / 2
        b2_y1, b2_y2 = box2[:, 1] - box2[:, 3] / 2, box2[:, 1] + box2[:, 3] / 2
    else:
        # Get the coordinates of bounding boxes
        b1_x1, b1_y1, b1_x2, b1_y2 = box1[:, 0], box1[:, 1], box1[:, 2], box1[:, 3]
        b2_x1, b2_y1, b2_x2, b2_y2 = box2[:, 0], box2[:, 1], box2[:, 2], box2[:, 3]

    # get the corrdinates of the intersection rectangle
    inter_rect_x1 = torch.max(b1_x1, b2_x1)
    inter_rect_y1 = torch.max(b1_y1, b2_y1)
    inter_rect_x2 = torch.min(b1_x2, b2_x2)
    inter_rect_y2 = torch.min(b1_y2, b2_y2)
    # Intersection area
    inter_area = torch.clamp(inter_rect_x2 - inter_rect_x1 + 1, min=0) * torch.clamp(
        inter_rect_y2 - inter_rect_y1 + 1, min=0
    )
    # Union Area
    b1_area = (b1_x2 - b1_x1 + 1) * (b1_y2 - b1_y1 + 1)
    b2_area = (b2_x2 - b2_x1 + 1) * (b2_y2 - b2_y1 + 1)

    iou = inter_area / (b1_area + b2_area - inter_area + 1e-16)

    return iou


def non_max_suppression(prediction, conf_thres=0.5, nms_thres=0.4):
    """
    Removes detections with lower object confidence score than 'conf_thres' and performs
    Non-Maximum Suppression to further filter detections.
    Returns detections with shape:
        (x1, y1, x2, y2, object_conf, class_score, class_pred)
    """
    # prediction: (batch_size, num_anchors*grid_size*grid_size*3, 85) 85 => (x,y,w,h, conf, cls)
    # From (center x, center y, width, height) to (x1, y1, x2, y2)
    prediction[..., :4] = xywh2xyxy(prediction[..., :4])
    output = [None for _ in range(len(prediction))]

    for image_i, image_pred in enumerate(prediction):
        # Filter out confidence scores below threshold
        # 得到置信预测框：过滤anchor置信度小于阈值的预测框
        # print image_pred.shape (num_anchors*grid_size*grid_size*3, 85) 85 => (x,y,w,h, conf, cls)
        image_pred = image_pred[image_pred[:, 4] >= conf_thres]
        # print image_pred.shape  (more_than_conf_thres_num, 85) 85 => (x,y,w,h, conf, cls)

        # If none are remaining => process next image
        # 基于anchor的置信度过滤完后，看看是否还有保留的预测框，如果都被过滤，则认为没有实体目标被检测到
        if not image_pred.size(0):
            continue

        # Object confidence times class confidence
        # 计算处理：先选取每个预测框所代表的最大类别值，再将这个值乘以对应的anchor置信度，这样将类别预测精准度和置信度都考虑在内。
        # 每个置信预测框都会对应一个score值
        score = image_pred[:, 4] * image_pred[:, 5:].max(1)[0]
        # Sort by it
        # 基于score值，将置信预测框从大到小进行排序
        # image_pred = image_pred[(-score).argsort()]
        # 置信预测：image_pred ==》(more_than_conf_thres_num, 85) 85 => (x,y,w,h, conf, cls)
        image_pred = image_pred[torch.sort(-score, dim=0)[1]]
        # image_pred[:, 5:] ==> (more_than_conf_thres_num, cls)
        # 该处理是获取每个置信预测框所对应的类别预测分值（class_confs）和类别索引（class_preds）
        class_confs, class_preds = image_pred[:, 5:].max(1, keepdim=True)
        # 将置信预测框的 x,y,w,h,conf，类别预测分值和类别索引关联到一起
        # detections ==》 (more_than_conf_thres_num, 7) 7 =》x,y,w,h,conf,class_conf,class_pred
        detections = torch.cat((image_pred[:, :5], class_confs.float(), class_preds.float()), 1)

        # Perform non-maximum suppression
        keep_boxes = []
        while detections.size(0):
            # detections[0, :4]是第一个置信预测框，也是当前序列中分值最大的置信预测框
            # 计算当前序列的第一个（分值最大）置信预测框与整个序列预测框的IOU，并将IOU大于阈值的设置为1，小于的设置为0。
            large_overlap = bbox_iou(detections[0, :4].unsqueeze(0), detections[:, :4]) > nms_thres
            # 匹配与当前序列的第一个（分值最大）置信预测框具有相同类别标签的所有预测框（将相同类别标签的预测框标记为1）
            label_match = detections[0, -1] == detections[:, -1]

            # Indices of boxes with lower confidence scores, large IOUs and matching labels
            # 与当前序列的第一个（分值最大）置信预测框IOU大，说明这些预测框与其相交面积大，
            # 如果这些预测框的标签与当前序列的第一个（分值最大）置信预测框的相同，则说明是预测的同一个目标，
            # 对与当前序列第一个（分值最大）置信预测框预测了同一目标的设置为1（包括当前序列第一个（分值最大）置信预测框本身）。
            invalid = large_overlap & label_match
            # 取出对应置信预测框的置信度，将置信度作为权重
            weights = detections[invalid, 4:5]

            # Merge overlapping bboxes by order of confidence
            # 把预测为同一目标的预测框进行合并，合并后认为是最优的预测框。合并方式如下：
            detections[0, :4] = (weights * detections[invalid, :4]).sum(0) / weights.sum()
            # 保存当前序列中最终识别的预测框
            keep_boxes += [detections[0]]
            # ~invalid表示取反，将之前的0变为1，即选取剩下的预测框，进行新一轮的计算
            detections = detections[~invalid]
        if keep_boxes:
            # 每张图片的最终预测框有pred_boxes_num个，output[image_i]的shape：
            # (pred_boxes_num, 7) 7 =》x,y,w,h,conf,class_conf,class_pred
            output[image_i] = torch.stack(keep_boxes)

    # (batch_size, pred_boxes_num, 7) 7 =》x,y,w,h,conf,class_conf,class_pred
    return output


def build_targets(pred_boxes, pred_cls, target, anchors, ignore_thres):
    # pred_boxes => (batch_size, anchor_num, gride, gride, 4)
    # pred_cls => (batch_size, anchor_num, gride, gride, 80)
    # targets => (num, 6)  6=>(batch_index, cls, center_x, center_y, widht, height)
    # anchors => (3, 2)

    ByteTensor = torch.cuda.ByteTensor if pred_boxes.is_cuda else torch.ByteTensor
    FloatTensor = torch.cuda.FloatTensor if pred_boxes.is_cuda else torch.FloatTensor

    nB = pred_boxes.size(0)  # batch num
    nA = pred_boxes.size(1)  # anchor num
    nC = pred_cls.size(-1)  # class num => 80
    nG = pred_boxes.size(2)  # gride

    # Output tensors
    obj_mask = ByteTensor(nB, nA, nG, nG).fill_(0)  # (batch_size, anchor_num, gride, gride)
    noobj_mask = ByteTensor(nB, nA, nG, nG).fill_(1)
    class_mask = FloatTensor(nB, nA, nG, nG).fill_(0)
    iou_scores = FloatTensor(nB, nA, nG, nG).fill_(0)
    tx = FloatTensor(nB, nA, nG, nG).fill_(0)
    ty = FloatTensor(nB, nA, nG, nG).fill_(0)
    tw = FloatTensor(nB, nA, nG, nG).fill_(0)
    th = FloatTensor(nB, nA, nG, nG).fill_(0)
    tcls = FloatTensor(nB, nA, nG, nG, nC).fill_(0)  # (batch_size, anchor_num, gride, gride, class_num)

    # Convert to position relative to box
    # 这一步是将x,y,w,h这四个归一化的变量变为真正的尺寸，因为当前图像的尺寸是nG，所以乘以nG。
    # print target[:, 2:6].shape  # (num, 4)
    target_boxes = target[:, 2:6] * nG  # (num, 4)  4=>(center_x, center_y, widht, height)
    gxy = target_boxes[:, :2]  # (num, 2)
    gwh = target_boxes[:, 2:]  # (num, 2)
    # print target_boxes.shape, gxy.shape, gwh.shape

    # Get anchors with best iou
    # 这一步是为每一个目标框从三种anchor框中分配一个最优的.
    # anchor 是设置的锚框，gwh是真实标记的宽高，这里是比较两者的交集，选出最佳的锚框,因为只是选择哪种锚框，不用考虑中心坐标。
    ious = torch.stack([bbox_wh_iou(anchor, gwh) for anchor in anchors])  # (3, num)
    # ious（3，num），该处理是为每一个目标框选取一个IOU最大的anchor框，best_ious表示最大IOU的值，best_n表示最大IOU对应anchor的index
    best_ious, best_n = ious.max(0)  # best_ious 和 best_n 的长度均为 num， best_n是num个目标框对应的anchor索引

    # Separate target values
    # .t() 表示转置，（num，2） =》（2，num）
    # （2，num）  2=>(batch_index, cls） =》 b(num)表示对应num个index, target_labels(num)表示对应num个labels
    b, target_labels = target[:, :2].long().t()
    gx, gy = gxy.t()  # gx表示num个x， gy表示num个y
    gw, gh = gwh.t()
    gi, gj = gxy.long().t()  # .long()是把浮点型转为整型（去尾），这样就可以得到目标框中心点所在的网格坐标


    # ---------------------------得到目标实体框obj_mask和目标非实体框noobj_mask  start----------------------------
    # Set masks
    # 表示batch中的第b张图片，其网格坐标为(gj, gi)的单元网格存在目标框的中心点，该目标框所匹配的最优anchor索引为best_n
    obj_mask[b, best_n, gj, gi] = 1  # 对目标实体框中心点所在的单元网格，其最优anchor设置为1
    noobj_mask[b, best_n, gj, gi] = 0  # 对目标实体框中心点所在的单元网格，其最优anchor设置为0 （与obj_mask相反）

    # Set noobj mask to zero where iou exceeds ignore threshold
    # ious.t(): (3, num) => (num, 3)
    # 这里不同与上一个策略，上个策略是找到与目标框最优的anchor框，每个目标框对应一个anchor框。
    # 这里不考虑最优问题，只要目标框与anchor的IOU大于阈值，就认为是有效anchor框，即noobj_mask对应的位置设置为0
    for i, anchor_ious in enumerate(ious.t()):
        noobj_mask[b[i], anchor_ious > ignore_thres, gj[i], gi[i]] = 0

    # 以上操作得到了目标实体框obj_mask和目标非实体框noobj_mask，目标实体框是与实体一一对应的，一个实体有一个最匹配的目标框；
    # 目标非实体框noobj_mask，该框既不是实体最匹配的，而且还要该框与实体IOU小于阈值，这也是为了让正负样例更加明显。
    # ---------------------------得到目标实体框obj_mask和目标非实体框noobj_mask  end------------------------------

    # ---------------------------得到目标实体框的归一化坐标（tx, ty, tw, th）  start------------------------------
    # Coordinates
    # 将x,y,w,h重新归一化，
    # 注意：要明白这里为什么要这么做，此处的归一化和传入target的归一化方式不一样，
    # 传入target的归一化是实际的x,y,w,h / img_size. 即实际x,y,w,h在img_size中的比例，
    # 此处的归一化中，中心坐标x,y是基于单元网络的，w,h是基于anchor框，此处归一化的x,y,w,h，也是模型要拟合的值。
    tx[b, best_n, gj, gi] = gx - gx.floor()
    ty[b, best_n, gj, gi] = gy - gy.floor()
    # Width and height
    tw[b, best_n, gj, gi] = torch.log(gw / anchors[best_n][:, 0] + 1e-16)
    th[b, best_n, gj, gi] = torch.log(gh / anchors[best_n][:, 1] + 1e-16)
    # ---------------------------得到目标实体框的归一化坐标（tx, ty, tw, th）  end---------------------------------


    # One-hot encoding of label
    # 表示batch中的第b张图片，其网格坐标为(gj, gi)的单元网格存在目标框的中心点，该目标框所匹配的最优anchor索引为best_n，其类别为target_labels
    tcls[b, best_n, gj, gi, target_labels] = 1

    # Compute label correctness and iou at best anchor
    # class_mask:将预测正确的标记为1（正确的预测了实体中心点所在的网格坐标，哪个anchor框可以最匹配实体，以及实体的类别）
    class_mask[b, best_n, gj, gi] = (pred_cls[b, best_n, gj, gi].argmax(-1) == target_labels).float()
    # iou_scores：预测框pred_boxes中的正确框与目标实体框target_boxes的交集IOU，以IOU作为分数，IOU越大，分值越高。
    iou_scores[b, best_n, gj, gi] = bbox_iou(pred_boxes[b, best_n, gj, gi], target_boxes, x1y1x2y2=False)
    # tconf：正确的目标实体框，其对应anchor框的置信度为1，即置信度的标签，这里转为float，是为了后面和预测的置信度值做loss计算。
    tconf = obj_mask.float()


    # iou_scores：预测框pred_boxes中的正确框与目标实体框target_boxes的交集IOU，以IOU作为分数，IOU越大，分值越高。
    # class_mask：将预测正确的标记为1（正确的预测了实体中心点所在的网格坐标，哪个anchor框可以最匹配实体，以及实体的类别）
    # obj_mask：将目标实体框所对应的anchor标记为1，目标实体框所对应的anchor与实体一一对应的
    # noobj_mask：将所有与目标实体框IOU小于某一阈值的anchor标记为1
    # tx, ty, tw, th： 需要拟合目标实体框的坐标和尺寸
    # tcls：目标实体框的所属类别
    # tconf：所有anchor的目标置信度
    return iou_scores, class_mask, obj_mask, noobj_mask, tx, ty, tw, th, tcls, tconf

utils

参考：

　　https://www.zybuluo.com/rianusr/note/1417734

　　https://www.cnblogs.com/hellcat/p/10375310.html

　　https://blog.csdn.net/leviopku/article/details/82660381

posted @ 2019-10-23 22:49 silence_cho 阅读(4957) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

silence_cho

yolov1， yolo v2 和yolo v3系列

公告