长尾数据集Places365-LT, ImageNet-LT构建

除了CIFAR系列，长尾图像分类用的最多的就是这三个数据集：Place365、iNaturalist2018、ImageNet（2012）

分别给出需要下载的 item 以及下载地址：

Place365: Small images (256 * 256) with easy directory structure - Train and val images. 21G. [link]
iNaturalist2018: All training and validation images [120GB] [link]
ImageNet: Training images (Task 1 & 2). 138GB. & Validation images (all tasks). 6.3GB. [link] [Details of dataset construction]

这三个数据集中，iNaturalist2018本身就是长尾分布的，Place365和ImageNet（2012）都是前人，手动构造成长尾分布的，也就是人为从各个类中挑选部分样本。

最早，这两个手工制造的长尾数据集由Large-Scale Long-Tailed Recognition in an Open World提出，对于哪些图片被挑选来构造长尾数据集，文件名都被记录在了 .txt文件中，[.txt]

只需要根据 .txt 文件，从原始数据集中筛选、复制对应的图像即可：

import pathlib
import shutil


def select_data(src_root, dest_root, checklist_path:str):
    """We select the data specified in the checklist from the source path and copy it to the destination path"""

    with open(checklist_path, 'r') as f:
        for line in f:
            file_path = pathlib.Path(src_root) / line.split()[0].strip()
            assert file_path.exists(), f"File not found: {file_path}"
            dest_path = pathlib.Path(dest_root) / line.split()[0].strip()
            dest_path.parent.mkdir(parents=True, exist_ok=True)
            if not dest_path.exists():
                shutil.copy(file_path, dest_path)
                print(f"Copy {file_path} to {dest_path}")
            else:
                print(f"Skip {file_path} to {dest_path}")

if __name__ == "__main__":
    src_root_list = [
        r'F:\dataset\places365',
        r'F:\dataset\ILSVRC2012'
    ]
    dest_root_list = [
        r'E:\dataset-lt\places365',
        r'E:\dataset-lt\ILSVRC2012'
    ]
    checklist_list = [
        './Places_LT_train.txt',
        './Places_LT_test.txt',
        './ImageNet_LT_train.txt',
    ]
    # select_data(src_root_list[0], dest_root_list[0], checklist_list[0])
    # select_data(src_root_list[0], dest_root_list[0], checklist_list[1])
    # select_data(src_root_list[1], dest_root_list[1], checklist_list[2])

注意，这里并没有ImageNet-LT的验证集（或者称为测试集，因为原始的ImageNet测试集没有提供label，所以论文中常用验证集作为测试），因为在上面的那个链接里，关于验证集 .txt文件的每一行是这样的：

val/n01440764/ILSVRC2012_val_00029930.JPEG 0

目录结构与下载的不同，关于验证集的构建可参考：issue

此外，ImageNet验证集下载时，其中一个文件是ILSVRC2012_validation_ground_truth.txt，据我观察和验证集的ground-truth对不上，验证集的ground-truth还是得看下载的 ImageNet_LT_test.txt

当然了，不去筛选样本直接使用原始数据集也可以，毕竟制作dataloader时也是从 .txt文件里找对应的文件，这么做只是为了选出用到的图片，仅此而已。

posted @ 2025-04-26 12:09 片刻的自由阅读(242) 评论(0) 收藏举报

刷新页面返回顶部