CpaCol代码复现
论文源自Contrastive Pedestrian Attentive and Correlation Learning Network for Occluded Person Re-Identification,由于复现时问题较多,故考虑留档日后再次复现时使用
常规修改
config部分
_C.INPUT = CN()
# 数据集路径
# _C.INPUT.OCC_PTH = '/media/userdisk1/jbl/occluded/occluded_patch/'
以上推测为数据集路径,需设置为自己的路径,其中../occluded_patch
路径多次出现,需要在多处修改
应该在只启动脚本中改就行。
configs部分
INPUT:
# OCC_PTH: '/home/ym/workspace/CpaCol-main/occluded_patch/Duke'
OCC_PTH: '/home/ym/workspace/CpaCol-main/datasets/data/Occluded_Duke/'
同样为数据集路径,需要具体到Occ数据集内。
值得注意的是DATASETS
部分需要修改ROOT_DIR
的路径,这部分指的是数据集的根目录。
此外,MODEL
部分的DEVICE_ID
设置情况尚不明确,因为后续其他部分代码中皆指定了显卡编号,需要后续求证
重要问题
数据集路径问题
已知数据集路径均已设置完毕,但运行时仍出现路径问题,考虑可能是出现路径的部分使用的并非是occduke数据集,而是论文中提到的其他部分。
目前的问题出现在make_model.py
文件中的打开图像部分,这里出现了文件路径问题,要求文件,而打开的是数据集下的文件夹,可能说明该路径下要求的不是occduke数据集,而是其他图像文件。
self.occlusion_path = cfg.INPUT.OCC_PTH
self.occlusion_bank = []
filelist = os.listdir(self.occlusion_path)
print(f"File list: {filelist}") # 判断文件列表是否正确
for file in filelist:
print(f"Processing file: {file}") # 判断文件是否正确
print(self.occlusion_path, 'occlusion_path')
img_path = os.path.join(self.occlusion_path, file)
img = Image.open(img_path).convert('RGB')
img = TF.to_tensor(img)
self.occlusion_bank.append(img)
推理: 由源代码中
OCC_PTH: '../occluded_patch/Duke/'
可知,这部分可能并非是occduke数据集路径,而是某个图像文件夹。因此考虑找出输出该文件夹的代码,即输出路径为occluded_patch/Duke
或OCC_OTH
的代码。
解决:回到github页面发现作者给出了occluded_patch文件,下载并放入根目录即可运行。
预训练模型调取错误
在第一阶段训练完毕后调取模型报错:
Traceback (most recent call last):
File "train.py", line 71, in <module>
model = make_model(cfg, num_class=num_classes, camera_num=camera_num, view_num = view_num)
File "/home/ym/workspace/CpaCol-main/model/make_model.py", line 627, in make_model
model = build_transformer_local(num_class, camera_num, view_num, cfg, __factory_T_type, rearrange=cfg.MODEL.RE_ARRANGE)
File "/home/ym/workspace/CpaCol-main/model/make_model.py", line 384, in __init__
assert os.path.exists(cfg.MODEL.STAGE_1_MODEL)
AssertionError
推理:由
dist_train_occDuke.sh
可知,模型每30轮进行一次保存,但SOLVER.MAX_EPOCHS 150
只有140轮,也就是说第140轮没有被保存。可能是作者在调试过程中发现第140轮成绩就达到峰值了,但后续代码中没有改?
解决:将SOLVER.MAX_EPOCHS 150
增加到150轮,并在第二阶段修改为MODEL.STAGE_1_MODEL './logs/occ_Duke/stage_1/transformer_150.pth' \
尝试创建新的分布式进程组时出错
Traceback (most recent call last):
File "train.py", line 89, in <module>
num_query, args.local_rank
File "/home/ym/workspace/CpaCol-main/processor/processor.py", line 55, in do_train
for n_iter, (img, vid, target_cam, target_view) in enumerate(train_loader):
File "/opt/anaconda3/envs/CpaCol/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 291, in __iter__
return _MultiProcessingDataLoaderIter(self)
File "/opt/anaconda3/envs/CpaCol/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 764, in __init__
self._try_put_index()
File "/opt/anaconda3/envs/CpaCol/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 994, in _try_put_index
index = self._next_index()
File "/opt/anaconda3/envs/CpaCol/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 357, in _next_index
return next(self._sampler_iter) # may raise StopIteration
File "/opt/anaconda3/envs/CpaCol/lib/python3.7/site-packages/torch/utils/data/sampler.py", line 208, in __iter__
for idx in self.sampler:
File "/home/ym/workspace/CpaCol-main/datasets/sampler_ddp.py", line 148, in __iter__
seed = shared_random_seed()
File "/home/ym/workspace/CpaCol-main/datasets/sampler_ddp.py", line 108, in shared_random_seed
all_ints = all_gather(ints)
File "/home/ym/workspace/CpaCol-main/datasets/sampler_ddp.py", line 77, in all_gather
group = _get_global_gloo_group()
File "/home/ym/workspace/CpaCol-main/datasets/sampler_ddp.py", line 18, in _get_global_gloo_group
return dist.new_group(backend="gloo")
File "/opt/anaconda3/envs/CpaCol/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1786, in new_group
timeout=timeout)
File "/opt/anaconda3/envs/CpaCol/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 503, in _new_process_group_helper
timeout=timeout)
RuntimeError: Socket Timeout
百度一下可能将存活时长修改一下即可