保姆级教程：用Python脚本一键整理Market-1501数据集，适配PyTorch训练

张

张建站

2026/6/8 12:05:57

10分钟阅读

保姆级教程用Python脚本一键整理Market-1501数据集适配PyTorch训练第一次接触行人重识别ReID项目时最让人头疼的往往不是模型搭建而是数据预处理这个脏活累活。特别是面对Market-1501这种经典但目录结构复杂的数据集新手很容易在数据整理阶段就耗费大量时间。本文将手把手教你用Python脚本将原始Market-1501数据集转换为PyTorch友好的格式整个过程只需运行一个脚本5分钟即可完成数据准备工作。1. 理解Market-1501的原始结构在开始整理前我们需要先了解原始数据集的目录结构和命名规则。Market-1501的原始目录包含以下关键文件夹Market-1501/ ├── bounding_box_test/ # 测试集图像 ├── bounding_box_train/ # 训练集图像 ├── query/ # 查询图像 ├── gt_bbox/ # 手工标注框 └── gt_query/ # 查询集的标注信息文件命名遵循特定模式以0002_c1s1_000451_03.jpg为例0002: 行人IDc1: 摄像头编号(1-6)s1: 视频片段编号000451: 帧编号03: 检测框编号(00表示手工标注)这种结构虽然信息丰富但直接用于PyTorch训练并不方便。我们需要将其转换为以行人ID为文件夹的层级结构。2. 准备Python转换脚本下面是一个完整的转换脚本基于layumi的代码进行了优化和注释import os from shutil import copyfile from tqdm import tqdm # 进度条显示 def convert_market_to_pytorch(download_path./Market): 将Market-1501数据集转换为PyTorch格式 :param download_path: 原始数据集路径 # 创建输出目录 save_path os.path.join(download_path, pytorch) os.makedirs(save_path, exist_okTrue) # 需要处理的子目录映射 dir_mapping { bounding_box_train: train_all, bounding_box_test: gallery, query: query, gt_bbox: multi-query # 可选的多查询数据 } # 处理每个子目录 for src_dir, target_dir in dir_mapping.items(): src_path os.path.join(download_path, src_dir) if not os.path.exists(src_path): continue target_path os.path.join(save_path, target_dir) os.makedirs(target_path, exist_okTrue) print(fProcessing {src_dir} - {target_dir}) for file in tqdm(os.listdir(src_path)): if not file.endswith(.jpg): continue # 从文件名提取行人ID person_id file.split(_)[0] # 创建行人ID子目录 person_dir os.path.join(target_path, person_id) os.makedirs(person_dir, exist_okTrue) # 复制文件 src_file os.path.join(src_path, file) dst_file os.path.join(person_dir, file) copyfile(src_file, dst_file) print(转换完成输出目录:, save_path) if __name__ __main__: convert_market_to_pytorch()3. 脚本使用指南3.1 环境准备确保已安装Python 3.6和以下依赖库pip install tqdm3.2 执行步骤将原始Market-1501数据集解压到./Market目录或修改脚本中的路径保存上述脚本为convert_market1501.py运行脚本python convert_market1501.py3.3 输出结构转换后的目录结构如下pytorch/ ├── train_all/ # 完整训练集 │ ├── 0001/ # 行人ID文件夹 │ │ ├── 0001_c1s1_000451_03.jpg │ │ └── ... │ └── ... ├── gallery/ # 测试集 └── query/ # 查询集4. 高级功能扩展4.1 训练集/验证集分割对于实际训练我们通常需要将训练集进一步划分为训练和验证子集。添加以下函数def split_train_val(train_path, val_ratio0.2): 划分训练集和验证集 :param train_path: 原始训练集路径 :param val_ratio: 验证集比例 # 创建输出目录 train_out os.path.join(train_path, ../train) val_out os.path.join(train_path, ../val) os.makedirs(train_out, exist_okTrue) os.makedirs(val_out, exist_okTrue) # 遍历每个行人 for person_id in os.listdir(train_path): person_dir os.path.join(train_path, person_id) if not os.path.isdir(person_dir): continue images [f for f in os.listdir(person_dir) if f.endswith(.jpg)] if not images: continue # 随机选择验证图像 val_size max(1, int(len(images) * val_ratio)) val_images set(random.sample(images, val_size)) # 复制文件 for img in images: src os.path.join(person_dir, img) if img in val_images: dst_dir os.path.join(val_out, person_id) else: dst_dir os.path.join(train_out, person_id) os.makedirs(dst_dir, exist_okTrue) copyfile(src, os.path.join(dst_dir, img))4.2 数据统计功能添加数据统计功能帮助了解数据集分布def analyze_dataset(dataset_path): 分析数据集统计信息 :param dataset_path: 数据集路径 stats { num_persons: 0, num_images: 0, images_per_person: [], cameras: set() } for person_id in os.listdir(dataset_path): person_dir os.path.join(dataset_path, person_id) if not os.path.isdir(person_dir): continue stats[num_persons] 1 images [f for f in os.listdir(person_dir) if f.endswith(.jpg)] stats[num_images] len(images) stats[images_per_person].append(len(images)) # 统计摄像头信息 for img in images: cam_id img.split(_)[1] stats[cameras].add(cam_id) # 打印统计信息 print(f行人数量: {stats[num_persons]}) print(f图像总数: {stats[num_images]}) print(f平均每人图像数: {np.mean(stats[images_per_person]):.1f}) print(f摄像头数量: {len(stats[cameras])})5. 与PyTorch DataLoader集成转换后的数据结构可以方便地与PyTorch的Dataset类集成。下面是一个简单的实现示例from torch.utils.data import Dataset from PIL import Image class Market1501Dataset(Dataset): def __init__(self, root, transformNone): self.root root self.transform transform self.samples [] # 收集所有图像路径和标签 for person_id in os.listdir(root): person_dir os.path.join(root, person_id) if not os.path.isdir(person_dir): continue for img_name in os.listdir(person_dir): if img_name.endswith(.jpg): img_path os.path.join(person_dir, img_name) self.samples.append((img_path, int(person_id))) def __len__(self): return len(self.samples) def __getitem__(self, idx): img_path, label self.samples[idx] img Image.open(img_path).convert(RGB) if self.transform: img self.transform(img) return img, label使用示例from torchvision import transforms transform transforms.Compose([ transforms.Resize((256, 128)), transforms.ToTensor(), transforms.Normalize(mean[0.485, 0.456, 0.406], std[0.229, 0.224, 0.225]) ]) train_dataset Market1501Dataset(./Market/pytorch/train, transform) train_loader DataLoader(train_dataset, batch_size32, shuffleTrue)6. 常见问题解决在实际使用中可能会遇到以下问题文件权限问题确保有足够的权限读写目标目录在Linux/Mac上可能需要使用chmod路径问题建议使用os.path.join构建路径确保跨平台兼容性检查路径是否存在os.path.exists(path)内存不足对于大型数据集可以使用生成器而非一次性加载所有数据考虑使用lmdb等高效存储格式文件名解析错误确保文件名符合Market-1501命名规范添加异常处理try: person_id file.split(_)[0] if not person_id.isdigit(): continue except Exception as e: print(fError parsing {file}: {e}) continue通过这个完整的处理流程你可以将原始的Market-1501数据集快速转换为适合PyTorch训练的格式节省大量预处理时间把精力集中在模型开发和调优上。

UVa 425 Enigmatic Encryption

题目描述题目要求根据加密后的密码和一篇论文文本，还原出原始密码。密码由两个单词和一个数字组合而成，格式为 word1 digit word2 或 word2 digit word1，其中： 两个单词均来自论文文本中的单词（仅由字母组成&…...

2026/6/8 12:04:16 阅读更多 →

2026年计划采购双级滤波器，国内值得合作的工厂都有哪些

随着新能源、医疗设备、工业自动化等领域对电磁兼容（EMC）要求不断提升，双级滤波器凭借优于单级滤波器的杂波过滤效果，成为越来越多高端设备的刚需配置。2026年有采购计划的企业，如何筛选靠谱的合作工厂，直接…...

2026/6/8 12:03:00 阅读更多 →

Windows 10/11下复现CVE-2020-17103：从cldapi.dll调用到注册表权限提升的完整流程

Windows 10/11下CVE-2020-17103漏洞复现实战指南在Windows安全研究领域，Minifilter驱动漏洞一直是权限提升（EoP）攻击的热门目标。CVE-2020-17103作为cldflt.sys驱动中的一个经典漏洞，通过巧妙利用注册表操作和线程竞争条件&#…...

2026/6/8 11:55:03 阅读更多 →

索引堆及其优化

索引堆及其优化引言索引堆是一种数据结构，广泛应用于计算机科学和软件工程领域。它主要用于解决优先队列问题，如最小堆和最大堆。本文将详细介绍索引堆的概念、实现方法以及优化策略。索引堆的定义索引堆是一种基于堆数据结构的索引机制。它通过维护一个堆来存储数据…...

2026/6/8 0:46:40 阅读更多 →

2026实测盘点｜适合国内高校生的AI写作平台，降重润色哪家强？

2026年毕业季，学术审查全面加码。教育部明确要求毕业论文AIGC率不得超过30%，985/211院校更是将红线压到了20%以内，硕士论文甚至卡到15%。与此同时，知网上线AIGC 3.0系统，可实现段落级内容溯源；维普引入语义…...

2026/6/8 4:35:49 阅读更多 →

JewelCraft：Blender珠宝设计的终极免费解决方案

JewelCraft：Blender珠宝设计的终极免费解决方案【免费下载链接】jewelcraft Blender add-on for jewelry design 项目地址: https://gitcode.com/gh_mirrors/je/jewelcraft JewelCraft是一款专为珠宝设计师和3D艺术家打造的Blender插件，提供完整…...

2026/6/8 0:52:21 阅读更多 →