从零构建动作识别数据集基于MMAction2与PoseC3d的完整实践指南当我在实验室第一次尝试用自拍视频训练动作识别模型时踩遍了所有可能的坑——从骨骼点提取失败到数据集格式混乱再到训练时莫名其妙的维度报错。如果你也正在为如何将一堆视频文件转换成PoseC3d能识别的标准数据集而头疼这篇实战手册或许能帮你节省40小时的试错时间。1. 环境配置与工具链搭建在开始处理视频数据前需要搭建完整的工具链。不同于简单的pip install这里涉及多个OpenMMLab框架的协同工作建议使用conda创建独立环境conda create -n mmaction python3.8 -y conda activate mmaction pip install torch1.12.1cu113 torchvision0.13.1cu113 --extra-index-url https://download.pytorch.org/whl/cu113关键组件安装清单组件版本要求安装命令MMDetection≥3.0.0pip install mmdet3.2.0MMPose≥1.0.0pip install mmpose1.0.0MMAction2≥1.0.0pip install mmaction21.0.0MMEngine≥0.7.0pip install mmengine0.7.0验证安装是否成功import mmdet, mmpose, mmaction print(mmdet.__version__, mmpose.__version__, mmaction.__version__)常见问题解决方案报错No module named setuptools.command.buildpython -m pip install --upgrade pip setuptools wheelCUDA版本不匹配建议使用CUDA 11.3及以上版本可通过nvcc --version检查2. 视频素材预处理实战原始视频往往存在分辨率不一、时长差异大的问题。我们需要先进行标准化处理import cv2 import os def preprocess_video(input_path, output_dir, target_resolution(1280, 720)): cap cv2.VideoCapture(input_path) fps int(cap.get(cv2.CAP_PROP_FPS)) fourcc cv2.VideoWriter_fourcc(*XVID) os.makedirs(output_dir, exist_okTrue) output_path os.path.join(output_dir, os.path.basename(input_path)) out cv2.VideoWriter(output_path, fourcc, fps, target_resolution) while cap.isOpened(): ret, frame cap.read() if not ret: break resized_frame cv2.resize(frame, target_resolution) out.write(resized_frame) cap.release() out.release()文件目录结构规范my_dataset/ ├── boxing/ │ ├── video1.avi │ └── video2.avi ├── handclapping/ │ ├── video3.avi │ └── video4.avi └── handwaving/ ├── video5.avi └── video6.avi提示建议每个类别至少准备50个视频样本单个视频时长控制在5-10秒3. 骨骼点提取与数据转换MMAction2提供的ntu_pose_extraction.py需要针对性修改才能处理自定义数据集。以下是关键修改点# 修改后的骨骼点批量提取代码 def batch_pose_extraction(video_root, output_dir): action_dirs [d for d in os.listdir(video_root) if os.path.isdir(os.path.join(video_root, d))] for label_idx, action in enumerate(action_dirs): video_files [f for f in os.listdir(os.path.join(video_root, action)) if f.endswith(.avi) or f.endswith(.mp4)] for video_file in video_files: video_path os.path.join(video_root, action, video_file) output_path os.path.join(output_dir, f{video_file.split(.)[0]}.pkl) anno ntu_pose_extraction(video_path, label_idx) mmengine.dump(anno, output_path)执行提取命令python modified_pose_extraction.py --video-root ./my_dataset --output-dir ./pkl_output提取后的骨骼点数据结构示例{ keypoint: np.ndarray, # 形状为(N, T, V, 2) keypoint_score: np.ndarray, # 形状为(N, T, V) frame_dir: str, label: int, total_frames: int }4. 数据集打包与配置文件调整多个pkl文件需要合并为训练集和验证集import pickle from sklearn.model_selection import train_test_split def merge_pkl_files(pkl_dir, output_path, test_size0.2): all_data [] for pkl_file in os.listdir(pkl_dir): if pkl_file.endswith(.pkl): with open(os.path.join(pkl_dir, pkl_file), rb) as f: data pickle.load(f) all_data.append(data) train_data, val_data train_test_split(all_data, test_sizetest_size) with open(output_path, wb) as f: pickle.dump({ split: { train: [d[frame_dir] for d in train_data], val: [d[frame_dir] for d in val_data] }, annotations: all_data }, f)PoseC3d配置文件关键参数修改model dict( cls_headdict( num_classes6 # 修改为实际类别数 )) dataset_type PoseDataset ann_file path/to/your_merged.pkl # 指向合并后的pkl文件 # 训练参数调整 train_cfg dict( max_epochs240, # 根据数据集大小调整 val_interval5)5. 训练优化与实战技巧启动训练的命令行示例python tools/train.py configs/skeleton/posec3d/your_config.py \ --work-dir work_dirs/your_exp \ --cfg-options \ data.videos_per_gpu16 \ optimizer.lr0.1性能优化策略数据增强在pipeline中添加时空增强train_pipeline [ dict(typePoseRandomRotate, max_angle20), dict(typePoseRandomScale, scale_range0.2) ]混合精度训练在配置中添加fp16dict(loss_scale512.0)学习率预热param_scheduler [ dict(typeLinearLR, start_factor0.1, by_epochTrue, begin0, end5), dict(typeCosineAnnealingLR, T_max235) ]常见错误排查表错误现象可能原因解决方案Keypoint维度不匹配视频分辨率不一致统一预处理为1280x720CUDA out of memorybatch size过大减小videos_per_gpu参数验证集准确率波动大数据分布不均衡使用加权采样策略训练损失不下降学习率设置不当尝试0.01-0.1之间的学习率在真实项目中使用时我发现对健身动作识别任务将骨骼点序列长度设置为96帧约3秒比默认的48帧效果提升约7%的准确率。这提示我们需要根据具体动作的持续时间动态调整clip_len参数。