前言本文是为了方便机器人初学者快速学习Mujoco强化学习而设计的教程循序渐进从环境搭建到简单的运动控制再到强化学习自主探索难度逐步提升帮助初学者建立学习路线思维框架并在此基础上实现创新探索自定义功能。所有工程均提供代码参考https://github.com/gimmyy/Tutorial_Mujoco_RL/tree/main鼓励理论结合实践和二次开发参考链接附在文末。注笔者所用操作系统为win10因系统差异个别代码直接复制可能会运行失败建议根据自己的操作系统修改代码。训练环境为纯CPU对无GPU、算力受限的初学者非常友好。一、Mujoco安装和环境搭建Mujoco包存在新老版本很多搜出来的教程还是老版本Mujoco200、210的安装需要导入密钥手动配置安装位置存在一定的误导。事实上2022年以后Mujoco已经开始由DeepMind发行标准python接口的新版本老版本Mujoco 2xx过去由OpenAI维护已经停止更新我们不安装老版本新版本作为标准的python库只需两行代码即可自动安装。1. 创建conda虚拟环境并进入conda create -n orca python3.11conda activate orca2. 安装Mujoco库pip install mujoco系统会自动安装和python版本匹配的mujoco库3. 测试命令行输入python进入python环境python逐行输入如下命令import mujocoprint(mujoco.__version__)说明安装的版本是3.6.0安装成功。二、实战1Mujoco加载机器人模型实现简单控制学习要点Mujoco工程构建、机器人模型文件MJCF的修改和导入、仿真流程、结果可视化本节参考博文https://blog.csdn.net/Vint_LU/article/details/147948556该工程利用ur5e机械臂实现固定位置物体的推动。通过该文章的复现可以实现Mujoco工程的构建、机器人模型文件MJCF的修改、模型导入、仿真流程、结果可视化的完整闭环。三、实战2MujocoRL倒立摆学习要点强化学习训练框架搭建、mujoco物理环境的适配、训练一个强化学习智能体本节参考gymnasium的官方文档 https://gymnasium.org.cn/tutorials/training_agents/mujoco_reinforce/官方提供的代码无法进行仿真可视化我增加了可视化模块可参考https://github.com/gimmyy/Tutorial_Mujoco_RL/tree/main/2tutorial_InvertedPendulum该工程实现在mujoco物理仿真环境中用强化学习方法控制倒立摆InvertedPendulum。强化学习算法采用policy gradient模型使用自定义模型Policy_Network。class Policy_Network(nn.Module): Parametrized Policy Network. def __init__(self, obs_space_dims: int, action_space_dims: int): Initializes a neural network that estimates the mean and standard deviation of a normal distribution from which an action is sampled from. Args: obs_space_dims: Dimension of the observation space action_space_dims: Dimension of the action space super().__init__() hidden_space1 16 # Nothing special with 16, feel free to change hidden_space2 32 # Nothing special with 32, feel free to change # Shared Network self.shared_net nn.Sequential( nn.Linear(obs_space_dims, hidden_space1), nn.Tanh(), nn.Linear(hidden_space1, hidden_space2), nn.Tanh(), ) # Policy Mean specific Linear Layer self.policy_mean_net nn.Sequential( nn.Linear(hidden_space2, action_space_dims) ) # Policy Std Dev specific Linear Layer self.policy_stddev_net nn.Sequential( nn.Linear(hidden_space2, action_space_dims) ) def forward(self, x: torch.Tensor) - tuple[torch.Tensor, torch.Tensor]: Conditioned on the observation, returns the mean and standard deviation of a normal distribution from which an action is sampled from. Args: x: Observation from the environment Returns: action_means: predicted mean of the normal distribution action_stddevs: predicted standard deviation of the normal distribution shared_features self.shared_net(x.float()) action_means self.policy_mean_net(shared_features) action_stddevs torch.log( 1 torch.exp(self.policy_stddev_net(shared_features)) ) return action_means, action_stddevsInvertedPendulum-v4环境的封装已经被集成在gym库中。env gym.make(InvertedPendulum-v4) wrapped_env gym.wrappers.RecordEpisodeStatistics(env, 50) # Records episode-rewardstep函数、奖励函数也不需要自己写官方为InvertedPendulum-v4提供了现成的函数直接调用即可。obs, reward, terminated, truncated, info wrapped_env.step(action)运行代码可以在mujoco仿真环境下看到倒立摆的实时状态。我们发现gymnasium库对mujoco的兼容非常好为后续进行复杂控制的强化学习创造了很大的便利。四、实战3MujocoRL控制机器人避障路径规划学习要点机器人自定义环境类的编写奖励函数的设计stable_baselines3库对强化学习模型的封装本节参考https://github.com/LitchiCheng/mujoco-learningrl_panda_reach_target_high_profile.pyMujoco 仿真 PPO 强化学习机械臂末端路径规划到达指定位置该工程利用PPO强化学习控制机械臂末端绕过障碍物到达指定目标位置和上一节不同的是由于stable_baselines3库的引入强化学习模型不需要自定义可以直接从库中调用即可只需通过参数设置改变模型结构和超参数而env环境类的编写和奖励函数需要用户根据控制目标进行设计调整这个方式可以让用户更定多聚焦在机器人控制侧而不用过度纠结强化学习模型的设计。环境类定义继承自gym.Env类和stable_baselines3中提供的模型接口适配在PPO模型初始化时可以作为env参数直接传入。class PandaObstacleEnv(gym.Env): def __init__(self, visualize: bool False): super(PandaObstacleEnv, self).__init__() if not check_flag_file(): write_flag_file() self.visualize visualize else: self.visualize False self.handle None self.model mujoco.MjModel.from_xml_path(./model/franka_emika_panda/scene_pos_with_obstacles.xml) self.data mujoco.MjData(self.model) if self.visualize: self.handle mujoco.viewer.launch_passive(self.model, self.data) self.handle.cam.distance 3.0 self.handle.cam.azimuth 0.0 self.handle.cam.elevation -30.0 self.handle.cam.lookat np.array([0.2, 0.0, 0.4]) self.np_random np.random.default_rng(None) self.end_effector_id mujoco.mj_name2id(self.model, mujoco.mjtObj.mjOBJ_BODY, ee_center_body) self.home_joint_pos np.array(self.model.key_qpos[0][:7], dtypenp.float32) self.action_space spaces.Box(low-1.0, high1.0, shape(7,), dtypenp.float32) # 7轴关节角度、目标位置、障碍物位置和球体直径 self.obs_size 7 3 3 1 self.observation_space spaces.Box(low-np.inf, highnp.inf, shape(self.obs_size,), dtypenp.float32) # self.goal_position np.array([0.4, 0.3, 0.4], dtypenp.float32) self.goal_position np.array([0.4, -0.3, 0.4], dtypenp.float32) self.goal_arrival_threshold 0.005 self.goal_visu_size 0.02 self.goal_visu_rgba [0.1, 0.3, 0.3, 0.8] for i in range(self.model.ngeom): name mujoco.mj_id2name(self.model, mujoco.mjtObj.mjOBJ_GEOM, i) if name obstacle_0: self.obstacle_id_1 i self.obstacle_position np.array(self.model.geom_pos[self.obstacle_id_1], dtypenp.float32) self.obstacle_size self.model.geom_size[self.obstacle_id_1][0] self.last_action self.home_joint_pos def _render_scene(self) - None: if not self.visualize or self.handle is None: return self.handle.user_scn.ngeom 0 total_geoms 1 self.handle.user_scn.ngeom total_geoms mujoco.mjv_initGeom( self.handle.user_scn.geoms[0], mujoco.mjtGeom.mjGEOM_SPHERE, size[self.goal_visu_size, 0.0, 0.0], posself.goal_position, matnp.eye(3).flatten(), rgbanp.array(self.goal_visu_rgba, dtypenp.float32) ) def reset(self, seed: Optional[int] None, options: Optional[dict] None) - tuple[np.ndarray, dict]: super().reset(seedseed) if seed is not None: self.np_random np.random.default_rng(seed) # 重置关节到home位姿 mujoco.mj_resetData(self.model, self.data) self.data.qpos[:7] self.home_joint_pos self.data.qpos[7:] [0.04,0.04] mujoco.mj_forward(self.model, self.data) self.goal_position np.array([self.goal_position[0], self.np_random.uniform(-0.3, 0.3), self.goal_position[2]], dtypenp.float32) self.obstacle_position np.array([self.obstacle_position[0], self.np_random.uniform(-0.3, 0.3), self.obstacle_position[2]], dtypenp.float32) self.model.geom_pos[self.obstacle_id_1] self.obstacle_position mujoco.mj_step(self.model, self.data) if self.visualize: self._render_scene() obs self._get_observation() self.start_t time.time() return obs, {} def _get_observation(self) - np.ndarray: joint_pos self.data.qpos[:7].copy().astype(np.float32) return np.concatenate([joint_pos, self.goal_position, self.obstacle_position np.random.normal(0, 0.001, size3), np.array([self.obstacle_size], dtypenp.float32)]) def _calc_reward(self, joint_angles: np.ndarray, action: np.ndarray) - tuple[np.ndarray, float]: now_ee_pos self.data.body(self.end_effector_id).xpos.copy() dist_to_goal np.linalg.norm(now_ee_pos - self.goal_position) # 非线性距离奖励 if dist_to_goal self.goal_arrival_threshold: distance_reward 20.0*(1.0(1.0-(dist_to_goal / self.goal_arrival_threshold))) elif dist_to_goal 2*self.goal_arrival_threshold: distance_reward 10.0*(1.0(1.0-(dist_to_goal / 2*self.goal_arrival_threshold))) elif dist_to_goal 3*self.goal_arrival_threshold: distance_reward 5.0*(1.0(1.0-(dist_to_goal / 3*self.goal_arrival_threshold))) else: distance_reward 1.0 / (1.0 dist_to_goal) # 平滑惩罚 smooth_penalty 0.001 * np.linalg.norm(action - self.last_action) # 碰撞惩罚 contact_reward 10.0 * self.data.ncon # 关节角度限制惩罚 joint_penalty 0.0 for i in range(7): min_angle, max_angle self.model.jnt_range[:7][i] if joint_angles[i] min_angle: joint_penalty 0.5 * (min_angle - joint_angles[i]) elif joint_angles[i] max_angle: joint_penalty 0.5 * (joint_angles[i] - max_angle) time_penalty 0.001 * (time.time() - self.start_t) total_reward (distance_reward - contact_reward - smooth_penalty - joint_penalty - time_penalty) self.last_action action.copy() return total_reward, dist_to_goal, self.data.ncon 0 def step(self, action: np.ndarray) - tuple[np.ndarray, np.float32, bool, bool, dict]: joint_ranges self.model.jnt_range[:7] scaled_action np.zeros(7, dtypenp.float32) for i in range(7): scaled_action[i] joint_ranges[i][0] (action[i] 1) * 0.5 * (joint_ranges[i][1] - joint_ranges[i][0]) self.data.ctrl[:7] scaled_action self.data.qpos[7:] [0.04,0.04] mujoco.mj_step(self.model, self.data) reward, dist_to_goal, collision self._calc_reward(self.data.qpos[:7], action) terminated False if collision: reward - 10.0 terminated True if dist_to_goal self.goal_arrival_threshold: terminated True print(f[成功] 距离目标: {dist_to_goal:.3f}, 奖励: {reward:.3f}) if not terminated: if time.time() - self.start_t 20.0: reward - 10.0 print(f[超时] 时间过长奖励减半) terminated True if self.visualize and self.handle is not None: self.handle.sync() time.sleep(0.01) obs self._get_observation() info { is_success: not collision and terminated and (dist_to_goal self.goal_arrival_threshold), distance_to_goal: dist_to_goal, collision: collision } return obs, reward.astype(np.float32), terminated, False, info def seed(self, seed: Optional[int] None) - list[Optional[int]]: self.np_random np.random.default_rng(seed) return [seed] def close(self) - None: if self.visualize and self.handle is not None: self.handle.close() self.handle NonePPO训练模型直接调用并设置超参数。env make_vec_env( env_idlambda: PandaObstacleEnv(** ENV_KWARGS), n_envsn_envs, seed42, vec_env_clsSubprocVecEnv, vec_env_kwargs{start_method: spawn}#fork ) model PPO( policyMlpPolicy, envenv, policy_kwargsPOLICY_KWARGS, verbose1, n_steps2048, batch_size2048, n_epochs10, gamma0.99, # ent_coef0.02, # 增加熵系数保留后期探索以提升泛化性 ent_coef 0.001, clip_range0.15, # 限制策略更新幅度 max_grad_norm0.5, # 梯度裁剪防止爆炸 learning_ratelambda f: 1e-4 * (1 - f), # 学习率线性衰减初始1e-4后期逐步降低 devicecuda if torch.cuda.is_available() else cpu, tensorboard_log./tensorboard/panda_obstacle_avoidance/ )模型测试效果末端执行器附近的小圆球表示目标位置灰色大圆球表示障碍物末端执行器可以绕过障碍物到达目标位置。五、实战4MujocoOrcahand 控制魔方翻转学习要点orcahand模型文件理解、自定义场景和训练框架本节参考orcahand开源github https://github.com/orcahand/orca_sim作者仅提供cube_orientation的随机策略我们基于该任务框架训练强化学习模型完成灵巧手控制魔方翻转。首先进行环境类定义因为代码较多仅选择重要的点列举首先定义父类BaseOrcaHandEnv继承自gym.Env。class BaseOrcaHandEnv(gym.Env[np.ndarray, np.ndarray]):接着定义魔方翻转任务子类继承自父类BaseOrcaHandEnv。class OrcaHandRightCubeOrientation(BaseOrcaHandEnv):在该类中定义观测、奖励函数。def _get_obs(self) - np.ndarray: base super()._get_obs() if not hasattr(self, _cube_qpos_adr): return base return np.concatenate([base, self._cube_red_face_world_normal(), np.array([self._red_face_up_alignment()], dtypenp.float64)]) def _get_reward(self) - float: align 0.5 * (self._red_face_up_alignment() 1) lift np.clip(self.data.xpos[self._cube_body_id,2]-0.12, 0,0.12)/0.12 drop 1.0 if self._cube_dropped() else 0.0 return float(align 0.1*lift - drop)PPO训练模型model PPO( MlpPolicy, env, policy_kwargsdict(activation_fnnn.LeakyReLU, net_archdict(pi[256,128], vf[256,128])), learning_rate3e-4, batch_size256, n_steps1024, gamma0.99, verbose1 )训练效果手指不停活动使魔方的红面朝上。相信通过以上学习你已经建立了MujocoRL的认识框架接下去可以根据自己的目标任务和机器人选型进行自主训练调优啦参考链接[1] ​https://blog.csdn.net/Vint_LU/article/details/147948556​[2] https://gymnasium.org.cn/tutorials/training_agents/mujoco_reinforce/[3] https://github.com/LitchiCheng/mujoco-learning[4] https://github.com/orcahand/orca_sim