昇腾 910B 上部署 Qwen3.6(vLLM-Ascend)
昇腾 910B 上部署 Qwen3.6vLLM-Ascend1️⃣ 环境说明1.1 硬件环境NPUAscend 910BA2 架构多卡如 8 卡davinci0~71.2 软件环境操作系统openEulerCANN8.5.0推理框架vLLMAscend 版本1.3 模型模型Qwen3.6-35B-A3B本地路径示例/data/yourproject/models/qwen3.6-35b-a3bhuggingface-cli download Qwen/Qwen3.6-35B-A3B\--local-dir /data/yourproject/models/qwen3.6-35b-a3b2️⃣ 镜像选择❗ 必须匹配 NPU 架构NPU镜像A2 910/910Bdocker pull quay.io/ascend/vllm-ascend:v0.18.0rc1-openeulerA3docker pull quay.io/ascend/vllm-ascend:v0.18.0rc1-a3-openeuler3️⃣ 启动容器✅ 推荐命令标准版dockerrun-it\--nameqwen3_6_vllm\--privileged\--shm-size64g\-v/usr/local/Ascend:/usr/local/Ascend\-v/dev:/dev\-v/data/yourproject/models/qwen3.6-35b-a3b:/models\-v/data/yourproject:/workspace\-p8010:8010\quay.io/ascend/vllm-ascend:v0.18.0rc1-openeuler\Bash 参数说明--device映射 NPU例如--device /dev/davinci0但建议直接-v /dev:/dev-v /usr/local/Ascend挂载驱动/CANN--shm-size64g避免大模型 OOM/models模型目录/workspace项目目录--privileged避免权限问题4️⃣ Ascend 环境变量 动态库问题排查Ascend 推理环境本质上是一个分层运行时体系1. CANN基础执行层 2. ATB算子优化层 3. 推理引擎MindIE / vLLM 等 问题本质常见报错ImportError: libascend_hal.so: cannot open shared object fileOSError: libatb.so: cannot open shared object file 本质原因动态库路径LD_LIBRARY_PATH未正确配置❗ 问题1CANN 版本路径不一致实际/usr/local/Ascend/cann-8.5.0脚本引用cann-8.5.1✅ 方案1临时验证ln-s/usr/local/Ascend/cann-8.5.0 /usr/local/Ascend/cann-8.5.1⚠️ 不推荐长期使用可能存在ABI风险✅ 方案2推荐统一环境变量exportASCEND_HOME/usr/local/AscendexportASCEND_TOOLKIT_HOME$ASCEND_HOME/cann-8.5.0❗ 问题2动态库找不到核心问题✅ 一次性修复推荐exportASCEND_HOME/usr/local/AscendexportASCEND_TOOLKIT_HOME$ASCEND_HOME/cann-8.5.0exportLD_LIBRARY_PATH\$ASCEND_HOME/driver/lib64:\$ASCEND_HOME/driver/lib64/driver:\$ASCEND_TOOLKIT_HOME/lib64:\$ASCEND_TOOLKIT_HOME/lib64/plugin/opskernel:\$ASCEND_HOME/nnal/atb/latest/atb/cxx_abi_1/lib:\$LD_LIBRARY_PATH5️⃣ 启动 vLLM 推理服务 启动命令vllm serve /models\--served-model-name qwen3.6\--host0.0.0.0\--port8010\--tensor-parallel-size2\--max-model-len8192\--trust-remote-code\--async-scheduling 参数说明--tensor-parallel-size并行卡数--max-model-len上下文长度--async-scheduling提升吞吐 指定卡在启动前exportASCEND_VISIBLE_DEVICES0,16️⃣ 测试接口curlhttp://localhost:8010/v1/completions\-HContent-Type: application/json\-d{ prompt: The future of AI is, max_tokens: 100 }多模态请求curlhttp://localhost:8010/v1/completions\-HContent-Type: application/json\-d{ model: qwen3.6, messages: [ {role: system, content: You are a helpful assistant.}, {role: user, content: [ {type: image_url, image_url: {url: https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png}}, {type: text, text: What is the text in the illustrate?} ]} ] }7️⃣ 性能优化建议https://modelscope.cn/models/Qwen/Qwen3.6-35B-A3B