Cube-to-Vec-to-Cube-to-Vec Pattern【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skillsGeneric baseline only. For a2 (b3) kernels, preferagent/references/patterns/a2-cube-vec-cube-vec.md(and the softmax varianta2-cube-vec-cube-vec-softmax.md), which add delayed-consumer and running-statistic rules specific to a2.Read this file when one cube stage feeds vec logic, then another cube stage, then a final vec stage. This is the highest-complexity staged pattern currently worth documenting as a dedicated route.Use this pattern whenthere are at least two cube-heavy stages with vec-side logic between themone tile may be produced in one iteration and consumed in a later iterationdelayed state such as softmax stats or rescale factors must follow the consumer lifetimeMinimal flowcube stage 1 - vec stage 1 - cube stage 2 - vec stage 2In practice this often becomes a one-tile lookahead schedule with warmup and drain.What usually matters mostkeeping producer and delayed consumer lifetimes separategiving delayed stages their own countersdeciding whether the bridge should stay on chip or go through GM workspacekeeping scalar state aligned with the delayed consumer, not with the original producervalidating each stage before trusting the fused versionStable repository lessonsif stage 2 reuses a stage 1 operand one iteration later, keep that operand on chip when the lifetime fitsif the reuse does not fit cleanly, materialize an explicit GM workspace instead of forcing a fake on-chip storydo not normalize too early when the numerator and denominator streams must both finish firstwhen the live query side is truly one row, flatten(B, H)into oneBHaxis and keeprows1instead of forcing a wider row tilefor half-inputBASES256attention on a5, keep the outer256tile in L1, usesplitk64forq k.t(), andsplitn64forp vfor fp8 decode attention with external scales, mask invalid tail columns to-infbeforerowmax, scale the probability tile only after the floatrow_sumupdate, and compensate with a finalscale_v / P_SCALEif the delayed cube consumer wants packed-NZ input, pack the vec-produced tile in UB first, then publish that NZ view intoL1One-tile lookahead scheduling detailThe retained MLA kernel (agent/example/kernels/a5/test_mla_entire.py) uses a four-stage on-chip flow:cube: produce score tileivec: update streaming softmax state and cast score tileito probability tileicube: consume delayed probability tilei-1with the matching value/key tilevec: rescale and accumulate the delayed output tilei-1Stable control pattern:for s in range(0, S TILE, TILE)with:if s S: producer side (warmup steady state)if s 0: delayed consumer side (steady state drain)On-chip operand reuse:if stage 2 must reuse a stage 1 operand one iteration later, keep that operand resident on chip instead of round-tripping to GMin the MLA kernel,k_nopestays inl1knand the vec-producedptile is published directly intol1pinagent/example/kernels/a5/mha_ifa_nz.py, the vec-producedptile is first packed withreg_to_ub(...)and then published tol1pas.nz()Delayed scalar state:delayed scalar state must follow the consumer lifetime, not the producer lifetimecache per-tilerow_exp_diff/ rescale factors in a slot indexed by the delayed consumer counterkeep runningrow_max,row_sum, andoutput_accunder a single vec owner to avoid duplicate updatesTypical files to studyagent/example/kernels/a5/test_mla_entire.pyagent/example/kernels/a5/mha_ifa.pyagent/example/kernels/a5/mha_ifa_256.pyagent/example/kernels/a5/mha_ifa_fp8_scale_256.pyagent/example/kernels/a5/mha_ifa_nz.py【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考