Building effective agents构建高效的代理 发表于 Published Dec 19,2024https://www.anthropic.com/engineering/building-effective-agentsWeve worked with dozens of teams building LLM agents across industries. Consistently, the most successful implementations use simple, composable patterns rather than complex frameworks.Over the past year, weve worked with dozens of teams building large language model (LLM) agents across industries. Consistently, the most successful implementations werent using complex frameworks or specialized libraries. Instead, they were building with simple, composable patterns.In this post, we share what we’ve learned from working with our customers and building agents ourselves, and give practical advice for developers on building effective agents.我们曾与数十个团队合作在各行业构建大型语言模型LLM智能体。一个不变的规律是最成功的实施方案往往采用简单、可组合的模式而非复杂框架。过去一年间我们协助跨行业团队开发了数十个LLM智能体项目。持续观察表明那些最成功的案例都没有使用复杂框架或专业库而是基于简单、可组合的模式进行构建。本文将分享我们从客户合作及自主开发中获得的经验并为开发者提供构建高效智能体的实用建议。What are agents?Agent can be defined in several ways. Some customers define agents as fully autonomous systems that operate independently over extended periods, using various tools to accomplish complex tasks. Others use the term to describe more prescriptive implementations that follow predefined workflows. At Anthropic, we categorize all these variations asagentic systems, but draw an important architectural distinction betweenworkflowsandagents:什么是智能体“智能体”可以从多个角度定义。部分客户将智能体视为完全自主的系统能长期独立运行借助多种工具完成复杂任务另一些客户则用该术语描述遵循预设流程的规范性实施方案。在Anthropic我们将这些变体统称为智能代理系统但在架构上明确区分工作流与智能体Workflowsare systems where LLMs and tools are orchestrated through predefined code paths.Agents, on the other hand, are systems where LLMs dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks.工作流是通过预定义代码路径来协调大型语言模型LLM和工具的系统。而智能体则是大型语言模型自主动态调控其处理流程及工具使用并保持对任务执行方式掌控的系统。Below, we will explore both types of agentic systems in detail. In Appendix 1 (“Agents in Practice”), we describe two domains where customers have found particular value in using these kinds of systems.下面我们将详细探讨这两种代理系统。在附录1“实践中的代理”中我们描述了客户在使用这类系统时发现特别有价值的两个领域。When (and when not) to use agentsWhen building applications with LLMs, we recommend finding the simplest solution possible, and only increasing complexity when needed. This might mean not building agentic systems at all. Agentic systems often trade latency and cost for better task performance, and you should consider when this tradeoff makes sense.When more complexity is warranted, workflows offer predictability and consistency for well-defined tasks, whereas agents are the better option when flexibility and model-driven decision-making are needed at scale. For many applications, however, optimizing single LLM calls with retrieval and in-context examples is usually enough.何时以及何时不使用智能体在基于大语言模型LLM构建应用程序时我们建议优先寻找最简单的解决方案仅在必要时增加复杂度。这可能意味着完全不需要构建智能体系统。智能体系统往往以延迟和成本为代价换取任务性能提升需谨慎评估这种取舍是否合理。当确实需要更高复杂度时工作流能为明确定义的任务提供可预测性和一致性而若需大规模灵活性和模型驱动决策能力则智能体是更优选择。但请注意对多数应用场景而言通过检索和上下文示例优化单次LLM调用通常已足够。When and how to use frameworks何时以及如何使用框架There are many frameworks that make agentic systems easier to implement, including:The Claude Agent SDK;Strands Agents SDK by AWS;Rivet, a drag and drop GUI LLM workflow builder; andVellum, another GUI tool for building and testing complex workflows.These frameworks make it easy to get started by simplifying standard low-level tasks like calling LLMs, defining and parsing tools, and chaining calls together. However, they often create extra layers of abstraction that can obscure the underlying prompts ​​and responses, making them harder to debug. They can also make it tempting to add complexity when a simpler setup would suffice.We suggest that developers start by using LLM APIs directly: many patterns can be implemented in a few lines of code. If you do use a framework, ensure you understand the underlying code. Incorrect assumptions about whats under the hood are a common source of customer error.有许多框架可以更轻松地实现智能体系统包括Claude智能体SDKAWS的Strands智能体SDKRivet一个拖放式GUI LLM工作流构建器以及Vellum另一个用于构建和测试复杂工作流的GUI工具。这些框架通过简化调用LLM、定义和解析工具以及链式调用等标准底层任务使入门变得容易。然而它们通常会创建额外的抽象层可能掩盖底层提示和响应使其更难调试。当简单设置就足够时它们也可能诱使开发者增加不必要的复杂性。我们建议开发者首先直接使用LLM API许多模式只需几行代码即可实现。如果确实要使用框架请确保理解底层代码。对底层机制的错误假设是客户常见错误的来源。See our cookbook for some sample implementations.Building blocks, workflows, and agentsIn this section, we’ll explore the common patterns for agentic systems we’ve seen in production. Well start with our foundational building block—the augmented LLM—and progressively increase complexity, from simple compositional workflows to autonomous agents.构建模块、工作流与智能体本节将探讨我们在实际生产中观察到的智能体系统常见模式。我们将从基础构建模块——增强型大语言模型LLM出发逐步增加复杂度从简单的组合式工作流到自主运行的智能体系统。Building block: The augmented LLMThe basic building block of agentic systems is an LLM enhanced with augmentations such as retrieval, tools, and memory. Our current models can actively use these capabilities—generating their own search queries, selecting appropriate tools, and determining what information to retain.构建模块增强型大语言模型智能代理系统的基础构建模块是经过检索、工具和记忆等功能增强的大语言模型。当前模型能主动运用这些能力——自主生成搜索查询、选择合适工具并决定保留哪些信息。We recommend focusing on two key aspects of the implementation: tailoring these capabilities to your specific use case and ensuring they provide an easy, well-documented interface for your LLM. While there are many ways to implement these augmentations, one approach is through our recently released Model Context Protocol, which allows developers to integrate with a growing ecosystem of third-party tools with a simple client implementation.For the remainder of this post, well assume each LLM call has access to these augmented capabilities.我们建议重点关注实施过程中的两个关键方面根据您的具体使用场景定制这些功能并确保它们为您的LLM提供易于使用且文档完善的接口。虽然实现这些增强功能有多种方式但其中一种方法是通过我们最近发布的模型上下文协议该协议允许开发者通过简单的客户端实现与不断增长的第三方工具生态系统集成。在本文后续内容中我们将假设每次LLM调用都能使用这些增强功能。Workflow: Prompt chainingPrompt chaining decomposes a task into a sequence of steps, where each LLM call processes the output of the previous one. You can add programmatic checks (see gate” in the diagram below) on any intermediate steps to ensure that the process is still on track.工作流程提示链提示链将任务分解为一系列步骤其中每个LLM调用处理前一个调用的输出。您可以在任何中间步骤添加程序化检查参见下图中的“门控”以确保流程仍在正轨上。When to use this workflow:This workflow is ideal for situations where the task can be easily and cleanly decomposed into fixed subtasks. The main goal is to trade off latency for higher accuracy, by making each LLM call an easier task.Examples where prompt chaining is useful:Generating Marketing copy, then translating it into a different language.Writing an outline of a document, checking that the outline meets certain criteria, then writing the document based on the outline.何时使用此工作流此工作流最适合任务能够轻松且清晰地分解为固定子任务的情况。其主要目标是通过让每个大语言模型调用执行更简单的任务用延迟换取更高的准确性。提示链适用的示例生成营销文案然后将其翻译成不同的语言。编写文档大纲检查大纲是否符合特定标准然后根据大纲撰写文档。Workflow: RoutingRouting classifies an input and directs it to a specialized followup task. This workflow allows for separation of concerns, and building more specialized prompts. Without this workflow, optimizing for one kind of input can hurt performance on other inputs.工作流程路由路由对输入进行分类并将其定向到专门的后续任务。该工作流程实现了关注点分离并构建出更专业的提示。若不采用此工作流程针对某类输入的优化可能会损害其他输入的性能。When to use this workflow:Routing works well for complex tasks where there are distinct categories that are better handled separately, and where classification can be handled accurately, either by an LLM or a more traditional classification model/algorithm.Examples where routing is useful:Directing different types of customer service queries (general questions, refund requests, technical support) into different downstream processes, prompts, and tools.Routing easy/common questions to smaller, cost-efficient models like Claude Haiku 4.5 and hard/unusual questions to more capable models like Claude Sonnet 4.5 to optimize for best performance.何时使用此工作流程路由适用于复杂的任务这些任务具有明显不同的类别且这些类别更适合分开处理并且分类可以通过大型语言模型LLM或更传统的分类模型/算法准确完成。路由适用的示例场景将不同类型的客户服务查询一般问题、退款请求、技术支持引导至不同的下游流程、提示词和工具处理。将简单/常见问题路由到更经济的小型模型如Claude Haiku 4.5而将困难/特殊问题分配给性能更强的模型如Claude Sonnet 4.5以优化整体性能。Workflow: ParallelizationLLMs can sometimes work simultaneously on a task and have their outputs aggregated programmatically. This workflow, parallelization, manifests in two key variations:Sectioning: Breaking a task into independent subtasks run in parallel.Voting:Running the same task multiple times to get diverse outputs.工作流程并行化大型语言模型有时可以同时处理任务并通过编程方式聚合它们的输出。这种工作流程即并行化主要体现在两种关键变体中分段将任务拆分为独立的子任务并行运行。投票多次运行同一任务以获得多样化的输出。When to use this workflow:Parallelization is effective when the divided subtasks can be parallelized for speed, or when multiple perspectives or attempts are needed for higher confidence results. For complex tasks with multiple considerations, LLMs generally perform better when each consideration is handled by a separate LLM call, allowing focused attention on each specific aspect.何时使用此工作流程当划分的子任务可并行处理以提升速度或需要多视角尝试来获得更高置信度的结果时并行化尤为有效。对于涉及多重考量因素的复杂任务通常每个考量维度由单独的LLM调用处理效果更佳这样能确保对每个具体方面进行专注分析。Examples where parallelization is useful:Sectioning:Implementing guardrails where one model instance processes user queries while another screens them for inappropriate content or requests. This tends to perform better than having the same LLM call handle both guardrails and the core response.Automating evals for evaluating LLM performance, where each LLM call evaluates a different aspect of the model’s performance on a given prompt.Voting:Reviewing a piece of code for vulnerabilities, where several different prompts review and flag the code if they find a problem.Evaluating whether a given piece of content is inappropriate, with multiple prompts evaluating different aspects or requiring different vote thresholds to balance false positives and negatives.并行化适用的示例场景分块处理 在实现安全护栏机制时可采用并行架构——一个模型实例处理用户查询另一个实例同步筛查不当内容或请求。这种方案通常比单模型同时处理安全审查和核心响应的性能更优。 自动化评估场景中通过并行调用让每个LLM评估模型在不同提示下的特定性能维度。投票机制 代码漏洞审查场景中采用多个差异化提示并行审查代码任一提示发现问题即可触发标记。 内容合规性判定时通过多提示并行评估不同维度或设置差异化的投票阈值来平衡误判率。Workflow: Orchestrator-workersIn the orchestrator-workers workflow, a central LLM dynamically breaks down tasks, delegates them to worker LLMs, and synthesizes their results.工作流程协调器-工作者模式在协调器-工作者流程中一个核心大语言模型动态分解任务将其分配给工作者大语言模型并整合它们的结果。When to use this workflow:This workflow is well-suited for complex tasks where you can’t predict the subtasks needed (in coding, for example, the number of files that need to be changed and the nature of the change in each file likely depend on the task). Whereas it’s topographically similar, the key difference from parallelization is its flexibility—subtasks arent pre-defined, but determined by the orchestrator based on the specific input.Example where orchestrator-workers is useful:Coding products that make complex changes to multiple files each time.Search tasks that involve gathering and analyzing information from multiple sources for possible relevant information.何时使用此工作流程该流程非常适合无法预知所需子任务的复杂场景例如编程中需修改的文件数量及每个文件的改动性质通常取决于具体任务。虽然表面结构与并行处理相似但其核心区别在于灵活性——子任务并非预先定义而是由协调器根据具体输入动态决定。协调器-工作器模式适用的场景示例需要对多个文件进行复杂修改的代码开发工作需要从多源收集并分析信息以获取潜在相关内容的搜索任务Workflow: Evaluator-optimizerIn the evaluator-optimizer workflow, one LLM call generates a response while another provides evaluation and feedback in a loop.工作流程评估者-优化者在评估者-优化者工作流程中一个大型语言模型调用生成响应而另一个则在循环中提供评估和反馈。When to use this workflow:This workflow is particularly effective when we have clear evaluation criteria, and when iterative refinement provides measurable value. The two signs of good fit are, first, that LLM responses can be demonstrably improved when a human articulates their feedback; and second, that the LLM can provide such feedback. This is analogous to the iterative writing process a human writer might go through when producing a polished document.Examples where evaluator-optimizer is useful:Literary translation where there are nuances that the translator LLM might not capture initially, but where an evaluator LLM can provide useful critiques.Complex search tasks that require multiple rounds of searching and analysis to gather comprehensive information, where the evaluator decides whether further searches are warranted.何时使用此工作流程当我们有明确的评估标准且迭代优化能带来可衡量的价值时该工作流程尤为有效。两个适配标志是首先当人类明确表达反馈时大语言模型的响应能得到显著改进其次大语言模型能提供此类反馈。这类似于人类作家在打磨文稿时经历的迭代写作过程。评估优化器适用的场景示例文学翻译中存在细微差别翻译大语言模型可能最初无法捕捉但评估大语言模型能提供有价值的批评意见需要多轮搜索和分析才能收集全面信息的复杂搜索任务由评估器判断是否需要进一步搜索。AgentsAgents are emerging in production as LLMs mature in key capabilities—understanding complex inputs, engaging in reasoning and planning, using tools reliably, and recovering from errors. Agents begin their work with either a command from, or interactive discussion with, the human user. Once the task is clear, agents plan and operate independently, potentially returning to the human for further information or judgement. During execution, its crucial for the agents to gain “ground truth” from the environment at each step (such as tool call results or code execution) to assess its progress. Agents can then pause for human feedback at checkpoints or when encountering blockers. The task often terminates upon completion, but it’s also common to include stopping conditions (such as a maximum number of iterations) to maintain control.Agents can handle sophisticated tasks, but their implementation is often straightforward. They are typically just LLMs using tools based on environmental feedback in a loop. It is therefore crucial to design toolsets and their documentation clearly and thoughtfully. We expand on best practices for tool development in Appendix 2 (Prompt Engineering your Tools).智能体随着大语言模型在关键能力上的成熟——理解复杂输入、进行推理规划、可靠使用工具以及从错误中恢复智能体正在生产环境中崭露头角。智能体通过人类用户的指令或交互对话启动工作在明确任务后便自主规划运作必要时会向人类寻求更多信息或判断。执行过程中智能体需在每个步骤从环境获取基础事实如工具调用结果或代码执行情况以评估进展并可在检查点或遇到阻碍时暂停等待人类反馈。任务通常在完成后终止但也常设置停止条件如最大迭代次数以保持控制。智能体能处理复杂任务但其实现往往简洁明了——通常只是大语言模型在循环中根据环境反馈使用工具。因此精心设计工具集及其文档至关重要我们在附录二《工具提示工程》中详述了工具开发的最佳实践。When to use agents:Agents can be used for open-ended problems where it’s difficult or impossible to predict the required number of steps, and where you can’t hardcode a fixed path. The LLM will potentially operate for many turns, and you must have some level of trust in its decision-making. Agents autonomy makes them ideal for scaling tasks in trusted environments.The autonomous nature of agents means higher costs, and the potential for compounding errors. We recommend extensive testing in sandboxed environments, along with the appropriate guardrails.Examples where agents are useful:The following examples are from our own implementations:A coding Agent to resolve SWE-bench tasks, which involve edits to many files based on a task description;Our “computer use” reference implementation, where Claude uses a computer to accomplish tasks.何时使用智能体智能体适用于开放式问题场景这类问题往往难以或无法预测所需步骤数量也无法通过硬编码固定路径解决。LLM大语言模型可能会进行多轮操作因此需要对其决策能力保持一定信任度。智能体的自主性使其成为可信环境中规模化任务的理想选择。智能体的自主特性意味着更高的成本支出和潜在的错误累积风险。我们建议在沙盒环境中进行充分测试并设置适当的安全防护措施。智能体的适用场景示例以下案例来自我们的实际应用编码智能体用于解决SWE-bench任务根据任务描述对多个文件进行编辑我们的计算机使用参考实施方案Claude通过操作计算机完成任务。Combining and customizing these patternsThese building blocks arent prescriptive. Theyre common patterns that developers can shape and combine to fit different use cases. The key to success, as with any LLM features, is measuring performance and iterating on implementations. To repeat: you should consider adding complexityonlywhen it demonstrably improves outcomes.SummarySuccess in the LLM space isnt about building the most sophisticated system. Its about building therightsystem for your needs. Start with simple prompts, optimize them with comprehensive evaluation, and add multi-step agentic systems only when simpler solutions fall short.When implementing agents, we try to follow three core principles:Maintainsimplicityin your agents design.Prioritizetransparencyby explicitly showing the agent’s planning steps.Carefully craft your agent-computer interface (ACI) through thorough tooldocumentation and testing.Frameworks can help you get started quickly, but dont hesitate to reduce abstraction layers and build with basic components as you move to production. By following these principles, you can create agents that are not only powerful but also reliable, maintainable, and trusted by their users.组合与定制这些模式这些构建模块并非一成不变而是开发者可根据不同使用场景灵活调整和组合的通用模式。与所有大语言模型功能一样成功的关键在于持续测量性能并迭代优化实现方案。再次强调只有当复杂性被证实能提升效果时才应考虑增加。总结在大语言模型领域取得成功不在于构建最复杂的系统而在于构建最适合需求的系统。从简单提示词开始通过全面评估进行优化仅当简单解决方案无法满足时再引入多步骤的智能体系统。实施智能体时我们遵循三个核心原则保持智能体设计的简洁性通过明确展示智能体的规划步骤来确保透明度通过完善的工具文档和测试精心设计人机交互界面框架能帮助快速入门但在投入生产环境时应毫不犹豫地减少抽象层级基于基础组件进行构建。遵循这些原则您将创造出不仅功能强大而且可靠、可维护、值得用户信赖的智能体。AcknowledgementsWritten by Erik S. and Barry Zhang. This work draws upon our experiences building agents at Anthropic and the valuable insights shared by our customers, for which were deeply grateful.Appendix 1: Agents in practiceOur work with customers has revealed two particularly promising applications for AI agents that demonstrate the practical value of the patterns discussed above. Both applications illustrate how agents add the most value for tasks that require both conversation and action, have clear success criteria, enable feedback loops, and integrate meaningful human oversight.A. Customer supportCustomer support combines familiar chatbot interfaces with enhanced capabilities through tool integration. This is a natural fit for more open-ended agents because:Support interactions naturally follow a conversation flow while requiring access to external information and actions;Tools can be integrated to pull customer data, order history, and knowledge base articles;Actions such as issuing refunds or updating tickets can be handled programmatically; andSuccess can be clearly measured through user-defined resolutions.Several companies have demonstrated the viability of this approach through usage-based pricing models that charge only for successful resolutions, showing confidence in their agents effectiveness.B. Coding agentsThe software development space has shown remarkable potential for LLM features, with capabilities evolving from code completion to autonomous problem-solving. Agents are particularly effective because:Code solutions are verifiable through automated tests;Agents can iterate on solutions using test results as feedback;The problem space is well-defined and structured; andOutput quality can be measured objectively.In our own implementation, agents can now solve real GitHub issues in the SWE-bench Verified benchmark based on the pull request description alone. However, whereas automated testing helps verify functionality, human review remains crucial for ensuring solutions align with broader system requirements.Appendix 2: Prompt engineering your toolsNo matter which agentic system youre building, tools will likely be an important part of your agent. Tools enable Claude to interact with external services and APIs by specifying their exact structure and definition in our API. When Claude responds, it will include a tool use block in the API response if it plans to invoke a tool. Tool definitions and specifications should be given just as much prompt engineering attention as your overall prompts. In this brief appendix, we describe how to prompt engineer your tools.There are often several ways to specify the same action. For instance, you can specify a file edit by writing a diff, or by rewriting the entire file. For structured output, you can return code inside markdown or inside JSON. In software engineering, differences like these are cosmetic and can be converted losslessly from one to the other. However, some formats are much more difficult for an LLM to write than others. Writing a diff requires knowing how many lines are changing in the chunk header before the new code is written. Writing code inside JSON (compared to markdown) requires extra escaping of newlines and quotes.Our suggestions for deciding on tool formats are the following:Give the model enough tokens to think before it writes itself into a corner.Keep the format close to what the model has seen naturally occurring in text on the internet.Make sure theres no formatting overhead such as having to keep an accurate count of thousands of lines of code, or string-escaping any code it writes.One rule of thumb is to think about how much effort goes into human-computer interfaces (HCI), and plan to invest just as much effort in creating goodagent-computer interfaces (ACI). Here are some thoughts on how to do so:Put yourself in the models shoes. Is it obvious how to use this tool, based on the description and parameters, or would you need to think carefully about it? If so, then it’s probably also true for the model. A good tool definition often includes example usage, edge cases, input format requirements, and clear boundaries from other tools.How can you change parameter names or descriptions to make things more obvious? Think of this as writing a great docstring for a junior developer on your team. This is especially important when using many similar tools.Test how the model uses your tools: Run many example inputs in our workbench to see what mistakes the model makes, and iterate.Poka-yoke your tools. Change the arguments so that it is harder to make mistakes.While building our agent for SWE-bench, we actually spent more time optimizing our tools than the overall prompt. For example, we found that the model would make mistakes with tools using relative filepaths after the agent had moved out of the root directory. To fix this, we changed the tool to always require absolute filepaths—and we found that the model used this method flawlessly.Building effective agentshttps://www.anthropic.com/engineering/building-effective-agents