1. 前言LangChain(含LangGraph)是在AI应用开发中被广泛使用的框架之一。这类应用渐渐成为企业应用系统的重要部分。OpenTelemetry作为企业应用系统监控的整体解决方案肯定要纳入这部分关键应用。当然实现起来也非常简单我这里就简单描述一下基本的做法。2. 借助LangSmith的机制来实现OpenTelemetry对LangChain应用的监控我曾经写过一篇使用OpenTelemetry监控Python应用的文章原则上是可以用的但是不巧的是它对LangChain的实现有bug还没有修复。实际上就像Java的Spring Framework一样Python的LangChain框架自身就对OpenTelemetry有良好的支持这个支持是通过其LangSmith来实现的。我们只要激活LangSmith的机制也不必用LangSmith的服务器可以把数据直接发送到OpenTelemetry的服务器上。下面我把实现的步骤讲一下3. 启动OpenTelemetry的后端工具OpenTelemetry的后端工具就是支持OpenTelemetry metrics/traces/logs/profiles的数据库和UI的工具集。最常见的做法还是使用OpenTelemetry Collector来连接不同的后端工具。假如您是初学者或者系统很小可以直接使用基于Docker的Grafana LGTM几乎是一键安装完成非常简单易行。前提是需要您有个支持Docker的环境。假设您想把LGTM安装到/opt/lgtm目录 (任何目录均可)下面是命令(假设在Linux系统)docker pull grafana/otel-lgtm mkdir /opt/lgtm cd /opt/lgtm wget https://raw.githubusercontent.com/grafana/docker-otel-lgtm/main/run-lgtm.sh chmod x run-lgtm.sh sed -i s/3000:3000/3100:3000/ run-lgtm.sh注意最后一行命令因为LGTM的Grafana的默认对外端口是3000这个端口经常和一些应用程序冲突我就改成了3100.下面是启动LGTM的方法就一个命令cd /opt/lgtm ./run-lgtm.shLGTM要监听以下的端口4317/4318 是OTLP端口用来接受metrics/traces/logs数据3100 是Grafana UI的端口9090 是Prometheus的端口用于调试4040 是Pyroscope接受Profiles的端口将来会整合进入4317/43184. 如何激活LangChain应用发送数据给OpenTelemetry1) 增加Python包langsmith一般先要进到应用使用的虚拟环境(venv, Conda之类的)。然后可以简单地使用如下的命令安装Python包pip install -U langsmith当然更正规的方式是在应用的requirements.txt里面增加两行langsmith然后重新执行pip install -r requirements.txt2) 在启动LangChain应用前设置必要的环境变量假如您启动应用的命令是python app.py那可以改成如下命令来启动应用export LANGSMITH_TRACINGtrue export LANGSMITH_OTEL_ENABLEDtrue export LANGSMITH_OTEL_ONLYtrue export OTEL_EXPORTER_OTLP_ENDPOINThttp://localhost:4318 export LS_APM_OTEL_ENABLEDtrue export OTEL_EXPORTER_OTLP_PROTOCOLhttp/protobuf export OTEL_SERVICE_NAMEMY_SERVICE1 python app.py注意按照您自己应用的需要来修改OTEL_SERVICE_NAME环境变量如果OpenTelemetry后端(本例采用LGTM)不在本机假如在1.2.3.4则设置响应的OTEL_EXPORTER_OTLP_ENDPOINT为export OTEL_EXPORTER_OTLP_ENDPOINThttp://1.2.3.4:43185. 监控LangChain应用的Dashboard使用“http://localhost:3100”就可以访问LGTM的Grafana UI界面(如果是远程的话用主机名或者IP替换localhost)使用admin/admin登录。虽然LGTM支持各种类型的数据metrics/traces/logs/profiles但是由于OpenTelemetry对Python的自动监控目前只支持traces而我们可以通过traces计算出一些metrics所以我们的Grafana Dashboard只能有traces和metrics这两种数据。以下是我做的Dashboard的截图我简单介绍一下这个Dashboard第一行的两个图都是展现吞吐量左图是基于每秒请求数右图是基于每秒的字节数。左图用到的PromQL是:sum by (service, span_name, status_code) ( rate(traces_spanmetrics_calls_total[2m]) )右图用到的PromQL是:sum by (service, span_name, status_code) (rate(traces_spanmetrics_size_total[2m]))第二行展现的是延迟时间用到的PromQL是:histogram_quantile(0.90, sum by (le, service, span_name, status_code) (rate(traces_spanmetrics_latency_bucket[2m])))第三行展现的是Traces的列表。Trace的分析有各种做法我举一个例子。比如基于LangGraph做的Agent内部的调用比较复杂很多是自动判断来完成的。我们要进行性能分析的时候就需要看清楚调用的逻辑顺序和个步骤的消耗这时候Trace就特别有价值。现在我们先选择查询我们这些应用中从Agent发送来的Traces。点击Trace图右上角的三个小点选择“Edit”进入设置然后选择适当的查询方法。假如我们的某类Agent的服务名称是“agent1”就在服务名称中选择或者直接输入“agent1”。点击“Refresh”按键得到在选择的时间段内所有“agent1”的Traces如下图所示点积其中一个Trace如下图可以看出内部的调用关系和时序。这个Agent还算简单从根agent1(root span)生出两个子span(“retrieval”和“chatbot”)分别对用两个LangGraph节点。“retrieval”节点下又分出三个子span(VectorStoreRetrieverChatPromptTemplate和route_after_retrieval)从代码可以看到是三个调用。而“chatbot”下只分出一个子span叫做“ChatOllama”。我们点积耗时最长的叶子span - “ChatOllama”。可以看到这个Span的详细信息如下图包括模型名称token输入与回答等各种信息。下面是Dashboard的源码仅供参考{ apiVersion: dashboard.grafana.app/v2, kind: Dashboard, metadata: { name: adkqq5s, namespace: default, uid: 8e305963-7a7d-4d83-b256-e40d9b932f12, resourceVersion: 1777469615660990, generation: 11, creationTimestamp: 2026-04-18T15:09:59Z, labels: { grafana.app/deprecatedInternalID: 885991689437184 }, annotations: { grafana.app/createdBy: user:eferfp17srfnkb, grafana.app/folder: , grafana.app/saved-from-ui: Grafana v13.0.1 (a100054f), grafana.app/updatedBy: user:eferfp17srfnkb, grafana.app/updatedTimestamp: 2026-04-29T13:33:35Z } }, spec: { annotations: [ { kind: AnnotationQuery, spec: { query: { kind: DataQuery, group: grafana, version: v0, datasource: { name: -- Grafana -- }, spec: {} }, enable: true, hide: true, iconColor: rgba(0, 211, 255, 1), name: Annotations Alerts, builtIn: true } } ], cursorSync: Off, editable: true, elements: { panel-1: { kind: Panel, spec: { id: 1, title: 吞吐量(req/s), description: , links: [], data: { kind: QueryGroup, spec: { queries: [ { kind: PanelQuery, spec: { query: { kind: DataQuery, group: prometheus, version: v0, datasource: { name: prometheus }, spec: { editorMode: code, expr: sum by (service, span_name, status_code) (\r\n rate(traces_spanmetrics_calls_total[2m])\r\n), instant: true, legendFormat: __auto, range: true } }, refId: A, hidden: false } } ], transformations: [], queryOptions: {} } }, vizConfig: { kind: VizConfig, group: timeseries, version: 13.0.1, spec: { options: { annotations: { clustering: -1, multiLane: false }, legend: { calcs: [], displayMode: list, placement: bottom, showLegend: true }, tooltip: { hideZeros: false, mode: single, sort: none } }, fieldConfig: { defaults: { unit: reqps, thresholds: { mode: absolute, steps: [ { value: 0, color: green }, { value: 80, color: red } ] }, color: { mode: palette-classic }, custom: { axisBorderShow: false, axisCenteredZero: false, axisColorMode: text, axisLabel: , axisPlacement: auto, barAlignment: 0, barWidthFactor: 0.6, drawStyle: line, fillOpacity: 0, gradientMode: none, hideFrom: { legend: false, tooltip: false, viz: false }, insertNulls: false, lineInterpolation: linear, lineWidth: 1, pointSize: 5, scaleDistribution: { type: linear }, showPoints: never, showValues: false, spanNulls: false, stacking: { group: A, mode: none }, thresholdsStyle: { mode: off } } }, overrides: [] } } } } }, panel-2: { kind: Panel, spec: { id: 2, title: 延迟时间(s), description: , links: [], data: { kind: QueryGroup, spec: { queries: [ { kind: PanelQuery, spec: { query: { kind: DataQuery, group: prometheus, version: v0, datasource: { name: prometheus }, spec: { editorMode: code, expr: histogram_quantile(0.90, sum by (le, service, span_name, status_code) (rate(traces_spanmetrics_latency_bucket[2m]))), instant: true, legendFormat: __auto, range: true } }, refId: A, hidden: false } } ], transformations: [], queryOptions: {} } }, vizConfig: { kind: VizConfig, group: timeseries, version: 13.0.1, spec: { options: { annotations: { clustering: -1, multiLane: false }, legend: { calcs: [], displayMode: list, placement: bottom, showLegend: true }, tooltip: { hideZeros: false, mode: single, sort: none } }, fieldConfig: { defaults: { unit: s, thresholds: { mode: absolute, steps: [ { value: 0, color: green }, { value: 80, color: red } ] }, color: { mode: palette-classic }, custom: { axisBorderShow: false, axisCenteredZero: false, axisColorMode: text, axisLabel: , axisPlacement: auto, barAlignment: 0, barWidthFactor: 0.6, drawStyle: line, fillOpacity: 0, gradientMode: none, hideFrom: { legend: false, tooltip: false, viz: false }, insertNulls: false, lineInterpolation: linear, lineWidth: 1, pointSize: 5, scaleDistribution: { type: linear }, showPoints: never, showValues: false, spanNulls: false, stacking: { group: A, mode: none }, thresholdsStyle: { mode: off } } }, overrides: [] } } } } }, panel-3: { kind: Panel, spec: { id: 3, title: 吞吐量(bytes/sec), description: , links: [], data: { kind: QueryGroup, spec: { queries: [ { kind: PanelQuery, spec: { query: { kind: DataQuery, group: prometheus, version: v0, datasource: { name: prometheus }, spec: { editorMode: code, expr: sum by (service, span_name, status_code) (rate(traces_spanmetrics_size_total[2m])), instant: true, legendFormat: __auto, range: true } }, refId: A, hidden: false } } ], transformations: [], queryOptions: {} } }, vizConfig: { kind: VizConfig, group: timeseries, version: 13.0.1, spec: { options: { annotations: { clustering: -1, multiLane: false }, legend: { calcs: [], displayMode: list, placement: bottom, showLegend: true }, tooltip: { hideZeros: false, mode: single, sort: none } }, fieldConfig: { defaults: { unit: Bps, thresholds: { mode: absolute, steps: [ { value: 0, color: green }, { value: 80, color: red } ] }, color: { mode: palette-classic }, custom: { axisBorderShow: false, axisCenteredZero: false, axisColorMode: text, axisLabel: , axisPlacement: auto, barAlignment: 0, barWidthFactor: 0.6, drawStyle: line, fillOpacity: 0, gradientMode: none, hideFrom: { legend: false, tooltip: false, viz: false }, insertNulls: false, lineInterpolation: linear, lineWidth: 1, pointSize: 5, scaleDistribution: { type: linear }, showPoints: never, showValues: false, spanNulls: false, stacking: { group: A, mode: none }, thresholdsStyle: { mode: off } } }, overrides: [] } } } } }, panel-4: { kind: Panel, spec: { id: 4, title: Traces, description: , links: [], data: { kind: QueryGroup, spec: { queries: [ { kind: PanelQuery, spec: { query: { kind: DataQuery, group: tempo, version: v0, datasource: { name: tempo }, spec: { filters: [ { id: b664759c, operator: , scope: span } ], key: Q-9f6990d1-f1a3-4fe9-a46f-63846a4c475e-0, limit: 30, metricsQueryType: range, queryType: traceqlSearch, serviceMapUseNativeHistograms: false, spss: 3, tableType: traces } }, refId: A, hidden: false } } ], transformations: [], queryOptions: {} } }, vizConfig: { kind: VizConfig, group: table, version: 13.0.1, spec: { options: { cellHeight: sm, showHeader: true, sortBy: [ { desc: true, displayName: Service } ] }, fieldConfig: { defaults: { thresholds: { mode: absolute, steps: [ { value: 0, color: green }, { value: 80, color: red } ] }, custom: { align: auto, cellOptions: { type: auto }, footer: { reducers: [] }, inspect: false } }, overrides: [] } } } } } }, layout: { kind: GridLayout, spec: { items: [ { kind: GridLayoutItem, spec: { x: 0, y: 0, width: 12, height: 8, element: { kind: ElementReference, name: panel-1 } } }, { kind: GridLayoutItem, spec: { x: 12, y: 0, width: 12, height: 8, element: { kind: ElementReference, name: panel-3 } } }, { kind: GridLayoutItem, spec: { x: 0, y: 8, width: 24, height: 8, element: { kind: ElementReference, name: panel-2 } } }, { kind: GridLayoutItem, spec: { x: 0, y: 16, width: 24, height: 7, element: { kind: ElementReference, name: panel-4 } } } ] } }, links: [], liveNow: false, preload: false, tags: [], timeSettings: { timezone: browser, from: now-30m, to: now, autoRefresh: , autoRefreshIntervals: [ 5s, 10s, 30s, 1m, 5m, 15m, 30m, 1h, 2h, 1d ], hideTimepicker: false, fiscalYearStartMonth: 0 }, title: LangChain1, variables: [] } }6. 总结本文介绍了如何利用OpenTelemetry监控LangChain应用的方法。通过LangSmith机制实现数据采集只需设置环境变量即可将监控数据发送到OpenTelemetry服务器。文章详细说明了使用Docker部署Grafana LGTM作为监控后端的步骤包括端口配置和启动方式。同时提供了Python环境的配置指南展示如何通过Dashboard查看应用的吞吐量、延迟时间和调用链追踪。特别演示了如何分析LangGraph Agent的复杂调用关系通过Trace功能可清晰了解内部各节点的执行耗时和逻辑顺序为性能优化提供直观依据。整个方案实现了对LangChain应用的全方位监控从指标数据到调用链追踪的完整可视化。