为什么 Agentforce 需要专门的评估体系？

AI Agent 的输出具有非确定性，传统的功能测试无法覆盖。需要从检索质量（RAG）、生成质量和业务结果三个层面建立量化指标，才能判断 Agent 是否可控、可持续优化。

RAG 评估的核心指标有哪些？

核心指标包括：Retrieval Precision（检索精度）、Retrieval Recall（检索召回率）、Answer Relevance（答案相关性）和 Faithfulness（答案忠实度，即是否基于检索内容生成而非幻觉）。

什么是 Agent 上线门禁（Go-Live Gates）？

上线门禁是一组硬性条件，只有全部通过才允许 Agent 发布到生产环境。通常包括 RAG 指标阈值、响应延迟 P95、幻觉率上限、回归测试通过率等。

Agentforce 的可观测性应该监控什么？

不只是仪表盘数字，而是要建立诊断路径：对话成功率、检索命中分布、延迟分位数（P50/P95/P99）、Token 消耗趋势、用户反馈评分，以及异常对话的根因追踪。

推荐的 Agentforce 运营节奏是什么？

建议日监控（关键指标告警）、周复盘（分析失败对话、调整 Prompt 和 Knowledge）、月度重标定（重新校准评估基线、更新测试用例集、评估是否需要架构调整）。

Agentforce Evaluation in Practice: RAG Metrics, Observability, and Go-Live Gates

Why Agentforce Needs a Real Evaluation System Now

Over the past year, many teams focused on “getting an agent live first.” The typical result: session volume grows, but real resolution rates lag; responses sound fluent, yet factual mistakes still appear; operations teams can see ticket outcomes, but not where the agent failed in the middle of the flow.

Since Spring ’26, Salesforce has introduced clearer quality signals around Agentforce and related analytics (including stronger signals for retrieval/RAG quality, session performance, and testing workflows). That gives teams a chance to move AI operations from subjective feedback to metric-driven governance.

Use a Three-Layer Metric Model: Retrieval, Generation, Outcome

Don’t rely on “user satisfaction” alone. Break Agentforce quality into three layers:

Layer 1: Retrieval quality — Did the system fetch the right and sufficient context?
Layer 2: Generation quality — Given context, is the response accurate, complete, and actionable?
Layer 3: Business outcomes — Are you improving deflection, first-response time, and resolution metrics?

You need all three. Outcome-only metrics hide root causes; model-only metrics can’t prove business value.

Operationalizing RAG Metrics: Go Beyond “Document Hit Rate”

A common mistake in Agentforce programs is treating “a retrieved document exists” as success. A stronger KPI set is:

Context relevance: retrieved context truly matches intent, not topic-adjacent noise.
Groundedness/Faithfulness: claims in the response are supported by retrieved evidence.
Answer completeness: key constraints (object, time, condition, exceptions) are covered.
Latency by stage: isolate retrieval latency from generation latency.
Fallback rate: transfer-to-human, no-answer, or default-template frequency.

Also segment by intent type (lookup, action, explanation). Thresholds should vary by intent: action intents can tolerate higher latency, but must enforce stricter accuracy thresholds.

Observability Is Not a Dashboard — It’s a Diagnostic Path

Many teams already have agent dashboards. The problem: they show anomalies, not causes. Upgrade observability to diagnosis:

Session level: total sessions, successful sessions, handoffs, average response time.
Step level: actions invoked, knowledge sources queried, fallback points.
Version level: metric regressions after Prompt/Topic/Action changes.

Core rule: every production change should be replayable, comparable, and attributable. Otherwise, “quality dropped this week” becomes guesswork.

Go-Live Gates: Define “Ready to Release” as Hard Rules

Set standardized release gates for every agent. At minimum:

Gate	Suggested bar	If failed
Retrieval quality	Core-intent relevance remains above threshold	Fix data source/retrieval strategy before prompt tuning
High-risk accuracy	Higher bar for billing/contract/compliance intents	Force restricted templates or human handoff
Latency SLO	P95 response latency within target	Split and profile retrieval vs tool-call bottlenecks
Failure recoverability	Stable fallback with context continuity	Improve fallback copy and handoff payloads
Regression test	No decline on historical critical test set	Block release and roll back

Once gates are defined, automate enforcement: run fixed evaluation suites pre-release, then monitor aggressively for 24–72 hours post-release.

Recommended Operating Cadence

Daily: review top failed intents, top failed sessions, and latency spikes.
Weekly: classify new failure modes (retrieval drift, prompt ambiguity, action permission gaps).
Monthly: recalibrate evaluation datasets, retire stale cases, add new business scenarios.

This turns your program from reactive bug fixing into continuous capability building.

Common Mistakes and Corrections

Mistake 1: one-time UAT only → AI quality drifts with data and process changes; evaluation must be continuous.
Mistake 2: blaming the model for everything → many failures come from retrieval, permissions, or tool orchestration.
Mistake 3: coverage before accuracy → in high-risk domains, start narrow and correct, then expand.
Mistake 4: tracking averages only → P95/P99 latency and long-tail failure modes are what users feel.

Conclusion

The hard part of Agentforce is no longer “can we build it?” but “can we run it reliably, explainably, and iteratively?” A clear RAG metric system, diagnostic observability path, and explicit go-live gates move AI programs from demo mode to engineering mode.

If you are moving to production, start with a minimum closed loop: pick three high-value intents, define thresholds, enforce release gates, and scale from there. Shipping a quality loop is more important than shipping broad feature coverage.

References

Salesforce Spring ’26 Release Notes (Agentforce monthly updates)
The Salesforce Developer’s Guide to the Spring ’26 Release (developer.salesforce.com)
Run Agent Tests | Agentforce Developer Guide (developer.salesforce.com)
About Knowledge/RAG Quality Data and Metrics (help.salesforce.com)
Agentforce Analytics / Reports documentation (help.salesforce.com)
Build and Optimize Agents with New Agentforce 360 Features (developer.salesforce.com)