Enterprise AI Platform: 7 technical lessons for putting AI Agents into production

ARTICLE SUMMARY

Based on more than 1,500 AI Agent implementations across the platform, Pipefy has mapped the 7 technical decisions that sustain an enterprise AI platform in production, from the prompt to RAG curation, from evals to process governance. See each of them in detail.

At Web Summit Rio 2026, Adrianno Esnarriaga, the Product Manager leading Pipefy’s AI Agents squad, opened his masterclass with a sharp provocation: “today, I want to put an end to all the hype around AI Agents.”

The provocation isn’t a stylistic choice. It’s a direct consequence of a striking data point recently released by MIT in the GenAI Divide study: 95% of Generative AI initiatives in enterprises fail to generate revenue.

At Pipefy, after more than 1,500 AI Agent implementations across the platform, the explanation Pipefy backs with evidence isn’t about the model, the data, or the tool; it’s about craft.

The difference between getting an AI Agent project off the ground and keeping it running in production lies in 7 technical decisions — lessons that apply to any enterprise AI platform, but that only become visible once real operations begin to expose what prompts, demos, and POCs never demand.

This article details each of these lessons, with the most common antipatterns observed in the field and the technical principles that sustain AI Agents in production.

What 1,500 implementations teach about the AI Agents hype

The paradox MIT points to isn’t a technology problem. The models available today (Claude, GPT-5, Gemini, fine-tuned open-weight models) handle most use cases in AI and machine learning applied to business processes. Academic benchmarks reflect that maturity.

The difference between what works in a benchmark and what works in a real operation shows up across three fronts that pilots rarely expose:

1. Input variability

In production, the agent receives contracts with atypical clauses, spreadsheets with empty fields, poorly scanned documents, WhatsApp messages in casual Portuguese. None of that looks like the POC test set.

2. Human noise

Operators approve wrong decisions, correct right ones, and silently create exceptions that become rules. Without structured feedback, that noise enters the agent as truth.

3. Legacy stack

ERPs, HR systems, insurance platforms, and IT solutions that have been running for 15 years weren’t designed to talk to AI Agents. They need to be orchestrated, not replaced.

None of this is solved by a better prompt, but it can be addressed by the 7 technical decisions that follow. Before diving in, however, it’s worth drawing a boundary that often stays implicit in discussions about AI Agents.

Input variability, human noise, and legacy stack are the three fronts pilots rarely expose and the ones that separate benchmarks from real operations

From traditional automation to the enterprise AI platform

Traditional automation — RPA (Robotic Process Automation) at its classical core, integration scripts — is deterministic. Define the rule, the robot executes, the result is predictable. It works very well for repetitive tasks with structured inputs, which is precisely why every mature enterprise RPA solution eventually demands smarter layers to handle what isn’t deterministic.

AI Agents, by contrast, are probabilistic. They interpret variable inputs, make decisions within defined boundaries, and produce outputs that need to be evaluated. Teams that migrate from RPA to AI Agents without rethinking their engineering framework underestimate this transition, and the data MIT highlights largely corresponds to organizations that kept thinking of the agent as just “a smarter RPA.”

An enterprise AI platform exists precisely to support this difference: it orchestrates processes with governance, connects legacy systems, embeds business rules in auditable workflows, and allows AI Agents for enterprises to operate with clearly defined responsibility at every step.

Recognizing this boundary is the starting point. The 7 technical decisions described below are the craft behind that crossing, observed across more than 1,500 implementations at Pipefy, in increasing order of technical depth: from the isolated prompt to the design of the process.

Pipefy’s 7 technical lessons to put AI Agents into production

Each of the lessons below first surfaces as an antipattern before becoming a practice. What separates a mature AI Agent operation from one that stalls is the interval between observing the failure in the field and engineering the technical solution. That interval is documented in the lessons below:

1. The prompt is just the beginning (and usually where teams get stuck)

The first instinct of any team starting with AI Agents is to treat the prompt as the product. As Esnarriaga puts it in his Web Summit Rio 2026 talk, “obsession with the prompt is just the beginning” — and that’s exactly where most teams get stuck: they spend weeks refining text, find a prompt that works in 80% of POC cases, and call it a win.

In production, the isolated prompt becomes a technical ceiling. Once real operations kick in, the team realizes that 80% of the work lives outside the prompt: in context curation, in scope control, in the architecture of the workflow that delivers clean input to the model.

The most common antipattern is the monolithic prompt: that 4,000-token prompt trying to cover every case, every exception, every instruction. It works until the first unforeseen case, and trying to patch it by adding “just one more little rule” only makes things worse.

The lesson is structural: the prompt is a starting point, not a destination. Teams that treat the prompt as the technical ceiling never reach production.

2. Without curation, RAG becomes structured noise

The second disappointment for teams that move past prompts is realizing that RAG, on its own, solves nothing. The company’s documents are uploaded to a vector store, plugged into the agent, and the result is the agent citing wrong excerpts, outdated sources, or sources that look right but are out of context.

Data is the oil, but you don’t run a car on oil — you run it on gasoline. Without curation, the company’s knowledge base is just crude oil.
Adrianno Esnarriaga
Product Manager – AI Agents | Pipefy

Curation means making editorial decisions about what goes into the RAG. Duplicate or outdated documents, or those describing processes that no longer exist, or that look right but contradict the company’s current policy, are filtered out.

Vector stores don’t replace curation; they amplify whatever has been decided to be inside. That’s why Intelligent Document Processing (IDP) stops being an isolated feature and becomes a prerequisite for any serious RAG: it’s what extracts, classifies, and validates what can actually feed the agent.

3. Eval suites as the “safety belt”

Hallucination isn’t a technical detail. It’s the number one reason AI Agent operations get shut down after reaching production.

The agent cites a source that doesn’t exist, assigns a value the clause doesn’t authorize, asserts wrong information with full confidence — and the operator, with no fact-checking tool at hand, either believes it or pulls the plug.

The antipattern is relying on “feeling” to detect hallucination. It works while volume is low. It breaks in the first month of real production.

The lesson is technical: automated eval suites are the safety belt for any agent in production. Automated fact-checking, mandatory source citation, semantic regression on critical cases, comparing versions of the same agent against the same input. Without that arsenal, the agent’s hallucination can derail the entire project.

This angle is technical, not architectural: auditable governance is the rail the platform provides. The eval suite is the engineering that makes each individual agent worthy of that rail.

See also: AI Governance: How Pipefy Mitigates Risks and Ensures Safe Use of Artificial Intelligence

4. The “do-it-all” agent doesn’t deliver anything well

A single agent trying to solve everything works as a demo but breaks as an operation. In Esnarriaga’s words, “the more things you throw at the same agent, the more tangled and stumbling it gets.”

The reason is structural: each new responsibility adds context, context competes for attention inside the prompt, and the quality of every decision drops.

The principle behind Pipefy’s AI Agents platform is the opposite: specialized agents, each with a single responsibility, connected by the process. One agent reads the document, another classifies risk, another writes a message to the employee. Each does one specific thing very well, and the workflow connects all of their work.

Specialization isn’t an architectural luxury. It’s the only way to maintain quality as volume scales.

5. TDD for AI Agents: LLM-as-judge as the missing eval

The software industry learned 20 years ago that code without automated testing breaks in production. The same curve is repeating itself with AI Agents — only too fast for many teams to notice.

The antipattern is shipping to production without a single automated eval suite. The team trusts that the prompt is solid, the agent “seems to work,” and the first base-model update (released by the provider, with no warning) breaks the behavior.

The lesson is technical: Test-Driven Development (TDD) for AI Agents means building an eval suite that runs before every deploy, comparing the new version against the baseline on a representative set of cases. LLM-as-judge has entered the picture as the central piece: a trusted model is used to judge, at scale, whether the agent’s output meets predefined criteria.

Testing an agent is fundamentally different from testing deterministic software or RPA. The right behavior isn’t a function of the input; it’s a distribution. That means testing has changed, not gone away.

6. Human-in-the-loop isn’t an approval button, but a confidence gradient

Most Human-in-the-Loop (HITL) implementations get stuck in a binary pattern: the agent acts, the human approves or rejects.

The antipattern is a binary User Interface (UI) that never evolves. The operator approves everything manually in the first month, the third month, the sixth month, and nothing changes. The agent keeps depending on human approval for every decision, and ROI never materializes, because the human workload hasn’t decreased.

The lesson is one of technical design: HITL needs to be a confidence gradient. In the beginning, the operator validates 100% of decisions. As the agent proves consistent across a class of cases, the human checkpoint narrows to the exceptions within that class. Step by step, the agent’s scope expands — without ever leaving the governed rail.

This is exactly the design behind intelligent process automation on a platform like Pipefy: every workflow step defines who decides, and that configuration evolves with the confidence documented in the operation.

Human-in-the-Loop as a confidence gradient: the human checkpoint narrows to exceptions as the agent expands its governed scope across the operation

7. An agent without a process is born “unemployed”

The most central lesson — the one that gives the masterclass its core thesis — is also the most neglected: “an agent without a process is born ‘unemployed.'”

Most organizations start with the model: they pick the LLM, define the prompt, connect integrations, and only then ask, “now, what process is this actually solving?” That ordering leaves the agent born without a clear, defined destination.

An enterprise AI platform starts with the process. The workflow defines where decisions happen, who’s responsible, what the SLA is, what the exceptions are, and what the audit needs to record. The agent enters that rail with clear responsibility and a defined scope, and that’s why it can last in production.

That inversion is what differentiates AI process automation from simply “releasing an agent into the operation.”

When to apply each lesson (and where teams typically fail)

Each of the 7 lessons has a specific warning sign that appears before the problem escalates. Recognizing that sign early is what separates the team that corrects course in weeks from the one that only catches the failure once the project is already at risk.

Lesson	Sign it’s missing in your project	Typical antipattern observed
1. Prompt is just the beginning	Every new use case starts a new prompt from scratch	A monolithic 4,000-token prompt trying to cover everything
2. RAG curation	Agent cites “hallucinated” or irrelevant sources	Uploading documents to the vector store with no editing
3. Eval against hallucination	Bugs only show up in production	Random manual validation, with no automated fact-checking
4. Specialization	“One agent that does it all”	Multiple unrelated tasks executed by the same agent
5. TDD for AI	Regressions appear after every model update	Deploying without an automated eval suite
6. HITL gradient	Operators manually approve everything, month after month	Binary approve/reject UI, with no evolution of scope
7. Process as home	Agent abandoned after the go-live	Agent connected to a system, with no process owner

Puma: the 7 lessons applied to HR at a global retail brand

The 7 lessons aren’t theory. They’ve become the foundation of how real operations run, and the case that best supports that today is Puma.

Puma is a global sportswear brand with more than 500 employees in Brazil, retail stores, distribution centers, and continuous hiring cycles.

Before Pipefy, HR was handling 40 to 50 monthly hires through a manual, 20-business-day process, with RG, CPF, and supporting document validation done one by one. The operational errors generated along the way ended up affecting payroll directly, under e-Social compliance requirements.

With Pipefy, the team structured a 21-day onboarding workflow with daily automated communication cadences, plus AI Agent–powered document reading and validation reaching 90% accuracy.

Mapping the lessons onto the actual design of the solution:

Lesson 1 (“the prompt is just the beginning”): document reading wasn’t solved by prompt alone — it required a fully structured workflow.
Lesson 2 (curation) appears in the centralized employee database, reused across multiple processes.
Lesson 4 (specialization) shows up in the agent’s well-defined scope: it reads specific documents (RG, CPF, proof of income), not the entire HR function.
Lesson 6 (HITL gradient): instead of having the operator review every document the agent processes, only the cases where the agent itself flags low confidence are routed to human review.
And Lesson 7 (process as “home”) is the onboarding workflow itself: the agent lives inside a clear process, with step-level SLAs and direct e-Social integration.

The quantifiable results back the design:

10 hours/month saved on document reading.
More than 10,000 actions automated in a single year.
10+ departments using the platform.
And 29+ active processes, more than 10 of which are dedicated to HR.

We’re very pleased with how processes have advanced and with the efficiency Pipefy has brought through its tools. Today, we’ve gained not only in productivity, but also in the time we can dedicate to continuously improving other workflows.
Wanderson Andrade
P&O Payroll Analyst | Puma

How Pipefy translates the 7 lessons into an enterprise AI platform

As an enterprise AI platform, Pipefy implements AI Agents as part of the process architecture, never as isolated resources.

This means agents are born inside governed workflows, not loose. Every agent has a single responsibility, a defined scope, and a configurable human checkpoint that evolves with the confidence documented in the operation.

The platform also embeds audit trails per action, role-based access control, and model flexibility (BYOLLM) from the very first flow.

And because the process is the agent’s “home,” the workflow owner defines where the agent decides on its own and where it routes to human review, without having to rewrite code for every adjustment.

Pipefy AI Report: The leap of Artificial Intelligence in Latin America

To explain how these 7 lessons translate into an AI strategy for your operation — and to show how Puma, Roca, and Banco Sofisa have applied them in practice — Pipefy has prepared an exclusive report: “The leap of Artificial Intelligence in Latin America: the next ‘leapfrog’ after WhatsApp, PIX, and Mobile Banking.”

In this material, you’ll see in more detail:

The Brazilian paradox in numbers and why Brazil is the world’s 2nd-largest user of Generative AI.
The LATAM window of the next 12 to 18 months and what it means for companies operating in Brazil.
The step where enterprise AI value is lost, and how to cross it.
Real-world cases with documented results in HR, credit, and operations.

Download the material for free and discover how to turn the technical craft behind the 7 lessons into an enterprise AI strategy ready for the LATAM opportunity window:

[Pipefy AI Report] The Leap of Artificial Intelligence in Latin America: The Next ‘Leapfrog’ After WhatsApp, PIX, and Mobile Banking

Download now