We periodically corner three engineers from the agents platform and ask them what's actually going on. This quarter, that's Maya (SDR Agent), Rohan (Replenishment Agent) and Yusuf (the underlying agent runtime). Coffee was provided. Edits were minimal.

What did you ship this quarter?

Maya (SDR Agent): The big one was multilingual outbound. We always knew our customers in the GCC and SEA wanted Arabic and Bahasa, but the easy way — translate English drafts — produced output that read like a translation. We retrained the writing model with native examples per language and per industry. Reply rates in Arabic outbound are now within 8% of English, which is well above the industry baseline of "embarrassing."

Rohan (Replenishment Agent): Mine was supplier lead-time learning. The old replenishment agent assumed each supplier's lead time was the contracted lead time. Suppliers, as it turns out, lie about this. The new version learns the real distribution from actual delivery history and reorders against the 90th percentile, not the contracted promise. Stockouts dropped 31% across the pilot cohort.

Yusuf (Runtime): Less visible from outside, but the runtime now supports deterministic replays. You can take any production agent decision and replay it in a sandbox with identical inputs, and the agent will produce the same output. This sounds boring. It is the most important thing we've shipped in two years because it makes debugging tractable. The original version was a chaotic mess — when an agent made a weird decision, you couldn't reliably reproduce it.

What surprised you?

Maya: How much the small choices matter. We A/B-tested whether the SDR agent should send follow-ups on Mondays or Tuesdays. The lift from Tuesday was bigger than the lift from a six-week rewrite of the qualification logic. There's so much performance left on the table in tiny details that any thoughtful agent design unlocks.

Rohan: How wrong contracted lead times are. Across our customer base, the median supplier delivers 3 days later than what's in the PO. Some suppliers — about 8% of them — are routinely 14+ days late and have somehow been getting reordered anyway. The agent flagged this in week one of the pilot. The customer's sourcing team had been seeing it but couldn't quantify it. Now they can.

Yusuf: How much of the runtime work is about observability, not capability. Every senior engineer who joins thinks "we need to make the agents more capable." After about six weeks they figure out we need to make the agents more debuggable. Capability without debuggability is a science project.

Capability without debuggability is a science project. Production agents need replay, audit, and a why-log on every decision before they need a more powerful model.

Yusuf, Runtime team

What's hard about agents specifically?

Maya: Evaluation. With a traditional feature, you have unit tests and integration tests and you know if it works. With an agent, "does it work" is a statistical claim. You need a held-out set of real situations, a way to score the agent's responses, and a way to detect when the score drifts. We spend more time on the eval harness than on the agent itself.

Rohan: Failure modes that look like success. An agent can reorder the wrong thing and the system happily processes it. Stock arrives. Inventory balances. Three weeks later you discover the agent has been quietly building up a 6-month supply of the wrong SKU. The whole thing looks fine until you look at it from the right angle. We've built a lot of "look from the right angle" tooling.

Yusuf: The temptation to use a bigger model. There's always a more capable model six months away. The trap is using it before you've gotten as much as possible out of the cheaper one. Most of the wins we've shipped this year came from prompt engineering, tool design and policy tuning — not from upgrading the underlying model.

What are you betting on next?

Maya: Voice. Booking demos is becoming a phone conversation again as inboxes get noisier. The SDR Agent will need to take and place calls. That's a 2026 H2 project.

Rohan: Inter-agent collaboration. The Replenishment Agent currently doesn't know what the Pricing Agent is planning, so a price-cut promotion can blow up replenishment forecasts. We're building shared planning state so agents can negotiate before they act.

Yusuf: Live model swaps. We want any agent to be swappable between underlying models without a deploy. That sounds operational, but it's actually about flexibility — when a new model comes out, we should be able to A/B-test agents on it within a day. We can't yet. We will soon.

If someone is reading this and wondering whether to join

Maya: Come if you find the system-design problems interesting. The hard parts aren't the model. They're the interfaces. If you want to design how an agent talks to a database, a calendar, a CRM and a human — and have all of those work — you'll find a lot to do.

Rohan: Come if you like quantitative work that ships. We're not doing research; we're doing engineering against measurable customer outcomes. If you want a paper, this is the wrong company. If you want every quarter's bonus tied to whether your agent saved a customer real money, it's the right one.

Yusuf: Come if you've been frustrated that LLM products have been thin on the engineering side. There's a deep stack to build here. Most companies are stopping at "wrap the model." We're building the airframe under it.

We're hiring across the agents platform. See open roles — or just send an email to whoever above sounded most like you.

Tagged: hiring · engineering · agents ← Back to all posts