AI | How Traversal Prevents Million-Dollar Outages Christine Hall

“It’s like finding a needle in a haystack with fake needles everywhere.” – Anish Agarwal, co-founder and CEO of Traversal

Website outages are painful, but in the age of AI-generated code they’re turning existential. Last year, companies, including Amazon Web Services, Azure, Cloudflare and Google Cloud all announced major outages, some lasting over 15 hours.

As Traversal co-founder and CEO Anish Agarwal puts it, the oft-quoted “$2 million an hour” figure during a downtime is now just a starting point, unfortunately, for large enterprises.

“The problem only gets bigger, the larger the company gets,” he said. “The $2 million might even be small if we’re talking about some of the largest. I’m certain AWS’s recent outage was an order of orders of magnitude bigger than $2 million an hour.”

The stakes aren’t just abstract numbers on a slide. Agarwal has watched outages end careers. For example, Optus CEO Kelly Bayer Rosmarin resigned in 2023 following a 14-hour network outage, and more recently, IndiGo airline CEO Pieter Elbers resigned after an outage led to thousands of flight cancellations.

“CEOs are fired when they’re no longer hitting the agreements that they have contractually obligated to hit with customers,” Agarwal said. “Once you don’t do that, it’s a security problem. You’re in breach of your contract, and that leads to massive fines and reputational damage.”

Why the old model broke

This isn’t a new problem, but the increase in AI use is pouring gasoline on an already burning fire, according to Agarwal.

Even before generative AI, the amount of data produced from software was going up, yet the number of people who can troubleshoot well has been flat, Agarwal said. Why? Site reliability engineers (SREs) are scarce and budgets are capped even as observability has become “the second largest spend, typically, for a company after cloud spend,” he said.

That means the status quo can look like a hospital emergency room on a bad night when something breaks in a large system.

“It spreads like an epidemic throughout your entire system,” Agarwal said. This is because each team only understands its own part of the system, so connecting the dots between all these teams with limited context is painful, he said.

In a pre-AI world, a major incident can mean 50 to 60 engineers in a “war room” for hours troubleshooting while millions of dollars are wasted.

Now add AI-generated code. More organizations are under pressure “to apply AI to everything,” with one of the clearest returns on investment areas being software development via tools like Claude or Cursor, Agarwal said.

It also causes some CIOs to regret their decisions. AI company Dataiku polled 800 CIOs and found 74% of them were under pressure to “deliver measurable business gains from AI within the next two years” or risk their jobs.

That’s leading to some harried decision-making. The same percentage also “regret at least one major AI vendor or platform decision made in the last 18 months.”

The result of all that pressure is a ton of code being written by AI. And large enterprises also give AI systems permissions that they might typically not give so that they can see what the AI can do. This is known as “dangerously skip permissions,” a mode in Claude that bypasses the need for user approval before the AI performs an action.

The combination of more opaque code, more permissions and less human context means things are breaking in ways not seen before.

“No one has context of the code, and the amount of code is blowing up as well,” Agarwal said. “So the outages are getting way, way worse than they used to be, which was already really bad.”

From causal ML research to AI SRE

All of this became the thesis for Agarwal’s company, Traversal, which launches AI SREs to find the root cause of a network outage before engineers need the war room.

Agarwal didn’t arrive at this problem as a traditional SaaS founder. His research while getting a Ph.D. at MIT and as a current professor at Columbia centered on a niche but powerful area: causal machine learning.

“These AI systems are very good at picking up minute correlations in data and not very good at picking up cause-and-effect relationships,” he said. “My research was how do you get these AI systems to learn cause-and effect-relationships from data automatically?”

That turns out to be exactly what’s missing in today’s incident responses, and what Traversal is solving. In a complex distributed system, an outage looks like “finding a needle in a haystack with fake needles everywhere,” Agarwal said.

The hard question, according to him, is: “When you see an issue, is it a symptom of the problem? Is it just a spurious correlation because something else is wrong in the system, or is it the root cause?”

Agarwal joined with Ahmed Lone, Raaz Dwivedi and Raj Agrawal to research this, and says the light-bulb moment came when he and his co-founders connected that research to the reality of operations. They also played with early AI coding tools and saw the trajectory clearly.

“If AI is going to write all of your code, and no one’s going to understand it, we need AI to fix your code as well,” Anish Agarwal said. “That was really the key moment for us.”

He also felt that some of the most interesting work in AI was happening in companies now, and that a company “with research in its DNA,” tackling a deeply technical problem. was the right expression.

Ending the 2 a.m. emergency calls

Traversal describes itself as an AI SRE agent that “autonomously troubleshoots, remediates and even prevents production incidents.” To understand what that means, Agarwal paints a before-and-after picture.

Before Traversal, Agarwal saw a lot of those “war room” scenarios play out where an engineer gets paged at “ungodly times of the day,” and joins an incident war room in Slack or Zoom to figure out what went wrong. Hours go by until there’s an “aha moment” and the team finally converges on a fix.

“It’s like this heart attack that an organization goes through every time a [critical] incident happens,” Agarwal said.

With Traversal, the workflow looks very different. For example, when there’s an incident, a ticket gets created, and Traversal automatically kicks off. By the time an engineer shows up, Traversal has come back with an answer, Agarwal said.

Not only an answer, but tells the engineer who is needed to verify what Traversal has said. So instead of 50 people, five or six people are needed to verify the answer,” then execute the mitigating steps Traversal proposes, Agarwal said.

Rather than an average three hours, it becomes something like 15 minutes to get to the root cause of an incident and mitigate it,” he said.

For some customers, Traversal has moved beyond recommendation into action. They have trusted the organization with autonomously healing their system without a human in the loop. Agarwal called this “self driving production,” where “Traversal finds the issue, tells you the mitigating steps, and then heals the system fully autonomously” without needing to get anyone up at 2 a.m.

Tangible ROI from AI

Over the last nine months, Agarwal has seen observability and reliability having a “ChatGPT moment,” with enterprises actively seeking AI SRE solutions to keep increasingly AI-generated code stable in production.

Agarwal emphasizes that the product is now at a point where it can deliver fast, repeatable time-to-value — often within 30 days — by significantly reducing mean time to resolution.

As a result, Traversal is in go-to-strategy mode, growing the company by four times to over 70 people and turning on the sales engine after gaining clients, including American Express and Pepsi.

The company has moved so aggressively and hired so strategically that one of Agarwal’s friends commented that Traversal has created “the Avengers of enterprise sales.”

In just a few months, Traversal has hired, among them, a vice president of worldwide sales, vice president of field engineering and vice president of marketing, all from blue-chip infrastructure and observability companies like AppDynamics, Cribl, SignalFx and Splunk, along with more than 10 sales executives and supporting solutions engineers.

In addition to securing more customers, Traversal’s vision extends well beyond incident response. The team is building what Agarwal calls a “production world model,” which is a rich representation of a company’s production environment analogous to the simulators used in self-driving cars.

This world model doesn’t just power faster root-cause analysis; it can also be surfaced to AI coding tools to help them write more resilient code before it ever reaches production.

“The market for this is massive, and if you start collecting all this data and correlating across all these disparate systems, you can really rethink all of the maintenance of software, and that’s the vision of where we’re going,” Agarwal said.

“It’s like finding a needle in a haystack with fake needles everywhere.” – Anish Agarwal, co-founder and CEO of Traversal Website outages are painful, but in the age of AI-generated code they’re turning existential. Last year, companies, including Amazon Web Services, Azure, Cloudflare and Google Cloud all announced major outages, some lasting over 15 hours. As Traversal co-founder and CEO Anish Agarwal puts it, the oft-quoted “$2 million an hour” figure during a downtime is now just a starting point, AI, Home, News, Popular

This articles is written by : Nermeen Nabil Khear Abdelmalak

You can Enjoy surfing our website categories and read more content in many fields you may like .

Why USAGoldMines ?

USAGoldMines is a comprehensive website offering the latest in financial, crypto, and technical news. With specialized sections for each category, it provides readers with up-to-date market insights, investment trends, and technological advancements, making it a valuable resource for investors and enthusiasts in the fast-paced financial world.

Breaking

AI | How Traversal Prevents Million-Dollar Outages Christine Hall | usagoldmines.com

“It’s like finding a needle in a haystack with fake needles everywhere.” – Anish Agarwal, co-founder and CEO of Traversal

Why the old model broke

From causal ML research to AI SRE

Ending the 2 a.m. emergency calls

Tangible ROI from AI

By Nermeen Nabil

You Missed

SEC Schedules Urgent Crypto Meeting to Clarify Regulations Steve Muchoki | usagoldmines.com

B.C. attorney general moves to sue OpenAI over the Tumbler Ridge school shooting Micah Abiodun | usagoldmines.com

Best laptops 2026: Premium, budget, gaming, 2-in-1, and more | usagoldmines.com

Best monitors 2026: Top picks for gaming, 4K, HDR, and more | usagoldmines.com

AI | How Traversal Prevents Million-Dollar Outages Christine Hall | usagoldmines.com

“It’s like finding a needle in a haystack with fake needles everywhere.” – Anish Agarwal, co-founder and CEO of Traversal

Why the old model broke

From causal ML research to AI SRE

Ending the 2 a.m. emergency calls

Tangible ROI from AI

By Nermeen Nabil

Related Posts

AI | Funded: Fonoa raises $110M to build the operating system for autonomous tax Fintech Nexus Staff | usagoldmines.com

AI | Merge CEO on building the pipes behind AI, and starting with zero code Shubham Sharma | usagoldmines.com

AI | Integral Ventures’ Stephanie Sher is all about seeing diamonds in the rough Christine Hall | usagoldmines.com

You Missed

SEC Schedules Urgent Crypto Meeting to Clarify Regulations Steve Muchoki | usagoldmines.com

B.C. attorney general moves to sue OpenAI over the Tumbler Ridge school shooting Micah Abiodun | usagoldmines.com

Best laptops 2026: Premium, budget, gaming, 2-in-1, and more | usagoldmines.com

Best monitors 2026: Top picks for gaming, 4K, HDR, and more | usagoldmines.com