βItβs like finding a needle in a haystack with fake needles everywhere.β β Anish Agarwal, co-founder and CEO of Traversal
Website outages are painful, but in the age of AI-generated code theyβre turning existential. Last year, companies, including Amazon Web Services, Azure, Cloudflare and Google Cloud all announced major outages, some lasting over 15 hours.
As Traversal co-founder and CEO Anish Agarwal puts it, the oft-quoted β$2 million an hourβ figure during a downtime is now just a starting point, unfortunately, for large enterprises.Β
βThe problem only gets bigger, the larger the company gets,β he said. βThe $2 million might even be small if weβre talking about some of the largest. Iβm certain AWSβs recent outage was an order of orders of magnitude bigger than $2 million an hour.β
The stakes arenβt just abstract numbers on a slide. Agarwal has watched outages end careers. For example, Optus CEO Kelly Bayer Rosmarin resigned in 2023 following a 14-hour network outage, and more recently, IndiGo airline CEO Pieter Elbers resigned after an outage led to thousands of flight cancellations.
βCEOs are fired when theyβre no longer hitting the agreements that they have contractually obligated to hit with customers,β Agarwal said. βOnce you donβt do that, itβs a security problem. Youβre in breach of your contract, and that leads to massive fines and reputational damage.β
Why the old model broke
This isnβt a new problem, but the increase in AI use is pouring gasoline on an already burning fire, according to Agarwal.Β
Even before generative AI, the amount of data produced from software was going up, yet the number of people who can troubleshoot well has been flat, Agarwal said. Why? Site reliability engineers (SREs) are scarce and budgets are capped even as observability has become βthe second largest spend, typically, for a company after cloud spend,β he said.
That means the status quo can look like a hospital emergency room on a bad night when something breaks in a large system.
βIt spreads like an epidemic throughout your entire system,β Agarwal said. This is because each team only understands its own part of the system, so connecting the dots between all these teams with limited context is painful, he said.Β
In a pre-AI world, a major incident can mean 50 to 60 engineers in a βwar roomβ for hours troubleshooting while millions of dollars are wasted.
Now add AI-generated code. More organizations are under pressure βto apply AI to everything,β with one of the clearest returns on investment areas being software development via tools like Claude or Cursor, Agarwal said.
It also causes some CIOs to regret their decisions. AI company Dataiku polled 800 CIOs and found 74% of them were under pressure to βdeliver measurable business gains from AI within the next two yearsβ or risk their jobs.Β
Thatβs leading to some harried decision-making. The same percentage also βregret at least one major AI vendor or platform decision made in the last 18 months.β
The result of all that pressure is a ton of code being written by AI. And large enterprises also give AI systems permissions that they might typically not give so that they can see what the AI can do. This is known as βdangerously skip permissions,β a mode in Claude that bypasses the need for user approval before the AI performs an action.
The combination of more opaque code, more permissions and less human context means things are breaking in ways not seen before.
βNo one has context of the code, and the amount of code is blowing up as well,β Agarwal said. βSo the outages are getting way, way worse than they used to be, which was already really bad.β
From causal ML research to AI SRE
All of this became the thesis for Agarwalβs company, Traversal, which launches AI SREs to find the root cause of a network outage before engineers need the war room.Β
Agarwal didnβt arrive at this problem as a traditional SaaS founder. His research while getting a Ph.D. at MIT and as a current professor at Columbia centered on a niche but powerful area: causal machine learning.Β
βThese AI systems are very good at picking up minute correlations in data and not very good at picking up cause-and-effect relationships,β he said. βMy research was how do you get these AI systems to learn cause-and effect-relationships from data automatically?β
That turns out to be exactly whatβs missing in todayβs incident responses, and what Traversal is solving. In a complex distributed system, an outage looks like βfinding a needle in a haystack with fake needles everywhere,β Agarwal said.Β
The hard question, according to him, is: βWhen you see an issue, is it a symptom of the problem? Is it just a spurious correlation because something else is wrong in the system, or is it the root cause?β
Agarwal joined with Ahmed Lone, Raaz Dwivedi and Raj Agrawal to research this, and says the light-bulb moment came when he and his co-founders connected that research to the reality of operations. They also played with early AI coding tools and saw the trajectory clearly.
βIf AI is going to write all of your code, and no oneβs going to understand it, we need AI to fix your code as well,β Anish Agarwal said. βThat was really the key moment for us.β
He also felt that some of the most interesting work in AI was happening in companies now, and that a company βwith research in its DNA,β tackling a deeply technical problem. was the right expression.
Ending the 2 a.m. emergency calls
Traversal describes itself as an AI SRE agent that βautonomously troubleshoots, remediates and even prevents production incidents.β To understand what that means, Agarwal paints a before-and-after picture.
Before Traversal, Agarwal saw a lot of those βwar roomβ scenarios play out where an engineer gets paged at βungodly times of the day,β and joins an incident war room in Slack or Zoom to figure out what went wrong. Hours go by until thereβs an βaha momentβ and the team finally converges on a fix.Β
βItβs like this heart attack that an organization goes through every time a [critical] incident happens,β Agarwal said.
With Traversal, the workflow looks very different. For example, when thereβs an incident, a ticket gets created, and Traversal automatically kicks off. By the time an engineer shows up, Traversal has come back with an answer, Agarwal said.Β
Not only an answer, but tells the engineer who is needed to verify what Traversal has said. So instead of 50 people, five or six people are needed to verify the answer,β then execute the mitigating steps Traversal proposes, Agarwal said.
Rather than an average three hours, it becomes something like 15 minutes to get to the root cause of an incident and mitigate it,β he said.Β
For some customers, Traversal has moved beyond recommendation into action. They have trusted the organization with autonomously healing their system without a human in the loop. Agarwal called this βself driving production,β where βTraversal finds the issue, tells you the mitigating steps, and then heals the system fully autonomouslyβ without needing to get anyone up at 2 a.m.
Tangible ROI from AI
Over the last nine months, Agarwal has seen observability and reliability having a βChatGPT moment,β with enterprises actively seeking AI SRE solutions to keep increasingly AI-generated code stable in production.
Agarwal emphasizes that the product is now at a point where it can deliver fast, repeatable time-to-value β often within 30 days β by significantly reducing mean time to resolution.
As a result, Traversal is in go-to-strategy mode, growing the company by four times to over 70 people and turning on the sales engine after gaining clients, including American Express and Pepsi.
The company has moved so aggressively and hired so strategically that one of Agarwalβs friends commented that Traversal has created βthe Avengers of enterprise sales.β
In just a few months, Traversal has hired, among them, a vice president of worldwide sales, vice president of field engineering and vice president of marketing, all from blue-chip infrastructure and observability companies like AppDynamics, Cribl, SignalFx and Splunk, along with more than 10 sales executives and supporting solutions engineers.
In addition to securing more customers, Traversalβs vision extends well beyond incident response. The team is building what Agarwal calls a βproduction world model,β which is a rich representation of a companyβs production environment analogous to the simulators used in self-driving cars.Β
This world model doesnβt just power faster root-cause analysis; it can also be surfaced to AI coding tools to help them write more resilient code before it ever reaches production.
βThe market for this is massive, and if you start collecting all this data and correlating across all these disparate systems, you can really rethink all of the maintenance of software, and thatβs the vision of where weβre going,β Agarwal said.
Β βItβs like finding a needle in a haystack with fake needles everywhere.β β Anish Agarwal, co-founder and CEO of Traversal Website outages are painful, but in the age of AI-generated code theyβre turning existential. Last year, companies, including Amazon Web Services, Azure, Cloudflare and Google Cloud all announced major outages, some lasting over 15 hours. As Traversal co-founder and CEO Anish Agarwal puts it, the oft-quoted β$2 million an hourβ figure during a downtime is now just a starting point,Β AI, Home, News, PopularΒ
This articles is written by : Nermeen Nabil Khear Abdelmalak
All rights reserved to : USAGOLDMIES . www.usagoldmines.com
You can Enjoy surfing our website categories and read more content in many fields you may like .
Why USAGoldMines ?
USAGoldMines is a comprehensive website offering the latest in financial, crypto, and technical news. With specialized sections for each category, it provides readers with up-to-date market insights, investment trends, and technological advancements, making it a valuable resource for investors and enthusiasts in the fast-paced financial world.