Beyond Vibe-Coding: The Reality of Scaling AI in Enterprise Engineering

Like many developers, I find myself in the middle of a large organisation trying to modernise its software development lifecycle (SDLC). The sheer velocity and promise of AI has everyone running a bit scared—though for very different reasons:

Leadership fears falling behind as competitors accelerate, or seeing their business model become irrelevant overnight.
Developers fear being replaced entirely, or seeing the craft they love turn into something they no longer recognise or enjoy.

As a technical lead at an insurance technology company, I am directly responsible for driving AI adoption across our engineering teams. This blog chronicles that journey—sharing the raw learnings, successes, and failures along the way. In this post, I want to look at our early approaches and the real-world challenges we are already experiencing.

The Beginning

I can’t say the majority of our developers were thrilled about using AI at first. Although everyone had access, the early tools were underwhelming. This poor quality discouraged the curious and brought a quiet sense of relief to those who feared being replaced.

The real turning point came when AI agents began writing code autonomously. Slowly, a handful of developers welcomed the shift and began integrating them into their daily workflows. We soon realised that the tools’ performance depended entirely on the operator: those who invested the effort to customize and understand the environment began to excel.

I was one of those early adopters. I spent hours every day experimenting, trying to understand how to harness the agent’s capabilities. This interest led to me being asked to help accelerate the rollout of AI tools across our wider engineering organisation.

The Initial Approach

Our initial approach was to build a multi-agent framework around the AI harness, bridging the steep knowledge gap required to use it effectively.

At the time, even when developers successfully passed tasks to the agent, they mostly relied on “vibe-coding”—steer-by-feel prompting to nudge the agent toward the desired outcome. To ground this, we designed our framework to act as a quality gate (see Figure 1). It helped the agent self-correct linting errors, verify passing tests, confirm all task requirements were met, and perform a final code quality review.

Figure 1: Initial multi-agent framework quality gate loop

While this setup successfully reduced the amount of manual prompting, it didn’t solve everything. Code quality was still at the mercy of three variables: prompt quality, existing codebase health, and project setup.

Each of these factors brought its own complex dynamics:

Prompt Quality

Improving prompt quality is far from straightforward. Despite our efforts, vibe-coding persists as a common fallback. The reality is that if you don’t know exactly what you want or fail to articulate it, you leave the technical and implementation decisions entirely up to the agent—putting you completely at the mercy of its reasoning capabilities. Shifting from loose prompts to structured tasks improved outcomes, but only for developers who were already skilled at breaking down requirements and defining clear constraints. We also explored spec-driven development, but ran into a similar bottleneck: most developers struggled to write robust specifications. Without the experience to judge whether a spec is complete or missing critical edge cases, developers simply end up vibe-coding with extra steps.

Existing Codebase Health

It’s easy to assume a large, mature enterprise codebase is the perfect playground for AI. In theory, millions of lines of existing code should give an agent plenty of context and patterns to mimic. In practice, legacy codebases are often minefields of inconsistent standards, architectural shifts, and technical debt. When you drop an AI agent into this environment, it doesn’t just replicate the good patterns—it happily inherits and propagates the bad ones, amplifying existing mess at machine speed.

Project Setup

A chaotic project setup presents similar roadblocks. When project structures are inconsistent, linting rules and configurations are absent, and overall code organization is poor, the agent lacks the structural feedback loop necessary to make informed decisions. Without strict project structure, automated linting, and solid testing frameworks, we found ourselves spending more time babysitting the agent’s output than it would have taken to write the code ourselves.

Individual vs Team Acceleration

When we started this journey, most tools and harnesses focused purely on individual developer productivity. In reality, the context an agent needs to produce meaningful, high-quality output extends far beyond any single developer’s frame of reference. Large enterprise organisations rely on multi-disciplinary teams to build and scale complex applications. There is a structural reason for this: these systems are too complex for any single person to hold in their head, requiring diverse skills across architecture, product management, QA, and security. Crucially, business continuity hinges on team redundancy—if an engineer leaves, it shouldn’t disrupt operations. Consequently, AI acceleration cannot be viewed solely through the lens of individual productivity. If our AI tools only make individual developers write bad code faster, we haven’t actually solved anything.

Team Maturity

My team was chosen as the lighthouse team to pilot these capabilities. However, because early AI tools lacked collaborative features, we initially adopted suboptimal processes. We tried introducing spec-driven development at the individual user story level, but it quickly backfired, adding more friction than value. When the stories themselves were vague, and the agent had no understanding of the broader application architecture, the generated specs were often completely off-target and had to be discarded multiple times.

Eventually, we took a step back and shifted to generating specifications per feature rather than per story. This pivot led to far more cohesive specs and less wasted effort. While this approach worked well for our greenfield project, it was quickly rejected when suggested to teams working on brownfield systems. With well-established backlogs already in place, those teams were reluctant to spend significant time reworking their existing user stories into meaningful context for the AI. Instead, they wanted tools to help them automate and maintain their current backlogs.

I found this problematic.

Context Streams

Simply automating our existing Software Development Life Cycle (SDLC) won’t move the needle. An agent’s performance hinges entirely on the quality and quantity of the context it is given. Relying on Jira or other backlog tools to feed context to an agent is a recipe for failure. Those tools were built for humans. In traditional Agile practices, this means providing the minimal amount of information required for a human to fill in the blanks or know who to consult. Agents, by contrast, require rich, detailed context from various disciplines, and they need it in a highly structured format.

The organisation as a whole needs to fundamentally rethink its approach to software development.

If we want AI agents to do more than just write simple boilerplate, we have to start treating context as a first-class citizen in our development lifecycle.

Cookie Preferences

The Beginning#

The Initial Approach#

Prompt Quality#

Existing Codebase Health#

Project Setup#

Individual vs Team Acceleration#

Team Maturity#

Context Streams#