Part 2: AI Native Engineering Flow

AI Native Engineering Flow series

Feb 12, 2026

In Part 1 of The AI-Native Engineering Flow, we shared the numbers: one developer, six AI agents, three months, an enterprise loan processing system.

👉 Part 2: Co-Plan with AI is now out!! : https://medium.com/data-science-at-microsoft/the-ai-native-engineering-flow-part-2-co-plan-with-ai-49e4a5bf43a1

7 insights from spending 22% of engineering time before writing a line of code

One finding kept coming up — the 22% upfront time spent in co-planning became the single biggest factor in preventing rework.

This article unpacks what happened during that 22%. Here are the key insights:

1. Generic AI agents give generic advice — domain tuning changes everything

Early experiments confirmed this fast. An agent instructed “You are a product manager. Help with requirements.” gave feedback like “Consider user needs” and “Ensure clarity.” Technically correct but not actionable. The fix was tuning to four areas: domain terminology (DTI ratios, credit scoring), technology stack (FastAPI, Azure, Microsoft Agent Framework), team conventions (ADR format, GitHub issue structure), and architectural constraints (multi-agent design, MCP server patterns). A PM agent that understands loan processing asks fundamentally different questions than a generic one.

2. AI agents can collaborate across disciplines

When the PM agent identified the need for an intake flow, it consulted the UX agent — which proposed structured selections over free-text input. When the Architecture agent reviewed our design, it asked the Code Reviewer to validate prompt injection risks. These cross-agent handoffs produced insights that no single agent would have surfaced alone.

3. ADRs (Architecture Decision Records) became the source of truth — more than the code itself

This was the insight we didn’t expect. In AI-assisted development, context loss can be a constant problem. Sessions get compacted, conversations end, token limits force forgetting. 61 Architecture Decision Records over three months became the project’s institutional memory. Two months in, when an AI session suggested consolidating MCP servers “for simplicity,” pointing to ADR-015 (MCP Server Separation of Concerns) immediately restored the context — reusability across agents, independent scaling, fault isolation. The AI adjusted its recommendation instantly.

4. More comprehensive instructions made AI worse, not better

By week four, CLAUDE.md had grown to 3,000+ lines. Each agent persona exceeded 1,500 lines. We had documented every pattern, every edge case, every lesson learned. The result was worse AI assistance. Response times increased. Instructions early in the file got ignored. Recommendations became generic again — the “lost in the middle” effect. Every token in instruction files competes with actual work context.

After pruning to ~800 lines for CLAUDE.md and 300-500 per agent persona, quality improved noticeably. There is no formula for the “correct” size — only continuous refinement. This is getting even more improtant with newer models like Claude 4.6

6. Agent configuration is code — version it, iterate it, prune it

Week 1: The Architecture agent applied every framework to every review — OWASP Top 10, Well-Architected, Zero Trust, microservices patterns — regardless of relevance. We added “Step 0” to categorize systems first.

Week 3: The PM agent wasn’t enforcing GitHub issue structure. We added explicit label requirements and epic templates.

Week 6: Agent personas had grown bloated. We optimized for concise directives, cutting token consumption substantially.

Agent definitions are never “done.” They evolve with the project.

7. Human judgment remains the irreplaceable element

Agents generated options and structured analysis. But decisions required human domain knowledge, strategic thinking, and the willingness to redirect when the AI went sideways. The PM agent asked great questions but couldn’t inject loan processing expertise. The Architecture agent proposed patterns but couldn’t weigh organizational trade-offs. The UX agent accelerated design but couldn’t observe real users struggling with an interface.

The 22% wasn’t AI planning autonomously — it was a human orchestrating AI collaborators toward aligned decisions.

Coming next: Part 3 — Build by Prompt — covers the 35% spent turning plans into code: active monitoring, the “stop agent” technique, and why full delegation to AI fails for enterprise systems.

Applied Context

Discussion about this post

Ready for more?