Part 2: AI Native Engineering Flow
AI Native Engineering Flow series
In Part 1 of The AI-Native Engineering Flow, we shared the numbers: one developer, six AI agents, three months, an enterprise loan processing system.
đ Part 2: Co-Plan with AI is now out!! : https://medium.com/data-science-at-microsoft/the-ai-native-engineering-flow-part-2-co-plan-with-ai-49e4a5bf43a1
7 insights from spending 22% of engineering time before writing a line of code
One finding kept coming up â the 22% upfront time spent in co-planning became the single biggest factor in preventing rework.
This article unpacks what happened during that 22%. Here are the key insights:
1. Generic AI agents give generic advice â domain tuning changes everything
Early experiments confirmed this fast. An agent instructed âYou are a product manager. Help with requirements.â gave feedback like âConsider user needsâ and âEnsure clarity.â Technically correct but not actionable. The fix was tuning to four areas: domain terminology (DTI ratios, credit scoring), technology stack (FastAPI, Azure, Microsoft Agent Framework), team conventions (ADR format, GitHub issue structure), and architectural constraints (multi-agent design, MCP server patterns). A PM agent that understands loan processing asks fundamentally different questions than a generic one.
2. AI agents can collaborate across disciplines
When the PM agent identified the need for an intake flow, it consulted the UX agent â which proposed structured selections over free-text input. When the Architecture agent reviewed our design, it asked the Code Reviewer to validate prompt injection risks. These cross-agent handoffs produced insights that no single agent would have surfaced alone.
3. ADRs (Architecture Decision Records) became the source of truth â more than the code itself
This was the insight we didnât expect. In AI-assisted development, context loss can be a constant problem. Sessions get compacted, conversations end, token limits force forgetting. 61 Architecture Decision Records over three months became the projectâs institutional memory. Two months in, when an AI session suggested consolidating MCP servers âfor simplicity,â pointing to ADR-015 (MCP Server Separation of Concerns) immediately restored the context â reusability across agents, independent scaling, fault isolation. The AI adjusted its recommendation instantly.
4. More comprehensive instructions made AI worse, not better
By week four, CLAUDE.md had grown to 3,000+ lines. Each agent persona exceeded 1,500 lines. We had documented every pattern, every edge case, every lesson learned. The result was worse AI assistance. Response times increased. Instructions early in the file got ignored. Recommendations became generic again â the âlost in the middleâ effect. Every token in instruction files competes with actual work context.
After pruning to ~800 lines for CLAUDE.md and 300-500 per agent persona, quality improved noticeably. There is no formula for the âcorrectâ size â only continuous refinement. This is getting even more improtant with newer models like Claude 4.6
6. Agent configuration is code â version it, iterate it, prune it
Week 1: The Architecture agent applied every framework to every review â OWASP Top 10, Well-Architected, Zero Trust, microservices patterns â regardless of relevance. We added âStep 0â to categorize systems first.
Week 3: The PM agent wasnât enforcing GitHub issue structure. We added explicit label requirements and epic templates.
Week 6: Agent personas had grown bloated. We optimized for concise directives, cutting token consumption substantially.
Agent definitions are never âdone.â They evolve with the project.
7. Human judgment remains the irreplaceable element
Agents generated options and structured analysis. But decisions required human domain knowledge, strategic thinking, and the willingness to redirect when the AI went sideways. The PM agent asked great questions but couldnât inject loan processing expertise. The Architecture agent proposed patterns but couldnât weigh organizational trade-offs. The UX agent accelerated design but couldnât observe real users struggling with an interface.
The 22% wasnât AI planning autonomously â it was a human orchestrating AI collaborators toward aligned decisions.
Coming next: Part 3 â Build by Prompt â covers the 35% spent turning plans into code: active monitoring, the âstop agentâ technique, and why full delegation to AI fails for enterprise systems.

