Benchmarking AI Assistants Against Real-World Codebases
There’s plenty of hype around AI writing entire applications from scratch. However, in real-world environments, such as front-end monorepos and microservice-based backends, the bar is much higher.
At RBA, we don’t evaluate AI by how well it builds a basic demo app. We test it inside layered systems with real patterns, custom architecture, and a mix of legacy and modern tooling. In this experiment, I compared two AI copilots—GPT-4o and Claude 3.7—on tasks pulled straight from a production-grade repo. Each model was asked to handle three core scenarios:
- Structuring projects using NX and TypeScript
- Building backend features with Spring Boot and Kotlin
- Generating front-end services and components
Project Structure: NX Setup with TypeScript Modules
GPT-4o: Needed Precision and Patience
GPT-4o struggled to organize NX-based modules properly. Even with instructions outlining shared, web, and mobile folder structures, it produced inconsistent results until I added several refinements:
- More specific example commands in the instruction file
- Clearer documentation of folder structure down to the module level
- A working example module already created in the repo
Another key insight was that GPT-4o does not automatically use the instruction file in context. I had to manually attach it with every request. Once those adjustments were made, GPT-4o got close enough. A few configuration files still needed to be updated by hand to get it over the finish line.
Claude 3.7: Better First Pass, Slight Overreach
Claude 3.7 correctly identified the tooling and followed initial instructions more accurately. It did try to add extra features based on module names, which wasn’t needed, but only one iteration was required to reach working build, lint, and test outputs. No manual edits were needed.
Backend Feature Development: Spring Boot with Kotlin
Claude 3.7: Pattern Recognition and Adaptability
Claude handled backend tasks extremely well. It created services and controllers using the existing project style, identifying and extending established conventions across the codebase.
GPT-4o: Lacked Structural Awareness
GPT-4o did not reliably recognize existing code patterns. It often failed to detect abstractions across modules and pulled in examples from unrelated public repositories. Even when I gave it access to the necessary files, it duplicated logic rather than working within the defined Gradle structure.
Front-End Development: React and TypeScript
For this test, I kept prompts vague to evaluate each model’s default behavior. Both agents received the same instruction file and backend controller as starting context.
GPT-4o: Dependent on Documentation and Easily Thrown Off
GPT-4o often referenced non-existent components or libraries. Including documentation helped a little, though it quickly increased token usage. It could not reliably identify which libraries were in use from the file context alone. Adding a new instruction that pointed to the top-level package.json improved results, but output remained inconsistent. When asked to build a service, it created a single file with mismatched types and an Axios implementation that didn’t align with the endpoint provided.
Claude 3.7: Stronger Alignment After a Few Iterations
Claude picked up on project patterns more efficiently. After four iterations, it adapted its output to match the repo’s preferred style, including React Query hooks, Axios for integrations, Yup validation, and supporting TypeScript types. When I pointed out gaps, Claude was able to quickly analyze and incorporate corrections from sample files.
Final Take: Claude 3.7 Proved More Effective
Claude 3.7 outperformed GPT-4o across all three categories. It required less input, responded more accurately to instructions, and produced cleaner, more usable code. GPT-4o was capable, but only with a high level of manual steering and repeated refinement.
Key lessons for developers working with AI agents:
- Context is everything. If results are off, step back and evaluate what the model is seeing versus what it needs to know.
- Be explicit. Folder structure, naming patterns, and working examples go a long way.
- Test and verify. Even clean-looking code can fail at runtime.
- Expect to iterate. These tools are assistants, not replacements. Progress comes from interaction, not automation.
AI copilots are a valuable part of the developer toolkit, especially in enterprise settings with complex architecture and evolving patterns. When used thoughtfully, they can help teams move faster and build more consistently; however, human oversight remains essential.