RBA Consulting
RBA Consulting
RBA Consulting

Does Iterative Refinement Improve AI Code Generation? We Tested It.

There has been growing discussion across platforms like LinkedIn around recursive strategies such as Recursive Self Aggregation (RSA) for improving large language model outputs. The idea is compelling: allow a model to generate, critique, and refine its own responses across multiple iterations to produce better results.
But does it actually work in practice?
We set out to test a simplified version of this concept using an iterative refinement loop and compared it directly against standard single-pass generation.

What We Tested

We evaluated two core approaches to AI code generation:
  • Single-pass generation
One prompt, one response. This serves as the baseline approach most developers use today.
  • Multi-iteration refinement (adversarial loop)
A recursive process where the model:
  • Generates code
  • Critiques its own output
  • Refines the response based on that critique
  • Repeats the process
To understand how model size impacts performance, we ran both approaches across:
  • 1.5B parameter model
  • 7B parameter model
We tracked performance across key metrics relevant to both developers and enterprise teams:
  • Accuracy (pass rate)
  • Execution time
  • Token usage
  • Iteration count
  • Model confidence

How We Tested It

To ensure consistency and comparability, we used the HumanEval benchmark, a widely adopted standard for evaluating code generation tasks.
Each model and approach was tested against 50 problems, with the following metrics recorded:
  • Pass Rate: Percentage of correct solutions
  • Avg Time: Seconds per task
  • Tokens: Average tokens used per task
  • Iterations: Average number of refinement cycles
  • Confidence: Model self-reported confidence (0–1)
Model Size Baseline (Single Pass) Adversarial (Multi-Iteration)
1.5B (Small) 50% pass, 9.3s 36% pass, 261.8s
7B (Large) 76% pass, 24.5s 74% pass, 645.6s

What We Expected

Our initial hypothesis was straightforward:

Allowing the model multiple opportunities to critique and refine its output should improve accuracy.

This assumption aligns with how humans approach problem-solving and how iterative workflows are typically framed in AI discussions.

However, the results told a very different story.

Configuration Pass Rate Avg. Time Tokens Iterations Confidence
7B + Baseline 76% 24.5s 634 1.0 0.45
1.5B + Baseline 50% 9.3s 519 1.0 0.46
7B + Adversarial 74% 645.6s 3262 2.4 0.53
1.5B + Adversarial 36% 261.8s 3063 2.5 0.59

What We Found

Instead of improving performance, iterative refinement:
  • Reduced accuracy by 8–14%
  • Increased compute time by 26x

Yes, 26x longer processing time… for worse results.

This represents a clear tradeoff that does not favor recursive refinement in its current form, especially in production or enterprise environments where latency and cost are critical.

Breaking Down the Results

  • The 7B model outperformed the 1.5B model, as expected
    • However, both models degraded under iterative refinement
  • Increased iterations led to:
    • Context bloat
    • Diminishing returns
    • Higher token consumption without improved outcomes
Model Size Baseline Pass Rate Adversarial Pass Rate Absolute Change Relative Change
7B (Large) 76% 74% -2pts -2.6%

Why This Matters for Enterprise AI

For organizations investing in AI-assisted development, these findings highlight a key consideration:

More complexity does not always equal better outcomes.

Iterative refinement strategies may sound promising, but without careful implementation, they can:

  • Increase infrastructure costs
  • Introduce latency
  • Degrade output quality
In high-scale environments, this becomes a compounding issue.

What We’re Exploring Next

This isn’t the end of the road for recursive strategies.

There are several areas worth exploring to improve results:

  • Better context management
  • Preventing prompt and critique overload between iterations
  • Dynamic temperature tuning
  • Varying creativity across refinement cycles
  • Larger model sizes (14B–30B)
  • Testing whether more capable models benefit from multi-pass approaches
  • Structured critique frameworks
  • Improving how feedback is generated and applied
It is possible that iterative refinement requires more sophistication than a simple loop to be effective.

Final Thoughts

While recursive refinement is an exciting concept, our testing shows that in its current form, it may not be production-ready for AI code generation.

For now, focused, high-quality single-pass prompting remains the most efficient and effective approach.

Join the Conversation

Have you experimented with iterative AI workflows or recursive prompting strategies?

We’d love to hear what you’re seeing in practice. Share your thoughts and experiences in the comments.