How I Improved AI Code Generation for Go by 6%: Lessons from Fine-Tuning Qwen2.5-14B

How I Trained an AI to Actually Be Better at Go

For enterprise organizations investing in AI, the difference between experimentation and measurable improvement often comes down to discipline in training, data governance, and benchmarking. Fine-tuning models is not just about better outputs, it is about building repeatable, scalable processes that deliver consistent gains.

This is a sequel to my original post, How I Made an AI 100% Worse at Go, where I documented a pretty disastrous first attempt at fine-tuning a model. That experience was packed with hard lessons around data quality, scope, and the importance of starting small.

Well… this time it worked.

Here are the results:

Metric	Qwen2.5-14B (QLoRA)	Qwen2.5-14B (Base)
Problems	164	164
Total Passed	432	334
Overall Pass Rate	26.34%	20.37%
Pass@10	45.12%	31.10%
Time (seconds)	33168.9	90644.8

A 6 percentage point increase in overall pass rate.

Not massive, but meaningful.

After weeks of training, multiple iterations, and even scrapping an entire run due to improperly licensed data, I finally landed on a model that performs better than the base. It reinforces something I have been thinking about more lately: there is real value in smaller, specialized “scalpel” models that are highly optimized for a specific task.

What I Learned in the Process

1. Check Your Data

Make sure your training data is properly licensed. Not all datasets allow for model training use, and overlooking this can cost you weeks of work.

I had to throw out a well-performing model because of this exact issue.

It is painful, but it is also a critical reminder. In enterprise environments, this is not just a technical concern, it is a legal and compliance requirement.

2. Iterate Quickly

You will not know something is broken until you actually test or benchmark the model.

Start with smaller datasets that train quickly, validate your approach, and then scale. I made the mistake early on of running a large training job with poorly formatted data, which cost me an entire cycle.

Fast iteration is what allows you to fail cheaply and learn quickly.

3. Automate Everything

This is not a one-and-done process. You will tweak parameters, adjust datasets, and rerun benchmarks constantly.

Automating the full pipeline, from data ingestion to fine-tuning to benchmarking, makes this manageable. It also removes the friction of manually running complex commands every time you want to test a change.

Version control is equally important. Commit after every meaningful change so you can track what actually moved the needle.

4. Be Thorough and Consistent with Benchmarking

Benchmark everything.

Benchmark the base model. Benchmark every iteration. Use multiple evaluation methods. The more data you collect, the clearer your progress becomes.

Consistency matters just as much as depth. If you are not testing under the same conditions each time, your results will not be reliable.

For example, do not fine-tune in one environment and benchmark in another with different configurations. That introduces variability that makes your results harder to trust.

Why This Matters for Enterprise AI

A 6% improvement might not sound dramatic, but at scale, incremental gains compound quickly. For enterprise organizations, this is the difference between AI that is “interesting” and AI that is operationally valuable.

This process highlights a few broader takeaways:

Specialized models can outperform general-purpose ones in targeted domains

Governance around data and licensing is just as important as model performance

Repeatable pipelines and automation are critical for scaling AI initiatives

Measurement frameworks are required to prove ROI and guide investment

Final Thoughts

I am curious to hear how others are approaching fine-tuning. Have you seen meaningful gains, or run into similar challenges?

At RBA, we help organizations move beyond experimentation into structured, scalable AI implementation. From model strategy and governance to benchmarking and optimization, our focus is on making AI work in real business environments, not just in isolated demos.

If you are exploring how to operationalize AI or improve model performance in your organization, let’s connect.

About the Author

Robby Sarvis

Senior Software Engineer

Robby is a full-stack developer at RBA with a deep passion for crafting mobile applications and enhancing user experiences. With a robust skill set that encompasses both front-end and back-end development, Robby is dedicated to leveraging technology to create solutions that exceed client expectations.

Residing in a small town in Texas, Robby enjoys a balanced life that includes his wife, children, and their charming dogs.

Solutions

Services

Engagement Types

Stay Connected

By Role

By Industry

By Size

Case Study

Overview

News & Events

The RBA Flag

Solutions

Services

Engagement Types

Stay Connected

By Role

By Industry

By Size

Case Study

Overview

News & Events

The RBA Flag

Services

Solutions

Engagement Types