The Results

Let me start with the punchline:

Metric
Base Phi-3-Mini
After 60 Hours Training
Delta
Pass@10
100%
0%
-100%
Overall Pass Rate
60%
0%
-60%
Total Solutions
18/30
0/30
-18%

I did not just fail to improve the model. I took a working Go code generator and systematically destroyed its ability to write even basic functions.

Before anyone panics, this experiment ran entirely on hardware I owned. The only real costs were electricity and a direct hit to my ego.

That is fine though. There is a lot to learn from this, and that is the real value of the experiment.

What Was I Actually Trying to Do?

The original goal was straightforward: test whether a small, domain-specific fine-tuned model could compete with, or even outperform, a larger generalist model for Go code generation.

Could a focused 3B parameter model hold its own against a much larger LLM when trained correctly on Go? That question still matters, and I am not done answering it.

What I did discover instead was a hands-on lesson in the most common and costly mistakes engineers make when attempting to fine-tune their own models.

So What Went Wrong?

Below are the major mistakes I made, along with the lessons learned from each.

First Mistake: Training Data Format Mismatch

I cannot stress this enough: high-quality data that matches the expected format is critical.

The goal was to build a Go expert model. To do that, I assembled a dataset from the top 100 Go repositories on GitHub by stars, plus several hand-selected core Go projects. That part was easy.

I thought my dataset was ready to go.

The problem was the format.

Training Data Format:

// File: golang/go/lib/time/mkzip.go
// Copyright 2022 The Go Authors…
package main

import (
    “archive/zip”
    “bytes”
)

func main() {
    // … implementation
}

However, the model I was fine-tuning expected something very different.

<|user|>Write a Go function to reverse a string<|end|>
<|assistant|>
func ReverseString(s string) string {
    runes := []rune(s)
    for i, j := 0, len(runes)-1; i < j; i, j = i+1, j-1 {
        runes[i], runes[j] = runes[j], runes[i]
    }
    return string(runes)
}
<|end|>

These two formats are fundamentally incompatible.

My intention was to train an instruction-tuned chat model that could respond to prompts like, “Write correct Go code to solve this problem.” Instead, I effectively trained a completion model, and not a good one.

As a result, while it could occasionally resemble code completion, it failed badly on structured evaluations like HumanEval.

That was mistake number one.

Second Mistake: Training Far Too Long Without Validation

Once I had the dataset and training script in place, I let the model train on all 380,000 examples for 95,000 steps.

Yes, that is thousands.

I spent roughly 60 hours of compute time before running even a basic output test.

That was a painful lesson.

What Should I Have Done Instead?

I should have started with a much smaller subset of data and far fewer training steps. A short training run would have surfaced the issues quickly and saved days of wasted compute.

Early validation is not optional. It is essential.

Third Mistake: Trusting the Loss Curve

During training, the loss curve looked excellent. Smooth, stable, and well-behaved.

The outputs were still garbage.

A clean loss curve only tells you the model is learning something. It does not tell you whether it is learning the right thing.

Good training metrics do not guarantee useful results.

Fourth Mistake: Assuming All LLMs Are the Same

I treated the base model as if “an LLM is an LLM is an LLM.”

That assumption is wrong, especially for smaller models.

Instruction-tuned chat models and completion-based models behave very differently. Fine-tuning a model against its original training objective makes the task significantly harder and often counterproductive.

In this case, I was fine-tuning directly against the grain.

Fifth Mistake: More Data Is Not Always Better

I assumed that throwing more data at the model would naturally lead to better results.

That only holds true if the data is correct and well-aligned.

Here, it was not.

Three hundred eighty thousand examples in the wrong format are far worse than one thousand examples in the right format. While this overlaps with the first mistake, it is worth reinforcing: dataset quality and fit matter more than sheer volume.

What The Model Actually Learned

Looking at the outputs, the fine-tuned model consistently learned the wrong patterns.

Pattern 1: Referencing Undefined Variables

func IsPrime(n int) bool {
    return primes[n%12] == 1  // ‘primes’ doesn’t exist
}

The model learned to reference data structures it had seen in context, without learning to define or initialize them first.

Pattern 2: Generating Plausible but Broken Code

func ReverseString(s string) string {
    return s[len(s)-1:]+s[0:len(s)-1]  // rotates by 1, doesn’t reverse
}

The syntax was correct. The logic was not.

It learned the shape of string manipulation without understanding the behavior behind it.

Pattern 3: Giving Up With Comments

func Factorial(n int) int {
    } // factorial(n) = 1 * 2 * … * n
}

In some cases, it generated literal syntax errors alongside comments explaining what the function should do. It felt like a student who does not know the answer but hopes for partial credit.

What I Actually Learned

  1. Data format is everything: More important than model size, learning rate, or training time
  2. Validate early and often: A five-minute test can save sixty hours of compute
  3. Loss curves can be misleading: Low loss does not equal useful outputs
  4. Know your base model: Match its training objective or expect degraded results
  5. Test the hypothesis you care about: Not just the metrics that are easiest to measure

 

What I Gained from the Experiment

  • ✅ A complete end-to-end training pipeline on a single GPU
  • ✅ A working evaluation harness for Go code generation
  • ✅ A deep understanding of instruction tuning versus completion models
  • ✅ Practical, battle-tested knowledge of what not to do
  • ✅ A genuinely strong story for future technical interview

Conclusion

Have you run into these pitfalls while training a model? It feels like a rite of passage, even if it stings when so much time and effort ends in failure.

The good news is that this experiment was still a success in the ways that matter most. The next iteration of the Go expert model is already underway, this time with the right data, validation strategy, and expectations in place.

Stay tuned.

About the Author

Robby Sarvis
Robby Sarvis

Senior Software Engineer

Robby is a full-stack developer at RBA with a deep passion for crafting mobile applications and enhancing user experiences. With a robust skill set that encompasses both front-end and back-end development, Robby is dedicated to leveraging technology to create solutions that exceed client expectations.

Residing in a small town in Texas, Robby enjoys a balanced life that includes his wife, children, and their charming dogs.