The Problem: Good Data Takes Time (A Lot of It)
If you’ve read my previous post about accidentally making an AI worse at Go, you already know where this is going. Good data matters. A lot.
As part of fixing that mistake, I needed to score every function in a dataset of roughly 600,000–700,000 entries. And while I’d like to think I’m a better judge of code quality than a model (jury’s still out), I definitely don’t have the time to manually review hundreds of thousands of functions.
So naturally, I followed best practices: I had the AI handle it.
The Bottleneck: Inference Speed on a Single Machine
I kicked things off using a local inference server running on an aging but reliable Nvidia P40. With a Qwen2.5 Coder 7B model, I was processing functions at a rate of about 0.6 functions per second.
Not great.
My first instinct was to parallelize. Ollama itself is single-threaded, but I was able to run two workers and squeeze out a slight improvement to ~0.64 functions per second.
Still not enough. At that rate, I was staring down hundreds of hours of processing time.
The Breakthrough: A Distributed Ollama Cluster
Then came the obvious-in-hindsight realization: I had more machines.
Two MacBook Pros were sitting idle. Not nearly as powerful as the P40 individually, but collectively? That’s a different story.
I updated the script (with a little help from Claude) to distribute inference jobs across all three machines using Ollama endpoints.
And just like that, I had a distributed Ollama cluster.
The Results: Doubling Throughput Without New Hardware
The impact was immediate:
- Before: ~0.6 functions/sec
- After (distributed): 1.2+ functions/sec
That effectively cut total runtime in half. A 700-hour job became a 350-hour job.
Is that still long? Yes.
Is it a meaningful improvement with zero hardware investment? Also yes.
Tradeoffs and Constraints
This setup isn’t perfect, and it’s worth calling out the limitations:
- Hardware constraints matter
The MacBook Pros (16GB RAM) limit model size and context window.
- Best suited for “one-shot” tasks
This approach works well for stateless workloads like scoring, but not for more complex multi-step inference pipelines.
- Throughput is additive, not exponential
You’re still bound by the capabilities of each node, but you can stack incremental gains.
For my use case (function scoring), these tradeoffs were completely acceptable.
Implementation: Simple, Practical, Effective
What stood out most was how easy this was to implement. Ollama abstracts away much of the complexity typically associated with distributed systems.
If you can:
- Run Ollama locally
- Expose endpoints across machines
- Distribute jobs via a script
You can build a lightweight inference cluster.
For those interested, the script is available here:
https://gist.github.com/rsarv3006/e5cbba4e00799fd399adb940fb2ff2e1
What’s Next: Distributed Benchmarking and Beyond
This approach opened up more than just scoring improvements.
I’m now working on a distributed benchmarking script using the same architecture. Since Ollama makes it easy to pull models from Hugging Face, I can:
- Upload fine-tuned models
- Distribute benchmarking jobs across machines
- Reduce multi-day CPU workloads down to something closer to a day
At the same time, I can run training workloads on my primary server while secondary machines handle evaluation tasks. That kind of parallelism is where this really starts to pay off.
Where This Goes: Future Opportunities for Distributed Local AI
This experiment has definitely sparked a few ideas:
- Distributed speculative decoding
- Multi-node fine-tuning workflows
- Local AI pipelines that scale horizontally instead of vertically
- Cost-efficient alternatives to cloud-based inference
The big takeaway: you don’t always need better hardware. Sometimes you just need to use what you already have more effectively.
Final Thoughts
This wasn’t about building a perfect distributed system. It was about solving a real bottleneck with minimal friction.
And in that sense, it worked exactly as intended.
If you’re running into limits with local AI workloads, it might be worth asking:
Do you actually need a bigger machine, or just more machines?
About the Author
Robby Sarvis
Senior Software Engineer
Robby is a full-stack developer at RBA with a deep passion for crafting mobile applications and enhancing user experiences. With a robust skill set that encompasses both front-end and back-end development, Robby is dedicated to leveraging technology to create solutions that exceed client expectations.
Residing in a small town in Texas, Robby enjoys a balanced life that includes his wife, children, and their charming dogs.