RBA Consulting
RBA Consulting
RBA Consulting

The Problem: Good Data Takes Time (A Lot of It) 

If you’ve read my previous post about accidentally making an AI worse at Go, you already know where this is going. Good data matters. A lot. 

As part of fixing that mistake, I needed to score every function in a dataset of roughly 600,000–700,000 entries. And while I’d like to think I’m a better judge of code quality than a model (jury’s still out), I definitely don’t have the time to manually review hundreds of thousands of functions. 

So naturally, I followed best practices: I had the AI handle it. 

The Bottleneck: Inference Speed on a Single Machine 

I kicked things off using a local inference server running on an aging but reliable Nvidia P40. With a Qwen2.5 Coder 7B model, I was processing functions at a rate of about 0.6 functions per second. 

Not great. 

My first instinct was to parallelize. Ollama itself is single-threaded, but I was able to run two workers and squeeze out a slight improvement to ~0.64 functions per second. 

Still not enough. At that rate, I was staring down hundreds of hours of processing time. 

The Breakthrough: A Distributed Ollama Cluster 

Then came the obvious-in-hindsight realization: I had more machines. 

Two MacBook Pros were sitting idle. Not nearly as powerful as the P40 individually, but collectively? That’s a different story. 

I updated the script (with a little help from Claude) to distribute inference jobs across all three machines using Ollama endpoints. 

And just like that, I had a distributed Ollama cluster. 

The Results: Doubling Throughput Without New Hardware 

The impact was immediate: 

  • Before: ~0.6 functions/sec 
  • After (distributed): 1.2+ functions/sec 

That effectively cut total runtime in half. A 700-hour job became a 350-hour job. 

Is that still long? Yes. 

Is it a meaningful improvement with zero hardware investment? Also yes. 

Tradeoffs and Constraints 

This setup isn’t perfect, and it’s worth calling out the limitations: 

  • Hardware constraints matter 

The MacBook Pros (16GB RAM) limit model size and context window. 

  • Best suited for “one-shot” tasks 

This approach works well for stateless workloads like scoring, but not for more complex multi-step inference pipelines. 

  • Throughput is additive, not exponential 

You’re still bound by the capabilities of each node, but you can stack incremental gains. 

For my use case (function scoring), these tradeoffs were completely acceptable. 

Implementation: Simple, Practical, Effective 

What stood out most was how easy this was to implement. Ollama abstracts away much of the complexity typically associated with distributed systems. 

If you can: 

  • Run Ollama locally 
  • Expose endpoints across machines 
  • Distribute jobs via a script 

You can build a lightweight inference cluster. 

For those interested, the script is available here: 

https://gist.github.com/rsarv3006/e5cbba4e00799fd399adb940fb2ff2e1 

What’s Next: Distributed Benchmarking and Beyond 

This approach opened up more than just scoring improvements. 

I’m now working on a distributed benchmarking script using the same architecture. Since Ollama makes it easy to pull models from Hugging Face, I can: 

  • Upload fine-tuned models 
  • Distribute benchmarking jobs across machines 
  • Reduce multi-day CPU workloads down to something closer to a day 

At the same time, I can run training workloads on my primary server while secondary machines handle evaluation tasks. That kind of parallelism is where this really starts to pay off. 

Where This Goes: Future Opportunities for Distributed Local AI 

This experiment has definitely sparked a few ideas: 

  • Distributed speculative decoding 
  • Multi-node fine-tuning workflows 
  • Local AI pipelines that scale horizontally instead of vertically 
  • Cost-efficient alternatives to cloud-based inference 

The big takeaway: you don’t always need better hardware. Sometimes you just need to use what you already have more effectively. 

Final Thoughts 

This wasn’t about building a perfect distributed system. It was about solving a real bottleneck with minimal friction. 

And in that sense, it worked exactly as intended. 

If you’re running into limits with local AI workloads, it might be worth asking: 

Do you actually need a bigger machine, or just more machines? 

About the Author

Robby Sarvis
Robby Sarvis

Senior Software Engineer

Robby is a full-stack developer at RBA with a deep passion for crafting mobile applications and enhancing user experiences. With a robust skill set that encompasses both front-end and back-end development, Robby is dedicated to leveraging technology to create solutions that exceed client expectations.

Residing in a small town in Texas, Robby enjoys a balanced life that includes his wife, children, and their charming dogs.