When One Machine Isn’t Enough: Building a Distributed Ollama Cluster for AI Workloads

The Problem: Good Data Takes Time (A Lot of It)

If you’ve read my previous post about accidentally making an AI worse at Go, you already know where this is going. Good data matters. A lot.

As part of fixing that mistake, I needed to score every function in a dataset of roughly 600,000–700,000 entries. And while I’d like to think I’m a better judge of code quality than a model (jury’s still out), I definitely don’t have the time to manually review hundreds of thousands of functions.

So naturally, I followed best practices: I had the AI handle it.

The Bottleneck: Inference Speed on a Single Machine

I kicked things off using a local inference server running on an aging but reliable Nvidia P40. With a Qwen2.5 Coder 7B model, I was processing functions at a rate of about 0.6 functions per second.

Not great.

My first instinct was to parallelize. Ollama itself is single-threaded, but I was able to run two workers and squeeze out a slight improvement to ~0.64 functions per second.

Still not enough. At that rate, I was staring down hundreds of hours of processing time.

The Breakthrough: A Distributed Ollama Cluster

Then came the obvious-in-hindsight realization: I had more machines.

Two MacBook Pros were sitting idle. Not nearly as powerful as the P40 individually, but collectively? That’s a different story.

I updated the script (with a little help from Claude) to distribute inference jobs across all three machines using Ollama endpoints.

And just like that, I had a distributed Ollama cluster.

The Results: Doubling Throughput Without New Hardware

The impact was immediate:

Before: ~0.6 functions/sec

After (distributed): 1.2+ functions/sec

That effectively cut total runtime in half. A 700-hour job became a 350-hour job.

Is that still long? Yes.

Is it a meaningful improvement with zero hardware investment? Also yes.

Tradeoffs and Constraints

This setup isn’t perfect, and it’s worth calling out the limitations:

Hardware constraints matter

The MacBook Pros (16GB RAM) limit model size and context window.

Best suited for “one-shot” tasks

This approach works well for stateless workloads like scoring, but not for more complex multi-step inference pipelines.

Throughput is additive, not exponential

You’re still bound by the capabilities of each node, but you can stack incremental gains.

For my use case (function scoring), these tradeoffs were completely acceptable.

Implementation: Simple, Practical, Effective

What stood out most was how easy this was to implement. Ollama abstracts away much of the complexity typically associated with distributed systems.

If you can:

Run Ollama locally

Expose endpoints across machines

Distribute jobs via a script

You can build a lightweight inference cluster.

For those interested, the script is available here:

https://gist.github.com/rsarv3006/e5cbba4e00799fd399adb940fb2ff2e1

What’s Next: Distributed Benchmarking and Beyond

This approach opened up more than just scoring improvements.

I’m now working on a distributed benchmarking script using the same architecture. Since Ollama makes it easy to pull models from Hugging Face, I can:

Upload fine-tuned models

Distribute benchmarking jobs across machines

Reduce multi-day CPU workloads down to something closer to a day

At the same time, I can run training workloads on my primary server while secondary machines handle evaluation tasks. That kind of parallelism is where this really starts to pay off.

Where This Goes: Future Opportunities for Distributed Local AI

This experiment has definitely sparked a few ideas:

Distributed speculative decoding

Multi-node fine-tuning workflows

Local AI pipelines that scale horizontally instead of vertically

Cost-efficient alternatives to cloud-based inference

The big takeaway: you don’t always need better hardware. Sometimes you just need to use what you already have more effectively.

Final Thoughts

This wasn’t about building a perfect distributed system. It was about solving a real bottleneck with minimal friction.

And in that sense, it worked exactly as intended.

If you’re running into limits with local AI workloads, it might be worth asking:

Do you actually need a bigger machine, or just more machines?

About the Author

Robby Sarvis

Senior Software Engineer

Robby is a full-stack developer at RBA with a deep passion for crafting mobile applications and enhancing user experiences. With a robust skill set that encompasses both front-end and back-end development, Robby is dedicated to leveraging technology to create solutions that exceed client expectations.

Residing in a small town in Texas, Robby enjoys a balanced life that includes his wife, children, and their charming dogs.

Solutions

Services

Engagement Types

Stay Connected

By Role

By Industry

By Size

Case Study

Overview

News & Events

The RBA Flag

Solutions

Services

Engagement Types

Stay Connected

By Role

By Industry

By Size

Case Study

Overview

News & Events

The RBA Flag

Services

Solutions

Engagement Types