Local LLM Inference with Ollama: What's Actually Usable

I’ve been experimenting with local inference for the past two months. The goal: reduce API costs and latency for tasks that don’t require GPT-4 level reasoning. Here’s what I learned.

The setup

Hardware: MacBook Pro M3 Max, 36GB unified memory Model: Llama 3.1 8B via Ollama Use case: Summarization, basic extraction, draft generation

I chose Ollama because it’s dead simple to install and has good library support in Python. The model downloads in a few minutes and runs without any GPU configuration.

What works well

1. Summarization

For summarizing 2-3 page documents, Llama 3.1 8B is indistinguishable from GPT-4o-mini. I ran a blind test with 10 documents and couldn’t reliably tell which model produced which summary.

2. Code explanation

The model is surprisingly good at explaining what a function does. It won’t write complex algorithms from scratch, but for documentation-level explanations, it’s sufficient.

3. Draft generation

For first drafts of blog posts or emails, local models save me from paying for “thinking tokens.” The quality is lower—more repetition, weaker structure—but that’s fine for a starting point.

What doesn’t work

1. Multi-step reasoning

The model fails at tasks requiring more than 2-3 logical steps. For example, “analyze this pricing page and identify where it violates standard B2B SaaS pricing principles” produces superficial answers.

2. Tool calling

Ollama’s tool calling support is improving but still unreliable. The model often hallucinates tool parameters or calls the wrong tool entirely. For agent workflows, this is a dealbreaker.

3. Consistency

I ran the same prompt 20 times and got meaningfully different answers 15 times. For some use cases this is fine; for anything that requires reproducibility, it’s not.

The economics

Based on my usage:

Task	Cloud (GPT-4o-mini)	Local (Llama 3.1 8B)
Summarization	$0.02/doc	$0 (hardware amortized)
Extraction	$0.05/request	$0
Complex reasoning	$0.15/request	N/A (fails)

For my workload (~100 summaries/month, ~50 extractions), I’m saving about $5/month. That’s not nothing, but it’s also not transformative. The real value is latency—local inference feels instantaneous compared to even the fastest cloud APIs.

When to go local

My current heuristic:

Use local: Summarization, classification, draft generation, any task where “good enough” is acceptable
Use cloud: Multi-step reasoning, tool calling, production outputs, anything where consistency matters

The local stack is improving rapidly. I expect the gap to close over the next 6-12 months, especially as quantization techniques get better. But for now, a hybrid approach makes the most sense.

What’s next

I’m experimenting with model routing—using a smaller model to classify task complexity and routing to cloud only when necessary. Early results show ~40% cost savings with minimal quality degradation.

The dream: a local model that can handle 80% of tasks, with cloud as fallback for the hard 20%. We’re not there yet, but we’re close.