March 22, 2026
Local LLM Inference with Ollama: What's Actually Usable
My experience running Llama 3.1 locally on an M3 Max. What works, what doesn't, and why I'm still using cloud APIs for production.
6 min read
I’ve been experimenting with local inference for the past two months. The goal: reduce API costs and latency for tasks that don’t require GPT-4 level reasoning. Here’s what I learned.
The setup
Hardware: MacBook Pro M3 Max, 36GB unified memory Model: Llama 3.1 8B via Ollama Use case: Summarization, basic extraction, draft generation
I chose Ollama because it’s dead simple to install and has good library support in Python. The model downloads in a few minutes and runs without any GPU configuration.
What works well
1. Summarization
For summarizing 2-3 page documents, Llama 3.1 8B is indistinguishable from GPT-4o-mini. I ran a blind test with 10 documents and couldn’t reliably tell which model produced which summary.
2. Code explanation
The model is surprisingly good at explaining what a function does. It won’t write complex algorithms from scratch, but for documentation-level explanations, it’s sufficient.
3. Draft generation
For first drafts of blog posts or emails, local models save me from paying for “thinking tokens.” The quality is lower—more repetition, weaker structure—but that’s fine for a starting point.
What doesn’t work
1. Multi-step reasoning
The model fails at tasks requiring more than 2-3 logical steps. For example, “analyze this pricing page and identify where it violates standard B2B SaaS pricing principles” produces superficial answers.
2. Tool calling
Ollama’s tool calling support is improving but still unreliable. The model often hallucinates tool parameters or calls the wrong tool entirely. For agent workflows, this is a dealbreaker.
3. Consistency
I ran the same prompt 20 times and got meaningfully different answers 15 times. For some use cases this is fine; for anything that requires reproducibility, it’s not.
The economics
Based on my usage:
| Task | Cloud (GPT-4o-mini) | Local (Llama 3.1 8B) |
|---|---|---|
| Summarization | $0.02/doc | $0 (hardware amortized) |
| Extraction | $0.05/request | $0 |
| Complex reasoning | $0.15/request | N/A (fails) |
For my workload (~100 summaries/month, ~50 extractions), I’m saving about $5/month. That’s not nothing, but it’s also not transformative. The real value is latency—local inference feels instantaneous compared to even the fastest cloud APIs.
When to go local
My current heuristic:
- Use local: Summarization, classification, draft generation, any task where “good enough” is acceptable
- Use cloud: Multi-step reasoning, tool calling, production outputs, anything where consistency matters
The local stack is improving rapidly. I expect the gap to close over the next 6-12 months, especially as quantization techniques get better. But for now, a hybrid approach makes the most sense.
What’s next
I’m experimenting with model routing—using a smaller model to classify task complexity and routing to cloud only when necessary. Early results show ~40% cost savings with minimal quality degradation.
The dream: a local model that can handle 80% of tasks, with cloud as fallback for the hard 20%. We’re not there yet, but we’re close.