

“Set your heart upon your work, but never on its reward.”
Bhagavad Gita

“Set your heart upon your work, but never on its reward.”
Bhagavad Gita
The cloud was supposed to be the permanent home for AI inference. Models were too large, hardware was too specialized, and the ops burden was too high for anyone but hyperscalers to run. That narrative made sense in 2024.
In 2026, three forces are pushing inference back to local hardware:
The result: 2026 is the year self-hosted AI went from niche to mainstream. Changelog declared it the "Year of Self-Hosted Software." GitHub's fastest-growing repos are local-first AI tools. And the tooling has finally caught up to make it practical.
Running AI locally used to mean "buy an NVIDIA A100 and figure it out." The options are broader now:
The RTX 5090 (32GB VRAM) and RTX 5080 (16GB VRAM) are the workhorses of local inference. A 5090 runs a 30B-parameter quantized model at interactive speeds. Two 5090s in NVLink handle 70B models comfortably.
For most local inference workloads, consumer hardware is sufficient. You don't need data center GPUs unless you're running 100B+ models or serving high concurrency.
The M4 Ultra with 192GB unified memory is genuinely competitive for inference. Unified memory means the entire model fits in one address space without the complexity of GPU VRAM management. Throughput is lower than discrete GPUs, but for single-user or low-concurrency workloads, the simplicity is worth the tradeoff.
The most surprising development: running models directly in the browser via WebGPU. Projects like web-llm and the browser-based GGUF runners have reached the point where a 7B model runs at 15-20 tokens/second in Chrome on a laptop with a decent GPU.
This isn't competitive with native inference for production workloads, but for privacy-sensitive applications where data can't leave the device, it's transformative. No server, no API, no data transmission — the model runs entirely in the user's browser.
The real enabler isn't hardware — it's the software stack that makes local AI usable.
llama.cpp remains the foundation. Its GGUF format is the de facto standard for local model distribution, with quantization options that trade quality for speed:
| Quantization | Size (7B model) | Quality loss | Speed gain |
|---|---|---|---|
| FP16 | 14 GB | None | Baseline |
| Q8_0 | 7.5 GB | Negligible | ~1.5x |
| Q5_K_M | 5.0 GB | Minimal | ~2x |
| Q4_K_M | 4.1 GB | Small | ~2.5x |
| Q3_K_M | 3.3 GB | Moderate | ~3x |
For most applications, Q5_K_M is the sweet spot — barely perceptible quality loss with significant memory and speed improvements.
vLLM is the choice for higher-throughput serving. If you need to serve multiple users from a local GPU server, vLLM's PagedAttention and continuous batching make much better use of GPU memory than naive inference.
Ollama has become the Docker of local AI. Pull a model, run it, expose an API. The developer experience is polished to the point where getting a model running locally is a single command:
ollama pull llama3.3:70b-instruct-q5_K_M
ollama serveThat's it. You now have a local API endpoint at localhost:11434 that's compatible with the OpenAI API format.
RAG doesn't go away just because you're running locally — you still need retrieval for domain-specific knowledge. The local vector database ecosystem has matured:
Zvec is the breakout project of 2026 — a lightweight, in-process vector database designed for local-first applications. No separate server process. It embeds directly in your application:
import { Zvec } from "zvec";
const db = new Zvec({
dimensions: 768,
metric: "cosine",
storagePath: "./data/vectors",
});
// Index documents
await db.insert(
documents.map((doc) => ({
id: doc.id,
vector: await embed(doc.content),
metadata: { title: doc.title, source: doc.source },
}))
);
// Query
const results = await db.search(await embed(query), {
limit: 10,
filter: { source: "documentation" },
});For applications that need something heavier, Chroma and Qdrant both support local deployment with Docker.
Running retrieval locally means running embedding models locally too. The current best options:
The embedding model choice matters less than people think. For most retrieval tasks, the difference between a good and great embedding model is smaller than the difference between good and bad chunking strategies.
The pattern I recommend for most local setups: a single gateway service that routes to local models based on task requirements.
import { Hono } from "hono";
const app = new Hono();
const models = {
fast: "http://localhost:11434", // Ollama: small model for classification, extraction
capable: "http://localhost:11434", // Ollama: large model for reasoning, generation
embedding: "http://localhost:8080", // Local embedding server
};
app.post("/v1/chat/completions", async (c) => {
const body = await c.req.json();
const complexity = estimateComplexity(body.messages);
// Route to appropriate model based on task complexity
const target = complexity > THRESHOLD ? models.capable : models.fast;
const response = await fetch(`${target}/v1/chat/completions`, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
...body,
model: complexity > THRESHOLD ? "llama3.3:70b" : "llama3.3:8b",
}),
});
return new Response(response.body, {
headers: response.headers,
});
});This gives you a single endpoint that your application talks to, with intelligent routing under the hood. It's the same model-router pattern used in cloud deployments, adapted for local hardware.
Pure local isn't always practical. Some tasks require capabilities that local models don't have — frontier reasoning, very long context, or specialized knowledge. The pragmatic approach: run most inference locally, fall back to cloud APIs for tasks that exceed local capability.
async function inference(
messages: Message[],
requirements: TaskRequirements
): Promise<Response> {
const localCapable =
requirements.contextLength < 32_000 &&
requirements.reasoning !== "frontier" &&
!requirements.imageUnderstanding;
if (localCapable) {
return localInference(messages, requirements);
}
// Fall back to cloud for tasks that need frontier capabilities
return cloudInference(messages, requirements);
}The key: design the fallback boundary explicitly. Don't discover it at runtime. Know upfront which tasks go local and which go to the cloud, and make the routing deterministic.
For applications that need to work without any network connectivity — field deployments, air-gapped environments, embedded systems — the pattern is offline-first with periodic sync:
This is the architecture behind the "Off Grid" project that's gaining traction on Hacker News. It runs a quantized model on mobile devices with no network dependency, then syncs insights when the device reconnects.
Self-hosting isn't universally better. Stay on cloud APIs when:
For a sustained workload of ~1M tokens/day (input + output):
| Approach | Monthly cost | Notes |
|---|---|---|
| Cloud API (frontier model) | ~$500-800 | Pay per token, no hardware |
| Cloud API (small model) | ~$50-100 | Lower quality |
| Self-hosted (RTX 5090) | ~$40 electricity | $2,000 upfront for GPU |
| Self-hosted (Mac Studio M4 Ultra) | ~$15 electricity | $6,000 upfront |
The breakeven for self-hosted hardware is typically 2-4 months at sustained workloads. After that, the marginal cost is electricity.
These numbers shift dramatically with scale. At 10M tokens/day, cloud API costs multiply linearly while self-hosted costs stay roughly flat (assuming you have the hardware capacity).
If you're considering self-hosting AI inference:
The pendulum is swinging from "everything in the cloud" to "cloud where necessary, local where possible." For AI inference, "where possible" is a much larger territory than most people realize. The tools are ready. The economics work. The only question is whether your use case fits the pattern.

Developers spent decades wishing for tools that write code. Now they have them. Why does freedom feel like loss?

Shadow IT on steroids, MCP tools nobody asked for, LLMs playing architect, vibe-coded open source, and text-to-SQL fantasies. The antipatterns everyone's falling into — and how to stop.

Google leads in math and science. OpenAI leads in agentic coding. Anthropic leads in economically valuable work. A comprehensive breakdown of every flagship AI model with actual numbers.