Short answer — where should you run LLM inference? Run it wherever each workload’s economics and compliance constraints point, which for most companies means more than one place:
- Choose a cloud API when output quality, speed to ship, managed infrastructure, or frontier reasoning matters most.
- Choose a local (self-hosted) LLM when you’re handling sensitive data, running high-volume commodity tasks, or you need controllable, predictable cost.
- Choose a hybrid architecture when your workflows vary by sensitivity, volume, and complexity — the common case for mid-market teams.
The rest of this guide turns that into a routing decision across cost, compliance, and capability, scored against your actual workloads. Start with the comparison table, then work through each dimension.
| Dimension | Local LLM (self-hosted) | Cloud API | Hybrid |
|---|---|---|---|
| Cost | High fixed cost, low marginal cost; tends to pay off only at sustained volume | Pay-per-token, no fixed cost; scales directly with usage | Local for high-volume commodity tokens, cloud for the rest |
| Compliance | Data stays on infrastructure you control; strongest fit for PHI, NPI, and privileged data | Requires BAAs or enterprise terms; data traverses a third party | Route regulated data local, route everything else to cloud |
| Capability | Open models now match frontier models on many tasks; gaps remain on complex agents | Frontier reasoning, multimodal, long-horizon tool use | Match the best-available model to each task |
| Latency | Predictable in-network latency; no external rate limits | Low latency at small scale, but subject to provider limits and outages | Keep latency-sensitive paths wherever they perform best |
| Maintenance | You own GPUs, ops, and model updates (ongoing MLOps time) | Vendor-managed; minimal ops burden | Added routing-layer complexity plus partial ops burden |
| Best-fit workloads | Classification, extraction, summarization, high-volume batch | Complex multi-step agents, frontier reasoning, low-volume or spiky demand | Mixed workloads that span sensitivity, volume, and complexity |
Every token you generate costs money. With a cloud API, you pay per token directly — GPT-4o at $10 per million output tokens, Claude Sonnet at $15. With local inference, you pay indirectly — hardware, electricity, and engineering time amortized across every token the machine produces. So the more useful question is usually not which approach is cheaper, but which tokens should come from where.
A fintech company running $47,000 per month in API costs dropped to $8,000 by routing bulk summarization to a self-hosted 7B model on spot H100s while keeping complex reasoning on cloud APIs. An 83% reduction, four-month payback. Meanwhile, 76% of companies using LLMs have adopted open-source models (Databricks 2025), but open-source models account for roughly 13% of enterprise model usage, down from 19% mid-year (Menlo Ventures 2025). The gap between experimentation and production tells the real story: running your own models is viable, but most companies haven’t figured out which workloads justify it.
The output of that scoring is a routing decision: what runs locally, what stays on cloud APIs, and what runs through both. Most teams land on some version of the hybrid column above rather than an all-or-nothing answer.
This guide cites specific hardware prices, model benchmark scores, and cloud API rates. Those figures move quickly in this category — treat them as a point-in-time snapshot and confirm current numbers before making a purchase or architecture decision.
The Cost Math That Actually Matters
Cloud API pricing is transparent. Self-hosting costs are not. The comparison requires honest accounting on both sides.
Cloud API reference prices (March 2026): GPT-4o runs $2.50 input / $10.00 output per million tokens. Claude Sonnet 4.6 runs $3.00 / $15.00. But those are list prices — Anthropic’s prompt caching cuts repeat-context costs by 90%, and their batch API halves async workloads. Combined, a cacheable batch workflow on Claude Haiku drops from $5.00 to roughly $0.25 per million output tokens. The effective price depends entirely on your usage pattern.
Self-hosted TCO for a 70B model on 2x H100: Hardware amortized over three years runs $28,000–$40,000 annually. Power and cooling add $10,000–$25,000. A quarter of one MLOps engineer’s time — the minimum for production reliability — adds $30,000–$50,000. Total: $68,000–$115,000 per year before generating a single token. At that cost, you need sustained volume to break even.
The break-even thresholds: Below $50,000 per year in API spend, self-hosting costs more than it saves — engineering overhead alone exceeds the API bill. Between $50,000 and $500,000, the answer is hybrid: route high-volume commodity tasks to local models and keep complex or low-volume work on cloud APIs. Above $500,000, the economics of owned infrastructure start to dominate. In practice, payback periods vary from under two years to a decade or more, with the variance driven almost entirely by GPU utilization rates. An idle GPU is among the most expensive hardware you can own.
The break-even isn’t about total API spend — it’s about how many tokens you can route to a local model that’s good enough for the task. A $200,000 API bill where 80% is frontier reasoning work won’t benefit from self-hosting. A $60,000 bill where 80% is classification and extraction will.
The Compliance Dimension Cloud Vendors Don’t Emphasize
Cost is the obvious axis. Compliance is the one that forces the decision.
Healthcare: Sending protected health information in a cloud LLM prompt makes that provider a HIPAA Business Associate. Without a signed BAA, every API call containing PHI is an unauthorized disclosure — not a technical risk, a legal violation. Standard API tiers from OpenAI and Anthropic do not include BAAs. You need enterprise agreements, and even then, OpenAI retains API data for 30 days by default. HHS OCR issued seven enforcement actions against business associates in the first six months of their 2024 Risk Analysis Initiative, doubling all prior BA enforcement since 2013. OCR HIPAA settlements have ranged from tens of thousands of dollars to over $3 million per case.
Financial services: Federal Reserve SR 11-7 required regulated institutions to validate model behavior; it was rescinded in April 2026 and replaced with updated interagency model-risk guidance, but the core validation obligation carries forward. Proprietary cloud LLMs don’t expose weights or training data — regulators acknowledged in 2024 that “the exact nature of the training data used may not be available to a banking organization in the case of proprietary third-party models.” A local model running open weights satisfies those validation requirements in ways a cloud API structurally cannot.
Legal: ABA Formal Opinion 512 (July 2024) states that inputting confidential client information into a public GenAI platform without an enterprise agreement may implicate a lawyer’s confidentiality obligations and risk exposing client information to a third party. Multiple state bars — Florida, California, New York, New Jersey, Pennsylvania — issued parallel guidance.
The Samsung precedent: Three documented data leaks within twenty days of authorizing ChatGPT — semiconductor source code, defect-detection program code, and internal meeting transcripts. Irrecoverable. This is the case every compliance officer cites, and it remains relevant because the fundamental architecture hasn’t changed: cloud inference means your data traverses someone else’s infrastructure.
For healthcare, financial services, legal, and insurance companies, the compliance question often overrides the cost question. A local model processing sensitive data isn’t a performance optimization — it’s frequently a regulatory requirement. Whichever way you route, the underlying obligation is the same: document which data classes go where and why, which is the kind of control an AI governance framework is meant to capture.
When Open-Source Models Are Good Enough
The quality gap between open-source and proprietary models has collapsed in the past twelve months. Whether it matters depends on the task.
Where open-source matches or beats proprietary (March 2026): DeepSeek R1 scores 90.8% on MMLU versus GPT-4o’s 87–88%. On MATH-500, DeepSeek R1 hits 97.3%. On the standard HumanEval coding benchmark, GPT-4o scores about 90 percent versus DeepSeek V3’s 78 percent — proprietary models retain a clear lead there. Qwen3-30B-A3B scores 91.0 on ArenaHard versus GPT-4o’s 85.3. For classification, extraction, summarization, and structured reasoning, a well-chosen open model running locally delivers equivalent output at marginal cost approaching zero.
Where cloud APIs still lead: Complex multi-step agents, long-horizon planning with tool use, multimodal document understanding, and real-time audio processing. These are the tasks where reliability across edge cases matters more than benchmark scores — and where proprietary models maintain a meaningful advantage. If your critical workflow involves an autonomous agent making a chain of decisions over 50 steps, cloud frontier models fail less often in ways that matter.
The practical model for mid-market (March 2026): Qwen3-30B-A3B is the strongest GPT-4o-class model deployable on a single machine. Mistral Small 3 at 24B parameters offers the best efficiency in its size class — MMLU above 81%, 150 tokens per second on modest hardware. Llama 4 Scout runs on a single H100 with a 10-million-token context window and native vision, though EU companies face licensing restrictions. The model you choose matters less than matching model capability to task complexity — the same evaluation logic from the SaaS Replacement Scorecard applies here.
The Hardware Decision for a 200-Person Company
Datacenter GPUs are not the only option. The mid-market hardware landscape has shifted.
Mac Studio as inference server: A Mac Studio M3 Ultra with 256GB unified memory costs roughly $5,999, draws 160–270W depending on model size and concurrency (versus 450W for a single RTX 4090), runs silently in an office, and can load a 70B model in full precision. Qwen3-30B delivers strong local throughput in 4-bit quantization. The unified memory architecture eliminates the VRAM bottleneck that makes NVIDIA GPUs expensive — you’re paying for memory capacity, not specialized GPU memory at 5x the price per gigabyte. For teams running fewer than five concurrent inference users, a Mac Studio handles the workload at a fraction of datacenter cost.
Mac Studio clusters: macOS 26.2 added RDMA over Thunderbolt in December 2025. EXO Labs’ open-source framework enables distributed inference across multiple Mac Studios. Two to three units at roughly $10,000 entry cost handle 100B+ parameter models. A few units running into the low tens of thousands of dollars can run very large models locally — no per-token cost, no data egress, deployable in a closet. The software is early-stage, but the architecture is real: office-grade hardware running models that previously required datacenter infrastructure.
NVIDIA for higher concurrency: Once you need 10+ concurrent users or sustained high-throughput batch processing, NVIDIA hardware delivers better aggregate throughput. An RTX 4090 at $2,500–$3,500+ (now end-of-life, with prices trending up) handles 7B–13B models. The L40S at $8,000–$10,000 with 48GB VRAM runs 30B–70B models for light multi-user production. The H100 at $25,000–$40,000 is the production workhorse — but at that price point, you’re comparing against cloud GPU rental at $2.00–$3.90 per GPU-hour on providers like Vast.ai and Lambda Labs.
The Hybrid Architecture That Actually Works
The right answer for most mid-market companies is not local or cloud — it’s a routing layer that sends each request to the right destination.
Sensitivity-aware routing: Classify data at the prompt level. Regulated data — PHI, nonpublic financial information, attorney-client privileged content, trade secrets — routes to local models only. Internal but non-sensitive work routes to local models when capacity is available, cloud when it’s not. Public-facing, non-sensitive tasks use cloud APIs. Netflix, Lemonade, and RocketMoney run this pattern in production using LiteLLM, an open-source proxy that provides a unified OpenAI-compatible API across both local and cloud models with centralized audit logging.
Confidence-based routing: A lightweight local model handles the majority of requests. When the model’s confidence falls below a threshold — measurable via output probability scores — the request escalates to a cloud frontier model. Published research shows this pattern reduces cloud API usage by 60% or more while maintaining output quality. RouteLLM, developed by UC Berkeley, Anyscale, and Canva (published at ICLR 2025), achieves 85% cost reduction while preserving 95% of GPT-4-level quality through trained routing classifiers.
The implementation sequence: Start with cloud APIs for everything — time to first query is under an hour. Identify your highest-volume, lowest-complexity workloads after 30 days of production data. Deploy a local model for those specific tasks. Route sensitive data to local from day one if compliance requires it. Expand local coverage as utilization justifies additional hardware. The goal is not to eliminate cloud APIs. The goal is to stop paying frontier-model prices for tasks that a 30B open-source model can handle just as well. Teams without the in-house MLOps capacity to stand up and maintain that routing layer increasingly hand the build-and-run to managed AI operations rather than absorb it internally.
Not sure which workloads justify local infrastructure? Get a free AI readiness assessment →