Every token you generate costs money. With a cloud API, you pay per token directly — GPT-4o at $10 per million output tokens, Claude Sonnet at $15. With local inference, you pay indirectly — hardware, electricity, and engineering time amortized across every token the machine produces. The question isn’t which approach is cheaper. The question is which tokens should come from where.
A fintech company running $47,000 per month in API costs dropped to $8,000 by routing bulk summarization to a self-hosted 7B model on spot H100s while keeping complex reasoning on cloud APIs. An 83% reduction, four-month payback. Meanwhile, 76% of companies using LLMs have adopted open-source models (Databricks 2025), but only 11% of production inference traffic actually runs on self-hosted infrastructure (Menlo Ventures 2025). The gap between experimentation and production tells the real story: running your own models is viable, but most companies haven’t figured out which workloads justify it.
This framework answers that. Three dimensions — cost, compliance, and capability — scored against your actual workloads. The output is a routing decision: what runs locally, what stays on cloud APIs, and what runs through both.
The Cost Math That Actually Matters
Cloud API pricing is transparent. Self-hosting costs are not. The comparison requires honest accounting on both sides.
Cloud API reference prices (March 2026): GPT-4o runs $2.50 input / $10.00 output per million tokens. Claude Sonnet 4.6 runs $3.00 / $15.00. But those are list prices — Anthropic’s prompt caching cuts repeat-context costs by 90%, and their batch API halves async workloads. Combined, a cacheable batch workflow on Claude Haiku drops from $5.00 to roughly $0.25 per million output tokens. The effective price depends entirely on your usage pattern.
Self-hosted TCO for a 70B model on 2x H100: Hardware amortized over three years runs $28,000–$40,000 annually. Power and cooling add $10,000–$25,000. A quarter of one MLOps engineer’s time — the minimum for production reliability — adds $30,000–$50,000. Total: $68,000–$115,000 per year before generating a single token. At that cost, you need sustained volume to break even.
The break-even thresholds: Below $50,000 per year in API spend, self-hosting costs more than it saves — engineering overhead alone exceeds the API bill. Between $50,000 and $500,000, the answer is hybrid: route high-volume commodity tasks to local models and keep complex or low-volume work on cloud APIs. Above $500,000, the economics of owned infrastructure start to dominate. A peer-reviewed analysis of 54 deployment scenarios found payback periods ranging from 18 months to 9 years, with the variance driven almost entirely by GPU utilization rates (arXiv 2024). An idle GPU is the most expensive hardware in your office.
The break-even isn’t about total API spend — it’s about how many tokens you can route to a local model that’s good enough for the task. A $200,000 API bill where 80% is frontier reasoning work won’t benefit from self-hosting. A $60,000 bill where 80% is classification and extraction will.
The Compliance Dimension Cloud Vendors Don’t Emphasize
Cost is the obvious axis. Compliance is the one that forces the decision.
Healthcare: Sending protected health information in a cloud LLM prompt makes that provider a HIPAA Business Associate. Without a signed BAA, every API call containing PHI is an unauthorized disclosure — not a technical risk, a legal violation. Standard API tiers from OpenAI and Anthropic do not include BAAs. You need enterprise agreements, and even then, OpenAI retains API data for 30 days by default. HHS OCR issued seven enforcement actions against business associates in the first six months of their 2024 Risk Analysis Initiative, doubling all prior BA enforcement since 2013. Fines start at $80,000 per incident.
Financial services: Federal Reserve SR 11-7 requires regulated institutions to validate model behavior. Proprietary cloud LLMs don’t expose weights or training data — regulators acknowledged in 2024 that “the exact nature of the training data used may not be available to a banking organization in the case of proprietary third-party models.” A local model running open weights satisfies SR 11-7 in ways a cloud API structurally cannot.
Legal: ABA Formal Opinion 512 (July 2024) states that inputting confidential client information into a public GenAI platform without an enterprise agreement may constitute disclosure to a third party, resulting in waiver of attorney-client privilege. Multiple state bars — Florida, California, New York, New Jersey, Pennsylvania — issued parallel guidance.
The Samsung precedent: Three documented data leaks within three weeks of authorizing ChatGPT — semiconductor source code, defect-detection program code, and internal meeting transcripts. Irrecoverable. This is the case every compliance officer cites, and it remains relevant because the fundamental architecture hasn’t changed: cloud inference means your data traverses someone else’s infrastructure.
For healthcare, financial services, legal, and insurance companies, the compliance question often overrides the cost question. A local model processing sensitive data isn’t a performance optimization — it’s a regulatory requirement.
When Open-Source Models Are Good Enough
The quality gap between open-source and proprietary models has collapsed in the past twelve months. Whether it matters depends on the task.
Where open-source matches or beats proprietary (March 2026): DeepSeek R1 scores 90.8% on MMLU versus GPT-4o’s 87–88%. On MATH-500, DeepSeek R1 hits 97.3%. On HumanEval coding benchmarks, DeepSeek V3 scores 82.6 versus GPT-4o’s 80.5. Qwen3-30B-A3B scores 91.0 on ArenaHard versus GPT-4o’s 85.3. For classification, extraction, summarization, and structured reasoning, a well-chosen open model running locally delivers equivalent output at marginal cost approaching zero.
Where cloud APIs still lead: Complex multi-step agents, long-horizon planning with tool use, multimodal document understanding, and real-time audio processing. These are the tasks where reliability across edge cases matters more than benchmark scores — and where proprietary models maintain a meaningful advantage. If your critical workflow involves an autonomous agent making a chain of decisions over 50 steps, cloud frontier models fail less often in ways that matter.
The practical model for mid-market (March 2026): Qwen3-30B-A3B is the strongest GPT-4o-class model deployable on a single machine. Mistral Small 3 at 24B parameters offers the best efficiency in its size class — MMLU above 81%, 150 tokens per second on modest hardware. Llama 4 Scout runs on a single H100 with a 10-million-token context window and native vision, though EU companies face licensing restrictions. The model you choose matters less than matching model capability to task complexity — the same evaluation logic from the SaaS Replacement Scorecard applies here.
The Hardware Decision for a 200-Person Company
Datacenter GPUs are not the only option. The mid-market hardware landscape has shifted.
Mac Studio as inference server: A Mac Studio M3 Ultra with 256GB unified memory costs $5,499, draws 215W under load (versus 450W for a single RTX 4090), runs silently in an office, and can load a 70B model in full precision. Qwen3-30B runs at roughly 2,320 tokens per second in 4-bit quantization. The unified memory architecture eliminates the VRAM bottleneck that makes NVIDIA GPUs expensive — you’re paying for memory capacity, not specialized GPU memory at 5x the price per gigabyte. For teams running fewer than five concurrent inference users, a Mac Studio handles the workload at a fraction of datacenter cost.
Mac Studio clusters: macOS 26.2 added RDMA over Thunderbolt in December 2025. EXO Labs’ open-source framework enables distributed inference across multiple Mac Studios. Two to three units at roughly $10,000 entry cost handle 100B+ parameter models. Four units at $40,000–$50,000 run trillion-parameter models locally — no per-token cost, no data egress, deployable in a closet. The software is early-stage, but the architecture is real: office-grade hardware running models that previously required datacenter infrastructure.
NVIDIA for higher concurrency: Once you need 10+ concurrent users or sustained high-throughput batch processing, NVIDIA hardware delivers better aggregate throughput. An RTX 4090 at $1,600–$2,000 handles 7B–13B models. The L40S at $8,000–$10,000 with 48GB VRAM runs 30B–70B models for light multi-user production. The H100 at $25,000–$40,000 is the production workhorse — but at that price point, you’re comparing against cloud GPU rental at $1.49–$3.90 per GPU-hour on providers like Vast.ai and Lambda Labs.
The Hybrid Architecture That Actually Works
The right answer for most mid-market companies is not local or cloud — it’s a routing layer that sends each request to the right destination.
Sensitivity-aware routing: Classify data at the prompt level. Regulated data — PHI, nonpublic financial information, attorney-client privileged content, trade secrets — routes to local models only. Internal but non-sensitive work routes to local models when capacity is available, cloud when it’s not. Public-facing, non-sensitive tasks use cloud APIs. Netflix, Lemonade, and RocketMoney run this pattern in production using LiteLLM, an open-source proxy that provides a unified OpenAI-compatible API across both local and cloud models with centralized audit logging.
Confidence-based routing: A lightweight local model handles the majority of requests. When the model’s confidence falls below a threshold — measurable via output probability scores — the request escalates to a cloud frontier model. Published research shows this pattern reduces cloud API usage by 60% or more while maintaining output quality. RouteLLM, developed by UC Berkeley, Anyscale, and Canva (published at ICLR 2025), achieves 85% cost reduction while preserving 95% of GPT-4-level quality through trained routing classifiers.
The implementation sequence: Start with cloud APIs for everything — time to first query is under an hour. Identify your highest-volume, lowest-complexity workloads after 30 days of production data. Deploy a local model for those specific tasks. Route sensitive data to local from day one if compliance requires it. Expand local coverage as utilization justifies additional hardware. The goal is not to eliminate cloud APIs. The goal is to stop paying frontier-model prices for tasks that a 30B open-source model handles identically.
Not sure which workloads justify local infrastructure? Get a free AI readiness assessment →