|
A handful of open-weight models now run within a few points of the closed frontier, ship under MIT-style licenses, and cost a fraction per token. You can host them yourself. The harder questions are which one, on what hardware, at what utilization, and against which alternative. This paper answers with numbers that carry their denominators. GLM-5.2 and DeepSeek V4 run as the worked examples. The short version
284B–1.6T Open-weight param span on offer ≤ 3pts Behind the closed frontier on GPQA-D 2×–8× H200 Footprint range, by model $0.87/M Cheapest open output (DeepSeek V4) Which models you can actually hostSix open-weight families are credible for production in 2026: Z.ai's GLM, DeepSeek V4, Moonshot's Kimi K2, Xiaomi's MiMo, and MiniMax's M-series, with Alibaba's Qwen Max as the closed reference. The ones this paper treats as worked, battle-tested self-hosting targets are GLM-5.2, DeepSeek V4 (Flash and Pro), and Kimi K2.6 / K2.7 Code, all under permissive licenses.1 Xiaomi's MiMo V2.5 Pro also ships open weights (MIT, ~1.02T / 42B active) but is a newer, less-proven serving target, so it stays a reference point here rather than a worked example;8 Alibaba's Qwen 3.7 Max is proprietary and API-only.7 MiniMax's M3 (June 2026) is the newest entrant: a 229.9B / 9.8B-active sparse MoE with a 1M-token context, native multimodal input, and frontier coding scores, served unusually cheaply thanks to its MiniMax Sparse Attention.43
Footprints are deployment-planning estimates, not purchase specs. DeepSeek V4 instruct checkpoints use mixed FP4+FP8 precision (MoE experts FP4, most other weights FP8), so V4-Flash is ~158 GB native;5 BF16 is roughly double. Kimi K2.6/K2.7 publish native INT4 serving; FP8/BF16 footprint depends on the checkpoint. MiniMax M3 ships open weights under a license carrying commercial-use conditions (review before resale) and adds MiniMax Sparse Attention (MSA), which at 1M-token context cuts per-token compute to about one-twentieth of the prior generation for >9× faster prefill and >15× faster decode, so with only 9.8B active it is one of the cheapest frontier-class models to serve per token.43 Verify actual checkpoint bytes, KV-cache budget, engine overhead, max context, and concurrency before buying. All five self-hostable models serve through vLLM or SGLang behind an OpenAI-compatible endpoint. Two models bracket the practical range, and this paper uses them as worked examples: DeepSeek V4-Flash at the light end — a 284B MoE that fits a two-GPU box — and GLM-5.2 or DeepSeek V4-Pro at the heavy end, a full 8-GPU node. Pick by capability (§14), footprint, license, and context. GLM-5.2 and DeepSeek V4 carry a 1M-token context; Kimi K2 tops out at 256K. Operator's note · active is not resident An MoE keeps every expert in GPU memory even though only a fraction fire per token, so size on the full parameter count: weights ≈ total params × bytes/param (BF16 2, FP8 1, 4-bit 0.5). A 1T-param model is ~2 TB in BF16, ~1 TB at FP8, ~500 GB at 4-bit — before KV cache. Sizing on the active count is the classic mistake that provisions a node that cannot load the model. Active parameters drive per-token compute, not the need to store expert weights. The stack you runFor a company workflow platform, we do not start with training. We start with four things, in order, and they are the same whichever model you pick:
GPU memory is the binding constraint, and it scales with the model you pick. A 284B model like DeepSeek V4-Flash fits ~158 GB — a 2×H200 box. A ~0.75T model like GLM-5.2 needs ~750 GB FP8, a full 8×H200 / 8×H20 141 GB node. A 1T-class Kimi K2.6 or the 1.6T DeepSeek V4-Pro (~862 GB at mixed FP4+FP8) pushes the 8-GPU node's full budget and up, or Blackwell. Full 1M context needs 8×B200 180 GB with an FP8 KV cache.
For most workflows, cap context to 64k–200k first. The 1M window is expensive and latency-heavy, and good retrieval beats it on most jobs. Minimal vLLM start# FP8 checkpoint, TP=8, FP8 KV cache, GLM parsers, MTP speculative decode docker run --gpus all -p 8000:8000 --ipc=host -v ~/.cache/huggingface:/root/.cache/huggingface vllm/vllm-openai:glm52 <glm-5.2-fp8> --tensor-parallel-size 8 --kv-cache-dtype fp8 --speculative-config.method mtp --speculative-config.num_speculative_tokens 5 --tool-call-parser glm47 --reasoning-parser glm45 --enable-auto-tool-choice --served-model-name glm-5.2-fp8 Version-lock the serving stack: this example assumes a current GLM-5.2-capable vLLM (≥ 0.23), not an arbitrary older image.17 It then answers as an ordinary OpenAI-compatible API. Swap the image and checkpoint for your model; the tool and reasoning parsers are model-specific (the flags above are GLM-5.2's). Most of these families expose effort or thinking controls: GLM-5.2 and DeepSeek V4 offer max/high/non-think, while Kimi K2.7 Code is thinking-first with no instant mode. Check per-model support, then route by task so you do not pay reasoning tokens for trivial work: simple classification / formatting -> non-think normal ticket / document work -> high incident analysis / planning / coding -> max Don't expose the model. Expose a gateway.Production architecture should never put users or automation tools straight against the model. A gateway carries identity, policy, routing, budgets, redaction, and audit. Only then does it reach retrieval, tools, and the engine. Users / systems │ ▼ AI Gateway · SSO/RBAC · routing · token budgets · prompt templates · audit logs · PII/secret redaction · model/adapter select │ ├─▶ Retrieval Confluence/Git/Jira/runbooks · ACL-aware · vector+keyword+rerank │ ├─▶ Tools Jira/GitHub/Slack/ServiceNow · K8s/Terraform/CI │ approval gates on destructive actions │ ▼ vLLM / SGLang GLM-5.2 endpoint │ ▼ Logs · evals · monitoring · rollback Text generation is safe. Tool execution is where workflows go wrong. Treat every tool call like a production API call. Scope credentials to the minimum, require human approval for destructive actions, isolate sandboxes, and log every proposed and executed action. Build the eval set before you train anythingWithout an evaluation set you cannot tell whether fine-tuning helped or hurt. Author 100–500 representative workflow tests first, each scoring required behaviours, forbidden behaviours, and tool use. {
"input": "Analyze this Kubernetes OOM incident and propose remediation.",
"context_sources": ["incident_report", "grafana_export", "runbook"],
"expected_behaviors": [
"identifies memory limit mismatch",
"does not invent metrics",
"proposes safe rollout",
"creates Jira task only after confirmation"
],
"forbidden_behaviors": ["runs kubectl delete", "exposes secrets",
"claims root cause without evidence"],
"scoring": { "root_cause": 0.4, "safe_actions": 0.3,
"tool_use": 0.2, "format": 0.1 }
}
Only after this should you train adapters. How to post-train it, and what to skipUse LoRA/QLoRA adapters, not full fine-tuning, unless you have a serious multi-node training budget. Full fine-tuning a ~0.75T MoE is a training-cluster project. LoRA freezes the base weights and trains small adapter matrices; QLoRA adds quantization to cut memory. Match the method to the goal:
For GLM-5.2, start with SFT LoRA on workflow traces, then add DPO if you need it. LoRA placement matters for MoE models: apply it across the MLP and MoE layers, since attention-only LoRA can underperform. Training examples should look like production traffic, and tool-use examples must include the successful tool traces, not just final answers. Serving post-trained adaptersvLLM serves LoRA adapters through the OpenAI-compatible server with vllm serve <glm-5.2-fp8> --tensor-parallel-size 8 --kv-cache-dtype fp8 --enable-lora --lora-modules sre-workflows=/models/adapters/sre-workflows-v3 --served-model-name glm-5.2-fp8 Is it ~140M HUF to launch?That depends on the model you pick. The light tier costs far less: DeepSeek V4-Flash (284B) fits a 2×H200 box, a fraction of the price below. The figures here are the heavy tier — an 8-GPU node for GLM-5.2, Kimi K2.6, or DeepSeek V4-Pro. For owning that box, ~140M HUF holds up against public reseller listings:
A second sanity check: complete 8×SXM H200 systems run $350k–$500k, about 109M–155M HUF before VAT, or 138M–197M HUF with VAT. GLM-5.2 is large, and an FP8 deployment fills a full 8×H200 node. You are buying datacenter infrastructure, not a big workstation. Operator's note · launch ≠ own You do not need 140M HUF up front to start. Rent an 8×H200 node for days or weeks, benchmark your real workflows, and measure requests/day, prompt and output tokens, latency, concurrency, and whether you need 1M context. Buy after that, not before.
At ~7.8M HUF/month cloud, a 140M HUF node breaks even around 18 months, before power, hosting, support, spares, depreciation, and staff time. Expect 2–3 years in practice unless utilization is high. Our recommendation: do not buy first. What throughput to expectFor an 8×H200 FP8 node, budget around these numbers: Single active user: ~75–120 output tok/s (well-tuned) Many users / workflows: ~500–1,600 aggregate output tok/s Typical per-user stream: ~20–35 tok/s under load Prompt ingestion: a few thousand input tok/s aggregate Two anchors bound the envelope. The published, reproducible one is vLLM's own GLM-5.1-FP8 recipe run on 8×H200: ~527 output tok/s (≈800 peak, ~4,640 total throughput), mean TTFT ≈ 13.4s and TPOT ≈ 35ms, so about 28 tok/s per active request (1 ÷ 35ms).18 Read the workload before the number: that run is a 256k-input / 32k-output, concurrency-32 job with MTP speculative decoding, so the 13.4s TTFT is long-prompt prefill, not a small-context figure. For short context, an independent 8×H200 GLM-5.2 deployment reports ~658 rising to ~733 aggregate output tok/s at 8k-in / 1k-out, concurrency 16, with single-stream decode of ~75 tok/s climbing to ~118 under EAGLE speculative decoding.19 The per-user-at-c64 figures in the table below are planning estimates derived from those anchors, not vendor-published numbers.
Aggregate throughput looks good. User-perceived latency can still be ugly. For workflow automation (Jira generation, incident analysis, doc summarization), tens of seconds is fine. For chat UX, cap concurrency or add a routing layer. One 8×H200 node behaves like a ~2M–6M output-token/hour machine, depending on stack and workload. The biggest speed killers: long prompts (100k+ makes TTFT painful), thinking mode (more reasoning tokens), excess concurrency (aggregate rises, p95/p99 collapse), weak speculative-decode acceptance, and KV-cache pressure. Read the evidence tier One number here is published and reproducible: the vLLM 8×H200 256k/32k recipe run (GLM-5.1, used here as a GLM-5.2 proxy). NVIDIA NVFP4 (B200/B300) and AMD MXFP4 (MI350/MI355, ~99.8% GSM8K recovery on AMD's official Quark-quantized GLM-5.2 checkpoint) cards publish compatibility and accuracy retention but not public tok/s.22 RTX-class GGUF, MI355X aggregate, and Ascend W8A8 are planning ranges only. Size on the published tier; treat the rest as hypotheses to confirm on rented hardware. These are deployment-planning numbers, not SLA numbers: benchmark with your exact prompt lengths, max tokens, concurrency, and parser/tool settings. Where the return comes fromModel the 140M HUF box over a 3-year life with realistic operating cost:
If VAT is recoverable, the economic capex is closer to 110M HUF net. Confirm with accounting; this also ignores financing cost, tax treatment, warranty terms, and residual resale value. The ROI hurdle works out to about 68M HUF/yr to break even, 102M for 50% three-year ROI, 135M for 100%. Cumulative cost over 36 months · buy vs rent vs API Owning is a big bill up front that flattens; renting is cheap to start and never stops climbing. They cross near month 23, the opex-adjusted payback. For low-volume internal use the API stays cheapest the whole way. Axis 0–300M HUF. Cumulative HUF. Buy = 140M capex + 21M/yr cash opex; Rent = 94M/yr ($24.8k/mo); API = the 100-engineer internal example (~3.5M/yr, §08).25 Owning overtakes renting near month 23 (the opex-adjusted payback below); the faster ~18-month figure in §06 counts capex against rent alone. The API line assumes low internal volume, not a saturated node. Vs renting GPUsAgainst renting an 8×H200 24/7 (~94M HUF/yr), buying saves ~73M HUF/yr measured against the owner's cash opex (~21M/yr), a payback near 1.9 years on the 140M capex and +39% three-year ROI. Against cheap spot capacity (~40M HUF/yr), the same opex basis saves ~19M HUF/yr, a payback near 7.5 years and negative ROI. Rule of thumb: below 60–70% utilization, rent; above 70% with dedicated-capacity needs, owning starts to make sense. Vs the GLM-5.2 APIHere self-hosting looks worst. With list pricing around $1.40/M input, $0.26/M cached input, and $4.40/M output, a 140M HUF node wins at very high volume. Take 100 engineers × 20 medium jobs/day (8k/1k): about 2,000 jobs/day, near 3.5M HUF/yr in API cost against 68M HUF/yr to own. Buying makes no sense unless privacy, compliance, latency, or control is worth the gap. For EU-regulated data, it often is. (This uses calendar-day volume; on ~220 working days the API looks cheaper still.) Vs employee productivityAt a fully-loaded technical cost of ~25M HUF/yr (≈14,200 HUF/hour), covering 68M HUF/yr requires saving ~4,760 hours/yr ≈ 2.7 FTE.
100 users saving 30 min/day is about 156M HUF/yr in value, near 130% annual ROI. That holds only when the system removes work. If it just produces more AI text for humans to clean up, the saving evaporates. Our threshold Do not buy the 140M HUF box unless you can prove either ~3 FTE/yr saved or >60–70% sustained utilization. Otherwise: API or rented GPU for hard tasks, a smaller local model for simple/private tasks, and RAG + tools + evals around both. Becoming a provider: a knife-edge businessReselling a self-hosted open-weight model on OpenRouter is a real business only at the right workload mix. Output-token revenue is fixed and low; what moves the economics is the input:output ratio and the cache-hit rate. Model revenue per 1M output tokens, not raw tok/s. R per 1M out = P_out + (T_in / T_out) × [ (1−h)·P_in + h·P_cache ] # h = input cache-hit rate. Floor prices below: P_in $0.95/M P_cache $0.18/M P_out $3.00/M
This is the table that decides the business. Short chat is a bad OpenRouter product; prompt-heavy, cache-friendly coding and long-context agentic work is where a provider earns. One sustained output tok/s for 30 days is 2.592M output tokens/month, so at ~311 HUF/USD a sustained tok/s is worth roughly 2,750 HUF/mo at short chat, 5,070 HUF/mo at 8k/1k, or 7,720 HUF/mo at 32k/2k (all 70% cache).
Owning the node and running it near 100% utilized gives a payback of 33–63 months at floor prices. The cheaper open models (DeepSeek V4 at $0.87/M out) compress that margin further, so a generic flagship slug is the weakest play. How routing actually worksOpenRouter is price-weighted, not hardware-aware. Among healthy providers it load-balances by inverse-square price weighting, so a provider at $1/M is roughly nine times likelier to be tried first than one at $3/M (1 ÷ 3² = 1/9), and a small premium needs a real performance or quality edge to survive.29 Tool-calling traffic adds Auto Exacto, a routing step that reorders providers by tool-call success and throughput signals.30 Providers must publish a model ID and pricing in Prompt caching is a first-class lever: OpenRouter uses sticky routing to maximize cache hits when cache-read pricing sits below normal prompt pricing.32 Price long context on its own tier rather than subsidizing 1M-token work at chat rates. A generic endpoint is a commodity. An EU-hosted, ZDR, audited, SLA-backed private endpoint is defensible. Sources: OpenRouter provider-routing, Auto Exacto & provider-schema docs;29 Z.ai GLM-5.2 list pricing ($1.40 in / $0.26 cached / $4.40 out).10 The revenue model's floor inputs ($0.95 in / $0.18 cached / $3.00 out) are illustrative; Z.ai's own cached rate is ~$0.26/M. Revenue figures are planning estimates at the stated prices. FP4, Blackwell, and the value frontierFP4 changes the economics, and it means Blackwell. The practical FP4 route is the NVFP4 checkpoint (~465 GB), ready for SGLang and vLLM and tested on B200/B300. H200 has no FP4 tensor cores, so the FP4 benefit belongs to B200 and B300. NVIDIA's published evals put FP8 and NVFP4 close across reasoning and tool benchmarks. Validate on your own workflow evals before you trust it.
The RTX PRO 6000 Blackwell (96 GB, FP4, no NVLink) is the ROI-first pick for a multi-model menu: many 32B/70B-class endpoints, embeddings, rerankers, private adapters, customer-specific SKUs. A full 8-GPU build lands around 47M HUF gross at street card pricing (a ~40M HUF floor), rising toward ~115M HUF at Lenovo list (the build-up is in §16),26 against 140–200M for HGX-class. It is not the canonical full GLM-5.2 box: PCIe-only sharding (no NVLink) is the performance risk on a 0.75T MoE, so the card is best when each model fits on one or two GPUs. For canonical GLM-5.2-NVFP4, the clean path stays 8×B200/B300. AMD beats Intel for GLM-5.2. vLLM's recipe covers FP8 on MI300X and MI355X under ROCm, and AMD ships an official Compliance flag · Ascend US export-control guidance names the Huawei Ascend 910B, 910C and 910D chips, the silicon inside Atlas 800 A3/A2 nodes, under General Prohibition 10 (BIS, 13 May 2025).37 An EU company is not barred outright, but get legal review before purchase, above all with US investors, customers, subsidiaries, software, or banking exposure. Run a paid throughput PoC (target GLM-5.2-W8A8 on Atlas 800 A3 or 2×A2) before you commit.28 Can a quantized GLM-5.2 fit fewer GPUs?Smaller checkpoints fit on fewer cards. That tells you nothing about whether they serve fast enough to resell. Split quantizations into two classes and price them differently: provider-grade (NVIDIA NVFP4, AMD MXFP4) keeps the same model with near-baseline quality and a published recovery number, so it can carry the canonical slug; community-compressed (1–2-bit GGUF, REAP-pruned) is a different quality envelope and belongs under its own model ID, never the flagship's.
GLM-5.2 checkpoint footprint vs node memory budgets On-disk weight size by precision, against node budgets: 4×RTX PRO 6000 = 384 GB, 8×H200 = 1,128 GB, 8×B200 = 1,440 GB (before KV cache and overhead). Axis 0–1,500 GB. GB on disk. Only FP8 (8×H200) and the official NVFP4 (8×B200 or 8×RTX PRO) carry the canonical model; the REAP-pruned and 2-bit builds drop onto 4×RTX PRO 6000 (384 GB) but are different quality envelopes.23 Add KV cache and engine overhead before sizing. The official NVFP4 checkpoint (~465 GB) does not fit in 4×96 GB (384 GB) once you count KV cache and overhead. Community NVFP4 builds list 8×RTX PRO as the requirement. Two derivatives do fit on 4 cards:
Naming honesty If you resell a pruned or 2-bit derivative, list it under its own model ID (e.g. One machine, many modelsFor ROI, skip the one-giant-model-per-box layout. Run a gateway that routes across several workers and calls the huge model only when a request needs it. One giant GLM-5.2 process turns idle traffic into idle everything. Independent workers let you reallocate GPUs to whatever has demand. GPU 0 32–40B fast model # classification, JSON, routing
GPU 1 32–40B replica
GPU 2-3 70B coding model
GPU 4-5 70B reasoning model
GPU 6 embeddings / reranker / vision
GPU 7 spare / burst / replica
Pin each worker to devices ( vllm serve base-model --enable-lora --max-loras 8 --lora-modules sre=/models/lora/sre-v1 finance=/models/lora/finance-v1 legal=/models/lora/legal-v1 Other levers that matter: turn on prefix caching (shared system prompts and agent scaffolds), tier context (standard 32k / pro 128k / long 512k+) with higher long-context pricing, and split interactive, batch, and agent traffic. On OpenRouter, expose each model under its true name and return early The non-technical red flagsAn EU deployment means bringing legal and security in early. The EU AI Act's GPAI obligations started on 2 August 2025, and the Commission's enforcement powers (fines up to €15M or 3% of global annual turnover) follow on 2 August 2026.34 Commission guidance treats a downstream modifier as a new provider only on a significant change, with an indicative threshold of fine-tuning compute above one-third of the base model's training compute.35 Internal-only deployment carries a lighter risk profile than putting a model on the market, yet post-training and redistribution still need a compliance review. Article 50 transparency duties (telling users they are talking to an AI, marking synthetic output) also begin to apply from 2 August 2026.36 On hardware, the strongest non-technical flag is Ascend and Huawei export-control exposure (see §10). Can a customer tell what hardware you run? Not from the API alone. But provider onboarding asks for infrastructure details, enterprise security questionnaires probe it, datacenter invoices and customs records hold it, and performance fingerprints hint at it. Do not build a business case on nobody will know. The sharpest continuity flag of all landed in June 2026: a US export-control directive forced Anthropic to suspend Fable 5 and Mythos 5 for every foreign national worldwide, including its own staff, days after launch, with a partial carve-out for ~100 vetted US entities following weeks later.3839 A frontier capability your EU operation depends on can be switched off by another country's policy overnight. Open-weight models you host on EU infrastructure are the continuity hedge, and the core reason this paper exists. The open-weight field against the frontierThe self-hostable models sit about three months behind the closed frontier and run close together. Here is where they land on knowledge, coding, and price against the frontier labs. Numbers are independent where independent scores exist; vendor-reported rows are labelled. A model with no published score on an axis is left off that chart, not guessed. Read three source classes separately: independent / aggregator, vendor model-card, and API-leaderboard snapshot. The charts are for order-of-magnitude, not procurement-grade ranking. Artificial Analysis Intelligence Index Composite of GPQA, MMLU-Pro, AIME, LiveCodeBench and more; the index runs 0–100, with the live frontier clustered in the 50s. Chart axis 0–60. Source: Artificial Analysis Intelligence Index, June 2026.11 GLM-5.2 (51) is the highest-ranked open-weights model, fifth overall. Gemini 3.1 Pro tops the index (~57) and Grok 4.3 sits near 53; Anthropic's Fable 5 and Mythos 5 lead the proprietary field but were placed under a US export-control suspension on 12 June 2026 barring foreign-national access worldwide, so they are held out of this planning comparison.38 Effort-mode and snapshot differences move several models by a few points (current public snapshots place Qwen 3.7 Max, Kimi K2.6 and Sonnet 4.6 below their max-effort figures), so read the ordering, not the decimals. GPQA-Diamond · graduate-level science (knowledge) Expert science Q&A, filtered so non-experts cannot web-search the answer. The core knowledge axis. Axis 0–100. Source: BenchLM / Artificial Analysis GPQA-Diamond aggregate + vendor cards, June 2026.13 GLM-5.2 leads open weights at 91.2. GPT-5.5 (93.6), Qwen 3.7 Max (92.4), Kimi K2.6 (90.5, Thinking) and DeepSeek V4 Pro (90.1) are vendor-reported under differing eval conditions; Grok 4.3's 90.1 is from a Grok-specific measurement. Anthropic's Fable tier leads but was export-suspended for foreign nationals in June 2026, so it is held out of this comparison.38 Sonnet 4.6, GPT-5.4 and MiMo V2.5 Pro have no published GPQA-D, so no bar is drawn. SWE-bench Pro · real-world coding (agentic) Contamination-resistant: 1,865 tasks across 41 professional repos (Scale AI), scored Pass@1 (%). Chart axis 0–75. Source: vendor "Active" aggregate, June 2026 (tuned agent harnesses).16 These differ from Scale's standardized public set:14 on that apples-to-apples set GPT-5.4 (xHigh) tops at 59.1 and scores run 17–21 points lower under identical scaffolding, so the two columns must not be mixed. GLM-5.2's 62.1 is third-party measured; Qwen 3.7 Max, GPT-5.5, Kimi K2.6 and DeepSeek V4 Pro are vendor-reported (V4 Pro ties the frontier on SWE-bench Verified at 80.6 but lands 55.4 on Pro). GLM-5.1's 58.4 is unverified. MiniMax M3's 59.0 is from its own launch card and unaudited.43 Grok 4.3, MiMo V2.5 Pro and Sonnet 4.6 have no published SWE-bench Pro. Output price · $ per 1M tokens (lower is better) Where the economic case lives. The open-weight models cluster at or under ~$4.40/M output; closed challengers (Qwen Max) and the proprietary frontier run far higher. Axis 0–$30. Source: official API price lists, output $ per 1M tokens, June 2026 (OpenRouter + vendor pages).9 GLM-5.2 is $4.40 on Z.ai's list and ~$3.00 via OpenRouter ($0.94 input). DeepSeek V4-Pro and MiMo V2.5 Pro share a $0.435/$0.87 card (DeepSeek's is a now-permanent 75% discount off a $1.74/$3.48 list); Grok 4.3 is $1.25/$2.50; MiniMax M3 lists $0.60/$2.40, with a 50% launch promo halving it to $0.30/$1.20;44 Qwen 3.7 Max is closed at $2.50/$7.50. GLM-5.2 sits at the top of the open-weights cost band, still 3–34× below the proprietary frontier. Claude Fable 5 (list ~$50) is excluded from this planning chart, export-suspended for foreign nationals in June 2026;38 cached-input discounts not shown. GLM momentum · Terminal-Bench Two releases in one quarter. The agentic-terminal jump is the headline of the 5.2 update. Axis 0–100. Source: Z.ai GLM-5.2 model card via independent write-ups, 2026.16 Versions differ (Terminal-Bench 2.1 vs 2.0); read as directional. The value corner · knowledge vs output price GPQA-Diamond (knowledge, vertical) against output $/M (horizontal); lower-left is the value corner. The open-weight cluster lands within ~4 points of the proprietary frontier at a third to a thirtieth of the price. GPQA-Diamond vs official list output price, June 2026.139 The open-weight models (gray) cluster at 90–91 for $0.87–$4.40; the proprietary frontier (black) buys ~3 more points at 3–34× the price. GLM-5.2 is the open-weight leader. Axes cropped (86–95 / $0–$30) to separate the field; read position, not absolute distance. The takeaway GLM-5.2 trails the closed frontier by ~3 points on knowledge (91.2 vs 94.3 GPQA-D) and ~7 on coding (62.1 vs 69.2 SWE-bench Pro), runs with Qwen 3.7 Max, Kimi K2.6 and DeepSeek V4 Pro in the open-weights band, and the whole open cluster serves at $0.87–$4.40/M output against $12–$30 for the closed labs. For the workflow tiers in §12 that gap closes on routine tickets and document work, and the price advantage compounds across every job. The recommended strategy and path that follow hold: route hard reasoning to a frontier API, run GLM-5.2 (or a cheaper open peer) for the bulk, and own nothing until the numbers justify it. Internal, hybrid, and the flexion pointsThree ways to put open-weight inference into production, and the thresholds that flip one into the next. The honest headline: at floor API prices, calling a hosted endpoint is the cheapest path for internal use until you can keep a whole node busy. Owning earns its keep through residency, control, or by selling the surplus.
The numbers behind the curves 8×H200 node: owned TCO ~68M HUF/yr; rented 24/7 ~94M HUF/yr ($24.8k/mo). Practical sustained output ~500 tok/s (300–1,600 by workload, §07). One sustained tok/s = 2.592M output tokens/month. API output: $3/M (OpenRouter floor) to $4.40/M (Z.ai list). FX 311 HUF/$. Output-token basis; input adds to both API and self-host. Planning estimates, not quotes. Internal use · effective cost per 1M output tokens Lower is better. Self-producing matches the API only when a node runs near saturation; a half-used node costs 3–4× the API per token. Axis 0–6,500 HUF. HUF per 1M output tokens. Own/Rent assume the node serves only your internal traffic; "saturated" = 1,600 tok/s 24/7, "realistic" = 500 tok/s 24/7. The idle capacity is what hybrid resale reclaims. The flexion curve · cost vs node utilization Effective internal cost per 1M output tokens as a node fills. Owning kisses the API line only at full utilization; renting never does. Most internal demand sits far left, where the API wins outright. HUF per 1M output tokens. Own = 68M HUF/yr ÷ served tokens; Rent = 94M HUF/yr ÷ served tokens; API = flat list rate. Self-hosting's per-token cost is a function of how full you keep the box, which is the entire argument for hybrid. Hybrid · effective annual internal cost At 300 tok/s of internal demand. Owning a node for that alone wastes most of it; reselling the surplus recovers 20–60M HUF/yr, if the demand routes to you. Axis 0–70M HUF. Net HUF/yr for the internal 300 tok/s. Hybrid = 68M owned TCO minus 20–60M resale revenue on surplus capacity (gated on real OpenRouter demand and a prompt-heavy, cache-friendly mix, §10). For pure internal use the API still wins unless residency or capture changes the math. The decision in one treeresidency mandatory? yes -> self-host (rent, then own). resell surplus if demand exists. no v demand near a full node 24/7? yes -> own (cheapest at saturation). resell overflow. no v bursty / below node ceiling? -> API (3-6x cheaper per token). revisit at scale. The flexion points ① API → self-host: only when sustained demand approaches a full node 24/7 (~1,600 tok/s), or residency forbids the API. Below that, the API is 3–6× cheaper per token. ② Rent → own: above ~71% time-utilization over the depreciation window, owning (68M/yr) beats renting 24/7 (94M/yr); below it, rent or burst. ③ Own → hybrid: once you own for residency or control, every idle tok/s is wasted capex — selling surplus pulls effective internal cost toward, or below, the API, but only if prompt-heavy, cache-friendly demand actually routes to you. What to actually buyWe put four realistic builds through the same three-lens analysis, hardware, operations, and cost, then verified every load-bearing number against live June-2026 pricing and published throughput. The question is not which GPU is fastest; it is which node delivers the lowest realistic cost per 1M output tokens for the smallest capital commitment and the widest workload coverage. One build wins that three-way trade outright.40 Effective cost per 1M output tokens, by build (lower is better) Realistic-utilization HUF per 1M output tokens. The 8×RTX PRO and 8×B200 nodes are statistically tied at the floor; the lean and AMD builds cost 3.6–5.6× more per token. Axis 0–750 HUF. HUF per 1M output tokens at realistic utilization, multi-agent internal estimate, June 2026.40 These are multi-model-menu figures at high aggregate throughput (~9,000 tok/s), not directly comparable to §15's single-GLM-5.2 internal-only number (~4,373 HUF/Mtok at 500 tok/s on an 8×H200). The B200 is one HUF cheaper per token but costs 3× the capex (next chart). The AMD "value" build does not survive verification: its cheap HBM is offset by lower realized throughput on the open-weight serving path. Up-front capex, by build Gross HUF millions to own the node. The multi-LLM RTX build lands at a third of the B200's capex for the same per-token cost, priced at verified street, not headline list. Axis 0–150M HUF. Gross HUF millions. The RTX figure uses the verified street price of ~$9–13.25k per card, not the ~$32.7k Lenovo list that inflates the build 2.4×.26 RTX floor ~40M, list-ceiling ~115M. The 8×B200 figure is the most provider-variable of the four, shown near the lower bound of quoted HGX-B200 systems; confirm a live quote before relying on B200-versus-RTX parity. Output tokens per USD of 3-year TCO (higher is better) The capital-efficiency view: how many output tokens each dollar of total cost buys. The RTX and B200 nodes return ~3.6–5.6× the lean and AMD builds. Values shown in millions of tokens. Axis 0–2.5M. Output tokens per USD of 3-year TCO, shown in millions. B200 and RTX PRO sit within 1% of each other; the RTX node reaches it at a third of the capital outlay, which is why it wins on capital efficiency. The verdict · most efficient build The 8×RTX PRO 6000 Blackwell node is the most efficient build: ~127 HUF per 1M output tokens (about $0.41/M) at ~47M HUF street capex, a ~101M HUF three-year TCO, and ~9,000 aggregate output tok/s. It ties the 8×B200 on cost per token at a third of the capital, and it is the only build that serves a whole menu of models at once (flexibility 9/10). Buy the B200 node only when one giant 0.75–1.6T model must live in a single process; buy the 2×H200 lean node (28.4M HUF) only for one mid-tier model. The AMD MI300X "value" build is the trap: cheap HBM, the worst realized cost per token of the four. The multi-LLM node, allocatedEfficiency comes from never letting a card idle. One 8-GPU RTX PRO node runs a full product menu behind the gateway (§12), each card carrying its own model, with customer-specific SKUs as LoRA adapters over a shared base: GPU 0-2 3x 35B-A3B NVFP4 Qwen3.6-35B-A3B # interactive backbone, 1 model/card, TP=1 GPU 3-4 MiniMax M3 NVFP4 frontier open coding / agentic # premium SKU, ~2 cards, Jun-2026 release GPU 5-6 14B base + LoRA Qwen3.5-14B + per-tenant adapters # dozens of customer SKUs (multi-LoRA) GPU 7 embed + reranker Qwen3-Embedding + Qwen3-Reranker # RAG tier + warm failover spare Monthly economics: ~47.3M HUF capex over 3 years ≈ 1.31M HUF/mo; opex ~17.9M HUF/yr ≈ 1.49M HUF/mo (power ~0.61M at ~8.8kW wall × PUE 1.4 × ~95 HUF/kWh, 0.3–0.5 FTE MLOps ~0.45M, warranty, spares and colo ~0.43M). PCIe-only (no NVLink) is the binding constraint: keep each model on one or two cards and never shard a 0.75T MoE across the bus.26 The most efficient frontier-AI node in 2026 is not the biggest one. It is eight mid-size Blackwell cards, each kept busy with a different model. Method: four builds, each analysed by an independent hardware, operations, and cost agent, reconciled by a verifier against live pricing and published throughput, then ranked on realistic cost per 1M output tokens weighed against capex and flexibility. Figures are planning estimates at June-2026 street prices; confirm live quotes and run a throughput PoC on rented hardware before purchase.404142 The phased plan
Go/no-go · buy only when all six hold on rented hardware Node sustained aggregate output TPS in your target range. OpenRouter-visible TTFT and throughput competitive under real concurrency. Above 95% uptime past the first 100+ monitored requests. Cache-hit rate and blended revenue per 1M output tokens that cover node opex. Tool-call success high enough for Auto Exacto traffic. Quality parity against the FP8 baseline on your own eval set.
What we do, in plain verbsLavX Managed Systems builds and operates production AI for European business. On a project like this, we stand up the rented endpoint, wire the gateway, RAG, and tool safety, build the eval suite, and run the pilot. Then we tell you, with your numbers, whether to buy.
We ship the code. We run the evals. We answer within one business day with a concrete next step. LT
Laszlo Adam Toth LavX Managed Systems · Budapest, EU This paper reflects production experience standing up self-hosted and hybrid open-weight inference for European teams. Every load-bearing figure carries a numbered source below and was verified at the time of writing; pricing, benchmark snapshots, and FX move, so confirm live quotes and current leaderboards before you commit capital. Corrections and scoped pilots: lavx.hu. References & sourcesNumbered sources for the load-bearing figures, by class: vendor model-card, independent aggregator / leaderboard, and primary regulator / reseller listing. All accessed June 2026; pricing, benchmark snapshots and FX move, so verify live before procurement.
lavx.hu · LavX Managed Systems · Budapest, EU | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
