Whitepaper · Inference Economics · 2026

Self-Hosting Open-Weight Frontier Models for Production Workflows

A working operator's read on which model, what hardware, what throughput, and what ROI. The honest line between build, rent, and call an API.

AuthorLaszlo Adam Toth · LavX Managed Systems

ResidencyEU · Budapest

ScopeModel selection · inference · ROI

Version3.3 · Jul 2026

A handful of open-weight models now run within a few points of the closed frontier, ship mostly under MIT or lightly modified MIT licenses, and cost a fraction per token. You can host them yourself. The harder questions are which one, on what hardware, at what utilization, and against which alternative. This paper answers with numbers that carry their denominators. GLM-5.2 and DeepSeek V4 run as the worked examples.

The short version

The open-weight field is frontier-adjacent. GLM-5.2 lands within ~3 points of the closed frontier on knowledge (GPQA-D) and ~7 behind Opus 4.8 on coding (SWE-bench Pro), where it outscores GPT-5.5's vendor figure; Kimi K2.6 and DeepSeek V4 trail by a few points more. See §14.
The model is a footprint choice. DeepSeek V4-Flash (284B) fits a 2×H200 box; GLM-5.2 (~0.75T), Kimi K2.6 (~1T) and DeepSeek V4-Pro (1.6T) need a full 8×H200 node or Blackwell. Size on total parameters, not active.
Don't start with post-training. Stand up a stock FP8 endpoint, RAG, tool calling, and an evaluation harness. Fine-tune only what the base model keeps getting wrong.
A heavy node is ~140M HUF. An 8×H200 box lands around 110M HUF net / 140M HUF gross. That is real money, and never the minimum to try a model.
Rent or call the API first. Owning wins at high, sustained utilization or where data cannot leave EU infrastructure.
On-demand GPU is the honest middle. Rent H100/H200/B200 by the hour (mid-2026: ~$1–2.70, ~$2.60–4, ~$3.20–6.70 per GPU-hour) and it beats the API on sustained volume, EU residency, custom models, and burst batch. See §16.
Reselling capacity via OpenRouter is a knife-edge. Output tokens at $0.87–$4.40/M are cheap relative to the cost of keeping accelerators online. It works only at high sustained throughput.

Benchmarks, prices, GPU availability and API terms verified June to July 2026. All of them move, so check live sources before committing capital. This paper is technical and economic guidance, not legal, tax, or procurement advice.

284B–1.6T

Open-weight param span on offer

≤ 3pts

Behind the closed frontier on GPQA-D

2×–8× H200

Footprint range, by model

$0.87/M

Cheapest open output (DeepSeek V4)

01 / THE OPEN-WEIGHT FIELD

Which models you can actually host

Five open-weight families are credible for production in 2026: Z.ai's GLM, DeepSeek V4, Moonshot's Kimi K2, Xiaomi's MiMo, and MiniMax's M-series, with Alibaba's Qwen Max as the closed reference. The ones this paper treats as worked, production-candidate self-hosting targets are GLM-5.2, DeepSeek V4 (Flash and Pro), and Kimi K2.6 / K2.7 Code, all under permissive licenses.¹ Xiaomi's MiMo V2.5 Pro also ships open weights (MIT, ~1.02T / 42B active) but is a newer, less-proven serving target, so it stays a reference point here rather than a worked example;⁸ Alibaba's Qwen 3.7 Max is proprietary and API-only.⁷ MiniMax's M3 (June 2026) is the newest entrant: a ~428B / ~23B-active sparse MoE with a 1M-token context, native multimodal input, and frontier coding scores, served unusually cheaply thanks to its MiniMax Sparse Attention.⁴³

The self-hostable roster
Model	Total / active	License	Context	Serving footprint	Tier
DeepSeek V4-Flash	284B / 13B	MIT	1M	~158 GB*	2× H200
MiniMax M3	~428B / ~23B	Community	1M	~428 GB FP8	4× H200
GLM-5.2	~744–753B / 40B	MIT	1M	~750 GB FP8	8× H200
Kimi K2.6 / K2.7 Code⁶	~1.0T / 32B	Modified MIT	256K	~1T-class · INT4	8× H200
DeepSeek V4-Pro³	1.6T / 49B	MIT	1M	~862 GB*	8× H200 / B300
MiMo V2.5 Pro (ref)	1.02T / 42B	MIT	1M	~1T-class	8× H200
Qwen 3.7 Max (ref)	proprietary tier	closed	1M	—	API

Footprints are deployment-planning estimates, not purchase specs. DeepSeek V4 instruct checkpoints use mixed FP4+FP8 precision (MoE experts FP4, most other weights FP8), so V4-Flash is ~158 GB native;⁵ BF16 is roughly double. Kimi K2.6/K2.7 publish native INT4 serving; FP8/BF16 footprint depends on the checkpoint. On licenses: Kimi's modified MIT adds one condition, a product above 100M monthly active users or US$20M monthly revenue must display the model's name in its UI, and below those thresholds it reads as standard MIT. MiniMax M3 ships open weights under the MiniMax Community License, not MIT: commercial use requires "Built with MiniMax M3" attribution and a one-time notice to MiniMax, products above US$20M yearly revenue need prior written authorization, and the license's commercial-use definition is ambiguous for hosted-inference resale, so get written confirmation before reselling M3 capacity. M3 also brings MiniMax Sparse Attention (MSA), which at 1M-token context cuts per-token compute to about one-twentieth of the prior generation for >9× faster prefill and >15× faster decode, so with only ~23B active it is one of the cheapest frontier-class models to serve per token.⁴³ Verify actual checkpoint bytes, KV-cache budget, engine overhead, max context, and concurrency before buying. All five self-hostable models serve through vLLM or SGLang behind an OpenAI-compatible endpoint.

Two models bracket the practical range, and this paper uses them as worked examples: DeepSeek V4-Flash at the light end, a 284B MoE that fits a two-GPU box, and GLM-5.2 or DeepSeek V4-Pro at the heavy end, a full 8-GPU node. Pick by capability (§14), footprint, license, and context. GLM-5.2 and DeepSeek V4 carry a 1M-token context; Kimi K2 tops out at 256K.²

Operator's note · active is not resident

An MoE keeps every expert in GPU memory even though only a fraction fire per token, so size on the full parameter count: weights ≈ total params × bytes/param (BF16 2, FP8 1, 4-bit 0.5). A 1T-param model is ~2 TB in BF16, ~1 TB at FP8, ~500 GB at 4-bit, before KV cache. Sizing on the active count is the classic mistake that provisions a node that cannot load the model. Active parameters drive per-token compute, not the need to store expert weights.

02 / SELF-HOSTED INFERENCE

The stack you run

For a company workflow platform, we do not start with training. We start with four things, in order, and they are the same whichever model you pick:

A self-hosted inference endpoint on vLLM (SGLang as the alternative).
RAG + tool calling for company knowledge and actions.
An evaluation harness built from your real workflows.
LoRA/QLoRA post-training, once you know what the base model keeps getting wrong.

GPU memory is the binding constraint, and it scales with the model you pick. A 284B model like DeepSeek V4-Flash fits ~158 GB, a 2×H200 box. A ~0.75T model like GLM-5.2 needs ~750 GB FP8, a full 8×H200 / 8×H20 141 GB node. A 1T-class Kimi K2.6 or the 1.6T DeepSeek V4-Pro (~862 GB at mixed FP4+FP8) pushes the 8-GPU node's full budget and up, or Blackwell. Full 1M context needs 8×B200 180 GB with an FP8 KV cache.

Reference self-hosted layout
Layer	Recommendation
Inference engine	vLLM first; SGLang as alternative
Model	FP8 checkpoint for production practicality
GPU	2×H200 (284B) up to 8×H200 (0.75–1.6T)
Full 1M context	8×B200-class, or cap `max_model_len`
Storage	≥ 2 TB fast NVMe per node
Network	NVLink in-node; InfiniBand/RDMA if multi-node
API	OpenAI-compatible `/v1/chat/completions` behind your gateway
Platform	K8s + NVIDIA GPU Operator, or bare-metal systemd for a first pilot
Observability	Prometheus/Grafana, DCGM exporter, OTel traces, token accounting
Security	SSO, RBAC, audit logs, DLP/redaction, tool allowlists, secret isolation

For most workflows, cap context to 64k–200k first. The 1M window is expensive and latency-heavy, and good retrieval beats it on most jobs.

Minimal vLLM start

lavx@managed-systems ~ %

# FP8 checkpoint, TP=8, FP8 KV cache, GLM parsers, MTP speculative decode
docker run --gpus all -p 8000:8000 --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:glm52 <glm-5.2-fp8> \
    --tensor-parallel-size 8 \
    --kv-cache-dtype fp8 \
    --speculative-config.method mtp \
    --speculative-config.num_speculative_tokens 5 \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --enable-auto-tool-choice \
    --served-model-name glm-5.2-fp8

Version-lock the serving stack: this example assumes the GLM-5.2 recipe image (≥ 0.23). The glm52 image carries the MTP fixes; stock 0.23.0 needs vLLM main to run tool calling and MTP together.¹⁷

It then answers as an ordinary OpenAI-compatible API. Swap the image and checkpoint for your model; the tool and reasoning parsers are model-specific (the flags above are GLM-5.2's). Most of these families expose effort or thinking controls: GLM-5.2 and DeepSeek V4 offer max/high/non-think, while Kimi K2.7 Code is thinking-first with no instant mode. Check per-model support, then route by task so you do not pay reasoning tokens for trivial work:

routing policy

simple classification / formatting      -> non-think
normal ticket / document work           -> high
incident analysis / planning / coding   -> max

03 / WORKFLOW ARCHITECTURE

Don't expose the model. Expose a gateway.

Production architecture should never put users or automation tools straight against the model. A gateway carries identity, policy, routing, budgets, redaction, and audit. Only then does it reach retrieval, tools, and the engine.

reference topology

Users / systems
   │
   ▼
AI Gateway  · SSO/RBAC · routing · token budgets · prompt templates
            · audit logs · PII/secret redaction · model/adapter select
   │
   ├─▶ Retrieval   Confluence/Git/Jira/runbooks · ACL-aware · vector+keyword+rerank
   │
   ├─▶ Tools       Jira/GitHub/Slack/ServiceNow · K8s/Terraform/CI
   │                approval gates on destructive actions
   │
   ▼
vLLM / SGLang GLM-5.2 endpoint
   │
   ▼
Logs · evals · monitoring · rollback

Text generation is safe. Tool execution is where workflows go wrong.

Treat every tool call like a production API call. Scope credentials to the minimum, require human approval for destructive actions, isolate sandboxes, and log every proposed and executed action.

04 / EVALUATE FIRST

Build the eval set before you train anything

Without an evaluation set you cannot tell whether fine-tuning helped or hurt. Author 100–500 representative workflow tests first, each scoring required behaviours, forbidden behaviours, and tool use.

eval/incident-k8s-oom-001.json

{
  "input": "Analyze this Kubernetes OOM incident and propose remediation.",
  "context_sources": ["incident_report", "grafana_export", "runbook"],
  "expected_behaviors": [
    "identifies memory limit mismatch",
    "does not invent metrics",
    "proposes safe rollout",
    "creates Jira task only after confirmation"
  ],
  "forbidden_behaviors": ["runs kubectl delete", "exposes secrets",
                          "claims root cause without evidence"],
  "scoring": { "root_cause": 0.4, "safe_actions": 0.3,
               "tool_use": 0.2, "format": 0.1 }
}

Only after this should you train adapters.

05 / POST-TRAINING

How to post-train it, and what to skip

Use LoRA/QLoRA adapters, not full fine-tuning, unless you have a serious multi-node training budget. Full fine-tuning a ~0.75T MoE is a training-cluster project. LoRA freezes the base weights and trains small adapter matrices; QLoRA adds quantization to cut memory. Match the method to the goal:

Method → use
Method	Use it for	Avoid for
Prompting + RAG	Company facts, policy, docs, runbooks	Deep behaviour change
SFT LoRA	Output formats, tool-calling patterns, workflow style	Large private knowledge dumps
DPO / ORPO	Preference alignment (good vs bad answer)	Objective executable rewards
RL / GRPO	Agent loops with measurable success (tests pass, ticket closed)	Vague "be better" training
Continued pretraining	Domain language / code adaptation	Small datasets, policy, facts

For GLM-5.2, start with SFT LoRA on workflow traces, then add DPO if you need it. LoRA placement matters for MoE models: apply it across the MLP and MoE layers, since attention-only LoRA can underperform. Training examples should look like production traffic, and tool-use examples must include the successful tool traces, not just final answers.

Serving post-trained adapters

vLLM serves LoRA adapters through the OpenAI-compatible server with --enable-lora and --lora-modules.²⁰ Dynamic runtime loading exists, but it carries security risk. Prefer static, versioned adapters and route by model or adapter name through the gateway. Compare every adapter against the base model on your eval set before rollout.

lavx@managed-systems ~ %

vllm serve <glm-5.2-fp8> \
  --tensor-parallel-size 8 --kv-cache-dtype fp8 \
  --enable-lora \
  --lora-modules sre-workflows=/models/adapters/sre-workflows-v3 \
  --served-model-name glm-5.2-fp8

06 / THE MONEY · LAUNCH COST

Is it ~140M HUF to launch?

That depends on the model you pick. The light tier costs far less: DeepSeek V4-Flash (284B) fits a 2×H200 box, a fraction of the price below. The figures here are the heavy tier: an 8-GPU node for GLM-5.2, Kimi K2.6, or DeepSeek V4-Pro. For owning that box, ~140M HUF holds up against public reseller listings:

8×H200 server · illustrative cost build-up
Item	Estimate
8×HGX H200 SXM server (CTO Servers, UK)²⁴	£268,785
GBP → HUF (~410)	≈ 110M HUF net
+ Hungarian VAT (27%)	≈ 140M HUF gross

A second sanity check: complete 8×SXM H200 systems run $350k–$500k, about 109M–155M HUF before VAT, or 138M–197M HUF with VAT. GLM-5.2 is large, and an FP8 deployment fills a full 8×H200 node. You are buying datacenter infrastructure, not a big workstation.

Operator's note · launch ≠ own

You do not need 140M HUF up front to start. Rent an 8×H200 node for days or weeks, benchmark your real workflows, and measure requests/day, prompt and output tokens, latency, concurrency, and whether you need 1M context. Buy after that, not before.

Buy vs rent · the decision
Path	Upfront	Monthly 24/7	Good for
Rent 8×H200	~0	~7.8M HUF	PoC, eval, bursty load
Buy 8×H200	~140–200M HUF	power/DC/support	Steady internal production
Rent 8×B200	~0	~5.8–12M HUF	Full long-context testing
Smaller / quantized	much lower	much lower	Most workflow automation

At ~7.8M HUF/month cloud, a 140M HUF node breaks even around 18 months, before power, hosting, support, spares, depreciation, and staff time. Expect 2–3 years in practice unless utilization is high. Our recommendation: do not buy first.

07 / TOKEN SPEED

What throughput to expect

For an 8×H200 FP8 node, budget around these numbers:

throughput envelope · 8×H200 FP8

Single active user:       ~75–120 output tok/s (well-tuned)
Many users / workflows:   ~500–1,600 aggregate output tok/s
Typical per-user stream:  ~20–35 tok/s under load
Prompt ingestion:         a few thousand input tok/s aggregate

Two anchors bound the envelope. The published, reproducible one is vLLM's own GLM-5.1-FP8 recipe run on 8×H200: ~527 output tok/s (≈800 peak, ~4,640 total throughput), mean TTFT ≈ 13.4s and TPOT ≈ 35ms, so about 28 tok/s per active request (1 ÷ 35ms).¹⁸ Read the workload before the number: that run is a prompt-heavy 8k-input / 1k-output job, 32 requests arriving at 10/s with MTP speculative decoding (256k input and ~33k output tokens in total across the batch), so the 13.4s mean TTFT is queueing and batched prefill under load, not long-context prefill, and the recipe notes MTP acceptance runs low on synthetic prompts (21% there), so real traffic usually decodes faster. At the same 8k/1k shape, an independent 8×H200 GLM-5.2 deployment reports ~658 rising to ~733 aggregate output tok/s at concurrency 16, with single-stream decode of ~75 tok/s climbing to ~118 under EAGLE speculative decoding.¹⁹ The per-user-at-c64 figures in the table below are planning estimates derived from those anchors, not vendor-published numbers.

Realistic per-scenario expectation
Scenario	Expectation
One engineer, interactive	~75–120 tok/s
10 concurrent users	~40–80 tok/s/user
32 concurrent jobs, 8k/1k	~25–30 tok/s/user
64 concurrent short jobs, 1k/1k	~25 tok/s/user · ~1.6k agg
64 concurrent 8k/1k jobs	~12 tok/s/user · bad p99 TTFT
Long-context 64k+ prompts	TTFT becomes the bottleneck

Aggregate throughput looks good. User-perceived latency can still be ugly. For workflow automation (Jira generation, incident analysis, doc summarization), tens of seconds is fine. For chat UX, cap concurrency or add a routing layer. One 8×H200 node behaves like a ~2M–6M output-token/hour machine, depending on stack and workload.

The biggest speed killers: long prompts (100k+ makes TTFT painful), thinking mode (more reasoning tokens), excess concurrency (aggregate rises, p95/p99 collapse), weak speculative-decode acceptance, and KV-cache pressure.

Read the evidence tier

One number here is published and reproducible: the vLLM 8×H200 256k/32k recipe run (GLM-5.1, used here as a GLM-5.2 proxy). NVIDIA NVFP4 (B200/B300) and AMD MXFP4 (MI350/MI355, ~99.8% GSM8K recovery on AMD's official Quark-quantized GLM-5.2 checkpoint) cards publish compatibility and accuracy retention but not public tok/s.²² RTX-class GGUF, MI355X aggregate, and Ascend W8A8 are planning ranges only. Size on the published tier; treat the rest as hypotheses to confirm on rented hardware. These are deployment-planning numbers, not SLA numbers: benchmark with your exact prompt lengths, max tokens, concurrency, and parser/tool settings.

08 / ROI

Where the return comes from

Model the 140M HUF box over a 3-year life with realistic operating cost:

Annualized total cost of ownership · 8×H200
Item	Rough value
Hardware launch (gross)	140M HUF
Colo / power / support / spares	~15M HUF/yr
Platform ops (~0.25 FTE)	~6M HUF/yr
Annualized TCO	~68M HUF/yr

If VAT is recoverable, the economic capex is closer to 110M HUF net. Confirm with accounting; this also ignores financing cost, tax treatment, warranty terms, and residual resale value. The ROI hurdle works out to about 68M HUF/yr to break even, 102M for 50% three-year ROI, 135M for 100%.

Model assumptions, here and through §17: 3-year depreciation, Hungarian VAT at 27% in gross figures, FX ≈ 311 HUF/USD and ≈ 410 HUF/GBP at writing, calendar-day volumes unless stated, throughput and price inputs as listed per scenario (§15 collects them). Change any of these and the break-evens move; rerun with your own numbers before deciding.

Cumulative cost over 36 months · buy vs rent vs API

Owning is a big bill up front that flattens; renting is cheap to start and never stops climbing. They cross near month 23, the opex-adjusted payback. For low-volume internal use the API stays cheapest the whole way. Axis 0–300M HUF.

Cumulative HUF. Buy = 140M capex + 21M/yr cash opex; Rent = 94M/yr ($24.8k/mo); API = the 100-engineer internal example (~3.5M/yr, below).²⁵ Owning overtakes renting near month 23 (the opex-adjusted payback below); the faster ~18-month figure in §06 counts capex against rent alone. The API line assumes low internal volume, not a saturated node.

Vs renting GPUs

Against renting an 8×H200 24/7 (~94M HUF/yr), buying saves ~73M HUF/yr measured against the owner's cash opex (~21M/yr), a payback near 1.9 years on the 140M capex and +39% three-year ROI. Against cheap spot capacity (~40M HUF/yr), the same opex basis saves ~19M HUF/yr, a payback near 7.5 years and negative ROI. Rule of thumb: below 60–70% utilization, rent; above 70% with dedicated-capacity needs, owning starts to make sense.

Vs the GLM-5.2 API

Here self-hosting looks worst. With list pricing around $1.40/M input, $0.26/M cached input, and $4.40/M output, a 140M HUF node wins at very high volume. Take 100 engineers × 20 medium jobs/day (8k/1k): about 2,000 jobs/day, near 3.5M HUF/yr in API cost against 68M HUF/yr to own. Buying makes no sense unless privacy, compliance, latency, or control is worth the gap. For EU-regulated data, it often is. (This uses calendar-day volume; on ~220 working days the API looks cheaper still.)

Vs employee productivity

At a fully-loaded technical cost of ~25M HUF/yr (≈14,200 HUF/hour), covering 68M HUF/yr requires saving ~4,760 hours/yr ≈ 2.7 FTE.

Break-even daily time saved, by active user count
Active users	Saving needed / user / day
20 users	~65 min
50 users	~26 min
100 users	~13 min
200 users	~6.5 min

100 users saving 30 min/day is about 156M HUF/yr in value, near 130% annual ROI. That holds only when the system removes work. If it just produces more AI text for humans to clean up, the saving evaporates.

Our threshold

Do not buy the 140M HUF box unless you can prove either ~3 FTE/yr saved or >60–70% sustained utilization. Otherwise: API or rented GPU for hard tasks, a smaller local model for simple/private tasks, and RAG + tools + evals around both.

09 / RESELLING VIA OPENROUTER

Becoming a provider: a knife-edge business

Reselling a self-hosted open-weight model on OpenRouter is a real business only at the right workload mix. Output-token revenue is fixed and low; what moves the economics is the input:output ratio and the cache-hit rate. Model revenue per 1M output tokens, not raw tok/s.

revenue model

R per 1M out = P_out + (T_in / T_out) × [ (1−h)·P_in + h·P_cache ]
# h = input cache-hit rate. Floor prices below:
P_in $0.95/M   P_cache $0.18/M   P_out $3.00/M

Blended revenue per 1M output tokens, by workload
Workload	In : out	No cache	70% cache-hit
Short chat	1 : 1	$3.95	$3.41
Coding / RAG	8 : 1	$10.60	$6.29
Long-context agentic	16 : 1	$18.20	$9.58

This is the table that decides the business. Short chat is a bad OpenRouter product; prompt-heavy, cache-friendly coding and long-context agentic work is where a provider earns. One sustained output tok/s for 30 days is 2.592M output tokens/month, so at ~311 HUF/USD a sustained tok/s is worth roughly 2,750 HUF/mo at short chat, 5,070 HUF/mo at 8k/1k, or 7,720 HUF/mo at 32k/2k (all 70% cache).

Renting 8×H200 (~$24.8k/mo) and reselling · margin
Scenario	Revenue/mo	Margin vs rented node
1k/1k · 1,600 tok/s	~$14.2k	−$10.6k
8k/1k · 955 tok/s	~$15.8k	−$9.0k
8k/1k · 733 tok/s	~$12.1k	−$12.7k
8k/1k · 1,600 tok/s (aggressive)	~$26.4k	+$1.6k

Owning the node and running it near 100% utilized gives a payback of 33–63 months at floor prices. The cheaper open models (DeepSeek V4 at $0.87/M out) compress that margin further, so a generic flagship slug is the weakest play.

How routing actually works

OpenRouter is price-weighted, not hardware-aware. Among healthy providers it load-balances by inverse-square price weighting, so a provider at $1/M is roughly nine times likelier to be tried first than one at $3/M (1 ÷ 3² = 1/9), and a small premium needs a real performance or quality edge to survive.²⁹ Tool-calling traffic adds Auto Exacto, a routing step that reorders providers by tool-call success and throughput signals.³⁰ Providers must publish a model ID and pricing in /v1/models and may also declare quantization, context and output length, datacenter country codes and capacity_tpm.³¹ Routing drops any provider with a significant outage in the last 30 seconds; separately, OpenRouter's provider standards expect ~95%+ uptime for standard priority and relegate sub-80% providers to fallback-only.³³ Your public throughput is output tokens ÷ generation time, which includes TTFT and any queueing, so lab tok/s and OpenRouter-visible tok/s are different numbers. OpenRouter reviews providers and does not guarantee acceptance or routing volume: there is a backlog, and proprietary models are currently prioritized.³³

Prompt caching is a first-class lever: OpenRouter uses sticky routing to maximize cache hits when cache-read pricing sits below normal prompt pricing.³² Price long context on its own tier rather than subsidizing 1M-token work at chat rates.

A generic endpoint is a commodity. An EU-hosted, ZDR, audited, SLA-backed private endpoint is defensible.

Sources: OpenRouter provider-routing, Auto Exacto & provider-schema docs;²⁹ Z.ai GLM-5.2 list pricing ($1.40 in / $0.26 cached / $4.40 out).¹⁰ The revenue model's floor inputs ($0.95 in / $0.18 cached / $3.00 out) are illustrative; Z.ai's own cached rate is ~$0.26/M. Revenue figures are planning estimates at the stated prices.

10 / HARDWARE · BEST VALUE

FP4, Blackwell, and the value frontier

FP4 changes the economics, and it means Blackwell. The practical FP4 route is the NVFP4 checkpoint (~465 GB), ready for SGLang and vLLM and tested on B200/B300. H200 has no FP4 tensor cores, so the FP4 benefit belongs to B200 and B300. NVIDIA's published evals put FP8 and NVFP4 close across reasoning and tool benchmarks. Validate on your own workflow evals before you trust it.

Hardware lines, ranked for ROI flexibility
Hardware	Best use	ROI view
8× RTX PRO 6000 Blackwell 96GB	Many 32B/70B models, direct EU endpoints, model menu	Best flexibility
8× B200 / B300	GLM-5.2-NVFP4, giant MoE, long context	Best GLM path
8× H200	GLM-5.2-FP8, long context	OK; weaker value new
8× MI300X (AMD)	GLM-5.2-FP8 (vLLM ROCm), huge VRAM	Best AMD value if cheap
8× MI355X (AMD)	GLM-5.2-MXFP4	Strong, POC first
2× Ascend A2 / 1× A3	GLM-5.2-W8A8	Cheap capex, POC + compliance
8× Gaudi 3 (Intel)	Llama/Qwen/DeepSeek; not GLM-5.2 today	Not for GLM-5.2 yet

The RTX PRO 6000 Blackwell (96 GB, FP4, no NVLink) is the ROI-first pick for a multi-model menu: many 32B/70B-class endpoints, embeddings, rerankers, private adapters, customer-specific SKUs. A full 8-GPU build lands around 47M HUF gross at street card pricing (a ~40M HUF floor), rising toward ~115M HUF at Lenovo list (the build-up is in §17),²⁶ against 140–200M for HGX-class. It is not the canonical full GLM-5.2 box: PCIe-only sharding (no NVLink) is the performance risk on a 0.75T MoE, so the card is best when each model fits on one or two GPUs. For canonical GLM-5.2-NVFP4, the clean path stays 8×B200/B300.

AMD beats Intel for GLM-5.2. vLLM's recipe covers FP8 on MI300X and MI355X under ROCm, and AMD ships an official GLM-5.2-MXFP4 checkpoint for MI350/MI355 (Quark-quantized, ~99.8% GSM8K recovery). MI300X (192 GB/GPU) is the value candidate if you can source a complete server near 100–120M HUF gross; MI355X is the technical MXFP4 candidate, high-power and high-capex, worth it only after a throughput PoC. Intel Gaudi 3 is cheap, but its vLLM path lagged GLM-5.2 support. Use it for Llama, Qwen, or DeepSeek private endpoints, not a GLM-5.2 resale play.

Compliance flag · Ascend

US export-control guidance names the Huawei Ascend 910B, 910C and 910D chips, the silicon inside Atlas 800 A3/A2 nodes, under General Prohibition 10 (BIS, 13 May 2025).³⁷ An EU company is not barred outright, but get legal review before purchase, above all with US investors, customers, subsidiaries, software, or banking exposure. Run a paid throughput PoC (target GLM-5.2-W8A8 on Atlas 800 A3 or 2×A2) before you commit.²⁸

11 / QUANTIZATION FRONTIER

Can a quantized GLM-5.2 fit fewer GPUs?

Smaller checkpoints fit on fewer cards. That tells you nothing about whether they serve fast enough to resell. Split quantizations into two classes and price them differently: provider-grade (NVIDIA NVFP4, AMD MXFP4) keeps the same model with near-baseline quality and a published recovery number, so it can carry the canonical slug; community-compressed (1–2-bit GGUF, REAP-pruned) is a different quality envelope and belongs under its own model ID, never the flagship's.

GLM-5.2 footprints vs hardware
Form	Size	Fits
BF16	~1.49 TB	Multi-node only
FP8	~750 GB	8×H200 / 8×H20
NVFP4 (official)	~465 GB	8×B200/B300 · 8×RTX PRO
NVFP4-REAP (pruned 469B)	~313 GB	4×RTX PRO 6000 (~78.6 GB/GPU)
2-bit GGUF (UD-Q2)	~238–254 GB	4×RTX PRO 6000 (fits in 384 GB)

GLM-5.2 checkpoint footprint vs node memory budgets

On-disk weight size by precision, against node budgets: 4×RTX PRO 6000 = 384 GB, 8×H200 = 1,128 GB, 8×B200 = 1,440 GB (before KV cache and overhead). Axis 0–1,500 GB.

BF161,490

FP8750

NVFP4 official465

NVFP4-REAP 469B313

2-bit GGUF238

GB on disk. Only FP8 (8×H200) and the official NVFP4 (8×B200 or 8×RTX PRO) carry the canonical model; the REAP-pruned and 2-bit builds drop onto 4×RTX PRO 6000 (384 GB) but are different quality envelopes.²³ Add KV cache and engine overhead before sizing.

The official NVFP4 checkpoint (~465 GB) does not fit in 4×96 GB (384 GB) once you count KV cache and overhead. Community NVFP4 builds list 8×RTX PRO as the requirement. Two derivatives do fit on 4 cards:

REAP-pruned 469B NVFP4: ~313 GB on disk, ~78.6 GB/GPU, 250k context, but a reported ~60 tok/s decode at ~3× concurrency.²¹ It is a pruned derivative (Cerebras's router-weighted expert activation pruning), not full GLM-5.2.
2-bit GGUF: ~238 GB, fits inside 384 GB VRAM. The headline 82% figure is a KL-divergence fidelity proxy, not a workflow-benchmark score, and GGUF 2-bit optimizes for compression over datacenter throughput.

Naming honesty

If you resell a pruned or 2-bit derivative, list it under its own model ID (e.g. your-org/glm-5.2-reap-469b). Do not hide it behind the canonical GLM-5.2 id. At 60–180 sustained tok/s, a 4×RTX PRO derivative makes a strong local or private endpoint and a weak OpenRouter resale business. The go/no-go bar is ~400 sustained aggregate output tok/s, with acceptable p95 TTFT and stable tool calling. Confirm how OpenRouter wants int2/GGUF derivatives declared before onboarding; the standard quantization vocabulary runs int4/int8/fp4/fp6/fp8/fp16/bf16/fp32.

12 / MULTI-MODEL SERVING

One machine, many models

For ROI, skip the one-giant-model-per-box layout. Run a gateway that routes across several workers and calls the huge model only when a request needs it. One giant GLM-5.2 process turns idle traffic into idle everything. Independent workers let you reallocate GPUs to whatever has demand.

8-GPU allocation · example

GPU 0   32–40B fast model        # classification, JSON, routing
GPU 1   32–40B replica
GPU 2-3 70B coding model
GPU 4-5 70B reasoning model
GPU 6   embeddings / reranker / vision
GPU 7   spare / burst / replica

Pin each worker to devices (CUDA_VISIBLE_DEVICES on NVIDIA, HIP_VISIBLE_DEVICES on AMD, HABANA_VISIBLE_DEVICES on Gaudi) and route by model name, price tier, prompt length, latency target, or customer. For per-product variants, use LoRA adapters over one base, not multiple base models:

lavx@managed-systems ~ %

vllm serve base-model --enable-lora --max-loras 8 \
  --lora-modules sre=/models/lora/sre-v1 \
                 finance=/models/lora/finance-v1 \
                 legal=/models/lora/legal-v1

Other levers that matter: turn on prefix caching (shared system prompts and agent scaffolds), tier context (standard 32k / pro 128k / long 512k+) with higher long-context pricing, and split interactive, batch, and agent traffic. On OpenRouter, expose each model under its true name and return early 429s under load instead of queueing. Queueing drags down your visible throughput metric.

13 / EU COMPLIANCE

The non-technical red flags

An EU deployment means bringing legal and security in early. The EU AI Act's GPAI obligations started on 2 August 2025, and the Commission's enforcement powers (fines up to €15M or 3% of global annual turnover) follow on 2 August 2026.³⁴ Commission guidance treats a downstream modifier as a new provider only on a significant change, with an indicative threshold of fine-tuning compute above one-third of the base model's training compute.³⁵ Internal-only deployment carries a lighter risk profile than putting a model on the market, yet post-training and redistribution still need a compliance review. Article 50 transparency duties (telling users they are talking to an AI, marking synthetic output) also begin to apply from 2 August 2026.³⁶ Two carve-outs matter for self-hosters: providers of GPAI models placed on the market before 2 August 2025 have until 2 August 2027 to comply (Article 111(3)), which covers older checkpoints and any fine-tune of one that crosses the new-provider threshold;³⁴ and under the May-2026 Digital Omnibus, the Article 50(2) machine-readable marking duty gets a grace period to 2 December 2026 for systems already on the market, while the user-facing disclosure duties stay on the August date.

On GDPR: a prompt that carries personal data to a US-hosted model API is a Chapter V third-country transfer.⁴⁹ That is not a prohibition, it is a requirement for a valid mechanism, EU-US Data Privacy Framework certification or SCCs with a transfer impact assessment, and the DPF's fate is again before the CJEU (Latombe appeal, C-703/25 P).⁵⁰ EU-resident inference removes the transfer analysis entirely rather than making US APIs unlawful; treat it as a hedge you control, not a legal mandate.

On hardware, the strongest non-technical flag is Ascend and Huawei export-control exposure (see §10). Can a customer tell what hardware you run? Not from the API alone. But provider onboarding asks for infrastructure details, enterprise security questionnaires probe it, datacenter invoices and customs records hold it, and performance fingerprints hint at it. Do not build a business case on nobody will know.

The sharpest continuity flag of all landed in June 2026. On 12 June, three days after launch, a US export-control directive forced Anthropic to suspend Fable 5 and Mythos 5 for every foreign national worldwide, including its own staff; unable to verify nationality in real time, Anthropic took both models offline for everyone, reportedly with 90 minutes to comply.³⁸³⁹ Two weeks later Washington cleared Mythos 5 alone for roughly 100 vetted US critical-infrastructure organizations; the directive was lifted on 30 June and Fable 5 returned to general availability on 1 July 2026, while Mythos 5 stays restricted to approved US organizations.⁵¹ Nineteen days, resolved, and the lesson stands: a frontier capability your EU operation depends on can be switched off by another country's policy overnight, on a timeline you do not control. Open-weight models you host on EU infrastructure are the continuity hedge, and the core reason this paper exists.

14 / THE FIELD

The open-weight field against the frontier

The self-hostable models sit about three months behind the closed frontier and run close together. Here is where they land on knowledge, coding, and price against the frontier labs. Numbers are independent where independent scores exist; vendor-reported rows are labelled. A model with no published score on an axis is left off that chart, not guessed. Read three source classes separately: independent / aggregator, vendor model-card, and API-leaderboard snapshot. The charts are for order-of-magnitude, not procurement-grade ranking.

Artificial Analysis Intelligence Index

Composite of GPQA, MMLU-Pro, AIME, LiveCodeBench and more (Index v4.1); the index runs 0–100, with the live frontier spanning the low 50s to Fable 5's 60. Chart axis 0–60.

Claude Fable 560

Claude Opus 4.856

GPT-5.5 (xhigh)55

Claude Opus 4.754

Claude Sonnet 553

GPT-5.5 (high)53

GLM-5.2 (max)51

Source class: independent aggregate. Artificial Analysis Intelligence Index v4.1, July 2026.¹¹¹² GLM-5.2 (51) is the highest-ranked open-weights model, seventh overall; the nearest open peers, MiniMax M3 and DeepSeek V4 Pro, score 44 and Kimi K2.6 43. Claude Fable 5 tops the index at 60, back in general availability since 1 July 2026 after the June export-control suspension (§13).⁵¹ The June v4.1 rebase rescored several models: Gemini 3.1 Pro Preview reads 46 and Grok 4.3 (high) 38 here, so their widely quoted pre-rebase 57 and 53 are not comparable. Effort-mode and snapshot differences move models by a few points, so read the ordering, not the decimals.

GPQA-Diamond · graduate-level science (knowledge)

Expert science Q&A, filtered so non-experts cannot web-search the answer. The core knowledge axis. Axis 0–100.

Gemini 3.1 Pro94.3

Claude Opus 4.794.2

Claude Opus 4.893.6

GPT-5.593.6

Qwen 3.7 Max92.4

GLM-5.291.2

Kimi K2.690.5

DeepSeek V4 Pro90.1

Grok 4.390.1

GLM-5.186.2

Source class: independent aggregate + vendor cards, labelled per row. BenchLM / Artificial Analysis GPQA-Diamond aggregate, June 2026.¹³ GLM-5.2 is the highest open-weight row drawn here at 91.2; MiniMax M3's own card reports ~92.9, edging it on this single axis, but as an unreplicated vendor figure it gets no bar, and on the composite index M3 sits well below GLM-5.2 (44 vs 51, above). GPT-5.5 (93.6), Qwen 3.7 Max (92.4), Kimi K2.6 (90.5, Thinking) and DeepSeek V4 Pro (90.1) are vendor-reported under differing eval conditions; Grok 4.3's 90.1 is from a Grok-specific measurement. Claude Fable 5 (restored to general availability 1 July 2026, §13) posts ~91.3 in early single-source roundups, provisional on this saturated axis, so no bar is drawn yet;⁵¹ Sonnet 4.6, GPT-5.4 and MiMo V2.5 Pro have no published GPQA-D, so no bars either.

SWE-bench Pro · real-world coding (agentic)

Contamination-resistant: 1,865 tasks across 41 professional repos (Scale AI), scored Pass@1 (%).¹⁵ Chart axis 0–75.

Claude Opus 4.869.2

Claude Opus 4.764.3

GLM-5.262.1

Qwen 3.7 Max60.6

MiniMax M359.0

GPT-5.558.6

Kimi K2.658.6

GLM-5.158.4

DeepSeek V4 Pro55.4

Source class: vendor-reported "Active" aggregate, June 2026 (tuned agent harnesses).¹⁶ These differ from Scale's standardized public set:¹⁴ on that apples-to-apples set GPT-5.4 (xHigh) tops at 59.1 and scores run 17–21 points lower under identical scaffolding, so the two columns must not be mixed. GLM-5.2's 62.1 is Z.ai-published; Qwen 3.7 Max, GPT-5.5, Kimi K2.6 and DeepSeek V4 Pro are also vendor-reported (V4 Pro matched Gemini 3.1 Pro on SWE-bench Verified at 80.6 at its April launch, since passed by Opus 4.8's 88.6, and lands 55.4 on Pro). GLM-5.1's 58.4 is unverified. MiniMax M3's 59.0 is from its own launch card and unaudited.⁴³ Claude Fable 5's early reported 80.3 sits above this chart's axis and is provisional, so no bar is drawn;⁵¹ Grok 4.3, MiMo V2.5 Pro and Sonnet 4.6 have no published SWE-bench Pro.

Output price · $ per 1M tokens (lower is better)

Where the economic case lives. The open-weight models cluster at or under ~$4.40/M output; closed challengers (Qwen Max) and the proprietary frontier run far higher. Axis 0–$30.

DeepSeek V4 Pro$0.87

MiMo V2.5 Pro$0.87

MiniMax M3$2.40

Grok 4.3$2.50

Kimi K2.6$4.00

GLM-5.2$4.40

Qwen 3.7 Max$7.50

Gemini 3.1 Pro$12

Claude Sonnet 4.6$15

GPT-5.4$15

Claude Opus 4.8$25

GPT-5.5$30

Source: official API price lists, output $ per 1M tokens, June 2026 (OpenRouter + vendor pages).⁹ GLM-5.2 is $4.40 on Z.ai's list and ~$3.00 via OpenRouter ($0.94 input). DeepSeek V4-Pro and MiMo V2.5 Pro share a $0.435/$0.87 card (DeepSeek's is a now-permanent 75% discount off a $1.74/$3.48 list);⁴ Grok 4.3 is $1.25/$2.50; MiniMax M3 lists $0.60/$2.40, with a 50% launch promo halving it to $0.30/$1.20;⁴⁴ Qwen 3.7 Max is closed at $2.50/$7.50. GLM-5.2 sits at the top of the open-weights cost band, still 3–34× below the proprietary frontier. Claude Fable 5 (list $10 in / $50 out) sits above the $30 axis and is left off for scale; it returned to general availability on 1 July 2026 after the June export-control suspension (§13).⁵¹ Cached-input discounts not shown.

GLM momentum · Terminal-Bench

Two releases in one quarter. The agentic-terminal jump is the headline of the 5.2 update. Axis 0–100.

GLM-5.2 (TB 2.1)81.0

GLM-5.1 (TB 2.0)63.5

Source: Z.ai GLM-5.2 model card via independent write-ups, 2026.¹⁶ Versions differ (Terminal-Bench 2.1 vs 2.0); read as directional.

The value corner · knowledge vs output price

GPQA-Diamond (knowledge, vertical) against output $/M (horizontal); lower-left is the value corner. The open-weight cluster lands within ~4 points of the proprietary frontier at a third to a thirtieth of the price.

GPQA-Diamond vs official list output price, June 2026.¹³⁹ The open-weight models (gray) cluster at 90–91 for $0.87–$4.40; the closed field (black) runs from the price-comparable Grok 4.3 up to the frontier tier, which buys ~3 more points at 3–34× the open-weight price. GLM-5.2 is the open-weight leader. Axes cropped (86–95 / $0–$30) to separate the field; read position, not absolute distance.

The takeaway

GLM-5.2 trails the closed frontier by ~3 points on knowledge (91.2 vs 94.3 GPQA-D) and ~7 on coding against Opus 4.8 (62.1 vs 69.2 SWE-bench Pro); early, provisional figures for the newer Fable tier read wider on coding. It runs just ahead of Kimi K2.6 and DeepSeek V4 Pro in the open-weights band, with the closed Qwen 3.7 Max in the same score band, and the whole open cluster serves at $0.87–$4.40/M output against $12–$50 for the closed labs. For the routing tiers in §02 that gap closes on routine tickets and document work, and the price advantage compounds across every job. The recommended strategy and path that follow hold: route hard reasoning to a frontier API, run GLM-5.2 (or a cheaper open peer) for the bulk, and own nothing until the numbers justify it.

15 / DEPLOYMENT STRATEGY

Internal, hybrid, and the flexion points

Three ways to put open-weight inference into production, and the thresholds that flip one into the next. The honest headline: at floor API prices, calling a hosted endpoint is the cheapest path for internal use until you can keep a whole node busy. Owning earns its keep through residency, control, or by selling the surplus.

The three modes
Mode	How it runs	Best when
Internal · API	Call a hosted GLM-5.2 / DeepSeek V4 endpoint	Low or bursty volume, fastest start, no residency mandate
Internal · Rent	Dedicated rented 8×H200 node	POC and validation, no capex, EU residency, weeks to months
Internal · Own	Buy the node (~140M HUF)	High sustained utilization plus strict residency, 2–3 yr horizon
Hybrid · Own + resell	Self-host, sell surplus via OpenRouter or direct	You need a node for residency and can capture external demand

The numbers behind the curves

8×H200 node: owned TCO ~68M HUF/yr; rented 24/7 ~94M HUF/yr ($24.8k/mo). Practical sustained output ~500 tok/s (500–1,600 aggregate by workload, §07; hybrid examples below assume 300 tok/s of internal demand). One sustained tok/s = 2.592M output tokens/month. API output: $3/M (OpenRouter floor) to $4.40/M (Z.ai list). FX 311 HUF/$. Output-token basis; input adds to both API and self-host. Planning estimates, not quotes.

Internal use · effective cost per 1M output tokens

Lower is better. Self-producing matches the API only when a node runs near saturation; a half-used node costs 3–4× the API per token. Axis 0–6,500 HUF.

API · floor ($3)933

API · list ($4.40)1,368

Own · saturated1,366

Own · realistic4,373

Rent · realistic6,044

HUF per 1M output tokens. Own/Rent assume the node serves only your internal traffic; "saturated" = 1,600 tok/s 24/7, "realistic" = 500 tok/s 24/7. The idle capacity is what hybrid resale reclaims.

The flexion curve · cost vs node utilization

Effective internal cost per 1M output tokens as a node fills. Owning kisses the API line only at full utilization; renting never does. Most internal demand sits far left, where the API wins outright.

HUF per 1M output tokens. Own = 68M HUF/yr ÷ served tokens; Rent = 94M HUF/yr ÷ served tokens; API = flat list rate. Self-hosting's per-token cost is a function of how full you keep the box, which is the entire argument for hybrid.

Hybrid · effective annual internal cost

At 300 tok/s of internal demand. Owning a node for that alone wastes most of it; reselling the surplus recovers 20–60M HUF/yr, if the demand routes to you. Axis 0–70M HUF.

API · internal only12.8M

Own · node mostly idle68M

Hybrid · own + resell8–48M

Net HUF/yr for the internal 300 tok/s. Hybrid = 68M owned TCO minus 20–60M resale revenue on surplus capacity (gated on real OpenRouter demand and a prompt-heavy, cache-friendly mix, §09). For pure internal use the API still wins unless residency or capture changes the math.

The decision in one tree

strategy

residency mandatory?  yes -> self-host (rent, then own). resell surplus if demand exists.
         no
          v
demand near a full node 24/7?  yes -> own (cheapest at saturation). resell overflow.
         no
          v
bursty / below node ceiling?   -> API (3-6x cheaper per token). revisit at scale.

The flexion points

① API → self-host: only when sustained demand approaches a full node 24/7 (~1,600 tok/s), or residency forbids the API. Below that, the API is 3–6× cheaper per token. ② Rent → own: above ~71% time-utilization over the depreciation window, owning (68M/yr) beats renting 24/7 (94M/yr); below it, rent or burst. ③ Own → hybrid: once you own for residency or control, every idle tok/s is wasted capex. Selling surplus pulls effective internal cost toward, or below, the API, but only if prompt-heavy, cache-friendly demand actually routes to you.

16 / ON-DEMAND GPU HOSTING

Rent the node by the hour

Between calling an API and owning the box sits the on-demand GPU host: you rent accelerators by the hour or the second, start a vLLM or SGLang endpoint on them, and stop paying the moment the work stops. It is the honest first step in this paper's plan, own nothing (§18), and for a large class of workloads it is not a stepping stone to ownership, it is the destination. The rates have fallen hard: H100 rental dropped roughly 64 to 75 percent between late 2024 and early 2026, and Blackwell followed as HBM3e supply loosened.⁴⁵⁴⁶

On-demand price per GPU-hour · mid-2026
GPU (VRAM)	On-demand $/GPU-hr	What it serves
H100 80 GB	~$1.00–2.70 neocloud · ~$3.90–6.98 hyperscaler	One to two mid-size models
H200 141 GB	~$2.60–4.00 · up to ~$6.31 top tier	2× = DeepSeek V4-Flash (284B)
B200 180 GB	~$3.20–6.70 · ~$2.25 reserved · ~$14 hyperscaler	8× = GLM-5.2-NVFP4

On-demand list rates as of mid-2026, per GPU-hour. Marketplace and neocloud lows: Vast.ai and Spheron (H100 from ~$1.03), RunPod (~$1.99–2.69), GMI Cloud (H200 ~$2.60); Lambda (H100 ~$3.99) and CoreWeave (H200 ~$6.31) price at the premium tier, AWS and Azure higher still.⁴⁵⁴⁷²⁷ B200 on-demand spans ~$2.99–27.04 across providers (average near $5.19), dropping to ~$2.25 on a 36-month reservation; spot prices swing week to week.⁴⁸ Verify a live quote and a real EU-region option before you rely on any figure here.

Assembled into the heavy node, an 8×H200 rents for roughly $21 to $34 an hour, about $15,000 to $25,000 a month run 24/7, the same envelope as the rent line in §08. But 24/7 is exactly where on-demand stops paying: a node left on all month is priced like a node you should have bought. On-demand earns its keep when you do not keep the meter running, or when something other than raw cost per token decides the question.

When an on-demand GPU beats the API
Scenario	Why the GPU wins
Sustained high volume	Near a saturated node the effective cost per 1M output tokens falls under the API list (§15). Once you can keep a rented node busy, you stop paying the per-token markup.
EU data residency	Inference on an EU-region node keeps prompt processing inside the EEA. Calling a US API with personal data in the prompt is a GDPR Chapter V transfer: lawful with a valid mechanism, but an extra compliance surface that EU-region hosting removes (§13). Filter hosts to an EU region first.
Custom or private models	Your fine-tuned LoRA adapters, a pruned or quantized derivative (§11), or a niche open model no public API serves. On-demand runs your exact checkpoint.
Burst batch jobs	Overnight document processing, eval runs, embedding backfills: rent a big node for hours, push millions of tokens through it, tear it down. A one-shot batch on rented iron beats metered per-token pricing.
Predictable, capped spend	A fixed hourly node is a known bill. Per-token API cost spikes with a busy month or a runaway agent loop.
Continuity, no lock-in	No rate limits, no throttling, no single-vendor dependency. A frontier API can be switched off by another country's policy overnight (§13).
Latency and privacy control	Dedicated capacity with no noisy-neighbour contention, concurrency tuned to your SLA, and prompts that never leave your tenancy (zero-retention by construction).

The API still wins the other side of that list. For low or bursty small volume, no residency mandate, or when a hard minority of jobs needs the absolute frontier ceiling, a hosted endpoint is cheaper and simpler: no cold starts, no ops, someone else's on-call. The paper's default holds either way: route the hard minority to a frontier API, run an open model on rented or owned GPUs for the bulk, and own only once §15's flexion points say so.

Operator's note · the lines that move the bill

A checkpoint pulls from Hugging Face and loads into VRAM in minutes, so keep a warm pool or fast local NVMe for anything user-facing, and prefer per-second or per-minute billing for spiky work. Spot and community tiers are the cheapest but interruptible: use them for checkpointable batch, never a live endpoint. Watch the hidden line items, egress, block storage for the weights, and idle hours you forgot to stop; a node billed hourly and left running all weekend is the classic on-demand overspend. Reserved terms cut the rate but trade away the flexibility that is the entire point. And availability is real: Blackwell supply is still tight, so pin a region and a fallback host.

The market splits three ways. Marketplaces and neoclouds (Vast.ai, RunPod, Spheron, Lambda, CoreWeave, Nebius, Crusoe, Together) are the cheapest and the right default; managed serverless (Modal, Baseten, Together) trades a margin for near-zero ops; hyperscalers (AWS, Azure, GCP) cost the most and earn it only when you are already inside their compliance estate. For an EU deployment, filter to a genuine EU region first and price second.

On-demand is not the compromise between the API and owning. For most teams it is the answer: the flexibility of the API with the residency, privacy, and per-token economics of the metal.

This is the "Internal · Rent" mode of §15, priced. The buy case, for when a rented node's meter finally justifies owning one, is §17. On-demand rates: Spheron, IntuitionLabs, GMI Cloud, and getDeploying, June to July 2026.⁴⁵⁴⁶⁴⁷⁴⁸

17 / THE MOST EFFICIENT BUILD

What to actually buy

We put four realistic builds through the same three-lens analysis, hardware, operations, and cost, then verified every load-bearing number against live June-2026 pricing and published throughput. The question is not which GPU is fastest; it is which node delivers the lowest realistic cost per 1M output tokens for the smallest capital commitment and the widest workload coverage. One build wins that three-way trade outright.⁴⁰

Effective cost per 1M output tokens, by build (lower is better)

Realistic-utilization HUF per 1M output tokens. The 8×RTX PRO and 8×B200 nodes are statistically tied at the floor; the lean and AMD builds cost 3.6–5.6× more per token. Axis 0–750 HUF.

8×B200 NVFP4126

8×RTX PRO 6000127

2×H200 lean453

8×MI300X AMD715

HUF per 1M output tokens at realistic utilization, multi-agent internal estimate, June 2026.⁴⁰ These are multi-model-menu figures at high aggregate throughput (~9,000 tok/s), not directly comparable to §15's single-GLM-5.2 internal-only number (~4,373 HUF/Mtok at 500 tok/s on an 8×H200). The B200 is one HUF cheaper per token but costs 3× the capex (next chart). The AMD "value" build does not survive verification: its cheap HBM is offset by lower realized throughput on the open-weight serving path.

Up-front capex, by build

Gross HUF millions to own the node. The multi-LLM RTX build lands at a third of the B200's capex for the same per-token cost, priced at verified street, not headline list. Axis 0–150M HUF.

2×H200 lean28.4M

8×RTX PRO 600047.3M

8×MI300X AMD93M

8×B200 NVFP4141M

Gross HUF millions. The RTX figure uses the verified street price of ~$9–13.25k per card, not the ~$32.7k Lenovo list that inflates the build 2.4×.²⁶ RTX floor ~40M, list-ceiling ~115M. The 8×B200 figure is the most provider-variable of the four, shown near the lower bound of quoted HGX-B200 systems; confirm a live quote before relying on B200-versus-RTX parity.

Output tokens per USD of 3-year TCO (higher is better)

The capital-efficiency view: how many output tokens each dollar of total cost buys. The RTX and B200 nodes return ~3.6–5.6× the lean and AMD builds. Values shown in millions of tokens. Axis 0–2.5M.

8×B200 NVFP42.48M

8×RTX PRO 60002.46M

2×H200 lean0.69M

8×MI300X AMD0.44M

Output tokens per USD of 3-year TCO, shown in millions. B200 and RTX PRO sit within 1% of each other; the RTX node reaches it at a third of the capital outlay, which is why it wins on capital efficiency.

The verdict · most efficient build

The 8×RTX PRO 6000 Blackwell node is the most efficient build: ~127 HUF per 1M output tokens (about $0.41/M) at ~47M HUF street capex, a ~101M HUF three-year TCO, and ~9,000 aggregate output tok/s. It ties the 8×B200 on cost per token at a third of the capital, and it is the only build that serves a whole menu of models at once (flexibility 9/10). Buy the B200 node only when one giant 0.75–1.6T model must live in a single process; buy the 2×H200 lean node (28.4M HUF) only for one mid-tier model. The AMD MI300X "value" build is the trap: cheap HBM, the worst realized cost per token of the four.

The multi-LLM node, allocated

Efficiency comes from never letting a card idle. One 8-GPU RTX PRO node runs a full product menu behind the gateway (§12), each card carrying its own model, with customer-specific SKUs as LoRA adapters over a shared base:

8×RTX PRO 6000 · most-efficient allocation

GPU 0-1  2x 35B-A3B NVFP4   Qwen3.6-35B-A3B                  # interactive backbone, 1 model/card, TP=1
GPU 2-4  MiniMax M3 NVFP4   frontier open coding / agentic   # premium SKU, ~3 cards (~428B NVFP4), Jun-2026
GPU 5-6  14B base + LoRA    Qwen3.5-14B + per-tenant adapters # dozens of customer SKUs (multi-LoRA)
GPU 7    embed + reranker   Qwen3-Embedding + Qwen3-Reranker  # RAG tier + warm failover spare

Monthly economics: ~47.3M HUF capex over 3 years ≈ 1.31M HUF/mo; opex ~17.9M HUF/yr ≈ 1.49M HUF/mo (power ~0.61M: ~6.3kW IT load × PUE 1.4 ≈ 8.8kW wall, at ~95 HUF/kWh; ~0.2 FTE MLOps ~0.45M at §08's fully-loaded rate; warranty, spares and colo ~0.43M). PCIe-only (no NVLink) is the binding constraint: keep each model on one or two cards and never shard a 0.75T MoE across the bus.²⁶

The most efficient frontier-AI node in 2026 is not the biggest one. It is eight mid-size Blackwell cards, each kept busy with a different model.

Method: four builds, each analysed by an independent hardware, operations, and cost agent, reconciled by a verifier against live pricing and published throughput, then ranked on realistic cost per 1M output tokens weighed against capex and flexibility. Figures are planning estimates at June-2026 street prices; confirm live quotes and run a throughput PoC on rented hardware before purchase.⁴⁰⁴¹⁴²

18 / RECOMMENDED PATH

The phased plan

Phase 1, pick the model and prove value, own nothing. Choose by footprint and license: DeepSeek V4-Flash for a cheap 2-GPU pilot, GLM-5.2 / Kimi K2.6 / DeepSeek V4-Pro for the heavy tier. Stand it up (API or rented node), cap context to 128k–200k, wire RAG and tools, build the workflow eval suite, and run pilots with read-only tools first.
Phase 2, measure. Track jobs/day, input and output tokens, latency, concurrency, human-review time, time saved, and failed-workflow rate, per model, so you can compare a cheap open peer against the heavy default on your own work.
Phase 3, buy on evidence. Purchase hardware when utilization stays high, the API is off the table, or measured savings clear ~70M HUF/yr. Match the box to the model: 2×H200 suffices for a 284B model, Blackwell (B200/B300) for the 0.75–1.6T tier, RTX PRO 6000 for multi-model. Gate Ascend and 4-GPU quantized derivatives behind a PoC.

Go/no-go · buy only when all six hold on rented hardware

Node sustained aggregate output TPS in your target range. OpenRouter-visible TTFT and throughput competitive under real concurrency. Above 95% uptime past the first 100+ monitored requests. Cache-hit rate and blended revenue per 1M output tokens that cover node opex. Tool-call success high enough for Auto Exacto traffic. Quality parity against the FP8 baseline on your own eval set.

Hardware verdicts at a glance
Goal	Pick
Best ROI flexibility	8× RTX PRO 6000 Blackwell 96 GB
Best GLM-5.2 path	8× B200 / B300 (NVFP4)
Best AMD value	8× MI300X (if cheap)
Avoid (new, for GLM-5.2)	8× H200 unless much cheaper than B200
Speculative only	Ascend, after a throughput PoC

19 / WHERE LAVX FITS

What we do, in plain verbs

LavX Managed Systems builds and operates production AI for European business. On a project like this, we stand up the rented endpoint, wire the gateway, RAG, and tool safety, build the eval suite, and run the pilot. Then we tell you, with your numbers, whether to buy.

EU data residency, configurable per engagement up to full on-prem. GDPR-aligned by default, with DPA, sub-processors, logging, support access, and retention set per engagement; AI-system transparency disclosures where applicable, including chatbot and generative-content disclosures under the EU AI Act; on EU-resident deployments, data stays on EU infrastructure.
Multi-provider gateway with best-fit routing, no vendor lock-in, and the small/large-model split that makes ROI work.
Source-cited answer mode available. Every reply auditable, every tool call logged.
In production since 2023; 99.9% availability target for managed production systems, defined per SLA in each engagement.

We ship the code. We run the evals. We answer within one business day with a concrete next step.

Laszlo Adam Toth

LavX Managed Systems · Budapest, EU

This paper reflects production experience standing up self-hosted and hybrid open-weight inference for European teams. Every load-bearing figure carries a numbered source below and was verified at the time of writing; pricing, benchmark snapshots, and FX move, so confirm live quotes and current leaderboards before you commit capital. Corrections and scoped pilots: lavx.hu.

References & sources

Numbered sources for the load-bearing figures, by class: vendor model-card, independent aggregator / leaderboard, and primary regulator / reseller listing. All accessed June to July 2026; pricing, benchmark snapshots and FX move, so verify live before procurement.

Z.ai. "GLM-5.2: Built for Long-Horizon Tasks" (model card). huggingface.co/blog/zai-org/glm-52-blog
Together AI. "GLM-5.2" model page (specs, 1M context). together.ai/models/glm-52
DeepSeek. "DeepSeek-V4-Pro" (weights, 1.6T/49B, MIT). huggingface.co/deepseek-ai/DeepSeek-V4-Pro
Morph. "DeepSeek V4: 1.6T MoE, 1M Context, $0.87/M Output." morphllm.com/deepseek-v4
Modular. "DeepSeek V4 Flash Inference, 284B MoE with 1M Context" (~158 GB native FP4+FP8). modular.com/models/deepseek-v4-flash
OpenRouter. "Kimi K2.7 Code" (1T/32B, 256K, INT4, $4.00/M out). openrouter.ai/moonshotai/kimi-k2.7-code
OpenRouter. "Qwen3.7 Max" (proprietary, $2.50/$7.50). openrouter.ai/qwen/qwen3.7-max
Xiaomi. "MiMo-V2.5-Pro" (open weights, MIT, 1.02T/42B). mimo.xiaomi.com/mimo-v2-5-pro
BenchLM. "LLM API Pricing Comparison 2026." benchlm.ai/llm-pricing
OpenRouter / Z.ai. "GLM-5.2 API pricing" ($0.94 in / $3.00 out; Z.ai list $1.40/$4.40). openrouter.ai/z-ai/glm-5.2
Artificial Analysis. "Intelligence Index" (composite, 0–100). artificialanalysis.ai/evaluations/artificial-analysis-intelligence-index
BenchLM. "Artificial Analysis Intelligence Index, 136-model aggregate." benchlm.ai/benchmarks/artificialAnalysis
BenchLM. "GPQA-Diamond aggregate, 2026." benchlm.ai/benchmarks/gpqaDiamond
Scale AI. "SWE-bench Pro public leaderboard" (standardized set). labs.scale.com/leaderboard/swe_bench_pro_public
Scale AI et al. "SWE-Bench Pro: Long-Horizon Software Engineering" (arXiv 2509.16941). arxiv.org/html/2509.16941v1
apidog. "GLM-5.2 Benchmarks and Specs" (SWE-bench Pro Active, Terminal-Bench). apidog.com/blog/glm-5-2-benchmarks
vLLM. "zai-org/GLM-5.2 recipe" (≥0.23, TP=8, FP8/NVFP4, MTP, glm47/glm45 parsers). recipes.vllm.ai/zai-org/GLM-5.2
vLLM. "GLM-5 / GLM-5.1 Series Usage" (8×H200 527/4,640 tok/s, 256k/32k/c32). docs.vllm.ai/projects/recipes/en/stable/GLM/GLM5.html
Phala. "Running GLM-5.2 1M Context on a Single 8×H200 Node" (658→733 tok/s, EAGLE). phala.com/posts/glm-5-2-1m-context-8xh200
vLLM. "LoRA Adapters" (--enable-lora, --lora-modules, --max-loras). docs.vllm.ai/en/latest/features/lora
Cerebras. "REAP: One-Shot Pruning for Trillion-Parameter MoE" (arXiv 2510.13999). arxiv.org/abs/2510.13999
AMD. "GLM-5.2-MXFP4" (Quark V0.11, GSM8K 94.09→93.93, 99.8% recovery). huggingface.co/amd/GLM-5.2-MXFP4
Unsloth. "GLM-5.2: How to Run Locally" (GGUF 2-bit ~239 GB). unsloth.ai/docs/models/glm-5.2
CTO Servers (UK). "NVIDIA HGX H200 8×GPU 141GB server" (£268,785). ctoservers.com/nvidia-hgx-h200
Jarvislabs. "NVIDIA H200 Price Guide 2026" (server + rental rates). jarvislabs.ai/blog/h200-price
Lenovo Press. "ThinkSystem NVIDIA RTX PRO 6000 Blackwell Server Edition" (96 GB, PCIe Gen5, no NVLink). lenovopress.lenovo.com/lp2263
getDeploying. "H200 Cloud Pricing: 33+ Providers (2026)." getdeploying.com/gpus/nvidia-h200
vLLM Ascend. "GLM-5.2 W8A8 on Atlas 800 A3/A2." docs.vllm.ai/projects/ascend/en/main/tutorials/models/GLM5.2.html
OpenRouter. "Provider Routing" (inverse-square price weighting, 30s outage window). openrouter.ai/docs/guides/routing/provider-selection
OpenRouter. "Auto Exacto" (tool-call routing). openrouter.ai/docs/guides/routing/auto-exacto
OpenRouter. "Provider Integration" (/v1/models schema; id + pricing required). openrouter.ai/docs/guides/community/for-providers
OpenRouter. "Prompt Caching" (sticky routing). openrouter.ai/docs/guides/best-practices/prompt-caching
OpenRouter. "Become a Provider" (review, backlog, 95%/80% standing policy). openrouter.ai/providers/apply
European Commission. "Regulatory framework on AI" (GPAI obligations 2 Aug 2025; enforcement 2 Aug 2026). digital-strategy.ec.europa.eu/en/policies/regulatory-framework-ai
European Commission. "Guidelines for providers of general-purpose AI models" (one-third-compute modification threshold). digital-strategy.ec.europa.eu/en/policies/guidelines-gpai-providers
EU AI Act. "Article 50: Transparency Obligations." artificialintelligenceact.eu/article/50
US BIS. "Guidance on General Prohibition 10, PRC advanced-computing ICs" (Ascend 910B/910C/910D, 13 May 2025). bis.gov/media/documents/general-prohibition-10-guidance-may-13-2025.pdf
Anthropic. "Statement on the US government directive to suspend access to Fable 5 and Mythos 5" (12 June 2026). anthropic.com/news/fable-mythos-access
CNN Business. "Anthropic suspends all access to Mythos model after US government bars foreign nationals' use" (13 June 2026); follow-up: Trump administration later cleared Mythos 5 for ~100 vetted US entities. cnn.com/2026/06/13/business/anthropic-mythos-model-national-security
TRG Datacenters. "NVIDIA H200 Price Guide" (H200 NVL ~$31-32k; 8-GPU SXM ~$308-315k; street GPU pricing). trgdatacenters.com/resource/nvidia-h200-price-guide
Spheron. "Deploy DeepSeek V4-Flash on GPU Cloud" (284B native FP4+FP8 ~158GB, ~175GB total with 1M KV). spheron.network/blog/deploy-deepseek-v4-flash-gpu-cloud
vLLM Recipes. "DeepSeek-V4-Flash" (TP, FP8 KV, MTP serving config). recipes.vllm.ai/deepseek-ai/DeepSeek-V4-Flash
MiniMax. "MiniMax-M3" model card (~428B total / ~23B active, MSA, 1M context, native multimodal; SWE-Bench Pro 59.0 self-reported; weights on Hugging Face 7 Jun 2026). huggingface.co/MiniMaxAI/MiniMax-M3
OpenRouter. "MiniMax M3" API pricing ($0.60/$2.40 list; $0.30/$1.20 launch promo). openrouter.ai/minimax/minimax-m3
Spheron. "GPU Cloud Pricing Comparison 2026" (H100 from $1.03/hr, B200 from $2.12/hr, 15+ providers; H100 rental down 64–75% since Q4 2024). spheron.network/blog/gpu-cloud-pricing-comparison-2026
IntuitionLabs. "H100 Rental Prices Compared: $1.49–$6.98/hr Across 15+ Cloud Providers (2026)." intuitionlabs.ai/articles/h100-rental-prices-cloud-comparison
GMI Cloud. "CoreWeave, Lambda, Nebius, GMI: H200 GPU Provider Pricing 2026" (H200 $2.60/hr on-demand; CoreWeave $6.31/hr). gmicloud.ai/en/blog/h200-gpu-provider-pricing
getDeploying. "NVIDIA B200 Cloud Pricing: 24+ Providers (2026)" ($2.99–27.04/GPU-hr on-demand; average ~$5.19; from ~$2.25 reserved). getdeploying.com/gpus/nvidia-b200
EDPB. "Guidelines 05/2021 on the interplay between Article 3 and Chapter V GDPR" v2.0 (the three cumulative transfer criteria). edpb.europa.eu/guidelines-05-2021
CJEU. Latombe v Commission: General Court dismissal T-553/23 (3 Sept 2025), appeal C-703/25 P pending (EU-US Data Privacy Framework). curia.europa.eu/cp250106en.pdf
Anthropic. "Redeploying Claude Fable 5" (30 June 2026; general availability restored 1 July 2026, Mythos 5 restricted to ~100 vetted US organizations). anthropic.com/news/redeploying-fable-5

lavx.hu · LavX Managed Systems · Budapest, EU
Figures are planning estimates from public pricing and benchmarks at time of writing. Verify against live quotes for your config.
This paper is technical and economic guidance, not legal, tax, procurement, or compliance advice; validate decisions with your counsel and accounting.