“CJK languages (Chinese / Japanese / Korean) save tokens compared to English” is a widely circulated claim. The intuition is reasonable — Chinese characters pack high information density, one Hanzi can encode the meaning of several English letters, so tokenizers should treat CJK better.
How well does that intuition hold up? We benchmarked six task types with OpenAI’s official tiktoken library (short instructions, business prose, translation, proposal, code comment, creative copy). The result:
For equivalent semantic content, Chinese consumes 1.1–1.6x MORE tokens than English on modern LLMs. Japanese is worse, at 1.3–2.2x.
This article lays out the full measurements, per-vendor tokenizer behavior, why the intuition fails, and practical guidance.
🏆 TL;DR
🔤 Which language consumes more tokens for the same meaning? (measured on o200k_base / GPT-4o, GPT-5)
Rank Language Multiplier vs English 🥇 English 1.0x (baseline) 🥈 Chinese 1.34x 🥉 Japanese 1.73x 4 Korean ~2.0x One-line summary: CJK always consumes more tokens — Chinese ~30% more, Japanese ~70% more. Token counts translate directly to API bills. Prompt caching and Batch API compress the absolute cost but don’t reverse this ranking. To actually cut Chinese costs, jump to the cross-vendor cheat sheet — DeepSeek / Qwen are currently the only tokenizers approaching 1.0x parity for Chinese.
5 Key Empirical Findings
- CJK consumes more tokens than English. On modern o200k_base (GPT-4o / GPT-5): Chinese 1.1–1.6x, Japanese 1.3–2.2x. On older cl100k_base (GPT-4 / Claude 3): Chinese 1.6–2.5x, Japanese 1.7–2.9x.
- Task type matters a lot. Short commands are nearly tied; long business prose and proposals inflate to 1.5–2.2x.
- Generational tokenizer improvement is real. o200k_base cuts CJK penalty by roughly 40% versus cl100k_base.
- Japanese is consistently worse than Chinese — hiragana and katakana drag the average.
- Only DeepSeek / Qwen flip the result — their Chinese-optimized tokenizers reach ~1.0x parity with English.
🔍 Why the “CJK Saves Tokens” Intuition Sounds Right
Before we show why it’s wrong, let’s give the intuition its due — there are three legitimate-looking reasons to believe it.
Three clues that make CJK look more efficient
1. Hanzi genuinely carry more information per character
One “學” character encodes the meaning scaffold behind “learn,” “study,” and “scholarship.” Per-character information density in CJK really is higher than in alphabetic scripts. The intuitive leap: a good tokenizer should reward that.
2. Character counts are genuinely lower
The Chinese version of an article is typically 35–45% of the English character count. When you eyeball length, Chinese looks “much shorter” — an easy step to “must be fewer tokens too.”
3. Cherry-picked short-sentence comparisons can favor Chinese
If you test a couple of handpicked examples:
EN: "Summarize this article in bullets" (6 tokens)
ZH: 「條列摘要這篇文章」(5 tokens)
Chinese looks like a win — and the story spreads.
Why the intuition still breaks
The gap between actual tokenizer behavior and intuitive information density is where it goes wrong:
-
Hanzi density doesn’t translate into fewer tokens. Tokenizers typically assign 1 token per character for CJK. The semantic density shows up in how much meaning fits per context slot, not in fewer tokens charged. It helps your context window, not your API bill.
-
Character count ≠ token count. “learn” is 5 characters but usually 1 token (merged as a common word). “學” is 1 character and also 1 token. English tokenizers compress far more aggressively than Chinese ones.
-
Cherry-picked short sentences aren’t the average. A handful of examples can hit tokenizer-specific optimizations for common phrases, but cross-task averages tell the real story.
Supporting evidence: 2M-sentence study
A large-scale study using 2 million professionally human-translated sentences measured the token ratio between English and each target language:
| Language (vs. English baseline) | Average token multiplier |
|---|---|
| English | 1.0x (baseline) |
| Chinese (Mandarin) | 1.76x |
| Cantonese | 2.10x |
| Japanese | 2.12x |
| Korean | 2.36x |
Every CJK language uses more tokens. No exceptions.
🧪 Six-Task Benchmark with tiktoken
All measurements below use OpenAI’s official tiktoken library across two tokenizers actually used by current models:
- cl100k_base: GPT-4, Claude 3 series, early GPT-3.5
- o200k_base: GPT-4o, GPT-5 series, latest OpenAI mainline
Task 1: Short instruction prompt
EN: "Summarize this article and list the top 5 key takeaways in bullet points."
ZH: 「摘要這篇文章,並以條列式列出前 5 個重點。」
JA: 「この記事を要約し、重要なポイントを5つ箇条書きで挙げてください。」
| Tokenizer | EN | ZH | JA | ZH/EN | JA/EN |
|---|---|---|---|---|---|
| cl100k_base | 18 | 29 | 31 | 1.61x | 1.72x |
| o200k_base | 18 | 19 | 24 | 1.06x | 1.33x |
Short instructions are the friendliest case for CJK — near parity on o200k.
Task 2: Business prose paragraph
A 5-sentence paragraph on LLM impact on enterprise text processing.
| Tokenizer | EN | ZH | JA | ZH/EN | JA/EN |
|---|---|---|---|---|---|
| cl100k_base | 69 | 174 | 202 | 2.52x | 2.93x |
| o200k_base | 69 | 107 | 150 | 1.55x | 2.17x |
Worst case — long business content on cl100k_base reaches nearly 3x.
Task 3: Translation task (sales report)
| Tokenizer | EN | ZH | JA | ZH/EN | JA/EN |
|---|---|---|---|---|---|
| cl100k_base | 56 | 120 | 139 | 2.14x | 2.48x |
| o200k_base | 56 | 79 | 101 | 1.41x | 1.80x |
Task 4: Project proposal
| Tokenizer | EN | ZH | JA | ZH/EN | JA/EN |
|---|---|---|---|---|---|
| cl100k_base | 90 | 177 | 197 | 1.97x | 2.19x |
| o200k_base | 89 | 119 | 145 | 1.34x | 1.63x |
Task 5: Code comment
| Tokenizer | EN | ZH | JA | ZH/EN | JA/EN |
|---|---|---|---|---|---|
| cl100k_base | 22 | 46 | 54 | 2.09x | 2.45x |
| o200k_base | 22 | 29 | 41 | 1.32x | 1.86x |
Task 6: Creative copy
| Tokenizer | EN | ZH | JA | ZH/EN | JA/EN |
|---|---|---|---|---|---|
| cl100k_base | 46 | 99 | 100 | 2.15x | 2.17x |
| o200k_base | 44 | 60 | 70 | 1.36x | 1.59x |
📊 Summary Across All Tasks
| Metric | cl100k_base ZH | cl100k_base JA | o200k_base ZH | o200k_base JA |
|---|---|---|---|---|
| Lowest ratio | 1.61x | 1.72x | 1.06x | 1.33x |
| Highest ratio | 2.52x | 2.93x | 1.55x | 2.17x |
| Average | 2.08x | 2.32x | 1.34x | 1.73x |
Rule-of-thumb to memorize:
- On GPT-4 / Claude 3: Chinese ≈ 2x English tokens, Japanese ≈ 2.3x
- On GPT-4o / GPT-5: Chinese ≈ 1.3x English tokens, Japanese ≈ 1.7x
🧮 Measurement Detail: Fewer Chinese Characters, but MORE Tokens
The most counter-intuitive finding — verified in Task 2 and Task 4:
| Task | EN chars | ZH chars | Char ratio | o200k token ratio |
|---|---|---|---|---|
| Task 2 business prose | 371 | 136 | 0.37x | 1.55x |
| Task 4 proposal | 342 | 141 | 0.41x | 1.34x |
Chinese uses only 37–41% of the character count, but 34–55% more tokens. The mechanism:
English tokenizers merge common words. Tokens like “the,” “of,” “transformation,” “enterprise” each get 1–2 tokens despite their character length. English averages ~5.4 characters per token.
Chinese tokenizers barely merge anything. Each Hanzi is essentially its own token; some less-common Hanzi even split into 2 byte fragments. Chinese averages ~0.9–1.0 characters per token.
That ~5.4× compression differential overwhelms the “Chinese has fewer characters” advantage — and you end up with more tokens.
Watch out for non-equivalent translations
A common unfair example circulating online:
EN: "Summarize the following article and extract the top 5 key points in bullet format" (15 tokens)
ZH: 「摘要以下文章,並以條列式提取前 5 個重點」(9 tokens)
Chinese looks 40% cheaper — but the Chinese version omitted “in bullet format” and compressed the verb. That’s a translation-style difference, not a tokenizer-efficiency difference. Every Task in this article uses the tightest semantically equivalent translation to keep the comparison fair.
📈 Tokenizer Generations: o200k_base Is a Real Improvement
From cl100k_base to o200k_base, OpenAI cut CJK penalty by roughly 40%:
| Language | cl100k average | o200k average | Improvement |
|---|---|---|---|
| English | 1.00x | 1.00x | — |
| Chinese | 2.08x | 1.34x | -36% |
| Japanese | 2.32x | 1.73x | -25% |
⚠️ Counter-case: Claude Opus 4.7 Released 2026/4/16, Claude Opus 4.7 ships a new tokenizer that Anthropic disclosed makes CJK consume 15–35% more tokens than the previous generation. Tokenizer upgrades don’t always move in the right direction — don’t assume “newer = better for CJK.”
🆚 Cross-Vendor Tokenizer Cheat Sheet (April 2026)
CJK inefficiency is a shared problem, but vendor differences remain significant. Measured medians:
Metric note
This table uses characters-per-token (CPT) for within-language vendor comparison (which vendor is friendliest to your language). The ratio tables above use cross-language token ratios for equivalent meaning (which language costs more for the same content). Both are valid but measure different things — don’t mix them.
| Tokenizer | EN CPT | ZH CPT | JA CPT | Chinese rank |
|---|---|---|---|---|
| DeepSeek V3 / V4 | ~5.0 | ~1.1–1.2 | ~0.9 | 🥇 Best for Chinese |
| Qwen family | ~4.8 | ~1.1–1.3 | ~0.9 | 🥇 Best for Chinese |
| GPT-4o / o200k_base | ~5.4 | ~1.0–1.3 | ~0.9–1.3 | 🥈 |
| Gemini 3.x | ~5.5 | ~1.0 | ~0.82 | 🥉 |
| Llama 3 family | ~5.3 | ~1.0 | ~0.8 | 🥉 |
| Claude Opus 4.6 | ~5.5 | ~1.0–1.1 | ~0.9 | 🥉 |
| GPT-4 / cl100k_base | ~5.4 | ~0.7–0.8 | ~0.9–1.0 | ❌ Worst for Chinese |
| Claude Opus 4.7 | ~5.5 | ~0.75–0.9 | ~0.7–0.8 | ❌ Worst for Chinese |
Key observations:
- 🇨🇳 Chinese: only DeepSeek and Qwen reach CPT > 1.0 (meaning one Chinese character consumes less than one token). Everyone else sits at roughly one-character-per-token or worse.
- 🇯🇵 Japanese: tight cluster across vendors, nobody has seriously optimized for it
- 🇺🇸 English: all vendors in the 4.8–5.5 band, differences are within noise
🎯 Three Ways CJK Still Has an Edge
More tokens doesn’t mean CJK users get nothing. Three legitimate advantages remain:
1. Context windows hold more “meaning”
Even though token counts are higher per equivalent meaning, character counts are lower. When you stuff a whole book or contract into context, Chinese consumes less character space — the density advantage still exists at the context-window level even if not at the pricing level.
Example: Claude Opus 4.7’s 1M-token context
| Language | ~Chars per token | Chars fitting in 1M tokens |
|---|---|---|
| English | ~5.4 | ~5.4M characters |
| Chinese | ~0.85 | ~850K characters |
850K characters of Chinese ≈ 3–4 standard novels. On an information-content basis, the Chinese version fits roughly as much meaning as the 5.4M-character English version.
2. Short commands nearly tie
Task 1 shows Chinese only 6% worse, Japanese only 33% worse — for apps dominated by short interactions (customer service, search, simple tool calls), the CJK cost penalty is mild.
3. Chinese-specialized models flip the result
DeepSeek V4’s Chinese CPT sits at 1.1–1.2 — already close to its own English density. Combined with pricing at 1/10 to 1/50 of Western models, total cost for a Chinese application can actually be lower than GPT-4o.
💰 Real Cost Delta: Monthly Bill Estimation
Scenario: Chinese customer-support app, 1M conversations/month, avg 200 Chinese chars input, 150 Chinese chars output per conversation.
| Model | ZH input tokens (per call) | ZH output tokens (per call) | Monthly total tokens | Unit price ($/M in / out) | Monthly cost |
|---|---|---|---|---|---|
| GPT-4o | ~250 | ~190 | 2.5B in / 1.9B out | $2.5 / $10 | $25,250 |
| Claude Opus 4.7 | ~280 | ~220 | 2.8B in / 2.2B out | $5 / $25 | $69,000 |
| DeepSeek V4 | ~180 | ~140 | 1.8B in / 1.4B out | $0.28 / $0.42 | $1,092 |
| Qwen-Max | ~180 | ~140 | 1.8B in / 1.4B out | ~$0.5 / $2 | $2,700 |
Same Chinese workload, DeepSeek costs ~60x less than Claude Opus 4.7. Quality isn’t equivalent of course — this just illustrates how “tokenizer efficiency × unit price” stacks up.
🧠 The Clever-Looking Trap: “Translate-Both-Ways”
Once you learn English uses fewer tokens, there’s an obvious-sounding strategy:
“Let AI translate my Chinese prompt into English → run the task in English → translate the output back to Chinese.”
It sounds smart. In practice it’s almost always more expensive. Reason: you pay for translation twice, and translation itself costs tokens.
Numbers: Long document analysis
Analyzing a 50,000-character Chinese report:
| Step | Native Chinese flow | English-bridge flow |
|---|---|---|
| 1. Read document | 55K Chinese tokens | 55K in → 37K English translation out = 92K |
| 2. Run analysis | 55K in + 5.5K out = 60.5K | 37K in + 3.7K out = 40.7K |
| 3. Translate back | — | 3.7K in + 5K Chinese out = 8.7K |
| Total | 60.5K | 141.4K |
The English bridge burns 2.3x more tokens — a net loss.
Numbers: Short intent + medium output (code generation)
For a short-intent, long-output task like “build me a React todo app”:
| Step | Native Chinese | English-bridge (write prompt in English directly) |
|---|---|---|
| 1. Prompt | 100 tokens | 75 tokens |
| 2. Execute + think | 30K | 25K |
| 3. Output | Already Chinese | Translate explanation to Chinese: 3K |
| Total | 30K | 28K |
~7% savings — but only if you write the prompt in English directly, skipping the first translation hop.
When the strategy actually saves (three conditions, all required)
- ✅ Prompt is short or written in English directly (skip the first translation)
- ✅ Execution is heavy (reasoning-intensive tasks with lots of thinking tokens)
- ✅ Final output is short or doesn’t need back-translation (code, tech docs can stay English)
What actually works for CJK cost savings
Skip the double translation; use a hybrid strategy:
| Tactic | Savings | Trade-off |
|---|---|---|
| English system prompts (user content stays in CJK) | Fixed per-call saving, compounds at scale | Models understand English system prompts perfectly |
| English few-shot examples | Same × example count | Requires preparing examples |
| Keep output in English (tech tasks) | Skips the entire translation-back step | User must read English |
| Prompt caching | 50–90% discount on input | Requires repeated prompts to trigger |
| Use DeepSeek / Qwen for Chinese | -20–30% tokens + 10–50x cheaper unit price | Slight quality gap vs. Claude / GPT |
| Tighter Chinese prompts | 10–20% | Time investment to polish |
One-liner: Don’t translate full content in both directions — almost always more expensive. Anglicize the static parts, keep dynamic parts in CJK, and lean on caching — that’s the strategy that actually wins.
🛠️ Practical Guidance
When estimating cost
- Never extrapolate English token counts to Chinese bills — multiply by 1.3–1.6x (o200k) or 1.6–2.5x (cl100k)
- Re-benchmark every model upgrade — Opus 4.6 → 4.7 CJK regression is the cautionary tale
- System prompts and few-shot examples are repeated across every call — Chinese token inflation compounds; prompt caching helps offset
Vendor selection
| Situation | Recommendation |
|---|---|
| Chinese-heavy, cost-sensitive (customer service, high volume) | DeepSeek V4 / Qwen |
| Chinese-heavy, quality-first (legal / finance / coding) | Claude Opus 4.7 or GPT-5 |
| Mixed-language / global apps | GPT-4o / Gemini 3 |
| Short, high-frequency commands | Any o200k-class tokenizer works |
| Long-document batch processing | Benchmark tokenizer efficiency alongside unit price |
Token compression tactics
Since CJK inflation is a fact, reduce from the content side:
- Write tighter Chinese — avoid filler (“的話”, “的時候”, “做一個 X 的動作”)
- Prefer structure over prose — bullets, tables, key-value pairs beat long sentences
- Consider English system prompts — when user content is Chinese, an English system prompt doesn’t hurt model comprehension but saves tokens
- Use prompt caching aggressively — cached Chinese system prompts recover much of the lost efficiency
🧭 Bottom Line: Trust Measurement, Not Intuition
“Chinese saves tokens” is a reasonable-sounding intuition that doesn’t survive contact with real measurements. Hanzi density shows up in the context window, not the token bill.
The real picture:
- 📉 CJK always uses more tokens on modern LLMs — Chinese 1.1–1.6x, Japanese 1.3–2.2x
- 📈 Tokenizer updates can go either way (GPT-4 → 4o improved 36%; Opus 4.6 → 4.7 regressed 15–35%)
- 💡 DeepSeek / Qwen are the genuine Chinese tokenizer optimizers, and combined with low pricing they offer the best Chinese cost-performance
- 📊 CJK’s character-density advantage still shows up at the context-window level — just don’t mistake it for token-level efficiency
Don’t trust the intuition “Chinese saves money with AI.” Trust your own measurement. Run 100 representative prompts through tiktoken or your API’s usage field — that number is worth more than any public benchmark.
❓ FAQ
Is "Chinese saves tokens" ever actually true?
On cherry-picked short sentences, occasionally yes — but not as a general rule.
cl100k_base (the GPT-4 era) has specific optimizations for certain high-frequency Chinese phrases, so on very short, very common sentences Chinese occasionally ties or slightly beats English. But scale up the sample and write tighter translations and the gap flips — the 2023 arxiv 2305.15425 study across 2M sentences averages Chinese at 1.76x English, consistent with the tiktoken measurements in this article.
Rule to memorize: on modern LLMs, CJK always consumes more tokens. The only question is how much more.
Then why does the context window sometimes "feel" bigger for Chinese?
Because you’re feeling character count, not token count. A 1M-token context fits ~850K Chinese characters or ~5.4M English characters. At first glance English looks to fit more — but measured by “how many books of meaning,” the high information density of Chinese lets 850K characters hold 3–4 novels worth of content.
Conclusion: context windows are a wash for CJK (maybe a slight advantage) but API bills are worse for CJK — keep these two mental models separate.
Are DeepSeek / Qwen really more efficient at Chinese than GPT / Claude?
Yes, and they’re the only clear exception in Chinese tokenization. DeepSeek and Qwen train on corpora weighted heavily toward Chinese, and their vocabulary allocation reflects it — CPT reaches 1.1–1.3 versus the 1.0-and-below typical of Western models.
Practical impact: for pure Chinese content, DeepSeek / Qwen consume roughly 70–85% of GPT-4o’s tokens. Combined with pricing far below Western models, total Chinese-app cost can be 1/10 to 1/60 of Western models.
I write my prompts in Chinese. Should I switch to English?
Mostly no. Reasons:
- A prompt you write well in Chinese beats a clumsy English one — expression precision matters far more than saving 20% tokens
- Prompt quality > token efficiency — a clear prompt that works first try beats a cheap one that fails three times
- The CJK penalty isn’t as big as it feels — o200k’s 1.3x Chinese multiplier rarely changes project viability
When is English worth considering?
- Repeated system prompts and few-shot examples (the penalty compounds over millions of calls)
- Technical domains where English terminology is more precise (AI, legal, medical)
- High-frequency, cost-sensitive API apps
User input and model output can stay in Chinese — only fixed prompt templates are worth translating.
Why is Japanese worse than Chinese?
Japanese mixes three writing systems:
- Kanji: high information density, similar to Chinese
- Hiragana / Katakana: each character encodes only one syllable but still consumes ~1 token — low information-per-token ratio
- Katakana for loanwords (e.g., “コンピュータ” = computer): one English word stretched into 6 katakana = 6 tokens
So even though Japanese contains high-efficiency kanji, the average ends up worse than pure-Hanzi Chinese. This is why every single Task above shows Japanese worse than Chinese.
How do I measure my own app's token consumption?
Three options, easiest to most precise:
- gptforwork.com/tools/tokenizer — paste text, get counts for GPT / Claude / Gemini / Grok
- OpenAI
tiktoken(Python):import tiktoken enc = tiktoken.encoding_for_model("gpt-4o") print(len(enc.encode("your text"))) - Real API calls — read the
usagefield — input_tokens / output_tokens from response is the only 100% accurate source (especially for Claude / Gemini, whose tokenizers aren’t public)
Recommended: run 100 representative prompts and average. That number beats any public benchmark for your specific app.
Related: