Back to top
Token efficiency measured comparison across Chinese, Japanese, and English

CJK Token Myth Busted — Measured Data

CJK languages use 1.1–2.9x MORE tokens than English on modern LLMs — not less. Six-task tiktoken benchmark and vendor comparison.

“CJK languages (Chinese / Japanese / Korean) save tokens compared to English” is a widely circulated claim. The intuition is reasonable — Chinese characters pack high information density, one Hanzi can encode the meaning of several English letters, so tokenizers should treat CJK better.

How well does that intuition hold up? We benchmarked six task types with OpenAI’s official tiktoken library (short instructions, business prose, translation, proposal, code comment, creative copy). The result:

For equivalent semantic content, Chinese consumes 1.1–1.6x MORE tokens than English on modern LLMs. Japanese is worse, at 1.3–2.2x.

This article lays out the full measurements, per-vendor tokenizer behavior, why the intuition fails, and practical guidance.

🏆 TL;DR

🔤 Which language consumes more tokens for the same meaning? (measured on o200k_base / GPT-4o, GPT-5)

RankLanguageMultiplier vs English
🥇English1.0x (baseline)
🥈Chinese1.34x
🥉Japanese1.73x
4Korean~2.0x

One-line summary: CJK always consumes more tokens — Chinese ~30% more, Japanese ~70% more. Token counts translate directly to API bills. Prompt caching and Batch API compress the absolute cost but don’t reverse this ranking. To actually cut Chinese costs, jump to the cross-vendor cheat sheet — DeepSeek / Qwen are currently the only tokenizers approaching 1.0x parity for Chinese.

5 Key Empirical Findings

  1. CJK consumes more tokens than English. On modern o200k_base (GPT-4o / GPT-5): Chinese 1.1–1.6x, Japanese 1.3–2.2x. On older cl100k_base (GPT-4 / Claude 3): Chinese 1.6–2.5x, Japanese 1.7–2.9x.
  2. Task type matters a lot. Short commands are nearly tied; long business prose and proposals inflate to 1.5–2.2x.
  3. Generational tokenizer improvement is real. o200k_base cuts CJK penalty by roughly 40% versus cl100k_base.
  4. Japanese is consistently worse than Chinese — hiragana and katakana drag the average.
  5. Only DeepSeek / Qwen flip the result — their Chinese-optimized tokenizers reach ~1.0x parity with English.

🔍 Why the “CJK Saves Tokens” Intuition Sounds Right

Before we show why it’s wrong, let’s give the intuition its due — there are three legitimate-looking reasons to believe it.

Three clues that make CJK look more efficient

1. Hanzi genuinely carry more information per character

One “學” character encodes the meaning scaffold behind “learn,” “study,” and “scholarship.” Per-character information density in CJK really is higher than in alphabetic scripts. The intuitive leap: a good tokenizer should reward that.

2. Character counts are genuinely lower

The Chinese version of an article is typically 35–45% of the English character count. When you eyeball length, Chinese looks “much shorter” — an easy step to “must be fewer tokens too.”

3. Cherry-picked short-sentence comparisons can favor Chinese

If you test a couple of handpicked examples:

EN: "Summarize this article in bullets" (6 tokens)
ZH: 「條列摘要這篇文章」(5 tokens)

Chinese looks like a win — and the story spreads.

Why the intuition still breaks

The gap between actual tokenizer behavior and intuitive information density is where it goes wrong:

  1. Hanzi density doesn’t translate into fewer tokens. Tokenizers typically assign 1 token per character for CJK. The semantic density shows up in how much meaning fits per context slot, not in fewer tokens charged. It helps your context window, not your API bill.

  2. Character count ≠ token count. “learn” is 5 characters but usually 1 token (merged as a common word). “學” is 1 character and also 1 token. English tokenizers compress far more aggressively than Chinese ones.

  3. Cherry-picked short sentences aren’t the average. A handful of examples can hit tokenizer-specific optimizations for common phrases, but cross-task averages tell the real story.

Supporting evidence: 2M-sentence study

A large-scale study using 2 million professionally human-translated sentences measured the token ratio between English and each target language:

Language (vs. English baseline)Average token multiplier
English1.0x (baseline)
Chinese (Mandarin)1.76x
Cantonese2.10x
Japanese2.12x
Korean2.36x

Every CJK language uses more tokens. No exceptions.


🧪 Six-Task Benchmark with tiktoken

All measurements below use OpenAI’s official tiktoken library across two tokenizers actually used by current models:

  • cl100k_base: GPT-4, Claude 3 series, early GPT-3.5
  • o200k_base: GPT-4o, GPT-5 series, latest OpenAI mainline

Task 1: Short instruction prompt

EN: "Summarize this article and list the top 5 key takeaways in bullet points."
ZH: 「摘要這篇文章,並以條列式列出前 5 個重點。」
JA: 「この記事を要約し、重要なポイントを5つ箇条書きで挙げてください。」
TokenizerENZHJAZH/ENJA/EN
cl100k_base1829311.61x1.72x
o200k_base1819241.06x1.33x

Short instructions are the friendliest case for CJK — near parity on o200k.

Task 2: Business prose paragraph

A 5-sentence paragraph on LLM impact on enterprise text processing.

TokenizerENZHJAZH/ENJA/EN
cl100k_base691742022.52x2.93x
o200k_base691071501.55x2.17x

Worst case — long business content on cl100k_base reaches nearly 3x.

Task 3: Translation task (sales report)

TokenizerENZHJAZH/ENJA/EN
cl100k_base561201392.14x2.48x
o200k_base56791011.41x1.80x

Task 4: Project proposal

TokenizerENZHJAZH/ENJA/EN
cl100k_base901771971.97x2.19x
o200k_base891191451.34x1.63x

Task 5: Code comment

TokenizerENZHJAZH/ENJA/EN
cl100k_base2246542.09x2.45x
o200k_base2229411.32x1.86x

Task 6: Creative copy

TokenizerENZHJAZH/ENJA/EN
cl100k_base46991002.15x2.17x
o200k_base4460701.36x1.59x

📊 Summary Across All Tasks

Metriccl100k_base ZHcl100k_base JAo200k_base ZHo200k_base JA
Lowest ratio1.61x1.72x1.06x1.33x
Highest ratio2.52x2.93x1.55x2.17x
Average2.08x2.32x1.34x1.73x

Rule-of-thumb to memorize:

  • On GPT-4 / Claude 3: Chinese ≈ 2x English tokens, Japanese ≈ 2.3x
  • On GPT-4o / GPT-5: Chinese ≈ 1.3x English tokens, Japanese ≈ 1.7x

🧮 Measurement Detail: Fewer Chinese Characters, but MORE Tokens

The most counter-intuitive finding — verified in Task 2 and Task 4:

TaskEN charsZH charsChar ratioo200k token ratio
Task 2 business prose3711360.37x1.55x
Task 4 proposal3421410.41x1.34x

Chinese uses only 37–41% of the character count, but 34–55% more tokens. The mechanism:

English tokenizers merge common words. Tokens like “the,” “of,” “transformation,” “enterprise” each get 1–2 tokens despite their character length. English averages ~5.4 characters per token.

Chinese tokenizers barely merge anything. Each Hanzi is essentially its own token; some less-common Hanzi even split into 2 byte fragments. Chinese averages ~0.9–1.0 characters per token.

That ~5.4× compression differential overwhelms the “Chinese has fewer characters” advantage — and you end up with more tokens.

Watch out for non-equivalent translations

A common unfair example circulating online:

EN: "Summarize the following article and extract the top 5 key points in bullet format" (15 tokens)
ZH: 「摘要以下文章,並以條列式提取前 5 個重點」(9 tokens)

Chinese looks 40% cheaper — but the Chinese version omitted “in bullet format” and compressed the verb. That’s a translation-style difference, not a tokenizer-efficiency difference. Every Task in this article uses the tightest semantically equivalent translation to keep the comparison fair.


📈 Tokenizer Generations: o200k_base Is a Real Improvement

From cl100k_base to o200k_base, OpenAI cut CJK penalty by roughly 40%:

Languagecl100k averageo200k averageImprovement
English1.00x1.00x
Chinese2.08x1.34x-36%
Japanese2.32x1.73x-25%

⚠️ Counter-case: Claude Opus 4.7 Released 2026/4/16, Claude Opus 4.7 ships a new tokenizer that Anthropic disclosed makes CJK consume 15–35% more tokens than the previous generation. Tokenizer upgrades don’t always move in the right direction — don’t assume “newer = better for CJK.”


🆚 Cross-Vendor Tokenizer Cheat Sheet (April 2026)

CJK inefficiency is a shared problem, but vendor differences remain significant. Measured medians:

Metric note

This table uses characters-per-token (CPT) for within-language vendor comparison (which vendor is friendliest to your language). The ratio tables above use cross-language token ratios for equivalent meaning (which language costs more for the same content). Both are valid but measure different things — don’t mix them.

TokenizerEN CPTZH CPTJA CPTChinese rank
DeepSeek V3 / V4~5.0~1.1–1.2~0.9🥇 Best for Chinese
Qwen family~4.8~1.1–1.3~0.9🥇 Best for Chinese
GPT-4o / o200k_base~5.4~1.0–1.3~0.9–1.3🥈
Gemini 3.x~5.5~1.0~0.82🥉
Llama 3 family~5.3~1.0~0.8🥉
Claude Opus 4.6~5.5~1.0–1.1~0.9🥉
GPT-4 / cl100k_base~5.4~0.7–0.8~0.9–1.0❌ Worst for Chinese
Claude Opus 4.7~5.5~0.75–0.9~0.7–0.8❌ Worst for Chinese

Key observations:

  • 🇨🇳 Chinese: only DeepSeek and Qwen reach CPT > 1.0 (meaning one Chinese character consumes less than one token). Everyone else sits at roughly one-character-per-token or worse.
  • 🇯🇵 Japanese: tight cluster across vendors, nobody has seriously optimized for it
  • 🇺🇸 English: all vendors in the 4.8–5.5 band, differences are within noise

🎯 Three Ways CJK Still Has an Edge

More tokens doesn’t mean CJK users get nothing. Three legitimate advantages remain:

1. Context windows hold more “meaning”

Even though token counts are higher per equivalent meaning, character counts are lower. When you stuff a whole book or contract into context, Chinese consumes less character space — the density advantage still exists at the context-window level even if not at the pricing level.

Example: Claude Opus 4.7’s 1M-token context

Language~Chars per tokenChars fitting in 1M tokens
English~5.4~5.4M characters
Chinese~0.85~850K characters

850K characters of Chinese ≈ 3–4 standard novels. On an information-content basis, the Chinese version fits roughly as much meaning as the 5.4M-character English version.

2. Short commands nearly tie

Task 1 shows Chinese only 6% worse, Japanese only 33% worse — for apps dominated by short interactions (customer service, search, simple tool calls), the CJK cost penalty is mild.

3. Chinese-specialized models flip the result

DeepSeek V4’s Chinese CPT sits at 1.1–1.2 — already close to its own English density. Combined with pricing at 1/10 to 1/50 of Western models, total cost for a Chinese application can actually be lower than GPT-4o.


💰 Real Cost Delta: Monthly Bill Estimation

Scenario: Chinese customer-support app, 1M conversations/month, avg 200 Chinese chars input, 150 Chinese chars output per conversation.

ModelZH input tokens (per call)ZH output tokens (per call)Monthly total tokensUnit price ($/M in / out)Monthly cost
GPT-4o~250~1902.5B in / 1.9B out$2.5 / $10$25,250
Claude Opus 4.7~280~2202.8B in / 2.2B out$5 / $25$69,000
DeepSeek V4~180~1401.8B in / 1.4B out$0.28 / $0.42$1,092
Qwen-Max~180~1401.8B in / 1.4B out~$0.5 / $2$2,700

Same Chinese workload, DeepSeek costs ~60x less than Claude Opus 4.7. Quality isn’t equivalent of course — this just illustrates how “tokenizer efficiency × unit price” stacks up.


🧠 The Clever-Looking Trap: “Translate-Both-Ways”

Once you learn English uses fewer tokens, there’s an obvious-sounding strategy:

“Let AI translate my Chinese prompt into English → run the task in English → translate the output back to Chinese.”

It sounds smart. In practice it’s almost always more expensive. Reason: you pay for translation twice, and translation itself costs tokens.

Numbers: Long document analysis

Analyzing a 50,000-character Chinese report:

StepNative Chinese flowEnglish-bridge flow
1. Read document55K Chinese tokens55K in → 37K English translation out = 92K
2. Run analysis55K in + 5.5K out = 60.5K37K in + 3.7K out = 40.7K
3. Translate back3.7K in + 5K Chinese out = 8.7K
Total60.5K141.4K

The English bridge burns 2.3x more tokens — a net loss.

Numbers: Short intent + medium output (code generation)

For a short-intent, long-output task like “build me a React todo app”:

StepNative ChineseEnglish-bridge (write prompt in English directly)
1. Prompt100 tokens75 tokens
2. Execute + think30K25K
3. OutputAlready ChineseTranslate explanation to Chinese: 3K
Total30K28K

~7% savings — but only if you write the prompt in English directly, skipping the first translation hop.

When the strategy actually saves (three conditions, all required)

  1. ✅ Prompt is short or written in English directly (skip the first translation)
  2. ✅ Execution is heavy (reasoning-intensive tasks with lots of thinking tokens)
  3. ✅ Final output is short or doesn’t need back-translation (code, tech docs can stay English)

What actually works for CJK cost savings

Skip the double translation; use a hybrid strategy:

TacticSavingsTrade-off
English system prompts (user content stays in CJK)Fixed per-call saving, compounds at scaleModels understand English system prompts perfectly
English few-shot examplesSame × example countRequires preparing examples
Keep output in English (tech tasks)Skips the entire translation-back stepUser must read English
Prompt caching50–90% discount on inputRequires repeated prompts to trigger
Use DeepSeek / Qwen for Chinese-20–30% tokens + 10–50x cheaper unit priceSlight quality gap vs. Claude / GPT
Tighter Chinese prompts10–20%Time investment to polish

One-liner: Don’t translate full content in both directions — almost always more expensive. Anglicize the static parts, keep dynamic parts in CJK, and lean on caching — that’s the strategy that actually wins.


🛠️ Practical Guidance

When estimating cost

  1. Never extrapolate English token counts to Chinese bills — multiply by 1.3–1.6x (o200k) or 1.6–2.5x (cl100k)
  2. Re-benchmark every model upgrade — Opus 4.6 → 4.7 CJK regression is the cautionary tale
  3. System prompts and few-shot examples are repeated across every call — Chinese token inflation compounds; prompt caching helps offset

Vendor selection

SituationRecommendation
Chinese-heavy, cost-sensitive (customer service, high volume)DeepSeek V4 / Qwen
Chinese-heavy, quality-first (legal / finance / coding)Claude Opus 4.7 or GPT-5
Mixed-language / global appsGPT-4o / Gemini 3
Short, high-frequency commandsAny o200k-class tokenizer works
Long-document batch processingBenchmark tokenizer efficiency alongside unit price

Token compression tactics

Since CJK inflation is a fact, reduce from the content side:

  • Write tighter Chinese — avoid filler (“的話”, “的時候”, “做一個 X 的動作”)
  • Prefer structure over prose — bullets, tables, key-value pairs beat long sentences
  • Consider English system prompts — when user content is Chinese, an English system prompt doesn’t hurt model comprehension but saves tokens
  • Use prompt caching aggressively — cached Chinese system prompts recover much of the lost efficiency

🧭 Bottom Line: Trust Measurement, Not Intuition

“Chinese saves tokens” is a reasonable-sounding intuition that doesn’t survive contact with real measurements. Hanzi density shows up in the context window, not the token bill.

The real picture:

  • 📉 CJK always uses more tokens on modern LLMs — Chinese 1.1–1.6x, Japanese 1.3–2.2x
  • 📈 Tokenizer updates can go either way (GPT-4 → 4o improved 36%; Opus 4.6 → 4.7 regressed 15–35%)
  • 💡 DeepSeek / Qwen are the genuine Chinese tokenizer optimizers, and combined with low pricing they offer the best Chinese cost-performance
  • 📊 CJK’s character-density advantage still shows up at the context-window level — just don’t mistake it for token-level efficiency

Don’t trust the intuition “Chinese saves money with AI.” Trust your own measurement. Run 100 representative prompts through tiktoken or your API’s usage field — that number is worth more than any public benchmark.


❓ FAQ

Is "Chinese saves tokens" ever actually true?

On cherry-picked short sentences, occasionally yes — but not as a general rule.

cl100k_base (the GPT-4 era) has specific optimizations for certain high-frequency Chinese phrases, so on very short, very common sentences Chinese occasionally ties or slightly beats English. But scale up the sample and write tighter translations and the gap flips — the 2023 arxiv 2305.15425 study across 2M sentences averages Chinese at 1.76x English, consistent with the tiktoken measurements in this article.

Rule to memorize: on modern LLMs, CJK always consumes more tokens. The only question is how much more.

Then why does the context window sometimes "feel" bigger for Chinese?

Because you’re feeling character count, not token count. A 1M-token context fits ~850K Chinese characters or ~5.4M English characters. At first glance English looks to fit more — but measured by “how many books of meaning,” the high information density of Chinese lets 850K characters hold 3–4 novels worth of content.

Conclusion: context windows are a wash for CJK (maybe a slight advantage) but API bills are worse for CJK — keep these two mental models separate.

Are DeepSeek / Qwen really more efficient at Chinese than GPT / Claude?

Yes, and they’re the only clear exception in Chinese tokenization. DeepSeek and Qwen train on corpora weighted heavily toward Chinese, and their vocabulary allocation reflects it — CPT reaches 1.1–1.3 versus the 1.0-and-below typical of Western models.

Practical impact: for pure Chinese content, DeepSeek / Qwen consume roughly 70–85% of GPT-4o’s tokens. Combined with pricing far below Western models, total Chinese-app cost can be 1/10 to 1/60 of Western models.

I write my prompts in Chinese. Should I switch to English?

Mostly no. Reasons:

  1. A prompt you write well in Chinese beats a clumsy English one — expression precision matters far more than saving 20% tokens
  2. Prompt quality > token efficiency — a clear prompt that works first try beats a cheap one that fails three times
  3. The CJK penalty isn’t as big as it feels — o200k’s 1.3x Chinese multiplier rarely changes project viability

When is English worth considering?

  • Repeated system prompts and few-shot examples (the penalty compounds over millions of calls)
  • Technical domains where English terminology is more precise (AI, legal, medical)
  • High-frequency, cost-sensitive API apps

User input and model output can stay in Chinese — only fixed prompt templates are worth translating.

Why is Japanese worse than Chinese?

Japanese mixes three writing systems:

  • Kanji: high information density, similar to Chinese
  • Hiragana / Katakana: each character encodes only one syllable but still consumes ~1 token — low information-per-token ratio
  • Katakana for loanwords (e.g., “コンピュータ” = computer): one English word stretched into 6 katakana = 6 tokens

So even though Japanese contains high-efficiency kanji, the average ends up worse than pure-Hanzi Chinese. This is why every single Task above shows Japanese worse than Chinese.

How do I measure my own app's token consumption?

Three options, easiest to most precise:

  1. gptforwork.com/tools/tokenizer — paste text, get counts for GPT / Claude / Gemini / Grok
  2. OpenAI tiktoken (Python):
    import tiktoken
    enc = tiktoken.encoding_for_model("gpt-4o")
    print(len(enc.encode("your text")))
  3. Real API calls — read the usage field — input_tokens / output_tokens from response is the only 100% accurate source (especially for Claude / Gemini, whose tokenizers aren’t public)

Recommended: run 100 representative prompts and average. That number beats any public benchmark for your specific app.


Related:

№ · further reading

延伸閱讀