Is Chinese More Token-Efficient Than English? Tested

Q: Is "Chinese saves tokens" ever actually true?

**On cherry-picked short sentences, occasionally yes — but not as a general rule.** cl100k_base (the GPT-4 era) has specific optimizations for certain high-frequency Chinese phrases, so on very short, very common sentences Chinese occasionally ties or slightly beats English. But scale up the sample and write tighter translations and the gap flips — the 2023 arxiv 2305.15425 study across 2M sentences averages Chinese at 1.76x English, consistent with the tiktoken measurements in this article. **Rule to use**: the six tested pairs used more CJK tokens, but your exact ratio depends on the wording and tokenizer. Measure before estimating a bill.

Q: Then why does the context window sometimes "feel" bigger for Chinese?

**Because you're feeling character count, not token count.** A 1M-token context fits ~850K Chinese characters or ~5.4M English characters. At first glance English looks to fit more — but measured by "how many books of meaning," the high information density of Chinese lets 850K characters hold 3–4 novels worth of content. Conclusion: **context windows are a wash for CJK** (maybe a slight advantage) but **API bills are worse for CJK** — keep these two mental models separate.

Q: Are DeepSeek / Qwen really more efficient at Chinese than GPT / Claude?

They may tokenize Chinese differently, but this article's `tiktoken` benchmark cannot establish that result because `tiktoken` is for OpenAI models. For a fair comparison, run the same representative prompts through each provider's official tokenizer or API, record input and output tokens separately, then multiply by the current unit prices. Model quality and caching also affect the real cost.

Q: I write my prompts in Chinese. Should I switch to English?

**Mostly no.** Reasons: 1. **A prompt you write well in Chinese beats a clumsy English one** — expression precision matters far more than saving 20% tokens 2. **Prompt quality > token efficiency** — a clear prompt that works first try beats a cheap one that fails three times 3. **The CJK penalty isn't as big as it feels** — o200k's 1.3x Chinese multiplier rarely changes project viability **When is English worth considering?** - Repeated **system prompts** and **few-shot examples** (the penalty compounds over millions of calls) - Technical domains where English terminology is more precise (AI, legal, medical) - High-frequency, cost-sensitive API apps User input and model output can stay in Chinese — only fixed prompt templates are worth translating.

Q: Why is Japanese worse than Chinese?

Japanese mixes three writing systems: - **Kanji**: high information density, similar to Chinese - **Hiragana / Katakana**: each character encodes only one syllable but still consumes ~1 token — low information-per-token ratio - **Katakana for loanwords** (e.g., "コンピュータ" = computer): one English word stretched into 6 katakana = 6 tokens So even though Japanese contains high-efficiency kanji, the average ends up worse than pure-Hanzi Chinese. This is why every single Task above shows Japanese worse than Chinese.

Q: How do I measure my own app's token consumption?

Three options, easiest to most precise: 1. **OpenAI `tiktoken` (Python):** ```python import tiktoken enc = tiktoken.encoding_for_model("gpt-4o") print(len(enc.encode("your text"))) ``` 2. **Real API calls — read the `usage` field** — input and output token counts from the exact model are the most reliable source for production estimates. Recommended: run 100 representative prompts and average. That number beats any public benchmark for your specific app.

Does Chinese use fewer tokens than English? Six paired tests show o200k_base used 1.06–1.55x more Chinese tokens. See the data and method.

Published: 2026-04-14 | Updated: 2026-07-14

Beginner Token Tokenizer LLM

Content checked: 2026-07-14 Sources checked: 2026-07-14

Is Chinese more token-efficient than English? In our six paired tests, no. With OpenAI’s o200k_base tokenizer, equivalent Chinese prompts used 1.06–1.55x as many tokens as English, averaging 1.34x. With the older cl100k_base, the Chinese average was 2.08x.

That answer is specific to the test set and OpenAI tokenizers. It is not a universal law for every model: tokenization differs by vendor, model generation and wording. The reliable way to estimate cost is to measure representative production prompts with the tokenizer or API usage data for the exact model you plan to use.

Below you can inspect all six prompt pairs, see why character count is a poor proxy for token count, and decide whether changing language would actually reduce your bill.

TL;DR

🔤 Which language consumes more tokens for the same meaning? (measured on o200k_base / GPT-4o, GPT-5)

Rank Language Multiplier vs English
🥇 English 1.0x (baseline)
🥈 Chinese 1.34x
🥉 Japanese 1.73x
4 Korean ~2.0x

One-line summary: In this six-task o200k_base benchmark, Chinese used about 34% more tokens and Japanese about 73% more than equivalent English. Measure your own prompts before changing language or vendor.

Rank	Language	Multiplier vs English
🥇	English	1.0x (baseline)
🥈	Chinese	1.34x
🥉	Japanese	1.73x
4	Korean	~2.0x

5 Key Empirical Findings

CJK consumed more tokens in all six tested pairs. On o200k_base: Chinese 1.06–1.55x, Japanese 1.33–2.17x. On cl100k_base: Chinese 1.61–2.52x, Japanese 1.72–2.93x.
Task type matters a lot. Short commands are nearly tied; long business prose and proposals inflate to 1.5–2.2x.
Generational tokenizer improvement is real. o200k_base cuts CJK penalty by roughly 40% versus cl100k_base.
Japanese is consistently worse than Chinese — hiragana and katakana drag the average.
Vendor results cannot be inferred from tiktoken. Test the exact model and tokenizer rather than applying OpenAI ratios to Claude, Gemini, DeepSeek or Qwen.

🔍 Why the “CJK Saves Tokens” Intuition Sounds Right

Before we show why it’s wrong, let’s give the intuition its due — there are three legitimate-looking reasons to believe it.

Three clues that make CJK look more efficient

1. Hanzi genuinely carry more information per character

One “學” character encodes the meaning scaffold behind “learn,” “study,” and “scholarship.” Per-character information density in CJK really is higher than in alphabetic scripts. The intuitive leap: a good tokenizer should reward that.

2. Character counts are genuinely lower

The Chinese version of an article is typically 35–45% of the English character count. When you eyeball length, Chinese looks “much shorter” — an easy step to “must be fewer tokens too.”

3. Cherry-picked short-sentence comparisons can favor Chinese

If you test a couple of handpicked examples:

EN: "Summarize this article in bullets" (6 tokens)
ZH: 「條列摘要這篇文章」(5 tokens)

Chinese looks like a win — and the story spreads.

Why the intuition still breaks

The gap between actual tokenizer behavior and intuitive information density is where it goes wrong:

Hanzi density doesn’t translate into fewer tokens. Tokenizers typically assign 1 token per character for CJK. The semantic density shows up in how much meaning fits per context slot, not in fewer tokens charged. It helps your context window, not your API bill.
Character count ≠ token count. “learn” is 5 characters but usually 1 token (merged as a common word). “學” is 1 character and also 1 token. English tokenizers compress far more aggressively than Chinese ones.
Cherry-picked short sentences aren’t the average. A handful of examples can hit tokenizer-specific optimizations for common phrases, but cross-task averages tell the real story.

Supporting evidence: 2M-sentence study

A large-scale study using 2 million professionally human-translated sentences measured the token ratio between English and each target language:

The paper reported English as the 1.0x baseline, Mandarin Chinese at 1.76x, Cantonese at 2.10x, Japanese at 2.12x, and Korean at 2.36x for its dataset and evaluated tokenizers.

In that study, every reported CJK language used more tokens than the English baseline. That supports the direction of this smaller benchmark, but it still does not prove that every tokenizer and every sentence behaves the same way.

🧪 Six-Task Benchmark with tiktoken

All measurements below use OpenAI’s official tiktoken library across two tokenizers actually used by current models:

cl100k_base: older OpenAI models including GPT-4 and some GPT-3.5 variants
o200k_base: GPT-4o, GPT-5 series, latest OpenAI mainline

Task 1: Short instruction prompt

EN: "Summarize this article and list the top 5 key takeaways in bullet points."
ZH: 「摘要這篇文章，並以條列式列出前 5 個重點。」
JA: 「この記事を要約し、重要なポイントを5つ箇条書きで挙げてください。」

cl100k_base: EN 18, ZH 29 (1.61x), JA 31 (1.72x).

o200k_base: EN 18, ZH 19 (1.06x), JA 24 (1.33x).

Short instructions are the friendliest case for CJK — near parity on o200k.

Task 2: Business prose paragraph

A 5-sentence paragraph on LLM impact on enterprise text processing.

cl100k_base: EN 69, ZH 174 (2.52x), JA 202 (2.93x).

o200k_base: EN 69, ZH 107 (1.55x), JA 150 (2.17x).

Worst case — long business content on cl100k_base reaches nearly 3x.

Task 3: Translation task (sales report)

cl100k_base: EN 56, ZH 120 (2.14x), JA 139 (2.48x).

o200k_base: EN 56, ZH 79 (1.41x), JA 101 (1.80x).

Task 4: Project proposal

cl100k_base: EN 90, ZH 177 (1.97x), JA 197 (2.19x).

o200k_base: EN 89, ZH 119 (1.34x), JA 145 (1.63x).

Task 5: Code comment

cl100k_base: EN 22, ZH 46 (2.09x), JA 54 (2.45x).

o200k_base: EN 22, ZH 29 (1.32x), JA 41 (1.86x).

Task 6: Creative copy

cl100k_base: EN 46, ZH 99 (2.15x), JA 100 (2.17x).

o200k_base: EN 44, ZH 60 (1.36x), JA 70 (1.59x).

📊 Summary Across All Tasks

Metric	cl100k_base ZH	cl100k_base JA	o200k_base ZH	o200k_base JA
Lowest ratio	1.61x	1.72x	1.06x	1.33x
Highest ratio	2.52x	2.93x	1.55x	2.17x
Average	2.08x	2.32x	1.34x	1.73x

Rule-of-thumb to memorize:

On GPT-4-era cl100k_base: Chinese ≈ 2x English tokens, Japanese ≈ 2.3x
On GPT-4o / GPT-5: Chinese ≈ 1.3x English tokens, Japanese ≈ 1.7x

🧮 Measurement Detail: Fewer Chinese Characters, but MORE Tokens

The most counter-intuitive finding — verified in Task 2 and Task 4:

Task 2 used 371 English characters versus 136 Chinese characters (0.37x by characters), yet the Chinese o200k_base token count was 1.55x. Task 4 used 342 English versus 141 Chinese characters (0.41x), while the Chinese token count was 1.34x.

Chinese uses only 37–41% of the character count, but 34–55% more tokens. The mechanism:

English tokenizers merge common words. Tokens like “the,” “of,” “transformation,” “enterprise” each get 1–2 tokens despite their character length. English averages ~5.4 characters per token.

Chinese tokenizers barely merge anything. Each Hanzi is essentially its own token; some less-common Hanzi even split into 2 byte fragments. Chinese averages ~0.9–1.0 characters per token.

That ~5.4× compression differential overwhelms the “Chinese has fewer characters” advantage — and you end up with more tokens.

Watch out for non-equivalent translations

A common unfair example circulating online:

EN: "Summarize the following article and extract the top 5 key points in bullet format" (15 tokens)
ZH: 「摘要以下文章，並以條列式提取前 5 個重點」(9 tokens)

Chinese looks 40% cheaper — but the Chinese version omitted “in bullet format” and compressed the verb. That’s a translation-style difference, not a tokenizer-efficiency difference. Every Task in this article uses the tightest semantically equivalent translation to keep the comparison fair.

📈 Tokenizer Generations: o200k_base Is a Real Improvement

From cl100k_base to o200k_base, OpenAI cut CJK penalty by roughly 40%:

English stays at the 1.0x comparison baseline. The Chinese average fell from 2.08x on cl100k_base to 1.34x on o200k_base, a 36% reduction in the ratio. Japanese fell from 2.32x to 1.73x, a 25% reduction.

This comparison says nothing about a non-OpenAI model. A newer model can use a different tokenizer, and providers may count hidden reasoning, cached input or tool traffic differently. Re-run the workload whenever you change the model ID.

How to Compare Token Efficiency Across Vendors

Do not compare vendors with a single characters-per-token table. A fair test needs the same semantic workload and each provider’s own token counts.

Collect at least 100 representative requests, including system prompts, user text, tool definitions and expected output length.
Prepare equivalent Chinese and English versions. Have a bilingual reviewer check that neither version silently omits constraints.
Send both sets to the exact model IDs you are evaluating and record input, cached input, output and reasoning tokens where reported.
Compare task success before cost. A cheaper response that needs a second attempt is not cheaper in production.
Multiply observed tokens by the provider’s current prices, then include caching discounts and batch pricing only when your workflow actually qualifies.

Use characters per token only to describe compression within one language. Use tokens for equivalent meaning to compare languages. They answer different questions.

Character Density Is Not a Context-Window Shortcut

Chinese usually expresses equivalent meaning with fewer visible characters, but context windows are measured in tokens, not characters. If a tokenizer produces more tokens for the Chinese version, the model does not gain extra context capacity merely because the text looks shorter on screen.

Estimate context capacity with complete documents from your workload. Headings, tables, code, punctuation and proper names can shift the ratio substantially.

Short commands can nearly tie

Task 1 shows Chinese only 6% worse, Japanese only 33% worse — for apps dominated by short interactions (customer service, search, simple tool calls), the CJK cost penalty is mild.

That is why a short chatbot command may show little difference while a long proposal produces a larger gap. Workload mix matters more than a global language multiplier.

Calculate the Monthly Cost with Your Own Usage

For each model, use observed averages rather than character-count guesses:

monthly input cost  = requests × average input tokens  ÷ 1,000,000 × input price
monthly output cost = requests × average output tokens ÷ 1,000,000 × output price
monthly total       = input cost + output cost + tool or storage charges

Keep cached input separate because it may have a different price. Also compare the percentage of requests completed correctly on the first attempt. Token efficiency is only one part of total cost.

🧠 The Clever-Looking Trap: “Translate-Both-Ways”

Once you learn English uses fewer tokens, there’s an obvious-sounding strategy:

“Let AI translate my Chinese prompt into English → run the task in English → translate the output back to Chinese.”

It can be more expensive because you pay for translation twice, and translation itself consumes tokens. Test the full workflow before adopting it.

Numbers: Long document analysis

Analyzing a 50,000-character Chinese report:

In this illustrative long-document scenario, the native Chinese flow is 55K input plus 5.5K output, or 60.5K tokens. Translating first uses 55K Chinese input plus 37K English output; analysis then uses 37K input plus 3.7K output, and translation back adds 3.7K input plus 5K output. The bridge totals 141.4K. These are scenario assumptions, so replace them with observed usage before making a product decision.

The English bridge burns 2.3x more tokens — a net loss.

Numbers: Short intent + medium output (code generation)

For a short-intent, long-output task like “build me a React todo app”:

For a short instruction that produces mostly code, writing the prompt directly in English can avoid the first translation hop. One illustrative run might be 30K tokens natively versus 28K with a short English prompt and brief Chinese explanation. A small difference like this is easily erased by retries, so measure task success as well as tokens.

~7% savings — but only if you write the prompt in English directly, skipping the first translation hop.

When the strategy actually saves (three conditions, all required)

✅ Prompt is short or written in English directly (skip the first translation)
✅ Execution is heavy (reasoning-intensive tasks with lots of thinking tokens)
✅ Final output is short or doesn’t need back-translation (code, tech docs can stay English)

What actually works for CJK cost savings

Skip the double translation; use a hybrid strategy:

Useful experiments include shortening repeated system instructions, removing redundant few-shot examples, keeping code and technical identifiers in their natural language, enabling prompt caching when repeated prefixes qualify, and testing a second model with the same evaluation set. Do not assume a savings percentage until the API usage confirms it.

One-line rule: do not translate full content in both directions without measuring the extra hops. For repeated workloads, test whether shorter static prompts and prompt caching provide savings without changing the user’s language.

🛠️ Practical Guidance

When estimating cost

Never extrapolate English token counts to Chinese bills — multiply by 1.3–1.6x (o200k) or 1.6–2.5x (cl100k)
Re-benchmark every model upgrade — Opus 4.6 → 4.7 CJK regression is the cautionary tale
System prompts and few-shot examples are repeated across every call — Chinese token inflation compounds; prompt caching helps offset

Vendor selection

For high-volume Chinese customer service, compare at least two models on first-attempt resolution and total cost. For legal, finance or coding tasks, quality review matters more than tokenizer compression alone. Mixed-language products need a multilingual test set, while long-document batch processing should include context limits, caching and unit price in the same benchmark.

Token compression tactics

Since CJK inflation is a fact, reduce from the content side:

Write tighter Chinese — avoid filler (“的話”, “的時候”, “做一個 X 的動作”)
Prefer structure over prose — bullets, tables, key-value pairs beat long sentences
Consider English system prompts — when user content is Chinese, an English system prompt doesn’t hurt model comprehension but saves tokens
Use prompt caching aggressively — cached Chinese system prompts recover much of the lost efficiency

🧭 Bottom Line: Trust Measurement, Not Intuition

“Chinese saves tokens” is a reasonable-sounding intuition that doesn’t survive contact with real measurements. Hanzi density shows up in the context window, not the token bill.

The measured picture:

All six OpenAI-tokenizer pairs used more CJK tokens: Chinese 1.06–1.55x and Japanese 1.33–2.17x on o200k_base.
Tokenizer updates can change the ratio: moving from cl100k_base to o200k_base reduced the average gap in this test.
Vendor comparisons need vendor data: do not use tiktoken to predict Claude, Gemini, DeepSeek or Qwen usage.
Character density is not a billing metric: context limits and API bills are based on tokens.

Don’t trust the intuition “Chinese saves money with AI.” Trust your own measurement. Run 100 representative prompts through tiktoken or your API’s usage field — that number is worth more than any public benchmark.

❓ FAQ

Is "Chinese saves tokens" ever actually true?

On cherry-picked short sentences, occasionally yes — but not as a general rule.

cl100k_base (the GPT-4 era) has specific optimizations for certain high-frequency Chinese phrases, so on very short, very common sentences Chinese occasionally ties or slightly beats English. But scale up the sample and write tighter translations and the gap flips — the 2023 arxiv 2305.15425 study across 2M sentences averages Chinese at 1.76x English, consistent with the tiktoken measurements in this article.

Rule to use: the six tested pairs used more CJK tokens, but your exact ratio depends on the wording and tokenizer. Measure before estimating a bill.

Then why does the context window sometimes "feel" bigger for Chinese?

Because you’re feeling character count, not token count. A 1M-token context fits ~850K Chinese characters or ~5.4M English characters. At first glance English looks to fit more — but measured by “how many books of meaning,” the high information density of Chinese lets 850K characters hold 3–4 novels worth of content.

Conclusion: context windows are a wash for CJK (maybe a slight advantage) but API bills are worse for CJK — keep these two mental models separate.

Are DeepSeek / Qwen really more efficient at Chinese than GPT / Claude?

They may tokenize Chinese differently, but this article’s tiktoken benchmark cannot establish that result because tiktoken is for OpenAI models. For a fair comparison, run the same representative prompts through each provider’s official tokenizer or API, record input and output tokens separately, then multiply by the current unit prices. Model quality and caching also affect the real cost.

I write my prompts in Chinese. Should I switch to English?

Mostly no. Reasons:

A prompt you write well in Chinese beats a clumsy English one — expression precision matters far more than saving 20% tokens
Prompt quality > token efficiency — a clear prompt that works first try beats a cheap one that fails three times
The CJK penalty isn’t as big as it feels — o200k’s 1.3x Chinese multiplier rarely changes project viability

When is English worth considering?

Repeated system prompts and few-shot examples (the penalty compounds over millions of calls)
Technical domains where English terminology is more precise (AI, legal, medical)
High-frequency, cost-sensitive API apps

User input and model output can stay in Chinese — only fixed prompt templates are worth translating.

Why is Japanese worse than Chinese?

Japanese mixes three writing systems:

Kanji: high information density, similar to Chinese
Hiragana / Katakana: each character encodes only one syllable but still consumes ~1 token — low information-per-token ratio
Katakana for loanwords (e.g., “コンピュータ” = computer): one English word stretched into 6 katakana = 6 tokens

So even though Japanese contains high-efficiency kanji, the average ends up worse than pure-Hanzi Chinese. This is why every single Task above shows Japanese worse than Chinese.

How do I measure my own app's token consumption?

Three options, easiest to most precise:

OpenAI tiktoken (Python):

import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
print(len(enc.encode("your text")))

Real API calls — read the usage field — input and output token counts from the exact model are the most reliable source for production estimates.

Recommended: run 100 representative prompts and average. That number beats any public benchmark for your specific app.

Sources and method

OpenAI tiktoken repository — the BPE tokenizer library used for this article’s cl100k_base and o200k_base counts.
Petrov et al., Language Model Tokenizers Introduce Unfairness Between Languages — the NeurIPS 2023 multilingual tokenizer study cited above.

The six task pairs in this article are a small editorial benchmark, not a universal model leaderboard. Reproduce the counts with the listed text before using them in a cost forecast.

Related:

№ · further reading