I wanted to write this post to critique the idea that reaching HSK6 level gets you halfway to fluent reading in Chinese.
This idea comes from the famous post HSK 6 Gets You Halfway on the site Chinese The Hard Way. I love the site, and I like the spirit of the article, but it is factually incorrect. I recommend you read it before my critique.
The author has squeezed the numbers (and over-relied on Junda) to make a point. The reality is much more banal, yet also much more hopeful.
Let me start with some surface-level critique before I destroy this myth at the root.
The Author Has Rounded Wrong
One of the main arguments I'll make is that the Junda Frequency List can't be compared to HSK lists. But let's assume for now that it can.
The author's idea revolves around the observation that 2600 characters leads to 96-97% comprehension of all characters in the texts cited.
For one thing, this isn't true. You can see in the ranges the author provides: knowing the HSK6 characters gives 96-97.7% coverage of all characters in the text. They have rounded down to 96-97%. Let's do proper maths and round up, so that the range is 96-98%.
This means you won't know between 2 and 4 new characters (not necessarily unique) per 100 characters, or one every 25-50 (not every 20-30 as they suggest).
If a typical novel has 500-600 characters per page, as the author suggests, then that's anywhere between 10 and 24 new characters per page, not the 20 the author suggests.
What's more, on a given page these could very well be duplicates. Think place names, character names, onomatopoeia, and so on, some of which you have no need to commit to long-term memory.
So that's one thing: the lack of coverage of HSK6 isn't quite as daunting as it seems.
2600 is not 50% of 4400
There's another very simple flaw in the argument.
Let's for a moment assume that the author is correct: you become a fluent reader when you're able to understand all characters in the page of a novel except one, i.e. that you need to know the top 4400 characters in the list.
When the author says that knowing 2600 hanzi gets you halfway, it sounds like you have another 2600 characters to go.
That is just patently, obviously false. you clearly have 4400-2600=1800 left to learn, which is 800 characters less than what they suggest.
To look at it differently, you are actually 2600/4400*100 = 59% of the way there.
It doesn't get you halfway. It gets you within 1800 characters, or 59% of the way there!
Comparing Apples with Oranges
Another basic error is that the author is comparing the HSK6 list, which is specifically designed for foreign learners in the modern day, to a different character list altogether.
See the Junda study here and the frequency list for modern characters here.
You can see this by looking at the 2600th entry in the Junda frequencies. According to that list, you get 98.65% coverage with the 2600 most frequent characters, not 96-98%. That's a much more hopeful picture.
Besides, even the modern Junda list incorporates "imaginative" material. You can't generalise novels to non-fiction reading, especially mainstream material.
I've read the Harry Potter series multiple times in English (my native English), and there are many words there that I never use in any other context, not to mention all of JK Rowling's invented vocabulary. I'd be wrong to compute my English level based on how often I use Harry Potter-type language.
Put all that together, and I'm willing to bet that HSK6 gets you more than 59% of the way there.
A Way Forward
So if this comparison is problematic and much more hopeful than the author suggests, how could we go about getting a more accurate percentage?
The way to reach a reasonable comparison would be the following:
- repeat the author's analysis for several character frequency lists (e.g. the ones from chinesetoolbox.com),
- extract several counts for 99.8% coverage and state the minimum and maximum amounts required.
My View
I'm also skeptical that 99.8% recognition leads to fluent reading, as though it were a threshold to a magical door leading to Narnia.
Let's scale back a bit, so that we recongise 99.5% of characters we come across. That's 995 out of 1000, meaning we don't know 5 out of every 1000, or 1 in every 200. That's a nice, round number.
The characters we don't know will have a context we understand, likely a semantic part, and could easily be repeated several times close by, making them easy to decode. And many of the characters that appear at these frequencies you'll never see again. Let's not measure our level of Chinese by how many of them we know.
Let's see how many we need to get to 99.5% comprehension. I'm skeptical of the JunDa list, but let's run with the author and say it's a good measure.
To reach that level of recognition (1 every 200), you need to know the top 3424 characters in that list. That puts you way above the threshold for the Advanced band of the new HSK, and means HSK 6 gets you 75.9% of the way there.
Hanzi Disease
Chinese is my second foreign language, and I realise there is a certain disease in Chinese-learning communities: people are obsessed with how many characters they know. Shall we call it "Hanzi Disease"?
It's quite strange, really. Do you have any idea how many English words you know? Does it even matter to you? Would knowing an extra 1000 really make a difference to your life?
Listen, I do count my characters. I know roughly how many I know. But at the end of the day, it's a pretty arbitrary measure. It doesn't even measure reading ability too well, because knowledge of a character doesn't mean knowledge of all the vocab that uses it, let alone speaking and listening.
Besides, character counting past a certain point becomes silly. Let's go back to that magic number of 3424 characters.
If you look at the JunDa table, it shows you that the 3424th character only appears 1114 times out of 193,504,018 total characters in the modern sample, meaning it appears once every 173,702 characters.
If a book has on average 500 characters per page, on average the book would need 347 pages for the character to appear once.
Why even bother counting characters that are so infrequent? Knowing such infrequent characters isn't going to make a noticeable difference to your level. You might only see them a handful of times in your Chinese journey.
I like having a way to count and measure, but we can't become obsessed with it. The same goes with HSK levels. All of it distracts us from skills, which is what actually matters. You don't rely on your character count or an HSK certificate to communicate with people or read things on the go. You rely on your skills.
Conclusion
Why HSK doesn't get you halfway:
- there are several elementary maths errors involved in the calculation (97.7% doesn't round down; 2600 is not half of 4400)
- the author is comparing apples with oranges,
- 99.8% coverage isn't necessary for reading comprehension,
- Hanzi alone isn't a reliable measure of ability,
- past 99.5% coverage, characters are so infrequent that they'll appear on average once every 347 pages of a standard book.