Article describing four advanced KEY 5.1 features introduced in 2008 (see esp. the yellow-highlighted parts)

Computer-assisted teaching of Pinyin orthography

Peter Leimbigler, Ph.D.

Facilitating compliance with Pinyin orthography (中文拼音正词法) for teachers and learners is a main task for Chinese language teachers and software designers. This article introduces algorithmic self-correcting software solutions for Pinyin Input with non-standard spacing. Focus is on four aspects of Pinyin orthography in a computerized learning environment: 1) spacing rules for modal particles, 2) orthography of four-syllable fixed expressions, 3) number/measure-word combinations, and 4) an option for tone changes ("sandhi") being reflected in the Pinyin rendering.


Compliance with the standard of writing both in Hanzi and in Pinyin - is a major effort for native speakers and learners of any language. In the case of Chinese, especially when involving textbooks and other publications requiring a phonetic transcription, the task is even more difficult, as writing the same text in Chinese characters (Hanzi) and in the standard phonetic transcription (Hanyu Pinyin) challenge the writer in totally different ways. In this presentation I focus on the orthography rules for Hanyu Pinyin, in particular on the question how we can enhance implementation of the Pinyin orthography standard that was proclaimed in 1996 as Zhōngwn Pīnyīn zhngcfǎ jīběn guīz 中文拼音正词法基本规则 by the China State Bureau of Quality and Technical Supervision of the State Council of China. The original 8-page document with the rules is available at For the software solutions discussed in the text of this article, the latest (May 2008) release version 5.1 of the Chinese text system KEY5 at is being used. The research for our software development, including the linguistic basis for the software algorithms, was mainly done on the basis of the works mentioned in the references below, with many contributions in theory and practice from my colleagues world-wide, and my R&D team in Ottawa, Canada.

1. Spacing rules for modal particles

While the general spacing rules of Pinyin orthography are well defined in books like Yin & Felley (1990) and in modern standard dictionaries (like the Xindi Hnyǔ Cdiǎn 现代汉语词典, 2002), they are not always implemented in teaching. For instance, when non-standard Pinyin input is used on the computer to teach Chinese such as monosyllabic input, or continuous input without spacing this may facilitate text entry in certain instances, but in the long run might prove detrimental to the learners Pinyin orthography skills.

More complicated than the general vocabulary-level spacing rules of Pinyin are the Pinyin spacing rules on the syntactic level that might even seem inconsistent at first sight. With these syntax-based rules, there is a lot of potential for confusion for teachers and learners alike, because we are here dealing with rules that cannot be readily looked up in a dictionary. For example, sooner or later every Chinese language teacher and student is confronted with the Pinyin spacing rules governing modal particles like le 了, zhe 着, and guo 过 which in most cases (but not in all cases!) are supposed to be appended directly to the preceding verb, without a space in between. We can observe the complexity of these spacing rules in some example sentences containing the particle le 了. The following three sentences all comply with the Pinyin spacing rules but would you as an author of learning material write the Pinyin text like this? Or, imagine you have to explain to your students the logic behind the Pinyin renderings, in particular the le 了 spacing:

  1. 走进来了两位客人 。 Zǒu jnli le liǎng wi kren.
  2. 来了两位客人 。 Lile liǎng wi kren.
  3. 客人来了 。 Kren li le.

Background explanation of the above le 了 spacing:

  1. "Zǒu jnli" is a verb + complement construction. The rule says that, if the construction is written as two units, then le 了 is written separate from it (Yin & Felley, 1990, pp. 303-304). Therefore the standard way of writing is "Zǒu jnli le ..."
  2. The tense-marking particle le 了 is ordinarily written as one unit with the verb it follows (Yin & Felley, 1990, p. 276). Therefore the standard way of writing is "Lile ..."
  3. In this sentence, the standard defines le 了 not as a tense-marking, but rather as a mood-marking particle at the end of a sentence, and sets out the following rule: the particle le 了, appearing at the end of a sentence or clause, is written by itself (Yin & Felley, 1990, p. 278). Therefore, the standard way of writing is "... li le."

To facilitate compliance through software algorithms - in the interest of those who are writing and learning Chinese on the computer, two questions come to mind:

  1. In a Chinese software system with Pinyin entry, can we make Pinyin input with non-standard spacing convert to the correctly spaced Chinese-character version?
  2. Can we, through back-conversion from Hnz 汉字 to Pinyin or in two-line "Hanzi with Pinyin" mode show or teach the writer/student the standard Pinyin orthography?
  3. The following examples show that the answer to both questions is affirmative, as our team has just (2008) implemented the respective self-correcting algorithms.

  1. Non-standard Input "zoujinlaile liang wei keren." converts correctly into 走进来了两位客人.; this back-converts to the correct standard form "Zǒu jnli le liǎng wi kren" thus providing the orthography teaching effect.
  2. Non-standard Input "lai le liang wei keren." converts correctly into 来了两位客人.; this back-converts to the correct standard form "Lile liǎng wi kren" thus providing the orthography teaching effect.
  3. Non-standard Input "keren laile." converts correctly into 客人来了.; this back-converts to the correct standard form "Kren li le." thus providing the orthography teaching effect.

2. Orthography of four-syllable fixed expressions

A further problem area in Hanyu Pinyin orthography is the Pinyin rendering of chngyǔ 成语 "fixed idioms". These are set four-character expressions, and despite the standardization efforts and detailed hyphenation/spacing rules (Yin & Felley pp. 457-489) we find many inconsistencies in spacing and the use of the hyphen in such expressions in the current dictionaries.

We observe these inconsistencies in a large number of idioms; as one from thousands of similar border cases, we take a closer look at the idiom h l jīqn 鹤立鸡群 "crane-like stand in a flock of chickens" (stand out from the crowd, be exceptional). Like many four-character idioms, the expression has a wnyn 文言 (classical Chinese) infrastructure, according to which logical semantic groupings should be either "h l jīqn" or "h l jī qn". But we find three different ways of writing which one is right?

  1. In the Xindi Hnyǔ Cdiǎn 现代汉语词典 (2002) we find the Pinyin form "h l jī qn", which reflects the wnyn 文言 infrastructure;
  2. In the Xīn Shdi Hn-Yīng D Cdiǎn 新时代汉英大词典 (2001) we find the Pinyin version "hl-jīqn", which is obviously inspired by the hyphenation rules;
  3. In the ABC Hn-Yīng D Cdiǎn 汉英大词典 (2003) this idiom is rendered in Pinyin as one long string "hljīqn", as the editors did not see enough evidence of symmetry.

In such cases, the software solution we implemented combines two different approaches. For input purposes, any combination of the four syllables (with or without spaces or hyphens) converts correctly to 鹤立鸡群. In back-conversion from Hnz 汉字 to Pinyin or in two-line "Hanzi with Pinyin" mode, 鹤立鸡群, by default, back-converts to the version with the best semantic segmentation for an infrastructure-based understanding "h l jīqn"; however, if the student has set the system to the "standard of 1996" which suggests to render non-symmetrical non-hyphenated expressions as one string, the back-conversion result will be "hljīqn".

continued under the next link:

"number/measure-word combinations with automatic translation"