Facilitating compliance with Pinyin orthography (中文拼音正词法) for teachers and learners is a main task for Chinese language teachers and software designers. This article introduces algorithmic self-correcting software solutions for Pinyin Input with non-standard spacing. Focus is on four aspects of Pinyin orthography in a computerized learning environment: 1) spacing rules for modal particles, 2) orthography of four-syllable fixed expressions, 3) number/measure-word combinations, and 4) an option for tone changes ("sandhi") being reflected in the Pinyin rendering.
Compliance with the standard of writing – both in Hanzi and in Pinyin - is a major effort for native speakers and learners of any language. In the case of Chinese, especially when involving textbooks and other publications requiring a phonetic transcription, the task is even more difficult, as writing the same text in Chinese characters (Hanzi) and in the standard phonetic transcription (Hanyu Pinyin) challenge the writer in totally different ways. In this presentation I focus on the orthography rules for Hanyu Pinyin, in particular on the question how we can enhance implementation of the Pinyin orthography standard that was proclaimed in 1996 as Zhōngwén Pīnyīn zhèngcífǎ jīběn guīzé 中文拼音正词法基本规则 by the China State Bureau of Quality and Technical Supervision of the State Council of China. The original 8-page document with the rules is available at http://www.china-language.gov.cn/gfbz/shanghi/025.htm. For the software solutions discussed in the text of this article, the latest (May 2008) release version 5.1 of the Chinese text system KEY5 at http://www.cjkware.com is being used. The research for our software development, including the linguistic basis for the software algorithms, was mainly done on the basis of the works mentioned in the references below, with many contributions in theory and practice from my colleagues world-wide, and my R&D team in Ottawa, Canada.
1. Spacing rules for modal particles
While the general spacing rules of Pinyin orthography are well defined in books like Yin & Felley (1990) and in modern standard dictionaries (like the Xiàndài Hànyǔ Cídiǎn 现代汉语词典, 2002), they are not always implemented in teaching. For instance, when non-standard Pinyin input is used on the computer to teach Chinese – such as monosyllabic input, or continuous input without spacing – this may facilitate text entry in certain instances, but in the long run might prove detrimental to the learners’ Pinyin orthography skills.
More complicated than the general vocabulary-level spacing rules of Pinyin are the Pinyin spacing rules on the syntactic level that might even seem inconsistent at first sight. With these syntax-based rules, there is a lot of potential for confusion for teachers and learners alike, because we are here dealing with rules that cannot be readily looked up in a dictionary. For example, sooner or later every Chinese language teacher and student is confronted with the Pinyin spacing rules governing modal particles like le 了, zhe 着, and guo 过 which in most cases (but not in all cases!) are supposed to be appended directly to the preceding verb, without a space in between. We can observe the complexity of these spacing rules in some example sentences containing the particle le 了. The following three sentences all comply with the Pinyin spacing rules – but would you as an author of learning material write the Pinyin text like this? Or, imagine you have to explain to your students the logic behind the Pinyin renderings, in particular the le 了 spacing:
Background explanation of the above le 了 spacing:
To facilitate compliance through software algorithms - in the interest of those who are writing and learning Chinese on the computer, two questions come to mind:
2. Orthography of four-syllable fixed expressions
A further problem area in Hanyu Pinyin orthography is the Pinyin rendering of chéngyǔ 成语 "fixed idioms". These are set four-character expressions, and despite the standardization efforts and detailed hyphenation/spacing rules (Yin & Felley pp. 457-489) we find many inconsistencies in spacing and the use of the hyphen in such expressions in the current dictionaries.
We observe these inconsistencies in a large number of idioms; as one from thousands of similar border cases, we take a closer look at the idiom hè lì jīqún 鹤立鸡群 "crane-like stand in a flock of chickens" (stand out from the crowd, be exceptional). Like many four-character idioms, the expression has a wényán 文言 (classical Chinese) infrastructure, according to which logical semantic groupings should be either "hè lì jīqún" or "hè lì jī qún". But we find three different ways of writing – which one is right?
In such cases, the software solution we implemented combines two different approaches. For input purposes, any combination of the four syllables (with or without spaces or hyphens) converts correctly to 鹤立鸡群. In back-conversion from Hànzì 汉字 to Pinyin or in two-line "Hanzi with Pinyin" mode, 鹤立鸡群, by default, back-converts to the version with the best semantic segmentation for an infrastructure-based understanding "hè lì jīqún"; however, if the student has set the system to the "standard of 1996" which suggests to render non-symmetrical non-hyphenated expressions as one string, the back-conversion result will be "hèlìjīqún".
continued under the next link:
"number/measure-word combinations with automatic translation"