I have a file, file1.txt, containing text in English, Chinese, Japanese, and Korean. For use in ConTeXt, I need to mark each region of text within the file according to language, except for English, and output a new file, e.g., here is a sample line:
The 恐龙 ate 鱼.
As this contains text in Chinese characters, this will get marked like this:
The \language[cn]{恐龙} ate \language[cn]{鱼}.
- The document is saved as UTF-8.
- Text in Chinese should be marked
\language[cn]{*}. - Text in Japanese should be marked
\language[ja]{*}. - Text in Korean should be marked
\language[ko]{*}. - The content never continues from one line to the next.
- If the code is ever in doubt about whether something is Chinese, Japanese, or Korean, it is best if it defaults to Chinese.
How can I mark the text according to the language present?
A crude algorithm:
(Also see Detect chinese character using perl?)
There are problems with that. Daenyth comments that e.g. 恐竜 is misidentified as Chinese. I find it unlikely that you are really working with mixed English-CJK, and are just giving bad example text. Perform a lexical analysis first to differentiate Chinese from Japanese.