HKSCS 2004 conversion script
This perl script has only a simple function so far: to convert Hong Kong characters inside text or html to the newest standard version (HKSCS-2004).
In previous ISO 10646 standard (ISO 10646:2001), lots of Hong Kong characters are allocated inside PUA segment for compatibility, but they are moved to SIP (U+20000 - U+2FFFF) or CJK extension A/B in ISO 10646:2003 standard, which corresponds to Unicode 4.1.
The main function of this script is to convert PUA characters to non-PUA characters. It has 2 operation modes:
- Takes a text file or HTML/XML file in either Big5HKSCS or UTF-8 encoding, and outputs a UTF-8 encoded file.
- Reads standard input and prints result on standard output, i.e. act as a filter
Besides converting character, it can also convert HTML entities (e.g. 一 ).
Known problem:
- When converting HTML/XML file, it doesn’t recognize HTML meta tag and XML text encoding attribute and change them accordingly.
Todo:
- Recognize RTF and OpenDocument files.
- Convert whole directory recursively, not just single file.










