HKSCS 2004 conversion script

This perl script has only a simple function so far: to convert Hong Kong characters inside text or html to the newest standard version (HKSCS-2004).

In previous ISO 10646 standard (ISO 10646:2001), lots of Hong Kong characters are allocated inside PUA segment for compatibility, but they are moved to SIP (U+20000 - U+2FFFF) or CJK extension A/B in ISO 10646:2003 standard, which corresponds to Unicode 4.1.

The main function of this script is to convert PUA characters to non-PUA characters. It has 2 operation modes:

  1. Takes a text file or HTML/XML file in either Big5HKSCS or UTF-8 encoding, and outputs a UTF-8 encoded file.
  2. Reads standard input and prints result on standard output, i.e. act as a filter

Besides converting character, it can also convert HTML entities (e.g. 一 ).

Known problem:

  • When converting HTML/XML file, it doesn’t recognize HTML meta tag and XML text encoding attribute and change them accordingly.

Todo:

  • Recognize RTF and OpenDocument files.
  • Convert whole directory recursively, not just single file.

Leave a Reply

E-mail is not disclosed nor shared. Required fields are marked *

Powered by WP Hashcash