Posts tagged ‘CJK’

Fedora 6 的 release notes

2008-01-16

因為尋找 Fedora CD 下載的關係,在搜尋器中找 release notes,可能是當時覺得無聊,又誤打誤撞,結果去看舊版的 release notes。但不看還好,看完後喚起了當年的一些感受。主要是這一段:

Read the rest of content »

WP-GownFull

2007-07-17

上個月 Ben Lau 向我提及功夫輸入法這個 project;其中一個有趣的地方是,這個 project 是由兩個香港人負責的。撇開負責的人不說,單說 project 本身,是一個在 web 介面使用的中文(也可以是別的語言)輸入法,系統本身不需要安裝任何輸入法,只要有個瀏覽器就可以用了。即是說用英文 Windows 也可以在瀏覽器中入中文字。當我看過 demo 不久,就在想有沒有可能在 WordPress 裏面用 —— 於是寫了點東西出來。

WordPress 的留言欄位中使用 GownFull

花的時間不多,主要都是用在解讀 GownFull 輸入法如何讀取設定,和考慮如何將設定簡化,並透過 WordPress 的資料庫儲存設定。本來 GownFull 的設定方法頗為原始,要人手修改 config.php,現在不用了,而且可以在 WordPress 的管理介面選擇啟用哪幾種輸入法。暫時只在 blog 留言的位置啟動輸入法,遲些考慮會不會在 post 文章的介面也啟動吧。

其實已經想不到有甚麼功能要加了,日後最麻煩的,恐怕就是要從 GownFull 的 source code 定期更新;只是一個月,就多了 mimeTeX 的輸入法了,看來他們開發的速度不算慢。還有就是想更新一下它的倉頡和速成碼表:好像很多字都 map 到 PUA(即造字區)去,這樣對於 Unicode adoption 很不利。

有興趣想試的話可以在這裏留言,或者私下問我,因為我還未想 Google 找得到自己的 subversion repository。另一方面也在自己的站安裝了(在留言時留意一下左上角)。本應到 WordPress 登記開一個 subversion 戶口的,但 WordPress 已漸漸顯露出商品化的徵兆,不能賺錢的事務已經開始拖得就拖,有些人一個月也沒有回覆。或者我在 Google Code 開個 project 算了。

pros and cons of ABC notation

2007-02-25

Basically, existing implementations of ABC music notation are almost usable. Indeed there are still some critical problems that prevent it from being production ready, but it’s already much better than GUIDO notation — well, to put it more accurately, it is the rendering that matters. One picture is better than a thousand words:

Read the rest of content »

Multibyte character? No, not considered.

2007-01-13

Lately I’ve been struggling with one of the Recent Comments WordPress plugin. As expressed in title, it is one of the vast category of software not supporting multi-byte characters. Not especially bad, just that this plugin is too visible on my blog. All software not supporting multi-byte strings are equally evil or ignorant.

This plugin has slight advantage over other similar but simpler plugins: it allows breaking long ‘words’ (like URL), so the blog layout wouldn’t be damaged due to extremely long URL. However it makes use of multi-byte unsafe PHP functions like substr(), strlen(), and especially — wordwrap(), which has no multibyte-safe equivalent (like mb_substr() or mb_strlen() ). The net result is, some comments have line break inserted in the middle of a multibyte character!

The most obvious thing to do, is to replace wordwrap() with other saner functions. After attempting something stupid (like trying to write my own function… the only possible resolution is give up), time to turn to my savior (read: Google) for help. Finally, this htmlwrap() script written by Brian Huisman gets my attention. Quoting from its author:

Built for use in the Orca Forum and Blog, the htmlwrap() function safely wraps HTML formatted text by breaking strings of characters over a certain length. It’s great for use anywhere where generated HTML output is built from user input.

A BIG plus: it is UTF-8 safe! So I simply replace all instances of wordwrap() with htmlwrap(), and have half of the problem solved. The remaining half is actually two problems:

  1. While the plugin claims it can chop off the comment after certain number of characters, it actually means this many bytes minus the length of name of comment author. That’s certainly not blog admin would expect, though I doubt if many people would really count the characters.
  2. Word wrapping is only defined as a bunch of bytes separated by spaces. However the ‘word wrap’ rule is vastly different for Asian languages, especially CJK: no space is inserted at all between characters; A whole sentence, or a whole paragraph, can contain zero white spaces. Line breaks can occur before any character except (most) punctuation marks.

It is more time consuming to check for punctuations to avoid line breaking; but the others tends to be easy. Thus the remaining part comprises of replacement of string functions with their multibyte-safe equivalents, and some preg_match() to make sure only incomplete english word at the end of text is trimmed. Here is the comparison before and after modification:

Before After
WordPress comment with bad line breaking WordPress comment with good line breaking

However, I have made assumption that people are using UTF-8 encoding. Most people should already be using it, but still, forewarning is better than regretting later. Not sure if submitting this change upstream is a good idea, since it is not generic enough to cope with any multi-byte encoding and/or any language — only CJK comment in UTF-8 encoding so far.

Anyway, if you want to try, save the content of this file and rename the file to get-recent-comments.php. Place this file into WordPress plugin directory and pray have fun!

CJK font testing

2006-08-14

These days I’ve been using lots of time testing CJK (Chinese, Japanese, Korean) fonts in browser and Linux desktop, especially the mystery of how fontconfig chooses which font to use for each glyph. CJK fonts have been notorious for fighting against each other when searching for proper font to display CJK unified glyphs. While Keith Packard (author of fontconfig) didn’t provide any in-depth explanation of how it works, everybody resorted into doing the guess work themselves. (Not even when asking him directly through email — his usual reply is that it automagically works, no useful information at all.)

I still haven’t achieved my goals yet. My goals are:

  1. Pick font according to language — For Chinese, Japanese and Korean web pages, if the lang attribute is specified, then corresponding font for each language is used. NOT always override all other fonts by Chinese ones.
  2. Pick uniformly — For all non-CJK web pages, don’t pick random font for each glyph individually. Always prefer single Chinese font (because I’m in Chinese environment), unless some glyphs are Japanese and Korean Han characters. In that case……
  3. Mix and match — Attempt to match various font face of each language together properly based on their stroke style. For example, uming (”AR PL ShanHeiSun Uni”) should mix together with some specific Japanese font (”Sazanami Mincho”, “Kochi Mincho”) and Korean font (”UnBatang”). It’s because they have very similar (Song/Ming style, or in Chinese, 宋體/明體) stroke styles.
  4. Use english for english — When displaying latin character glyphs, always use latin fonts like DejaVu and Bitstream Vera. Latin glyphs in Chinese fonts are unconditionally crappy. They are blurred for smaller sizes (though Chinese won’t setup display in smaller font size as it would become unreadable for CJK glyphs), unlike normal latin fonts that have solid outline. I’m not sure how freetype 2.2.x goes, but even if it has improvement regarding anti-aliasing when comparing with 2.0.x, it would be a long time before everybody switch to 2.2.

However, there are lots of problems in the setup:

  1. Mix and mismatch — While Song/Ming style can be matched with Serif, there is no match against Sans (or unmodulated as better term internationally). The best match against unmodulated font face is Hei style (黑體, means black style). Japanese font (”Sazanami Gothic”) and Korean one (”UnDotum”) is publicly available that closely resemble Hei style, but not Chinese. Kai style (楷體) simply is another style, not a substitution of Hei style.
  2. Hard to use English fonts for latin glyphs — Latin glyphs inside both uming and ukai are actually in monospaced Serif face. However, the punctuations and other symbols are not of uniform width, they are proportional. What should I call it? Anyway, so far the attempt to use Dejavu to override it is unsuccessful.
  3. Browser issue — Each browser may or may not have the capability to support CSS properly and pick glyphs according to language. Look at the 3 screenshots below:
    Font on Firefox Firefox
    Font on Konqueror Konqueror
    Font on Opera Opera

    Among Firefox, Opera and Konqueror, only Firefox managed to obey <span lang="xxx"></span> attribute and pick correct font among CJK ones. Konqueror only picks font from current locale when using Sans, Serif or Monospace aliases. Opera doesn’t even obey font-family in CSS (I guess it is using uming with no anti-alias). That means I have to stick to Firefox for font testing on browser.

  4. How about other fonts? — I’m pretty sure, things can go wild if other untested fonts are used, which is common among Chinese where borrowing Windows ttf (mingliu and simsun) and other commercial fonts is a common practise. Making fontconfig behave properly with these fonts is tedious work.

Rearrange my priority — forget Ubuntu for now

2006-05-01

Looks like Ubuntu isn’t really the future that most Asian users hoped. Yes, after the localization sprint more Asian users participated into shaping Ubuntu more suitable for Asian users, but still most internal Ubuntu staff don’t really think Asian market is a necessity for them. Things like fixing typo are more important than fixing UI in Asian locales; including python development packages is more important than including basic input methods for Asian languages. This is, in some sense, a logical consequence that Ubuntu has no Asian developers.

But I think I can’t put too much time to fight this uphill battle. I have to spend my time on supporting basic living requirements. Probably they will employ some knowledged people later, but I can’t care too much about it now. Hope they have success later.

People used to say, contribution is a hobby for those who have satisfactory basic life, and is looking for something more (except selected few who are lucky enough). I always have that feeling, but not as strong as now. Yes, this moment.

回溯:Ubuntu 第 4-5 日

2006-03-23

那兩天的記憶開始模糊了。不過可以肯定一點,就是工作進度很慢(特別是自己),有些懶散的感覺。同時,開始在附近買多些東西了。

最記得是第四日的晚上出外吃飯,大家都選不定位置,到最後是其中一個 developer 選了附近一間中國餐廳(因為門口是中文字,有點像篆書)。初時除了幾個在吃喝的客人之外,連半隻鬼也看不到,後來老闆娘(?) 才慢慢走出來。和她說了 “four cups of tea please”,她大喊「四杯茶」,嘿嘿,餐廳是香港人開的。

侍應除了老闆娘較老之外,還有嫩的一男一女,還以為她們是一家人,不過當問過其中一個之後才知嫩的只是打工。我要的所謂 “Imperial Fried Rice” 像極了福州炒飯,不過很多很多,我只能吃一半。

聽了 developer 們在談天,才知道原來卡斯特羅在古巴是極受歡迎的,很像新加坡的李光耀,用的是鐵腕政策,但不知為何很受人民愛戴。另外,美國一向將古巴孤立,但現在情況不同了。古巴和另一國家組成聯盟(忘了是哪個),在經濟上互惠互利,之後幾乎全南美都參加了,甚至本來和美國友好的國家都差不多全部倒戈,看來以後除了北美和歐盟之後會有第三大勢力出現。

最後一天因為其中兩個 developer 要參加另一個會議,只剩下一個,加上我和泰國的 helper 也只是三個人,更懶了。泰國 helper 的功課做得好,基本上 Ubuntu 是第一個完整地支援泰文的輸入和顯示的。不過我就差得多了,還要靠 atie、freeflying、minghua 等人作為主力,我只是旁邊稍為協助一下。

晚上 Mark 訂了位,在總部附近吃晚飯,不過不是同桌,Mark 之前已經有約和別的人同桌吃飯,我們三個則和之前走了的兩個 developer 一起吃。鴨胸肉味道不錯,比那個聖誕節的「滲血的石頭」好得多。

吃飯前曾到總部參觀一下,地方出奇地小,最多 500 呎左右,只是供管理人員坐的,所有 developer 都是在家工作。離開前拿了一兩片 CD(不是 Ubuntu 哦),但很可惜,不夠時間拿給 Mark 簽個名。