Multibyte character? No, not considered.
Lately I’ve been struggling with one of the Recent Comments WordPress plugin. As expressed in title, it is one of the vast category of software not supporting multi-byte characters. Not especially bad, just that this plugin is too visible on my blog. All software not supporting multi-byte strings are equally evil or ignorant.
This plugin has slight advantage over other similar but simpler plugins: it allows breaking long ‘words’ (like URL), so the blog layout wouldn’t be damaged due to extremely long URL. However it makes use of multi-byte unsafe PHP functions like substr(), strlen(), and especially — wordwrap(), which has no multibyte-safe equivalent (like mb_substr() or mb_strlen() ). The net result is, some comments have line break inserted in the middle of a multibyte character!
The most obvious thing to do, is to replace wordwrap() with other saner functions. After attempting something stupid (like trying to write my own function… the only possible resolution is give up), time to turn to my savior (read: Google) for help. Finally, this htmlwrap() script written by Brian Huisman gets my attention. Quoting from its author:
Built for use in the Orca Forum and Blog, the
htmlwrap()function safely wraps HTML formatted text by breaking strings of characters over a certain length. It’s great for use anywhere where generated HTML output is built from user input.
A BIG plus: it is UTF-8 safe! So I simply replace all instances of wordwrap() with htmlwrap(), and have half of the problem solved. The remaining half is actually two problems:
- While the plugin claims it can chop off the comment after certain number of characters, it actually means this many bytes minus the length of name of comment author. That’s certainly not blog admin would expect, though I doubt if many people would really count the characters.
- Word wrapping is only defined as a bunch of bytes separated by spaces. However the ‘word wrap’ rule is vastly different for Asian languages, especially CJK: no space is inserted at all between characters; A whole sentence, or a whole paragraph, can contain zero white spaces. Line breaks can occur before any character except (most) punctuation marks.
It is more time consuming to check for punctuations to avoid line breaking; but the others tends to be easy. Thus the remaining part comprises of replacement of string functions with their multibyte-safe equivalents, and some preg_match() to make sure only incomplete english word at the end of text is trimmed. Here is the comparison before and after modification:
| Before | After |
|---|---|
![]() |
![]() |
However, I have made assumption that people are using UTF-8 encoding. Most people should already be using it, but still, forewarning is better than regretting later. Not sure if submitting this change upstream is a good idea, since it is not generic enough to cope with any multi-byte encoding and/or any language — only CJK comment in UTF-8 encoding so far.
Anyway, if you want to try, save the content of this file and rename the file to get-recent-comments.php. Place this file into WordPress plugin directory and pray have fun!


It even chops after the given number of bytes minus every *visible* element. The intention is to occupy always the same space in the sidebar, even if someone uses incredibly long author names, trackback titles, or whatever.
Your feedback is very appreciated. PHP’s support for multibyte languages is still pretty poor and we really need a clean international wordwrapping function. My problem is, that I am unable to judge the results of the functions you can find on the internet. I simply can not say if they work right or not. So if htmlwrap() does the right thing I will use it.