How to test a string for CJK characters

CJK languages (and others like Thai for instance) offer some problems to us devs with for instance being word-counted or even fitting into a nice layout. This is why I was thinking about how to find out if a string contains CJK characters.

The solution lies within the ranges of the UTF-8 definition. There is a range for CJK that starts at U+4E00 and ends at U+9FFF. My general idea is to match the tested string against that range. The following layout n func/isCJK.html will do that:

{{ $isCJK := false }}
{{ $matches := findRE "[\u4E00-\u9FFF]" . }}
{{ if gt (len $matches) 0 }}
  {{ $isCJK = true }}
{{ end }}
{{ return $isCJK }}

Testing:

{{ $isCJK := partialCached "func/isCJK.html" "丹为" "丹为" }}
{{ $isCJK }} <-- true
{{ $isCJK := partialCached "func/isCJK.html" "blafasel" "blafasel" }}
{{ $isCJK }} <-- false

Then it can be used for instance with

{{ .WordCount }}{{ if partialCached "func/isCJK.html" .Content .Content }} Characters{{ else }} Words{{ end }}

Things to keep in mind:

  • The expression matches any string with a single (or more) CJK character
  • The Thai Unicode block is at U+0E00 to U+0E7F, so adding a parameter to set what ranges to test against is a nice extension, or maybe a dict that connects languages to ranges so we can test agains a language code…
  • not sure if this partial returns "true" or true - typecasting to the rescue!
4 Likes