How to test a string for CJK characters

davidsneighbour · January 22, 2022, 4:25am

CJK languages (and others like Thai for instance) offer some problems to us devs with for instance being word-counted or even fitting into a nice layout. This is why I was thinking about how to find out if a string contains CJK characters.

The solution lies within the ranges of the UTF-8 definition. There is a range for CJK that starts at U+4E00 and ends at U+9FFF. My general idea is to match the tested string against that range. The following layout n func/isCJK.html will do that:

{{ $isCJK := false }}
{{ $matches := findRE "[\u4E00-\u9FFF]" . }}
{{ if gt (len $matches) 0 }}
  {{ $isCJK = true }}
{{ end }}
{{ return $isCJK }}

Testing:

{{ $isCJK := partialCached "func/isCJK.html" "丹为" "丹为" }}
{{ $isCJK }} <-- true
{{ $isCJK := partialCached "func/isCJK.html" "blafasel" "blafasel" }}
{{ $isCJK }} <-- false

Then it can be used for instance with

{{ .WordCount }}{{ if partialCached "func/isCJK.html" .Content .Content }} Characters{{ else }} Words{{ end }}

Things to keep in mind:

The expression matches any string with a single (or more) CJK character
The Thai Unicode block is at U+0E00 to U+0E7F, so adding a parameter to set what ranges to test against is a nice extension, or maybe a dict that connects languages to ranges so we can test agains a language code…
not sure if this partial returns "true" or true - typecasting to the rescue!

Topic		Replies	Views
Detecting unicode ranges in content support	1	177	March 29, 2024
ReadingTime is computed as purely CJK or non-CJK, should this be changed? dev i18n	5	758	June 21, 2022
Getting the character count support	5	1174	June 1, 2018
Useful method for pattern matching values?	8	732	April 4, 2018
How to check if a string contains or only contains numerical values support	2	236	January 2, 2024

How to test a string for CJK characters

Related topics