Cyrillic-aware slugify function

Mihara · August 14, 2020, 8:57am

I’m trying to cook up a slugify function that would treat Cyrillic characters in a specific way… But first, a problem definition:

While technically, you are permitted to use Cyrillic in an URL, you will usually end up with a percent-encoded mess when people copy your URL out of the browser to paste it somewhere else. This is not readable with a naked eye and annoys people whom you rely on to spread your URL around. The traditional solution to the problem I have been using before Hugo was transliteration: Every Cyrillic letter has a commonly accepted Latin letter, or a combination of letters, that sounds roughly like the original Cyrillic sound, or close enough, that someone glancing at the URL will be able to tell what it meant to say. This way “привет” turns into “privet” and everyone’s happy.

Now, obviously, natively, Hugo does nothing of the sort – the anchorize function converts Cyrillic to lowercase, but leaves Cyrillic letters in, which defeats the point, as these still turn into %D0%B0%D0%B1%D0%B2%D0%B3%D0%B4%D0%B5%D1… I need an alternate solution, and while I could, in theory, go fix the anchorize function itself and recompile, I don’t want to maintain my own fork of Hugo and I doubt this will be generally accepted as a pull request. And I can’t call anything external to do this for me either. I therefore have to somehow do this with Go templates only.

After some mucking around, I’ve been able to produce this horror, which I can’t help but think looks unbelievably crude:

{{- . | anchorize | replaceRE "а" "a" | replaceRE "б" "b" | replaceRE "в" "v" | replaceRE "г" "g" | replaceRE "д" "d" | replaceRE "е" "e" | replaceRE "ё" "yo" | replaceRE "ж" "zh" | replaceRE "з" "z" | replaceRE "и" "i" | replaceRE "й" "j" | replaceRE "к" "k" | replaceRE "л" "l" | replaceRE "м" "m" | replaceRE "н" "n" | replaceRE "о" "o" | replaceRE "п" "p" | replaceRE "р" "r" | replaceRE "с" "s" | replaceRE "т" "t" | replaceRE "у" "u" | replaceRE "ф" "f" | replaceRE "х" "kh" | replaceRE "ц" "ts" | replaceRE "ч" "ch" | replaceRE "ш" "sh" | replaceRE "щ" "shh" | replaceRE "(ъ|ь)" "" | replaceRE "ы" "y" | replaceRE "э" "ee" | replaceRE "ю" "yu" | replaceRE "я" "ya" -}}

I can’t even spread the code across multiple lines.

Now, could anyone more familiar with the way regexps work around here tell me, is it possible to do it in fewer regexps? Is there perhaps a way to iterate over the string instead?..

P.S. Yes, I realized I could use replace instead of replaceRE and it would at least be faster, but that’s not the point of this question.

pointyfar · August 14, 2020, 9:36am

Not really fewer regex, but you could “hide” the mess in a partial? Then put your transliterate pairs into a data file.

So you could have a toml file with the pairs:

# transliterate.toml
"а" = "a"
"б" = "b"
"в" = "v"
...

and then in your partial:

<!-- string to array -->
{{ $chars := split . "" }}

<!-- transliterate pairs -->
{{ $t := site.Data.transliterate }}

<!-- string to return -->
{{ $new := "" }}

{{ range $i, $e := $chars }} <!-- range over chars -->
    {{ if isset $t $e }}     <!-- if char exists as a key in toml -->
        {{ $new = print $new (index $t $e ) }} <!-- use that key's value -->
    {{ else }}               <!-- otherwise use 'old' character -->
        {{ $new = print $new $e }}
    {{ end }}
{{ end }}

<!-- comment out return line to test -->
{{ . }} = {{ $new }}

<!-- return new string -->
{{ return $new }}

{{ partial "transliterate.html" "привет" }} => privet

Mihara · August 14, 2020, 9:44am

Considering how I already would have to stick it in a partial to use, (wait, can _markup templates even call partials? I need to test that…) I doubt it is an improvement.

Thanks for pointing out split, though.

Now I wish I could call this for slugs that are auto-generated from titles.

Mihara · August 14, 2020, 10:09am

There we go, I think I have a satisfactory solution. In case anyone comes looking…

With the data file translit.toml:

"а" = "a"
"б" = "b"
"в" = "v"
"г" = "g"
"д" = "d"
"е" = "e"
"ё" = "yo"
"ж" = "zh"
"з" = "z"
"и" = "i"
"й" = "j"
"к" = "k"
"л" = "l"
"м" = "m"
"н" = "n"
"о" = "o"
"п" = "p"
"р" = "r"
"с" = "s"
"т" = "t"
"у" = "u"
"ф" = "f"
"х" = "kh"
"ц" = "ts"
"ч" = "ch"
"ш" = "sh"
"щ" = "shh"
"ъ" = ""
"ы" = "y"
"ь" = ""
"э" = "ee"
"ю" = "yu"
"я" = "ya

the partial looks like this:

{{ $r := anchorize . }}
{{ range $from, $to := site.Data.translit }}
  {{ $r = replace $r $from $to }}
{{ end }}
{{ return $r }}

And yes, it can be called from _markup/render-heading.html:

<h{{ .Level }} id="{{ partial "slugify.html" .Anchor | safeURL }}">{{ .Text | safeHTML }}</h{{ .Level }}>

I think I can live with that.

P.S For the record, here’s another variation without a data file:

{{ $r := anchorize . }}
{{ $pairs := (dict "а" "a" "б" "b" "в" "v" "г" "g" "д" "d" "е" "e" "ё" "yo" "ж" "zh" "з" "z" "и" "i" "й" "j" "к" "k" "л" "l" "м" "m" "н" "n" "о" "o" "п" "p" "р" "r" "с" "s" "т" "t" "у" "u" "ф" "f" "х" "kh" "ц" "ts" "ч" "ch" "ш" "sh" "щ" "shh" "ъ" "" "ы" "y" "ь" "" "э" "ee" "ю" "yu" "я" "ya") }}
{{ range $from, $to := $pairs }}
  {{ $r = replace $r $from $to }}
{{ end }}
{{ return $r }}

The advantage is that it’s self-contained, the disadvantage is that it’s still a silly long line.

system · August 16, 2020, 10:09am

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.

alexandros · December 13, 2021, 7:03am

Since Hugo v.0.65.0 there is multiline support in Go templates. Therefore the above dict can be written like so:

{{ $pairs := (dict
"а" "a"
"б" "b"
...
) }}

P.S. This topic offers information for transliterating non-Latin scripts to the Latin script within Hugo. So users need to know that the above solution can work within a self-contained partial and at the same time be readable.

Topic		Replies	Views
Tags with cyrillic slugs support	7	476	September 29, 2023
Make url use only ASCII characters support	3	1710	July 26, 2017
Hugo (Windows 7) generates folders in UTF-16 support	1	1225	June 9, 2016
[SOLVED] Replace UTF-8 characters in header id's and anchor links support	1	1010	September 14, 2018
Weird character needs to be used. URLIZE? support	3	418	November 22, 2018

Cyrillic-aware slugify function

Related topics