Problem with taxonomies in foreign languages

guayom · August 14, 2017, 12:56am

When you add taxonomy terms with foreign characters (say “Hidráulica”) to your pages, the method used to “urlize” the file or directory that hugo generates, doesn’t normalize the foreign characters. “Hidráulica” creates a /hidráulica/ folder. It should actually create a /hidraulica/ folder, changing the í for an i.
But then, the permalink looks like this: /hidr%C3%A1ulica/.
So, of course, the permalink doesn’t match the actual page.

I think the only problem is when the folder name is generated.

I tried to look for that in the source and try to fix it… but I don’t know go and I felt completely overwhelmed. LOL.

I posted this here instead of github just in case that it is not a bug but only a configuration that I’m missing on my side.

guayom · August 14, 2017, 1:33am

Actually, it seems like the urlize method could use that improvement.

{{ "Hidráulica" | urlize }} prints hidr%C3%A1ulica

For reference, ruby’s parameterize does this pretty good.

maiki · August 14, 2017, 2:23am

See the section about preserving taxonomy values in the taxonomy doc.

urlize is doing what it is supposed to do, by the way. It is encoding the characters for use in a URL. It just isn’t working as you need it.

guayom · August 14, 2017, 3:14am

@maiki, thanks for your answer!

Ok, so, I enabled preserveTaxonomyNames
preserveTaxonomyNames = true in my config.toml

In some pages I use the taxonomy term “Hidráulica”. So hugo generates a /<taxonomy>/hidráulica/index.html file for me.

When I range through taxonomies

{{ range ($.Site.GetPage "taxonomyTerm" "familias").Pages }}
  <a href="{{ .Permalink}}">{{ .Permalink }} - {{ .Title }}</a>
{{ end }}

I get: http://localhost:1313/familias/hidr%C3%A1ulica/ - Hidráulica

If I click on it, the url changes to: http://localhost:1313/familias/hidráulica/ in the search bar.

At least on my local environment, this works fine. I can navigate through the whole site without problems. But in production I will need urls to be normalized, and then all the rest of things that urlize does (lowercase and other characters replacement).

So, I have 2 questions:

Is there a way to get normalized urls instead of url encoded urls? http://localhost:1313/familias/hidraulica/ instead of http://localhost:1313/familias/hidr%C3%A1ulica/ ?
Although I know that urlize is working as intended, is there a specific reason to avoid the replacement of non-ASCII characters with an ASCII approximation? IF there is not, urlize could be improved by adding that functionality. I don’t know the dev team’s position about that, that’s why I’m asking.

Finally, I can always just remove non ASCII characters of my taxonomy terms… or find some workaround. But I would like to find a way to avoid that.

EDIT -

I understand that. And you’re right. The prupose of urlize is to url encode strings. And what I’m looking is not to url encode my taxonomy terms. I need to create valid url addresses, which is something different. I’m looking for something like ruby’s parameterize. Which first replaces all non ASCII characters with an ASCII approximation and then replaces spaces and lowercases everything.

So my second question is answered. But is there a method like parameterize in hugo or in go?

maiki · August 14, 2017, 7:55am

Encoded URLs are valid. I get that you want to replace the characters, but I want to reiterate that have accents and other encoded characters are okay.

What about your production environment prohibits those?

maiki · August 14, 2017, 8:01am

Are you using the most recent version of Hugo? I would expect the taxonomies to work as you would prefer by default, so I am curious as to if the default behavior has changed, or my assumptions are wrong.

guayom · August 14, 2017, 12:24pm

Nothing. But that’s not human readable. I want the users to be able to read the url. hidr%C3%A1ulica is perfectly readable for a computer or a web developer. But a normal user will see that as an error. In latin america, replacing á with a in urls is the default. It’s so normal that actually, hidr%C3%A1ulica on a url is seen as an error, and from a clients perspective, it’s unacceptable. Normal users will never see that url and think: “oh, that’s fine… that’s just an accented a”. You know what I mean? It’s not a technical requirement.

Also, I may have explained poorly. I meant that http://localhost:1313/familias/hidráulica/ wouldn’t be a valid url. Not that http://localhost:1313/familias/hidr%C3%A1ulica is not valid. That’s perfectly valid because it is url encoded. I meant that parameterize and urlize don’t share the same purpose.

0.26. I would also expect that behavior as default. That’s why I see this as a possible improvement that would benefit many developers.

Again, thanks for your time and your help. I really appreciate it.

guayom · August 14, 2017, 3:46pm

At first I was doubtful of myself. But now I’ve done a little more research.

Urlize appears to be intended to work as I expect it to work. In the docs for urlize, it says that it sanitizes the string and then replaces spaces for hyphens. So this is definitely not a url encoding method.

Also, go template already has it’s own url encoding method: urlquery. So, when I thought that maybe urlize was intended to url encode, I was wrong. There is another method that takes care of that.

But I still wonder if there is deliberate reason to url encode non ASCII characters instead of replacing them with an ASCII aproximation. It definitely looks like something that could be improved.

maiki · August 14, 2017, 9:41pm

I’ve got nothing left to contribute, but I am surprised by your stance. I expect urlize to sanitize a URL by encoding it.

I am not discounting your experience, but it is not a foregone conclusion that no one wants URL encoding, or that people look down on it as unprofessional. Some of the largest sites in the world handle URLs in this manner (such as Wikipedia).

I see conflict in the docs between urlize and the explicit instructions around preserveTaxonomyNames, but I do not agree urlize is not working as expected.

I was about to suggest looking in the GitHub issues, when I found two that concern you:

Special characters in taxonomy and slugs - explains why some languages require accents and encoding to work properly
Remove Accented Characters from URLs - suggested the undocumented config option: removePathAccents = true

So I suggest you try that, and if it works, add it to the configuration doc.

guayom · August 14, 2017, 10:15pm

Thanks man! You helped me in many ways. In the first issue you shared, I could understand that there is a deliberate reason for this behavior. And it’s perfectly fine. 2 years ago, this problem was already discovered and the necessary improvements were made. That answers one of my questions. (I knew I couldn’t possibly be the only one in need for that)

And by using RemovePathAccents = true I get the configuration I need. That answers my other question.

Also, I need to search more at the github issues and not only here.

I’m not on my computer right now. But I’ll test this tonight and come back with the result.

guayom · August 15, 2017, 12:02am

That’s it! RemovePathAccents = true is all I needed. Thanks @maiki.

I’ll check the docs again. This might be useful in the documentation.

Topic		Replies	Views
V0.54+ BUG: GetPage is broken with non-ASCII chars - invalid taxonomy URL generated support	4	574	June 1, 2019
Taxonomy Term Normalization In URLs tips & tricks	2	848	December 22, 2017
Incorrect taxonomy url if tag is unicode name support	1	455	December 22, 2018
Diacritics/accented charactes in taxonomy names and terms support taxonomy , i18n	5	790	January 10, 2021
Special characters in taxonomy support	14	4270	January 21, 2017

Problem with taxonomies in foreign languages

Related topics