Problem with taxonomies in foreign languages

When you add taxonomy terms with foreign characters (say “Hidráulica”) to your pages, the method used to “urlize” the file or directory that hugo generates, doesn’t normalize the foreign characters. “Hidráulica” creates a /hidráulica/ folder. It should actually create a /hidraulica/ folder, changing the í for an i.
But then, the permalink looks like this: /hidr%C3%A1ulica/.
So, of course, the permalink doesn’t match the actual page.

I think the only problem is when the folder name is generated.

I tried to look for that in the source and try to fix it… but I don’t know go and I felt completely overwhelmed. LOL.

I posted this here instead of github just in case that it is not a bug but only a configuration that I’m missing on my side.

Actually, it seems like the urlize method could use that improvement.

{{ "Hidráulica" | urlize }} prints hidr%C3%A1ulica

For reference, ruby’s parameterize does this pretty good.

See the section about preserving taxonomy values in the taxonomy doc.

urlize is doing what it is supposed to do, by the way. It is encoding the characters for use in a URL. It just isn’t working as you need it. :slight_smile:

@maiki, thanks for your answer!

Ok, so, I enabled preserveTaxonomyNames
preserveTaxonomyNames = true in my config.toml

In some pages I use the taxonomy term “Hidráulica”. So hugo generates a /<taxonomy>/hidráulica/index.html file for me.

When I range through taxonomies

{{ range ($.Site.GetPage "taxonomyTerm" "familias").Pages }}
  <a href="{{ .Permalink}}">{{ .Permalink }} - {{ .Title }}</a>
{{ end }}

I get: http://localhost:1313/familias/hidr%C3%A1ulica/ - Hidráulica

If I click on it, the url changes to: http://localhost:1313/familias/hidráulica/ in the search bar.

At least on my local environment, this works fine. I can navigate through the whole site without problems. But in production I will need urls to be normalized, and then all the rest of things that urlize does (lowercase and other characters replacement).

So, I have 2 questions:

  1. Is there a way to get normalized urls instead of url encoded urls? http://localhost:1313/familias/hidraulica/ instead of http://localhost:1313/familias/hidr%C3%A1ulica/ ?
  2. Although I know that urlize is working as intended, is there a specific reason to avoid the replacement of non-ASCII characters with an ASCII approximation? IF there is not, urlize could be improved by adding that functionality. I don’t know the dev team’s position about that, that’s why I’m asking.

Finally, I can always just remove non ASCII characters of my taxonomy terms… or find some workaround. But I would like to find a way to avoid that.

  • EDIT -

I understand that. And you’re right. The prupose of urlize is to url encode strings. And what I’m looking is not to url encode my taxonomy terms. I need to create valid url addresses, which is something different. I’m looking for something like ruby’s parameterize. Which first replaces all non ASCII characters with an ASCII approximation and then replaces spaces and lowercases everything.

So my second question is answered. But is there a method like parameterize in hugo or in go?

Encoded URLs are valid. I get that you want to replace the characters, but I want to reiterate that have accents and other encoded characters are okay.

What about your production environment prohibits those?

Are you using the most recent version of Hugo? I would expect the taxonomies to work as you would prefer by default, so I am curious as to if the default behavior has changed, or my assumptions are wrong.

Nothing. But that’s not human readable. I want the users to be able to read the url. hidr%C3%A1ulica is perfectly readable for a computer or a web developer. But a normal user will see that as an error. In latin america, replacing á with a in urls is the default. It’s so normal that actually, hidr%C3%A1ulica on a url is seen as an error, and from a clients perspective, it’s unacceptable. Normal users will never see that url and think: “oh, that’s fine… that’s just an accented a”. :laughing: You know what I mean? It’s not a technical requirement.

Also, I may have explained poorly. I meant that http://localhost:1313/familias/hidráulica/ wouldn’t be a valid url. Not that http://localhost:1313/familias/hidr%C3%A1ulica is not valid. That’s perfectly valid because it is url encoded. I meant that parameterize and urlize don’t share the same purpose.

0.26. I would also expect that behavior as default. That’s why I see this as a possible improvement that would benefit many developers.

Again, thanks for your time and your help. I really appreciate it.

At first I was doubtful of myself. But now I’ve done a little more research.

Urlize appears to be intended to work as I expect it to work. In the docs for urlize, it says that it sanitizes the string and then replaces spaces for hyphens. So this is definitely not a url encoding method.

Also, go template already has it’s own url encoding method: urlquery. So, when I thought that maybe urlize was intended to url encode, I was wrong. There is another method that takes care of that.

But I still wonder if there is deliberate reason to url encode non ASCII characters instead of replacing them with an ASCII aproximation. It definitely looks like something that could be improved.

I’ve got nothing left to contribute, but I am surprised by your stance. I expect urlize to sanitize a URL by encoding it.

I am not discounting your experience, but it is not a foregone conclusion that no one wants URL encoding, or that people look down on it as unprofessional. Some of the largest sites in the world handle URLs in this manner (such as Wikipedia).

I see conflict in the docs between urlize and the explicit instructions around preserveTaxonomyNames, but I do not agree urlize is not working as expected. :slight_smile:

I was about to suggest looking in the GitHub issues, when I found two that concern you:

So I suggest you try that, and if it works, add it to the configuration doc. :slight_smile:

1 Like

Thanks man! You helped me in many ways. In the first issue you shared, I could understand that there is a deliberate reason for this behavior. And it’s perfectly fine. 2 years ago, this problem was already discovered and the necessary improvements were made. That answers one of my questions. (I knew I couldn’t possibly be the only one in need for that)

And by using RemovePathAccents = true I get the configuration I need. That answers my other question.

Also, I need to search more at the github issues and not only here.

I’m not on my computer right now. But I’ll test this tonight and come back with the result.

That’s it! RemovePathAccents = true is all I needed. Thanks @maiki.

I’ll check the docs again. This might be useful in the documentation.

1 Like