Migrating from WordPress » non-ascii chars in permalinks » 404

Hi all, I’m interested in possibly migrating my WordPress blog to Hugo. I’ve had preliminary success migrating my content into Hugo’s structure, but I’ve encountered an issue for which Google yields no solution: some of my posts include emoji and/or Hebrew text in their titles, and so far I can’t get these posts to be successfully served by hugo serve or python -m SimpleHTTPServer or an old version of httpd running on my personal server.

To elaborate:

I started with

  • A WordPress blog with 3,269 posts that I’ve been running since 2004 on various versions of PHP, MySQL, Linux, etc
  • Currently running WordPress 4.7 on Ubuntu 14.04.1 LTS with PHP 5.5.9-1ubuntu4.5 and MySQL 14.14 Distrib 5.5.40 (whatever that means)
  • My MySQL DB is so old and has been migrated so many times that it’s a wonder it kinda-sorta still works. I would be shocked if it used a sensible character encoding setting, etc. It’s probably a mess.

I did

  1. Used the Jekyll Exporter WordPress plugin to generate and download my site in Jekyll’s structure
  2. Discovered via jekyll serve that a bunch of my earlier posts included “invalid” byte sequences (invalid as per Unicode, in other words byte sequences that are not valid Unicode)
  3. Used a rudimentary shell script invoking iconv to remove all the invalid byte sequences
  4. Used hugo import jekyll to import the site from the Jekyll structure to the Hugo structure

Now I’ve got

  • Most of the posts look great, work great.
  • The posts with emoji/Hebrew in their titles render correctly in the index
    • Except for that … showing up at the end — I have no idea what that is. Many of the posts have that at the end — it seems to be more common with more recent posts
      • Maybe it’s related to the plugin I’ve been using to import tweets
  • But I can’t navigate to those pages. I get a 404
    • As I wrote above, I tried this with various webservers, using the dynamic hugo serve and also with a published version of the site
  • The URL for that post above, for example, is http://localhost:1313/post/🤔-if-one-works-as-a-member-of-a-collaborative-team/
    • I mean the URL as generated in the index page
  • The path of that post in my Hugo site — as per ls via Bash in my MacOS terminal — is content/post/2016-10-19-%f0%9f%a4%94-if-one-works-as-a-member-of-a-collaborative-team.md
  • Within that file, as per cat, the line with title reads: title: "\U0001F914 if one works as a member of a collaborative team…"

So…

I’m kinda out of time writing this up… I’m not sure how to resolve this — although I’m a software developer, I’m out of my depth here. I just don’t know what’s going on or how I might resolve this.

I’d very much appreciate any suggestions!

Thank you!

This is an interesting problem.

Two things. Could you post the frontmatter for the post you have used here as an example?

Also, could you post your config file?

These two bits of data should make it possible to join a few more dots

Instead of trying to get emojis etc. working in URLs I would suggest not to fight it, but to work around it:

  • Have a look at the permalinks section in the docs
  • Esp. the part of the slug. Set the slug for the problematic pages.

Uh-oh. :wink:

Sure, the entire file is:

---
author: Avi
categories:
- none
date: 2016-10-19T19:29:18Z
format: status
guid: http://twitter-788884759283892229-post
id: 13355
tags:
- micro
- tweet
title: "\U0001F914 if one works as a member of a collaborative team…"
url: /post/%f0%9f%a4%94-if-one-works-as-a-member-of-a-collaborative-team/
---

🤔 if one works as a member of a collaborative team, perhaps “individual contributor” is a misnomer? 🤔

Sure, this is the whole thing:

baseurl: http://aviflax.com
disablePathToLower: true
languageCode: en-us
title: My New Hugo Site
...

I think it was auto-generated by hugo import jekyll.

Thanks for the help!

I appreciate the pragmatic suggestion, but I’m a bit idealogical about URLs. I cling stubbornly to the outmoded idea that Cool URIs don’t change.

Also: if WordPress (on Apache) can successfully serve my posts with emoji in the slugs, then pretty much any other system should be able to as well. (I like WordPress but I think it’s succeeded in spite of PHP, not because of it.)

Well, then we don’t share the same definition of “cool URIs”.

This is the unicode character for a Horizontal Ellipsis (guessing you probably know that) so it was possible something the titles from the old platform had or could be plugin. Either way a simple sed/regex command should be able to remove it from all posts.

It’s been a few days since you last posted so before diving in - have you made any further progress?

Yeah, I did that. This wasn’t my main problem, anyway.

I ended up getting Jekyll to successfully build and serve my site (as exported from WordPress by the WordPress to Jekyll Exporter (WP2JE)) with a single fairly simple patch: I added a call to Ruby’s String#scrub method to a variable containing a path for each file — and that fixed my issues with Jekyll. So for now I’m publishing my site with Jekyll… it’s working well enough :grin:

Thanks for the help!