How to ensure that file names and paths are preserved?

NovemberPain · October 2, 2020, 9:17pm

Hi,

I would like to convert three existing web sites of small non-profit organizations to Hugo. It is very important that the URLs of the HTML files and Markdown files are preserved, as they have been stable respectively for more than 20 years (*.html) and 10 years (*.md) and several deep links appear in printed documents. However, I am unable to find a way to configure Hugo so that it does not rename files or move them around. After spending several days trying different configuration options, reading the documentation and many messages in this community forum, I come here to ask for help or for a confirmation that Hugo is not suitable for importing existing web sites.

Here is a simplified view of the directory structure of one of the sites that I would like to convert to Hugo:

.
├── index.html
├── index.md
├── legal.html
├── legal.md
├── news/
│      ├── 1995-01-13-foobar.html
│      ├── 1995-01-13-foobar.md
...
│      ├── 2020-10-01-quux.html
│      ├── 2020-10-01-quux.md
│      ├── index.html
│      ├── index.md
│      ├── team.html
│      └── team.md
├── reports/
│      ├── 2005-01-31/
│      │      ├── index.html
│      │      ├── index.md
│      │      ├── graph1.png
│      │      └── graph2.png
....
│      └── 2020-08-31/
│             ├── index.html
│             └── index.md
...
├── topic1/
│      ├── article1.html
│      ├── article1.md
│      ├── article2.html
│      ├── article2.md
│      ├── index.html
│      ├── index.md
│      └── sub-topic12/
│             ├── image1.jpg
│             ├── index.html
│             ├── index.md
│             └── sub-sub-topic123/
...
├── topic2/
...

Copying the *.md files next to the *.html files generated by Hugo in its output directory is a minor issue that can easily be solved. But preserving the existing directory structure seems to be impossible, with or without uglyURLs and other settings. I have read many articles here describing similar problems, but most of the answers can be summarized as: adapt your directory structure to Hugo instead of trying to adapt Hugo to your web site.

If necessary, I can rename some index.md files to _index.md when running Hugo and then publish them as index.md. But this is not sufficient because no matter how I configure Hugo, this does not work: Hugo generates the *.html files in a different path than their source *.md files. For example, /news/_index.md generates /news.html instead of /news/index.html, while /news/index.md appears to work but blocks the generation of other contents under /news/ because Hugo treats it as a leaf and then ignores /news/team.md. There are several other unexpected renames, such as Hugo stripping the leading date from some file names or moving some files one level up in the directory structure. I also tried to fix the date-stripping issue by playing with the [permalinks] configuration but then some files under /news/ or /reports/ such as /news/team.md get an unwanted date prepended to their HTML output.

This is rather frustrating. I wish there was a setting preserveOriginalURLs = true or preserveFilePaths = true.

Here is my last attempt at creating a config.toml but that still does not work:

baseURL = "(...)"
title = "(...)"
theme = "(...)"
uglyURLs = true
disablePathToLower = true
disableKinds = [ "taxonomy", "term", "RSS", "sitemap", "robotsTXT", "404" ]
[taxonomies]
[frontmatter]
  date  = [":filename", "date", "publishDate", ":fileModeTime", ":default"]
  lastmod = [":git", ":fileModTime", ":default"]
...

A bit of history: the oldest of these three web sites was created in late 1993, when the web was still very young and was competing with FTP and Gopher. Its domain name, directory structure and URLs have remained stable since 1996. It started as a set of static HTML files, then after a few years it had its contents managed by the now-obsolete Website Meta Language (WML), then around 2005 its contents were converted to Markdown with some Makefiles and scary Perl scripts to convert these Markdown files back to HTML again, update the indexes and lists, etc. It was decided to publish the Markdown files to make it easier for third parties to extract and convert the contents of the site. The other web sites that I mentioned are a bit more recent but have a similar history and similar directory structure containing both *.md and *.html files. The custom Perl scripts that generate these web sites are difficult to maintain and inconvenient for Windows users (the majority of the current contributors to these sites) so I would like to simplify this system and replace it by Hugo. Alas, it looks like Hugo is unable to preserve the existing directory structure.

Because of that history and because of the way these web sites are currently managed, I have the following constraints:

URLs must not change. The *.md files are published alongside the *.html files that they generate.
I cannot configure the web servers to rewrite URLs.
I cannot override the URLs in the front matter of the Markdown files. In fact, I would like to avoid having to set anything in the front matter, for two reasons: on the one hand, some of these files are edited by people with very little knowledge of computers and who could accidentally make some content unavailable if they copy-paste or edit the front matter without understanding it. And on the other hand, some of the Markdown files are also fetched and processed by (old) third-party tools that are unable to understand or skip the front matter.
If any weird tricks are needed, they can be in config.toml, in layouts (I also played with that, without much success), in a custom theme or in other files that are preferably outside the content tree.

I like the single-binary approach of Hugo because it is easier to use for Windows users who do not have to install a whole language framework such as Perl, Python or Javascript/Node.js and who do not have to care about script dependencies. However, I am about to give up because it looks like Hugo wants to force web sites to be organized in specific ways instead of preserving the existing file names. So this cry for help is my last attempt before switching to another static site generator…

Thanks for reading my long rant. Any help about how to configure Hugo would be appreciated, or a confirmation that Hugo is not suitable for this task.

pointyfar · October 3, 2020, 1:33am

Many ways you can manage URLS. The most straightforward would be to just “force” the URL value: URL management | Hugo

Instead of copying the md files over, set up md as an output format instead. This should also address :

I would suggest you have a look at using a headless CMS for your users instead of having them touch the markdown files directly.

I get the feeling from your post that you just want to hear “Hugo is not for you” so if that’s what you are looking for, then: Hugo is not for you.

If you really do want to actually try it, Here’s a few points:

Read Requesting Help
It’s easier to ask for help for specific questions.
It’s easier to help if you had your code in a repo we can reproduce.
You probably need to read the docs a few times.

NovemberPain · October 3, 2020, 3:22pm

Thanks a lot for taking the time to read my long message and reply to it.

I could do what you suggest, but this would imply that the published *.md files would be different from the real source files. Or to put it differently: it would be difficult (not impossible, but difficult) to rebuild the web site if the private repo disappears and the only files available are the ones that were published. This has unfortunately happened fifteen years ago with one of the web sites that was still using WML at that time and for which the only two copies of the source files were accidentally destroyed during the same week before another backup could be made (there was no git at that time, and even svn was still new). A few years later, this almost happened to one of the other web sites when the volunteer who maintained the web site passed away and nobody could access his encrypted drives. Fortunately someone else had a recent copy, which was unexpected in this small non-profit association in which very few volunteers know about git and proper file management. The first incident was one of the main triggers for moving from WML to the Markdown format and for publishing the source files next to the corresponding HTML files so that it would be easier to recover them in case of disaster or if the maintainer would disappear.

In the 20+ years of existence of these web sites, many tools and frameworks have become popular and then disappeared, but the simple file formats such as Markdown have remained relatively stable thanks a large number of independently developed tools that can process them. The Markdown format seems to be a good bet for the long term: 20 years from now, even if some of the current tools working with Markdown files will have disappeared, there will still be enough of them left to ensure that the files can be read and converted as necessary. But this may not be true for variants of Markdown that rely on uncommon extensions in the document or on a specific header. This is why I would like to avoid relying too much on the front matter. Also, using a CMS that would hide the real source files from the editors would not solve the problem.

That being said (sorry for rambling again like a grandma), your reply made me think that instead of trying to find a solution in which Hugo works correctly for all files, it might be sufficient to find a set of options that allow Hugo to put most files in the right place and treat the other ones as exceptions that require a special treatment.

So I will try to find a solution (combination of uglyURLs and permalinks settings for some sections in config.toml) that works for most of the files and I will only add a front matter with the URL override in the files that are not edited often and that Hugo renames incorrectly. This is not a perfect solution, but if I can limit the URL overrides to a few _index.md files while the majority of the files would be generated correctly without any special instructions, then I could keep on using Hugo instead of having to switch to another static site generator.

A better solution would probably require some changes in Hugo, such as the addition of an option like preserveFilePaths = true as I suggested above. This would make it much easier for existing web sites to migrate to Hugo without having to modify most of their source files. But in the meantime, I will try the workaround that I just described and I hope that it will allow me to migrate these three web sites to Hugo without having to modify hundreds of *.md files.

Thanks again for your reply. It was not the solution that I was hoping for, but it made me think about a workaround that will probably be good enough.

system · October 5, 2020, 3:22pm

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Generating html files with names other than index.html support	3	1259	February 4, 2020
Keep special characters in url / path support	5	1181	September 5, 2020
Static site: Preserve times of static files support	12	1312	June 14, 2020
Issues in output folder structure while updating Hugo to v0.56 support	7	569	March 29, 2021
What and How should the URL paths be structured support	5	619	February 13, 2022

How to ensure that file names and paths are preserved?

Related topics