Can we avoid copying static files unnecessarily?

I’ve built a couple of websites with Hugo, and I wanted to learn Go, so I went looking for open issues. One of the ones that stood out to me was about not copying static files unnecessarily when building a site.

It seems like there are two independent problems here.

The first is that when testing that your site is functionally-correct (e.g. in CI) you don’t need to actually get as far as building a full public/ folder with all static files in it (I think this is what snicolai-blog was talking about). It’s possible that something like a --skip-static-files option would be enough here; something similar to a --dry-run option you might get with other tools, except that it would obviously still generate .html files etc.

The second is that if you’re going to build your site into a scratch folder and then e.g. rsync the resulting contents to a production server, you don’t want to copy gigabytes of files from one folder to another unnecessarily (I think this is Berny23’s position). Hence why we should try to hard-link the originals to where they end up in the public/ directory; or, possibly, symbolically-link them if the platform we’re on doesn’t support hard links. So maybe you’d say --link-static-files, --link-static-copies, --link-published-static-files or something.

But beyond that, it feels like that we have two different piles of stuff when we build Hugo sites. On one hand (1) we take a whole bunch of .md and e.g. .toml files, and apply all sorts of funky logic and templates that ultimately generate .html files, and that’s ultimately what they’re for. They serve no purpose after that, and there’s no reason why they’d matter for the generated website. They feel like source code.

But we also (2) have a bunch of .gif, .png, .jpg, .mov etc. files hanging around that ultimately belong in the generated website. Sure, we can peek at their filenames as part of #1, even peek at their contents so we can e.g. generate thumbnails, or maybe minify other static stuff like Javascript or CSS, but that’s just a consequence of Hugo being clever. There’s no requirement that you should have to do anything like this. The static files are other files that go into the website, and arguably that’s where they belong. We’re just borrowing them from the future.

So while the basic model is that you have a directory structure like this (ignoring assets because I think they follow the same model as content/: they’re supposed to be things that Hugo chews on):

hugo/
    content/
    static/
    public/
        .gitignore # These files are all derived and less important

I wonder whether it would sometimes make sense to invert this slightly and have something like this instead?

website/
    .gitignore # Only the .html files etc.
    images/
hugo/
    content/
    static/
        images -> ../../website/images
    public -> ../website

The file-syncing code that Hugo uses currently isn’t aware of symbolic or hard links, unfortunately; I’ve got a patch for that which I’ve kept in draft to avoid bothering people before I posted this message. But I think that once implemented it should speed up re-“generating” sites where most of the static content is already available and doesn’t need copied again.

Beyond that, I wonder if there’s a conversation worth having about potentially having four types of directory (which is obviously a lot more complicated than the current setup):

static-source/
    images/
dynamic-source/
    content/
    static/
        images -> ../../static-source/images/
    public -> ../generated-content
generated-content/
    index.html # and friends; from Hugo
    images/
        bigimage_thumbnail.jpg # generated by Hugo
website/
    index.html -> ../generated-content/index.html
    images/
        bigimage.jpg -> ../static-source/images/bigimage.jpg
        bigimage_thumbnail.jpg -> ../../generated-content/images/bigimage_thumbnail.jpg

I’ve used symbolic links here because they make it clear where files are coming from, but you could use hard links on file systems that supported them.

The benefit of having a two-stage (under the hood) process where you (a) first generate files from Hugo and then (b) combine them with static source files, is that you can blow away the generated-content and website directories without worrying, because everything will be rebuilt when you run hugo build. This doesn’t work if your website/images directory is both the place where large images live, but also where Hugo writes thumbnails to.

And for each of the four logical component directories, you can decide whether to check them into version control or not, and/or do funky things with S3 or Cloudflare or what have you.

From preliminary reading it feels like this might be something like mounts, but for outputs rather than inputs?

It looks like skingston has misunderstood what I want. There are not two independent problems. I have the second problem along with Berny23. The first problem is a misunderstanding that skingston made up. I do not think --skip-static-files is a useful option.

I don’t want to skip copying the files into public/. The files SHOULD appear in public/. I just want the public version hard linked to the source version to save disk space. I want --hardlink-static-files.

I don’t want the proposal in the second half of the post, moving all images into a symbolic linked image directory. In my use case, I’m using the hugo page bundle feature https://gohugo.io/content-management/page-bundles/ where the images are for a specific post and I want them next to the index.md file for that post. Some editing tools have nice previews for Markdown files that use simple image references like this. Moving the images off to their own area would break these tools.

Oh yeah, if you’re using page bundles then you’re effectively abandoning the special static/ or assets/ directories; and from the point of view of version history you want to have a post and its images appear in basically the same commit or set of commits.

My point about skipping copying static files was purely because you’d mentioned that you build a site in CI and then throw it away. If you’re going to do that, then we may as well skip all of the copying – whether creating new copies of the files or hard-linking them – because the only way any of that can go wrong is presumably if the file system fills up, rather than a markup or logic error in your Hugo files that you want to be told about.

Unless your CI process is also a CD process, i.e. if everything goes fine you also copy the resulting files to a production server (or, more likely, rsync them so you don’t have to copy stuff over the wire unnecessarily)?

Yes, it’s a CI/CD system like Netlify, so I don’t have much control over the CI/CD server. The service does a git clone of the repository then calls a build script that I supply in the sandboxed CI/CD server. The disk space is limited on that server. With the current hugo build process of copying the static files to public there are effectively three copies of each JPG on that disk.

  1. In the .git repository
  2. The checked out source in /content/posts
  3. Copied into /public by the hugo build process

Using hard links instead of copying into /public will remove one of the three copies.

For example, the service I’m using will deploy up to 10GB and has 22GB of free space on the disk used for builds. My site is 90% or more images. With the current build process I can only build a site that’s a little smaller than 7GB rather than the full 10GB. If hugo supports hard links, I should be able to build the full 10GB site.

There are also time limits for the build process, I suspect that hard links will be faster because of the reduced I/O.

Currently they won’t, because the file synchronisation code checks files byte-for-byte to determine whether they’re the same. If we fix it so it knows that two files hard- or symbolically-linked are the same, you’ll get the performance improvement you were hoping for.

Remember that this is a brand new environment, /public starts out empty for the build process, so there is nothing to compare.

Duh. Yes, of course.

right from the bat without muh thinking what all is cleaned up, cached, memory or performance footprint

what about an option --mountStatic which wouch make files accessible by hugo and hugo server.

then just publish the site and the static folder

maye technical nonsense :slight_smile: