I think I'm out of memory, again. 50k site build failing on v115.4 to v118.2 with 300GB of memory used

I’m somewhat at a loss presently as to how to proceed with this.

I have an internal Hugo built site (using a lightly modified Docsy theme) that did build successfully on current hardware a few weeks ago, but after some (what I thought were inconsequential) changes - like moving /content/ to /docs/content/en/, and disabling Lunr offline search - I now find that I can’t build the site anymore, and that after 50-80 mins of processing the Hugo process is killed (presumably by the OS).

I’ve talked about issues with memory before (What is --printMemoryUsage detailling about? - #6 by bep - and could really have used that output counter in gathering stats today Bep :wink: ) and thought I had this problem solved given the site was building a few weeks back.

I tried three runs on my machine using the latest 118.2 build, including after a clean restart and no user apps started, on my iMac 3.8 GHz 8-Core Intel Core i7 with 128 GB 2667 MHz DDR4 (macOS 13.5.1), and each one failed, with memory usage nearing 300GB. Latest example showed:


… and took around 50 minutes to run to failure. I ran most tests with the following options: hugo -v --debug --logLevel debug --printPathWarnings --templateMetrics --templateMetricsHints --printMemoryUsage (and note that there’s not a lot of information comes out during the build even with all these turned on) but I didn’t get the final results as it ended with:

[1] 15288 killed hugo --printPathWarnings --templateMetrics --templateMetricsHints

… as did all the other attempts. However, I wanted to try and understand a bit about where and when the memory was being used (hoping that the debug info might clue me in so that I could review the template for clues) and so I ended up graphing the memory outputs like so:

The X axis is a counter for each set of memory outputs to the terminal, and the Y axis is a log scale of values in MB. Unfortunately I don’t have any better timing metric than the number of memory statements to help understand what’s happening here. For about 60% of the time, memory usage is low - less than 100MB of system memory (I’ve come to ignore the TotalAlloc values and look at the Sys - which tends to reflect what Activity Monitor reports as Virtual Memory Size, and Alloc as Real Memory Size). I ran it again and graphed it again and got slightly different results, but a similar pattern:

This run lasted longer before it failed - I can’t explain why - but fail it did.

So I thought I’d had success previously with earlier releases, so I’ve attempted to run the same build on 116.0 and 115.4 as I recalled these were previously used on this system (Homebrew updates being intermittent here, I haven’t always caught every Hugo release, and Homebrew only installs the latest version each time).

116 came back with these outputs:


and a plot of:

Whilst 115.4 gave me:


and a plot of:

116 had the longest run time till failure - lasting almost a third longer time wise than 118; and 115.4 also lasted longer, but started to gobble up memory much earlier than either of the later versions.

In all cases, the jump in memory use from <1Gb to >100Gb happens very quickly, and from there onwards things get noticeably slower as the virtual memory kicks in hard and the memory faults rise rapidly to page info back in… but I don’t understand what’s happening in the build to explain this pattern, or what I can do with this knowledge, so I’m hoping someone here can advise please?

I haven’t had the time yet to revert the content (5.8GB of Markdown text with no images) back to the last known working build to test against these failures - but I’m not aware of any reason why this current content shouldn’t be able to build successfully on this hardware either.

Insight and suggestions welcome! Thank you for your time and efforts - I hope mine weren’t in vain, as this reporting has taken over 6 hours to do today.

Carl

Does anyone have a guide on the process pipeline for the build?

I can’t spot one in the published docs, and I’d rather not have to try and work my way through the codebase to try and determine one. I’m ultimately looking for clues as to where my memory explosion issues occur and then can dive into how to fix them

Is this a Mac only issue? It looks to me like Go is failing to recognize that it is near the memory limit (once it is) and keeps adding co-routines (Hugo uses coroutines/parallelization extensively) and exceeding the physical memory. This may be an OS specific issue in Go than Hugo as such.

Interesting… maybe? I’m trying to rustle up a linux box with sufficient RAM+VM so that I can test there, but my general experience to date has been that macOS memory management has been more effective than linux memory management. However, worth testing, will take me a wee while to accomplish though.

I don’t have access to a suitable windows setup to try this on though as a third case :frowning:

I would try the build without the templateMetricsHints flag. It has doubled build times for myself.

I have seen memory skyrocket similar to your graphs, but have just killed the build. It has always been a bad template on my part - infinite recursion or something.

You mention 5.8GB of markdown - what exactly is in these files? And what kind of templating is going on?

The build fails with zero flags enabled - I have been trying lately to use the hints and memory flags to give me some clues as to why the build just fails :slight_smile: I can confirm that the time to failure isn’t markedly quicker with the flags off though.

I wish there was a way to get even partial or intermediate outputs from these flags (like how the memory one outputs periodically) to give me some clues. After all, this has worked before, and I can’t see any material or obvious reason why recent attempts have failed.

As to the what generates this…

This site is an example of an automated documentation setup for system configuration data. There are a (growing) number of scheduled scripts that run to extract configuration data from a key system (in this example, a patient record system) which is held in the system’s database. That table data is then parsed and annotated by the scripts to generate a markdown formatted report of the current configuration. As we have multiple independent instances of this system running, comparing configuration between systems and tracking changes is a constant operational issue, and one that isn’t supported by the vendor, so we have these processes logging to a git repository, which means that we get a change history kept of the markdown files.

I added one of the largest (and key) configuration tables (essentially the core lookup table for reference values used throughout the platform) and this caused the site size to balloon from a couple of hundred Mb to nearly 6 Gb.

To give some numbers: there’s approximately 410k records in what is essentially a KeyValue table (though with several extra columns for indexing, activation status, update version metadata), and a 4.2k grouping records in a related table, for a single system domain. We’re tracking 4 domains presently (with another 4 to be added eventually). A typical KV record could be 211 bytes of raw data (direct from DBMS), which for 4 domains means 0.82Kb of data (then multiply that by 410k). After transforming into Markdown text with some basic analysis, this turns into (5758 Bytes) 5.62Kb of text data.

Ignoring how the files are grouped (e.g. there being 100 records per “page”), the equivalent HTML output for that specific record data after Hugo has combined it with the modified Docsy theme (linked in OP), means we’re up to (14921 Bytes) 14.57Kb of HTML data.

If you did the math, you’d have guessed that the raw MD files would take up about 2.25Gb of space - but because that’s just a sample record, it isn’t quite a representative as I might like. In reality the KV data takes up 5.5Gb of space, and the grouping record another 200Mb.

For the time being we’ve had to pull this part of the scan out of the build process so that we can work on adding other categories of configuration data (and allow the other ones we’ve already built to be published and updated again).

In what little spare time I (don’t) have I’m looking at:

  • alternative themes again - not a pleasant prospect as the migration from Material for MkDocs to Hugo + Docsy wasn’t trivial, as we rely on a range of extensions on other builds for diagram support, formatting callouts, image galleries, etc which means the Markdown isn’t fully portable (joys of extensible Markdown)

  • Trying to find the upper limit on the amount of data we can include from the current build before things fail (not easy given the data is autogenerated and there are a lot of hyperlinks between pages)

  • Trying to coerce our build system to let me push this via a specific runner that has been spec’d up to similar levels as my desktop machine (i.e. the linux memory test suggested above)

  • Restructuring the build for a fourth time to use fewer include / tmp files but to increase the number of pages/files in a given directory, but this is problematic as it breaks data history in git, changes urls, and causes directory browsing issues because >10000 files in a folder.

  • And if I do stay committed to Docsy, then I’ll eventually have to do some kind of deep dive to understand how the template process truly works and try and do a manual walk through and debug to see if there’s some efficiencies I can find that the Docsy community haven’t already found (I don’t expect I’ll win that one)