171.456 docs, 22 taxonomies, 20 minutes

Is Hugo capable of generating a 25.000+ docs website???
Well. Until this evening I was not convinced. But now…
Before converting to Hugo last year, I found the limit in Hexo to be approx. 1200 documents. Furthermore it took about 1 1/2 to 2 hours, before Hexo crashed.
And one of my friends was a little sceptical too. He’s using Jekyll, and he was quite sure that, if Jekyll should ever manage to spit out a 25.000 docs website, it would need several hours to do so. Hugo is likely to be the only static site generator with a chance to succed in this, he comforted me…
So. After having extracted 25.953 docs across 4 websites from Lotus Notes and cleaned up a lot of show-stopper characters from the taxonomies, I hit hugo -d /w/uv/public and crossed my fingers.
Please refer to the headline for the result.
Really impressive. I’m short of words.
Static site generators are most suitable for personal blogs… Get outta here.
Hugo is scalable beyond my wildest expectations. It’s CMS big time. It’s numero uno, and it’s a pleasure to work with as well.
Thank you so much to everyone, who have contributed to this potent masterpiece.

Jan@JLKM1 MINGW64 /f/data/uv
$ hugo -d /w/uv/public
Started building sites ...
Built site for language da:
0 draft content
0 future content
0 expired content
25953 regular pages created
62720 other pages created
0 non-page files copied
51449 paginator pages created
2495 landenoegleordpar created
1019 landenoegleord created
2918 landeemneord created
146 emneord created
189 maaned created
6546 skribenter created
18 aar created
2495 landenoegleordparomvendt created
6 dokumenttype created
13 aspekter created
556 byer created
252 regioner created
3402 personer created
16 noegleord created
4 kategorier created
438 kilder created
188 lande created
2292 steder created
8163 relaterede created
3 originalsprog created
175 noegleordpar created
total in 1178375 ms
5 Likes

Note that “other pages” includes all of the taxonomy pages, twice (HTML and RSS). I’m surprised it takes so long to run, though, even with that many taxonomies; it’s only 3.5 times the size of my site but takes 70 times as long to render. I wonder if the bottleneck is memory or I/O. (assuming reasonably comparable hardware, in my case a 3-year-old MacBook)

Also, that’s a hell of a lot of paginator pages. :slightly_smiling:

If you don’t need RSS feeds for each taxonomy term, you can speed things up by adding this to config.toml:

[outputs]
home = ["HTML","RSS"]
page = ["HTML"]

Update: Oh, I see where the runtime comes from now. I used my random-hugo-blog script to generate enough entries to reach your scale, including the large number of distinct terms in the taxonomies, and watched hugo chew up 6 GB of RAM and 3 cores for A Very Long Time. 25,000 MD files is no big deal; 6000+ terms in a taxonomy is.

-j

That’s performance right there. :wink:

You should write-up a case study.

1 Like

Hadn’t seen that coming… Found 20 minutes for 25.000+ docs to be fast…

Perhaps my modest expectations origin from having some rather old hardware:

My setup:
A five year old ASUS Laptop (i7 though...) with Windows 10.
RAM mamory: 16 GB
Available RAM memory: 13,3 GB
Virtual memory: 42,1 GB
Available virtual memory: 38,3 GB
Pagefile: 26,1 GB

Please let me know, anoyone, if I’ve missed something in those settings. Or if an RAM upgrade to 32 GB is worth trying.

Thank you also for bringing my attention to the the new custom output formats. I’ll defininitely give them a shot.

You should get some perspective. Those numbers are very, very impressive any way you look at it. With this amount of data Hugo is, I’m guessing, mostly memory constrained – so some kind of streaming approach would have to be implemented if we would like to improve.

It looks like it’s the number of distinct terms in each taxonomy that does it. I can render 25,000 regular pages in a minute or so, but once I added a few randomly-generated taxonomies with around a thousand terms each, Hugo was using 6GB of RAM and pounding on three cores for 23 minutes.

For your amusement, here’s what it looked like when it finished:

% time hugo
Started building sites ...
Built site for language en:
0 of 2 drafts rendered
0 future content
0 expired content
25298 regular pages created
14788 other pages created
4 non-page files copied
516634 paginator pages created
10 rushed created
10 gruel created
10 unfattable created
10 agglomerator created
17 wordplay created
1200 unqualifiedness created
10 pneumological created
10 assentatious created
10 kerogen created
10 trichinous created
10 philosophism created
10 bemist created
8 pecunious created
10 biconcave created
10 vulsinite created
1200 conspiratress created
10 paraaminobenzoic created
43 categories created
10 hydrachnid created
5 antrophore created
1200 unfreely created
10 archdepredator created
10 semirare created
1200 postwoman created
10 clovene created
6 series created
10 nonbilabiate created
10 urohematin created
659 tags created
10 diosmotic created
800 tenderness created
800 cothy created
total in 1365380 ms

real	22m48.707s
user	70m49.571s
sys	3m18.912s

(yes, it created 559,269 files in public; thank goodness for SSDs!)

-j

1 Like

Remind me to tell you sometime about collecting a half-billion lines of syslog data per day from 750,000 hosts, all sent to a single IP address. And storing several years of it in a searchable format.

My hugo site has just under 7,500 content pages, and builds in 13 seconds. As you’ll see in my other reply, I dug into exactly why his build is so much slower, and the answer appears to be “number of distinct terms in the taxonomies”. That builds a much larger in-RAM data structure, and spends a lot of time searching through it.

-j

Thanks for those numbers.

Two questions:

  1. “516634 paginator pages created” looks odd. What is your paginate setting?
  2. Did you add the taxonomy term to all or just some pages?

Of course, tagging pages at “random” will drive up the page count, but we may have an unneeded quadratic loop in there somewhere that gets really visible with these numbers. I’ll have a look into it.

  1. I had paginate=10 for my test site.

  2. The number of pages is huge because I added the random-taxonomy generator to my wikiblog script in a hurry, and for a taxonomy with N terms, it added random(N-1) of them to each article, so many MD files had hundreds of terms in their front matter. And I only generated 1,000 distinct articles, which I then copied 17 times into sub-directories and added to my real blog entries. As a result, despite having a lot less total terms than his site, I had a lot more articles per term, generating a lot of pages.

I suspect his relaterede and skribenter taxonomies have only a few terms per article, but the sheer size of them makes the results comparable to my hacked-up example.

-j

OK, 550000 pages rendered in 23 minutes isn’t bad. There is room for improvement of course, but in my perspective, comparing it the other static site generators, it s pretty darn good.

1 Like

I’ve done some more structured testing (with default pagination). I created 1,000 random articles with simple tags/categories, and copied them into 20 different sections, for a total of 20,000 content files. Then I wrote a standalone script that would generate X taxonomies each containing Y terms, and insert 1-X random taxonomies into each article, each with 1-5 random terms.

Without the random taxonomies, build time was 268956 ms.

With one 1,000-term taxonomy, it was 290217 ms.

With ten 1,000-term taxonomies, it was 332273 ms.

With one 10,000-term taxonomy, it was 316232 ms.

With four 10,000-term taxonomies, it was 427258 ms.

With four 10,000-term taxonomies and 6-10 terms/taxonomy, it was still only 511538 ms:

0 draft content
0 future content
0 expired content
20000 regular pages created
80088 other pages created
20 non-page files copied
75566 paginator pages created
10000 drove created
9999 pneumonographic created
10 categories created
10 tags created
9999 pleasingness created
9999 smidgen created
total in 511538 ms

So I decided to go for broke, and generated 20 10,000-term taxonomies with 1-5 terms/taxonomy. The build has been running for 50 minutes so far, using only a single core and 4.2 GB of RAM, not spinning up the fans, and has only written out the static files.

If I’m reading the stack trace correctly, only one thread is active, and it’s spending all of its time in assemble().

Update: Final results after 68 minutes:

0 draft content
0 future content
0 expired content
20000 regular pages created
371332 other pages created
0 non-page files copied
207794 paginator pages created
9253 formalesque created
9211 ankyloglossia created
9285 accidie created
9294 cholestanol created
9291 hala created
9280 undisgraced created
9273 brocho created
9270 subsist created
9252 featherless created
9275 turner created
9290 unawfully created
9280 overwalk created
9300 dicker created
9246 electoral created
9302 antalkali created
9296 overdaintily created
9284 tomeful created
9316 extrafloral created
9322 coruscation created
10 categories created
9283 scranny created
10 tags created
total in 4073392 ms

So, 5x the number of taxonomies/terms, 10x the runtime, and most of that was spent in a single-threaded routine that was neither reading from nor writing to the disk.

-j

@jgreely, would you care to share your test generator script?

Sure: taxonomies.pl.

Usage is simple: feed it a bunch of filenames on STDIN, and it will add random taxonomies to their TOML front matter. So, to create 3 taxonomies with 1000 terms each, and then add 1-3 of them with 1-5 randomly-selected terms to each article:

find content -name '*.md' | taxonomies.pl -T 3 -t 1000 -m 5

The thousand content files that I generated with my wikiblog.sh script are here (5MB tarball). I just copied them repeatedly into different sections to increase the article count, and then ran the taxonomy-adder on the results.

-j

Holy shnikes. Hugo is blisteringly fast.

@jgreely I’d be curious to see something more real-world-ish; e.g, a 10k pages, 10 sections, 5 taxonomies, with maybe 50 terms each (this would be a formidable group of metadata to manage). Also, what’s the templating like?

I seem to recall @budparr was saying he was working on a decent-sized site with some complex templating. Maybe he can add some insight into an example in the wild.

1 Like

Done.

Without random taxonomies:

0 draft content
0 future content
0 expired content
10000 regular pages created
66 other pages created
0 non-page files copied
11128 paginator pages created
10 categories created
10 tags created
total in 81768 ms

Adding 5 taxonomies of 50 terms, 1-5 tax/article with 1-5 terms/tax:

0 draft content
0 future content
0 expired content
10000 regular pages created
576 other pages created
0 non-page files copied
18669 paginator pages created
50 psychological created
50 loudish created
50 bullbaiting created
10 categories created
10 tags created
50 pseudomodest created
50 unerrable created
total in 91408 ms

config.toml:

languageCode = "en-us"
title = "5 random taxonomies"
baseURL = "https://example.com/"
theme = "mt2-theme"

[taxonomies]
category = "categories"
tag = "tags"
pseudomodest = "pseudomodest"
unerrable = "unerrable"
psychological = "psychological"
loudish = "loudish"
bullbaiting = "bullbaiting"

I used the (unpublished as yet) theme for my blog, because it paginates sections and taxonomies. If you have a specific theme you think would work well for testing, I can try it. It’s a bit painful to wade through the gallery looking for features like pagination (for instance, I tried hugo-octopress, but all it generated for taxonomies was RSS feeds, so it only created 100 paginator pages, and finished in 20 seconds).

-j

3 Likes

for the record on Amazon r3.2xlarge 64go de RAM 8cpu

Built site for language en: 
0 draft content 
0 future content 
0 expired content 
220341 regular pages created 
24 other pages created 
0 non-page files copied 
10 paginator pages created 
6 tags created 
0 categories created 
total in 209277 ms

and this is not lorem ipsum pages but real pages (with real content), some pages are build with a FML 300ko json

2 Likes

@jonathanulco that is really cool, and it would be really interesting if you could elaborate a little about what kind of project this is.

I work for a little startup who create a proximity social network, these are the external pages created by users of the service.

P.S : For improve my service, i’m waiting for the incremental build on hugo :wink:

I received the bill from Amazon : $0.67

2 Likes

What’s the bill for?