[SOLVED] The build process is slower on my production server?

Hey guys. I’ve finally relaunched my project and it’s running on Hugo now - https://www.kukuruku.co/. It supports authentication, as well as static commenting system.

There is a problem with the build time. This is what I get when I build locally.

Started building sites ...
Built site for language en:
0 of 36 drafts rendered
0 future content
0 expired content
193 regular pages created
488 other pages created
0 non-page files copied
65 paginator pages created
41 hubs created
430 tags created
total in 460 ms

I’m on MacBook Pro 2.5 GHz Intel Core i7 16 GB 1600 MHz DDR3.

The production server is running 2 GB Memory Intel(R) Xeon(R) CPU E5-2650L v3 @ 1.80GHz, Ubuntu 16.04.1 x64, and the build time is 1.2-1.3 seconds for the same number of posts/pages. This is almost 3 times slower.

Thoughts?

This is the output from production top.

KiB Mem :  2048276 total,   961724 free,   118676 used,   967876 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  1664348 avail Mem 

I would guess that it is mostly a IO/disk issue.

But I would say that the numbers you get on you MacBook is mighty impressive.

1 Like

Hm…Doing some disk benchmarks, but the results are similar.

$ time dd if=/dev/zero bs=1024k of=tstfile count=1024 2>&1 | grep sec | awk '{print $1 / 1024 / 1024 / $5, "MB/sec" }'

Mac:

706.042 MB/sec

Production:

1073741824 bytes (1.1 GB, 1.0 GiB) copied, 1.57467 s, 682 MB/s

Not a huge difference. Unless I’m doing something wrong. Going to do a few more tests.

This is off-topic, but I’d actually love to hear what you’re doing for your static commenting system!

I leveraged the Hugo Data Files. Essentially, comments are in JSON files.

What does hugo --stepAnalysis show? hugo version?

To take disk write IO out of the equation, use the --renderToMemory option.

$ hugo version
Hugo Static Site Generator v0.18.1

$ hugo --stepAnalysis 
Started building sites ...
Go initialization:
	6.5932ms (7.298323ms)	    0.71 MB 	8936 Allocs
initialize & template prep:
	11.695578ms (19.27632ms)	    1.08 MB 	15412 Allocs
load data:
	8.088502ms (27.594029ms)	    1.61 MB 	11548 Allocs
load i18n:
	65.65µs (27.708249ms)	    0.00 MB 	71 Allocs
read pages from source:
	54.645406ms (82.665565ms)	   19.26 MB 	91456 Allocs
convert source:
	78.146151ms (161.252195ms)	   44.12 MB 	207421 Allocs
build Site meta:
	32.567869ms (193.991574ms)	    2.58 MB 	106266 Allocs
prepare pages:
	101.938414ms (296.387193ms)	   18.09 MB 	63797 Allocs
render and write aliases:
	25.002µs (296.997514ms)	    0.00 MB 	0 Allocs
render and write pages:
	718.678644ms (1.016573599s)	  198.60 MB 	1616360 Allocs
render and write Sitemap:
	22.611793ms (1.039752773s)	    1.49 MB 	44289 Allocs
render and write robots.txt:
	10.231µs (1.040043935s)	    0.00 MB 	8 Allocs
render and write 404:
	818.047µs (1.041139946s)	    0.07 MB 	1506 Allocs

Looks like render and write pages is where it spends most of the time. Locally, this step is ~300-350ms.

--renderToMemory didn’t make much of a difference.

Here is what I get locally

$ hugo --stepAnalysis
Started building sites ...
Go initialization:
	8.298969ms (8.738369ms)	    1.12 MB 	12442 Allocs
initialize & template prep:
	7.400898ms (16.215953ms)	    1.09 MB 	15422 Allocs
load data:
	4.683847ms (20.935201ms)	    1.61 MB 	11556 Allocs
load i18n:
	49.219µs (21.011377ms)	    0.00 MB 	71 Allocs
read pages from source:
	26.509234ms (47.590804ms)	   19.52 MB 	92391 Allocs
convert source:
	27.206736ms (74.887976ms)	   45.24 MB 	211633 Allocs
build Site meta:
	24.819254ms (99.778819ms)	    2.60 MB 	107165 Allocs
prepare pages:
	52.911194ms (152.77138ms)	   19.15 MB 	63972 Allocs
render and write aliases:
	9.792µs (152.841306ms)	    0.00 MB 	0 Allocs
render and write pages:
	316.866188ms (469.873753ms)	  210.21 MB 	1625366 Allocs
render and write Sitemap:
	15.96715ms (485.990984ms)	    1.51 MB 	44504 Allocs
render and write robots.txt:
	11.2µs (486.084351ms)	    0.00 MB 	8 Allocs
render and write 404:
	536.745µs (486.697263ms)	    0.07 MB 	1498 Allocs

Obvious questions:

  1. I see in the docs that Hugo is multi-threaded. This CPU has 12 cores but don’t they have slower single thread performance than the i7’s (at least the latest ones)?

Does Hugo exploit each one of the 12 cores?

http://cpuboss.com/cpus/Intel-Xeon-E5-2650L-v3-vs-Intel-Core-i7-6600U

http://cpuboss.com/cpus/Intel-Xeon-E5-2650L-v3-vs-Intel-Core-i7-6920HQ

  1. This server is dedicated/has guaranteed reserved resources?

Hard to say.

In general, Hugo spreads most of its work to (4 x GOMAXPROCS) goroutines.

GOMAXPROCS seems to be empty on my macOS (will have to investigate this), so I get 4 goroutines, which seems fairly optimal on my MacBook.

But what you can try is to set GOMAXPROCS to something higher on startup, say:

env GOMAXPROCS=3 hugo

Or something …

Which should get your cores pretty busy.

This is a DigitalOcean’s VPS, which is in my case a 2 CPU plan.

# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                2
On-line CPU(s) list:   0,1
Thread(s) per core:    1
Core(s) per socket:    1
Socket(s):             2
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Model name:            Intel(R) Xeon(R) CPU E5-2650L v3 @ 1.80GHz
Stepping:              2
CPU MHz:               1799.998
BogoMIPS:              3599.99
Virtualization:        VT-x
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              30720K
NUMA node0 CPU(s):     0,1
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl eagerfpu pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm vnmi ept fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt arat

That’s your answer. DO uses KVM hypervisor to segment resources for each droplet. You don’t get any kind of guaranteed 100% use of resources and can suffer with ‘noisy neighbours’.

The way they word it “2 CPU plan” is a little misleading, but industry standard.

Apologies if you know this already but basically they’re virtual cores meaning you have theoretical access to the equivalent of 2 physical cores on the underlying bare metal. When you need some processing done the hypervisor will allocate your droplets’ request for CPU time to any available free CPU. That can take a little time for the request to be fulfilled, and it can take a little time to get the ‘answer’ back from the CPU.

So the hypervisor adds a bit of overhead and the other droplets present on the same underlying bare metal will be taking up some CPU resources too, so you won’t get the same performance as if you were on your own bit of dedicated hardware with no hypervisor or other droplets.

Add that to the fact the laptop you have is likely very good and that just about covers why you’re getting the results you are getting.

1 Like

This makes a total sense. Thanks for clarifying it all. Agreed. Not much I can do here. Thanks everyone.