Configurable cache TTL for remote content?

cmal · January 27, 2018, 2:22am

The docs do not specify how long the cache is kept for for data-driven content. In my experience (backed by reading the source code), it seems the cache never expires.

Right now I’m messing adjusting my cron script to refresh the cache every now and then (with –ignoreCache), but I feel like it would be a neat feature addition to have a cache timeout configurable from the site config file.

My current usecase is building a website regularly to display upcoming events on the sidebar, but I’m guessing many people making API calls using get{JSON,CSV} have plenty of different use-cases for a configurable cache TTL.

Thanks for maintaining Hugo! It’s a really cool thing to work with

PS: In case anyone’s wondering why not use a fancy JS/WASM applet to make the API call on the client side, my reason is twofold. First, avoiding the API server some useless CPU and network load. Second, being able to display the information with Javascript disabled in the browser.

cmal · March 17, 2018, 6:42pm

So, since the post didn’t gather a constructive discussion, I tried to implement something like this on my own.

What I did was implementing template functions called getCachedJSON and getCacheCSV that act as a wrapper around getJSON and getCSV and take an extra integer as a first parameter.

This first parameter is the time-to-live in seconds for the data cache. That means upon loading the content for the first time, it will be stored for ttl seconds. When you try to access it again, if ttl seconds has passed, the cached entry will be purged and getJSON/getCSV will be called to fetch fresh content.

Of course, if --ignoreCache is passed to hugo, cache will be dismissed in any case.

This option is I believe useful when rebuilding your site rather often and you don’t necessarily want to fetch external content every time (which could raise anti-spam/anti-DDOS measures).

I’m currently unable to run the tests I carelessly wrote. It says Skip Check on go1.8.1 although I upgraded my golang setup and go version says go version go1.10 linux/amd64. Sorry about that, but it’s my first time dealing with go and I’m not really comfortable with the toolset just yet. If you have any idea how to address this issue, let me know

However, I did some template-side testing with Wireshark in the background and it appeared to have the desired effect of sending network requests only after ttl seconds has passed (after the first query has been completed, of course).

Example usage:

{{ $days := getCachedJSON 3600 $url }}

This would keep the content of $url for one hour throughout as many builds as you can imagine.

What do you think about this? Is this a desired feature? Am I implementing it properly? All comments welcome

kaushalmodi · March 26, 2018, 2:40pm

I’d like to have this integrated into hugo. @bep What is the procedure? Should @cmal open a “proposal” issue on the hugo repo to get this rolling?

Without a solution like this, getJSON does not fetch new content while testing it out on localhost (hugo server).

Jonathan_Griffin · March 26, 2018, 2:45pm

I agree this would be useful. At the moment I have to clear the Cache folder every time I want to update the remote csv data.

cmal · March 26, 2018, 4:32pm

Well then we should start arguing about how to implement this feature

My original was proposal was to have the cache TTL configurable from the site config. However, I quite like the idea I implemented with getCachedJSON that different types of API calls may have their own TTL.

Actually, I think both are complementary ideas. In my opinion, we should be able to say from the templates for how long we’d like the content to be kept, but a setting in the site config should be able to override this, providing a maximum TTL (“refresh my cache every X seconds” type of setting) for both getCachedJSON and getJSON.

Please note that in my implementation, giving getCachedJSON a TTL of 0 means the file is always refetched, because when you call it again, more than 0 seconds has passed since arriving in the cache.

That means if the maximum cache TTL setting is enabled, we can use getJSON URL to cache the content for a certain time, or getCachedJSON 0 URL to have it always fresh. And in any case, if --ignoreCache is passed, fresh content is always fetched.

I think such a system would allow for enough flexibility : not caching content indefinitely, and allowing per-template fresh/cached content.

Actually, I’m wondering if caching “indefinitely” (that is, until --ignoreCache is passed) is a bug or a feature. Maybe that would be worth another function like getPermanentJSON for when we know the content is not going to change?

Jonathan_Griffin · March 26, 2018, 7:17pm

Yes, but it is not efficient. For example, with my site extensively using remote hosted csv, the --ignoreCache takes about 60 seconds to build the site. If I just delete the cache folder and build, it takes just 9 seconds to build. Presumably the latter is more efficient because it builds the cache at the start of the build, and then uses that cache throughout the build process.

I would love for something like --refreshCache that simply deletes the cache at the start and then builds the site using the new version of the cache that is built. However, in my use case this is only ever needed in local development, as when I deploy via Wercker it builds a fresh cache on every build anyway.

It seems your use case @cmal is a lot more complex than anything I would want, but I can see the logic in it.

bep · March 27, 2018, 8:02pm

Above I see someone mention this in relation to “when I have some changed content I have to delete cache” – which a TTL will not help with. You need some kind of manual trigger.

We have a ``–gc` flag (garbage collect). I added that flag thinking that it could be used for other stuff, too (not just stale images). Cache invalidation may be slightly out of scope, But I hate adding even more flags/commands for this very similar thing.

cmal · March 30, 2018, 2:30pm

I think what you’re looking for is along the lines of adding a single rm to your build script like so:

rm -r /tmp/hugo_cache

In most cases that’s where your cache folder would be (following POSIX norms), unless you have either the $TMPDIR environment variable set, or manually passed a --cacheDir setting to your hugo command.

Apart from this specific cache refresh problem, do you have opinions on whether and how to implement some form of cache TTL? In my proposal i was trying to maximize the possibilities while remaining a 100% backwards-compatible (in regards to how GetJSON/GetCSV and --ignoreCache currently work).

In summary, I was proposing:

a maxCacheTTL config that if set, would be the maximum time an amount spends in cache for the whole site
GetCachedJSON/GetCachedCSV functions that would take an additional parameter for this item’s specific TTL (not exceeding maxCacheTTL if set) or 0 to force to download fresh content

In retrospect, I think the current behaviour of GetJSON/CSV of keeping could be still desired even with a maxCacheTTL set for GetCachedJSON/CSV. I will think this through some more in the coming days

I would appreciate comments and feedback from more people. Should I just open an issue on Github?

kaushalmodi · November 5, 2018, 8:37pm

Looks like this issue has gotten some attention:

Topic		Replies	Views
Update cached JSON data support	4	2035	August 19, 2019
Use HTTP cache mechanism for efficient cache update feature	2	375	January 15, 2024
getJSON cache support	5	1457	January 29, 2020
I think I found a bug in hugo	5	593	May 26, 2018
Reusing data pulled via getJSON support	6	984	October 11, 2021

Configurable cache TTL for remote content?

Related topics