Configurable cache TTL for remote content?


#1

The docs do not specify how long the cache is kept for for data-driven content. In my experience (backed by reading the source code), it seems the cache never expires.

Right now I’m messing adjusting my cron script to refresh the cache every now and then (with –ignoreCache), but I feel like it would be a neat feature addition to have a cache timeout configurable from the site config file.

My current usecase is building a website regularly to display upcoming events on the sidebar, but I’m guessing many people making API calls using get{JSON,CSV} have plenty of different use-cases for a configurable cache TTL.

Thanks for maintaining Hugo! It’s a really cool thing to work with :smiley:

PS: In case anyone’s wondering why not use a fancy JS/WASM applet to make the API call on the client side, my reason is twofold. First, avoiding the API server some useless CPU and network load. Second, being able to display the information with Javascript disabled in the browser.


Alternative to Disqus Needed More Than Ever
Anyone for Webmention?
#2

So, since the post didn’t gather a constructive discussion, I tried to implement something like this on my own.

What I did was implementing template functions called getCachedJSON and getCacheCSV that act as a wrapper around getJSON and getCSV and take an extra integer as a first parameter.

This first parameter is the time-to-live in seconds for the data cache. That means upon loading the content for the first time, it will be stored for ttl seconds. When you try to access it again, if ttl seconds has passed, the cached entry will be purged and getJSON/getCSV will be called to fetch fresh content.

Of course, if --ignoreCache is passed to hugo, cache will be dismissed in any case.

This option is I believe useful when rebuilding your site rather often and you don’t necessarily want to fetch external content every time (which could raise anti-spam/anti-DDOS measures).

I’m currently unable to run the tests I carelessly wrote. It says Skip Check on go1.8.1 although I upgraded my golang setup and go version says go version go1.10 linux/amd64. Sorry about that, but it’s my first time dealing with go and I’m not really comfortable with the toolset just yet. If you have any idea how to address this issue, let me know :wink:

However, I did some template-side testing with Wireshark in the background and it appeared to have the desired effect of sending network requests only after ttl seconds has passed (after the first query has been completed, of course).

Example usage:

{{ $days := getCachedJSON 3600 $url }}

This would keep the content of $url for one hour throughout as many builds as you can imagine.

What do you think about this? Is this a desired feature? Am I implementing it properly? All comments welcome :slight_smile:


Anyone for Webmention?
#3

I’d like to have this integrated into hugo. @bep What is the procedure? Should @cmal open a “proposal” issue on the hugo repo to get this rolling?

Without a solution like this, getJSON does not fetch new content while testing it out on localhost (hugo server).


#4

I agree this would be useful. At the moment I have to clear the Cache folder every time I want to update the remote csv data.


#5

Well then we should start arguing about how to implement this feature :stuck_out_tongue:

My original was proposal was to have the cache TTL configurable from the site config. However, I quite like the idea I implemented with getCachedJSON that different types of API calls may have their own TTL.

Actually, I think both are complementary ideas. In my opinion, we should be able to say from the templates for how long we’d like the content to be kept, but a setting in the site config should be able to override this, providing a maximum TTL (“refresh my cache every X seconds” type of setting) for both getCachedJSON and getJSON.

Please note that in my implementation, giving getCachedJSON a TTL of 0 means the file is always refetched, because when you call it again, more than 0 seconds has passed since arriving in the cache.

That means if the maximum cache TTL setting is enabled, we can use getJSON URL to cache the content for a certain time, or getCachedJSON 0 URL to have it always fresh. And in any case, if --ignoreCache is passed, fresh content is always fetched.

I think such a system would allow for enough flexibility : not caching content indefinitely, and allowing per-template fresh/cached content.

Actually, I’m wondering if caching “indefinitely” (that is, until --ignoreCache is passed) is a bug or a feature. Maybe that would be worth another function like getPermanentJSON for when we know the content is not going to change?


#6

Yes, but it is not efficient. For example, with my site extensively using remote hosted csv, the --ignoreCache takes about 60 seconds to build the site. If I just delete the cache folder and build, it takes just 9 seconds to build. Presumably the latter is more efficient because it builds the cache at the start of the build, and then uses that cache throughout the build process.

I would love for something like --refreshCache that simply deletes the cache at the start and then builds the site using the new version of the cache that is built. However, in my use case this is only ever needed in local development, as when I deploy via Wercker it builds a fresh cache on every build anyway.

It seems your use case @cmal is a lot more complex than anything I would want, but I can see the logic in it.


#7

Above I see someone mention this in relation to “when I have some changed content I have to delete cache” – which a TTL will not help with. You need some kind of manual trigger.

We have a ``–gc` flag (garbage collect). I added that flag thinking that it could be used for other stuff, too (not just stale images). Cache invalidation may be slightly out of scope, But I hate adding even more flags/commands for this very similar thing.


#8

I think what you’re looking for is along the lines of adding a single rm to your build script like so:

rm -r /tmp/hugo_cache

In most cases that’s where your cache folder would be (following POSIX norms), unless you have either the $TMPDIR environment variable set, or manually passed a --cacheDir setting to your hugo command.

Apart from this specific cache refresh problem, do you have opinions on whether and how to implement some form of cache TTL? In my proposal i was trying to maximize the possibilities while remaining a 100% backwards-compatible (in regards to how GetJSON/GetCSV and --ignoreCache currently work).

In summary, I was proposing:

  • a maxCacheTTL config that if set, would be the maximum time an amount spends in cache for the whole site
  • GetCachedJSON/GetCachedCSV functions that would take an additional parameter for this item’s specific TTL (not exceeding maxCacheTTL if set) or 0 to force to download fresh content

In retrospect, I think the current behaviour of GetJSON/CSV of keeping could be still desired even with a maxCacheTTL set for GetCachedJSON/CSV. I will think this through some more in the coming days :slight_smile:

I would appreciate comments and feedback from more people. Should I just open an issue on Github?


#9

Looks like this issue has gotten some attention: