Keeping Algolia up-to-date automatically


#1

I’m not sure if this will be useful to anyone or not, but here it is anyways…

I just added Algolia to a Hugo site I’ve been working on, and I thought my solution for automatically keeping Algolia informed about new content could be useful for others to see.

Basically, I generate some special JSON, then use Algolia’s REST batch API to update Algolia after each build.

Since I already have JSON output for something else, I ended up creating an /algolia.json by adding this to my config.toml:

[outputFormats.Algolia]
    mediaType = "application/json"
    baseName = "algolia"
    isPlainText = true

Then I just had to create a template (themes/$theme/layouts/algolia.json) which outputs data which looks like:

{
  "requests": [
    { "action": "updateObject",
      "body": {
         ...,
         "objectID": "/foo/bar" } },
    { "action": "updateObject",
      "body": {
         ...,
         "objectID": "/foo/baz" } }
  ]
}

Where ... your data. I found it convenient to just use the URL as my objectID, but anything which uniquely identifies a record works. My template is at https://github.com/factopolis/factopolis/blob/master/themes/factopolis/layouts/algolia.json, but it’s pretty specific to my site so I’m not sure how helpful it would be as a guide.

A simple example to get you started:

{
  "requests": [
    {{ range $idx, $page := $.Site.Pages }}
      {{ if $idx }},{{ end }}
      { "action": "updateObject",
        "body": {
          "title": {{ $page.Title | jsonify }},
          "objectID": {{ $page.URL | jsonify }}
        }
      }
    {{ end }}
  ]
}

Note that if you remove content it won’t be deleted from Algolia. I think you could prepend a { "action": "clear" }, but it shouldn’t be very common for me so I’ve left it off for now, and I’ll just clear the index manually when necessary.

Then add a few environment variables to your CI, and upload the file with something like

curl -X POST \
  -H "X-Algolia-API-Key: ${ALGOLIA_API_KEY}" \
  -H "X-Algolia-Application-Id: ${ALGOLIA_APPLICATION_ID}" \
  --data-binary @web/algolia.json \
  "https://${ALGOLIA_APPLICATION_ID}.algolia.net/1/indexes/${ALGOLIA_INDEX_NAME}/batch"

This does hit Algolia a bit harder than necessary, but I don’t think I’ll run into their free limits for quite some time (you basically get 100.000 updateObject actions per month, so you’d need a pretty large and active site to hit the limit even for the free tier). If it does become a problem I’ll probably create a second git repository, have CI for the first repo update a file in the second one, then look at the diff to see what really needs to change and only send Algolia the necessary operations.

From there it’s just the standard front-end integration stuff for Algolia, which isn’t too difficult.

The commit for all this on my site is 3a980173a8fa220f1aa534393f57b6e32ded1067, could be useful if anyone runs into trouble.


Splitting post content for Algolia search index
Splitting post content for Algolia search index
#2

A cleaner solution would be to just use the json media type (built in ) and use the baseName on output format.


#3

Can you still have a “normal” output JSON if you do that? I need an /index.json, and an additional Algolia-specific JSON somewhere (right now it’s /index.algolia.json, but I don’t care about that).


#4

Set baseName to algolia or whatever (=> algolia.json). The challenge is to avoid naming conflicts – you can mix and match media types and output format definition in any way you want. And it is, of course, perfectly fine to have multiple output formats sharing one media type.


#5

Edited to reflect that change. Thanks :slight_smile:


#6

Was gonna hack together Lunr with Google Site Search – Then I found out Google is discontinuing it!

Thankfully, Algolia seems promising…


#7

Algolia is pretty good, and one of the most popular solution nowadays for many docs sites.

Curious, what’s the problem? You can still use Lunr.js. What do you need Google Site Search for?


#8

Although I think Lunr is well suited to find keywords in large texts, the index file will easily get bloated and for large sites. For these types of sites my opinion is you need an index engine such as the one Google and Algolia offers.

The idea was using the Lunr for “suggestions” (and only running through page titles) and GSS after hitting enter/pressing the search button. So speed and UX is the reasoning behind this.


#9

Oh so using Lunr just for the “as you type” search and GOogle for the full deal? I never thought of using them as a combo like that.

Personally, I hesitate to want to use Algolia sometimes because then you can’t search right offline. For example, the CircleCI Docs use to use Lunr for search but then we switched to Algolia. While it’s fast and works, whenever we’re working on the site locally in dev, the search bar returns like results and URLs. So that can’t really be tested anymore.

I wonder what is considered to be “too large for Lunr”. Might be an interesting write up. I agree a site with thousands of pages likely wouldn’t want to use Lunr due to speed.


#10

I think that combo would be optimal. You let Lunr provide suggestions by sifting through tens of thousands of page titles and and a bot dependent search engine to provide results if the user isn’t hit with a suggestion. Perhaps it could be called a dual layer search engine – layer one is shallow (titles and/or descriptions only). The second layer searches through full text/pages/articles.

So I agree with you, as a developer you pretty much know what you want to search for offline. In this case Lunr is enough. For non dev-users and larger quantities of text you’d want the second layer.

I have an index with +1000 titles and urls. I would guess it would start slowing down at these levels if the full article texts were added to the index.

So, combine Lunr with Algolia? Lunr for realtime suggestions while the user types, and then Algolia after the user hits enter. What do you think?


#11

I really haven’t had a problem with search being restricted to online. Since we’re talking about Hugo specifically here it’s not like the content is in a database, so a simple grep (or git grep) is usually plenty for me.


#13

The scenario I was talking about was a docs site. As we add new docs and information locally, I typically like to render the site locally, and try everything out including making sure the new information comes up in search as I would expect.

So far, with Algolia, I haven’t been able to do that. Rick’s dual-search system seems to help with that situation though.


#14

Putting it in the pipeline for a larger site I’m mastering, keep you posted :+1:


#15

Haven’t gotten to this yet :roll_eyes: but it struck me that hitting the enter key could also fire a site: search on Google.

Example:

  1. The idea is implemented in gohugio.io.
  2. When the user types letters in the search field she receives suggestions.
  3. When hitting enter she ends up on a google search site:gohugo.io QUERY

Hope to get to this soon, site: or Algolia – both are powerful extensions of Lunr.