Splitting post content for Algolia search index

Hi all.

I’m developing a full-text site search for one of my websites and I need some help formatting my search index. I’m using Hugo’s custom output formats to generate a JSON file which includes the full text of all posts, as described here. This will be pushed to Algolia’s REST batch API, and I’ll then use their free tier to power my search functionality.

Algolia’s free tier requires that each individual record be less than 10kb. Their docs suggest that larger records (such as longer blog posts) be split into smaller chunks then deduplicated using the distinct parameter.

Here’s a simplified version of my index, which currently outputs one record per page, containing the full text of each. I’ve used post URL as my objectID:

{
  "requests": [
    {{ range $idx, $page := $.Site.Pages }}
      {{ if $idx }},{{ end }}
      { "action": "updateObject",
        "body": {
          "objectID": {{ $page.URL | jsonify }},
          "title": {{ $page.Title | jsonify }},
          "href": {{ $page.Permalink | jsonify }},
          "content": {{ $page.Plain | jsonify }}
        }
      }
    {{ end }}
  ]
}

I need this template to split longer posts into multiple records. I’d prefer to avoid splitting mid-word, so using substr on .Plain probably isn’t workable. I suspect I’ll need to use .PlainWords, using range to output multiple records for the longer posts, splitting by word count. I’m struggling to do this, and can’t find any threads with relevant examples.

Here’s an example of the desired output for a longer post. ObjectID needs to be unique, so I’ve simply appended it with a number reflecting its position in the sequence. Order provides an integer for sequencing, and title and URL stay the same. Content should return a certain number of words.

{
  "requests": [
      { "action": "updateObject",
        "body": {
          "objectID": "/example-post-url/_1",
          "order": 1,
          "title": "My example post title",
          "href": "https://www.mydomain.co.uk/blog/example-post-url/",
          "content": "Hugo is one of the most popular open-source static site generators. With its amazing speed and flexibility, Hugo makes building websites fun again."
        }
      },
      { "action": "updateObject",
        "body": {
          "objectID": "/example-post-url/_2",
          "order": 2,
          "title": "My example post title",
          "href": "https://www.mydomain.co.uk/blog/example-post-url/",
          "content": "We love the beautiful simplicity of markdown’s syntax, but there are times when we want more flexibility. Hugo shortcodes allow for both beauty and flexibility."
        }
      }
    
  ]
}

Hope that makes sense. Any pointers would be hugely appreciated.

Thanks!

Hi all. In case it’s helpful to anyone else who’s struggling with this, my solution is below.

This gives the desired output.

{
  "requests": [
    {{ range $idx, $page := $.Site.Pages -}}
      {{ if $idx -}},{{- end }}
      { "action": "updateObject",
        "body": {
          "objectID": {{ print $page.RelPermalink "_1" | jsonify }},
          "order": 1,
          "title": {{ $page.Title | jsonify }},
          "href": {{ $page.Permalink | jsonify }},
          "content": "{{ range first 1000 $page.PlainWords }}{{ . }} {{ end }}"
        }
      }{{ if gt $page.PlainWords 1000 }},
      { "action": "updateObject",
        "body": {
          "objectID": {{ print $page.RelPermalink "_2" | jsonify }},
          "order": 2,
          "title": {{ $page.Title | jsonify }},
          "href": {{ $page.Permalink | jsonify }},
          "content": "{{ range first 1000 (after 1000 $page.PlainWords) }}{{ . }} {{ end }}"
        }
      }{{- end -}}
    {{ end }}
  ]
}

It might be a horrible way of doing it, but I’ve created an if statement within the loop which checks the wordcount and creates an additional record for longer posts. Each record contains 1000 words. I’ve capped it at 2000, but if your posts are longer you can just create extra ifs. Object IDs are unique (as required by Algolia), deduping can be done with title or href, and order lets you sequence the records.

POSTing this to Algolia as per this thread works as expected.

2 Likes