[SOLVED] Search keywords derived from Markdown content source files

We use lunr.js to implement search. We have a Gulp task that runs JavaScript code to create a JSON file of search keywords for each content page before running Hugo. The problem is that the search keywords are determined by parsing the Markdown content source files, and not the output files. As a result,
(a) Text, or even complete pages, that are excluded from the output by using comments, conditional logic implemented via shortcodes, page parameters, and build flags, are included in the search results.
(b) Text that doesn’t directly appear in the source MD file but is included in the output page via site parameters (variables) and templates (including the use of data files and shortcodes), is not included in the search results for the page, and shortcode calls in the content are identified as regular text.

Which leads me to wonder — is it possible for the search-keywords gathering code to run on the HTML output, or on intermediate build files (if such files exist), and if so, is there a reason to parse the content sources and not the output? Or, is there another way to bypass the problems with the content-sources parsing?

The developer who implemented the search feature pointed out that the HTML files include HTML code that we don’t want to include in the search keywords, but it seems to me that there should already be solutions out there to handle this, and alternatively, writing custom code to handle this would be simpler to handle all of the issues with the MD sources parsing?

Currently, our plan is to write code that excludes, for example, text within HTML comments or the content of calls to our custom internal-comment shortcode, and check also for draft/future/expired content front-matter configurations together with the use of related hugo runtime flags (such as --buildDrafts). But not only does this add complexity and overhead, it still won’t handle code reuse and conditional logic via the use of variables and templates/shortcodes, which I believe would be much more complex to handle.

You can use a custom output format to create the JSON index. Doing so will output the processed source files, so you can customize it to be however you want, so shortcodes will process, and you can remove HTML comments.

There is a fair amount of prior work done, for your specific use-case.

See https://github.com/bep/bepsays.com/blob/master/layouts/index.json – which could be a good foundation for a Lunr index. Note that it is allowed to creative; if content gets too big, maybe use .TableOfContents | plainify … keywords etc…

Thanks @maiki, @bep. I’m not a developer and I’m not implementing the search functionality myself, so I hope you can bear with me. I searched the discourse threads before posting this thread but the information, including in the links that you referred to, is more technical than my current knowledge and I couldn’t deduce from a quick review whether the suggested implementations operate on the original MD sources or something else to create the search JSON index file.

@maiki, regarding the option of using a JSON output format, do you mean that we can use Hugo to build the sources as JSON files to be used for the search indexing, in which case the Hugo build will already handle all the templating and conditional logic, so we will end up with JSON files that include only text that matches the equivalent HTML build output (i.e., this will handle the issues I have with the MD-based search-keywords indexing)?

Hugo processes your markdown source files and converts them to a json format file. You have control over what information from source ends up in the json data. This is what custom outputs can do for you. Here is an example of where I take “Event” posts and turn them into a json file that can be read by a JavaScript calendar on the site.

I convert event source files in Markdown into a json file.

The json file then provides the input for the ‘Monthly’ javascript calendar I am using.

The result is here.

Hope this helps.

1 Like

Thanks zivbk1.

So, we could select to render our MD sources into both HTML and JSON output formats, use JSON templates to influence the JSON output, and use the JSON files as the basis for our search indexing — which will allow us to create JSON search files that correctly handle the use of conditions, variables, and shortocdes in the MD sources, correct?

Does this mean that we will also need to write JSON equivalents for our HTML shortcodes?

(This will also undoubtedly increase the build times, but I guess it will be in place of our current implementation, which iterates the MD files to create a search-keywords JSON file.)

You can do a “multipass Hugo”: (I invented the term now)

  1. Create the JSON file for search only (you may not have to create this every time …)
  2. Create the regular (HTML) site

You would then have separate config.toml for 1 and 2 (and maybe use the option to pass multiple config files into Hugo).

But, if your site isn’t huge, I think you will be surprised how fast Hugo can create a JSON in addition to the rest.

Correct.

Nope. Normally a single field in the JSON document will contain the entirety of the content, so all your shortcodes will have already run. Normally this is sanitized, but you’ve noted in multiple places you aren’t a developer. You can point your devs to the functions, and they will be able to generate the file as needed.

@sharonl I add custom outputs all the time, hugo doesn’t sweat. Of course your own situation will be different, but if you are using a workflow where you have a server building and deploying your site, it isn’t even something you see. And you ought to be doing that. :slight_smile:

Argh! I’ve got no prior art! Okay, you get this one, Bjørn. :unamused:

Thanks guys. Regarding the build times, currently there’s only some test content, and we only have a small amount of documentation to port to the site; it will take time to accumulate a substantial volume, but I’m trying to make everything scalable.
As I mentioned, our current implementation already iterates all MD files on every run to create the search keywords, so using Hugo to produce the JSON files instead probably won’t add much overhead + I’ll end up with a search index that actually matches the output … .

I also considered the option of avoiding the search indexing in some local scenarios, should the need occur. (I actually temporarily eliminated our current search-indexing gulp task during some local tests to avoid the extra verbiage that is currently logs to stdout.) Our published docs site should ultimately be updated mainly on new product releases, and I indeed plan to run the publication builds on a server and automate this procedure.

@bep, regarding the option to use multiple configuration files, I came across it today in the release notes and I think it might come in handy for us for another scenario, which up until now I assumed would require a scripting solution to avoid duplicating large parts of the configuration file in multiple files. I hope to test it soon. I was also happy to find the new cond function, although it seems it’s not documented yet; I just used it in my shortcode :-).

UPDATE — SUCCESS: Thanks to your help, the developer changed the implementation to generate a JSON index file with Hugo and use this file as the basis for the search indexing, so now the search keywords match the actual generated doc output (+ as an added bonus, this eliminated the previous log messages for our Gulp search-keywords-generation JS code, which muddled the command line).

This is the theme layouts//index.json file that we now use to generate the output index.json file:

{{- $.Scratch.Add "index" slice -}}
{{- range where .Site.Pages "Type" "not in"  (slice "page" "json") -}}
  {{- $.Scratch.Add "index" (dict "path" .Permalink "title" .Title "content" .Plain "keywords" .Params.keywords) -}}
{{- end -}}
{{- $.Scratch.Get "index" | jsonify -}

–> I’m marking the issue as [SOLVED].

2 Likes