Filter out stopwords (or anything!) in a string or content

I had a little experiment for myself the other day and I’ve been able to complete it. The job was to have a simple routine to filter out stop words, either for a slug or any other reason. A search on this forum showed the only other relevant post regarding stopwords, and it was interesting and a little illuminating but not really what I wanted because it required hardcoding the filter words.

I wanted a solution that would would be scalable, easy to understand, use, and extendable.

Problem


I want to filter certain words from a slug in an archetype or from content used to populate a search index.

This doesn’t have to be just stopwords as traditionally defined (words that don’t contribute to the meaning or specificity of a search query) but could be any set of words or characters for any reason.

Solution


What I settled on was a simple function partial (a partial that returns a value to the calling template) and a structured list of words or filterable strings.

The Function Partial

Location: /layouts/partials/functions/filter-stopwords.html
Contents:

{{/*  Input arg is a dict with:  */}}
{{/*  .String (input string requiring filtering)  */}}
{{/*  .Delimiter (string use to recombine words into output string)  */}}

{{/*  Unescape possibly html escaped characters, convert to lowercase,
      filter out sentence punctuation with regex (leaves dash, plus, and apostrophe),
      and replace rendered apostrophe (single right quote) with standard one  */}}
{{ $filteredString := trim (replaceRE "[^\\w+'-]+" " " (replaceRE "’" "'" (.String | htmlUnescape | lower))) " " }}

{{/*  Split the filtered string into a slice array using whitespace  */}}
{{ $stringArray := split $filteredString " " }}

{{/*  Load list of stopwords as a slice array  */}}
{{ $sw := site.Data.stopwords }}

{{/*  Compare string words to stopwords and return only non-matches  */}}
{{ $filteredArray :=  complement $sw $stringArray }}

{{/*  Return a string from the filtered words separated by spaces  */}}
{{ return delimit $filteredArray .Delimiter }}

The regex can be explored here.

The Structured List of Words

Location: /data/stopwords.json
Contents:

[
    "able",
    "about",
    "above",
    "abroad",
    "according",
    "accordingly",
    "across",
    "actually",
    "adj",
    "after",
    ... Skip a
    ... Few ...
    "whos",
    "widely",
    "words",
    "world",
    "youd",
    "youre"
]

Template Usage

Use in whatever template file you choose. Some examples below:

HTML Template Example 1

<!-- Pretend raw page content = "The Federation's gone; the Borg is everywhere! You bet I'm agitated! I may be surrounded by insanity, but I am not insane." -->
{{ $newString := partial "functions/filter-stopwords.html" (dict "String" .Plain "Delimiter" " ") }}
<!-- $newString is now: "federation's borg bet agitated surrounded insanity insane" -->

This will take the current page’s rendered but HTML-removed content and return a string with all the stopwords (and sentence punctuation) removed. Notice the lower function in the dict assignment. All the words in the filter list are lowercase so uppercase words won’t match. To make them match, you must force all words to lowercase.

NOTE: The page variable .Plain takes a page’s rendered content and removes all HTML. In doing so, it also converts the following punctuation into HTML escaped encoding: “ -> &ldquo; , ” -> &rdquo; , ’ -> &rsquo; , > -> &gt; , < -> &lt; , & -> &amp; This is why it must be htmlUnescaped in the function, the first regex replacement can swap any to ' so it matches with the JSON data entires.

HTML Template Example 2

{{ $text := "The Federation's gone; the Borg is everywhere! You bet I'm agitated! I may be surrounded by insanity, but I am not insane."}}
{{ $newstring := partial "functions/filter-stopwords.html" (dict "String" $text "Delimiter" "#") }}
<!-- $newString is now: "federation's#borg#bet#agitated#surrounded#insanity#insane" -->

Archetype Example

Use the following in an archetype file to filter stopwords from a content filename/title for a slug

---
title: "{{ replace .Name "-" " " | title }}"
date: {{ .Date }}
draft: true
slug: {{ partial "functions/filter-stopwords.html" (dict "String" (replace .Name "-" " ") "Delimiter" "-") | urlize}}
---

New post command:

hugo new posts/This-is-a-story-all-about-How-my-life-got-twist-turned-upside-down.md

Location: /content/posts/This-is-a-story-all-about-How-my-life-got-twist-turned-upside-down.md
Content:

---
title: "This Is a Story All About How My Life Got Twist Turned Upside Down"
date: 2022-08-14T15:16:01-04:00
draft: true
slug: story-life-twist-upside
---

The Structured List of Filter Words

As I said before, the list can be anything. In my particular case, I wanted stopwords. I found online lists for English at the following links (I downloaded them all, combined the list of words, sorted them alphabetically, removed duplicates, and converted into JSON… just to really cover all the bases. I ended up with 973 unique words):

List 1 - 544 words - Text file

List 2 - 428 words - Text file

List 3 - 851 words - Has many languages available and output as JSON

For lists that start as text files (not YAML or JSON), you will have to convert them into one of these three types of data structures. It’s pretty easy with some search/replace in applications like VSCode, Notepad++, Sublime, etc. It’s just a basic top level list.

For example, the above given list which is in JSON is shown below in YAML:

YAML

---
- able
- about
- above
- abroad
- according
- accordingly
- across
- actually
- adj
- after
- afterwards
- again
- against
- ago
- ahead
- ain't
- all
- allow
---

Conclusion


I hope you enjoy this, maybe learned a little something (I know I did!), and can make this (or change it to) work for your own use cases!
2 Likes