Related content

There is an open issue about “related content” in Hugo – a way to link to similar articles; Joe Armstrong (Erlang) named this the Sherlock Problem (aka Sherlock Holmes’ last problem).

In the site generator world there are two ways to approach this:

By looking at

  • Some user generated extract (tags, categories)
  • The full content

Or some combination of the above.

I have collected some related links below:

I have tested a quick and dirty implementation of simhash (using https://github.com/mfonda/simhash) in Hugo – but, while very fast even none-tuned, the results vary a lot (I have done “visual testing” with some of the Hugo blogs in the open). The simhash-implementation used weighs all the words the same, and that might explain some. Simhash is a fascinating algo, and I might get something workable, but if working with the full content it would probably be language dependent.

I’m leaning against the “tag approach” as “good enough” – but it should be sorted/weighted: Articles with more tags in common are “more similar” than others. Using the simhash on the tags could be one approach.

Just tested: Adding a stop-word filter makes the simhash much more sane. But I’m still not convinced …

@bjornerik, Thanks so much for taking an interest. You’ve pulled together some great ideas and effort around this.

My thoughts on this were to take a simple approach much like you have here. I’d suggest a single tweak that may help improve the results.

So the problem you have to deal with isn’t that easy. Different people use tags (taxonomies) and content in different ways. I think it’s pretty common for a website to have a lot of content assigned to the same tags. People often write about the same things over and over. Unfortunately a taxonomy only approach will have trouble differentiating things here and I feel it would come up with subpar results.

I can think of a few different things to consider.

  1. Taxonomies
  2. Date
  3. Content
  4. Title

I believe that’s enough to cover all bases. I’ve thought through a few scenarios and all would be covered by those 4.

The change I would make would be to compare these 4 things, but add a weight to it. I would also allow each taxonomy to have it’s own weight. All weights are relative, higher == more important.

For example the default may be series (a taxonomy) 8, date 3 , title 0, content 0. This would likely give me items in the same series and with the dates closest to my post.

Another example may be category 10, tags 8, content 4, date 1. This would likely give me content with the same category and similar tags, but with similar content as well.

The nice thing about this approach is that we can provide sane defaults and allow people to easily tweak it though a few defines in the config file.

I think I will revisit this now. I totally agree with your four point list, @spf13 - and it should be weighted.

I have had this in back of my mind, and hopefully that deep, unconscious thinking have brought me closer to a good solution.

Some time ago I even created a Git repo for this:

Go sherlock. I saw the “Sherlock Problem” presented by Joe Armstrong, the guy behind Erlang.

How easy this is to generalize into a lib and then pull into Hugo; we’ll see.

Any input on this, ideas, tech tips etc… Please share!

2 Likes

Has any progress been made on this?

Not from me.

Hi,

I’m the author of the standalone prototype. I was busy on another project, so my go is a bit rusty.

I’ve tried to patch Hugo as follow:

  • add a private fingerprint field in Page struct
  • expose a new field listing related pages (RelatedPages []*Page)
  • Add a func on Page struct that list the related pages (findRelatedPages) and Less/Swap funcs for sort package.
  • in func (s *Site) CreatePages() after having built the pages, call findRelatedPages for each page.

I don’t know if it is ugly or not. Maybe incomplete, because one may be interrested in building this list from a node in addition to the entire site (these functions are fast enough for the 2 use cases). But it is working and sufficient for me. From single template, I’ve just need to list the pages as follow :

{{ range $i, $val := .RelatedPages }}
    {{if lt $i 5}}<li>{{ $val.Title }} {{ $val.Permalink }}</li>{{end}}
{{ end }}

Are you still interrested in this feature or for a PR?

1 Like

Definitely still interested in the feature.

I took into account the remark of @bep related to stop words. I created a package https://github.com/bbalet/stopwords that removes stop words in many languages. It improves the accuracy of SimHash algo.

I’ve got a question. As far as I understand the language code is not duplicated on the page struct. Whereas in multilingual sites it would be logical to have one language code for each page and to init this field with the website language by default.

As this is fairly performance and memory critical (at least for Hugo), I would have provided a lookup func instead, maybe backed by a Bloom filter. Then you can let the caller decide what “a word is”.

I’m not sure to understand, please develop what “a word is”.

It wasn’t the most important part of what I said, but you have defined a word as something divided by " " – this isn’t the case for CJK languages.

In my opinion, this is the most important part (I use Khmer language, so I do know this problem). I’m maybe wrong, but I used the regular expression [\pL-_’]+ to break words. It means that a word is composed of any unicode letter (space and word breaks are not letters). The space character " " is used to explicitly separate words in the generated text content (we can use any word separator with SimHash).

I’ll benchmark the bloom filter approach.

My main point is that your approach have room for improvements in the performance section.

I’ve benchmarked the bloom filter approach but it doesn’t improve the performance (tested with go test bench with the default value of 100). I think it’s due to two factors :

  • Bloom filters are fast when it comes to search into a large set, but they cost a lot to build (even if it is built once). So the overall gain is little where m is about 300.
  • In the first version, I used maps and go builds a hasmap for searching the key faster.

Regarding memory consumption, I could init the map when it used the first time, but the gain is minimal (each set is around 300 keys) and amounts to few ko.

On my computer i3-2120 @3.30Ghz for 100 articles, the overall process takes 2s. If you don’t have any other suggestion, we could make it optional (as it is the case for Jekyll and the lsi option).

I was just pointing out that the stopword API is inefficient from Hugo’s point of view (the Bloom filter remark derailed the discussion, and you are right about that being bad choice here). So, if you have a solution that covers the requirements given earlier in this thread, a pull request is the next step.

I am trying to with a trie approach suggested by a SW user. It reduces the time of stopwords API by 60%. A stopwords API (whatever the API you choose) is needed if you want to improve the accuracy of any natural text processing algo.

The only suggestion I have is stopwords + simhash and I can reduce the overall processing speed to 1 second on my reference computer (i3-2120 @3.30Ghz).

I’ve got some unanswered questions for the PR:

  • langCode on Page struct.
  • What is the best place to call the feature?
  • Should we make it optional as in Jekyll for LSI?

As I see it:

  • You have a byte slice and want word tokens
  • In your current implementation you convert the byte slice to a string (memory alloc) then do several regexp passes on that string to split the words, filter out the stop words, then concatenate the words into a string again
  • The string is then … split?

Keep it sitewide for now. nn-NO would be my value: Norwegian Nynorsk.

Not sure, but since your algorithm is O(n^2) guess more CPUs in paralell would be good.

It should be configurational (the different weights), and probably default all off.

I will stop here. Bye and good luck for your project.

I was wondering what the status of this enhancement is. Has there been progress?