Related content

bep · October 3, 2014, 8:08am

There is an open issue about “related content” in Hugo – a way to link to similar articles; Joe Armstrong (Erlang) named this the Sherlock Problem (aka Sherlock Holmes’ last problem).

In the site generator world there are two ways to approach this:

By looking at

Some user generated extract (tags, categories)
The full content

Or some combination of the above.

I have collected some related links below:

https://github.com/spf13/hugo/issues/98
- Standalone prototype using simhash: https://github.com/bbalet/gorelated
https://github.com/mfonda/simhash
http://matpalm.com/resemblance/simhash/ - that indicate it might be possible to get O(n logn) runtime with simhash.
Using tags:
- In, intersect: https://github.com/spf13/hugo/pull/537
- https://github.com/spf13/hugo/issues/525

I have tested a quick and dirty implementation of simhash (using https://github.com/mfonda/simhash) in Hugo – but, while very fast even none-tuned, the results vary a lot (I have done “visual testing” with some of the Hugo blogs in the open). The simhash-implementation used weighs all the words the same, and that might explain some. Simhash is a fascinating algo, and I might get something workable, but if working with the full content it would probably be language dependent.

I’m leaning against the “tag approach” as “good enough” – but it should be sorted/weighted: Articles with more tags in common are “more similar” than others. Using the simhash on the tags could be one approach.

bep · October 3, 2014, 10:12am

Just tested: Adding a stop-word filter makes the simhash much more sane. But I’m still not convinced …

spf13 · October 3, 2014, 4:57pm

@bjornerik, Thanks so much for taking an interest. You’ve pulled together some great ideas and effort around this.

My thoughts on this were to take a simple approach much like you have here. I’d suggest a single tweak that may help improve the results.

So the problem you have to deal with isn’t that easy. Different people use tags (taxonomies) and content in different ways. I think it’s pretty common for a website to have a lot of content assigned to the same tags. People often write about the same things over and over. Unfortunately a taxonomy only approach will have trouble differentiating things here and I feel it would come up with subpar results.

I can think of a few different things to consider.

Taxonomies
Date
Content
Title

I believe that’s enough to cover all bases. I’ve thought through a few scenarios and all would be covered by those 4.

The change I would make would be to compare these 4 things, but add a weight to it. I would also allow each taxonomy to have it’s own weight. All weights are relative, higher == more important.

For example the default may be series (a taxonomy) 8, date 3 , title 0, content 0. This would likely give me items in the same series and with the dates closest to my post.

Another example may be category 10, tags 8, content 4, date 1. This would likely give me content with the same category and similar tags, but with similar content as well.

The nice thing about this approach is that we can provide sane defaults and allow people to easily tweak it though a few defines in the config file.

bep · November 14, 2014, 7:45pm

I think I will revisit this now. I totally agree with your four point list, @spf13 - and it should be weighted.

I have had this in back of my mind, and hopefully that deep, unconscious thinking have brought me closer to a good solution.

Some time ago I even created a Git repo for this:

Go sherlock. I saw the “Sherlock Problem” presented by Joe Armstrong, the guy behind Erlang.

How easy this is to generalize into a lib and then pull into Hugo; we’ll see.

Any input on this, ideas, tech tips etc… Please share!

DerekPerkins · April 24, 2015, 6:43pm

Has any progress been made on this?

bep · April 24, 2015, 7:38pm

Not from me.

bbalet · October 14, 2015, 1:01pm

Hi,

I’m the author of the standalone prototype. I was busy on another project, so my go is a bit rusty.

I’ve tried to patch Hugo as follow:

add a private fingerprint field in Page struct
expose a new field listing related pages (RelatedPages []*Page)
Add a func on Page struct that list the related pages (findRelatedPages) and Less/Swap funcs for sort package.
in func (s *Site) CreatePages() after having built the pages, call findRelatedPages for each page.

I don’t know if it is ugly or not. Maybe incomplete, because one may be interrested in building this list from a node in addition to the entire site (these functions are fast enough for the 2 use cases). But it is working and sufficient for me. From single template, I’ve just need to list the pages as follow :

{{ range $i, $val := .RelatedPages }}
    {{if lt $i 5}}<li>{{ $val.Title }} {{ $val.Permalink }}</li>{{end}}
{{ end }}

Are you still interrested in this feature or for a PR?

spf13 · October 15, 2015, 4:00am

Definitely still interested in the feature.

bbalet · October 19, 2015, 9:24am

I took into account the remark of @bep related to stop words. I created a package https://github.com/bbalet/stopwords that removes stop words in many languages. It improves the accuracy of SimHash algo.

I’ve got a question. As far as I understand the language code is not duplicated on the page struct. Whereas in multilingual sites it would be logical to have one language code for each page and to init this field with the website language by default.

bep · October 19, 2015, 10:03am

As this is fairly performance and memory critical (at least for Hugo), I would have provided a lookup func instead, maybe backed by a Bloom filter. Then you can let the caller decide what “a word is”.

bbalet · October 19, 2015, 10:36am

I’m not sure to understand, please develop what “a word is”.

bep · October 19, 2015, 10:41am

It wasn’t the most important part of what I said, but you have defined a word as something divided by " " – this isn’t the case for CJK languages.

bbalet · October 19, 2015, 11:05am

In my opinion, this is the most important part (I use Khmer language, so I do know this problem). I’m maybe wrong, but I used the regular expression [\pL-_']+ to break words. It means that a word is composed of any unicode letter (space and word breaks are not letters). The space character " " is used to explicitly separate words in the generated text content (we can use any word separator with SimHash).

I’ll benchmark the bloom filter approach.

bep · October 19, 2015, 12:09pm

My main point is that your approach have room for improvements in the performance section.

bbalet · October 19, 2015, 1:22pm

I’ve benchmarked the bloom filter approach but it doesn’t improve the performance (tested with go test bench with the default value of 100). I think it’s due to two factors :

Bloom filters are fast when it comes to search into a large set, but they cost a lot to build (even if it is built once). So the overall gain is little where m is about 300.
In the first version, I used maps and go builds a hasmap for searching the key faster.

Regarding memory consumption, I could init the map when it used the first time, but the gain is minimal (each set is around 300 keys) and amounts to few ko.

On my computer i3-2120 @3.30Ghz for 100 articles, the overall process takes 2s. If you don’t have any other suggestion, we could make it optional (as it is the case for Jekyll and the lsi option).

bep · October 19, 2015, 3:25pm

I was just pointing out that the stopword API is inefficient from Hugo’s point of view (the Bloom filter remark derailed the discussion, and you are right about that being bad choice here). So, if you have a solution that covers the requirements given earlier in this thread, a pull request is the next step.

bbalet · October 19, 2015, 4:07pm

I am trying to with a trie approach suggested by a SW user. It reduces the time of stopwords API by 60%. A stopwords API (whatever the API you choose) is needed if you want to improve the accuracy of any natural text processing algo.

The only suggestion I have is stopwords + simhash and I can reduce the overall processing speed to 1 second on my reference computer (i3-2120 @3.30Ghz).

I’ve got some unanswered questions for the PR:

langCode on Page struct.
What is the best place to call the feature?
Should we make it optional as in Jekyll for LSI?

bep · October 19, 2015, 5:56pm

As I see it:

You have a byte slice and want word tokens
In your current implementation you convert the byte slice to a string (memory alloc) then do several regexp passes on that string to split the words, filter out the stop words, then concatenate the words into a string again
The string is then … split?

Keep it sitewide for now. nn-NO would be my value: Norwegian Nynorsk.

Not sure, but since your algorithm is O(n^2) guess more CPUs in paralell would be good.

It should be configurational (the different weights), and probably default all off.

bbalet · October 19, 2015, 6:27pm

I will stop here. Bye and good luck for your project.

Jura · December 8, 2015, 9:42am

I was wondering what the status of this enhancement is. Has there been progress?

Topic		Replies	Views
Roadmap to Hugo v1.0 feature	36	9366	February 20, 2016
.Related finds no pages support	5	1181	August 16, 2018
Page search, i.e. related content feature	0	664	August 19, 2017
Use taxonomy in permalinks? support	17	3865	August 2, 2017
171.456 docs, 22 taxonomies, 20 minutes Announcements	28	7666	June 3, 2017

Related content

Related topics