WordCount includes HTML and Markdown tags as words, is there a way to count the words after stripping all tags?

I noticed that {{ .WordCount }} is including tags as words.

For example this post reports back 5 words:

### Hello world

### Interesting

It’s counting each ### as 1 word. I noticed it does the same for other tags too.

Is this worth opening up a bug report and is there a way to configure Hugo to parse the content into HTML, then strip all of the tags and perform a word count on that result? That could give a more accurate count.

Weirdly enough {{ .Content | plainify | countwords }} reports 5 but plainify is supposed to remove HTML tags. It transforms ### into # but doesn’t remove it and I couldn’t find a reference in the docs to convert Markdown to HTML through a function call or a way to strip Markdown syntax as a different function call.

It doesn’t. What word count does (this is a little simplified as we have special handling of CJK languages etc.) is:

  1. Remove any markup from the generated HTML (using transform.Plainify | Hugo)
  2. Split the result into words.
  3. Count the words.

It doesn’t.

What mechanism is allowing the word count to be 5 in the above example when there’s 3 words and 2 headings? If I remove both ### then it produces 3 words. If step 1 happens from your 3 step workflow then ### should get generated into <h3></h3> and then removed right?

I am unable to reproduce the problem as described. Try it:

git clone --single-branch -b hugo-forum-topic-50748 https://github.com/jmooring/hugo-testing hugo-forum-topic-50748
cd hugo-forum-topic-50748
hugo server

Perhaps you’re doing something different.

1 Like

When I clone your example and run it, it reports 3 words not 5.

I was able to determine what might be causing this though. I was able to get it to report 5 in your example with a custom hook I added for headings:

In your example, add this in layouts/_default/markup/render-heading.html:

{{ if eq .Level 3 }}
  <a name="#{{ .Anchor | safeURL }}"></a>
  <h{{ .Level }} id="{{ .Anchor | safeURL }}">
    <a href="#{{ .Anchor | safeURL }}" style="position: absolute; left: -12px;">#</a>
    {{ .Text | htmlUnescape | safeHTML }}
  </h{{ .Level }}>
{{ else }}
  <h{{ .Level }}>{{ .Text | htmlUnescape | safeHTML }}</h{{ .Level }}>
{{ end }}

The question now is, how can we get things to report the correct word count while using this hook? This is while using v0.128.2 (linux amd64) for reference.

The render hook fires after the count. For your example, you can subtract the number of headings to get the desired result:

<p>Word count: {{ sub .WordCount (.Fragments.HeadingsMap | len) }}</p>

You might prefer to use something like https://github.com/bryanbraun/anchorjs.

Or ignore the difference; it’s noise.

If the hook fires after the count, what makes it affect the count?

For your example, you can subtract the number of headings to get the desired result

If I do that with the hook included above which only applies to H3 and this content:

## Hello world

### Hello world

### Cool

#### Hmm

It reports a count of 4 words but the expected count would be 6.

Sorry, I wrote that backwards.

See my previous recommmendations (JS or ignore).

Ah ok. Is that intended or a potential edge case oversight about how hooks apply to WordCount? I’m wondering if I should open an issue.

I was able to update your example to {{ sub .WordCount (findRE (?s)<h3.?>.? .Content | len) }} which works for my specific case since I only add the named anchors to H3 headings. Do you happen to know if there’s a more efficient way to do that since this requires scanning all of the .Content again?

Maybe some way to augment .Fragments.HeadingsMap to only return H3s not all of headings?

I’d like to avoid JS here.

It’s not something I’d spend any time on. It’s noise.

Having an accurate count is important to me.

I ended up going with:

  {{ $h3Count := 0 }}
  {{ range .Fragments.HeadingsMap }}
    {{ if (eq .Level 3) }}
      {{ $h3Count = add $h3Count 1 }}
    {{ end }}
  {{ end }}

  {{ sub .WordCount $h3Count }}

Given I only ever add linked anchors to H3 I suppose it works even though it’s not too pretty.

Whatever “accurate” means - I think the words in the Markdown are not relevant but the ones in the target HTML

I would not even say it’s a bug:

So the job is to count the words (in your case a hidden # but normally it would be words.

so in your last example I would expect 8 and neither your 4 or 6.

all one could do is argue that ‘#’ is not a word :wink:

suggestion in single.html or baseof

{{ define "main" }}
  {{ page.Scratch.Set "h3"  0 }}
  {{ .Content }}
  <p>Word count: {{ sub .WordCount (page.Scratch.Get "h3")}}</p>
{{ end }}

in your hook

{{ if eq .Level 3 }}
   {{ page.Scratch.Add "h3" 1}}
...

Thanks, are scratch pads a more performant way to implement this sort of thing?

I’m not in the go implementation details. Guess that has to be measured. As usual this will depend on the sources

It replaces one loop over all headings and a duplicate h3 test with handling a cross partial storage

Number of pages * numer of headings…no idea

1 Like