WordCount includes HTML and Markdown tags as words, is there a way to count the words after stripping all tags?

nickjanetakis · July 17, 2024, 12:58pm

I noticed that {{ .WordCount }} is including tags as words.

For example this post reports back 5 words:

### Hello world

### Interesting

It’s counting each ### as 1 word. I noticed it does the same for other tags too.

Is this worth opening up a bug report and is there a way to configure Hugo to parse the content into HTML, then strip all of the tags and perform a word count on that result? That could give a more accurate count.

Weirdly enough {{ .Content | plainify | countwords }} reports 5 but plainify is supposed to remove HTML tags. It transforms ### into # but doesn’t remove it and I couldn’t find a reference in the docs to convert Markdown to HTML through a function call or a way to strip Markdown syntax as a different function call.

bep · July 17, 2024, 1:36pm

It doesn’t. What word count does (this is a little simplified as we have special handling of CJK languages etc.) is:

Remove any markup from the generated HTML (using transform.Plainify | Hugo)
Split the result into words.
Count the words.

nickjanetakis · July 17, 2024, 4:25pm

It doesn’t.

What mechanism is allowing the word count to be 5 in the above example when there’s 3 words and 2 headings? If I remove both ### then it produces 3 words. If step 1 happens from your 3 step workflow then ### should get generated into <h3></h3> and then removed right?

jmooring · July 17, 2024, 5:27pm

I am unable to reproduce the problem as described. Try it:

git clone --single-branch -b hugo-forum-topic-50748 https://github.com/jmooring/hugo-testing hugo-forum-topic-50748
cd hugo-forum-topic-50748
hugo server

Perhaps you’re doing something different.

nickjanetakis · July 17, 2024, 5:38pm

When I clone your example and run it, it reports 3 words not 5.

I was able to determine what might be causing this though. I was able to get it to report 5 in your example with a custom hook I added for headings:

In your example, add this in layouts/_default/markup/render-heading.html:

{{ if eq .Level 3 }}
  <a name="#{{ .Anchor | safeURL }}"></a>
  <h{{ .Level }} id="{{ .Anchor | safeURL }}">
    <a href="#{{ .Anchor | safeURL }}" style="position: absolute; left: -12px;">#</a>
    {{ .Text | htmlUnescape | safeHTML }}
  </h{{ .Level }}>
{{ else }}
  <h{{ .Level }}>{{ .Text | htmlUnescape | safeHTML }}</h{{ .Level }}>
{{ end }}

The question now is, how can we get things to report the correct word count while using this hook? This is while using v0.128.2 (linux amd64) for reference.

jmooring · July 17, 2024, 5:55pm

The render hook fires after the count. For your example, you can subtract the number of headings to get the desired result:

<p>Word count: {{ sub .WordCount (.Fragments.HeadingsMap | len) }}</p>

You might prefer to use something like https://github.com/bryanbraun/anchorjs.

Or ignore the difference; it’s noise.

nickjanetakis · July 17, 2024, 6:07pm

If the hook fires after the count, what makes it affect the count?

For your example, you can subtract the number of headings to get the desired result

If I do that with the hook included above which only applies to H3 and this content:

## Hello world

### Hello world

### Cool

#### Hmm

It reports a count of 4 words but the expected count would be 6.

jmooring · July 17, 2024, 6:09pm

Sorry, I wrote that backwards.

See my previous recommmendations (JS or ignore).

nickjanetakis · July 17, 2024, 6:12pm

Ah ok. Is that intended or a potential edge case oversight about how hooks apply to WordCount? I’m wondering if I should open an issue.

I was able to update your example to {{ sub .WordCount (findRE (?s)<h3.?>.? .Content | len) }} which works for my specific case since I only add the named anchors to H3 headings. Do you happen to know if there’s a more efficient way to do that since this requires scanning all of the .Content again?

Maybe some way to augment .Fragments.HeadingsMap to only return H3s not all of headings?

I’d like to avoid JS here.

jmooring · July 17, 2024, 6:12pm

It’s not something I’d spend any time on. It’s noise.

nickjanetakis · July 17, 2024, 6:23pm

Having an accurate count is important to me.

I ended up going with:

  {{ $h3Count := 0 }}
  {{ range .Fragments.HeadingsMap }}
    {{ if (eq .Level 3) }}
      {{ $h3Count = add $h3Count 1 }}
    {{ end }}
  {{ end }}

  {{ sub .WordCount $h3Count }}

Given I only ever add linked anchors to H3 I suppose it works even though it’s not too pretty.

irkode · July 17, 2024, 6:27pm

Whatever “accurate” means - I think the words in the Markdown are not relevant but the ones in the target HTML

I would not even say it’s a bug:

So the job is to count the words (in your case a hidden # but normally it would be words.

so in your last example I would expect 8 and neither your 4 or 6.

all one could do is argue that ‘#’ is not a word

suggestion in single.html or baseof

{{ define "main" }}
  {{ page.Scratch.Set "h3"  0 }}
  {{ .Content }}
  <p>Word count: {{ sub .WordCount (page.Scratch.Get "h3")}}</p>
{{ end }}

in your hook

{{ if eq .Level 3 }}
   {{ page.Scratch.Add "h3" 1}}
...

nickjanetakis · July 17, 2024, 9:58pm

Thanks, are scratch pads a more performant way to implement this sort of thing?

irkode · July 18, 2024, 5:30am

I’m not in the go implementation details. Guess that has to be measured. As usual this will depend on the sources

It replaces one loop over all headings and a duplicate h3 test with handling a cross partial storage

Number of pages * numer of headings…no idea

Topic		Replies	Views
Count Word function customized to exclude code support	5	535	August 23, 2021
Word count of post section support	2	2028	May 30, 2018
How do I get number of words in a post? support	2	1312	August 11, 2015
A potential parsing bug when it comes to adding <pre> to HTML tags in Markdown? support	4	77	July 22, 2024
Hugo stripping content of <p> tags support	3	824	December 4, 2018

WordCount includes HTML and Markdown tags as words, is there a way to count the words after stripping all tags?

Related topics