HUGO

Disable unnecessary entity references

I find most HTML entities unnecessary on my site because the following three statements are true:

  1. My website has files encoded with UTF-8
  2. My server sends charset=utf-8 in the content-type HTTP header for all text/* mimetypes
  3. The first <meta> tag in my HTML sets charset="utf-8".

Aside from characters that are reserved in HTML (e.g. angle brackets), I’d like to disable the creation of entity references. Does anybody know if it’s possible to do with with Hugo + Goldmark?

Perhaps you need to use the htmlUnescape function for your purposes. See:

htmlUnescape looks like an all-or-nothing approach. I’m looking to keep entity references that are required like & or <; however, I’d like to have Hugo stop generating unnecessary references for characters like curly quotes.

I could get a part of the way there if there was a way to override the substitutions made by Goldmark’s Typographer extension from Hugo.

I think my only options at the moment are:

  1. Submit a patch to Goldmark to support disabling unnecessary HTML entities and then to submit a patch to Hugo to leverage that option
  2. Make my own non-upstreamed patched Goldmark with different behavior and patch Hugo to use that instead.

I’m thinking about going with option 2 in the short run before considering option 1, unless there’s another way.

Eliminating unnecessary HTML entities is also part of the Google HTML/CSS style guide, FWIW.

config.toml

[markup.goldmark.extensions]
  typographer = false

Again, I want the Typographer extension to do things like insert
curly-quotes. I just don’t want it to do so using entity references.

I apologize. I completely misunderstood.

Although this is an extra step (post-build), perhaps it would suffice while waiting for upstream changes…

The html-xml-utils package contains the hxunent utility which “replaces HTML predefined character entities by UTF-8.” If you run it with the -b option then the &lt; &gt; &quot; &apos; &amp; entities are retained.

bash script
#!/usr/bin/env bash

main() {
  declare file
  declare files
  declare publish_dir=public
  declare temp_dir
  temp_dir=$(mktemp -d)
  readarray -d '' files < <(find "${publish_dir}" -type f -name "*.html" -printf "%P\0")
  for file in "${files[@]}"; do
    mkdir -p "${temp_dir}/$(dirname "${file}")"
    hxunent -b "${publish_dir}/${file}" > "${temp_dir}/${file}"
    mv "${temp_dir}/${file}" "${publish_dir}/${file}"
  done
  rm -rf "${temp_dir}"
}

set -euo pipefail
main "$@"

The html-xml-utils package contains the hxunent utility which
“replaces HTML predefined character entities by UTF-8.” If you run it
with the -b option then the &lt; &gt; &quot; &apos; &amp; entities
are retained.

Awesome; I’ll take a look and maybe patch my site’s Makefile.

bash script
#!/usr/bin/env bash

main() {
 declare file
 declare files
 declare publish_dir=public
 declare temp_dir

 temp_dir=$(mktemp -d)
 hugo --destination "${temp_dir}"
 readarray -d '' files < <(find "${temp_dir}" -type f -name "*.html" -printf "%P\0")
 for file in "${files[@]}"; do
   mkdir -p "${publish_dir}/$(dirname "${file}")"
   hxunent -b "${temp_dir}/${file}" > "${publish_dir}/${file}"
 done
 rm -rf "${temp_dir}"
}

set -euo pipefail
main "$@"

Thanks; this looks like a good starting point, though I’d rather use
POSIX sh ;).