Issues with Template Parsing: HTML element opening/closing tags misread as HTML entities in several situations

Rumperuu · July 21, 2023, 4:35pm

I have a shortcode for citations, which just calls a partial and removes leading and trailing spacing:

{{- /**/ -}}{{- partial "cite.html" . -}}{{- /**/ -}}

The partial itself looks like this (simplified):

<cite>{{- .Params.title | markdownify | safeHTML -}}</cite>

And then I have another shortcode for blockquotes (again simplified):

<blockquote>
    {{ .Inner | markdownify | safeHTML }}
</blockquote>

However, if I try to call the citation shortcode within the blockquote .Inner, the resulting HTML gets escaped, and then wrapped into a <pre>.

Here’s an example of the Markdown content file:

{{< blockquote >}}
  This is a **Markdown** test.
  
  This is an <b>HTML</b> test.
  
  This is a {{< q >}}shortcode with closing shortcode{{< /q >}} test.
  
  This is a normal shortcode test: {{< abbr "html" >}}.

  This is a shortcode-with-a-partial test: {{< cite title="test" >}}.
{{< /blockquote >}}

…but the resulting HTML looks like:

  <blockquote>
    <p>This is a <strong>Markdown</strong> test.</p>
<p>This is an <b>HTML</b> test.</p>
<p>This is a <q>shortcode with closing shortcode</q> test.</p>
<p>This is a normal shortcode test: <abbr title="Hypertext Markup Language">HTML</abbr>.</p>
<p>This is a shortcode-with-a-partial test: &lt;cite class=&ldquo;cite&rdquo;</p>
<pre><code>      itemscope 
      itemprop=&quot;citation&quot;
      itemtype=&quot;https://schema.org/CreativeWork&quot;
      &gt;&lt;span itemprop=&quot;name&quot;&gt;test&lt;/span&gt;&lt;/cite&gt;.
</code></pre>

  </blockquote>

If I change the blockquote shortcode to just .Inner | safeHTML, all of the shortcodes render correctly but the first line with the Markdown doesn’t. If I swap the order (.Inner | safeHTML | markdownify) the Markdown line renders correctly again, but the shortcode-with-a-partial gets escaped again. If I replace the whole pipeline with .Inner | .Page.RenderString I get the same result.

Hugo version: hugo v0.115.4-dc9524521270f81d1c038ebbb200f0cfa3427cc5+extended linux/amd64 BuildDate=2023-07 -20T06:49:57Z VendorInfo=snap:0.115.4

jmooring · July 21, 2023, 5:33pm

Markdown (including any HTML within the markdown) indented by four or more characters is an indented code block, and will be wrapped with pre and code tags per the CommonMark specification.

Rumperuu · July 21, 2023, 6:38pm

Yes, and I’ve been caught out by that several other times, but in this case it’s only happening because the indented content isn’t being recognised as part of inline HTML, because the HTML tags have all been escaped.

The full opening tag of my citation partial:

<cite class="cite{{ with .Params.citeStyle }} cite--{{ . }}{{ end }}" 
          {{ with .Params.cite -}}cite="{{ . }}"{{- end }}
          itemscope 
          itemprop="citation" 
          itemtype="https://schema.org/
            {{- with .Params.schemaType -}}
                {{- . -}}
            {{- else -}}
                CreativeWork
            {{- end -}}"
          {{ if .Params.titleLang -}}
          lang="{{- .Params.titleLang -}}"
          title="{{- .Params.titleTr -}}"
          {{- end -}}
          >

By changing the second line to {{- with .Params.cite -}} cite="{{ . }}" {{- end -}}, I get the following HTML output:

  <blockquote 
    class="blockquote__body">
    <p>This is a <strong>Markdown</strong> test.</p>
<p>This is an <b>HTML</b> test.</p>
<p>This is a <q>shortcode with closing shortcode</q> test.</p>
<p>This is a normal shortcode test: <abbr
class="abbr"
title="Hypertext Markup Language">HTML</abbr>.</p>
<p>This is a shortcode-with-a-partial test: &lt;cite class=&ldquo;cite&quot;itemscope
itemprop=&ldquo;citation&rdquo;
itemtype=&ldquo;<a href="https://schema.org/CreativeWork%22">https://schema.org/CreativeWork&quot;</a>
&gt;<span itemprop="name">test</span></cite>.</p>

  </blockquote>

Still incorrect, but now without the <pre> red herring.

jmooring · July 21, 2023, 6:42pm

Can you boil this down to a minimal reproducible example?

Rumperuu · July 21, 2023, 9:13pm

I’ve narrowed down the issue, and it seems like it might relate to two other issues I’ve encountered before regarding how Hugo handles single <s and >s. Here is the MRE (run hugo serve and go to [your localhost]/test to view the example).

This particular issue is because the HTML tag is closed by a sole > on a new line, which the parser reads as Markdown blockquote syntax.

This can be worked around by using a no-op template construct (such as {{- /**/ -}}, per your example elsewhere), but for some reason that construct throws the following error if either of the -s is missing (e.g., {{ /**/ -}}):

parse failed unexpected “/” in command

If I really need the spacing on one side I can use a longer construct (e.g., {{ if 1 -}}{{- /**/ -}}{{ end -}}), but this makes my templates pretty messy.

But, without knowing the details of how the parsing algorithm works, I would have expected the parser should already be in ‘HTML element’ mode because it will have already hit the opening tag, so it should be anticipating a closing > and not be reading it (new line or not) as an unrelated symbol. So this seems to be a bug, though I don’t know if it’s a Hugo, Goldmark or Go issue.

The other two issues seem to be because the parser struggles with < on their own, or immediately followed by {{, which both result in an < instead of a <.

This means you can’t seem to have conditional element names within a single tag (as opposed to conditional tags; see layouts/shortcodes/foo.html). It does render the correct names, but within < and > elements rather than as HTML elements.

It also means you get misrendered comment blocks (see layouts/partials/copying.html). This doesn’t seem to have anything to do with Markdown, as it happens the same when I include the copyright comment block partial directly in the head of my page template. This issue also only replaces the opening < with <, leaving the closing > as it is.

FWIW having spacing between the name of an element and its closing > seems standards-compliant, and whilst having a space between the opening < and the element name isn’t, in foo.html you can see that I’ve used -s to remove all whitespace.

jmooring · July 21, 2023, 9:18pm

I may have time to get into this over the weekend, but have you tested the markdown rendering with the reference implementation? https://spec.commonmark.org/dingus/

Rumperuu · July 21, 2023, 9:42pm

The issue does seem to come when applying .Page.RenderString to the .Inner within the blockquote shortcode (as the issue doesn’t occur within a Markdown-formatted blockquote), but the only way I could replicate the nested blockquote result in that tool was with two >s, which would imply that not only is the Hugo template parser mistaking the closing > of the HTML element as an unrelated > to leave for the Markdown parser, but it’s also inserting an additional > from somewhere.

Here is my experimenting.

jmooring · July 24, 2023, 4:17am

TLDR

git clone --single-branch -b hugo-forum-topic-45391 https://github.com/jmooring/hugo-testing hugo-forum-topic-45391
cd hugo-forum-topic-45391
hugo server

There are a number of issues to address.

Call the outer shortcode using the {{% %}} notation. Call the inner shortcode using the {{< >}} notation.
The cite element may contain global HTML attributes, and the cite attribute is not global. Including a cite attribute within a cite element is invalid HTML.
HTML validators expect the itemscope and itemprop attributes to be defined on a parent element. Wrap the cite element within a span element.

I fixed the “cite” shortcode, removing white space where needed regardless of which parameters you provide when calling the shortcode. But this approach gets very messy, very quickly. Again, per item 2 and 3 above, this shortcode generates invalid HTML.

In the “good-cite” shortcode, I have rectified the problems with HTML validation, and used a cleaner approach to generating the cite element by iterating over a map of attributes. For example:

  <cite
    {{- range $k, $v := $attrs }}
      {{- if $v }}
        {{- printf " %s=%q" $k $v | safeHTMLAttr}}
      {{- end }}
    {{- end -}}
  >

This way you don’t need to worry about white space removal or a bunch of conditional blocks when setting element attributes.

While fixing this I did not encounter any bugs or unexpected behavior. Mixing HTML with markdown is tricky—there are seven different start/end conditions for detecting/rendering HTML blocks embedded within markdown. See https://spec.commonmark.org/0.30/#html-blocks. GitHub Flavored Markdown (GFM) is based on this specification as well.

Rumperuu · July 31, 2023, 7:09pm

Thanks very much for the tips on the cite shortcode.

I’ve done some more research with the CommonMark spec in hand, though, and it doesn’t seem to explain any of the three HTML rendering bugs I’ve detailed in my MRE; I’ve added the results of my subsequent investigations to the repo (and tidied things up a bit), but as far as I can tell two of the three issues seem to be Markdown behaviour contrary to the spec.

Here’s a screenshot of my MRE test page in case anybody reading doesn’t want to clone the repo:

Issues

Issue #1: < escaped inside HTML comment block (I don’t think this is Markdown-related)
Issue #2: Conditionally-named HTML element escaped
Issue #3: HTML element with newline before closing > escaped, but in such a chaotic way that it seems to even break Markdown’s <pre> rendering

Topic		Replies	Views
Nested Shortcodes converting to <pre><code> blocks support	4	2131	November 13, 2023
Problem with custom shortcode for citations support shortcodes	4	592	May 10, 2023
Shortcode + html support	4	1094	November 22, 2019
Shortcodes, Inner, markdownify and safeHTML support	7	1849	January 26, 2023
Weird stray </p> (closing paragraph) tags support	27	1361	February 5, 2023

Issues with Template Parsing: HTML element opening/closing tags misread as HTML entities in several situations

Related topics