Anyone have a good smartypants-for-markdown solution?

With Blackfriday officially deprecated, I’m looking to convert my 4200-entry blog over to Goldmark, but the “typographer” extension is still terrible. Post-processing the site with one of the standard “smartypants” scripts takes at least twice as long as the Hugo run, so my current plan is to pre-smarten all of the Markdown source files.

I’ve hacked together a script that mostly works, but I was wondering if anyone else has come up with a more robust, reliable solution than this:

#!/usr/bin/env bash
#
# add smart quotes to Hugo Markdown source files, using the 
# reference implementation of CommonMark's CLI tool:
#     https://github.com/commonmark/commonmark-spec
# Notes:
#   - assumes TOML front matter
#   - converts footnote-style links to inline
#   - normalizes ordered/unordered list formatting
#
# WARNING: possible site-breaking changes:
#   ! rarely, cmark breaks *italic* and **bold** by backslashing
#     the asterisks
#   ! breaks description/definition-list formatting by reflowing it
#   - adds blank line before shortcode that starts a line
#   - adds blank line after shortcode that ends a line
#   - adds usually-gratuitous backslashes to [, ], !, etc.
#   - converts  , ’, ​, etc into Unicode literals
#   - probably won't handle a "+++" line in body content

CMARK="cmark --to commonmark --width 70 --smart --unsafe"

for file in "$@"; do
    cat "$file" |
    # convert front matter to HTML comment, so it all gets ignored
    sed -e '1 s/^\+\+\+$/<!-- _FMPLUS_/' \
        -e 's/^\+\+\+$/_FMPLUS_ -->/' |
    # convert shortcodes to HTML comments, to keep it from
    # escaping their arguments
    sed -e 's/{{</<!-- _SC1OPEN_/g' \
        -e 's/>}}/_SC1CLOSE -->/g' \
        -e 's/{{%/<!-- _SC2OPEN_/g' \
        -e 's/%}}/_SC2CLOSE -->/g' |
    # pass through commonmark
    $CMARK |
    # restore shortcodes
    sed -e 's/<!-- _SC1OPEN_/{{</g' \
        -e 's/_SC1CLOSE -->/>}}/g' \
        -e 's/<!-- _SC2OPEN_/{{%/g' \
        -e 's/_SC2CLOSE -->/%}}/g' |
    # restore front matter
    sed -e 's/^.*_FMPLUS_.*$/+++/' > "$file.new"
    # overwrite original (you have source control, right?)
    mv "$file.new" "$file"
done
exit 0

This script takes about 35 seconds to run on my laptop, and as per my comments, there are some issues I’ll have to correct by hand. Not so bad when you have a few dozen blog posts, but the diff for my site runs to 184,000 lines!

-j

Probably you already know Pandoc, but what if you tried it specifying extensions that handle those issues? You would then convert from markdown to markdown. You can even write some Lua filters for cases specific to your content.

Pandoc is very fast. Besides other more academic uses, I use it regularly to convert a folder of Markdown-like annotations and personal texts to PDF. In my case, an intermediary step is needed, exactly like your case, to handle specific formatting with filters. The Pandoc part usually takes fractions of a second.

Big fan of Pandoc, but unfortunately its smart extension removes smart quotes when the output format is Markdown.

If I’m not wrong, internally smart quotes get a specific type. You could handle that type via a Lua filter, adding the character as a string in place of the quote type.

Yes, it’s true. It’s Quoted type. It even stores the type of quote (single or double). You could then write a simple filter like

-- I DIDN'T TEST THIS CODE
function Quoted (quotedText)
  local type = quotedText.quotetype
  local startQuote = '‘'
  local endQuote = '’'
  if type == 'DoubleQuote'
    startQuote = '“'
    endQuote = '”'
  end
  local content = quotedText.content
  table.insert(content, 1, pandoc.RawInline('markdown', startQuote))
  table.insert(content, pandoc.RawInline('markdown', endQuote))
  return pandoc.Span(content)
end

You could improve that code by storing the quote literals in a map where keys are SingleQuote and DoubleQuote (possible values of quotedText.quotetype).

Just a quick update. Since there doesn’t seem to be a good canned solution, and no hint that goldmark will ever get it right, I’ve started the tedious process of converting with my script, the latest version of which is here. 999 down, 3,248 to go!

Since my site is under source control, I’ve been doing it 100 at a time and diffing the results, which has allowed me to improve the script and fix the few errors introduced.

-j

1 Like