Generating custom text output from markdown with regular expressions

I’m working on a project where we want to make some of the content available in multiple formats (HTML, PDF, and plain text). I have the plain text version working ok but would like to manage it better.

Here is a draft article https://startwords.cdh.princeton.edu/issues/1/data-beyond-vision/ and here is the plain text version of the same article: https://startwords.cdh.princeton.edu/issues/1/data-beyond-vision/index.txt

The solution I came up with (after trying a few different options) is to apply a series of regular expressions to the markdown text for the page, since the markdown is the closest to the output we want. (I first tried applying it to the plain text content, but that lost important formatting, particularly with lists.)

Here’s the relevant text output template: https://github.com/Princeton-CDH/startwords/blob/master/themes/startwords/layouts/article/single.txt

I’m ok with this approach, but was hoping for a better way to manage the regular expressions. I thought I ought to be able to put them in a data file (where I could add comments to document them and more easily add new ones), but when I tried that I ran into two problems:

  1. The regular expressions that worked in the template didn’t seem to match anymore; is the escaping different?
  2. I think I need to use Scratch to store and update the modified content after each regex is applied, but it didn’t seem like that was working (although difficult to tell since my regexes were not matching).

Thanks in advance for any input or advice.

only my 2cents

I would try to define custom render hook templates

{{ with  .OutputFormats.Get "html" -}}
- format for HTML
{{end}}
{{ with  .OutputFormats.Get "txt" -}}
- format for TXT
{{end}}

So you can “easy” transform headers, links and images for your text output

data/replace.toml

# Arguments for replaceRE, applied iteratively to .RawContent.
# Patterns with backslashes must be encapsulated within single quotes.
# Use single backslashes to escape characters. Do not use double backslashes.
# Set skip = true to skip a particular replacement.

[[args]]
pattern = "How do we represent tangible objects"
replacement = "Hugo Forum Topic 27707"
comment = "Testing a simple string replacement."
skip = true

[[args]]
pattern = '!\[([^\]]+)\]\([^\)]+\)'
replacement = "[IMAGE: $1]"
comment = ""

[[args]]
pattern = '{{\<[^\>]+>}}'
replacement = ""
comment = ""

[[args]]
pattern = '(?m)^#+ '
replacement = ""
comment = ""

[[args]]
pattern = '(?m)(<\/?[^>]+>)'
replacement = ""
comment = ""

[[args]]
pattern = '(?m)^\[\^(\d+)\]: '
replacement = "$1. "
comment = ""

[[args]]
pattern = '\[([^\]]+)\]\(([^\)]+)\)'
replacement = "$1 [URL: $2]"
comment = ""

[[args]]
pattern = '\*\*([^*]+)\*\*'
replacement = "$1"
comment = ""

[[args]]
pattern = '{#[a-z-0-9]+}'
replacement = ""
comment = ""

[[args]]
pattern = '\[(.+)\]\(#.*\)'
replacement = "$1"
comment = ""

themes/startwords/layouts/article/single.txt

⩩-----------------------------------------------------------------------------------⟩
|
|    ▄▄▓
| ]▓▓▀
|  ╙▀▓▄   ▀▀▀▓     ╫▓    ╙▀▀▀▓⌐^▀▀▓▌    ▓⌐            ▀█▄,  ╚▀▀▀▓  ╙▀▀▀█▄   ▄█▀
|     ▀▓    ▐▓    ╫▓▀▓   ,,,╓▓▀   ╟▌   j▓⌐ ▄▓µ    ,      ▓▄ ,,,╓▓▀      ╙▓⌐ ▓▄,
|      ▐    ▐▓   ╟▓  ▀▌  ▓▌└╟▄    ╟▌   j▓▄▓▀ ▀▓   ▓▌     ▓▀ ▓▀└▓▄       ]▓   ^╙▀▓
| «▄▄╗╩"    ▐▓  ]▓    ▓▌ ▓∩  ▓b   ╫▌    ▓▀    └▓M  ╙█▓▓█▀╙  ▓∩  ▓∩ ╗▄▄▓█▀`    ▄▓▀
|
⩩-----------------------------------------------------------------------------------⟩
|
|  {{ .Title }}
|
|  Authors:
|  {{ range .Params.Authors }}
|     {{ . }}{{ end }}
|
|  {{ .Permalink }}
|{{ if .Params.doi }}
|  doi:{{ .Params.doi }}{{ end }}
|{{ if .Params.tags }}
|  {{ range .Params.tags }}#{{ . }}{{ end }}{{ end }}
|
⩩-----------------------------------------------------------------------------------⟩
|
|  Issue {{ .Parent.Params.number }}: {{ upper .Parent.Params.theme }}
|  {{ .Parent.Permalink }}
|  {{ .Parent.Date.Format "January 2006" }}
|
⩩-----------------------------------------------------------------------------------⟩
{{- $rawContent := .RawContent -}}
{{- range .Site.Data.replace.args -}}
  {{- if not .skip -}}
    {{- $rawContent = $rawContent | replaceRE .pattern .replacement -}}
  {{- end -}}
{{- end -}}
{{ $rawContent }}

See https://github.com/Princeton-CDH/startwords/pull/115

Thanks for the suggestion! I wasn’t aware of custom render hook templates. It looks like it is pretty powerful but doesn’t support all of the features we need to handle.

Thanks for the solution and the pull request! This is exactly what I was hoping for.

Is there somewhere I should I have looked to determine the right syntax for quotes, backslashes, and escaping characters?

Not really. There were some hints here and here, but it took me a while to zero-in on the right combination.

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.