I recently became aware of the llms.txt file proposed standard. (https://llmstxt.org/). It looks like it’s gained a lot of traction already - here’s a directory of sites which support it (https://llmstxt.site/).
The idea is that your site would have a /llms.txt file in the root which is formatted as markdown. Each page of the site would also be available as markdown alongside the HTML. The idea being that AI agents would be able to parse the markdown a lot faster than a full featured webpage (the https://llmstxt.org/ site explains it better - take a look).
Anyway, I was wondering if anyone had thought of a way for Hugo to deal with generating the required output for this? Would it be possible for Hugo to generate an llms.txt in a similar manner to how it can generate a sitemap.xml in the root but in markdown format - and to also output plain markdown files (perhaps with the frontmatter removed?) in the same folders as the HTML.
This is trivial with a couple of custom output formats, and maybe a front matter field for section inclusion. Wrap it up in a module and anyone could use it.
I generate an llms.txt file from within the robots.txt layout using a custom definition llms-txt: https://example.com/llms.txt, similar to the Sitemap: https://example.com/sitemap.xml definition. I use a frontmatter value for pages called sitemap-exclude, which can be set to true to exclude a page from the sitemap, and then I exclude this from the llms.txt as well.
The primary reason for including it in the robots.txt file is for loading the .Permalink so the template can be stored in the assets directory instead of static.
The following is an example that produces 3 levels of sections and includes titles, permalinks, and descriptions for each page.
config/_default/params.toml
[robots]
llmsTXT = true
layouts/robots.txt
User-agent: *
Sitemap: {{ "sitemap.xml" | absURL }}
{{ range where .Pages "Params.sitemap_exclude" "eq" true }}
Disallow: {{ .RelPermalink }}{{ end }}
{{/* LLMS */}}
{{- $llmsGoTXT := resources.Get "llms.go.txt" -}}
{{- if and $llmsGoTXT .Site.Params.robots.llmsTXT -}}
{{- $llmsTXT := $llmsGoTXT | resources.ExecuteAsTemplate "llms.txt" . -}}
llms-txt: {{ $llmsTXT.Permalink }}
{{- end -}}
assets/llms.go.txt
{{ with .Site.Title -}}
# {{ . }}
{{- end }}
{{ with .Site.Params.Description -}}
> {{ . }}
{{- end }}
{{ range (where (sort ((.Site.GetPage "/").Pages) "Weight" "asc" "Date" "desc" "Lastmod" "desc") "Params.sitemap_exclude" "ne" true) -}}
- [{{ .Title }}]({{ .Permalink }}): {{ .Description }}
{{ end -}}
{{/* Sections */}}
{{ range (where (sort ((.Site.GetPage "/").Sections) "Weight" "asc" "Date" "desc" "Lastmod" "desc") "Params.sitemap_exclude" "ne" true) -}}
{{ with .Title -}}
## {{ . }}
{{- end }}
{{ with .Description -}}
> {{ . }}
{{- end }}
{{ range (where (sort .Pages "Weight" "asc" "Date" "desc" "Lastmod" "desc") "Params.sitemap_exclude" "ne" true) -}}
{{ if .Title -}}
- [{{ .Title }}]({{ .Permalink }}){{ with .Description }}: {{ . }}{{ end }}
{{- end }}
{{ end -}}
{{/* Sub-Sections */}}
{{ range (where (sort .Sections "Weight" "asc" "Date" "desc" "Lastmod" "desc") "Params.sitemap_exclude" "ne" true) -}}
{{ with .Title -}}
### {{ . }}
{{- end }}
{{ with .Description -}}
> {{ . }}
{{- end }}
{{ range (where (sort .Pages "Weight" "asc" "Date" "desc" "Lastmod" "desc") "Params.sitemap_exclude" "ne" true) -}}
{{ if .Title -}}
- [{{ .Title }}]({{ .Permalink }}){{ with .Description }}: {{ . }}{{ end }}
{{- end }}
{{ end }}
{{ end -}}
{{ end -}}
Based on the rendered HTML response from Developer Tools > Network > [document file] > Response, and the markdown support for Hugo sites in browsers, it seems that the content is already pretty software-digestible. I atleast haven’t been inclined to host a markdown copy of every page. I am not an expert, add salt.
This isn’t exactly what they recommend, but it seems to work. I included the proposal quote below for anyone else reading this.
We furthermore propose that pages on websites that have information that might be useful for LLMs to read provide a clean markdown version of those pages at the same URL as the original page, but with .md appended. (URLs without file names should append index.html.md instead.)