Intro
The docs have an example of how to block every page using robots.txt but, I wanted to make a robots.txt template that blocks only pages with a certain frontmatter param.
As others have pointed out in the comments and elsewhere, robots.txt is not a sure thing. Web crawlers have to be set to honor it, and sometimes they are not. Additionally, for any page that is disallowed in robots.txt, the same page should be excluded from the sitemap.
Here’s what to do:
Set Up the robots.txt Template
Turn on robots.txt generation in your config.toml
:
...
enableRobotsTXT = "true"
...
In your layouts/robots.txt
put:
User-agent: *
Disallow: /some-path/
{{ range where .Data.Pages "Params.robotsdisallow" true }}
Disallow: {{ .RelPermalink }}
{{ end }}
(You can have manually-specified paths above or below the range
statement.)
Then in the frontmatter of the pages you want to disallow (i.e. block from being indexed by search crawlers), do:
---
...
robotsdisallow: true
...
---
Assuming you set that param in, say, your devnotes2019.md
and internallog1.md
markdown content files, your robots.txt
will be automatically generated. If you run hugo server
and browse http://localhost:1313/robots.txt (use whatever port you set), you should see the disallow statements.
User-agent: *
Disallow: /devnotes2019/
Disallow: /internallog1/
Disallow: 404.html
Disallow: /404/
Disallow: /search/
...
Another thing done in robots.txt is to set the sitemap. You can do this in your robots.txt template:
User-agent: *
Disallow: /some-path/
{{ range where .Data.Pages "Params.robotsdisallow" true }}
Disallow: {{ .RelPermalink }}
{{ end }}
Sitemap: {{ "sitemap.xml" | absLangURL }}
Setup the sitemap.xml template
Next, we need to make the same “robotsdisallow” param also have impact on the sitemap, in that, if it is set, the page will not be listed in the sitemap.
Wherever your <head>
is set (i.e. baseof.html
), you can add a meta to indicate no indexing should be performed on the page.
{{ with .Params.robotsdisallow }}<meta name="robots" content="noindex, nofollow, noarchive">{{ end }}
If you want to specify the opposite case, then use an “else”:
{{ with .Params.robotsdisallow }}<meta name="robots" content="noindex, nofollow, noarchive">{{ else }}<meta name="robots" content="index, follow, archive">{{ end }}
I understand that content="index, follow"
is the default, so, you could leave off the “else” in that case.
Then, add a custom layouts/sitemap.xml
template:
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xhtml="http://www.w3.org/1999/xhtml">
{{ range .Data.Pages }}{{ if not .Params.robotsdisallow }}
<url>
<loc>{{ .Permalink }}</loc>{{ if not .Lastmod.IsZero }}
<lastmod>{{ safeHTML ( .Lastmod.Format "2006-01-02T15:04:05-07:00" ) }}</lastmod>{{ end }}{{ with .Sitemap.ChangeFreq }}
<changefreq>{{ . }}</changefreq>{{ end }}{{ if ge .Sitemap.Priority 0.0 }}
<priority>{{ .Sitemap.Priority }}</priority>{{ end }}{{ if .IsTranslated }}{{ range .Translations }}
<xhtml:link
rel="alternate"
hreflang="{{ .Lang }}"
href="{{ .Permalink }}"
/>{{ end }}
<xhtml:link
rel="alternate"
hreflang="{{ .Lang }}"
href="{{ .Permalink }}"
/>{{ end }}
<xhtml:link
rel="alternate"
hreflang="x-default"
href="{{ .Permalink }}"
/>
</url>
{{ end }}{{ end }}
</urlset>
This uses {{ if not .Params.robotsdisallow }}
to see if the page has that param, and if so, does not include it. Change that param name if you used a different one, like “hidden” etc. It also assumes you want hreflang
entries.
Now confirm:
- using
hugo server
again, visit http://localhost:1313/sitemap.xml (or whatever port you set), to see that your disallowed pages are not present. - view source on an excluded page, and you should see the
noindex
meta in<head>
.
Set sitemap priority and change frequency in frontmatter
You may have seen that the custom sitemap template accommodates priority
and changfreq
. If you have pages you want to indicate change frequency or priority for (noting that these settings are more of a suggestion to search engines, not hard-and-fast), you can set like this in your frontmatter:
TOML:
[sitemap]
ChangeFreq = "daily"
Priority = "1"
YAML:
sitemap:
ChangeFreq: weekly
Priority: .7
Default priority is apparently 0.5.
Read about the values you can use here: https://www.sitemaps.org/protocol.html