Custom Robots.txt and sitemap.xml Templates


#1

Intro

The docs have an example of how to block every page using robots.txt but, I wanted to make a robots.txt template that blocks only pages with a certain frontmatter param.

As others have pointed out in the comments and elsewhere, robots.txt is not a sure thing. Web crawlers have to be set to honor it, and sometimes they are not. Additionally, for any page that is disallowed in robots.txt, the same page should be excluded from the sitemap.

Here’s what to do:

Set Up the robots.txt Template

Turn on robots.txt generation in your config.toml:

...
enableRobotsTXT = "true"
...

In your layouts/robots.txt put:

User-agent: *
Disallow: /some-path/

{{ range where .Data.Pages "Params.robotsdisallow" true }}
Disallow: {{ .RelPermalink }}
{{ end }}

(You can have manually-specified paths above or below the range statement.)

Then in the frontmatter of the pages you want to disallow (i.e. block from being indexed by search crawlers), do:

---
...
robotsdisallow: true
...
---

Assuming you set that param in, say, your devnotes2019.md and internallog1.md markdown content files, your robots.txt will be automatically generated. If you run hugo server and browse http://localhost:1313/robots.txt (use whatever port you set), you should see the disallow statements.

User-agent: *
Disallow: /devnotes2019/
Disallow: /internallog1/
Disallow: 404.html
Disallow: /404/
Disallow: /search/
...

Another thing done in robots.txt is to set the sitemap. You can do this in your robots.txt template:

User-agent: *
Disallow: /some-path/

{{ range where .Data.Pages "Params.robotsdisallow" true }}
Disallow: {{ .RelPermalink }}
{{ end }}

Sitemap: {{ "sitemap.xml" | absLangURL }}

Setup the sitemap.xml template

Next, we need to make the same “robotsdisallow” param also have impact on the sitemap, in that, if it is set, the page will not be listed in the sitemap.

Wherever your <head> is set (i.e. baseof.html), you can add a meta to indicate no indexing should be performed on the page.

{{ with .Params.robotsdisallow }}<meta name="robots" content="noindex, nofollow, noarchive">{{ end }}

If you want to specify the opposite case, then use an “else”:

{{ with .Params.robotsdisallow }}<meta name="robots" content="noindex, nofollow, noarchive">{{ else }}<meta name="robots" content="index, follow, archive">{{ end }}

I understand that content="index, follow" is the default, so, you could leave off the “else” in that case.

Then, add a custom layouts/sitemap.xml template:

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xhtml="http://www.w3.org/1999/xhtml">
  {{ range .Data.Pages }}{{ if not .Params.robotsdisallow }}
  <url>
    <loc>{{ .Permalink }}</loc>{{ if not .Lastmod.IsZero }}
    <lastmod>{{ safeHTML ( .Lastmod.Format "2006-01-02T15:04:05-07:00" ) }}</lastmod>{{ end }}{{ with .Sitemap.ChangeFreq }}
    <changefreq>{{ . }}</changefreq>{{ end }}{{ if ge .Sitemap.Priority 0.0 }}
    <priority>{{ .Sitemap.Priority }}</priority>{{ end }}{{ if .IsTranslated }}{{ range .Translations }}
    <xhtml:link
                rel="alternate"
                hreflang="{{ .Lang }}"
                href="{{ .Permalink }}"
                />{{ end }}
    <xhtml:link
                rel="alternate"
                hreflang="{{ .Lang }}"
                href="{{ .Permalink }}"
                />{{ end }}
    <xhtml:link
                rel="alternate"
                hreflang="x-default"
                href="{{ .Permalink }}"
                />
  </url>
  {{ end }}{{ end }}
</urlset>

This uses {{ if not .Params.robotsdisallow }} to see if the page has that param, and if so, does not include it. Change that param name if you used a different one, like “hidden” etc. It also assumes you want hreflang entries.

Now confirm:

  1. using hugo server again, visit http://localhost:1313/sitemap.xml (or whatever port you set), to see that your disallowed pages are not present.
  2. view source on an excluded page, and you should see the noindex meta in <head>.

Set sitemap priority and change frequency in frontmatter

You may have seen that the custom sitemap template accommodates priority and changfreq. If you have pages you want to indicate change frequency or priority for (noting that these settings are more of a suggestion to search engines, not hard-and-fast), you can set like this in your frontmatter:

TOML:

[sitemap]
  ChangeFreq = "daily"
  Priority = "1"

YAML:

sitemap:
  ChangeFreq: weekly
  Priority: .7

Default priority is apparently 0.5.
Read about the values you can use here: https://www.sitemaps.org/protocol.html


How to create a second, different sitemap
#2

Very useful. Thanks @RickCogley.


#3

For those who may not be familiar with robots.txt - please remember that this is not valid for actually preventing access to information. It only discourages well behaved search engines from indexing your data. There are lots of badly behaved search engines and even more bad actors who actively hunt through robots.txt files for “interesting” information.

You should also make sure that those pages are excluded from your sitemap.

If you have pages that you don’t want anyone to see, make sure they are not referenced anywhere.


#4

Good point. Looks like you’d need to do something like:


#5

It is important to synchronize the robots.txt and the sitemap.xml, if the page is blocked in the robots.txt, but is present in the sitemap.xml, Google-webmasters makes a warning.


#6

Thanks @TotallyInformation, @Mikhail, @onedrawingperday, @bep for the various bits of info that went into it.

Let me know if I’m missing something and I’ll edit the mini tutorial. Hopefully if this is good enough, I can get it into the proper docs, as I think this is a pretty universal need.


#7

Please forgive me for this one, but I HAVE to ask this question.

Why not refrain from putting the information on the website, entirely - if it’s not something you want indexed/searched/found?

Is robots.txt used so that you can have a few Easter eggs on your site?


#8

Well in my case I am hiding a Thank You HTML page that is displayed once a user subscribes to a newsletter through a form.

Also I am hiding Terms & Conditions and Privacy Policy pages from search engines. There is really no point in having these indexed.

So this is not so much about Easter eggs but more about clean search results.


#9


Lol! That’s the last thing on my mind, currently.

You might want those terms and conditions pages to be indexed/searched, though.
I know I would, if I were to sign up for something.
I forget those policies and don’t always know what it is I’ve signed up for.
It’d be nice if they’d pop up before I used your results.


#10

Actually the links to these pages are displayed prominently in the Newsletter subscription page, the user is required to agree to the terms if she wishes to subscribe to the Newsletter.

Also there are links to these pages in the footer of every HTML page of a Hugo site I manage.

So the user can find them quite easily, if she wants to read them.

Keeping these pages not listed in search results is just a matter of taste.


#11

That’s a perfect example of the use of the robot.txt file and I appreciate that now, do I not only know the purpose of it, I know how and when to use it.

I can also choose not to use it and if I do enable the setting, I’ll still need to add some configurations to get it working properly.


#12

Thank you for this great tutorial. When I wrote
{{ with .Params.robotsdisallow }}<meta name="robots" content="noindex">{{ end }}

in head.html, then I get a line for the meta tag the have the line enableRobotsTXT = "true" in the frontmater.
<meta name="robots" content="noindex">

But all other pages need another
<meta name="robots" content="index, follow, archive">


#13

@Joerg, extending my {{ with ...}}, I would do:

{{ with .Params.robotsdisallow }}<meta name="robots" content="noindex, nofollow, noarchive">{{ else }}<meta name="robots" content="index, follow, archive">{{ end }}

Just tested and it works. If you just need content="index, follow", then just leave the “else” off, because that is the default as I understand.

(updated the tutorial, thanks @Joerg :smile: )


#14

Extended the tutorial a bit with a section on setting sitemap changefreq and priority in frontmatter. See above.


#15

This would still only work for well-behaved search crawlers, just as a warning to the unfamiliar. People can still download anything they want for free from the command line using wget and others, or paid apps like Screaming Frog. To the previous point, if you want it truly hidden, you’ll need to hide it behind a login…


#16

Yeah, all these things are “polite requests” only. :wink:


#17

Extended the tutorial updating the custom sitemap template by adding a default “x-default” lang, recommended by Google and Yoast.


#18

Hi, Rick. I’ve got confused with my robots.txt. Trying follow thar artical Custom Robots.txt and sitemap.xml Templates Something going wrong((( I can not find my Robots.txt and I have three Layout (take a look a scan) That’s ok?

Every time I have 404 page instead


#19

No, there can be a layouts in the root of your project and in the theme, but it appears you have one in content as well. I have not tested to see if there would be a negative effect having one in content but, it is not the usual way.

If you want specific help, please see Requesting Help and share your repo etc with the community in a new thread.


#20

Thanks a lot @RickCogley. I followed your tutorial but I have 2 issues:

  • I don’t understand the following part: " Assuming you set that param in your 404", what am I supposed to put in my 404 and how?
  • When I visit http://localhost:1313/sitemap.xml, I have the following error:

    It’s listing anything that is coming from my blog page. Would you know why it happens and how I can fix it?
    Thanks