Custom Robots.txt and sitemap.xml Templates

#6

Thanks @TotallyInformation, @Mikhail, @onedrawingperday, @bep for the various bits of info that went into it.

Let me know if I’m missing something and I’ll edit the mini tutorial. Hopefully if this is good enough, I can get it into the proper docs, as I think this is a pretty universal need.

1 Like
#7

Please forgive me for this one, but I HAVE to ask this question.

Why not refrain from putting the information on the website, entirely - if it’s not something you want indexed/searched/found?

Is robots.txt used so that you can have a few Easter eggs on your site?

#8

Well in my case I am hiding a Thank You HTML page that is displayed once a user subscribes to a newsletter through a form.

Also I am hiding Terms & Conditions and Privacy Policy pages from search engines. There is really no point in having these indexed.

So this is not so much about Easter eggs but more about clean search results.

2 Likes
#9


Lol! That’s the last thing on my mind, currently.

You might want those terms and conditions pages to be indexed/searched, though.
I know I would, if I were to sign up for something.
I forget those policies and don’t always know what it is I’ve signed up for.
It’d be nice if they’d pop up before I used your results.

#10

Actually the links to these pages are displayed prominently in the Newsletter subscription page, the user is required to agree to the terms if she wishes to subscribe to the Newsletter.

Also there are links to these pages in the footer of every HTML page of a Hugo site I manage.

So the user can find them quite easily, if she wants to read them.

Keeping these pages not listed in search results is just a matter of taste.

#11

That’s a perfect example of the use of the robot.txt file and I appreciate that now, do I not only know the purpose of it, I know how and when to use it.

I can also choose not to use it and if I do enable the setting, I’ll still need to add some configurations to get it working properly.

1 Like
#12

Thank you for this great tutorial. When I wrote
{{ with .Params.robotsdisallow }}<meta name="robots" content="noindex">{{ end }}

in head.html, then I get a line for the meta tag the have the line enableRobotsTXT = "true" in the frontmater.
<meta name="robots" content="noindex">

But all other pages need another
<meta name="robots" content="index, follow, archive">

1 Like
#13

@Joerg, extending my {{ with ...}}, I would do:

{{ with .Params.robotsdisallow }}<meta name="robots" content="noindex, nofollow, noarchive">{{ else }}<meta name="robots" content="index, follow, archive">{{ end }}

Just tested and it works. If you just need content="index, follow", then just leave the “else” off, because that is the default as I understand.

(updated the tutorial, thanks @Joerg :smile: )

1 Like
#14

Extended the tutorial a bit with a section on setting sitemap changefreq and priority in frontmatter. See above.

#15

This would still only work for well-behaved search crawlers, just as a warning to the unfamiliar. People can still download anything they want for free from the command line using wget and others, or paid apps like Screaming Frog. To the previous point, if you want it truly hidden, you’ll need to hide it behind a login…

1 Like
#16

Yeah, all these things are “polite requests” only. :wink:

1 Like
#17

Extended the tutorial updating the custom sitemap template by adding a default “x-default” lang, recommended by Google and Yoast.

#18

Hi, Rick. I’ve got confused with my robots.txt. Trying follow thar artical Custom Robots.txt and sitemap.xml Templates Something going wrong((( I can not find my Robots.txt and I have three Layout (take a look a scan) That’s ok?
layouts_scan Every time I have 404 page instead

#19

No, there can be a layouts in the root of your project and in the theme, but it appears you have one in content as well. I have not tested to see if there would be a negative effect having one in content but, it is not the usual way.

If you want specific help, please see Requesting Help and share your repo etc with the community in a new thread.

#20

Thanks a lot @RickCogley. I followed your tutorial but I have 2 issues:

  • I don’t understand the following part: " Assuming you set that param in your 404", what am I supposed to put in my 404 and how?
  • When I visit http://localhost:1313/sitemap.xml, I have the following error:

    It’s listing anything that is coming from my blog page. Would you know why it happens and how I can fix it?
    Thanks
#21

Hi - that was confusing. I rewrote it. It can be any page you want to exclude. Once you add the param, it should show up in the robots.txt.

Regarding the other question, could you make a new post please. From what you pasted it is hard to tell, and I want to ask if you could please have a look at Requesting Help and provide some more details, in the new post? Thanks.

#22

@RickCogley
Loving your code, thanks.

I noticed an issue: Multilingual pages, like example.ru.md, do not get added to robots.txt.

I have enableRobotsTXT = "true" in front matter of both files example.md and example.ru.md.

I expect to get two entries in robots.txt:


/example/
/ru/example/

But I only get one /example/

What do you think is the problem?

#23

Hmm, I imagine it is something with the range statement. I don’t have time to set up a test but, can you try {{ .Permalink }} instead? Or, maybe it needs absLangURL.

1 Like
#24

Thanks, will give it a try later.

#25

Tried using {{ .Permalink }} instead of {{ .RelPermalink }}. Did not work. It just gave me an absolute url output.

Then I tried using{{ .RelPermalink | relLangURL }} and absLangURL. Did not work either.

Did some more research but can not figure it out yet. Any ideas?