When I build and deploy my hugo website, there are xml files which are generated and deployed. Google search console shows me that the xml files can’t be indexed properly (error is ‘crawled - currently not indexed’).
This isn’t something too bad, but I think it means that something isn’t configured properly.
Does it make sense for the xml files to be generated and deployed?
Is the right solution to disallow crawling xml files in the robots.txt file?
Or maybe add a no index statement in those xml files?
To what is, is not, indexed by Google is strictly up to Google Algorythms to decide. You can generate XML files to tell Google your links, but thats discretion of algorythm to index or not.
Currently for Google content matter. Your website do not have much content. Keep writing and you will see that will change.
but,
why you got this is robots.txt?
Disallow: /*.xml$
treat robots as a guidsance file not a mandate to what bots must obey. Many crawlers simply ignore them if they found them useless.
Why you want to disallow XML files in robots? or set No-index?
I don’t think that “generated XML files cause crawling errors in Google Search Console”.
There is nothing wrong in any of files (index.xml or /en/sitemap.xml for example). Just add them to search console, keep working on your content and see if this will change in next months or so. Nothing to do with Hugo itself.
You are forbidding access to all xml files, including sitemap.xml. I think Google is decent about respecting things in robots.txt so I suggest you change it.
If you really want to block access to some xml files at least add a allow statement for your sitemaps.
The reason I have Disallow: /*.xml$ in my robots.txt is because this is how I’m trying to ‘fix’ the issue. The crawling error is from before I had this in the robots.txt file.
The reason I raise this is because I’d expect things to work with the default configuration. The fact that I got the ‘Crawled - currently not indexed’ warning made me think that something isn’t configured right.
I’m fine with reverting my change to the robots.txt file - I’m trying to understand what would be the best action here, so I could ‘play nice’ with Google.
Your disallow rule will not fix anything, as there is nothing to fix in first place. It will cause more issues.
The things work (from Hugo perspective) and there is nothing that needs fixing.
‘Crawled - currently not indexed’ is strictly to discretion of Google Algorythms. If they will find your links usefull and desirable in matter of content, they will be indexed. Indexation takes time as well.
Everything on your site (apart of the robots thing) is configures right.