GreenGeeks Googlebot Restriction

This is how to avoid GreenGeeks restricting Googlebot’s access to your website. If Googlebot attempts too many connections to GreenGeeks servers, even the mighty Google will face restrictions so as to not overload it.

If you configure your own websites, starting with an empty root directory or are migrating from another server without these restrictions to GreenGeeks, this post will help you avoid indexing restrictions with Google.

Unreachable: robots.txt

Google cannot index sitewide issue

The problem I came across with GreenGeeks hosting was that the robots.txt file on several websites hosted there was noted as being unreachable by Googlebot. This was according to Google Search Console’s URL Inspection tool, so straight from the horse’s mouth so to speak.

I put in a support request ‘wondering if there were site specific things like maybe Google tried to index a lot of content from those sites quickly and got blocked.’ For reference, I included the help page Google provided, which explicitly states ‘your server returned a 5xx (unreachable) error when we tried to retrieve your robots.txt file.’

Finding A Solution

Finding a solution to the issue wasn’t immediately obvious. This was this is the first time Google had this issue indexing sites including with other hosting providers I used personally or with clients. The reply to my support request was that ‘Google IPs are globally whitelisted on all [GreenGeeks] shared servers.’

I was told that to pursue a solution, I’d need to provide ‘a step-by-step instruction to reproduce the issue on [their] end, including the involved details like login information and URLs, for [GreenGeeks] to be able to test it live and advise [me] further.’ To reproduce this error, one would need to log into my Google Search Console account (or be granted access), then inspect the website’s pages again.

As an aside, I won’t be giving out passwords for Search Console since that would grant a person I don’t even know access to Gmail, Analytics, Google Drive, Google Sheets and on and on. Given a support request, hosts are able to assist and access relevant parts of your account to solve issues. No passwords should ever be given out to any support staff.

Of course the error would be reproduced if the login credentials were granted, but one would only be referred to the Google Search Console help file referenced above. We’d still not have a solution because the answer would once again be ‘Google IPs are globally whitelisted on all [their] shared servers.’

The problem was definitely on their end since I’d never had this issue before and Google have no reason to give false information about their ability to index websites. I had to give this more thought and come up with a solution myself.

Troubleshooting Googlebot Restriction

Troubleshooting issues is a big part of marketing consultancy and web development is no different. You might even hire a Web Producer to deal with website related issues like this. It is something I’m very used to from my time working in the SaaS industry. The key is approaching the problem with an open mind, looking at things logically to come up with a solution. Start with something you know (a known known if you will), and go from there.

Restricted Server Access

Although Google IPs may be whitelisted by GreenGeeks, I remembered their servers do restrict repeated connection attempts. While testing things like how well Google Analytics 4 is working, I’d use a VPN to access sites hosted on their servers. I found such visits could result in the need to solve a reCAPTCHA (or similar).

greengeeks firewall message

That got me thinking along the lines of ‘okay, that explains the 5xx error reported by Google’. 5xx server errors are given when a server fails to fulfil a request, is unable to or refuses to fulfil a request as I’d say in this case.

Here are some possible errors Google was receiving:

500 Internal Server Error
This is a generic error message, given when an unexpected condition is encountered and no more specific message is suitable. It’s less likely Google was given this very generic error in this case.

503 Service Unavailable
Given when the server cannot handle a request because it is overloaded or down for maintenance. It’s highly likely that this was the error given to Google when trying to access the robots.txt files.

509 Bandwidth Limit Exceeded
Very unlikely Google was given this message since the account in question did not have bandwidth limitations.

Too Many Server Requests

Looking at Search Console’s failure percentage helped me work out the error was related to the number of pages I was asking Search Console to check in a short period of time. I could correlate the increase in failures reported by search console with my activity requesting Google inspect pages and add them to the index.

googlebot fail rate

Since the hosting in question was shared, it’s logical to assume other users on the same server would also have visits from Googlebot which adds to the load.

Default Files

My next task was to work out what could be different with sites hosted on GreenGeeks as opposed to other website hosting providers. That’s where my ‘site specific things’ support ticket question was coming from. What specifically was different with the sites that had the robots.txt access error compared to others hosted in the same account that did not have that error?

The errors occurred on websites that all worked without issue before being transferred to GreenGeeks’ servers. So I took a look at the root directory of sites that had been created specifically on GreenGeeks’ servers to see what, if any differences there were.

I found GreenGeeks included a default robots.txt and that file included a crawl delay. So there we have it, the GreenGeeks server is set up in a way that despite ‘whitelisting’ Google IPs, it could still restrict access if too many requests were being made in a given period of time. The crawl delay added to the default robots.txt confimed this, instructing robots to observe a delay of 60 seconds.

What I then discovered after more research was that Google does not (or might not) support the crawl-delay command, so I’ll have to use it as a possible solution for now and monitor the situation. Have a look at a video that explains how Google adjusts its behaviour according to the server’s ability to handle requests.

Unreachable: robots.txt Solution

To avoid an Unreachable: robots.txt error in Google Search Console, the best solution is to include the following lines in your robots.txt:

user-agent: *
crawl-delay: 60

The key line is ‘crawl-delay: 60’, which means Google, other search engines and spiders etc will pause for 60 seconds between server requests so as to not overload it.

Unfortunately, I can’t guarantee this will solve the problem, but it’s the only option I have for now. Hopefully a note has been made of this thanks to GreenGeeks’ standard operating procedures so the next customer who asks about an Unreachable: robots.txt error can resolve their issue in a swift manner.

My conclusion is GreenGeeks run their servers to be very secure. This is evidenced by the way they handle customer log-in, small things they restrict their users from doing and by the way their servers restrict multiple visits, which should help prevent things such as denial of service (DDoS) attacks.