The Website Gatekeeper – Robots.txt

I got a call from a friend who was upset that his website traffic had dwindled, and sales were on a steep decline. My first thought was maybe he got caught up in some link scheme that got his site penalized. It turns out the problem was a simple mistake that many people have made including me.

In this case, the problem started when the website was redesigned. The web developers wanted to ensure that their site wouldn’t be crawled by the search engines before it was launched. To accomplish this, they added a robots.txt file that disallowed the search engines. The code looked like this:

User-agent: *
Disallow: /

The sample code above is perfect for telling all web crawlers to go away, which is what they wanted for the development site. The problem was the team transferred all the files from the development server to the production server and they overwrote the previous robots.txt file. After awhile, the site pages started dropping from the search engine results.

The robots.txt file is an optional file that provides guidance to web crawlers or spiders about your domain. Reputable bots, such as the major search engines, use this file for determining which content to crawl or ignore.

For us non-technical people, looking at either Google’s or Bing’s instruction file can be intimidating. The bottom line is if you’re not careful with this file, you could block your “money” or “important” pages from the search engines.

Use Google Webmaster Tools to Test Robots.txt

Instead of guessing if your web pages will get blocked, I like to have Google tell me who passes and who fails before their first crawl. They have a free test tool where you can paste in a URL list and see the results. This does require you have a webmaster account, but the service is free.

You can also tell it which of Google’s 5 bots to use for testing. These include:

  • Googlebot : crawls pages for our web index and Google News
  • Googlebot – Mobile : crawls pages for mobile index
  • Googlebot – Image : crawls pages for image index
  • Mediapartners – Google : crawls pages to determine AdSense content
  • Adsbot-Google : crawls pages to measure AdWords landing page quality

Create the Test URL List

The first step is to create a text list of the URLs you wish to test. This might be your entire site or maybe just your important pages. Alternatively, you can enter in anything including non-existent URLs to learn how the system works. For example, I needed to test a series of URLs for a site that was redoing its URL structure.

If you want to test with real URLs from your site, Google has assembled a list of tools at https://code.google.com/p/sitemap-generators/wiki/SitemapGenerators. While these tools are designed to create sitemaps, some tools such as GSiteCrawler also create a simple URL text file.

If the program you select doesn’t create a simple URL list, you can export to a CSV or tab delimited file and remove any columns you don’t need using Excel or another spreadsheet program. In a worst case scenario, you can test 1 URL at a time, but I find that takes too long and leads to boredom.

Testing Your List with Google Webmaster Tools

  1. Log into your account at https://google.com/webmasters/tools
  2. From the left column, click Health and then Blocked URLs.
blocked URLs submenu
MENU OPTION FOR TEST TOOL
  1. The screen will show your current robots.txt and the last time it was downloaded. You can also edit your robots.txt file within this window. I’ve outlined my current file in red.

If you don’t have a robots.txt file, the search engines will crawl all pages it has discovered that don’t require a password. For example, the search engines may have discovered pages on your site from other websites you don’t control.

Example of robots.txt file
CURRENT ROBOTS.TXT
  1. Scroll down the page to the URLs Specify the URLs and user-agents to test against section and paste in your URL list.
pasted URLs in textbox
PASTE URLs TO TEST

Note: I’ve been able to paste in 200 URLs at a time in this text field.

  1. Select which User-agents you wish to emulate. You can rerun the test and choose different agents. This test only indicates if the crawler is allowed access.
list of Google user agents
GOOGLE USER AGENTS
  1. Click Test
  2. You should see your URL list and the status for each User Agent.
  1. Note any errors such as URLs that were blocked. In the example below, you can see that Googlebot was blocked because of line 2 in my robots.txt file.
example of blocked URL
BLOCKED URL
  1. Edit your robots.txt to fix any errors and test again.
  2. If you changed your robots.txt, make sure to copy the file to your production web server.