Providing Computer Help and Support to business in and around Hastings, St Leonards, Battle and Bexhill, East Sussex. Also has a few snippets of random things.

Thursday, January 26, 2006

Blocking Search Engines from indexing pages

It may sound bizarre, but occasionally when promoting your site to search engines, you do not want certain files/directories to be scanned and indexed by the search engine spiders such as configuration files and any personal files/directories you are using your webspace to store.

The most common approach is to use a 'robots.txt' file in the home directory of your site.

This file is normally created using a simple text editor that does not insert any formatting such as Microsoft's Notepad and follows the following format, with two lines making up a record;

[field]:[value]

There are two field types, the first 'User-agent' which specifies the type of spider you wish to apply the following rule(s) to and 'Disallow' which specifies prohibited content. The * symbol can be used as a wildcard in either of the fields.

The following allows all robots to visit all files because the wildcard "*" specifies all robots.
User-agent: *
Disallow:
This one keeps all robots out.
User-agent: *
Disallow: /
The next one bars all robots from the cgi-bin and images directories:
User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
This one bans Roverdog from all files on the server:
User-agent: Roverdog
Disallow: /
This one bans keeps googlebot from getting at the cheese.htm file:
User-agent: googlebot
Disallow: cheese.htm


Why not visit my main site for more tips and hints

No comments: