Friday 9 March 2012

How Web Search Engines Work


Crawler-based search engines are those that use automated software agents (called crawlers) that visit a Web site, read the information on the actual site, read the site's meta tags and also follow the links that the site connects to performing indexing on all linked Web sites as well. The crawler returns all that information back to a central depository, where the data is indexed. The crawler will periodically return to the sites to check for any information that has changed. The frequency with which this happens is determined by the administrators of the search engine.
 
Human-powered search engines rely on humans to submit information that is subsequently indexed and catalogued. Only information that is submitted is put into the index. In both cases, when you query a search engine to locate information, you're actually searching through the index that the search engine has created —you are not actually searching the Web. These indices are giant databases of information that is collected and stored and subsequently searched. This explains why sometimes a search on a commercial search engine, such as Yahoo! or Google, will return results that are, in fact, dead links. Since the search results are based on the index, if the index hasn't been updated since a Web page became invalid the search engine treats the page as still an active link even though it no longer is. It will remain that way until the index is updated.

Major search engines

Google
Yahoo
MSN/Bimg

Robots

Google:Googlebot
MSN / Bing: MSNBOT/0.1
Yahoo:  Yahoo! Slurp

 

Robot.txt file

Robot.txt is a file that gives instructions to all search engine spiders to index or follow certain page or pages of a website. This file is normally use to disallow the spiders of a search engines from indexing unfinished page of a website during it's development phase. Many webmasters also use this file to avoid spamming. The creation and uses of Robot.txt file are listed below:

Robot.txt Creation:

To all robots out
User-agent: *
Disallow: /

To prevent pages from all crawlers
User-agent: *
Disallow: /page name/

To prevent pages from specific crawler
User-agent: GoogleBot
Disallow: /page name/

To prevent images from specific crawler
User-agent: Googlebot-Image
Disallow: /

To allows all robots
User-agent: *
Disallow:

Finally, some crawlers now support an additional field called "Allow:", most notably, Google.

To disallow all crawlers from your site EXCEPT Google:
User-agent: *
Disallow: /
User-agent: Googlebot
Allow: /


"Robots" Meta Tag

If you want a page indexed but do not want any of the links on the page to be followed, you can use the following instead:
< meta name="robots" content="index,nofollow"/>

If you don't want a page indexed but want all links on the page to be followed, you can use the following instead:
< meta name="robots" content="noindex,follow"/>

If you want a page indexed and all the links on the page to be followed, you can use the following instead:
< meta name="robots" content="index,follow"/>

If you don't want a page indexed and followed, you can use the following instead:
< meta name="robots" content="noindex,nofollow"/>

Invite robots to follow all pages
< meta name="robots" content="all"/>

Stop robots to follow all pages
< meta name="robots" content="none"/>

No comments:

Post a Comment