Crawler-based search
engines
are those that use automated software agents
(called crawlers) that visit a Web site, read the information on the actual
site, read the site's meta tags
and also follow the links that the site connects to performing indexing on all
linked Web sites as well. The crawler returns all that information back to a
central depository, where the data is indexed. The crawler will periodically
return to the sites to check for any information that has changed. The
frequency with which this happens is determined by the administrators of the
search engine.
Human-powered search
engines
rely on humans to submit information that is subsequently indexed and catalogued.
Only information that is submitted is put into the index. In both cases, when
you query a search engine to locate information, you're actually searching
through the index that the search engine has created —you are not actually
searching the Web. These indices are giant databases of
information that is collected and stored and subsequently searched. This
explains why sometimes a search on a commercial search engine, such as Yahoo!
or Google, will return results that are, in fact, dead links. Since the search
results are based on the index, if the index hasn't been updated since a Web
page became invalid the search engine treats the page as still an active link
even though it no longer is. It will remain that way until the index is
updated.
Major
search engines
Google
Yahoo
MSN/Bimg
Robots
Google:Googlebot
MSN / Bing: MSNBOT/0.1
Yahoo: Yahoo!
Slurp
Robot.txt file
Robot.txt is a file that gives instructions to all search engine spiders
to index or follow certain page or pages of a website. This file is normally
use to disallow the spiders of a search engines from indexing unfinished page
of a website during it's development phase. Many webmasters also use this file
to avoid spamming. The creation and uses of Robot.txt file are listed below:
Robot.txt Creation:
To all robots out
User-agent: *
Disallow: /
To prevent pages from all crawlers
User-agent: *
Disallow: /page name/
To prevent pages from specific crawler
User-agent: GoogleBot
Disallow: /page name/
To prevent images from specific crawler
User-agent: Googlebot-Image
Disallow: /
To allows all robots
User-agent: *
Disallow:
Finally, some crawlers now support an additional field called
"Allow:", most notably, Google.
To disallow all crawlers from your site EXCEPT Google:
User-agent: *
Disallow: /
User-agent: Googlebot
Allow: /
"Robots" Meta Tag
If you want a page indexed but do not want any of the links on the page to be
followed, you can use the following instead:
< meta name="robots"
content="index,nofollow"/>
If you don't want a page indexed but want all links on the page to be followed,
you can use the following instead:
< meta name="robots" content="noindex,follow"/>
If you want a page indexed and all the links on the page to be followed, you
can use the following instead:
< meta name="robots"
content="index,follow"/>
If you don't want a page indexed and followed, you can use the following
instead:
< meta name="robots"
content="noindex,nofollow"/>
Invite robots to follow all pages
< meta name="robots"
content="all"/>
Stop robots to follow all pages
< meta name="robots"
content="none"/>
No comments:
Post a Comment