A
Web search engine is a tool
designed to search for information on the World Wide
Web. Information may consist of web pages,
images, information and other types of files.
Seo Expert
Whenever you enter a query in a search engine and hit 'enter' you get a list of web results that contain that query term. SEO is a technique which helps search engines find and rank your site higher than the millions of other sites in response to a search query. SEO thus helps you get traffic from search engines.
Friday 9 March 2012
What is spider
A
program that automatically
fetches Web pages.
Spiders are used to feed pages to search
engines. It's called a spider because it crawls over the Web.
Another term for these programs is webcrawler.
Because most Web pages
contain
links to other pages, a
spider can start almost anywhere. As soon as it sees a link to another page, it
goes off and fetches it. Large search engines ,
like Alta Vista,
have many spiders working in parallel.
How Web Search Engines Work
Crawler-based search
engines
are those that use automated software agents
(called crawlers) that visit a Web site, read the information on the actual
site, read the site's meta tags
and also follow the links that the site connects to performing indexing on all
linked Web sites as well. The crawler returns all that information back to a
central depository, where the data is indexed. The crawler will periodically
return to the sites to check for any information that has changed. The
frequency with which this happens is determined by the administrators of the
search engine.
Human-powered search
engines
rely on humans to submit information that is subsequently indexed and catalogued.
Only information that is submitted is put into the index. In both cases, when
you query a search engine to locate information, you're actually searching
through the index that the search engine has created —you are not actually
searching the Web. These indices are giant databases of
information that is collected and stored and subsequently searched. This
explains why sometimes a search on a commercial search engine, such as Yahoo!
or Google, will return results that are, in fact, dead links. Since the search
results are based on the index, if the index hasn't been updated since a Web
page became invalid the search engine treats the page as still an active link
even though it no longer is. It will remain that way until the index is
updated.
Major
search engines
Google
Yahoo
MSN/Bimg
Robots
Google:Googlebot
MSN / Bing: MSNBOT/0.1
Yahoo: Yahoo!
Slurp
Robot.txt file
Robot.txt is a file that gives instructions to all search engine spiders
to index or follow certain page or pages of a website. This file is normally
use to disallow the spiders of a search engines from indexing unfinished page
of a website during it's development phase. Many webmasters also use this file
to avoid spamming. The creation and uses of Robot.txt file are listed below:
Robot.txt Creation:
To all robots out
User-agent: *
Disallow: /
To prevent pages from all crawlers
User-agent: *
Disallow: /page name/
To prevent pages from specific crawler
User-agent: GoogleBot
Disallow: /page name/
To prevent images from specific crawler
User-agent: Googlebot-Image
Disallow: /
To allows all robots
User-agent: *
Disallow:
Finally, some crawlers now support an additional field called
"Allow:", most notably, Google.
To disallow all crawlers from your site EXCEPT Google:
User-agent: *
Disallow: /
User-agent: Googlebot
Allow: /
"Robots" Meta Tag
If you want a page indexed but do not want any of the links on the page to be
followed, you can use the following instead:
< meta name="robots"
content="index,nofollow"/>
If you don't want a page indexed but want all links on the page to be followed,
you can use the following instead:
< meta name="robots" content="noindex,follow"/>
If you want a page indexed and all the links on the page to be followed, you
can use the following instead:
< meta name="robots"
content="index,follow"/>
If you don't want a page indexed and followed, you can use the following
instead:
< meta name="robots"
content="noindex,nofollow"/>
Invite robots to follow all pages
< meta name="robots"
content="all"/>
Stop robots to follow all pages
< meta name="robots"
content="none"/>
Robots.txt Vs Robots Meta Tag
Robots.txt
While Google won't crawl or index the content of pages blocked by robots.txt, we may still index the URLs if we find them on other pages on the web. As a result, the URL of the page and, potentially, other publicly available information such as anchor text in links to the site, or the title from the Open Directory Project (www.dmoz.org), can appear in Google search results.
In order to use a robots.txt file, you'll need to have access to the root of your domain (if you're not sure, check with your web hoster). If you don't have access to the root of a domain, you can restrict access using the robots meta tag.
Robots Meta Tag
To entirely prevent a page's contents from being listed in the Google web index even if other sites link to it, use a noindex meta tag. As long as Googlebot fetches the page, it will see the noindex meta tag and prevent that page from showing up in the web index.
When we see the noindex meta tag on a page, Google will completely drop the page from our search results, even if other pages link to it. Other search engines, however, may interpret this directive differently. As a result, a link to the page can still appear in their search results.
Note that because we have to crawl your page in order to see the noindex meta tag, there's a small chance that Googlebot won't see and respect the noindex meta tag. If your page is still appearing in results, it's probably because we haven't crawled your site since you added the tag. (Also, if you've used your robots.txt file to block this page, we won't be able to see the tag either.)
If the content is currently in our index, we will remove it after the next time we crawl it. To expedite removal, use the URL removal request tool in Google Webmaster Tools.
While Google won't crawl or index the content of pages blocked by robots.txt, we may still index the URLs if we find them on other pages on the web. As a result, the URL of the page and, potentially, other publicly available information such as anchor text in links to the site, or the title from the Open Directory Project (www.dmoz.org), can appear in Google search results.
In order to use a robots.txt file, you'll need to have access to the root of your domain (if you're not sure, check with your web hoster). If you don't have access to the root of a domain, you can restrict access using the robots meta tag.
Robots Meta Tag
To entirely prevent a page's contents from being listed in the Google web index even if other sites link to it, use a noindex meta tag. As long as Googlebot fetches the page, it will see the noindex meta tag and prevent that page from showing up in the web index.
When we see the noindex meta tag on a page, Google will completely drop the page from our search results, even if other pages link to it. Other search engines, however, may interpret this directive differently. As a result, a link to the page can still appear in their search results.
Note that because we have to crawl your page in order to see the noindex meta tag, there's a small chance that Googlebot won't see and respect the noindex meta tag. If your page is still appearing in results, it's probably because we haven't crawled your site since you added the tag. (Also, if you've used your robots.txt file to block this page, we won't be able to see the tag either.)
If the content is currently in our index, we will remove it after the next time we crawl it. To expedite removal, use the URL removal request tool in Google Webmaster Tools.
Validate Your Code
There
are several ways to validate the accuracy of your website's source code. The
four most important, in my opinion, are validating your search engine
optimization, HTML, CSS and insuring that you have no broken links or images.
Start
by analyzing broken links. One of the W3C's top SEO tips would be
for you to use their tool to validate links. If you have
a lot of links on your website, this could take awhile.
Next,
revisit the W3C to analyze HTML and CSS. Here is a link to the W3C's HTML
Validation Tool and to their CSS Validation Tool.
The
final step in the last of my Top SEO
Tips is to validate your search engine optimization. Without having to
purchase software, the best online tool I've used is ScrubTheWeb's Analyze
Your HTML tool. STW has built an extremely extensive online
application that you'll wonder how you've lived with out.
One
of my favorite features of STW's SEO Tool is their attempt to mimic a search
engine. In other words, the results of the analysis will show you
(theoretically) how search engine spiders may see the website.
Install a sitemap.xml for Google
Though
you may feel like it is impossible to get listed high in Google's search engine
result page, believe it or not that isn't Google's intention. They simply want
to insure that their viewers get the most relevant results possible. In fact,
they've even created a program just for webmasters to help insure that your
pages get cached in their index as quickly as possible. They call the program Google Sitemaps. In
this tool, you'll also find a great new linking tool to help discover who is
linking to your website.
For
Google, these two pieces in the top SEO tips would be to read the tutorial
entitled How Do I Create a
Sitemap File and to create your own. To view the one on this page,
website simply right-click this SEO
Tips Sitemap.xml file and save it to your desktop. Open the file
with a text editor such as Notepad.
Effective
11/06, Google, Yahoo!, and MSN will be using one standard for sitemaps. Below
is a snippet of the standard code as listed at Sitemaps.org. Optional fields are lastmod,
changefreq, and priority.
<?xml
version="1.0" encoding="UTF-8"?>
<urlset
xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://www.example.com/</loc>
<lastmod>2005-01-01</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
</url>
</urlset>
The
equivilant to the sitemap.xml file is the urllist.txt for Yahoo!.
Technically you can call the file whatever you want, but all it really contains
is a list of every page on your website. Here's a screenshot of my urllist.txt:
Include a robots.txt File
By
far the easiest top SEO tips you will ever do as it relates to search engine
optimization is include a robots.txt file at the root of your website. Open
up a text editor, such as Notepad and type "User-agent: *". Then save
the file as robots.txt and upload it to your root directory on your domain.
This one command will tell any spider that hits your website to "please
feel free to crawl every page of my website".
Here's
one of my best top SEO tips: Because the search engine analyzes
everything it indexes to determine what your website is all about, it might be
a good idea to block folders and files that have nothing to do with the content
we want to be analyzed. You can disallow unrelated files to be read by adding
"Disallow: /folder_name/" or "Disallow: /filename.html".
Here is an example of the robots.txt file on this site:
Subscribe to:
Posts (Atom)