How Search Engines Read Web Pages

To properly optimize web page content for the search engines you need to understand how the search engines read web pages.

When the Search Engine Arrives to the Website

When the search engine arrives to the website in looks in the root (main) folder of the site for a file called robots.txt

In the robots.txt file it looks for what directories and files it is allowed to look at and index. Some search engine spiders (web crawlers, robots) ignore these instructions.

Web Page Head Section

Once a web page is found by the search engine spider (robot, web crawler) it takes a look at the head section (the tags between the <head> and </head> tags) of the web page for:

  1. The title of the page
  2. The keyword and description meta tags
  3. The robots meta tag. Some use the robots meta tag to override the instructions in the robots.txt file or if they cannot create a robots.txt file the instructions for the search engine spiders (robots, web crawlers) are placed here.

    If there is no robots.txt file in place and no robots meta taf in the page(s) it finds, it will follow and index all the links found.

To learn more about meta tags, see our Meta Tags and Basic Meta Tags articles.

Web Page Content

Web page content is everything between the <body> and </body> tags.

This where the search engine spider (robot, web crawler) looks for the keywords and phrases you have in the page's keywords and description meta tags.

It will also find and follow links within the web page content.

The search engine spider (robot, web crawler) reads the content in the order that it is inserted into the page. As in from the top down.

Some search engines use the first few words they find in the web page coding as the description under your listing in the search engine results. This is something you should keep in mind if you use HTML tables for your web page layout.

If your website uses frames, javascript or flash you should be aware that search engines don't read these.

A website that uses a doorway page (splash page) won't be too popular with the search engine either. If you are considering one of these or using one, then it would be advisable to reconsider it.

Proper heading structure is also important when preparing your web page for the search engines.

Determine What Order the Search Engines See Your Web Page Content

There are a few ways you can determine what order the search engine bots are reading your web page content:

  1. Lynx Viewer

    The Lynx Viewer will linealize the content of your web page content. This means it produces a report with the content in the same order the search engine bots will see.

    Enter the web page address of the page you wish to test then click the view report button.

    This report has an added benefit: It shows you the number of links you have on the page plus if they are visable or not. Great way to find hidden links, which are a nono if you want to keep the search engines happy.

  2. Web Accessibility Toolbar

    The Web Accessibility Toolbar for Internet Explorer provides a wealth of tools even if you are not into the study of website accessibility. This toolbar is also available for Opera users.

    Hint: Even if you are not an Internet Explorer or Opera user you should have these browsers on your computer for testing purposes. Each page needs to be cross-browser compatible if you want to take advantage of your site being accessible to all visitors.

    With the toolbar installed, go to the page you wish to test, then you have 2 options:

    1. Left click Check and then Lynx Viewer to get the same report as explained above.
    2. Left click CSS then Disable CSS. This will linearize your content right in the browser. Repeat the process to get things looking correctly again.
  3. Google Cache

    The third way to see how your content appears to the search engine bots is to check the cache of the page in question that Google has.

    1. Go to Google and look for one of your pages.
    2. Mouseover the listing of your page in the results and a >> symbol will appear. Mouseover that for the "quick view" to appear on the right.
    3. Near the top of the quick view is a link called Cached, click it.
    4. Top of the page you are taken to has a grey bar across the top. Bottom right of the bar is a link called Text-only version. Click that link.

    Like the other two method, your page is linearized in the browser.

    The added bonus of looking at the Google cached version is that in the grey bar across the top it states the date and time this snapshot was taken.

    FYI: You can also view the web page cache via the Google Toolbar if you have it installed.

There you go. Three ways you can determine how the search engine bots are reading your web pages with bonus information depending on which method you use.

To learn more about search engine optimization, we recommend:

If you found this web page a useful resource for your own website please link as follows:

HTML Basic Tutor - www.htmlbasictutor.ca/

To perform search engine optimization (SEO) properly you need to understand how the search engines read web pages.
URL: