May 15, 2007

Selective Page Indexing Directives

If you can control what parts of an HTML page are indexed by a search engine, you can really improve the quality of search results. Unfortunately, there is no standard way to do this, and Yahoo! has just added one more proprietary set of directives.

Some sections of HTML pages are the core content and some are navigation, ads, decoration, or site-related. If a search engine can index just the site-specific content, it will have cleaner data in the search index. I think of this as “gold in, gold out” instead of the more common “garbage in, garbage out”.

Search engines support different, incompatible ways of marking sections which should and should not be indexed. This page includes a list of every selective indexing implementation I’ve found. The original list was compiled when I was designing the Ultraseek Page Expert feature back in 2004.

Yahoo is the only WWW search engine to implement any of these schemes, and they invented their own this month. They claim that they “did a little homework”. I’ll believe the “a little” part since they are reinventing the wheel, and it only took me one day to find and document all these tags. Their customers are already complaining about the “don’t index” sense, which has been a usability problem with many of the other directives over the past ten years.

Do not confuse these directives with the robots meta tag, which provides hints for indexing the entire page. These directives are for sections of a page.

Ultraseek Page Expert Instead of a fixed directive, Page Expert allows you to configure which parts of the existing markup to index. The pages do not need to be changed to include new directives. It comes pre-configured for the MonArch and Hypermail mail archivers and for Javadoc. A visual preview highlights the parts of the page which will be indexed. The page types (sets of filters) can be applied to specific servers or sets of URLs. Page Expert filters can include multiple markup patterns with both index and noindex actions. See the Page Expert info at ultraseek.com for more details. This was introduced in Ultraseek 5.3 (September 2004).

Implemented in: Ultraseek

<noindex></noindex> This is the most widely implemented but has some problems. It isn’t legal HTML 4 or XHTML, and documents with this tag will fail validation. The noindex sections need to be entire blocks of structure, that is, you can’t do <noindex><p>a</noindex>b</p>. On the plus side, it is easy to see how the start and end match, and some HTML editors will help match them for you.

Implemented in: Verity/Autonomy K2, Ultraseek, Atomz, FDSE (with customization)

<!--stopindex--><!--startindex--> Legal in HTML or XHTML, but the sense of the directives confuses some users. People seem to expect to start and stop the noindex section, not the index section. One advantage is that these do not need to match or nest, so that there can be multiple stopindex directives in different templates or SSI’s, and indexing will still start at a startindex directive. These were proposed at Infoseek and implemented in Ultraseek Server](http://www.ultraseek.com/support/faqs/1001.html) in 1997.

Implemented in: Verity Ultraseek

<!--googleoff: all--><!--googleon: all--> A more complicated version of the stopindex structured comments accepted only by the Google Search Appliance. Instead of all, you may use anchor, snippet, or index. It isn’t exactly clear what happens when different directives are mixed or repeated, though some people think that googleon: all will enable all of the attributes. These are documented in Google’s appliance docs which are not publicly available. This description of googleon/googleoff matches what I’ve learned about them. These directives are ignored by Google’s WWW search engine.

Implemented in: Google Search Appliance

<p class="robots-nocontent"> A class which can be applied to any HTML element that allows the class attribute, the robots-nocontent class was introduced by Yahoo for WWW search in May 2007. This is the only selective indexing directive I know of for any WWW search engine. Like the stopindex and googleoff directives above, the inverted sense of this directive seems to confuse many users.

Implemented in: Yahoo! Web Search

<alkaline skip></alkaline> The Alkaline search engine has a product-specific tag. “skip” is one of the options for that, and causes the contained content to be skipped by the indexer. This has the same disadvantages as <noindex>.

Implemented in: Alkaline

<!-- robots content="noindex" --><!-- /robots --> This structured comment borrows the ROBOTS meta tag format. It isn’t clear what happens if the start and end directives are not matched. Is that an error? Does it work like the <!--stopindex--> directives? This is used in two Perl-based search engines.

Implemented in: Fluid Dynamics Search Engine (FDSE), Darryl Burgdorf’s WebSearch

<!-- robots:noindex --><!-- /robots:noindex --> Proposed by Avi Rappoport of searchtools.com. This uses an XML namespace style in a structured comment. Not implemented by any search engines, as far as I know.

Please let me know of any other tags like this.

Posted by Walter Underwood (wunder@best.com) at May 15, 2007 09:10 AM | TrackBack
Comments
Post a comment









Remember personal info?