IBM Improves Similar Sites Search

IBM has produced a lot of significant research for information retrieval science, including Web search. Their HITS and CLEVER algorithms, for example, were incorporated into Teoma’s algorithm (Teoma is now owned by Ask). While it may seem strange that IBM has never gotten directly involved in the Web search market, they have put out some respectable science through the years.

Just imagine what IBM could do if its business model included manipulating and influencing search results. They could render most of the SEO industry obsolete within a matter of months.

Since I’ve recently been bitten by the patent bug again I thought I would see what new ideas have been granted patents. Of course, given how long the process of obtaining patent protection really is, these ideas are not so much “new” as simply new to (you and) me.

So maybe you’ve played with Google’s Similar Pages feature and wondered, “What am I supposed to do with THESE results?” I know I have.

A search for pages related to SEO Theory leaves me scratching my head. Okay, some of the sites sort of make sense. Ms. Danielle, for example, does include an interview with me about the SEO Theory blog (which you can read here).

However, I’m pretty sure the W3 Schools’ page on XHTML falls short of the mark. Why does Google think these two sites are related or similar? (And why can they not use the same word, instead of using two really mutually-irrelevant words?) Here is what Google has to say about the related: query operator:
The query [related:] will list web pages that are “similar” to a specified web page. For instance, [related:www.google.com] will list web pages that are similar to the Google homepage. Note there can be no space between the “related:” and the web page url.

This functionality is also accessible by clicking on the “Similar Pages” link on Google’s main results page, and from the Advanced Search page, under Page Specific Search > Similar.

The only example they provide is for their own home page and those results SO look like a handjob they are not even laughable. Are we seriously supposed to assume that Lycos, Metacrawler, Hotbot, and Altavista are more similar to Google’s home page than Ask, Live, and AOL?

It would be helpful to search users if Google explained what the purpose of the operator is, who should benefit from it, and how we can improve our queries. Apparently, adding any parameters deactivates the query operator ‘related’ and converts the query to a text search. That’s not very useful, Google.

But never fear! Smith — er, IBM is here!

Their patent for a system and method to customize search engine results by picking documents proposes to enhance the related query function.

Now, here’s the bad news: so far as I know, IBM isn’t working with either Google or Yahoo! to implement this method on those search services.

As a searcher I’m still not entirely sure of why I would want to find similar pages. If I had designed the function, I would have used it to find similar sites (and I would have called the query operator similar:, not related:). The difference between a site and a page is an order of magnitude. It’s a waste of my time to be taken to an SEO article on a science fiction Web site (okay, maybe I should not have put an SEO article on my science fiction site, but I did, so get over it).

If I’m looking at SEO Theory and I decide I like it, I want to find other sites that specialize in (or at least have a significant investment of content in the topic). Although I don’t know of any other pure theory SEO blogs, I do know of several where you can find a fair amount of posts that fall into the SEO theory category. Some people have even tried to optimize for the query space.

Frankly, I find it to be a real pain in the you-know-where to find SEO theory-style blogs because I seem to have cluttered the name space with blogs about this blog. Shame on me. Or shame on you for writing about this blog and linking to it. It’s all very confusing.

IBM’s method proposes to clear up some of the confusion and, quite honestly, I think it would make a great SEO tool. They describe their proposed system thus:
When a normal query is run through a search engine, a user enters some keywords and then the user obtains a list of documents (web pages, PDFs, etc.) as search results. The result is listed in the order that the search engine believes best match with the keywords entered. However a user may want to generate a set of keywords such that a specific set of documents are returned. Then according to the present invention, a search engine takes a set of documents and generates the keywords from the set of documents such that entering the keywords into the search engine will yield the set of documents. The search results may produce many other documents and indeed the required documents may be split over many pages of results. The present invention further describes an algorithm that finds the ideal set of keywords such that the documents required appear close to the top of the search results.

Emphasis is mine

They continue:
A user launches a search engine to perform an Internet search. The user narrows down a query until the search engine returns the links the user is interested in. For example, a user has three web pages they are interested in. Then the user may enter the URLs of these web pages into a system and be given back a set of search keywords. The screen shot in FIG. 1 shows a convenient way to choose web pages based on selecting web pages in a browser using a check box 104 next to each link. Notice the check box 104 next to each search result. When the user has selected links from each result page, they click the “Find Matching Search” button 102 as shown in FIG. 1.

The search engine then generates keywords that best reflect the selected pages. This is achieved by performing a content analysis of the pages.

So we give their method a query, the query produces results on a search engine like Google or Yahoo!. They analyze the sites returned by the search engine and extract the keywords from those documents that are most relevant to the query. You can adjust the query to filter out documents that are improperly ranked because of anchor text.

IBM’s proposed search enhancement method helps you identify the keywords that are common to a class of documents and then use those keywords to find similar documents. The documents you select from your original queries are used as benchmarks. You can tell the system to construct queries that place those documents anywhere in the new results: at the top, in the middle, at the bottom.

This can help you identify documents that are more relevant than your for specific topics with which certain lexicons are closely associated. This can also help you identify documents that are closely associated with each other through frequent mutual reference. This can help you identify documents that use your scraped content. This can help you identify documents in a Web neighborhood you didn’t know exist.

The potential benefit Web searchers (and optimizers) could realize from this methodology is substantial. We would have a self-defining, user-configurable similar pages process. Much though I would still like to find similar sites, I could live with this system, I think (unless Google has already implemented it — in which case it doesn’t work very well after all).

I can see the backlink research stars in your eyes. Yes, this kind of system could be used to find potential new linking partners. What a waste of a great resource that would be, but if we ever get this kind of capability I suppose people would eventually figure out it’s good for a lot more than searching for backlink sources.

A system like this could return relevance to the forefront of search and help users bypass some of the crappy sites that Google promotes in its search results (I suspect a robust Ickipedia article would still show up in many query sets).

The only way I would want to modify this methodology would be to allow the user to block domains (or sub-portions of domains) from appearing in the results sets. Unfortunately, even Google’s crude “similar sites” function won’t let you do that.

But one can always hope that this area of search will improve. At least someone took some time to give thought to the matter.

www.seo-theory.com

published @ September 25, 2008

Home Page

SEO news

Search

Categories:

Ads:

Archives:

Ads:

Meta:

No Comments

IBM Improves Similar Sites Search

Similar posts:

More articles

Footer