No Comments

Analyzing Cuil

SEO digest

While clueless and/or impatient people are beating up on Cuil for not being the Google killer that Cuil never claimed to be, I’ve been quietly checking out Cuil’s search results. Cuil designer Tom Costello told Tech Ticker that he expects Cuil to be a competitive player within five years.

Of course, many people were disappointed in Cuil’s search results when the search engine launched. Costello said in one of the video segments with Tech Ticker that the number 1 complaint they received from Webmasters was that their sites did not rank first for specific queries. One Las Vegas dentist made such a complaint in the comments section.

Another complaint many people made was that search results were repeated over many results pages. I noticed similar duplication of results on the first day of testing and perhaps for a while after that but I’ve had trouble replicating the problem recently.

Because it’s such a new search engine, Cuil lacks a lot of the bells and whistles that Google has championed. People have tried all sorts of queries on Cuil that Cuil never claimed to support. For example, in a comment for a review of Cuil by the Wall Street Journal’s Numbers Guy someone wrote:
Cuil doesn’t do conversions or math.
When you search “24+86=” at Google, it gives you the answer; 110. When you do that at Guil, it gives “No results”. When you search “86f to c”, Google converts Fahrenheit to Celsius and gives the answer; 30 C. Cuil doesn’t do that.

While it makes sense to test Cuil to see what query formats it supports (they should have published documentation to help users), expressing disappointment and frustration in their lack of support for esoteric queries makes no sense. You might as well complain that Google doesn’t put air in your tires (since Google never promised to do that, you have equally as much reason to complain about Google as Cuil).

The encumbrance of false expectations and inappropriate measurement has held people back from understanding what Cuil represents in terms of evolution of the search experience. In fact, until recently I did not understand that Cuil is banking mostly on its ability to index content at substantially less content than Google and the other major search engines. That is the key to unraveling Cuil’s mysteries.

Tom Costello apparently has devised a new way to mathematically index Web pages that requires fewer resources than traditional search engines use. The Numbers Guy tried to deduce some numbers about Cuil and Google based on information Vince Solitto, Cuil’s VP of Communications, provided:
But is it still that size? Last week, Google announced that it was processing one trillion unique links online. The Google index doesn’t include all of the pages found at these links. Mr. Solitto said Cuil’s research has found that the average Web page has almost 20 links, which would suggest that there are more than 50 billion pages in Google’s index. Google also removed an undisclosed number of duplicate pages.

We know several things about Google’s infrastructure that may provide insight into why Google requires so many resources.

First, Google operates multiple data centers. They started out on one server and quickly added a few more servers, but eventually Google had to start building data centers. Cuil doesn’t yet have the resources to create several data centers so it has to make do with a small set of servers. Google has the advantage of operating concurrent crawls and ranking algorithms, as well as of implementing worldwide load balancing. Cuil will have to grow that capacity gradually, just as Google and other major search engines did.

Second, Google divides its search engine into sub-portions. Maybe they’re still using their shard technology, maybe not. Through shards, Google was able to segment its database (and file system) so that portions of the system could be taken offline without significantly impacting performance. My previous attempts to deduce how Google uses shards have been challenged by Googlers (i.e., they said my guesswork was wrong), but it’s my understanding that sharding provides a means to duplicate data without having to replicate ALL data equally.

Third, Google uses links for discovery, to determine Web site structure, and to assess trust (between sites) and relevance (through anchor text) and popularity (through PageRank). Cuil claims it doesn’t put much emphasis on links (although Danny Sullivan suggested they may using some sort of link popularity). We can be sure that Cuil has to use links for discovery; we cannot be sure that Cuil pays much attention to links in any other way.

Fourth, Google started out building a database that indexes every word on the Web, associating as many documents with those word indexes as possible. Now, after two changes in wiring (the December-2005 Bigdaddy rewrite and the May-2007 Searchology rewiring), no one outside of Google can really be sure of how they index words. Except that we can be certain Google no longer indexes every word in all documents, because Google’s supplemental index is still going strong, perhaps stronger. Nonetheless, Google’s word-document database has to be huge because it now handles query substitutions, linking relationships, and some phrase-like parsing.

In fact, Cuil seems to have no phrase-like parsing yet. I have been unable to use more than 3-4 words to pull up relevant results in Cuil. Phrases and proximity-based queries will have to wait and perhaps, like PageRank proved impossible for Google to manage across the entire Web, phrases may be the Achilles heel for Cuil’s mathematics-driven index.

Still, what is it that Cuil is doing that so frustrates some people? I think you can boil it down to: “Not indexing and ranking content the way that Google does.”

So, Mr. Costello and his wife (former Googler) Anna Patterson are telling everyone that they have no intention of organizing and presenting data the way Google does. After all, why would anyone want another search engine that just serves up Google-like results? That makes absolutely no sense whatsoever. I get it. They get it. But how many other people get it? This seems to be the barn-sized point that has escaped many people in the oh-so-quick-to-judge SEO and Webmaster communities.

On the other hand, except for the comment about challenging Google in five years, Cuil hasn’t really said much about what it thinks it can do. They have been extremely coy about disclosing how they crawl and index the Web (although in 2007 a number of Webmasters with large sites struggled to block Twiceler, Cuil’s robot). We know there is some sort of mathematical approach — one that was, apparently, inspired by the way kindergartners learned mathematical logic faster than Stanford Ph.Ds.

Not to take anything away from Stanford Ph.Ds for all that they have achieved, but sometimes you need a different perspective if you want to change the way things work. A lot of people are satisfied with Google as their primary search engine — despite the fact you cannot effectively use Google for site search, despite the fact you have to page past all the crap sites Google promotes to the top of its search results, despite the fact that Google is so preoccupied with fighting Web spam and introducing new services and gadgets that it doesn’t focus on relevance, despite the fact that Google devalues links and Web sites several times a year (sending thousands of Webmasters and SEOs scrambling for yet more links) — approximately 1/4 of search engine users seem satisfied to continue using Google.

Cuil is not about to unseat Google and anyone who hoped it might do so this year was just extremely naive. It’s impossible to roll out a search engine of that magnitude. They have to be built from the ground up and continuously over a period of several years. If Cuil’s new mathematical approach to indexing content is really scalable, it will have to scale in more than one way. It needs to support large, complex queries. And that means Cuil has to add support for sophisticated queries. The sooner the better.

Cuil needs to provide a site search and it needs to index more words per page than Google. Based on what I’ve seen so far, if I had to make a guess, I would surmise that Cuil is looking at statistically prominent words. That is, the more frequently a word occurs in a document in a non-structural position, the more likely Cuil is to associate that document with the word. Repetition alone doesn’t seem to be the answer. And numbers don’t necessarily tell us the entire story. For example, if you search for ‘the’ on Cuil you’ll see that Cuil reports about 121 billion results.

Curiously, I found a very odd descriptor text for the USPS Web site. The text accurately described the United States Post Office but it occurred nowhere on the USPS site. Where did it come from? Apparently from a Wikipedia scraper site — or maybe from an older Wikipedia page. In a search for “cover of time magazine” I found a listing for one of the Time covers — provided on time.com — with a block of text that is not found on that page. However, the text IS made available on a search results page where the same time.com page is listed.

That is, in these two cases, Cuil grabbed from alternative sources essentially relevant text about a page which lacked descriptive text.

We’ve seen this behavior before: Google, Yahoo!, and other major search engines have been historically notorious for substituting their directory descriptions (or DMOZ directory descriptions) for page meta descriptions or snippets of text from pages. Cuil seems to be seeking relevant descriptive text on pages that describe Web sites, but it has not limited its substitution source to DMOZ and the Yahoo! directory.

In fact, I found meta appropriate meta description text on both sites, but Cuil ignored the meta tag. Hence, I think we have to conclude that Cuil is looking for on-page descriptive text, and the text does not have to be explicit. That is, if it’s statistically obvious that a page is about Michael Martinez, Cuil seems inclined to grab text from the page to display in a query for “michael martinez”. I was able to find several pages listed in multiple queries that are relevant to me and SEO Theory so Cuil does not appear to have limited pages to being found for only one query.

Nonetheless, I have to wonder just how much indexable text is being extracted from each page and how much Cuil might be favoring sources of information about Web sites. There are probably several thousand Web sites that publish information about Web sites. The most well-known examples include major directories like DMOZ, Yahoo!, and JoeAnt; consumer-generated content sources like Wikipedia, its imitators and scrapers, and hybrid directories; and search scraper sites, perhaps. Maybe informational sites like news archives.

If the mathematical algorithm relies upon third-party sources of information about Web sites, then Cuil is going to face some pretty steep criticism from Webmasters who have made the effort to inform search engines about their sites. Tom Costello lamented the fact that Yahoo! stopped updating its directory in the video I cited above. He seemed to feel like Mahalo has a chance to replace Yahoo!’s directory but not to challenge Google.

I’m not sure, however, that relying upon the Wikipedia Principle of providing minimally relevant content in the least expensive way possible is a viable means of building a search service to challenge Google. For all its faults, Google still pays closer attention to what Web sites say about themselves. That’s value I don’t think any search engine should intentionally discard.

Time will tell if Cuil takes too many short cuts; but if it turns out they have, they’ll have to show that they are more flexible than Google. Google, by luck or quality, has reached the top of the pyramid. It can afford to be inflexible as long as no one unseats it. In five years, however, Microsoft may be the search engine to beat. They’ve already surpassed Yahoo!

www.seo-theory.com

published @ September 20, 2008

Similar posts:

Sorry, the comment form is closed at this time.