No Comments

The Google Sandbox Patent

SEO digest

Bill Slawski is the search engine patent guru of the SEO industry. I occasionally make forays through the patent applications but I hate having to figure out what they are trying to say in their overly specific patent application jargon. Bill, who has spent far more time deciphering patent applications than I, does a better job than anyone else in our field (in my humble opinion) of figuring out the possible implications and ramifications of these documents.

On August 28, 2008, Bill Slawski began an analysis of the Google Sandbox Patent. The Google Sandbox Patent is Patent Number 7,346,839. Although it was granted on March 18, 2008, Google filed for the patent on December 31, 2003. And notice that Matt Cutts’ name is listed second in the list of inventors. (NOTE: See the end of this article for a recap of how the SEO community discovered this patent.)

This patent has “anti-spam metric” stamped all over it.

There is another Google patent application that was (wrongly) identified as the Google Sandbox Patent a couple of years ago. That would be DOCUMENT SCORING BASED ON DOCUMENT INCEPTION DATE, which Google filed in 2006.

Those of you who suffered through the Year of Hell (March 2004 to May 2005) will recall that SEO forums lit up with complaints about new Web sites not appearing in the Google search results, even for their own names. Why anyone in the SEO industry would conclude that a patent application filed in 2006 would explain the Sandbox Effect is beyond me, but I’ve often questioned the analyses offered by many SEO pundits so we won’t dwell on this further.

In February 2006 I came across a reference to something John Scott had posted about the Sandbox Effect in 2004. I shared my idea that John’s claims about the Sandbox and link age had been validated, although I ended with the admission that no smoking gun had yet appeared. As I reported on SEOmoz at the time, “after two years of wading through Google Sandbox discussions, Google patents, patent discussions, and examining thousands of Web sites (including the backlink footprint for several hundred of them in detail) I’m still scratching my head. But I’ve come to believe that John was probably reporting the right thing all along.”

I am pretty sure that Google Patent 7,346,839 is the smoking gun people hunted so high and low for through the years. The patent fits all the clues that I was able to gather in the Spider-food post.

Although I’ve been managing Web site search results for about 10 years now, I only worked with one site that was sandboxed: my personal domain, michael-martinez.com. I registered the domain in January 2004. I took it live in May 2004. I spent the next 8 months looking for it in Google’s search results. At the time, “Michael Martinez” was not as popular a Web site name as it has now become (due to many other guys named Michael Martinez who have since started carving their niches on the Web).

I did point links at the domain from my other sites (including Xenite.org) but I only pointed a small number of links. Eventually, my domain hit the top ten results for my name and as more people linked to it, it settled in at the number 1 position on Google. Still, the experience puzzled me because I was well aware of all the SEO forum discussions about the Sandbox Effect. There were some pretty intense discussions back then. And there were some horrifically wild, bad ideas being passed around (mostly concerning the Hilltop algorithm that Krishna Bharat developed for Google News in 2002).

Hilltop never even came close to explaining half the things SEOs have blamed it for, but the idea that links were responsible for sites not being trusted enough was both interesting and disturbing. Clearly, faced with the Google-bombing craze that started in 2003, Google’s search results were easily subjected to substantial link-based manipulation. In fact, Google has always been vulnerable to link-based manipulation.

But the idea of putting links on probation seemed pretty radical to me. I no longer feel that way, as I have seen links appear to pass no value for several weeks or months and then suddenly they start to have an impact. I generally assume I need to wait [CENSORED BY VISIBLE TECHNOLOGIES SEO RESEARCH TEAM] before I can expect to see links on new sites begin to have an impact on search results.

How, I wondered (as have many other people, I am sure) does one “put links on probation”? There is an intriguing sentence in the Google Sandbox Patent that pretty much explains everything:
According to another aspect, a method for scoring documents is provided. The method may include determining an age of linkage data associated with a linked document and ranking the linked document based on a decaying function of the age of the linkage data.

All the emphasis is mine.

They offer a little more detail further on:
In one implementation, search engine 125 may modify the link-based score of a document as follows: H=L/log(F+2), where H may refer to the history-adjusted link score, L may refer to the link score given to the document, which can be derived using any known link scoring technique (e.g., the scoring technique described in U.S. Pat. No. 6,285,999) that assigns a score to a document based on links to/from the document, and F may refer to elapsed time measured from the inception date associated with the document (or a window within this period).

For some queries, older documents may be more favorable than newer ones. As a result, it may be beneficial to adjust the score of a document based on the difference (in age) from the average age of the result set. In other words, search engine 125 may determine the age of each of the documents in a result set (e.g., using their inception dates), determine the average age of the documents, and modify the scores of the documents (either positively or negatively) based on a difference between the documents’ age and the average age.

The patent application doesn’t specify a base for the Log() function, which could mean someone left out an important mathematical element somewhere in the process of getting the patent application onto the Web, or it could mean that the patent is using an indefinite logarithm, for which it is unnecessary to know what the base is.

If a base is required for the computation, the Log() function would be a definite logarithm (for which bases are usually set to 2, 10, or e). The most common base for statistics, mathematical analysis, economics, and engineering is Loge(value). The equation X = Loge(Y) can be rewritten e = X * Y, or ~2.71828 = X * Y.

Confused? You should be, because there is not enough information in the patent to make it clear (to me, at least) that a base is or is not required for the equation. Does that matter?

Well, we’re dealing with three scores:

  1. H is the history-adjusted score of a link pointing to a document
  2. L is the Link-adjusted score of the document (let’s pretend it’s internal PageRank)
  3. F is the recorded age of the document (within the database)

So we’re computing either the indefinite or natural logarithmic value of (age of the destination document plus 2) and then dividing the Link score (possibly PageRank) by that value to obtain the adjusted value assigned to the link. An older document would have a larger value for F than a younger document.

Bill’s first review of the patent sums it up thus:
The historical data patent goes on to explain that sometimes some “older documents may be more favorable than newer ones” and that some sets of results can be fairly mature. The scores of documents can be influenced (positively or negatively) by the difference between the document’s age, and the average age of documents resulting from a query.

So, a fairly new site that appears amongst a set of results that are, on the average fairly old, may find it being negatively influenced by that difference in age.

All the bold emphasis is mine.

This patent proposes a complex series of computations and valuations, providing for considerable flexibility that Bill goes into some depth about, so I’ll spare you my analysis. The point here is that the document suggests that the value links can pass to other documents is being tweaked by a set of computations which may either reward or penalize documents based on their relative age with respect to a query or set of queries — relative to the average of all documents associated with those queries.

In other words, back in 2004, when the Sandbox Effect was first noticed, Google may have been looking at more than just link age. Rather, Google may also have been looking at the age of the document sets associated with queries — calibrating its link weights by query space, a principle wholly consistent with the Theorem of Query Space Optimization, which I summarized thus:
Hence, the Theorem of Query Space Optimization tells us that queries remain productive only as long as there is search interest in them, and only as long as relevant content is promoted for those query spaces. In other words, a query space exists only as long as there is interesting content populating the query space, which means that a query space can actually become self-sustaining if and only if the query space produces continually new and interesting content.

In my article I suggested that a query space can die because people lose interest in the topic — no one creates new content and no one is interested in old content. The Google Sandbox Patent (and other patents) suggest that Google distinguishes between several types of query spaces:

  1. Static query spaces - there is no need for new content, but interest in the old content remains steady
  2. Fresh query spaces - the need for new content is continual, regardless of interest in old content
  3. Inactive query spaces - no one is searching for the query but there are documents that once populated an active query space

Under this patent, it might be more difficult for new content to break into a static query space because the older established sites provide sufficient information about the query. The average age of the query space would demand that a document “age” in order to compete with older documents, both in ranking and in passing value through its links to other documents.

On the other hand, a new document would be able to rank better than old documents in a fresh query space because the age-based adjustment would favor younger documents. Documents would literally age out of a query space. This could happen with product pages, review pages, news articles, blog posts, and press releases.

An inactive query space might behave peculiarly under this patent: for example, suppose new links were embedded on old, established documents. Those links’ weightings might work better for new documents in static query spaces. The patent does provide for measuring the relationship of a document with a query space, but its relationship can change over time.

In fact, this patent might explain why chasing long tail queries before tackling the big queries works so well. As a document’s performance for relevant queries improves, its chances of scoring high improves. It’s similar to the filthy linking rich principle, but it’s more like the success breeds success principle. The more often your document scores well for a query, the more likely your document is to score for a relevant query.

One of the methods people used to obtain new search traffic when confronted with the Google Sandbox Effect was to chase long tail queries where competition was relatively non-existent. A site that could not rank for its own name might still appear in other queries for which it was more relevant (in fact, even today most SEOs don’t practice on-page brand strengthening as much as they could).

This patent application — if it accurately represents what Google unleashed in 2004 — is responsible for many of today’s “tried and true” SEO strategies. Based on analysis of this document, a lot of SEOs implemented both content-building and link-building strategies that still work today — to a certain extent. The problem, as I see it, is that a lot of SEOs just follow the strategies blindly without understanding why they do what they do. They are not aware of the contexts for the various rules, that may or may not still be in place (in fact, we know that new rules have been added by Google since then).

You cannot shape a relevant SEO strategy today just looking at this document, but it does help explain why some techniques developed in 2004 and 2005 are still useful today. Those techniques emphasized aging and earning trust. An aged domain doesn’t always outrank every other domain, but a trusted domain can certainly pass link value, and the foundation of many SEO strategies is the passing of value through links from document to document.

When the SEO Community First Discovered This Patent
I think it’s a tossup between Bill Hartzer’s Live Journal post and msgraph’s SEW Forum post, both of which appeared on March 31, 2005. I can find no earlier references to the patent, which appears to have been published on that date (Bill pointed out the date in the forum discussion).

People in the heart of the SEO industry discussed the patent application for a few weeks after that time on both blogs and forums. It is curious, therefore, that some SEOs continued to look to other patents to explain the Sandbox Effect, which Googlers conceded was an unintended effect of whatever it did in 2004.

And yet, here we are, still discussing this patent in September 2008 — more than 3 years after it was published, 4 years after the first reports of emergence from the Sandbox began trickling across the Web.

This document defined the boundaries of practices that will still be in use at least two years from now, if not longer. It may still accurately describe some parts of the Google algorithm, for all we know.

It could very well be that the algorithmic changes this patent proposes made it the single most influential document in the history of search engine optimization. At least, up until now.

www.seo-theory.com

published @ September 24, 2008

Similar posts:

Sorry, the comment form is closed at this time.