No Comments

Linkscape Index Update Doubles Size!

SEO digest

The long awaited Linkscape index update is here.  We've gotten a lot of feedback, we've heard about a few success stories and we have a few thoughts from the development side to share with you.Linkscape Screenshot


First, we've included about 38 billion URLs, from about 230 million sub-domains (e.g. twopieceset.blogspot.com) inside about 48 million second level domains (e.g. *.blogspot.com).  As Live's Nate Buggia recently pointed out, there's a Netcraft survey which suggests that there are ~75 million "active" domains.  So we're certainly reaching a scale which gives us a comprehensive view of the web.  38 billion URLs is not double the previous number of urls in the index (nearly 30 billion); however, it reflects that we are doing some deeper crawling of urls and domains we already had indexed.  Really, of more interest is that we have about 450 billion links which is more than double our previous index which had approximately 170 billion links. 


We're also making the top 3000 links available per URL and domain in advanced reports.  These links are also filtered so that no more than 10 links from any domain are shown.  This dramatically increased volume and diversity of links gives you the opportunity to see many more of the top links along many dimensions (mozRank, mozTrust, etc.).  And the anchor text analysis is much more representative of your presence on the web as a whole.

To illustrate the variations in our link counts, consider these sites and pages.  You can see, almost across the board we know about substantially more links for any site and page, and have used this broader view of the web to update mozRank.  The small general decline in mozRank is an indication that we've spread mozRank across more pages.  In general we've found a higher correlation in our latest data to Google's toolbar PageRank, when excluding penalized sites.



You should note that because our index has grown substantially, these additional links, and changes to mozRanks, do not reflect growth in new links, but rather in new links we've discovered.  It would be unwise to compare link counts from the old index with counts from the new one.  Instead comparisons should be confined to metrics for sites and pages drawn from this latest update.  This artificial update effect will diminish as we refine our processes and reach the end of the beta period.

How does this benefit SEOs?

  • A bigger and fresher index means:
    • Greater accuracy in link counts and domains
    • Greater representation of what the search engines see and how they might interpret and use the data
    • More accuracy in mozRank & mozTrust, leading to better data comparisons & analysis
    • More fresh data that helps understand what's happened in the recent past
  • Up to 3000 links per URL in the report means:
    • Know about more links that point to you, so you can request anchor text changes, conduct better self analysis, or fix links that are broken
    • Reverse competitive strategies more comprehensively to analyze how they're winning
    • Find links that you could possibly acquire from your competitors
    • Get better anchor text distribution data
  • URL normalization means:
    • Link counts aren't biased by sites and pages that create duplicates
    • Our data is more like the major search engines who also do this stripping and canonicalization
  • Limiting to ten links per domain means:
    • You can see a wider variety of links from different domains
    • 3,000 links will show you at least 300 unique linking domains (often many more)

Here's a quick list of some of the things people have used Linkscape reports for:

  • Analyze their link counts, mozRank, mozTrust against those of pages ranking above them in the SERPs
  • Look at anchor text distribution and numbers to see why a site might be ranking where it is for a given term/phrase
  • Reverse competitor links to find sources they can get themselves
  • Look at the relative value of particular links based on the juice they pass and the quality of their domains/pages
  • Compare mozRank to PageRank to see if there is a large discrepency (often indicating a penalty if mR is much greater than PR)
  • Use link counts in conjunction with traffic data from sources like Compete, Hitwise, Quantcast, etc. to see how link numbers, mozRank, mozTrust, etc. mapp to traffic

Speaking of link counts, there's a lot of ways to interpret links, especially when you're adding them up.  Here are a few thoughts about how we count links:

  • We do not double count duplicate links from and to the same page.  For instance we don't consider the two links to our homepage in our header and footer as separate links.
  • We do a great deal of url normalization.  We strip common url parameters (e.g. SESSID, jsessionid, redirect, etc.) and remove any resulting duplicate links.
  • We do not collapse the source and target of 301s, 302s, meta refreshes, etc. for the purposes of link counts.  Of course, we do pass the properties (e.g. mozRank) of the source to the target.

That last point has been a controversial design decision and has led to some confusion.  To get a full view of links to a page you should run reports for several versions (e.g. www and non-www).  However, one advantage to this approach is that it lets you analyze your link profile at a very fine granularity.  For instance, we can see who's linking to "http://www.seomoz.org/web2.0", "http://web2.0awards.org", and "http://web2.0awards.com" all separately fromeachother.  This helps us to understand our marketing efforts, and quantifies the contribution of each of these different urls which point to the same content.  Also if we wanted to remove the 301 and rebrand one of these domains, we have some idea of where we would be starting out from.  We do list 301s as single links in advanced reports for the destination of the redirect.

Unfortunately, this makes some of our link counts look smaller than you might see from some other tools.  Because we're consistent within our tool, you can compare the numbers you see for different pages to get a relative sense of popularity.  But you can't, unfortunately, directly compare our numbers to other tools.  I suppose this is the sort of thing you come to discover in any beta ;)

It turns out that most the technological challenges with the back-end revolve not around scalling our data collection, but rather around processing and serving data. So we back-end developers have been very busy re-writing our processing pipeline and completely distributing our API architecture, which is why this update took so long to get out the door.  You guys probably care about this work because of our substantially improved performance for our PRO toolbar, which we're also publicly announcing today!  I'll let Danny tell you more about the toolbar, but both of these back-end changes should support our API product and help us to provide you with much more frequent index updates.

We'll probably see quite a few other changes to the product both visually and in terms of the data throughout Linkscape's beta period.  Obviously we'd like to continue to improve our coverage of the web while keeping the quality, relevance, and freshness of our data equally impressive.  If you have any feedback, feel free to post comments on our feedback thread.  We always appreciate it, and I hope some of you can see that some of your feedback has made its way into this update.

www.seomoz.org

published @ December 8, 2008

Similar posts:

Sorry, the comment form is closed at this time.