No Comments

Linkscape Index Update - With Focus on Quality

SEO digest

Hey gang, I know a lot of you are making awesome use of Linkscape.  We've seen impressive use of the Link Intelligence Report, and since the launch of the Free mozRank API a couple of weeks ago we've served millions of mozRank values to widgets, linkbait, and internal reporting pipelines. So I'm pleased to announce the third Linkscape index update since our launch five months ago.
Being in beta, we're still making a lot of improvements.  This time around we made some serious changes to our crawling methodology to include more quality content, and to avoid junk.  Having spoken to search engineers, thought leaders, our users, and having taken a look at the data, we think we've made plenty of progress on the raw size of our index (between 30 and 40 billion pages).  Now we want to focus on index quality.  With that said, here are the latest index stats:

  • URLs: 36,651,796,236
  • Subdomains: 214,625,541
  • Root Domains: 50,734,663
  • Links: 409,127,041,842

These numbers reflect the work we've done over the last several indexes to weed out junk, include deeper pages, and get an accurate, actionable link profile, which reflects the structure of the web as a whole and search engine relevance.  Check out the Top 500 Sites and Pages for another look at our index.

The current goals for the Linkscape team are:

  • Freshness: data which is three to eight weeks old or better
  • Coverage: include key deep pages and influential posts that might not be referenced from other sites
  • Visibility: uncover actionable data and trends across all our data, rather than sorting or filtering just 3,000 links
  • Measureability: provide data which is comparable index-to-index, and track those trends
  • Quality: provide data which reflects the structure of the relevant web

Last month we did an index quality assessment.  Overall we saw a strong statistical correlation with our link counts vs Y!SE's.  That is to say, if we say one site has more links than another, then with high probably, Y!SE will agree.  And we believe other search engines will follow suit in their rankings.

When we ask how many external links to a homepage there are, on average we report 40% of what Y!SE does.  Of course, we discount nofollows, while Y!SE (in our experience) includes them.  Given our latest crawl, we see that 3% of links are nofollowed (up from 1.8% when we originally launched).  This is probably about our crawl choice, rather than a large trend.  It's entirely possible that homepage links are nofollowed more often than deep pages (think about those comments you leave on blogs with a link back home).  We also keep link counts to sources of 301s separate from their targets for reporting reasons.  Don't worry, we still pass link juice through 301s.

When we ask how many external links to a whole site, we report, on average, 90% of what Y!SE does.  This number is quite dramatic IMHO.  Here we're probably seeing less of the nofollow bias.  This might also reflect a crawl bias, and some canonicalization differences.  But again, a strong statistical correlation suggests that we are giving an accurate site-wide link profile.

This graph is quite striking to me.  We pulled all the links we had for a variety of pages and went back 1 month later to see how many of those pages are still linking.  We want to assure you that the links we're reporting reflect the current state of the web, rather than the stale web from months or even years ago.  As it turns out, we have a 91% success rate.  When we ask for Y!SE's 1000 links we see a 97% success rate, making us quite competitive in this regard.

A few other stats measuring our index quality:

  • Pages mentioned in DMOZ also in Linkscape: 96%
  • Domains mentioned in DMOZ also in Linkscape: 99%
  • Average error of mozRank against Google Toolbar PR: 0.56 (best possible is 0.25 due to round-off)

This index quality assessment is an ongoing process, and we're happy to share our methodology or collaborate in this effort.  Feel free to PM me or drop a note in the comments.  For an independent look at the last index's mozRank comparison to PageRank (now out of date) check out The Google Cache's mozRank study, powered by the free mozRank API.

Oh yeah, I just discovered this awesome social media tool!  It's called... Twitter! (tongue-in-cheek)

But seriously, that's another great way to provide feedback, get questions answered, and in general keep up-to-date with what's going on behind the scenes here at the mozPlex.

published @ March 3, 2009

Similar posts:

Sorry, the comment form is closed at this time.