No Comments

Announcing Blogscape and SEOmoz Labs

SEO digest

IntroductionHi, I'm Chas, a developer at SEOmoz. I've been working here since last summer, and though I've written many lines of code, this is my very first blog post. Now, I'm happy to have the chance to reveal the super-secret project I've been working on: Blogscape!


Blogscape

What are the hottest topics on the web right now? How much does your web presence change day-to-day, or in response to an advertising campaign? How many links did your site receive from the blogosphere this week?

Blogscape is a data source built to answer these questions and more. It’s an ‘Information Feed Aggregator’ and has been monitoring about 10 million feeds since December 2007. These feeds come from any website which offers syndication, but they are mostly from Blogs and News sites.

While Linkscape crawls the web-at-large, Blogscape is focused on the 'fast-moving-web'. It stores and makes searchable the full content of syndication feeds (including link data!), and the newest data is made available several times every day.

Data from Blogscape will appear in various forms in upcoming SEOmoz products, but we’ve decided to release a version of our internal testing tool for it in...

SEOmoz Labs!

So, this graph shows the number of posts which mention each actor’s name for every day in February. The spike in mentions for all queries on Feb 23rd corresponds with the actual date of the Academy Awards. As I’m sure you know, Sean Penn took the Oscar for this one, and this is clearly reflected in the fact that he received twice as many hits as any other query on that day.

Viewing Posts

Blogscape stores the snippets of text found in each Feed – you can click any data point to view this information. For example, if you’d clicked the on the line for Sean Penn on Feb 23rd, you’d see something like this:
 

This view shows snippets from each post satisfying that query on that day. Posts are ordered by Blogrank, an internally calculated ranking metric. (Any feed can 'vote' for any other feed by linking to the website the feed comes from. Feeds with more votes have higher Blogrank.)

Each post has the following information:
•    The original title of the post (clicking here takes you to the actual post)
•    A snippet of the description of the post
•    The title of the feed the post came from (clicking here takes you to the main page of the feed's source)
•    The feed’s Blogrank
•    The URL for the feed itself
Advanced Queries

Beyond single terms or phrases in quotes, there are advanced query operators available. For example, you could search for posts containing the word ‘oscar’ or ‘oscars’ with the query
oscar | oscars (open this query)
There are also query operators for finding posts which link to specific URLs, root domains, or subdomains. For example, you could search for posts which link to any URL at the root domain ‘oscar.com’ with the query
rd:oscar.com (open this query)
A list of all available query operators can be found at the Blogscape help page.

Finally, each graph has option of being weighted by Blogrank (see checkbox on the right of the labs page). This makes the graph more of a measure of the ‘popularity’ of a query for any given day, instead of the raw number of matches for it. (Feeds with high Blogrank have many incoming links from other feeds, and tend to come from sources which are viewed by lots of people.)

Data Duplication

You may notice a message at the bottom of the ‘Posts’ view stating that “Posts very similar to these have been filtered from this list.” We’ve worked hard to battle data duplication in Blogscape by carefully canonicalizing feeds (many sites have several URLs for the same data) and posts within a feed. Nonetheless, there are situations where duplicate data is almost impossible to eliminate in advance (for example, some large sites have many feeds with content that occasionally overlaps).

To battle this problem, Blogscape does additional filtering of posts at query time. This filtering ensures that you see only the most relevant version of a post that occurs in Blogscape’s data stores multiple times. For this reason, some queries will have higher post counts on the frequency graphs than when viewing the Posts themselves. If you really want to view every post Blogscape has, you can click on the link at the bottom of the page to turn this feature off.

Data Quality

As I mentioned before, Blogscape has been monitoring a sizable portion of the Blogosphere for over a year. Nonetheless, we are striving to improve the quality of data within Blogscape, and we’ve very excited about two major upcoming improvements:
1.    Monitoring of more high-quality feeds

We’ve added Feed Auto-Discovery logic to our processing of Linkscape crawl data, and will be using the results to make sure Blogscape always monitors the most important blogs from across the web.

2.    Crawling of source pages

Based on our research, about half of syndication feeds don’t publish the entire content of their posts in their feeds – instead, they publish a truncated section of their content (or occasionally a hand-written summary of it). Most sites that do this also strip HTML from their feeds.
It’s important for data quality to ensure that queries for a term return all posts mentioning that term, and it’s important for SEO that all link information is present. For these reasons, we’re adding functionality to Blogscape that will follow links from syndication feeds, and store the actual source content for future search. (Of course, the upcoming crawler will politely ignore sites which block it using Robots.txt – details on this will be released when the crawler goes live.)

Movers and Shakers

Finally, an interesting use of the mountains of data stored by Blogscape is the search for hot trends, or ‘Movers and Shakers’. You can see the results of this process for several categories by clicking on the ‘Movers and Shakers’ links in the upper right hand corner of the Blogscape Labs page.

They tend to be most interesting (and stable) in weekly increments – you can view the top ‘mover and shaker’ phrases for this week here. On the day of writing this post, the top phrase is “Safari 4 beta,” which rose 26,632.4% this week (percent change over rank-weighted graphs). Right behind it is “Gary Locke,” which rose 25,101.7% over last week. On the labs page for this feature, you can click through and view the graph for each individual ‘mover and shaker’.

Conclusion

We’re excited to launch this feature, and even more excited about the data quality improvements we’ll be making on it in the next few months. If you’re PRO, check it out, and send your comments our way!

www.seomoz.org

published @ March 3, 2009

Similar posts:

Sorry, the comment form is closed at this time.