Using data to fight webspam
As the head of the webspam team at Google, I'm in charge of making sure your search results are as relevant and informative as possible. Webspam, in case you've never heard of it, is the junk you see in search results when websites successfully cheat their way into higher positions in search results or otherwise violate search engine quality guidelines. If you've never seen webspam, here's a good example of what you might see if you click on a link in the search results that's spam (click on the image to see it larger).
You can see how unhelpful such a page would be. This example is filled with almost no original content, irrelevant links, and information that is of little use to a user. We work hard to ensure you rarely see search results like this. Imagine how annoyed you would be if you clicked on a link from a Google search result and ended up on a page like this.
Searchers don't often see blatant, outright spam like this in search results today. But webspam was much more of an issue before Google became popular and before we were able to build effective anti-spam methods. In general, webspam can be a real annoyance, such as when a search on your own name returns links to porn pages as results. But for many searches, where getting relevant information is more critical, spam is a serious problem. For example, a search for prostate cancer that's full of spam instead of relevant links greatly diminishes the value of a search engine as a helpful tool.
Data from search logs is one tool we use to fight webspam and return cleaner and more relevant results. Logs data such as IP address and cookie information make it possible to create and use metrics that measure the different aspects of our search quality (such as index size and coverage, results "freshness," and spam).
Whenever we create a new metric, it's essential to be able to go over our logs data and compute new spam metrics using previous queries or results. We use our search logs to go "back in time" and see how well Google did on queries from months before. When we create a metric that measures a new type of spam more accurately, we not only start tracking our spam success going forward, but we also use logs data to see how we were doing on that type of spam in previous months and years.
The IP and cookie information is important for helping us apply this method only to searches that are from legitimate users as opposed to those that were generated by bots and other false searches. For example, if a bot sends the same queries to Google over and over again, those queries should really be discarded before we measure how much spam our users see. All of this--log data, IP addresses, and cookie information--makes your search results cleaner and more relevant.
If you think webspam is a solved problem, think again. Last year Google faced a rash of webspam on Chinese domains in our index. Some spammers were purchasing large amounts of cheap .cn domains and stuffing them with misspellings and porn phrases. Savvy users may remember reading a few blogs about it, but most regular users never even noticed. The reason that a typical searcher didn't notice the odd results is that Google identified the .cn spam and responded with a fast-tracked engineering project to counteract that type of spam attack. Without our logs data to help identify the speed and scope of the problem, many more Google users might have been affected by this attack.
In an ideal world, the vast majority of our users wouldn't even need to know that Google has a webspam team. If we do our job well, you may see low-quality results from time to time, but you won't have to face sneaky JavaScript redirects, unwanted porn, gibberish-stuffed pages or other types of webspam. Our logs data helps ensure that Google detects and has a chance to counteract new spam trends before it lowers the quality of your search experience.
Update: Enlarged image.
googleblog.blogspot.com
published @ September 1, 2008