No Comments

Clickable transcript of my Canonical Link Element talk

News from Google

Recently I’ve been playing with linking to specific parts of a video and incorporating YouTube subtitles. Then I realized that you could do a neat trick. YouTube allows you to create closed captioning with a simple text file that looks like this:

00:00:07.000
Hi everybody. Welcome back to another video. We’re doing this thing where when we speak at a conference

00:00:12.180
and we talk about something substantial, not just questions and answers, we talk through our presentation later

and it only takes a little bit of Unix command-line magic to turn that into a file like this:

Hi everybody. Welcome back to another video. We’re doing this thing where when we speak at a conference
and we talk about something substantial, not just questions and answers, we talk through our presentation later

If you run that over your entire caption file — boom — you have a clickable transcript of your video. For the text below, click on any phrase you’re interested to and you’ll be whisked away to YouTube in approximately the right place to hear me say that phrase.

Hi everybody. Welcome back to another video. We’re doing this thing where when we speak at a conferenceand we talk about something substantial, not just questions and answers, we talk through our presentation laterand put it up so people can follow along, watch the slides, and hopefully learn a little bit.So today I wanted to talk about the canonical link element. And that’s something that Google, Yahoo!, and Microsoftall announced that they will support in the future at SMX West. So, the date that we had this announcement wasFebruary 12, 2009, and the funny thing about it is that Charles Darwin was born exactly 200 years ago that day.

So I started out with a slide where I made a corny joke and I said, whether you think the web was intelligentlydesigned by Tim Berners-Lee, or whether you think the web needs to evolve, either way this is an open standard whichhelps people improve the web. And so we sort of said, what is a big problem that faces people today,webmasters, SEOs, site owners on the web? And it’s pretty clear that duplicate content is one of the things thatpeople care about the most. So what is duplicate content? Well, I’ve got a slide here where I show I think eightdifferent URLs, you know every single one of these URLs could return completely different content. In practice, weas humans whenever we look at www.example.com or just regular example.com or /index or home.asp, we think of it asthe same page. And in practice, it usually is the same page. So technically it doesn’t have to be, but almost alwaysweb servers will return the same content for like these eight different versions of the URL.So, that can cause a lot of problems in search engines if rather than having your backlinks all go to one page,instead it’s split between a www and a non-www version. And it’s a really big headache. How do people solve this?

How do people fix this? Well, it turns out, and I’ll dwell on this slide for just a few minutes, there are a lotof ways to fix it. So, some people have joked that this canonical link element is kind of like, you know,Spackle that fixes over the appearance of all the cracks in the wall. And the fact is there are a lot ofways that you can fix things first and foremost, from the beginning, upstream where you don’t need to fix it downstreamlater on. There was a really funny quote by Jill Whalen at the conference where she said,“Developers keep SEOs in business.”Right? And so whether you’re a developer or an SEO there are some best practices that can make things a little biteasier for your system so that you don’t have to worry about this issue of duplicate content at all.So, one is to try to make sure that your URLs are standardized, Microsoft sometimes calls them normalized,in essence there’s only one way to get to the content. If your content management system always generates consistentURLs, and they’re completely uniform, and you don’t have to worry about having eight different versions in thefirst place, that just saves you a lot of trouble. You don’t have to worry about the issue coming up at all.So one way to do that is to fix your content management system or your software so that you only generate these URLsin a very consistent way. Another thing to do is to think about your site. Suppose you have www.example.com andnon-www, just plain old example.com. Well if you link to www sometimes and non-www sometimes, it’s natural thatsearch engines might get a little bit confused. So linking consistently, saying okay, my homepage is goingto be www.example.com/. Nothing else, that’s it. And then making sure that all of your internal linking is consistent,that alone can make a really big difference, so that you don’t end up with two, three, four copies of each page.

If you do have, you know, home.asp or index.html, you can rewrite such that all those other URLs are 301 redirectsto a single URL. So, it’s great if you can fix it at the beginning, it’s great if you can link consistently so theissue never comes up, but if duplicate URLs do occur, then you can use a 301, a permanent redirect as we refer to it,to sort of standardize and glom together all of those URLs. And search engines will follow that 301 redirect,and typically group them all together. Google also does a couple of extra things that some search engines don’t do.So, in our Webmaster Tools, our webmaster console, which is totally free, doesn’t cost anything at all,you can specify, for example my site is mattcutts.com, you can specify if you prefer www.mattcutts.com or non-www,so just mattcutts.com. That’s a very easy setting, and that solves a lot of duplicate content issues right there.And a little-known fact, not everybody realizes this, is that whenever you submit your URLs in whatwe call a Sitemap, which is another standard that’s supported by many major search engines, and it’s a very simplefile, it can be as simple as a list of URLs, we take that list of URLs that you submit, and we say to ourselves,oh, if we see a URL in that list, and then we see another version of it that’s not in the list, we will preferURLs in the list that you gave us. So we sort of use it to break ties whenever you submit URLs from a Sitemap.So there’s at least a couple ways that you can give Google hints that try to help out with duplicate content.

But, that said, there will probably always be duplicate content issues that you can’t fix. So, just to run througha few example ones. Sometimes, you can’t generate a permanent or 301 redirect. For example, at my old school account,cs.unc.edu, I don’t run the web server there. So I’d have to open a ticket or drop an email to the people thatadminister that system and say hey, can you add a 301 redirect from this page to that page. A lot of free hosts,you might not be able to generate a 301 redirect. And you can’t help how people link to you. So for example,you know, even if you link consistently to just the www version of your website, some other people might link tothe non-www version. And you can’t really control that at all.Uppercase versus lowercase paths. Microsoft IIS will support showing pages whether you link to home.asp capitalizedor lowercase, and sometimes even mixed case. And so if people link to different versions that are uppercase andlowercase mixed, that can cause some issues. Session IDs are another really big factor. So I have seen,at least in some search search engines, a site with a one-page privacy policy. And that privacy policy was indexedthree thousand times, each time with a different session ID, because the privacy policy was slightly different each time.

So, you know, session IDs in general if you can avoid them are great. But sometimes you as thesearch engine optimizer or the person who is responsible for the site can’t get rid of them entirely.Tracking codes, you know, if you’re buying ads. Analytics, you know the UTM parameter, landing pages where theyhave to be different landing pages for different ads, those are the sort of things that you sometimes can’t get rid of.And if you run an e-commerce site, suppose you have different products. You might have sort by descending priceor sort by ascending price, and sometimes you need to have different facets, different views of your data, andconceptually it’s really the same thing, it’s just a different way to slice and dice it.

Finally, there’s breadcrumbs. So breadcrumbs are how did I get to this page? Am I coming to this red tent examplevia tents, or am I coming to it via colors, or did I come to it because I was interested in accessories?How did I land on this page? Even Google’s own webmaster help documentation sometimes has a CTX parameter that sayshere’s how we got to this page. And that day, it was kind of funny, the Queen had just launched a new website:royal.gov.uk. And so I wish the Queen the best, I want her to live long, and I wish the British monarchy the best,however, someone at the Telegraph, telegraph.co.uk, had done an SEO audit of this site, and they had foundduplicate content issues. So you can see right here, just slash, royal.gov.uk/Home.aspx, and then at the very bottomI almost made a ransom note style where I mixed uppercase and lowercase. And the royal website returned the same pagefor all three of those URLs. So that was just a very simple example to illustrate that anybody can have thesesorts of issues.

So what’s the answer? Lets, you know, I’ve buried the lead enough, how do people solve this particular problem?Well, assuming you can’t solve it any other way, and absolutely I encourage you to try to fix it upstream,to try to link consistently. This not something that you should just say, oh, now all my problems are solved,I don’t have to worry about anything else. But, if you can’t solve your problems in other ways, there’s a verysimple element, link element, where you can say my canonical, and that’s a long word that means you know, my preferred,or the primary, or the clean, the pretty version of the URL that I want to use, is not this ugly URL with a trackingcode or a session ID, it’s this pretty URL right over here. And all you have to do is in the head element of thisdocument say you know what, even though this has a weird session ID, the pretty version, the canonical version ofthis URL, is over here. And that’s literally all it is. It’s a very simple open standard. It’s one simple elementthat you add to the head of your document.Some interesting little tidbits. This is the director’s cut so you get a little bit of extra info. Is this a tag?

Well, it’s kind of, the technical name I believe is “element.” But we’re all friends here, nobody’s going to abuseyou or you know make fun of you if you call it a canonical link tag versus a canonical link element. People oftenspeak about meta tags, right? And so meta tags are things that go in the head of the document as well. And so, ifa meta tag has a value that is a hyperlink, I think the most correct thing is not for it to be meta, but for it tobe called “link.” And so that’s why you see link rel=”canonical” href= and the value. So now you know the officialname, but nobody’s going to care if you just call it the canonical link tag.One thing that’s kind of interesting about this tag, let’s just talk about a few high-order bits.

We don’t promise we’re going to abide by this 100%. Right? You know, if we see a webmaster and they’ve accidentallyshot themselves in the foot, you know maybe they’ve created an infinite loop, and it’s very easy to create aninfinite loop, we reserve the right to do what we think is best. At least at Google, we are going to treat this asa very strong hint. So unless we see some weird corner case or something where you’re probably hurting your own site,we probably would expect to respect this tag. So I think that in most cases, it will work quite well. But we do haveto reserve the final, sort of bottom-line ability to say no, we don’t think this is what’s best for the users.

Again, if you can fix it yourself upstream, that’s much better. So look at all the other alternatives, the otherchoices before you use this tag. Don’t just say, oh, I can just slap everything with a canonical link tag andboom, I’m done.If you’re a regular user, just like a mom-and-pop and you use WordPress or you use some shopping cart software,it’s probably best not to just roll up your sleeves and go digging into it and trying to fix it all yourself,at least not quite yet. Wait a little while, because I think plugins will come out, people are talking about hey,is WordPress able to add this to the core software, so maybe you don’t even need a plugin? So if you’re just a regularuser and you wait a few months, things should be fine. You know it’s a brand-new element, so there’s time for youto sit down and cautiously deliberate and say okay, what kinds of duplicate content do I have, how can I fix it?Take a little bit of time. Don’t just jump right in and start, oh I’m going to point everywhere, I’m going to do everything.There’s enough time where this will be supported so you can plan ahead a little bit.And as always, if we see people abusing it, we do reserve the right to change how we treat the tag, or tonot respect the tag. There is a nice way that we try to prevent abuse. We allow things within the same domain,but we don’t allow things to cross domains. So with 301s, there’s always been this notion of can I hijack a site bydoing weird 301s, and can I steal the reputation of some other site? And at least right now, this element is notreally subject to that because you can only use it within the same domain. Now a natural question right after that,is well, what about subdomains? Can I, you know, do things across different hostnames?

And the answer is yes, you can. So, I was talking to Tony Hsieh from Zappos, and they were talking about duplicatecontent. And they have a server called zeta.zappos.com, which is sort of their staging software and might be thenext version. And they were saying, well, can I send my canonicalness, can I splat it from zeta.zappos.com towww.zappos.com? And the answer is yes, you absolutely can.Can you use it from https and send that to http? Totally, works great for that. It’s on the same domain, so it’sno problem at all, at least within Google to use it for that purpose.And then what’s the difference between this and a 301 or a permanent redirect? There’s really not that much,other than this is restricted to one domain. So 301s can cross domains; this is all within the same domain.

In fact, whenever I think about it, the mental model that I have is that this is essentially like a little mini301 redirect that you can generate with this link element. So, you know, if you think about how Google handles 301s,that’s probably a pretty good guess of how we’ll handle this particular element.So, a few more questions, since you’ve got the time, you’re watching the video. Do the page have to be identical?Bit for bit identical? No, they do not. Think again about this case where you have a catalog page and you can sortby increasing price or decreasing price, those are conceptually pretty close to the same page. So if you want to saymap this to the same URL, and don’t worry about the sort by parameter, you’re more than welcome to do that.They should be similar. You know, if we see, this is the only thing I can think of where there could be abuse,is if you’ve got a cartoon page over here, and you’ve got something that’s completely irrelevent to cartoons overhere and you try to combine them together. And you’re not really gaining any advantage because you had PageRank onthis page and on that page. So it really doesn’t make sense to combine them, but we do recommend that you use themfor similar pages. They don’t have to be identical, but they should be similar.

A few sort of niggly bits. How about relative URLs versus absolute URLs? The answer to that is you can use either one.We recommend absolute URLs. And there’s a very simple reason. When you have relative URLs, you can move a URL andeverything stays the same relative to that URL. So essentially, you know the homepage can say /images or images.And that will move it relative to that particular page. But it’s better to have an absolute URL because this isa powerful tool, and you really want to say this URL goes to exactly this URL. So you want to specify that.Whereas if it’s relative, if you mess it up here, then you might mess it up somewhere else as well.Can you follow a chain of canonical tags, or canonical elements, just like you can follow a chain of 301 redirects?Yes, but again I don’t recommmend that, because if you have a big site and you have a big chain of 301 redirects,it’s easy for something to break. So, it’s similar, something can break and you don’t intend to have the consequencesthat you wanted to, so what I would recommend is absolute URLs, and going from the old URL to the new URL, one hopand that’s all you do. It’s just simpler that way, and you know you want to play it safe. You don’t want toaccidentally shoot yourself in the foot. So what are some ways you can shoot yourself in the foot? Well, what ifyou say my canonical is over here, and that’s a 404 page? Right, the page might not exist. What if you had aninfinite loop? This is canonical. No, this is canonical. And we’ve all seen those happen, you know, what is theCivil War? Look up the War Between the States. What is the War Between the States? Look up the Civil War.You know, and now you have to put the dictionary down and your head hurts. So try to avoid infinite loops.

What if I point to a URL that hasn’t been crawled? You know, we’ll try to crawl that URL, but that corner case,what if I told in the webmaster console, oh yeah, everything should be www.example.com, but then you specify yourcanonicals as non-www, or without the www. So you can do all these sorts of things to almost shoot yourself in thefoot, and the answer is we will try to handle all of these corner cases in a reasonable way. The slide has someGhostbusters because there’s the old saying, “Don’t cross the streams,” right? So think about this, take some time,don’t just throw canonical tags on willy-nilly on your site, you know, try to plan it out a little bit so that youdon’t run into these corner cases.So we’re getting towards the end of the presentation. I just really wanted to send a shout out to Joachim, who isthe Google engineer who really did all the implementation, all the heavy lifting on this. Made sure that it workedvery nicely within a 301, and thought about all the corner cases. So, for example, someone said, well what ifI have a canonical, and I point to myself? Does that work? Yep, that works fine. What if I have a canonical and myhref is empty? Well, it turns out that parses as an error, which turns out to point to itself. So all this stuffstill works because Joachim did a really good design, but again, try to make sure that it’s all absolute URLs andeverything’s specified well. Also, I’d love to send a shout out to Greg Grothaus. It turns out when you dig into this,a lot of people have proposed similar ideas. I saw at least one post out on the general web after we’d startedexploring this that said, hey, why don’t you do this kind of a proposal? But Greg was really one of the people whosparked the discussion at Google, who really pushed for it and had a great idea, and so I sort of think of him asat least within Google, he really got the ball rolling and really sparked the wave of work on this, so I reallyappreciate that. And of course all the people, you know, from Maile and Wysz and Adam and Riona who have worked onthe messaging and reached out to different people. At Yahoo!, Priyank, and a ton of people at Microsoft,Nathan Buggia and a bunch of other people as well. My hope is that lots of search engines will support this.So, Yahoo! and Microsoft have announced that they will support it, let’s keep our fingers crossed for Ask, I’d lovefor them to join in as well. Wikia, so Artur at Wikia had emailed us and sort of asked about doing canonical tagsanyway. And so it was really great that they could test it out while we were trying it out ourselves.And then a ton of webmasters who always give us this sort of feedback on what they’d like to see.

On this last slide, I just list a bunch of resources, so Google, Yahoo!, and Microsoft all did blog posts about it.There’s an official Help Center documentation page. And, what we saw was, as people would come and have duplicatecontent questions, Joost had come and sort of asked about an interesting corner case, we just said, hey, you knowwhat? We’ve got this thing coming out that might help with this. And so it was a very nice way to just do a sortof very quiet beta test and see how well it worked. So, Joost happened to email just a few days before we wereready to announce support, and so we gave him a heads-up about the possibility of this, and he turned aroundplugins not just for WordPress, but also for Magento, which is an e-commerce shopping software, and Drupal, whichis another open-source content management system, which I think the White House just rolled out using Drupal.So really appreciate the work that he’s done as well. And in general, you know, be careful, be cautious, plan outhow you want to use this tag. But we don’t intend to make any money off of it, we think it’s just good for the web,I’ll lead to less duplicate content. It’s an open standard, so any search engine that crawls the web can use thisinformation to help, you know, make the web more relevant and increase the relevancy of their search results.And now you know as much as the audience knows when they attended SMX West.Thanks very much for listening, and talk to you soon.

Pretty fun, right? With a little more effort, you might be able to get the links to update an in-page embedded video instead of using static hyperlinks. Specifically, there’s a “seekTo” function in the YouTube JavaScript Player API. But right now I’m too lazy to dig into it.

www.mattcutts.com

published @ March 11, 2009

Similar posts:

Sorry, the comment form is closed at this time.