The Internet Marketing Driver: Glenn Gabe's goal is to help marketers build powerful and measurable web marketing strategies.

Tuesday, October 13, 2009

SEO and AJAX: Taking a Closer Look at Google’s Proposal to Crawl AJAX


Taking a closer look at Google's proposal for crawling AJAX.Last week at SMX, Google announced a proposal to crawl AJAX. Although it was great to hear the official announcement, you had to know it was coming. Too many web applications are using AJAX for Google to ignore it! After the news was released, I received a lot of questions about what the proposal actually means, how it works, and what the impact could be. There seemed to be a lot of confusion, and even with people in the Search industry. And I can understand why. If you don’t have a technical background, then Google’s blog post detailing the proposal to crawl AJAX can be a bit confusing. The mention of URL fragments, stateful pages, and headless browsers can end up being confusing for a lot of people, to say the least. And if you’ve never heard of a headless browser, fear not! Since it’s close to Halloween and I grew up near Sleepy Hollow, I’ll spend some time in this post talking about what a headless browser is.

So based on my observations over the past week or so, I decided to write this post to take a closer look at what Google is proposing. My hope is to clear up some of the confusion so you can be prepared to have your AJAX crawled. And to reference AJAX’s original slogan, let’s find out if this proposal is truly Stronger Than Dirt. :)

Some Background Information About SEO and AJAX:
So why all the fuss about AJAX and SEO? AJAX stands for asynchronous JavaScript and xml, and when used properly, it can create extremely engaging web applications. In a nutshell, a webpage using AJAX can load additional data from the server on-demand without the page needing to refresh. For example, if you were viewing product information for a line of new computers, you could dynamically load the information for each computer when someone wants to learn more. That might sound unimpressive, but instead of triggering a new page and having to wait as the page loads all of the necessary images, files, etc., the page uses AJAX to dynamically (and quickly) supply the information. As a user, you could quickly see everything you need and without an additional page refresh. Ten or more pages of content can now be viewed on one… This is great for functionality, but not so great for SEO. More on that below.

Needless to say, this type of functionality has become very popular with developers wanting to streamline the user experience for visitors. Unfortunately, the search engines haven’t been so nice to AJAX-based sites. Until this proposal, most AJAX-based content was not crawlable. The original content that loaded on the page was crawlable, but you had to use a technique like HIJAX to make sure the bots could find all of your dynamically loaded content. Or, you had to create alternative pages that didn’t use AJAX (which added a lot of rework.) Either way, it took careful planning and extra work by your team. On that note, I’ve yet to be part of project where AJAX developers jump up and down with joy about having to do this extra work. Based on what I explained above, Google’s proposal is an important step forward. But there just had to be a better solution.

What is Google’s Proposal to Crawl AJAX?
When hearing about the proposal, I think experienced SEO’s and developers knew there would be challenges ahead. It probably wasn’t going to be a simple solution. And for the most part, we were right. The proposal is definitely a step forward, but webmasters need to cooperate (and share the burden of making sure their AJAX can be crawled). In a nutshell, Google wants webmasters to process AJAX content on the server and provide the search engines with a snapshot of what the page would look like with the AJAX content loaded. Then Google can crawl and index that snapshot and provide it in the search results as a stateful URL (a URL that visitors can access directly to see the page with the AJAX-loaded content).

If the last line threw you off, don’t worry. We are going to take a closer look at the process that’s being proposed below.

Getting Your AJAX Crawled: Taking a closer look at the steps involved:

1. Adding a token to your URL:
Let’s say you are using AJAX on your site to provide additional information about a new line of products. A URL might look like:

example.com?productid.aspx#productname

Google is proposing that you use a token (in this case an exclamation point !) to make sure Google knows that it’s an AJAX page that should be crawled. So, your new URL would look like:

example.com?productid.aspx#!productname

When Google comes across this URL using the token, it would recognize that it’s an AJAX page and take further action.

2. The Headless Browser (Scary name, but important functionality.)
Now that Google recognizes you are using AJAX, we need to make sure it can access the AJAX page (and the dynamically loaded content). That’s where the headless browser comes in. Now if you just said, “What the heck is a headless browser?”, you’re not alone. That’s probably the top question I’ve received after Google announced their proposal. A headless browser is a GUI-less browser (a browser with no graphical user interface) that will run on your server. The headless browser will process the request for the dynamic version of the webpage in question. In the blog post announcing this proposal, Google referenced a headless browser called HTMLUnit and you can read more about it on the website.

Why would Google require this? Well, Google knows that it would take enormous amounts of power and resources to execute and crawl all of the JavaScript being used today on the web. So, if webmasters help out and process the AJAX for Google, then it will cut down on the amount of resources needed and provide a quick way to make sure the page gets properly crawled.

To continue our example from above, let’s say you already provided a token in your URL so Google will recognize that it’s an AJAX page. Google would then request the AJAX page from the headless browser on your server by escaping the state. Basically, URL fragments (an anchor with additional information at the end of a URL), are not sent with requests to the server. Therefore, Google needs to change that URL to request the AJAX page from the headless browser (see below).

Google would end up requesting the page like this:
example.com/productid.aspx?_escaped_fragment=productname
Note: It would make this request only after it finds a URL using the token explained above (the exclamation point !)

This would tell the server to use the headless browser to process the page and return html code to Google (or any search engine that chooses to participate). That’s why the token is important. If you don’t use the token, the page will be processed normally (AJAX-style). If that’s the case, then the headless browser will not be triggered and Google will not request additional information from the server.

3. Stateful AJAX Pages Displayed in the Search Results
Now that you provided Google a way to crawl your AJAX content (using the process above), Google could now provide that URL in the search results. The page that Google displays in the SERPs will enable visitors to see the same content as if they were traversing your AJAX content on your site. i.e. They will access the AJAX version of the page versus the default content (which is what would normally be crawled). And since there is now a stateful URL that contains the AJAX content, Google can check to ensure that the indexable content matches what is returned to users.

Using our example from above, here is what the process would look like:
Your original URL:
example.com/productid.aspx#productname

You would change the URL to include a token:
example.com/productid.aspx#!productname

Google would recognize this as an AJAX page and request the following:
example.com/productid.aspx?_escaped_fragment=productname

The headless browser (on your server) would process this request and return a snapshot of the AJAX page. The engines would then provide the content at the stateful URL in the search results:
example.com/productid.aspx#!productname

Barriers to Acceptance
This all sounds great, right? It is, but there are some potential obstacles. I’m glad Google has offered this proposal, but I’m worried about how widespread of an acceptance it’s going to gain. Putting some of the workload on webmasters presents some serious challenges. When you ask webmasters to install something like a headless browser to their setup, you never know how many will actually agree to participate.

As an example, I’ve helped a lot of clients with Flash SEO, which typically involves using SWFObject 2.x to provide alternative and crawlable content for your flash movies. This is a relatively straightforward process and doesn’t require any server-based changes. It’s all client side. However, it does require some additional work from developers and designers. Even though it’s relatively painless to implement, I still see a lot of unoptimized flash content out there… And again, it doesn’t require setting up a headless browser on the server! There are some web architects I’ve worked with over the years that would have my head for requesting to add anything to their setup, no pun intended. :) To be honest, the fact that I even had to write this post is a bad sign… So again, I’m sure there are challenges ahead.

But, there is an upside for those webmasters that take the necessary steps to make sure their AJAX is crawlable. It’s called a competitive advantage! Take the time to provide Google what it wants, and you just might reap the benefits. That leads to my final point about what you should do now.

Wrapping Up: So What Should You Do?
Prepare. I would spend some time getting ready to test this out. Speak with your technical team, bring this up during meetings, and start thinking about ways to test it out without spending enormous amounts of time and energy. As an example, one of my clients agreed to wear a name tag that says, “Is Your AJAX Crawlable?” to gain attention as he walks the halls of his company. It sounds funny, but he said it has sparked a few conversations about the topic. My recommendation is to not blindside people at your company when you need this done. Lay the groundwork now, and it will be easier to implement when you need to.

Regarding actual implementation, I’m not sure when this will start happening. However, if you use AJAX on your website (or plan to), then this is an important advancement for you to consider. If nothing else, you now have a great idea for a Halloween costume, The Headless Browser. {And don’t blame me if nobody understands what you are supposed to be… Just make sure there are plenty of SEO’s at the Halloween party.} :)

GG

Related Posts:
The Critical Last Mile for SEO: Your Copywriters, Designers and Developers
Using SWFObject 2.0 to Embed Flash While Providing SEO Friendly Alternative Content
6 Questions You Should Ask During a Website Redesign That Can Save Your Search Engine Rankings
SEO, Forms, and Hidden Content - The Danger of Coding Yourself Into Search Obscurity

Labels: , , , ,