Net Leading: September 2011

Thursday, September 29, 2011

Work smarter, not harder, with site health

Webmaster level: All

We consistently hear from webmasters that they have to prioritize their time. Some manage dozens or hundreds of clients’ sites; others run their own business and may only have an hour to spend on website maintenance in between managing finances and inventory. To help you prioritize your efforts, Webmaster Tools is introducing the idea of “site health,” and we’ve redesigned the Webmaster Tools home page to highlight your sites with health problems. This should allow you to easily see what needs your attention the most, without having to click through all of the reports in Webmaster Tools for every site you manage.

Here’s what the new home page looks like:

You can see that sites with health problems are shown at the top of the list. (If you prefer, you can always switch back to listing your sites alphabetically.) To see the specific issues we detected on a site, click the site health icon

or the “Check site health” link next to that site:

This new home page is currently only available if you have 100 or fewer sites in your Webmaster Tools account (either verified or unverified). We’re working on making it available to all accounts in the future. If you have more than 100 sites, you can see site health information at the top of the Dashboard for each of your sites.

Right now we include three issues in your site’s health check:

Have we detected malware on the site?
Have any important pages been removed via our URL removal tool?
Are any of your important pages blocked from crawling in robots.txt?

You can click on any of these items to get more details about what we detected on your site. If the site health icon and the “Check site health” link don’t appear next to a site, it means that we didn’t detect any of these issues on that site (congratulations!).

A word about “important pages:” as you know, you can get a comprehensive list of all URLs that have been removed by going to Site configuration > Crawler access > Remove URL; and you can see all the URLs that we couldn’t crawl because of robots.txt by going to Diagnostics > Crawl errors > Restricted by robots.txt. But since webmasters often block or remove content on purpose, we only wanted to indicate a potential site health issue if we think you may have blocked or removed a page you didn’t mean to, which is why we’re focusing on “important pages.” Right now we’re looking at the number of clicks pages get (which you can see in Your site on the web > Search queries) to determine importance, and we may incorporate other factors in the future as our site health checks evolve.

Obviously these three issues—malware, removed URLs, and blocked URLs—aren’t the only things that can make a website “unhealthy;” in the future we’re hoping to expand the checks we use to determine a site’s health, and of course there’s no substitute for your own good judgment and knowledge of what’s going on with your site. But we hope that these changes make it easier for you to quickly spot major problems with your sites without having to dig down into all the data and reports.

After you’ve resolved any site health issues we’ve flagged, it will usually take several days for the warning to disappear from your Webmaster Tools account, since we have to recrawl the site, see the changes you’ve made, and then process that information through our Web Search and Webmaster Tools pipelines. If you continue to see a site health warning for that site after a week or so, the issue may not have been resolved. Feel free to ask for help tracking it down in our Webmaster Help Forum... and let us know what you think!

Posted by Susan Moskwa, Webmaster Trends Analyst

Thursday, September 15, 2011

View-all in search results

Webmaster level: Intermediate to Advanced

User testing has taught us that searchers much prefer the view-all, single-page version of content over a component page containing only a portion of the same information with arbitrary page breaks (which cause the user to click “next” and load another URL).

Searchers often prefer the view-all vs. paginated content with arbitrary page breaks and worse latency.

Therefore, to improve the user experience, when we detect that a content series (e.g. page-1.html, page-2.html, etc.) also contains a single-page version (e.g. page-all.html), we’re now making a larger effort to return the single-page version in search results. If your site has a view-all option, there’s nothing you need to do; we’ll work to do it on your behalf. Also, indexing properties, like links, will be consolidated from the component pages in the series to the view-all page.

However, high latency can make the view-all less preferred

Interestingly, the cases when users didn’t prefer the view-all page were correlated with high latency (e.g., when the view-all page took a while to load, say, because it contained many images). This makes sense because we know users are less satisfied with slow results. So while a view-all page is commonly desired, as a webmaster it’s important to balance this preference with the page’s load time and overall user experience.

Best practices for a series of content

If your site includes view-all pages

We aim to detect the view-all version of your content and, if available, its associated component pages. There’s nothing more you need to do! However, if you’d like to make it more explicit to us, you can include rel=”canonical” from your component pages to your view-all to increase the likelihood that we detect your series of pages appropriately.

rel=”canonical” can specify the superset of content (i.e. the view-all page, in this case page-all.html) from the same information in a series of URLs.

Why does this work?
In the diagram, page-2.html of a series may specify the canonical target as page-all.html because page-all.html is a superset of page-2.html's content. When a user searches for a query term and page-all.html is selected in search results, even if the query most related to page-2.html, we know the user will still see page-2.html’s relevant information within page-all.html.

On the other hand, page-2.html shouldn’t designate page-1.html as the canonical because page-2.html’s content isn’t included on page-1.html. It’s possible that a user’s search query is relevant to content on page-2.html, but if page-2.html’s canonical is set to page-1.html, the user could then select page-1.html in search results and find herself in a position where she has to further navigate to a different page to arrive at the desired information. That’s a poor experience for the user, a suboptimal result from us, and it could also bring poorly targeted traffic to your site.

However, if you strongly desire your view-all page not to appear in search results: 1) make sure the component pages in the series don’t include rel=”canonical” to the view-all page, and 2) mark the view-all page as “noindex” using any of the standard methods.
If you’d like to surface individual, component pages (or there’s no view-all available)

It may be the case that one or both of the situations below apply to your site:
- The view-all page is undesirable as a search result (e.g., load time too high or too difficult for users to navigate).
- Your users prefer the multi-page experience and to be directed to a component page in search results, rather than the view-all page.
If so, you can use standard HTML rel=”next” and rel=”prev” elements to specify a relationship between the component pages in your series of content. If done correctly, Google will generally strive to:
- Consolidate indexing properties, such as links, between the component pages/URLs.
- Send users to the most relevant page/URL from the component pages. Typically, the most relevant page is the first page of your content, but our algorithms may point users to one of the component pages in the series.

It’s not uncommon for webmasters to incorrectly use rel=”canonical” from component pages to the first page of their series (e.g. page-2.html with rel=”canonical” to page-1.html). We recommend against this implementation because the component pages don’t actually contain duplicate content. Using rel=”next” and rel=”prev” is far more appropriate.

Summary

Because users generally prefer the view-all option in search results, we’re making more of an effort to properly detect and serve this version to searchers. If you have a series of content, there’s nothing more you need to do. If you’d like to hint more to Google how best to serve users your information:

To better optimize your view-all page, you can use rel=”canonical” from component pages to the single-page version; otherwise,
If a view-all page doesn’t provide a good user experience for your site, you can use the rel=”next” and rel=”prev” attributes as a strong hint for Google to identify the series of pages and still surface a component page in results.

Questions?

As always, feel free to ask in our Webmaster Help Forum.

Written by Benjia Li & Joachim Kupke, Software Engineers, Indexing Team

Pagination with rel=“next” and rel=“prev”

Webmaster level: Intermediate to Advanced

Much like rel=”canonical” acts a strong hint for duplicate content, you can now use the HTML link elements rel=”next” and rel=”prev” to indicate the relationship between component URLs in a paginated series. Throughout the web, a paginated series of content may take many shapes—it can be an article divided into several component pages, or a product category with items spread across several pages, or a forum thread divided into a sequence of URLs. Now, if you choose to include rel=”next” and rel=”prev” markup on the component pages within a series, you’re giving Google a strong hint that you’d like us to:

Consolidate indexing properties, such as links, from the component pages/URLs to the series as a whole (i.e., links should not remain dispersed between page-1.html, page-2.html, etc., but be grouped with the sequence).
Send users to the most relevant page/URL—typically the first page of the series.

The relationship between component URLs in a series can now be indicated to Google through rel=”next” and rel=”prev”.

There’s an exception to the rel=”prev” and rel=”next” implementation: If, alongside your series of content, you also offer users a view-all page, or if you’re considering a view-all page, please see our post on View-all in search results for more information. Because view-all pages are most commonly preferred by searchers, we do our best to surface this version when appropriate in results rather than a component page (component pages are more likely to surface with rel=”next” and rel=”prev”).

If you don’t have a view-all page or you’d like to override Google returning a view-all page, you can use rel="next" and rel="prev" as described in this post.

For information on paginated configurations that include a view-all page, please see our post on View-all in search results.

Outlining your options

Here are three options for a series:

Leave whatever you have exactly as-is. Paginated content exists throughout the web and we’ll continue to strive to give searchers the best result, regardless of the page’s rel=”next”/rel=”prev” HTML markup—or lack thereof.
If you have a view-all page, or are considering a view-all page, see our post on View-all in search results.
Hint to Google the relationship between the component URLs of your series with rel=”next” and rel=”prev”. This helps us more accurately index your content and serve to users the most relevant page (commonly the first page). Implementation details below.

Implementing rel=”next” and rel=”prev”

If you prefer option 3 (above) for your site, let’s get started! Let’s say you have content paginated into the URLs:

http://www.example.com/article?story=abc&page=1
http://www.example.com/article?story=abc&page=2
http://www.example.com/article?story=abc&page=3
http://www.example.com/article?story=abc&page=4

On the first page, http://www.example.com/article?story=abc&page=1, you’d include in the <head> section:
<link rel="next" href="http://www.example.com/article?story=abc&page=2" />

On the second page, http://www.example.com/article?story=abc&page=2:
<link rel="prev" href="http://www.example.com/article?story=abc&page=1" />
<link rel="next" href="http://www.example.com/article?story=abc&page=3" />

On the third page, http://www.example.com/article?story=abc&page=3:
<link rel="prev" href="http://www.example.com/article?story=abc&page=2" />
<link rel="next" href="http://www.example.com/article?story=abc&page=4" />

And on the last page, http://www.example.com/article?story=abc&page=4:
<link rel="prev" href="http://www.example.com/article?story=abc&page=3" />

A few points to mention:

The first page only contains rel=”next” and no rel=”prev” markup.
Pages two to the second-to-last page should be doubly-linked with both rel=”next” and rel=”prev” markup.
The last page only contains markup for rel=”prev”, not rel=”next”.
rel=”next” and rel=”prev” values can be either relative or absolute URLs (as allowed by the <link> tag). And, if you include a <base> link in your document, relative paths will resolve according to the base URL.
rel=”next” and rel=”prev” only need to be declared within the <head> section, not within the document <body>.
We allow rel=”previous” as a syntactic variant of rel=”prev” links.
rel="next" and rel="previous" on the one hand and rel="canonical" on the other constitute independent concepts. Both declarations can be included in the same page. For example, http://www.example.com/article?story=abc&page=2&sessionid=123 may contain:

<link rel="canonical" href="http://www.example.com/article?story=abc&page=2”/>
<link rel="prev" href="http://www.example.com/article?story=abc&page=1&sessionid=123" />
<link rel="next" href="http://www.example.com/article?story=abc&page=3&sessionid=123" />
rel=”prev” and rel=”next” act as hints to Google, not absolute directives.
When implemented incorrectly, such as omitting an expected rel="prev" or rel="next" designation in the series, we'll continue to index the page(s), and rely on our own heuristics to understand your content.

Questions?
More information can be found in our Help Center, or join the conversation in our Webmaster Help Forum!

Written by Benjia Li & Joachim Kupke, Software Engineers, Indexing Team

Reconsideration requests get more transparent

Webmaster level: All

If your site isn't appearing in Google search results, or it's performing more poorly than it once did (and you believe that it does not violate our Webmaster Guidelines), you can ask Google to reconsider your site. Over time, we’ve worked to improve the reconsideration process for webmasters. A couple of years ago, in addition to confirming that we had received the request, we started sending a second message to webmasters confirming that we had processed their request. This was a huge step for webmasters who were anxiously awaiting results. Since then, we’ve received feedback that webmasters wanted to know the outcome of their requests. Earlier this year, we started experimenting with sending more detailed reconsideration request responses and the feedback we’ve gotten has been very positive!

Now, if your site is affected by a manual spam action, we may let you know if we were able to revoke that manual action based on your reconsideration request. Or, we could tell you if your site is still in violation of our guidelines. This might be a discouraging thing to hear, but once you know that there is still a problem, it will help you diagnose the issue.

If your site is not actually affected by any manual action (this is the most common scenario), we may let you know that as well. Perhaps your site isn’t being ranked highly by our algorithms, in which case our systems will respond to improvements on the site as changes are made, without your needing to submit a reconsideration request. Or maybe your site has access issues that are preventing Googlebot from crawling and indexing it. For more help debugging ranking issues, read our article about why a site may not be showing up in Google search results.

We’ve made a lot of progress on making the entire reconsideration request process more transparent. We aren’t able to reply to individual requests with specific feedback, but now many webmasters will be able to find out if their site has been affected by a manual action and they’ll know the outcome of the reconsideration review. In an ideal world, Google could be completely transparent about how every part of our rankings work. However, we have to maintain a delicate balance: trying to give as much information to webmasters as we can without letting spammers probe how to do more harm to users. We're happy that Google has set the standard on tools, transparency, and communication with site owners, but we'll keep looking for ways to do even better.

Posted by Tiffany Oberoi and Michael Wyszomierski, Search Quality Team

Wednesday, September 14, 2011

Introducing: Application Rich Snippets

Webmaster level: All

Rich snippets help users determine more quickly if a particular web page has the information they're interested in. We've previously introduced rich snippets for shopping, recipes, reviews, video, and events, and most recently music.

Before you install a software application, users might want to check out what others think about it, and how much it costs. We are pleased to announce that starting today, you’ll be able to get this information right in search results.

Here's an example of what an application snippet looks like.

Image of application snippet

You can see application snippets from several marketplaces and review sites, including Android Market, Apple iTunes, and CNET. For information on how to add app markup on your site, please refer to our Webmaster central article and send any questions to our discussion help forum.

Posted by Alejandro Goyen, Product Manager

Sunday, September 11, 2011

Îñţérñåţîöñåļîžåţîöñ

Webmaster level: Intermediate

So you’re going global, and you need your website to follow. Should be a simple case of getting the text translated and you’re good to go, right? Probably not. The Google Webmaster Team frequently builds sites that are localized into over 40 languages, so here are some things that we take into account when launching our pages in both other languages and regions.

(Even if you think you might be immune to these issues because you only offer content in English, it could be that non-English language visitors are using tools like Google Translate to view your content in their language. This traffic should show up in your analytics dashboard, so you can get an idea of how many visitors are not viewing your site in the way it’s intended.)

More languages != more HTML templates

We can’t recommend this enough: reuse the same template for all language versions, and always try to keep the HTML of your template simple.

Keeping the HTML code the same for all languages has its advantages when it comes to maintenance. Hacking around with the HTML code for each language to fix bugs doesn’t scale–keep your page code as clean as possible and deal with any styling issues in the CSS. To name just one benefit of clean code: most translation tools will parse out the translatable content strings from the HTML document and that job is made much easier when the HTML is well-structured and valid.

How long is a piece of string?

If your design relies on text playing nicely with fixed-size elements, then translating your text might wreak havoc. For example, your left-hand side navigation text is likely to translate into much longer strings of text in several languages–check out the difference in string lengths between some English and Dutch language navigation for the same content. Be prepared for navigation titles that might wrap onto more than one line by figuring out your line height to accommodate this (also worth considering when you create your navigation text in English in the first place).

Variable word lengths cause particular issues in form labels and controls. If your form layout displays labels on the left and fields on the right, for example, longer text strings can flow over into two lines, whereas shorter text strings do not seem associated with their form input fields–both scenarios ruin the design and impede the readability of the form. Also consider the extra styling you’ll need for right-to-left (RTL) layouts (more on that later). For these reasons we design forms with labels above fields, for easy readability and styling that will translate well across languages.

Screenshots of Chinese and German versions of web forms

click to enlarge

Also avoid fixed-height columns–if you’re attempting to neaten up your layout with box backgrounds that match in height, chances are when your text is translated, the text will overrun areas that were only tall enough to contain your English content. Think about whether the UI elements you’re planning to use in your design will work when there is more or less text–for instance, horizontal vs. vertical tabs.

On the flip side

Source editing for bidirectional HTML can be problematic because many editors have not been built to support the Unicode bidirectional algorithm (more research on the problems and solutions). In short, the way your markup is displayed might get garbled:

<p>ابةتث <img src="foo.jpg" alt=" جحخد"< ذرزسش!</p>

Our own day-to-day usage has shown the following editors to currently provide decent solutions for bidirectional editing: particularly Coda, and also Dreamweaver, IntelliJ IDEA and JEditX.

When designing for RTL languages you can build most of the support you need into the core CSS and use the directional attribute of the html element (for backwards compatibility) in combination with a class on the body element. As always, keeping all styles in one core stylesheet makes for better maintainability.

Some key styling issues to watch out for: any elements floated right will need to be floated left and vice versa; extra padding or margin widths applied to one side of an element will need to be overridden and switched, and any text-align attributes should be reversed.

We generally use the following approach, including using a class on the body tag rather than a html[dir=rtl] CSS selector because this is compatible with older browsers:

Elements:

<body class="rtl">
<h1><a href="http://www.blogger.com/"><img alt="Google" src="http://www.google.com/images/logos/google_logo.png" /></a> Heading</h1>

Left-to-right (default) styling:

h1 {
  height: 55px;
  line-height: 2.05;
  margin: 0 0 25px;
  overflow: hidden;
}
h1 img {
  float: left;
  margin: 0 43px 0 0;
  position: relative;
}

Right-to-left styling:

body.rtl {
  direction: rtl;
}
body.rtl h1 img {
  float: right;
  margin: 0 0 0 43px;
}

(See this in action in English and Arabic.)

One final note on this subject: most of the time your content destined for right-to-left language pages will be bidirectional rather than purely RTL, because some strings will probably need to retain their LTR direction–for example, company names in Latin script or telephone numbers. The way to make sure the browser handles this correctly in a primarily RTL document is to wrap the embedded text strings with an inline element using an attribute to set direction, like this:

<h2>‫עוד ב- <span dir="ltr">Google</span>‬</h2>

In cases where you don’t have an HTML container to hook the dir attribute into, such as title elements or JavaScript-generated source code for message prompts, you can use this equivalent to set direction where ‫ and ‬‬ are Unicode control characters for right-to-left embedding:

<title>&#x202B;‫הפוך את Google לדף הבית שלך‬&#x202C;</title>

Example usage in JavaScript code:

var ffError = '\u202B' +'כדי להגדיר את Google כדף הבית שלך ב\x2DFirefox, לחץ על הקישור \x22הפוך את Google לדף הבית שלי\x22, וגרור אותו אל סמל ה\x22בית\x22 בדפדפן שלך.'+ '\u202C';

(For more detail, see the W3C’s articles on creating HTML for Arabic, Hebrew and other right-to-left scripts and authoring right-to-left scripts.)

It’s all Greek to me…

If you’ve never worked with non-Latin character sets before (Cyrillic, Greek, and a myriad of Asian and Indic), you might find that both your editor and browser do not display content as intended.

Check that your editor and browser encodings are set to UTF-8 (recommended) and consider adding a element and the lang attribute of the html element to your HTML template so browsers know what to expect when rendering your page–this has the added benefit of ensuring that all Unicode characters are displayed correctly, so using HTML entities such as é (é) will not be necessary, saving valuable bytes! Check the W3C’s tutorial on character encoding if you’re having trouble–it contains in-depth explanations of the issues.

A word on naming

Lastly, a practical tip on naming conventions when creating several language versions. Using a standard such as the ISO 639-1 language codes for naming helps when you start to deal with several language versions of the same document.

Using a conventional standard will help users understand your site’s structure as well as making it more maintainable for all webmasters who might develop the site, and using the language codes for other site assets (logo images, PDF documents) is handy to be able to quickly identify files.

See previous Webmaster Central posts for advice about URL structures and other issues surrounding working with multi-regional websites and working with multilingual websites.

That’s a summary of the main challenges we wrestle with on a daily basis; but we can vouch for the fact that putting in the planning and work up front towards well-structured HTML and robust CSS pays dividends during localization!

Posted by Kathryn Cullen, Google Webmaster Team

Wednesday, September 7, 2011

Recognizing Top Contributors in Google's Help Forums

The communities around Google products and services have been growing tremendously over the last couple of years. It is inspiring and motivating for us to see how many users like you contribute to Google Forums. For some time, we´ve been thinking of ways to thank our Top Contributors -- our most the passionate, helpful, friendly, and active users. These TCs have demonstrated incredible commitment to our communities and continue to share their profound knowledge by answering user questions within the forums.

TCs from all over the world will attend our first global summit in California.

We decided to give the online world a break for a moment and meet in real life to celebrate our past success and work on future endeavours. Google Forum Guides, Googlers that participate in the forums, and Top Contributors will convene for the first global Top Contributor Summit on September 13th and 14th in Santa Clara and Mountain View, California. During the Google-organized two-day event, Top Contributors will meet guides, engineers and product managers in order to get to know each other, provide feedback and share new ideas. We’ll be sharing some of the insights and takeaways after the event too, so stay tuned. And if you would like to follow the events online, look out for the #TCsummit tag on Twitter and our updates on Google+.

Posted by Esperanza Navas and Kaspar Szymanski, Search Quality Team

Thursday, September 1, 2011

PDFs in Google search results

Webmaster level: All

Our mission is to organize the world’s information and make it universally accessible and useful. During this ambitious quest, we sometimes encounter non-HTML files such as PDFs, spreadsheets, and presentations. Our algorithms don’t let different filetypes slow them down; we work hard to extract the relevant content and to index it appropriately for our search results. But how do we actually index these filetypes, and—since they often differ so much from standard HTML—what guidelines apply to these files? What if a webmaster doesn’t want us to index them?

Google first started indexing PDF files in 2001 and currently has hundreds of millions of PDF files indexed. We’ve collected the most often-asked questions about PDF indexing; here are the answers:

Q: Can Google index any type of PDF file?
A: Generally we can index textual content (written in any language) from PDF files that use various kinds of character encodings, provided they’re not password protected or encrypted. If the text is embedded as images, we may process the images with OCR algorithms to extract the text. The general rule of the thumb is that if you can copy and paste the text from a PDF document into a standard text document, we should be able to index that text.

Q: What happens with the images in PDF files?
A: Currently the images are not indexed. In order for us to index your images, you should create HTML pages for them. To increase the likelihood of us returning your images in our search results, please read the tips in our Help Center.

Q: How are links treated in PDF documents?
A: Generally links in PDF files are treated similarly to links in HTML: they can pass PageRank and other indexing signals, and we may follow them after we have crawled the PDF file. It’s currently not possible to "nofollow" links within a PDF document.

Q: How can I prevent my PDF files from appearing in search results; or if they already do, how can I remove them?
A: The simplest way to prevent PDF documents from appearing in search results is to add an X-Robots-Tag: noindex in the HTTP header used to serve the file. If they’re already indexed, they’ll drop out over time if you use the X-Robot-Tag with the noindex directive. For faster removals, you can use the URL removal tool in Google Webmaster Tools.

Q: Can PDF files rank highly in the search results?
A: Sure! They’ll generally rank similarly to other webpages. For example, at the time of this post, [mortgage market review], [irs form 2011] or [paracetamol expert report] all return PDF documents that manage to rank highly in our search results, thanks to their content and the way they’re embedded and linked from other webpages.

Q: Is it considered duplicate content if I have a copy of my pages in both HTML and PDF?
A: Whenever possible, we recommend serving a single copy of your content. If this isn’t possible, make sure you indicate your preferred version by, for example, including the preferred URL in your Sitemap or by specifying the canonical version in the HTML or in the HTTP headers of the PDF resource. For more tips, read our Help Center article about canonicalization.

Q: How can I influence the title shown in search results for my PDF document?
A: We use two main elements to determine the title shown: the title metadata within the file, and the anchor text of links pointing to the PDF file. To give our algorithms a strong signal about the proper title to use, we recommend updating both.

If you want to learn more, watch Matt Cutt’s video about PDF files’ optimization for search, and visit our Help Center for information about the content types we’re able to index. If you have feedback or suggestions, please let us know in the Webmaster Help Forum.

Posted by Gary Illyes, Webmaster Trends Analyst

Net Leading

Pages

Traffic

CPX

PTP

DOWNLOAD OUR MOVIE COLLECTION