Net Leading: May 2011

Wednesday, May 18, 2011

Troubleshooting Instant Previews in Webmaster Tools

Webmaster level: All

In November, we launched Instant Previews to help users better understand if a particular result was relevant for a their search query. Since launch, our Instant Previews team has been keeping an eye on common complaints and problems related to how pages are rendered for Instant Previews.

When we see issues with preview images, they are frequently due to:

Blocked resources due to a robots.txt entry
Cloaking: Erroneous content being served to the Googlebot user-agent
Poor alternative content when Flash is unavailable

To help webmasters diagnose these problems, we have a new Instant Preview tool in the Labs section of Webmaster Tools (in English only for now).

Here, you can input the URL of any page on your site. We will then fetch the page from your site and try to render it both as it would display in Chrome and through our Instant Preview renderer. Please keep in mind that both of these renders are done using a recent build of Webkit which does not include plugins such as Flash or Silverlight, so it's important to consider the value of providing alternative content for these situations. Alternative content can be helpful to search engines, and visitors to your site without the plugin would benefit as well.

Below the renders, you’ll also see automated feedback on problems our system can detect such as missing or roboted resources. And, in the future, we plan to add more informative and timely feedback to help improve your Instant Previews!

Please direct your questions and feedback to the Webmaster Forum.

Posted by Michael Darweesh, Software Engineer

Tuesday, May 17, 2011

Easier URL removals for site owners

Webmaster Level: All

We recently made a change to the Remove URL tool in Webmaster Tools to eliminate the requirement that the webpage's URL must first be blocked by a site owner before the page can be removed from Google's search results. Because you've already verified ownership of the site, we can eliminate this requirement to make it easier for you, as the site owner, to remove unwanted pages (e.g. pages accidentally made public) from Google's search results.

Removals persist for at least 90 days
When a page’s URL is requested for removal, the request is temporary and persists for at least 90 days. We may continue to crawl the page during the 90-day period but we will not display it in the search results. You can still revoke the removal request at any time during those 90 days. After the 90-day period, the page can reappear in our search results, assuming you haven’t made any other changes that could impact the page’s availability.

Permanent removal
In order to permanently remove a URL, you must ensure that one of the following page blocking methods is implemented for the URL of the page that you want removed:

indicate that the page no longer exists by returning a 404 or 410 HTTP status code
block the page from crawling via a robots.txt file
block the page from indexing via a noindex meta tag

This will ensure that the page is permanently removed from Google's search results for as long as the page is blocked. If at any time in the future you remove the previously implemented page blocking method, we may potentially re-crawl and index the page. For immediate and permanent removal, you can request that a page be removed using the Remove URL tool and then permanently block the page’s URL before the 90-day expiration of the removal request.

For more information about URL removals, see our “URL removal explained” blog series covering this topic. If you still have questions about this change or about URL removal requests in general, please post in our Webmaster Help Forum.

Written by Jonathan Simon, Webmaster Trends Analyst

Thursday, May 12, 2011

Website Security for Webmasters

Webmaster level: Intermediate to Advanced

Users are taught to protect themselves from malicious programs by installing sophisticated antivirus software, but often they may also entrust their private information to websites like yours, in which case it’s important to protect their data. It’s also very important to protect your own data; if you have an online store, you don’t want to be robbed.

Over the years companies and webmasters have learned—often the hard way—that web application security is not a joke; we’ve seen user passwords leaked due to SQL injection attacks, cookies stolen with XSS, and websites taken over by hackers due to negligent input validation.

Today we’ll show you some examples of how a web application can be exploited so you can learn from them; for this we’ll use Gruyere, an intentionally vulnerable application we use for security training internally, too. Do not probe others’ websites for vulnerabilities without permission as it may be perceived as hacking; but you’re welcome—nay, encouraged—to run tests on Gruyere.

Client state manipulation - What will happen if I alter the URL?

Let’s say you have an image hosting site and you’re using a PHP script to display the images users have uploaded:

http://www.example.com/showimage.php?imgloc=/garyillyes/kitten.jpg

So what will the application do if I alter the URL to something like this and userpasswords.txt is an actual file?

http://www.example.com/showimage.php?imgloc=/../../userpasswords.txt

Will I get the content of userpasswords.txt?

Another example of client state manipulation is when form fields are not validated. For instance, let’s say you have this form:

It seems that the username of the submitter is stored in a hidden input field. Well, that’s great! Does that mean that if I change the value of that field to another username, I can submit the form as that user? It may very well happen; the user input is apparently not authenticated with, for example, a token which can be verified on the server.
Imagine the situation if that form were part of your shopping cart and I modified the price of a $1000 item to $1, and then placed the order.

Protecting your application against this kind of attack is not easy; take a look at the third part of Gruyere to learn a few tips about how to defend your app.

Cross-site scripting (XSS) - User input can’t be trusted

A simple, harmless URL:
http://google-gruyere.appspot.com/611788451095/%3Cscript%3Ealert('0wn3d')%3C/script%3E
But is it truly harmless? If I decode the percent-encoded characters, I get:

<script>alert('0wn3d')</script>

Gruyere, just like many sites with custom error pages, is designed to include the path component in the HTML page. This can introduce security bugs, like XSS, as it introduces user input directly into the rendered HTML page of the web application. You might say, “It’s just an alert box, so what?” The thing is, if I can inject an alert box, I can most likely inject something else, too, and maybe steal your cookies which I could use to sign in to your site as you.

Another example is when the stored user input isn’t sanitized. Let’s say I write a comment on your blog; the comment is simple:

<a href=”javascript:alert(‘0wn3d’)”>Click here to see a kitten</a>

If other users click on my innocent link, I have their cookies:

You can learn how to find XSS vulnerabilities in your own web app and how to fix them in the second part of Gruyere; or, if you’re an advanced developer, take a look at the automatic escaping features in template systems we blogged about on our Online Security blog.

Cross-site request forgery (XSRF) - Should I trust requests from evil.com?

Oops, a broken picture. It can’t be dangerous--it’s broken, after all--which means that the URL of the image returns a 404 or it’s just malformed. Is that true in all of the cases?

No, it’s not! You can specify any URL as an image source, regardless of its content type. It can be an HTML page, a JavaScript file, or some other potentially malicious resource. In this case the image source was a simple page’s URL:

That page will only work if I’m logged in and I have some cookies set. Since I was actually logged in to the application, when the browser tried to fetch the image by accessing the image source URL, it also deleted my first snippet. This doesn’t sound particularly dangerous, but if I’m a bit familiar with the app, I could also invoke a URL which deletes a user’s profile or lets admins grant permissions for other users.

To protect your app against XSRF you should not allow state changing actions to be called via GET; the POST method was invented for this kind of state-changing request. This change alone may have mitigated the above attack, but usually it's not enough and you need to include an unpredictable value in all state changing requests to prevent XSRF. Please head to Gruyere if you want to learn more about XSRF.

Cross-site script inclusion (XSSI) - All your script are belong to us

Many sites today can dynamically update a page's content via asynchronous JavaScript requests that return JSON data. Sometimes, JSON can contain sensitive data, and if the correct precautions are not in place, it may be possible for an attacker to steal this sensitive information.

Let’s imagine the following scenario: I have created a standard HTML page and send you the link; since you trust me, you visit the link I sent you. The page contains only a few lines:

<script>function _feed(s) {alert("Your private snippet is: " + s['private_snippet']);}</script><script src="http://google-gruyere.appspot.com/611788451095/feed.gtl"></script>

Since you’re signed in to Gruyere and you have a private snippet, you’ll see an alert box on my page informing you about the contents of your snippet. As always, if I managed to fire up an alert box, I can do whatever else I want; in this case it was a simple snippet, but it could have been your biggest secret, too.

It’s not too hard to defend your app against XSSI, but it still requires careful thinking. You can use tokens as explained in the XSRF section, set your script to answer only POST requests, or simply start the JSON response with ‘\n’ to make sure the script is not executable.

SQL Injection - Still think user input is safe?

What will happen if I try to sign in to your app with a username like

JohnDoe’; DROP TABLE members;--

While this specific example won’t expose user data, it can cause great headaches because it has the potential to completely remove the SQL table where your app stores information about members.

Generally, you can protect your app from SQL injection with proactive thinking and input validation. First, are you sure the SQL user needs to have permission to execute “DROP TABLE members”? Wouldn’t it be enough to grant only SELECT rights? By setting the SQL user’s permissions carefully, you can avoid painful experiences and lots of troubles. You might also want to configure error reporting in such way that the database and its tables’ names aren’t exposed in the case of a failed query.
Second, as we learned in the XSS case, never trust user input: what looks like a login form to you, looks like a potential doorway to an attacker. Always sanitize and quotesafe the input that will be stored in a database, and whenever possible make use of statements generally referred to as prepared or parametrized statements available in most database programming interfaces.

Knowing how web applications can be exploited is the first step in understanding how to defend them. In light of this, we encourage you to take the Gruyere course, take other web security courses from the Google Code University and check out skipfish if you're looking for an automated web application security testing tool. If you have more questions please post them in our Webmaster Help Forum.

Written by Gary Illyes, Webmaster Trends Analyst

Wednesday, May 11, 2011

Introducing the Google Webmaster Team

We’re pleased to introduce the Google Webmaster Team as contributors to the Webmaster Central Blog. As the team responsible for tens of thousands of Google’s informational web pages, they’re here to offer tips and advice based on their experiences as hands-on webmasters.

Back in the 1990s, anyone who maintained a website called themselves a “webmaster” regardless of whether they were a designer, developer, author, system administrator, or someone who had just stumbled across GeoCities and created their first web page. As the technologies changed over the years, so did the roles and skills of those managing websites.

Around 20 years after the word was first used, we still refer to ourselves as the Google Webmaster Team because it’s the only term that really covers the wide variety of roles that we have on our team. Although most of us have solid knowledge of HTML, CSS, JavaScript and other web technologies, we also have specialists in design, development, user experience, information architecture, system administration, and project management.

Part of the Google Webmaster Team, Mountain View

In contrast to the Google Webmaster Central Team—which mainly focuses on helping webmasters outside of Google understand web search and how things like crawling and indexing affect their sites—our team is responsible for designing, implementing, optimizing and maintaining Google’s corporate pages, informational product pages, landing pages for marketing campaigns, and our error page. Our team also develops internal tools to increase our productivity and help to maintain the thousands of HTML pages that we own.

We’re working hard to follow, challenge and evolve best practices and web standards to ensure that all our new pages are produced to the highest quality and provide the best user experience, and we’re constantly evaluating and updating our legacy pages to ensure their deprecated HTML isn’t just left to rot.

We want to share our work and experiences with other webmasters, so we recently launched our @GoogleWebTeam account on Twitter to keep our followers updated on the latest news about our projects, web standards, and anything else which may be of interest to other webmasters, web designers and web developers. We’ll be posting here on the Webmaster Central Blog when we want to share anything longer than 140 characters.

Before we share more details about our processes and experiences, please let us know if there’s anything you’d like us to specifically cover by leaving a comment here or by tweeting @GoogleWebTeam.

Posted by Tony Ruscoe, Google Webmaster Team

Page Speed Online has a shiny new API

Andrew

Richard

Webmaster level: intermediate

A few weeks ago, we introduced Page Speed Online, a web-based performance analysis tool that gives developers optimization suggestions. Almost immediately, developers asked us to make an API available to integrate into other tools and their regression testing suites. We were happy to oblige.

Today, as part of Google I/O, we are excited to introduce the Page Speed Online API as part of the Google APIs. With this API, developers now have the ability to integrate performance analysis very simply in their command-line tools and web performance dashboards.

We have provided a getting started guide that helps you to get up and running quickly, understand the API, and start monitoring the performance improvements that you make to your web pages. Not only that, in the request, you’ll be able to specify whether you’d like to see mobile or desktop analysis, and also get Page Speed suggestions in one of the 40 languages that we support, giving API access to the vast majority of developers in their native or preferred language.

We’re also pleased to share that the WordPress plugin W3 Total Cache now uses the Page Speed Online API to provide Page Speed suggestions to WordPress users, right in the WordPress dashboard. “The Page Speed tool itself provides extremely pointed and valuable insight into performance pitfalls. Providing that tool via an API has allowed me to directly correlate that feedback with actionable solutions that W3 Total Cache provides.” said Frederick Townes, CTO Mashable and W3 Total Cache author.

Take the Page Speed Online API for a spin and send us feedback on our mailing list. We’d love to hear your experience integrating the new Page Speed Online API.

Andrew Oates is a Software Engineer on the Page Speed Team in Google's Cambridge, Massachusetts office. You can find him in the credits for the Pixar film Up.

Richard Rabbat is the Product Management Lead on the "Make the Web Faster" initiative. He has launched Page Speed, mod_pagespeed and WebP. At Google since 2006, Richard works with engineering teams across the world.

Posted by Scott Knaster, Andrew Oates and Richard Rabbat, Page Speed Team

Friday, May 6, 2011

More guidance on building high-quality sites

Webmaster level: All

In recent months we’ve been especially focused on helping people find high-quality sites in Google’s search results. The “Panda” algorithm change has improved rankings for a large number of high-quality websites, so most of you reading have nothing to be concerned about. However, for the sites that may have been affected by Panda we wanted to provide additional guidance on how Google searches for high-quality sites.

Our advice for publishers continues to be to focus on delivering the best possible user experience on your websites and not to focus too much on what they think are Google’s current ranking algorithms or signals. Some publishers have fixated on our prior Panda algorithm change, but Panda was just one of roughly 500 search improvements we expect to roll out to search this year. In fact, since we launched Panda, we've rolled out over a dozen additional tweaks to our ranking algorithms, and some sites have incorrectly assumed that changes in their rankings were related to Panda. Search is a complicated and evolving art and science, so rather than focusing on specific algorithmic tweaks, we encourage you to focus on delivering the best possible experience for users.

What counts as a high-quality site?

Our site quality algorithms are aimed at helping people find "high-quality" sites by reducing the rankings of low-quality content. The recent "Panda" change tackles the difficult task of algorithmically assessing website quality. Taking a step back, we wanted to explain some of the ideas and research that drive the development of our algorithms.

Below are some questions that one could use to assess the "quality" of a page or an article. These are the kinds of questions we ask ourselves as we write algorithms that attempt to assess site quality. Think of it as our take at encoding what we think our users want.

Of course, we aren't disclosing the actual ranking signals used in our algorithms because we don't want folks to game our search results; but if you want to step into Google's mindset, the questions below provide some guidance on how we've been looking at the issue:

Would you trust the information presented in this article?
Is this article written by an expert or enthusiast who knows the topic well, or is it more shallow in nature?
Does the site have duplicate, overlapping, or redundant articles on the same or similar topics with slightly different keyword variations?
Would you be comfortable giving your credit card information to this site?
Does this article have spelling, stylistic, or factual errors?
Are the topics driven by genuine interests of readers of the site, or does the site generate content by attempting to guess what might rank well in search engines?
Does the article provide original content or information, original reporting, original research, or original analysis?
Does the page provide substantial value when compared to other pages in search results?
How much quality control is done on content?
Does the article describe both sides of a story?
Is the site a recognized authority on its topic?
Is the content mass-produced by or outsourced to a large number of creators, or spread across a large network of sites, so that individual pages or sites don’t get as much attention or care?
Was the article edited well, or does it appear sloppy or hastily produced?
For a health related query, would you trust information from this site?
Would you recognize this site as an authoritative source when mentioned by name?
Does this article provide a complete or comprehensive description of the topic?
Does this article contain insightful analysis or interesting information that is beyond obvious?
Is this the sort of page you’d want to bookmark, share with a friend, or recommend?
Does this article have an excessive amount of ads that distract from or interfere with the main content?
Would you expect to see this article in a printed magazine, encyclopedia or book?
Are the articles short, unsubstantial, or otherwise lacking in helpful specifics?
Are the pages produced with great care and attention to detail vs. less attention to detail?
Would users complain when they see pages from this site?

Writing an algorithm to assess page or site quality is a much harder task, but we hope the questions above give some insight into how we try to write algorithms that distinguish higher-quality sites from lower-quality sites.

What you can do

We've been hearing from many of you that you want more guidance on what you can do to improve your rankings on Google, particularly if you think you've been impacted by the Panda update. We encourage you to keep questions like the ones above in mind as you focus on developing high-quality content rather than trying to optimize for any particular Google algorithm.

One other specific piece of guidance we've offered is that low-quality content on some parts of a website can impact the whole site’s rankings, and thus removing low quality pages, merging or improving the content of individual shallow pages into more useful pages, or moving low quality pages to a different domain could eventually help the rankings of your higher-quality content.

We're continuing to work on additional algorithmic iterations to help webmasters operating high-quality sites get more traffic from search. As you continue to improve your sites, rather than focusing on one particular algorithmic tweak, we encourage you to ask yourself the same sorts of questions we ask when looking at the big picture. This way your site will be more likely to rank well for the long-term. In the meantime, if you have feedback, please tell us through our Webmaster Forum. We continue to monitor threads on the forum and pass site info on to the search quality team as we work on future iterations of our ranking algorithms.

Written by Amit Singhal, Google Fellow

Flash support in Instant Previews

Webmaster level: All

With Instant Previews, users can see a snapshot of a search result before clicking on it. We’ve made a number of improvements to the feature since its introduction last November, and if you own a site, one of the most relevant changes for you is that Instant Previews now supports Flash.

An Instant Preview with rich content rendered

In most cases, when the preview for a page is generated through our regular crawl, we will now render a snapshot of any Flash components on the page. This will replace the "puzzle piece" icon that previously appeared to indicate Flash components, and should improve the accuracy of the previews.

However, for pages that are fetched on demand by the "Google Web Preview" user-agent, we will generate a preview without Flash in order to minimize latency. In these cases the preview will appear as if the page were visited by someone using a browser without Flash enabled, and "Install Flash" messages may appear in the preview, depending on how your website handles users without Flash.

To improve your previews for these on-demand renders, here are some guidelines for using Flash on your site:

Make sure that your site has a reasonable, seamless experience for visitors without Flash. This may involve creating HTML-only equivalents for your Flash-based content that will automatically be shown to visitors who can't view Flash. Providing a good experience for this case will improve your preview and make your visitors happier.

If Flash components are rendering but appear as loading screens instead of actual content, try reducing the loading time for the component. This makes it more likely we'll render it properly.

If you have Flash videos on your site, consider submitting a Video Sitemap which helps us to generate thumbnails for your videos in Instant Previews.

If most of the page is rendering properly but you still see puzzle pieces appearing for some smaller components, these may be fixed in future crawls of your page.

If you have additional questions, please feel free to post them in our Webmaster Help Forum.

As always, we'll keep you updated as we continue to make improvements to Instant Previews.

Posted by Raj Krishnan, Product Manager

Monday, May 2, 2011

Do 404s hurt my site?

Webmaster level: Beginner/Intermediate

So there you are, minding your own business, using Webmaster Tools to check out how awesome your site is... but, wait! The Crawl errors page is full of 404 (Not found) errors! Is disaster imminent??

Fear not, my young padawan. Let’s take a look at 404s and how they do (or do not) affect your site:

Q: Do the 404 errors reported in Webmaster Tools affect my site’s ranking?
A: 404s are a perfectly normal part of the web; the Internet is always changing, new content is born, old content dies, and when it dies it (ideally) returns a 404 HTTP response code. Search engines are aware of this; we have 404 errors on our own sites, as you can see above, and we find them all over the web. In fact, we actually prefer that, when you get rid of a page on your site, you make sure that it returns a proper 404 or 410 response code (rather than a “soft 404”). Keep in mind that in order for our crawler to see the HTTP response code of a URL, it has to be able to crawl that URL—if the URL is blocked by your robots.txt file we won’t be able to crawl it and see its response code. The fact that some URLs on your site no longer exist / return 404s does not affect how your site’s other URLs (the ones that return 200 (Successful)) perform in our search results.

Q: So 404s don’t hurt my website at all?
A: If some URLs on your site 404, this fact alone does not hurt you or count against you in Google’s search results. However, there may be other reasons that you’d want to address certain types of 404s. For example, if some of the pages that 404 are pages you actually care about, you should look into why we’re seeing 404s when we crawl them! If you see a misspelling of a legitimate URL (www.example.com/awsome instead of www.example.com/awesome), it’s likely that someone intended to link to you and simply made a typo. Instead of returning a 404, you could 301 redirect the misspelled URL to the correct URL and capture the intended traffic from that link. You can also make sure that, when users do land on a 404 page on your site, you help them find what they were looking for rather than just saying “404 Not found."

Q: Tell me more about “soft 404s.”
A: A soft 404 is when a web server returns a response code other than 404 (or 410) for a URL that doesn’t exist. A common example is when a site owner wants to return a pretty 404 page with helpful information for his users, and thinks that in order to serve content to users he has to return a 200 response code. Not so! You can return a 404 response code while serving whatever content you want. Another example is when a site redirects any unknown URLs to their homepage instead of returning 404s. Both of these cases can have negative effects on our understanding and indexing of your site, so we recommend making sure your server returns the proper response codes for nonexistent content. Keep in mind that just because a page says “404 Not Found,” doesn’t mean it’s actually returning a 404 HTTP response code—use the Fetch as Googlebot feature in Webmaster Tools to double-check. If you don’t know how to configure your server to return the right response codes, check out your web host’s help documentation.

Q: How do I know whether a URL should 404, or 301, or 410?
A: When you remove a page from your site, think about whether that content is moving somewhere else, or whether you no longer plan to have that type of content on your site. If you’re moving that content to a new URL, you should 301 redirect the old URL to the new URL—that way when users come to the old URL looking for that content, they’ll be automatically redirected to something relevant to what they were looking for. If you’re getting rid of that content entirely and don’t have anything on your site that would fill the same user need, then the old URL should return a 404 or 410. Currently Google treats 410s (Gone) the same as 404s (Not found), so it’s immaterial to us whether you return one or the other.

Q: Most of my 404s are for bizarro URLs that never existed on my site. What’s up with that? Where did they come from?
A: If Google finds a link somewhere on the web that points to a URL on your domain, it may try to crawl that link, whether any content actually exists there or not; and when it does, your server should return a 404 if there’s nothing there to find. These links could be caused by someone making a typo when linking to you, some type of misconfiguration (if the links are automatically generated, e.g. by a CMS), or by Google’s increased efforts to recognize and crawl links embedded in JavaScript or other embedded content; or they may be part of a quick check from our side to see how your server handles unknown URLs, to name just a few. If you see 404s reported in Webmaster Tools for URLs that don’t exist on your site, you can safely ignore them. We don’t know which URLs are important to you vs. which are supposed to 404, so we show you all the 404s we found on your site and let you decide which, if any, require your attention.

Q: Someone has scraped my site and caused a bunch of 404s in the process. They’re all “real” URLs with other code tacked on, like http://www.example.com/images/kittens.jpg" width="100" height="300" alt="kittens"/></a... Will this hurt my site?
A: Generally you don’t need to worry about “broken links” like this hurting your site. We understand that site owners have little to no control over people who scrape their site, or who link to them in strange ways. If you’re a whiz with the regex, you could consider redirecting these URLs as described here, but generally it’s not worth worrying about. Remember that you can also file a takedown request when you believe someone is stealing original content from your website.

Q: Last week I fixed all the 404s that Webmaster Tools reported, but they’re still listed in my account. Does this mean I didn’t fix them correctly? How long will it take for them to disappear?
A: Take a look at the ‘Detected’ column on the Crawl errors page—this is the most recent date on which we detected each error. If the date(s) in that column are from before the time you fixed the errors, that means we haven’t encountered these errors since that date. If the dates are more recent, it means we’re continuing to see these 404s when we crawl.

After implementing a fix, you can check whether our crawler is seeing the new response code by using Fetch as Googlebot. Test a few URLs and, if they look good, these errors should soon start to disappear from your list of Crawl errors.

Q: Can I use Google’s URL removal tool to make 404 errors disappear from my account faster?
A: No; the URL removal tool removes URLs from Google’s search results, not from your Webmaster Tools account. It’s designed for urgent removal requests only, and using it isn’t necessary when a URL already returns a 404, as such a URL will drop out of our search results naturally over time. See the bottom half of this blog post for more details on what the URL removal tool can and can’t do for you.

Still want to know more about 404s? Check out 404 week from our blog, or drop by our Webmaster Help Forum.

Posted by Susan Moskwa, Webmaster Trends Analyst

Net Leading

Pages

Traffic

CPX

PTP

DOWNLOAD OUR MOVIE COLLECTION