Net Leading: 2006

Wednesday, October 25, 2006

Update to our webmaster guidelines

As the web continues to change and evolve, our algorithms change right along with it. Recently, as a result of one of those algorithmic changes, we've modified our webmaster guidelines. Previously, these stated:

Don't use "&id=" as a parameter in your URLs, as we don't include these pages in our index.

However, we've recently removed that technical guideline, and now index URLs that contain that parameter. So if your site uses a dynamic structure that generates it, don't worry about rewriting it -- we'll accept it just fine as is. Keep in mind, however, that dynamic URLs with a large number of parameters may be problematic for search engine crawlers in general, so rewriting dynamic URLs into user-friendly versions is always a good practice when that option is available to you. If you can, keeping the number of URL parameters to one or two may make it more likely that search engines will crawl your dynamic urls.

Thursday, October 19, 2006

Googlebot activity reports

The webmaster tools team has a very exciting mission: we dig into our logs, find as much useful information as possible, and pass it on to you, the webmasters. Our reward is that you more easily understand what Google sees, and why some pages don't make it to the index.

The latest batch of information that we've put together for you is the amount of traffic between Google and a given site. We show you the number of requests, number of kilobytes (yes, yes, I know that tech-savvy webmasters can usually dig this out, but our new charts make it really easy to see at a glance), and the average document download time. You can see this information in chart form, as well as in hard numbers (the maximum, minimum, and average).

For instance, here's the number of pages Googlebot has crawled in the Webmaster Central blog over the last 90 days. The maximum number of pages Googlebot has crawled in one day is 24 and the minimum is 2. That makes sense, because the blog was launched less than 90 days ago, and the chart shows that the number of pages crawled per day has increased over time. The number of pages crawled is sometimes more than the total number of pages in the site -- especially if the same page can be accessed via several URLs. So http://googlewebmastercentral.blogspot.com/2006/10/learn-more-about-googlebots-crawl-of.html and http://googlewebmastercentral.blogspot.com/2006/10/learn-more-about-googlebots-crawl-of.html#links are different, but point to the same page (the second points to an anchor within the page).

And here's the average number of kilobytes downloaded from this blog each day. As you can see, as the site has grown over the last two and a half months, the number of average kilobytes downloaded has increased as well.

The first two reports can help you diagnose the impact that changes in your site may have on its coverage. If you overhaul your site and dramatically reduce the number of pages, you'll likely notice a drop in the number of pages that Googlebot accesses.

The average document download time can help pinpoint subtle networking problems. If the average time spikes, you might have network slowdowns or bottlenecks that you should investigate. Here's the report for this blog that shows that we did have a short spike in early September (the maximum time was 1057 ms), but it quickly went back to a normal level, so things now look OK.

In general, the load time of a page doesn't affect its ranking, but we wanted to give this info because it can help you spot problems. We hope you will find this data as useful as we do!

Tuesday, October 17, 2006

Learn more about Googlebot's crawl of your site and more!

We've added a few new features to webmaster tools and invite you to check them out.

Googlebot activity reports
Check out these cool charts! We show you the number of pages Googlebot's crawled from your site per day, the number of kilobytes of data Googlebot's downloaded per day, and the average time it took Googlebot to download pages. Webmaster tools show each of these for the last 90 days. Stay tuned for more information about this data and how you can use it to pinpoint issues with your site.

Crawl rate control
Googlebot uses sophisticated algorithms that determine how much to crawl each site. Our goal is to crawl as many pages from your site as we can on each visit without overwhelming your server's bandwidth.

We've been conducting a limited test of a new feature that enables you to provide us information about how we crawl your site. Today, we're making this tool available to everyone. You can access this tool from the Diagnostic tab. If you'd like Googlebot to slow down the crawl of your site, simply choose the Slower option.

If we feel your server could handle the additional bandwidth, and we can crawl your site more, we'll let you know and offer the option for a faster crawl.

If you request a changed crawl rate, this change will last for 90 days. If you liked the changed rate, you can simply return to webmaster tools and make the change again.

Enhanced image search
You can now opt into enhanced image search for the images on your site, which enables our tools such as Google Image Labeler to associate the images included in your site with labels that will improve indexing and search quality of those images. After you've opted in, you can opt out at any time.

Number of URLs submitted
Recently at SES San Jose, a webmaster asked me if we could show the number of URLs we find in a Sitemap. He said that he generates his Sitemaps automatically and he'd like confirmation that the number he thinks he generated is the same number we received. We thought this was a great idea. Simply access the Sitemaps tab to see the number of URLs we found in each Sitemap you've submitted.

As always, we hope you find these updates useful and look forward to hearing what you think.

Wednesday, October 11, 2006

Got a website? Get gadgets.

Google Gadgets are miniature-sized devices that offer cool and dynamic content -- they can be games, news clips, weather reports, maps, or most anything you can dream up. They've been around for a while, but their reach got a lot broader last week when we made it possible for anyone to add gadgets to their own webpages. Here's an example of a flight status tracker, for instance, that can be placed on any page on the web for free.

Anyone can search for gadgets to add to their own webpage for free in our directory of gadgets for your webpage. To put a gadget on your page, just pick the gadget you like, set your preferences, and copy-and-paste the HTML that is generated for you onto your own page.

Creating gadgets for others isn't hard, either, and it can be a great way to get your content in front of people while they're visiting Google or other sites. Here are a few suggestions for distributing your own content on the Google homepage or other pages across the web:

* Create a Google Gadget for distribution across the web. Gadgets can be most anything, from simple HTML to complex applications. It’s easy to experiment with gadgets – anyone with even a little bit of web design experience can make a simple one (even me!), and more advanced programmers can create really snazzy, complex ones. But remember, it’s also quick and easy for people to delete gadgets or add new ones too their own pages. To help you make sure your gadget will be popular across the web, we provide a few guidelines you can use to create gadgets. The more often folks find your content to be useful, the longer they'll keep your gadget on their pages, and the more often they’ll visit your site.

* If your website has a feed, visitors can put snippets of your content on their own Google homepages quickly and easily, and you don't even need to develop a gadget. However, you will be able to customize their experience much more fully with a gadget than with a feed.

* By putting the “Add to Google” button in a prominent spot on your site, you can increase the reach of your content, because visitors who click to add your gadget or feed to Google can see your content each time they visit the Google homepage. Promoting your own gadget or feed can also increase its popularity, which contributes to a higher ranking in the directory of gadgets for the Google personalized homepage.

Monday, October 9, 2006

Multiple Sitemaps in the same directory

We've gotten a few questions about whether you can put multiple Sitemaps in the same directory. Yes, you can!

You might want to have multiple Sitemap files in a single directory for a number of reasons. For instance, if you have an auction site, you might want to have a daily Sitemap with new auction offers and a weekly Sitemap with less time-sensitive URLs. Or you could generate a new Sitemap every day with new offers, so that the list of Sitemaps grows over time. Either of these solutions works just fine.

Or, here's another sample scenario: Suppose you're a provider that supports multiple web shops, and they share a similar URL structure differentiated by a parameter. For example:

http://example.com/stores/home?id=1
http://example.com/stores/home?id=2
http://example.com/stores/home?id=3

Since they're all in the same directory, it's fine by our rules to put the URLs for all of the stores into a single Sitemap, under http://example.com/ or http://example.com/stores/. However, some webmasters may prefer to have separate Sitemaps for each store, such as:

http://example.com/stores/store1_sitemap.xml
http://example.com/stores/store2_sitemap.xml
http://example.com/stores/store3_sitemap.xml

As long as all URLs listed in the Sitemap are at the same location as the Sitemap or in a sub directory (in the above example http://example.com/stores/ or perhaps http://example.com/stores/catalog) it's fine for multiple Sitemaps to live in the same directory (as many as you want!). The important thing is that Sitemaps not contain URLs from parent directories or completely different directories -- if that happens, we can't be sure that the submitter controls the URL's directory, so we can't trust the metadata.

The above Sitemaps could also be collected into a single Sitemap index file and easily be submitted via Google webmaster tools. For example, you could create http://example.com/stores/sitemap_index.xml as follows:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.google.com/schemas/sitemap/0.84">
 <sitemap>
    <loc>http://example.com/stores/store1_sitemap.xml</loc>
    <lastmod>2006-10-01T18:23:17+00:00</lastmod>
 </sitemap>
 <sitemap>
    <loc>http://example.com/stores/store2_sitemap.xml</loc>
    <lastmod>2006-10-01</lastmod>
 </sitemap>
<sitemap>
    <loc>http://example.com/stores/store3_sitemap.xml</loc>
    <lastmod>2006-10-05</lastmod>
 </sitemap>
</sitemapindex>

Then simply add the index file to your account, and you'll be able to see any errors for each of the child Sitemaps.

If each store includes more than 50,000 URLs (the maximum number for a single Sitemap), you would need to have multiple Sitemaps for each store. In that case, you may want to create a Sitemap index file for each store that lists the Sitemaps for that store. For instance:

http://example.com/stores/store1_sitemapindex.xml
http://example.com/stores/store2_sitemapindex.xml
http://example.com/stores/store3_sitemapindex.xml

Since Sitemap index files can't contain other index files, you would need to submit each Sitemap index file to your account separately.

Whether you list all URLs in a single Sitemap or in multiple Sitemaps (in the same directory of different directories) is simply based on what's easiest for you to maintain. We treat the URLs equally for each of these methods of organization.

Thursday, October 5, 2006

Useful information you may have missed

When we launched this blog in early August, we said goodbye to the Inside Google Sitemaps blog and started redirecting it here. The redirect makes the posts we did there a little difficult to get to. For those of you who started reading with this newer blog, here are links to some of the older posts that may be of interest.

Webmaster Tools Account questions

And you can always browse the blog's archives .

Thursday, September 28, 2006

Fresher query stats

Query stats in webmaster tools provide information about the search queries that most often return your site in the results. You can view this information by a variety of search types (such as web search, mobile search, or image search) and countries. We show you the top search types and locations for your site. You can access these stats by selecting a verified site in your account and then choosing Query stats from the Statistics tab.

If you've checked your site's query stats lately, you may have noticed that they're changing more often than they used to. This is because we recently changed how frequently we calculate them. Previously, we showed data that was averaged over a period of three weeks. Now, we show data that is averaged over a period of one week. This results in fresher stats for you, as well as stats that more accurately reflect the current queries that return your site in the results. We update these stats every week, so if you'd like to keep a history of the top queries for your site week by week, you can simply download the data each week. We generally update this data each Monday.

How we calculate query stats
Some of you have asked how we calculate query stats.

These results are based on results that searchers see. For instance, say a search for [Britney Spears] brings up your site as position 21, which is on the third page of the results. And say 1000 people searched for [Britney Spears] during the course of a week (in reality, a few more people than that search for her name, but just go with me for this example). 600 of those people only looked at the first page of results and the other 400 browsed to at least the third page. That means that your site was seen by 400 searchers. Even though your site was at position 21 for all 1000 searchers, only 400 are counted for purposes of this calculation.

Both top search queries and top search query clicks are based on the total number of searches for each query. The stats we show are based on the queries that most often return your site in the results. For instance, going back to that familiar [Britney Spears] query -- 400 searchers saw your site in the results. Now, maybe your site isn't really about Britney Spears -- it's more about Buffy the Vampire Slayer. And say Google received 50 queries for [Buffy the Vampire Slayer] in the same week, and your site was returned in the results at position 2. So, all 50 searchers saw your site in the results. In this example, Britney Spears would show as a top search query above Buffy the Vampire Slayer (because your site was seen by 400 searchers for Britney but 50 searchers for Buffy).

The same is true of top search query clicks. If 100 of the Britney-seekers clicked on your site in the search results and all 50 of the Buffy-searchers click on your site in the search results, Britney would show as a top search query above Buffy.

At times, this may cause some of the query stats we show you to seem unusual. If your site is returned for a very high-traffic query, then even if a low percentage of searchers click on your site for that query, the total number of searchers who click on your site may still be higher for the query than for queries for which a much higher percentage of searchers click on your site in the results.

The average top position for top search queries is the position of the page on your site that ranks most highly for the query. The average top position for top search query clicks is the position of the page on your site that searchers clicked on (even if a different page ranked more highly for the query). We show you the average position for this top page across all data centers over the course of the week.

A variety of download options are available. You can:

download individual tables of data by clicking the Download this table link.
download stats for all subfolders on your site (for all search types and locations) by clicking the Download all query stats for this site (including subfolders) link.
download all stats (including query stats) for all verified sites in your account by choosing Tools from the My Sites page, then choosing Download data for all sites and then Download statistics for all sites.

Wednesday, September 27, 2006

Introducing Google Checkout

For you webmasters that manage sites that sell online, we'd like to introduce you to one of our newest products, Google Checkout. Google Checkout is a checkout process that you integrate with your site(s), enabling your customers to quickly buy from you by providing only a single username and password. From there, you can use Checkout to charge your customers' credit cards and process their orders.

Users of Google's search advertising program, AdWords, get the added benefit of the Google Checkout badge and free transaction processing. The Google Checkout badge is an icon that appears on your AdWords ads and improves the effectiveness of your advertising by letting searchers know that you accept Checkout. Also, for every $1 you spend on AdWords, you can process $10 of Checkout sales for free. Even if you don't use AdWords, you can still process sales for a low 2% and $0.20 per transaction. So if you're interested in implementing Google Checkout, we encourage you to learn more.

If you're managing the sites of other sellers, you might want to sign up for our merchant referral program where you can earn cash for helping your sellers get up and running with Google Checkout. You can earn $25 for every merchant you refer that processes at least 3 unique customer transactions and $500 in Checkout sales. And you can earn $5 for every $1,000 of Checkout sales processed by the merchants you refer. If you're interested, apply here.

Wednesday, September 20, 2006

How to verify Googlebot

Lately I've heard a couple smart people ask that search engines provide a way know that a bot is authentic. After all, any spammer could name their bot "Googlebot" and claim to be Google, so which bots do you trust and which do you block?

The common request we hear is to post a list of Googlebot IP addresses in some public place. The problem with that is that if/when the IP ranges of our crawlers change, not everyone will know to check. In fact, the crawl team migrated Googlebot IPs a couple years ago and it was a real hassle alerting webmasters who had hard-coded an IP range. So the crawl folks have provided another way to authenticate Googlebot. Here's an answer from one of the crawl people (quoted with their permission):

Telling webmasters to use DNS to verify on a case-by-case basis seems like the best way to go. I think the recommended technique would be to do a reverse DNS lookup, verify that the name is in the googlebot.com domain, and then do a corresponding forward DNS->IP lookup using that googlebot.com name; eg:

> host 66.249.66.1
1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com.

> host crawl-66-249-66-1.googlebot.com
crawl-66-249-66-1.googlebot.com has address 66.249.66.1

I don't think just doing a reverse DNS lookup is sufficient, because a spoofer could set up reverse DNS to point to crawl-a-b-c-d.googlebot.com.

This answer has also been provided to our help-desk, so I'd consider it an official way to authenticate Googlebot. In order to fetch from the "official" Googlebot IP range, the bot has to respect robots.txt and our internal hostload conventions so that Google doesn't crawl you too hard.

(Thanks to N. and J. for help on this answer from the crawl side of things.)

Tuesday, September 19, 2006

Debugging blocked URLs

Vanessa's been posting a lot lately, and I'm starting to feel left out. So here my tidbit of wisdom for you: I've noticed a couple of webmasters confused by "blocked by robots.txt" errors, and I wanted to share the steps I take when debugging robots.txt problems:

A handy checklist for debugging a blocked URL

Let's assume you are looking at crawl errors for your website and notice a URL restricted by robots.txt that you weren't intending to block:
http://www.example.com/amanda.html URL restricted by robots.txt Sep 3, 2006

Check the robots.txt analysis tool
The first thing you should do is go to the robots.txt analysis tool for that site. Make sure you are looking at the correct site for that URL, paying attention that you are looking at the right protocol and subdomain. (Subdomains and protocols may have their own robots.txt file, so https://www.example.com/robots.txt may be different from http://example.com/robots.txt and may be different from http://amanda.example.com/robots.txt.) Paste the blocked URL into the "Test URLs against this robots.txt file" box. If the tool reports that it is blocked, you've found your problem. If the tool reports that it's allowed, we need to investigate further.

At the top of the robots.txt analysis tool, take a look at the HTTP status code. If we are reporting anything other than a 200 (Success) or a 404 (Not found) then we may not be able to reach your robots.txt file, which stops our crawling process. (Note that you can see the last time we downloaded your robots.txt file at the top of this tool. If you make changes to your file, check this date and time to see if your changes were made after our last download.)

Check for changes in your robots.txt file
If these look fine, you may want to check and see if your robots.txt file has changed since the error occurred by checking the date to see when your robots.txt file was last modified. If it was modified after the date given for the error in the crawl errors, it might be that someone has changed the file so that the new version no longer blocks this URL.

Check for redirects of the URL
If you can be certain that this URL isn't blocked, check to see if the URL redirects to another page. When Googlebot fetches a URL, it checks the robots.txt file to make sure it is allowed to access the URL. If the robots.txt file allows access to the URL, but the URL returns a redirect, Googlebot checks the robots.txt file again to see if the destination URL is accessible. If at any point Googlebot is redirected to a blocked URL, it reports that it could not get the content of the original URL because it was blocked by robots.txt.

Sometimes this behavior is easy to spot because a particular URL always redirects to another one. But sometimes this can be tricky to figure out. For instance:

Your site may not have a robots.txt file at all (and therefore, allows access to all pages), but a URL on the site may redirect to a different site, which does have a robots.txt file. In this case, you may see URLs blocked by robots.txt for your site (even though you don't have a robots.txt file).
Your site may prompt for registration after a certain number of page views. You may have the registration page blocked by a robots.txt file. In this case, the URL itself may not redirect, but if Googlebot triggers the registration prompt when accessing the URL, it will be redirected to the blocked registration page, and the original URL will be listed in the crawl errors page as blocked by robots.txt.

Ask for help
Finally, if you still can't pinpoint the problem, you might want to post on our forum for help. Be sure to include the URL that is blocked in your message. Sometimes its easier for other people to notice oversights you may have missed.

Good luck debugging! And by the way -- unrelated to robots.txt -- make sure that you don't have "noindex" meta tags at the top of your web pages; those also result in Google not showing a web site in our index.

Friday, September 15, 2006

For Those Wondering About Public Service Search

Update: The described product or service is no longer available. More information.

We recently learned of a security issue with our Public Service Search service and disabled login functionality temporarily to protect our Public Service Search users while we were working to fix the problem. We are not aware of any malicious exploits of this problem and this service represents an extremely small portion of searches.

We have a temporary fix in place currently that prevents exploitation of this problem and will have a permanent solution in place shortly. Unfortunately, the temporary fix may inconvenience a small number of Public Service Search users in the following ways:

* Public Service Search is currently not open to new signups.
* If you use Public Service Search on your site, you are currently unable to log in to make changes, but rest assured that Public Service Search continues to function properly on your site.
* The template system is currently disabled, so search results will appear in a standard Google search results format, rather than customized to match the look and feel of your site. However, the search results themselves are not being modified.

If you are a Public Service Search user and are having trouble logging in right now, please sit tight. As soon as the permanent solution is in place the service will be back on its feet again. In the meantime, you will still be able to provide site-specific searches on your site as usual.

Google introduced this service several years ago to support universities and non-profit organizations by offering ad-free search capabilities for their sites. Our non-profit and university users are extremely important to us and we apologize for any inconvenience this may cause.

Please post any questions or concerns in our webmaster discussion forum and we'll try our best to answer any questions you may have.

Net Leading

Pages

Traffic

CPX

PTP

DOWNLOAD OUR MOVIE COLLECTION