GoogleBot overloads the server by spidering very frequently - seo

My website has about 500.000 pages. I made sitemap.xml and listed all pages in it (I know about limitation 50.000 links per file, so I have 10 sitemaps). Anyway I submitted sitemaps in webmastertool and everything seems ok (no error and I can see submitted and index links). Hoverer I have a problem with spidering frequently. GoogleBot spiders the same page 4 times per day but in sitemap.xml I tell that the page would be changed yearly.
This is an example
<url>
<loc>http://www.domain.com/destitution</loc>
<lastmod>2015-01-01T16:59:23+02:00</lastmod>
<changefreq>yearly</changefreq>
<priority>0.1</priority>
</url>
1) So how to tell GoogleBot not to spider so frequently as it overload my server?
2) the website has several pages like http://www.domain.com/destitution1, http://www.domain.com/destitution2 ... and I put canonical url to http://www.domain.com/destitution. Might it be the reason of multi spidering ?

You can report this to Google crawling team, see here :
In general, specific Googlebot crawling-problems like this are best
handled through Webmaster Tools directly. I'd go through the Site
Settings for your main domain, Crawl Rate, and then use the "Report a
problem with Googlebot" form there. The submissions through this form
go to our Googlebot team, who can work out what (or if anything) needs
to be changed on our side. They generally won't be able to reply, and
won't be able to process anything other than crawling issues, but they
sure know Googlebot and can help tweak what it does.
https://www.seroundtable.com/google-crawl-report-problem-19894.html

The crawling will slow down progressively. Bots are likely revisiting your pages because there are internal links between your pages.
In general, canonicals tend to reduce crawling rates. But at the beginning, Google bots need crawl both the source and target page. You will see the benefit later.
Google bots don't necessarily take lastmod and changefreq information into account. But if they establish content is not modified, they will come back less often. It is a matter of time. Every URL has a scheduler for revisits.
Bots adapt to the capaccity of the server (see crawling summary I maintain for more details). You can temporarily slow down bots by returning them http error code 500 if that is an issue. They will stop and come back later.
I don't believe there is a crawling issue with your site. What you see is normal behavior. When several sitemaps are submitted at once, the crawling rates can be temporarily raised.

Related

How google sees the updated content daily?

I have a website with updated content daily. I have two questions:
How does google see this content? Do I need a SEO in this case?
Does the error 404 page have an influence on the ranking on the search engine. (I do not have a static page)
So Google can "know" the content is supposed to be updated daily, it may be useful (if you don't do it yet) to implement a sitemap (and update, if necessary, dynamically). In this simemap, you can specify for each page, the update period.
This is not a constraint for Google, but it can help to adjust the frequency of indexing robots visit.
If you do, you must be "honest" with Google about times updates. If Google realizes that the frequency defined in the sitemap does not correspond to the actual frequency, it can be bad for your rankings.
404 errors (and other HTTP errors) can actually indirectly have an adverse effect on the ranking of the site. Of course, if the robot can not access content at a given moment, it can not be indexed. But scoffers, if too many problems are encountered during the visit of your site by web crawlers, Google will adjust the frequency of visits to the downside.
You can get some personalized advice and monitor the process of indexing your site using Google Webmaster Tools (and to a lesser extent, Analytics or any other tool that could monitor the web crawlers visits).
You can see the date and time when Google last visited your page. So you can see that Google adapted your updated content or not. If you have a website with updated content daily then you can ping your website to many search engines and can also submit your site in Google.
You can make a sitemap for only those urls who have a daily updated content and submit to google webmaster tools. You can define your date and time when the url was last modified in tag. You can also give a hint how frequently your page will likely to change under tag. You can set high priority for the pages which are modified daily under tag.
If you have 404 error (file not found) page then then put them in one directory and define it in your robots.txt file. So Google will not crawl that web pages and automatically it will not be indexed. It will not make any influence on your SERP ranking.

Should a sitemap have *every* url

I have a site with a huge number (well, thousands or tens of thousands) of dynamic URLs, plus a few static URLs.
In theory, due to some cunning SEO linkage on the homepage, it should be possible for any spider to crawl the site and discover all the dynamic urls via a spider-friendly search.
Given this, do I really need to worry about expending the effort to produce a dynamic sitemap index that includes all these URLs, or should I simply ensure that all the main static URLs are in there?
That actual way in which I would generate this isn't a concern - I'm just questioning the need to actually do it.
Indeed, the Google FAQ (and yes, I know they're not the only search engine!) about this recommends including URLs in the sitemap that might not be discovered by a crawl; based on that fact, then, if every URL in your site is reachable from another, surely the only URL you really need as a baseline in your sitemap for a well-designed site is your homepage?
If there is more than one way to get to a page, you should pick a main URL for each page that contains the actual content, and put those URLs in the site map. I.e. the site map should contain links to the actual content, not every possible URL to get to the same content.
Also consider putting canonical meta tags in the pages with this main URL, so that spiders can recognise a page even if it's reachable through different dynamical URLs.
Spiders only spend a limited time searching each site, so you should make it easy to find the actual content as soon as possible. A site map can be a great help as you can use it to point directly to the actual content so that the spider doesn't have to look for it.
We have had a pretty good results using these methods, and Google now indexes 80-90% of our dynamic content. :)
In an SO podcast they talked about limitations on the number of links you could include/submit in a sitemap (around 500 per page with a page limit based on pagerank?) and how you would need to break them over multiple pages.
Given this, do I really need to worry
about expending the effort to produce
a dynamic sitemap index that includes
all these URLs, or should I simply
ensure that all the main static URLs
are in there?
I was under the impression that the sitemap wasn't necessarily about disconnected pages but rather about increasing the crawling of existing pages. In my experience when a site includes a sitemap, minor pages even when prominently linked to are more likely to appear on Google results. Depending on the pagerank/inbound links etc. of your site this may be less of an issue.

Use of sitemaps

I've recently been involved in the redevelopment of a website (a search engine for health professionals: http://www.tripdatabase.com), and one of the goals was to make it more search engine "friendly", not through any black magic, but through better xhtml compliance, more keyword-rich urls, and a comprehensive sitemap (>500k documents).
Unfortunately, shortly after launching the new version of the site in October 2009, we saw site visits (primarily via organic searches from Google) drop substantially to 30% of their former glory, which wasn't the intention :)
We've brought in a number of SEO experts to help, but none have been able to satisfactorily explain the immediate drop in traffic, and we've heard conflicting advice on various aspects, which I'm hoping someone can help us with.
My question are thus:
do pages present in sitemaps also need to be spiderable from other pages? We had thought the point of a sitemap was specifically to help spiders get to content not already "visible". But now we're getting the advice to make sure every page is also linked to from another page. Which prompts the question... why bother with sitemaps?
some months on, and only 1% of the sitemap (well-formatted, according to webmaster tools) seems to have been spidered - is this usual?
Thanks in advance,
Phil Murphy
The XML sitemap helps search engine spider to indexing of all web pages of your site.
The sitemap is very usefull if you publish frequently many pages, but does not replace the correct system of linking of the site: all documents must be linke from an other related page.
Your site is very large, you must attention at the number of URLs published in the Sitemap because there are the limit of 50.000 URLs for each XML file.
The full documentation is available at Sitemaps.org
re: do pages present in sitemaps also need to be spiderable from other pages?
Yes, in fact this should be one of the first things you do. Make your website more usable to users before the search engines and the search engines will love you for it. Heavy internal linking between pages is a must first step. Most of the time you can do this with internal sitemap pages or category pages ect..
re: why bother with sitemaps?
Yes!, Site map help you set priorities for certain content on your site (like homepage), Tell the search engines what to look at more often. NOTE: Do not set all your pages with the highest priority, it confuses Google and doesn't help you.
re: some months on, and only 1% of the sitemap seems to have been spidered - is this usual?
YES!, I have a webpage with 100k+ pages. Google has never indexed them all in a single month, it takes small chunks of about 20k at a time each month. If you use the priority settings property you can tell the spider what pages they should re index each visit.
As Rinzi mentioned more documentation is available at Sitemaps.org
Try build more backlinks and "trust" (links from quality sources)
May help speed indexing further :)

About Isolated Page In My Web Site

I Produced a page which I have no intention to let Search Engines find and claw it.
The advisable solution is robot.txt. But it is not applicable in my situation.
So I isolated this page from my site by clearing all links from other pages to this page, and never put its URL in external sites.
Logically, then, it is impossible for search engines to find out this page. And that means no matter how many out-bound links nesting in this page, the PR of site is save.
Am I right?
Thank you very much!
Hope this question is programming related!
No, there's still a chance your page can be found by search engine crawlers. For example, it's been speculated that data from the Google Toolbar can be used to alert Googlebot to the presence of a page. And there's still a chance others might link to your page from external sites if the URL becomes known.
Your best bet is to add a robots meta tag to your page, this will prevent it from being indexed, and prevent crawlers from following any links:
<meta name="robots" content="noindex,nofollow" />
If it is on the internet and not restricted, it will be found. It may make it harder to find, but it is still possible a crawler may happen across it.
What is the link so I can check? ;)
If you have outbound links on this "isolated" page then your page will probably show up as a referrer in the logs of the linked-to page. Depending on how much the owners of the linked-to page track their stats, then they may find your page.
I've seen httpd log files turn up in Google searches. This in turn may lead others to find your page, including crawlers and other robots.
The easiest solution might be to password protect the page?

Possible to prevent search engine spiders from infinitely crawling paging links on search results?

Our SEO team would like to open up our main dynamic search results page to spiders and remove the 'nofollow' from the meta tags. It is currently accessible to spiders via allowing the path in robots.txt, but with a 'nofollow' clause in the meta tag which prevents spiders from going beyond the first page.
<meta name="robots" content="index,nofollow">
I am concerned that if we remove the 'nofollow', the impact to our search system will be catastrophic, as spiders will start crawling through all pages in the result set. I would appreciate advice as to:
1) Is there a way to remove the 'nofollow' from the meta tag, but prevent spiders from following only certain links on the page? I have read mixed opinions on rel="nofollow", is this a viable option?
<a rel="nofollow" href="http://www.mysite.com/paginglink" >Next Page</a>
2) Is there a way to control the 'depth' of how far spiders will go? It wouldn't be so bad if they hit a few pages, then stopped.
3) Our search results pages have the standard next/previous links, which would in theory cause spiders to hit pages recursively to infinity, what is the effect of this on SEO?
I understand that different spiders behave differently, but am mainly concerned with the big players, such as Google, Yahoo, MSN.
Note our Search results pages and paging links are not bot-friendly, in that they are not re-written and have a ?name=value query string, but from what I've seen spiders no longer just abort when they see the '?' as the results pages ARE getting indexed with decent page rank.
To be honest you are looking at nofollow wrong. Chances are the search spiders are already especially Google, Yahoo, and MSN searching the nofollow pages, because they still have to hit those pages to see if they have a noindex.
The real problem is nofollow doesn't actually mean don't follow, it just means don't pass on my reputation to this link. So unless you are aggressively blocking bots, which it doesn't sound like you are, changing the ROBOTS meta tag and robot commands on links will not effect performance because they are already hitting your site. To confirm this just look at your HTTP Server Log.
So my vote is that you will not see any problem with removing the robot limits.
I've seen Google index a calendar system that had relative links on each page through the end of time (Jan 19, 2038 - see: http://en.wikipedia.org/wiki/Year_2038_problem). We didn't notice the load on our servers until it exposed a bug in the source code dealing with dates in 2038.
I don't know about the other search engines, but Google offers a number of helpful tools for controlling how much the googlebot impacts your server infrastructure. See http://www.google.com/webmasters/.
There is an option in webmaster tools to set the crawl rate for your site.
Google bots are pretty intelligent about not traversing an entire database of dynamically-generated pages, as long as the URLs give some hint that they are dynamic (i.e. file extension of .asp or .jsp, etc. and numeric ids as query parameters). If you use rewrite rules to make your URLs "friendly", then the bots have a harder time determining whether or not it's a static page they are reading or a dynamically generated page. See this Google article for more information about dynamic vs. static URLs.
You may also want to consider creating a Google Sitemap to give the bots a better idea about what pages on your site can be indexed and which cannot.