Why is Google crawling pages blocked by my robots.txt? - seo

I have a “double” question on the number of pages crawled by Google and it’s maybe relation with possible duplicate content (or not) and impact on SEO.
Facts on my number of pages and pages crawled by Google
I launched a new website two months ago. Today, it has close to 150 pages (it's increasing every day). This is the number of pages in my sitemap anyway.
If I look in "Crawl stats" in Google webmaster, I can see the number of pages crawled by Google everyday is much bigger (see image below).
I'm not sure it's good actually because not only it make my server a bit more busy (5,6 MB of download for 903 pages in a day), but I'm scared it makes some duplicate content as well.
I have checked on Google (site:mysite.com) and it gives me 1290 pages (but only 191 are shown unless I click on "repeat the search with the omitted results included". Let’s suppose the 191 ones are the ones in my sitemap (I think I have a problem of duplicate content of around 40 pages, but I just update the website for that).
Facts on my robots.txt
I use a robots.txt file to disallow all crawling engines to go to pages with parameters (see robots below) and also “Tags”.
User-Agent: *
Disallow: /administrator
Disallow: *?s
Disallow: *?r
Disallow: *?c
Disallow: *?viewmode
Disallow: */tags/*
Disallow: *?page=1
Disallow: */user/*
The most important one is tags. They are in my url as follow:
www.mysite.com/tags/Advertising/writing
It is blocked by the robots.txt (I’ve check with google webmaster) but it is still present in Google search (but you need to click on “repeat the search with the omitted results included.”)
I don’t want those pages to be crawled as it is duplicate content (it’s a kind of search on a keyword) that’s why I put them in robots.txt
Finaly, my questions are:
Why Google is crawling the pages that I blocked in robots.txt?
Why is Google indexing pages that I have blocked? Are those pages considered by Google as duplicate content? If yes I guess it’s bad for SEO.
EDIT: I'm NOT asking how to remove the pages indexed in Google (I know the answer already).

Why google is crawling the pages that I blocked in robots.txt? Why google is indexing pages that I have blocked?
They may have crawled it before you blocked it. You have to wait until they read your updated robots.txt file and then update their index accordingly. There is no set timetable for this but it is typically longer for newer websites.
Are those pages considered as duplicate content?
You tell us. Duplicate content is when two pages have identical or nearly identical content on two or more pages. Is that happening on your site?
Blocking duplicate content is not the way to solve that problem. You should be using canonical URLs. Blocking pages means you're linking to "black holes" in your website which hurts your SEO efforts. Canonical URLs prevents this and gives the canonical URL full credit for its related terms and all links to all duplicated pages as well.

Related

Robots.txt pattern-based matching using data driven results

Is there a way to create pattern based rule in the robots.txt file that search engines can index?
New York 100
New York 101
New York 102
...
Atlanta 100
Atlanta 101
Atlanta 102
...
Our website has millions of records that we'd like search engines to index.
The indexing should be based on data-driven results, following a simple pattern: City + Lot Number.
The webpage loaded shows the city lot and related info.
Unfortunately, there are too many records to simply put them in the robots.txt file (over 21MB), where google has a 500KB robots file limit.
The default permissions from robots.txt are that bots are allowed to crawl (and index) everything unless you exclude it. You shouldn't need any rules at all. You could have no robots.txt file or it could be as simple as this one that allows all crawling (disallows nothing):
User-agent: *
Disallow:
Robots.txt rules are all "Starts with" rules. So if you did want to disallow a specific city, you could do it like this:
User-agent: *
Disallow: /atlanta
Which would disallow all the following URLs:
/atlanta-100
/atlanta-101
/atlanta-102
But allow crawling for all other cities, including New York.
As an aside, it is a big ask for search engines to index millions of pages from a site. Search engines will only do so if the content is high quality (lots of text, unique, well written,) your site has plenty of reputation (links from lots of other sites,) and your site has good information architecture (several usable navigation links to and from each page.) Your next question is likely to be Why aren't search engines indexing my content?
You probably want to create XML sitemaps with all of your URLs. Unlike robots.txt, you can list each of your URLs in a sitemap to tell search engines about them. A sitemap's power is limited, however. Just listing a URL in the sitemap is almost never enough to get it to rank well, or even to get it indexed at all. At best sitemaps can get search engine bots to crawl your whole site, give you extra information in webmaster tools, and are a way of telling search engines about your preferred URLs. See The Sitemap Paradox for more information.

Google still indexing unique URLs

I have a robots.txt file set up as such
User-agent: *
Disallow: /*
For a site that is all unique URL based. Sort of like https://jsfiddle.net/ when you save a new fiddle it gives it a unique URL. I want all of my unique URLs to be invisible to Google. No indexing.
Google has indexed all of my unique URLs, even though it says "A description for this result is not available because of the site's robots.txt file. - learn more"
But that still sucks because all the URLs are there, and clickable - so all the data inside is available. What can I do to 1) get rid of these off Google and 2) stop Google from indexing these URLs.
Robots.txt tells search engines not to crawl the page, but it does not stop them from indexing the page, especially if there are links to the page from other sites. If your main goal is to guarantee that these pages never wind up in search results, you should use robots meta tags instead. A robots meta tag with 'noindex' means "Do not index this page at all". Blocking the page in robots.txt means "Do not request this page from the server."
After you have added the robots meta tags, you will need to change your robots.txt file to no longer disallow the pages. Otherwise, the robots.txt file would prevent the crawler from loading the pages, which would prevent it from seeing the meta tags. In your case, you can just change the robots.txt file to:
User-agent: *
Disallow:
(or just remove the robots.txt file entirely)
If robots meta tags are not an option for some reason, you can also use the X-Robots-Tag header to accomplish the same thing.

GoogleBot overloads the server by spidering very frequently

My website has about 500.000 pages. I made sitemap.xml and listed all pages in it (I know about limitation 50.000 links per file, so I have 10 sitemaps). Anyway I submitted sitemaps in webmastertool and everything seems ok (no error and I can see submitted and index links). Hoverer I have a problem with spidering frequently. GoogleBot spiders the same page 4 times per day but in sitemap.xml I tell that the page would be changed yearly.
This is an example
<url>
<loc>http://www.domain.com/destitution</loc>
<lastmod>2015-01-01T16:59:23+02:00</lastmod>
<changefreq>yearly</changefreq>
<priority>0.1</priority>
</url>
1) So how to tell GoogleBot not to spider so frequently as it overload my server?
2) the website has several pages like http://www.domain.com/destitution1, http://www.domain.com/destitution2 ... and I put canonical url to http://www.domain.com/destitution. Might it be the reason of multi spidering ?
You can report this to Google crawling team, see here :
In general, specific Googlebot crawling-problems like this are best
handled through Webmaster Tools directly. I'd go through the Site
Settings for your main domain, Crawl Rate, and then use the "Report a
problem with Googlebot" form there. The submissions through this form
go to our Googlebot team, who can work out what (or if anything) needs
to be changed on our side. They generally won't be able to reply, and
won't be able to process anything other than crawling issues, but they
sure know Googlebot and can help tweak what it does.
https://www.seroundtable.com/google-crawl-report-problem-19894.html
The crawling will slow down progressively. Bots are likely revisiting your pages because there are internal links between your pages.
In general, canonicals tend to reduce crawling rates. But at the beginning, Google bots need crawl both the source and target page. You will see the benefit later.
Google bots don't necessarily take lastmod and changefreq information into account. But if they establish content is not modified, they will come back less often. It is a matter of time. Every URL has a scheduler for revisits.
Bots adapt to the capaccity of the server (see crawling summary I maintain for more details). You can temporarily slow down bots by returning them http error code 500 if that is an issue. They will stop and come back later.
I don't believe there is a crawling issue with your site. What you see is normal behavior. When several sitemaps are submitted at once, the crawling rates can be temporarily raised.

How to remove duplicate title and meta description tags if google indexed them

So, I have been building an ecommerce site for a small company.
The url structure is : www.example.com/product_category/product_name and the site has around 1000 products.
I've checked google webmaster tools and in the HTML improvements section it shows that I have multiple title and meta description tags for all the product pages. They all appear two times, both:
-www.example.com/product_category/product_name
and
-www.example.com/product_category/product_name/ (with slash in the end)
got indexed as separate pages.
I've added a 301 redirect from every www.example.com/product_category/product_name/ to www.example.com/product_category/product_name, but this was almost two weeks ago. I have resubmitted my sitemap and asked google to fetch the whole page a few times. Nothing has changed, GWT still shows the pages as duplicate tags.
I did not get any manual action message.
So I have two questions:
-how can I accelerate the reindexation process, if it's possible?
-and do these tags hurt my organic search results? I've googled it, yes and some say it does and some say it doesn't.
An option is to set a canonical link on both URLs (with and without /) using the URL without a /. Little by little, Google will stop complaining. Keep in mind Google Webmaster Tools is slow to react, especially when you don't have much traffic or backlinks.
And yes, duplicate tags can influence your rankings negatively because users won't have proper and specific information for each page.
Set a canonical link on both Urls is a solution but it take time from my experience.
The fasted way is to block old URL in robots.txt file.
Disallow: /old_url
canonical tag is option but why you are not adding different title and description for all pages.
you can add dynamic meta tags one time and it will create automatically for all pages so we dont worry about duplication.

will limiting dynamic urls with robots.txt improve my SEO ranking?

My website has about 200 useful articles. Because the website has an internal search function with lots of parameters, the search engines end up spidering urls with all possible permutations of additional parameters such as tags, search phrases, versions, dates etc. Most of these pages are simply a list of search results with some snippets of the original articles.
According to Google's Webmaster-tools Google spidered only about 150 of the 200 entries in the xml sitemap. It looks as if Google has not yet seen all of the content years after it went online.
I plan to add a few "Disallow:" lines to robots.txt so that the search engines no longer spiders those dynamic urls. In addition I plan to disable some url parameters in the Webmaster-tools "website configuration" --> "url parameter" section.
Will that improve or hurt my current SEO ranking? It will look as if my website is losing thousands of content pages.
This is exactly what canonical URLs are for. If one page (e.g. article) can be reached by more then one URL then you need to specify the primary URL using a canonical URL. This prevents duplicate content issues and tells Google which URL to display in their search results.
So do not block any of your articles and you don't need to enter any parameters, either. Just use canonical URLs and you'll be fine.
As nn4l pointed out, canonical is not a good solution for search pages.
The first thing you should do is have search results pages include a robots meta tag saying noindex. This will help get them removed from your index and let Google focus on your real content. Google should slowly remove them as they get re-crawled.
Other measures:
In GWMT tell Google to ignore all those search parameters. Just a band aid but may help speed up the recovery.
Don't block the search page in the robots.txt file as this will block the robots from crawling and cleanly removing those pages already indexed. Wait till your index is clear before doing a full block like that.
Your search system must be based on links (a tags) or GET based forms and not POST based forms. This is why they got indexed. Switching them to POST based forms should stop robots from trying to index those pages in the first place. JavaScript or AJAX is another way to do it.