Robots.txt pattern-based matching using data driven results - indexing

Is there a way to create pattern based rule in the robots.txt file that search engines can index?
New York 100
New York 101
New York 102
...
Atlanta 100
Atlanta 101
Atlanta 102
...
Our website has millions of records that we'd like search engines to index.
The indexing should be based on data-driven results, following a simple pattern: City + Lot Number.
The webpage loaded shows the city lot and related info.
Unfortunately, there are too many records to simply put them in the robots.txt file (over 21MB), where google has a 500KB robots file limit.

The default permissions from robots.txt are that bots are allowed to crawl (and index) everything unless you exclude it. You shouldn't need any rules at all. You could have no robots.txt file or it could be as simple as this one that allows all crawling (disallows nothing):
User-agent: *
Disallow:
Robots.txt rules are all "Starts with" rules. So if you did want to disallow a specific city, you could do it like this:
User-agent: *
Disallow: /atlanta
Which would disallow all the following URLs:
/atlanta-100
/atlanta-101
/atlanta-102
But allow crawling for all other cities, including New York.
As an aside, it is a big ask for search engines to index millions of pages from a site. Search engines will only do so if the content is high quality (lots of text, unique, well written,) your site has plenty of reputation (links from lots of other sites,) and your site has good information architecture (several usable navigation links to and from each page.) Your next question is likely to be Why aren't search engines indexing my content?
You probably want to create XML sitemaps with all of your URLs. Unlike robots.txt, you can list each of your URLs in a sitemap to tell search engines about them. A sitemap's power is limited, however. Just listing a URL in the sitemap is almost never enough to get it to rank well, or even to get it indexed at all. At best sitemaps can get search engine bots to crawl your whole site, give you extra information in webmaster tools, and are a way of telling search engines about your preferred URLs. See The Sitemap Paradox for more information.

Related

SEO And AJAX Sites

Is it possible to help search engines by giving them a list of urls to crawl? It might be hard to make the site SEO friendly when using heavy AJAX logic. Let's say that the user chooses a category, then a sub-category and a product. It seems unnecessary to give categories and subcategories urls. But giving only products a url makes sense. When I see the url for the product, I can make the application navigate to that product. So, is it possible to use robots.txt or some other method to direct search engines to the urls I designate?
I am open to other suggestions if this somehow does not make sense.
Yes. What you're describing is called a sitemap -- it's a list of pages on your site which search engines can use to help them crawl your web site.
There are a couple ways of formatting a sitemap, but by far the easiest is to just list out all the URLs in a text file available on your web site -- one per line -- and reference it in robots.txt like so:
Sitemap: http://example.com/sitemap.txt
Here's Google's documentation on the topic: https://support.google.com/webmasters/answer/183668?hl=en

SEO Search Only content

We have a ton of content on our website which a user can get to by performing a search on the website. For example, we have data for all Public companies, in the form of individual pages per company. So think like 10,000 pages in total. Now in order to get to these pages, a user needs to search for the company name and from the search results, click on the company name they are interested in.
How would a search bot find this page? There is no page on the website which has links to these 10,000 pages. Think amazon, you need to search for your product and then from the search results, click on the product you are interested in to get to it.
The closest solution I could find was the sitemap.xml, is that it? Anything which doesn't require adding 10,000 links to an xml file?
You need to link to a page, or for it to be close to the homepage for it to stand a decent chance of getting indexed by Google.
A sitemap helps, sure, but a page still needs to exist in the menu / site structure. A sitemap reference alone does not guarantee a resource will be indexed.
Google - Webmaster Support on Sitemaps: "Google doesn't guarantee that we'll crawl or index all of your URLs. However, we use the data in your Sitemap to learn about your site's structure, which will allow us to improve our crawler schedule and do a better job crawling your site in the future. In most cases, webmasters will benefit from Sitemap submission, and in no case will you be penalized for it."
If you browse Amazon, it will be possible to find 99% of the products available. Amazon do a lot of interesting stuff in their faceted navigation, you could write a book on it.
Speak to an SEO or a usability / CRO expert - they will be able to tell you what you need to do - which is basically create a user friendly site with categories & links to all your products.
An XML sitemap pretty much is your only on-site option if you do not or cannot link to these products on your website. You could link to these pages from other websites but that doesn't seem like a likely scenario.
Adding 10,000 products to an XML sitemap is easy to do. Your sitemap can be dynamic just like your web pages are. Just generate it on the fly when requested like you would a regular web page and include whatever products you want to be found and indexed.

Why is Google crawling pages blocked by my robots.txt?

I have a “double” question on the number of pages crawled by Google and it’s maybe relation with possible duplicate content (or not) and impact on SEO.
Facts on my number of pages and pages crawled by Google
I launched a new website two months ago. Today, it has close to 150 pages (it's increasing every day). This is the number of pages in my sitemap anyway.
If I look in "Crawl stats" in Google webmaster, I can see the number of pages crawled by Google everyday is much bigger (see image below).
I'm not sure it's good actually because not only it make my server a bit more busy (5,6 MB of download for 903 pages in a day), but I'm scared it makes some duplicate content as well.
I have checked on Google (site:mysite.com) and it gives me 1290 pages (but only 191 are shown unless I click on "repeat the search with the omitted results included". Let’s suppose the 191 ones are the ones in my sitemap (I think I have a problem of duplicate content of around 40 pages, but I just update the website for that).
Facts on my robots.txt
I use a robots.txt file to disallow all crawling engines to go to pages with parameters (see robots below) and also “Tags”.
User-Agent: *
Disallow: /administrator
Disallow: *?s
Disallow: *?r
Disallow: *?c
Disallow: *?viewmode
Disallow: */tags/*
Disallow: *?page=1
Disallow: */user/*
The most important one is tags. They are in my url as follow:
www.mysite.com/tags/Advertising/writing
It is blocked by the robots.txt (I’ve check with google webmaster) but it is still present in Google search (but you need to click on “repeat the search with the omitted results included.”)
I don’t want those pages to be crawled as it is duplicate content (it’s a kind of search on a keyword) that’s why I put them in robots.txt
Finaly, my questions are:
Why Google is crawling the pages that I blocked in robots.txt?
Why is Google indexing pages that I have blocked? Are those pages considered by Google as duplicate content? If yes I guess it’s bad for SEO.
EDIT: I'm NOT asking how to remove the pages indexed in Google (I know the answer already).
Why google is crawling the pages that I blocked in robots.txt? Why google is indexing pages that I have blocked?
They may have crawled it before you blocked it. You have to wait until they read your updated robots.txt file and then update their index accordingly. There is no set timetable for this but it is typically longer for newer websites.
Are those pages considered as duplicate content?
You tell us. Duplicate content is when two pages have identical or nearly identical content on two or more pages. Is that happening on your site?
Blocking duplicate content is not the way to solve that problem. You should be using canonical URLs. Blocking pages means you're linking to "black holes" in your website which hurts your SEO efforts. Canonical URLs prevents this and gives the canonical URL full credit for its related terms and all links to all duplicated pages as well.

will limiting dynamic urls with robots.txt improve my SEO ranking?

My website has about 200 useful articles. Because the website has an internal search function with lots of parameters, the search engines end up spidering urls with all possible permutations of additional parameters such as tags, search phrases, versions, dates etc. Most of these pages are simply a list of search results with some snippets of the original articles.
According to Google's Webmaster-tools Google spidered only about 150 of the 200 entries in the xml sitemap. It looks as if Google has not yet seen all of the content years after it went online.
I plan to add a few "Disallow:" lines to robots.txt so that the search engines no longer spiders those dynamic urls. In addition I plan to disable some url parameters in the Webmaster-tools "website configuration" --> "url parameter" section.
Will that improve or hurt my current SEO ranking? It will look as if my website is losing thousands of content pages.
This is exactly what canonical URLs are for. If one page (e.g. article) can be reached by more then one URL then you need to specify the primary URL using a canonical URL. This prevents duplicate content issues and tells Google which URL to display in their search results.
So do not block any of your articles and you don't need to enter any parameters, either. Just use canonical URLs and you'll be fine.
As nn4l pointed out, canonical is not a good solution for search pages.
The first thing you should do is have search results pages include a robots meta tag saying noindex. This will help get them removed from your index and let Google focus on your real content. Google should slowly remove them as they get re-crawled.
Other measures:
In GWMT tell Google to ignore all those search parameters. Just a band aid but may help speed up the recovery.
Don't block the search page in the robots.txt file as this will block the robots from crawling and cleanly removing those pages already indexed. Wait till your index is clear before doing a full block like that.
Your search system must be based on links (a tags) or GET based forms and not POST based forms. This is why they got indexed. Switching them to POST based forms should stop robots from trying to index those pages in the first place. JavaScript or AJAX is another way to do it.

sitemap generation strategy

i have a huge site, with more than 5 millions url.
We have already pagerank 7/10. The problem is that because of 5 millions url and because we add/remove new urls daily (we add ± 900 and we remove ± 300) google is not fast enough to index all of them. We have a huge and intense perl module to generate this sitemap that normally is composed by 6 sitemap files. For sure google is not faster enough to add all urls, specially because normally we recreate all those sitemaps daily and submit to google. My question is: what should be a better approach? Should i really care to send 5 millions urls to google daily even if i know that google wont be able to process? Or should i send just permalinks that wont change and the google crawler will found the rest, but at least i will have a concise index at google (today i have less than 200 from 5.000.000 urls indexed)
What is the point of having a lot of indexed sites which are removed right away?
Temporary pages are worthless for search engines and their users after they are disposed. So I would go for letting search engine crawlers decides whether a page is worth indexing. Just tell them the URLs that will persist... and implement some list pages (if there aren't any yet), which allow your pages to be crawled easier.
Note below: 6 sitemap files for 5m URLs? AFAIK, a sitemap file may no contain more than 50k URLs.
When URLs change you should watch out that you work properly with 301 status (permanent redirect).
Edit (refinement):
Still you should try that your URL patterns are getting stable. You can use 301 for redirects, but maintaining a lot of redirect rules is cumbersome.
Why don't you just compare your sitemap to the previous one each time, and only send google the URLs that have changed!