Google Webmaster Tools won't index my site - indexing

I discovered that my robots.txt file on my site is causing Google's Webmaster Tools to not index my site properly. I tried and removed just about everything from the file (using WordPress so it will still generate it) but I keep getting the same error in their panel,
"Severe status problems were found on the site. - Check site status". And when I click on the site status it tells me that robots.txt is blocking my main page, which is not.
http://saturate.co/robots.txt - ideas?
Edit: Marking this as solved as it seems Webmaster Tools now accepted the site and is showing no errors.

You should try adding Disallow: to the end of your file. So it looks like this:
User-agent: *
Disallow:

Related

we were unable to access your site's robots.txt file

I verified my site using google webmaster. I have made my website in Wordpress and I also added robots.txt. Now google is showing green tick mark on DNS and Server Connectivity but and yellow warning mark on robots.txt fetch..
My robots.txt file is look like this:
robots file
Also when I run robots.txt test in webmaster it gives allowed result.. My site is not even being searched in google..
when i submit my site in webmaster that time its not showing error but now its showing.
Please help to slove this problem.
If you made your website with wordpress
It will automatically generate an robots.txt file for you
Why you Did not use it ?!

How to remove folder and Its child pages from Google Search Index

I am redesigning my site and It is located in sub folder of website directory. And Google have indexed our new site from sub folder which is affecting my search results of live site.
Is there any specific way, that I can remove sub folder from google search index and google search results ?
e.g. My Live site is www.xyz.com and
I am redesigning on www.xyz.com/newsite
Is there anyway that I can remove /newsite from google search index and results ?
Refer http://www.robotstxt.org/robotstxt.html
Add this robots.txt file
User-agent: *
Disallow: /newsite/
or best suited, get access to Google Webmaster
https://www.google.com/webmasters/tools/url-removal?hl=en&siteUrl=
add your website url after =
For example:
https://www.google.com/webmasters/tools/url-removal?hl=en&siteUrl=http://www.techplayce.com/
Yes by uploading robots.txt file on your site directory...
User-agent: *
Disallow: /newsite/
add this code if you have wordpress site then install a plugin for robots.txt

robots.txt for disallowing Google to not to follow specific URLs

I have included robots.txt in the root directory of my application in order to tell Google bots that do not follow this http://www.test.com/example.aspx?id=x&date=10/12/2014 URL or the URL with the same extension but different query string values. For that I have used following piece of code:
User-agent: *
Disallow:
Disallow: /example.aspx/
But I found in the Webmaster Tools that Google is still following this page and has chached a number of URLs with the specified extension, is it something that query strings are creating problem because as far as I know that Google do not bother about query string, but just in case. Am I using it correctly or something else also needs to be done in order to achieve the task.
Your instruction is wrong :
Disallow: /example.aspx/
This is blocking all URLs in the direcory /example.aspx/
If you want to block all URLs of the file /example.aspx, use this instruction:
Disallow: /example.aspx
You can test it with Google Webmaster Tools.

robots.txt - exclude any URL that contains "/node/"

How do I tell crawlers / bots not to index any URL that has /node/ pattern?
Following is since day one but I noticed that Google has still indexed a lot of URLs that has
/node/ in it, e.g. www.mywebsite.com/node/123/32
Disallow: /node/
Is there anything that states that do not index any URL that has /node/
Should I write something like following:
Disallow: /node/*
Update:
The real problem is despite:
Disallow: /node/
in robots.txt, Google has indexed pages under this URL e.g. www.mywebsite.com/node/123/32
/node/ is not a physical directory, this is how drupal 6 shows it's content, I guess this is my problem that node is not a directory, merely part of URLs being generated by drupal for the content, how do I handle this? will this work?
Disallow: /*node
Thanks
Disallow: /node/ will disallow any url that starts with /node/ (after the host). The asterisk is not required.
So it will block www.mysite.com/node/bar.html, but will not block www.mysite.com/foo/node/bar.html.
If you want to block anything that contains /node/, you have to write Disallow: */node/
Note also that Googlebot can cache robots.txt for up to 7 days. So if you make a change to your robots.txt today, it might be a week before Googlebot updates its copy of your robots.txt. During that time, it will be using its cached copy.
Disallow: /node/* is exactly what you want to do. Search engines support wildcards in their robots.txt notation and the * characters means "any characters". See Google's notes on robots.txt for more.
update
An alternative way to make sure search engines stay out of a directory, and all directories below it, is to block them with the robots HTTP header. This can be done by placing the following in an htaccess file in your node directory:
Header set x-robots-tag: noindex
Your original Disallow was fine. Jim Mischel's comment seemed spot on and would cause me to wonder if it was just taking time for Googlebot to fetch the updated robots.txt and then unindex relevant pages.
A couple additional thoughts:
Your page URLs may appear in Google search results even if you've included it in robots.txt. See: http://support.google.com/webmasters/bin/answer.py?hl=en&answer=156449 ("...While Google won't crawl or index the content of pages blocked by robots.txt, we may still index the URLs if we find them on other pages on the web."). To many people, this is counter-intuitive.
Second, I'd highly recommend verifying ownership of your site in Google Webmaster Tools (https://www.google.com/webmasters/tools/home?hl=en), then using tools such as Health->"Fetch as Google" to see real time diagnostics related to retrieving your page. (Does that result indicate that robots.txt is preventing crawling?)
I haven't used it, but Bing has a similar tool: http://www.bing.com/webmaster/help/fetch-as-bingbot-fe18fa0d . It seems well worthwhile to use diagnostic tools provided by Google, Bing, etc. to perform real-time diagnostics on the site.
This question is a bit old, so I hope you've solved the original problem.

SEO: It's recommended to upload and put live a beta / non-finished version of web site?

I'm working on this big website and I want to put it online before its fully finished...
I'm working locally and the database is getting really big so I wanted to upload the website and continue to work on it in the server, but allowing people to enter, so I can test.
The question is if this is good for SEO, I mean, there are a lot of things SEO related that are incomplete.. For example: there are no friendly URLs, no sitemap, no .htacces file, lot of 'in-construction' sections...
Will Google penalize me forever? How does it work? Google indexes and gets the structure of the site just once or is it constantly updating and checking for changes? Will using User-agent: * Disallow: in robots.txt fully stop Google from indexing it? Can I change the robots.txt file later and have Google index it again? What dp you recommend?
Sure, just put a robots.txt file in your root so you can be safe that google doesn't start indexing it.
Like this:
User-agent: *
Disallow: /
This is how i understand this issue:
Google will reach your website if someone submitted your website URL http://www.google.com/addurl/ or there is a link to your website in another already indexed website.
When google reach your website it will look at the robots.txt and will see what rules there, if you disallow indexing using code like the following, google will not index your website at the moment.
User-agent: *
Disallow: /
But google will visit your website again after some days may be, and will do the same as the first time, if you didn't find the robots.txt or found that you put rules that allow them to index the website using code like the following, they will start indexing the website pages and content.
User-agent: *
Allow: /
About putting the website online from now or not? if you will disallow google index using robots.txt, there no difference, go for which is better for you.
Note:
I am not sure 100% from rules i mentioned in this answer as google always change their indexing technics.
Also what i said about Google is the same for other search engines such as yahoo and bing, but its not a rule for any search engine, its just a common way, so may be other search engine index all your website links while you have robots.txt disallow indexing.
And i used to put a stage version from my websites to test on the live environment before going on the real life version, and used to use the robots.txt and i never found any of these stage links in Google, Bing or Yahoo.
As long as your security is not beta quality, it's a good idea to get your site online as early as possible.
Google indexes your site periodically, and will index more frequently as it detects more frequent changes and/or your pagerank increases.