robots.txt which folders to disallow - SEO? - seo

I am currently writing my robots.txt file and have some trouble deciding whether I should allow or disallow some folders for SEO purposes.
Here are the folders I have:
/css/ (css)
/js/ (javascript)
/img/ (images i use for the website)
/php/ (PHP which will return a blank page such as for example checkemail.php which checks an email address or register.php which puts data into a SQL database and sends an email)
/error/ (my error 401,403,404,406,500 html pages)
/include/ (header.html and footer.html I include)
I was thinking about disallowing only the PHP pages and let the rest.
What do you think?
Thanks a lot
Laurent

/css and /js -- CSS and Javascript files will probably be crawled by googlebot whether or not you have them in robots.txt. Google uses them to render your pages for site preview. Google has asked nicely that you not put them in robots.txt.
/img -- Googlebot may crawl this even when in robots.txt the same way as CSS and Javascript. Putting your images in robots.txt generally prevents them from being indexed in Google image search. Google image search may be a source of visitors to your site so you may wish to be indexed there.
/php -- sounds like you don't want spiders hitting the urls that perform actions. Good call to use robots.txt
/error -- If your site is set up correctly the spiders will probably never know what directory your error pages are served from. They generally get served at the url that has the error and the spider never sees their actual url. This isn't the case if you redirect to them, which isn't recommended practice anyway. As such, I would say there is no need to put them in robots.txt

Related

Should I remove meta-robots (index, follow) when I have a robots.txt?

I'm a bit confused whether I should remove the robots meta tag, if I want search engines to follow my robots.txt rules.
If the robots meta-tag (index, follow) exists on the page, will search engines then ignore my robots.txt file and index the specified disallowed URLs in my robots.txt anyway?
The reason why I'm asking about this, is that search engines (Google mainly) still indexes disallowed pages from my website.
If a search engine’s bot honors your robots.txt, and you disallow crawling of /foo, then the bot will never crawl pages whose URL paths start with /foo. Hence the bot will never know that there are meta-robots elements.
Conversely, this means that if you want to disallow indexing a page (by specyfing meta-robots with noindex), you should not disallow crawling of this page in your robots.txt. Otherwise the noindex is never accessed, and the bot thinks that crawling is forbidden, not indexing.
With the robots.txt you can tell search engines not to crawl certain pages - but it wouldn't stop them from indexing the pages. If a page which is disallowed in the robots.txt is found by the crawler through an external link it can be indexed. That can be prevented through the meta-tag.
Thus, the robots.txt and the meta-tag do work differently.
https://developers.google.com/search/reference/robots_meta_tag?hl=en#combining-crawling-with-indexing--serving-directives
Robots meta tags and X-Robots-Tag HTTP headers are discovered when a URL is crawled. If a page is disallowed from crawling through the robots.txt file, then any information about indexing or serving directives will not be found and will therefore be ignored. If indexing or serving directives must be followed, the URLs containing those directives cannot be disallowed from crawling.

Single robots.txt file for all subdomains

I have a site (example.com) and have my robots.txt set up in the root directory. I have also multiple subdomains (foo.example.com, bar.example.com, and more to come in the future) whose robots.txt will all be identical as that of example.com. I know that I can place a file at the root of each subdomain but I'm wondering if it's possible to redirect the crawlers searching for robots.txt on any subdomain to example.com/robots.txt?
Sending a redirect header for your robots.txt file is not advised, nor is it officially supported.
Google's documentation specifically states:
Handling of robots.txt redirects to disallowed URLs is undefined and discouraged.
But the documentation does say redirect "will be generally followed". If you add your subdomains into Google Webmaster Tools and go to "Crawl > Blocked URLs" you can test your subdomain robots.txts that are 301 redirecting. It should come back as positively working.
However, with that said, I would strongly suggest that you just symlink the files into place and that each robots.txt file responds with a 200 OK at the appropriate URLs. This is much more inline with the original robots.txt specification, as well as, Google's documentation, and who knows exactly how bing / yahoo will handle it over time.

How to Disallow Landing Pages Using robots.txt file?

I'd like to start using specific landing pages in a marketing campaign. A quick search on google shows how to disallow specific pages and/or directories using a robots.txt file. (link)
If I don't want the search engines to index these landing pages should I put a single page entries in the robot.txt file or should I put them in specific directories and disallow the directory?
My concern is that anybody can read a robots.txt file and if the actual page names are visible within the robots.txt file it defeats the purpose.
"It defeats the purpose." How so? The purpose of robots.txt is to prevent crawlers from reading particular files or groups of files. Whether you exclude the individual files or put them all in a directory and exclude that directory is irrelevant as far as the crawler's behavior is concerned.
The benefit to putting them all in directories is that your robots.txt file is smaller and easier to manage. You don't have to add a new entry every time you create a new landing page.
You're right that putting a file name in robots.txt lets anybody who reads the file know that the file is there. That shouldn't be a problem. If you have sensitive information that you don't want others to see then it shouldn't be accessible, regardless of whether it's mentioned in robots.txt. Because if the file is publicly accessible, then a bot is going to find it even if you don't mention it in robots.txt.
robots.txt is just a guideline. The existence of a disallow line in robots.txt doesn't prevent an unfriendly crawler from looking at those pages. It just tells the crawler that you don't want them looking at those pages. But crawlers can ignore robots.txt. They shouldn't, and you can block them if they do, but robots.txt itself is more like a stop sign than a road block.
You should be able to simply use the NOINDEX META tag in the HEAD of your page.
http://www.robotstxt.org/meta.html

robots.txt - exclude any URL that contains "/node/"

How do I tell crawlers / bots not to index any URL that has /node/ pattern?
Following is since day one but I noticed that Google has still indexed a lot of URLs that has
/node/ in it, e.g. www.mywebsite.com/node/123/32
Disallow: /node/
Is there anything that states that do not index any URL that has /node/
Should I write something like following:
Disallow: /node/*
Update:
The real problem is despite:
Disallow: /node/
in robots.txt, Google has indexed pages under this URL e.g. www.mywebsite.com/node/123/32
/node/ is not a physical directory, this is how drupal 6 shows it's content, I guess this is my problem that node is not a directory, merely part of URLs being generated by drupal for the content, how do I handle this? will this work?
Disallow: /*node
Thanks
Disallow: /node/ will disallow any url that starts with /node/ (after the host). The asterisk is not required.
So it will block www.mysite.com/node/bar.html, but will not block www.mysite.com/foo/node/bar.html.
If you want to block anything that contains /node/, you have to write Disallow: */node/
Note also that Googlebot can cache robots.txt for up to 7 days. So if you make a change to your robots.txt today, it might be a week before Googlebot updates its copy of your robots.txt. During that time, it will be using its cached copy.
Disallow: /node/* is exactly what you want to do. Search engines support wildcards in their robots.txt notation and the * characters means "any characters". See Google's notes on robots.txt for more.
update
An alternative way to make sure search engines stay out of a directory, and all directories below it, is to block them with the robots HTTP header. This can be done by placing the following in an htaccess file in your node directory:
Header set x-robots-tag: noindex
Your original Disallow was fine. Jim Mischel's comment seemed spot on and would cause me to wonder if it was just taking time for Googlebot to fetch the updated robots.txt and then unindex relevant pages.
A couple additional thoughts:
Your page URLs may appear in Google search results even if you've included it in robots.txt. See: http://support.google.com/webmasters/bin/answer.py?hl=en&answer=156449 ("...While Google won't crawl or index the content of pages blocked by robots.txt, we may still index the URLs if we find them on other pages on the web."). To many people, this is counter-intuitive.
Second, I'd highly recommend verifying ownership of your site in Google Webmaster Tools (https://www.google.com/webmasters/tools/home?hl=en), then using tools such as Health->"Fetch as Google" to see real time diagnostics related to retrieving your page. (Does that result indicate that robots.txt is preventing crawling?)
I haven't used it, but Bing has a similar tool: http://www.bing.com/webmaster/help/fetch-as-bingbot-fe18fa0d . It seems well worthwhile to use diagnostic tools provided by Google, Bing, etc. to perform real-time diagnostics on the site.
This question is a bit old, so I hope you've solved the original problem.

How to check if googlebot will index a given url?

We're doing a whitelabel site, which mustn't be google indexed.
Does anyone know a tool to check if the googlebot will index a given url ?
I've put <meta name="robots" content="noindex" /> on all pages, so it shouldn't be indexed - however I'd rather be 110% certain by testing it.
I know I could use robots.txt, however the problem with robots.txt is as follows:
Our mainsite should be indexed, and it's the same application on the IIS (ASP.Net) as the whitelabel site - the only difference is the url.
I cannot modify the robots.txt depending on the incoming url, but I can add a meta tag to all pages from my code-behind.
You should add a Robots.txt to your site.
However, the only perfect way to prevent search engines from indexing a site is to require authentication. (Some spiders ignore Robots.txt)
EDIT: You need to add an handler for Robots.txt to serve different files depending on the Host header.
You'll need to configure IIS to send the Robots.txt request through ASP.Net; the exact instructions depend on the IIS version.
Google Webmasters Tools (google.com/webmasters/tools) will (other than permitting you to upload a sitemap) do a test crawl of your site and tell you what they crawled, how it rates for certain queries, and what they will crawl and what not.
The test crawl isn't automatically included in google results, anyway if you're trying to hide sensitive data from the prying eyes of Google you cannot count on that alone: put some authentication on the line of fire, no matter what.