Should I remove meta-robots (index, follow) when I have a robots.txt? - seo

I'm a bit confused whether I should remove the robots meta tag, if I want search engines to follow my robots.txt rules.
If the robots meta-tag (index, follow) exists on the page, will search engines then ignore my robots.txt file and index the specified disallowed URLs in my robots.txt anyway?
The reason why I'm asking about this, is that search engines (Google mainly) still indexes disallowed pages from my website.

If a search engine’s bot honors your robots.txt, and you disallow crawling of /foo, then the bot will never crawl pages whose URL paths start with /foo. Hence the bot will never know that there are meta-robots elements.
Conversely, this means that if you want to disallow indexing a page (by specyfing meta-robots with noindex), you should not disallow crawling of this page in your robots.txt. Otherwise the noindex is never accessed, and the bot thinks that crawling is forbidden, not indexing.

With the robots.txt you can tell search engines not to crawl certain pages - but it wouldn't stop them from indexing the pages. If a page which is disallowed in the robots.txt is found by the crawler through an external link it can be indexed. That can be prevented through the meta-tag.
Thus, the robots.txt and the meta-tag do work differently.
https://developers.google.com/search/reference/robots_meta_tag?hl=en#combining-crawling-with-indexing--serving-directives
Robots meta tags and X-Robots-Tag HTTP headers are discovered when a URL is crawled. If a page is disallowed from crawling through the robots.txt file, then any information about indexing or serving directives will not be found and will therefore be ignored. If indexing or serving directives must be followed, the URLs containing those directives cannot be disallowed from crawling.

Related

Prevent search engines from indexing my api

I have my api at api.website.com which requires no authentication.
I am looking for a way to disallow google from indexing my api.
Is there a way to do so?
I already have the disallow in my robots at api.website.com/robots.txt
but that just prevents google from crawling it.
User-agent: *
Disallow: /
The usual way would be to remove the Disallow and add a noindex meta tag but it's an API hence no meta tags or anything.
Is there any other way to do that?
It seems like there is a way to add a noindex on api calls.
See here https://webmasters.stackexchange.com/questions/24569/why-do-google-search-results-include-pages-disallowed-in-robots-txt/24571#24571
The solution recommended on both of those pages is to add a noindex meta tag to the pages you don't want indexed. (The X-Robots-Tag HTTP header should also work for non-HTML pages. I'm not sure if it works on redirects, though.) Paradoxically, this means that you have to allow Googlebot to crawl those pages (either by removing them from robots.txt entirely, or by adding a separate, more permissive set of rules for Googlebot), since otherwise it can't see the meta tag in the first place.
It is strange Google is ignoring your /robots.txt file. Try dropping an index.html file in the root web directory and adding the following between the <head>...</head> tags of the web page.
<meta name="robots" content="noindex, nofollow">

Prevent robots from indexing a restricted access sub domain

I have a sub domain setup for which i return a 403 for all but one IP.
I also want to avoid the site being indexed by search engines, which is why I added a robots.txt to the root of my sub domain.
However, since I return a 403 on every request to that subdomain, the crawler will also receive a 403 when they request the robots.txt file.
According to google, if a robots,txt returns a 403, it will still try and crawl the site.
Is there anyway around this? Keen to hear your thoughts.
With robots.txt you can disallow crawling, not indexing.
You can disallow indexing (but not crawling) with the HTML meta-robots or the corresponding HTTP header X-Robots-Tag.
So you have three options:
Whitelist /robots.txt so that it answers with 200. Conforming bots won’t crawl anything on your host (except the robots.txt), but they may index URLs if they find them somehow (e.g., if linked from another site).
User-agent: *
Disallow: /
Add a meta-robots element to each page. Conforming bots may crawl, but they won’t index. But this does only work for HTML documents.
<meta name="robots" content="noindex" />
Send a X-Robots-Tag header for each document. Conforming bots may crawl, but they won’t index.
X-Robots-Tag: noindex
(Sending 403 for each request may in itself be a strong signal that there’s nothing interesting to see; but what to make of it would depend on the bot, of course.)

How to Disallow Landing Pages Using robots.txt file?

I'd like to start using specific landing pages in a marketing campaign. A quick search on google shows how to disallow specific pages and/or directories using a robots.txt file. (link)
If I don't want the search engines to index these landing pages should I put a single page entries in the robot.txt file or should I put them in specific directories and disallow the directory?
My concern is that anybody can read a robots.txt file and if the actual page names are visible within the robots.txt file it defeats the purpose.
"It defeats the purpose." How so? The purpose of robots.txt is to prevent crawlers from reading particular files or groups of files. Whether you exclude the individual files or put them all in a directory and exclude that directory is irrelevant as far as the crawler's behavior is concerned.
The benefit to putting them all in directories is that your robots.txt file is smaller and easier to manage. You don't have to add a new entry every time you create a new landing page.
You're right that putting a file name in robots.txt lets anybody who reads the file know that the file is there. That shouldn't be a problem. If you have sensitive information that you don't want others to see then it shouldn't be accessible, regardless of whether it's mentioned in robots.txt. Because if the file is publicly accessible, then a bot is going to find it even if you don't mention it in robots.txt.
robots.txt is just a guideline. The existence of a disallow line in robots.txt doesn't prevent an unfriendly crawler from looking at those pages. It just tells the crawler that you don't want them looking at those pages. But crawlers can ignore robots.txt. They shouldn't, and you can block them if they do, but robots.txt itself is more like a stop sign than a road block.
You should be able to simply use the NOINDEX META tag in the HEAD of your page.
http://www.robotstxt.org/meta.html

robots.txt which folders to disallow - SEO?

I am currently writing my robots.txt file and have some trouble deciding whether I should allow or disallow some folders for SEO purposes.
Here are the folders I have:
/css/ (css)
/js/ (javascript)
/img/ (images i use for the website)
/php/ (PHP which will return a blank page such as for example checkemail.php which checks an email address or register.php which puts data into a SQL database and sends an email)
/error/ (my error 401,403,404,406,500 html pages)
/include/ (header.html and footer.html I include)
I was thinking about disallowing only the PHP pages and let the rest.
What do you think?
Thanks a lot
Laurent
/css and /js -- CSS and Javascript files will probably be crawled by googlebot whether or not you have them in robots.txt. Google uses them to render your pages for site preview. Google has asked nicely that you not put them in robots.txt.
/img -- Googlebot may crawl this even when in robots.txt the same way as CSS and Javascript. Putting your images in robots.txt generally prevents them from being indexed in Google image search. Google image search may be a source of visitors to your site so you may wish to be indexed there.
/php -- sounds like you don't want spiders hitting the urls that perform actions. Good call to use robots.txt
/error -- If your site is set up correctly the spiders will probably never know what directory your error pages are served from. They generally get served at the url that has the error and the spider never sees their actual url. This isn't the case if you redirect to them, which isn't recommended practice anyway. As such, I would say there is no need to put them in robots.txt

robots.txt - exclude any URL that contains "/node/"

How do I tell crawlers / bots not to index any URL that has /node/ pattern?
Following is since day one but I noticed that Google has still indexed a lot of URLs that has
/node/ in it, e.g. www.mywebsite.com/node/123/32
Disallow: /node/
Is there anything that states that do not index any URL that has /node/
Should I write something like following:
Disallow: /node/*
Update:
The real problem is despite:
Disallow: /node/
in robots.txt, Google has indexed pages under this URL e.g. www.mywebsite.com/node/123/32
/node/ is not a physical directory, this is how drupal 6 shows it's content, I guess this is my problem that node is not a directory, merely part of URLs being generated by drupal for the content, how do I handle this? will this work?
Disallow: /*node
Thanks
Disallow: /node/ will disallow any url that starts with /node/ (after the host). The asterisk is not required.
So it will block www.mysite.com/node/bar.html, but will not block www.mysite.com/foo/node/bar.html.
If you want to block anything that contains /node/, you have to write Disallow: */node/
Note also that Googlebot can cache robots.txt for up to 7 days. So if you make a change to your robots.txt today, it might be a week before Googlebot updates its copy of your robots.txt. During that time, it will be using its cached copy.
Disallow: /node/* is exactly what you want to do. Search engines support wildcards in their robots.txt notation and the * characters means "any characters". See Google's notes on robots.txt for more.
update
An alternative way to make sure search engines stay out of a directory, and all directories below it, is to block them with the robots HTTP header. This can be done by placing the following in an htaccess file in your node directory:
Header set x-robots-tag: noindex
Your original Disallow was fine. Jim Mischel's comment seemed spot on and would cause me to wonder if it was just taking time for Googlebot to fetch the updated robots.txt and then unindex relevant pages.
A couple additional thoughts:
Your page URLs may appear in Google search results even if you've included it in robots.txt. See: http://support.google.com/webmasters/bin/answer.py?hl=en&answer=156449 ("...While Google won't crawl or index the content of pages blocked by robots.txt, we may still index the URLs if we find them on other pages on the web."). To many people, this is counter-intuitive.
Second, I'd highly recommend verifying ownership of your site in Google Webmaster Tools (https://www.google.com/webmasters/tools/home?hl=en), then using tools such as Health->"Fetch as Google" to see real time diagnostics related to retrieving your page. (Does that result indicate that robots.txt is preventing crawling?)
I haven't used it, but Bing has a similar tool: http://www.bing.com/webmaster/help/fetch-as-bingbot-fe18fa0d . It seems well worthwhile to use diagnostic tools provided by Google, Bing, etc. to perform real-time diagnostics on the site.
This question is a bit old, so I hope you've solved the original problem.