Prevent robots from indexing a restricted access sub domain - indexing

I have a sub domain setup for which i return a 403 for all but one IP.
I also want to avoid the site being indexed by search engines, which is why I added a robots.txt to the root of my sub domain.
However, since I return a 403 on every request to that subdomain, the crawler will also receive a 403 when they request the robots.txt file.
According to google, if a robots,txt returns a 403, it will still try and crawl the site.
Is there anyway around this? Keen to hear your thoughts.

With robots.txt you can disallow crawling, not indexing.
You can disallow indexing (but not crawling) with the HTML meta-robots or the corresponding HTTP header X-Robots-Tag.
So you have three options:
Whitelist /robots.txt so that it answers with 200. Conforming bots won’t crawl anything on your host (except the robots.txt), but they may index URLs if they find them somehow (e.g., if linked from another site).
User-agent: *
Disallow: /
Add a meta-robots element to each page. Conforming bots may crawl, but they won’t index. But this does only work for HTML documents.
<meta name="robots" content="noindex" />
Send a X-Robots-Tag header for each document. Conforming bots may crawl, but they won’t index.
X-Robots-Tag: noindex
(Sending 403 for each request may in itself be a strong signal that there’s nothing interesting to see; but what to make of it would depend on the bot, of course.)

Related

Prevent search engines from indexing my api

I have my api at api.website.com which requires no authentication.
I am looking for a way to disallow google from indexing my api.
Is there a way to do so?
I already have the disallow in my robots at api.website.com/robots.txt
but that just prevents google from crawling it.
User-agent: *
Disallow: /
The usual way would be to remove the Disallow and add a noindex meta tag but it's an API hence no meta tags or anything.
Is there any other way to do that?
It seems like there is a way to add a noindex on api calls.
See here https://webmasters.stackexchange.com/questions/24569/why-do-google-search-results-include-pages-disallowed-in-robots-txt/24571#24571
The solution recommended on both of those pages is to add a noindex meta tag to the pages you don't want indexed. (The X-Robots-Tag HTTP header should also work for non-HTML pages. I'm not sure if it works on redirects, though.) Paradoxically, this means that you have to allow Googlebot to crawl those pages (either by removing them from robots.txt entirely, or by adding a separate, more permissive set of rules for Googlebot), since otherwise it can't see the meta tag in the first place.
It is strange Google is ignoring your /robots.txt file. Try dropping an index.html file in the root web directory and adding the following between the <head>...</head> tags of the web page.
<meta name="robots" content="noindex, nofollow">

Using noodp meta tag in a robots.txt file

Is it possible to add SEO tags such as 'noodp' to a robots.txt file instead of using <meta> tags? I am trying to avoid messing with our CMS template, although I suspect that I may have to...
Could I try something similar to this...
User-Agent: *
Disallow: /hidden
Sitemap: www.example.com
noodp:
I think robots.txt takes precedence over meta tags? For noindex for instance, the crawler will not even see the page in question. For something like noodp however, is this still the case?
You can't do this with robots.txt, but you can get the same effect using the X-Robots-Tag response header.
Add something like this to the appropriate part of your .htaccess file:
Header set X-Robots-Tag "noodp"
This tells the server to include the following line in the response headers:
X-Robots-Tag: noodp
Search engines (that support X-Robots-Tag) will interpret this header line exactly the same way they would interpret 'noodp' in a robots meta tag. In general, you can put anything in an X-Robots-Tag header that you can put in a robots meta tag. Note that the page must not be blocked by robots.txt, otherwise the crawler will never request the page, and will therefore never see the header.
No.
Robots.txt file goal is to provide information to robots about crawl, not about what they should do on what they crawled.
<meta> robots (or X-Robots-Tag instructions) and robots.txt instructions are two very distinct things.
Google gives good information about this in his article Learn about robots.txt file:
robots.txt should only be used to control crawling traffic
If you want to add some robots instructions without messing with your CMS, HTTP header X-Robots-Tag might be a good solution. You can try to add it through your server config.

Any way to both NoIndex and Prevent Crawling?

I created a new website and I do not want it to be crawled by search engines as well as not appear in search results.
I already created a robots.txt
User-agent: *
Disallow: /
I have a html page. I wanted to use
<meta name="robots" content="noindex">
but Google page says it should be used when a page is not blocked by robots.txt as robots.txt will not see noindex tag at all.
Is there any way I can use both noindex as well as robots.txt?
There are two solutions, neither of which are elegant.
You are correct that even if you Disallow: / that your URLs might still appear in the search results, just likely without a meta description and a Google generated title.
Assuming you are only doing this temporarily, the recommended approach is to be basic http auth in front of your site. This isn't great since users will have to put in a basic username and password, but this will prevent your site from getting crawled and indexed.
If you can't or don't want to put basic auth in front of your site, the alternative is to still Disallow: / in your Robots.txt file, and use Google Search Console to regularly purge the Google index by requesting the site be removed from the index.
This is inelegant in multiple ways.
You'll have to monitor the search results to see if URLs get indexed
You'll have to manually request the removal in the Google Search Console
Google really didn't intend for the removal feature to be used in this fashion, and who knows if they'll start ignoring your requests over time. But I'd imagine it would actually continue to work even though they'd prefer you didn't use it that way.

Should I remove meta-robots (index, follow) when I have a robots.txt?

I'm a bit confused whether I should remove the robots meta tag, if I want search engines to follow my robots.txt rules.
If the robots meta-tag (index, follow) exists on the page, will search engines then ignore my robots.txt file and index the specified disallowed URLs in my robots.txt anyway?
The reason why I'm asking about this, is that search engines (Google mainly) still indexes disallowed pages from my website.
If a search engine’s bot honors your robots.txt, and you disallow crawling of /foo, then the bot will never crawl pages whose URL paths start with /foo. Hence the bot will never know that there are meta-robots elements.
Conversely, this means that if you want to disallow indexing a page (by specyfing meta-robots with noindex), you should not disallow crawling of this page in your robots.txt. Otherwise the noindex is never accessed, and the bot thinks that crawling is forbidden, not indexing.
With the robots.txt you can tell search engines not to crawl certain pages - but it wouldn't stop them from indexing the pages. If a page which is disallowed in the robots.txt is found by the crawler through an external link it can be indexed. That can be prevented through the meta-tag.
Thus, the robots.txt and the meta-tag do work differently.
https://developers.google.com/search/reference/robots_meta_tag?hl=en#combining-crawling-with-indexing--serving-directives
Robots meta tags and X-Robots-Tag HTTP headers are discovered when a URL is crawled. If a page is disallowed from crawling through the robots.txt file, then any information about indexing or serving directives will not be found and will therefore be ignored. If indexing or serving directives must be followed, the URLs containing those directives cannot be disallowed from crawling.

How to check if googlebot will index a given url?

We're doing a whitelabel site, which mustn't be google indexed.
Does anyone know a tool to check if the googlebot will index a given url ?
I've put <meta name="robots" content="noindex" /> on all pages, so it shouldn't be indexed - however I'd rather be 110% certain by testing it.
I know I could use robots.txt, however the problem with robots.txt is as follows:
Our mainsite should be indexed, and it's the same application on the IIS (ASP.Net) as the whitelabel site - the only difference is the url.
I cannot modify the robots.txt depending on the incoming url, but I can add a meta tag to all pages from my code-behind.
You should add a Robots.txt to your site.
However, the only perfect way to prevent search engines from indexing a site is to require authentication. (Some spiders ignore Robots.txt)
EDIT: You need to add an handler for Robots.txt to serve different files depending on the Host header.
You'll need to configure IIS to send the Robots.txt request through ASP.Net; the exact instructions depend on the IIS version.
Google Webmasters Tools (google.com/webmasters/tools) will (other than permitting you to upload a sitemap) do a test crawl of your site and tell you what they crawled, how it rates for certain queries, and what they will crawl and what not.
The test crawl isn't automatically included in google results, anyway if you're trying to hide sensitive data from the prying eyes of Google you cannot count on that alone: put some authentication on the line of fire, no matter what.