preventing google from indexing/caching - indexing

is there a reliable way to prevent google from crawling/indexing/caching a page?
i am thinking about creating a product where users could temporarily share information, using temporary url's.
the information is not very confidential, but i'd definitely like to not see it show up on some cache or even search results.
what's the most reliable way of doing this, and what are the possible pitfalls?

Make a robots.txt file. See http://www.robotstxt.org/ for information.
You can also use <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"> in the <HEAD> of your index.html.
Google's specifications for their robots.txt handling are here.

Related

Prevent search engines from indexing my api

I have my api at api.website.com which requires no authentication.
I am looking for a way to disallow google from indexing my api.
Is there a way to do so?
I already have the disallow in my robots at api.website.com/robots.txt
but that just prevents google from crawling it.
User-agent: *
Disallow: /
The usual way would be to remove the Disallow and add a noindex meta tag but it's an API hence no meta tags or anything.
Is there any other way to do that?
It seems like there is a way to add a noindex on api calls.
See here https://webmasters.stackexchange.com/questions/24569/why-do-google-search-results-include-pages-disallowed-in-robots-txt/24571#24571
The solution recommended on both of those pages is to add a noindex meta tag to the pages you don't want indexed. (The X-Robots-Tag HTTP header should also work for non-HTML pages. I'm not sure if it works on redirects, though.) Paradoxically, this means that you have to allow Googlebot to crawl those pages (either by removing them from robots.txt entirely, or by adding a separate, more permissive set of rules for Googlebot), since otherwise it can't see the meta tag in the first place.
It is strange Google is ignoring your /robots.txt file. Try dropping an index.html file in the root web directory and adding the following between the <head>...</head> tags of the web page.
<meta name="robots" content="noindex, nofollow">

Any way to both NoIndex and Prevent Crawling?

I created a new website and I do not want it to be crawled by search engines as well as not appear in search results.
I already created a robots.txt
User-agent: *
Disallow: /
I have a html page. I wanted to use
<meta name="robots" content="noindex">
but Google page says it should be used when a page is not blocked by robots.txt as robots.txt will not see noindex tag at all.
Is there any way I can use both noindex as well as robots.txt?
There are two solutions, neither of which are elegant.
You are correct that even if you Disallow: / that your URLs might still appear in the search results, just likely without a meta description and a Google generated title.
Assuming you are only doing this temporarily, the recommended approach is to be basic http auth in front of your site. This isn't great since users will have to put in a basic username and password, but this will prevent your site from getting crawled and indexed.
If you can't or don't want to put basic auth in front of your site, the alternative is to still Disallow: / in your Robots.txt file, and use Google Search Console to regularly purge the Google index by requesting the site be removed from the index.
This is inelegant in multiple ways.
You'll have to monitor the search results to see if URLs get indexed
You'll have to manually request the removal in the Google Search Console
Google really didn't intend for the removal feature to be used in this fashion, and who knows if they'll start ignoring your requests over time. But I'd imagine it would actually continue to work even though they'd prefer you didn't use it that way.

Precedence of X-Robots-Tag header vs robots meta tag

I've placed the following Header in my vhost config:
Header set X-Robots-Tag "noindex, nofollow"
The goal here is to just disable search engines from indexing my testing environment. The site is Wordpress and there is a plugin installed to manage per-page the meta robots settings. For example:
<meta name="robots" content="index, follow" />
So my question is, which directive will take precedence over the other since both are being set on every page?
I am not sure if a definitive answer can be given to the question, as the behavior may be implementation-dependent (on the robot side).
However, I think there is reasonable evidence that X-Robots-Tag will take precedence over <meta name="robots" .... See :
One significant difference between the X-Robots-Tag and the robots meta directive is:
X-Robots-Tag is part of the HTTP protocol header.
<meta name="robots" ... is part of the HTML document header.
Therefore the the X-Robots-Tag belongs to HTTP protocol layer, while <meta name="robots" ... belongs to the HTML protocol layer.
As they belong to a different protocol layer, they will not be parsed simultaneously by the (robot) client getting the page: The HTTP layer will be parsed first, and the HTML in a later step.
(Also, it should be noted that X-Robots-Tag and <meta name="robots" ... are not suppported by all robots. Google and Yahoo/Bing suppport both, but according to this some support only <meta name="robots" ..., others support neither.)
Summary :
if supported by the robot, X-Robots-Tag will be processed first ; restrictions (noindex, nofollow) apply (and <meta name="robots" ... is ignored).
else, <meta name="robots" ... directive applies.
Just an update to Dan's experience, I also have both the
Header set X-Robots-Tag "noindex, nofollow"
and
<meta name="robots" content="index, follow" />
on my one of my Wordpress sites, and a check in Google Search Console confirmed that the noindex in X-Robots-Tag is taking precedence as the pages have been crawled and but aren't indexed. So the logic in the correct answer is indeed, correct.
In my recent experience, when Google sees mixed-messages it prefers positive action by default - ie - it favours indexing - meanwhile will flag the issue as a critical error/warning in your webmaster tools console if you have one.
see your site's status in google here: https://www.google.com/webmasters/
see you site's status in bing here: http://www.bing.com/toolbox/webmaster (note that yahoo search is now powered by bing)
Google takes this positive-by-default action because lots of site owners unwittingly have a dodgy cms semi-blocking robots and we know how google loves to accumulate as much data as it can - any excuse!
if the technical settings are erroneous they're liable to be totally disregarded, and we know how search engines index and follow by default when no settings are specified.

How to hide robot.txt from vistors?

I have seen sites hide the robot.txt file.
if you enter the name of the site as
http://www.mysite.com/robot.txt
you will not receive the robot.txt,
I also want to hide the file robot.txt from visitors how to do it?
Is there a connection of these lines
<meta name="ROBOTS" content="NOODP">
         <meta name="Slurp" content="NOYDIR">
So I do not understand the meaning of the code.
Thank you!
I'm not sure exactly what you're asking, but couldn't you do that with URL rewrites? You might be able to display the robots.txt file for visitors with the User Agent strings of a crawler (for instance, "Googlebot"), and then redirect to a 404 if it's a non-crawler UA

How to check if googlebot will index a given url?

We're doing a whitelabel site, which mustn't be google indexed.
Does anyone know a tool to check if the googlebot will index a given url ?
I've put <meta name="robots" content="noindex" /> on all pages, so it shouldn't be indexed - however I'd rather be 110% certain by testing it.
I know I could use robots.txt, however the problem with robots.txt is as follows:
Our mainsite should be indexed, and it's the same application on the IIS (ASP.Net) as the whitelabel site - the only difference is the url.
I cannot modify the robots.txt depending on the incoming url, but I can add a meta tag to all pages from my code-behind.
You should add a Robots.txt to your site.
However, the only perfect way to prevent search engines from indexing a site is to require authentication. (Some spiders ignore Robots.txt)
EDIT: You need to add an handler for Robots.txt to serve different files depending on the Host header.
You'll need to configure IIS to send the Robots.txt request through ASP.Net; the exact instructions depend on the IIS version.
Google Webmasters Tools (google.com/webmasters/tools) will (other than permitting you to upload a sitemap) do a test crawl of your site and tell you what they crawled, how it rates for certain queries, and what they will crawl and what not.
The test crawl isn't automatically included in google results, anyway if you're trying to hide sensitive data from the prying eyes of Google you cannot count on that alone: put some authentication on the line of fire, no matter what.