Precedence of X-Robots-Tag header vs robots meta tag - header

I've placed the following Header in my vhost config:
Header set X-Robots-Tag "noindex, nofollow"
The goal here is to just disable search engines from indexing my testing environment. The site is Wordpress and there is a plugin installed to manage per-page the meta robots settings. For example:
<meta name="robots" content="index, follow" />
So my question is, which directive will take precedence over the other since both are being set on every page?

I am not sure if a definitive answer can be given to the question, as the behavior may be implementation-dependent (on the robot side).
However, I think there is reasonable evidence that X-Robots-Tag will take precedence over <meta name="robots" .... See :
One significant difference between the X-Robots-Tag and the robots meta directive is:
X-Robots-Tag is part of the HTTP protocol header.
<meta name="robots" ... is part of the HTML document header.
Therefore the the X-Robots-Tag belongs to HTTP protocol layer, while <meta name="robots" ... belongs to the HTML protocol layer.
As they belong to a different protocol layer, they will not be parsed simultaneously by the (robot) client getting the page: The HTTP layer will be parsed first, and the HTML in a later step.
(Also, it should be noted that X-Robots-Tag and <meta name="robots" ... are not suppported by all robots. Google and Yahoo/Bing suppport both, but according to this some support only <meta name="robots" ..., others support neither.)
Summary :
if supported by the robot, X-Robots-Tag will be processed first ; restrictions (noindex, nofollow) apply (and <meta name="robots" ... is ignored).
else, <meta name="robots" ... directive applies.

Just an update to Dan's experience, I also have both the
Header set X-Robots-Tag "noindex, nofollow"
and
<meta name="robots" content="index, follow" />
on my one of my Wordpress sites, and a check in Google Search Console confirmed that the noindex in X-Robots-Tag is taking precedence as the pages have been crawled and but aren't indexed. So the logic in the correct answer is indeed, correct.

In my recent experience, when Google sees mixed-messages it prefers positive action by default - ie - it favours indexing - meanwhile will flag the issue as a critical error/warning in your webmaster tools console if you have one.
see your site's status in google here: https://www.google.com/webmasters/
see you site's status in bing here: http://www.bing.com/toolbox/webmaster (note that yahoo search is now powered by bing)
Google takes this positive-by-default action because lots of site owners unwittingly have a dodgy cms semi-blocking robots and we know how google loves to accumulate as much data as it can - any excuse!
if the technical settings are erroneous they're liable to be totally disregarded, and we know how search engines index and follow by default when no settings are specified.

Related

Why is the referer from my server alway null?

I am trying to work out why my referrer from my server always seems to be blank. I have knocked together the following to test it:
<html>
<head>
<meta http-equiv="Refresh" content="0; url='https://www.whatismyreferer.com/'" />
<meta name="referrer" content="origin" />
</head>
<body>
</body>
</html>
When I go to this page I get this:
Is this something that is being set at a server level in Apache? I have a case where I need to pass the referrer so finding out what is controlling this would be good.
The referrer header (with the famous referer spelling) is sent by the browser. If the browser decides not to send it (e.g. for privacy reasons) it just won't do. You should never rely on the header to be there. Even if you find configurations that currently work: The request is valid with or without this header. And browsers might change their opinion any time (they did: The header used to be omnipresent, not it's less present)

Prevent search engines from indexing my api

I have my api at api.website.com which requires no authentication.
I am looking for a way to disallow google from indexing my api.
Is there a way to do so?
I already have the disallow in my robots at api.website.com/robots.txt
but that just prevents google from crawling it.
User-agent: *
Disallow: /
The usual way would be to remove the Disallow and add a noindex meta tag but it's an API hence no meta tags or anything.
Is there any other way to do that?
It seems like there is a way to add a noindex on api calls.
See here https://webmasters.stackexchange.com/questions/24569/why-do-google-search-results-include-pages-disallowed-in-robots-txt/24571#24571
The solution recommended on both of those pages is to add a noindex meta tag to the pages you don't want indexed. (The X-Robots-Tag HTTP header should also work for non-HTML pages. I'm not sure if it works on redirects, though.) Paradoxically, this means that you have to allow Googlebot to crawl those pages (either by removing them from robots.txt entirely, or by adding a separate, more permissive set of rules for Googlebot), since otherwise it can't see the meta tag in the first place.
It is strange Google is ignoring your /robots.txt file. Try dropping an index.html file in the root web directory and adding the following between the <head>...</head> tags of the web page.
<meta name="robots" content="noindex, nofollow">

Add Expires headers for Cloudflare (Google, Disqus)

I am using DNS management cloudflare.
My site get hosting services on jekyll and github
When I analyze my site at Gtmetrix, I encounter the error "Add Expires headers".
How can I fix this error?
https://www.googletagmanager.com/gtm.js?id=GTM-KX5WC3P
https://cse.google.com/cse.js?cx=partner
https://ahmet123.disqus.com/
https://www.google-analytics.com/analytics.js
https://www.google.com/cse/static/style/look/v2/default.css
https://cse.google.com/adsense/search/async-ads.js
https://www.google.com/uds/css/v2/clear.png
Check out the first and second answers here: Github pages, HTTP headers
You can't change the HTTP headers, but you can do something like:
<meta http-equiv="Expires" content="600" />
inside the <head> section of the layout you use for your pages.
Hello How Fixing This problem after GT METRIX Performance Score
Add Expires headers
There are 9 static components without a far-future expiration date.
https://www.googletagmanager.com/gtag/js?id=UA-195633608-1
https://fonts.googleapis.com/css?family=Roboto+Slab%3A100%2C100italic%2C200%2C200italic%2C300%2C300italic%2C400%2C400italic%2C500%2C500italic%2C600%2C600italic%2C700%2C700italic%2C800%2C800italic%2C900%2C900italic%7CRoboto%3A100%2C100italic%2C200%2C200italic%2C300%2C300italic%2C400%2C400italic%2C500%2C500italic%2C600%2C600italic%2C700%2C700italic%2C800%2C800italic%2C900%2C900italic&display=auto&ver=5.9.2
https://www.googletagmanager.com/gtag/js?id=G-69EDJ9C5J7
https://static.cloudflareinsights.com/beacon.min.js
https://harhalakis.net/cdn-cgi/scripts/5c5dd728/cloudflare-static/email-decode.min.js
https://www.google-analytics.com/analytics.js
https://www.googletagmanager.com/gtag/js?id=G-69EDJ9C5J7&l=dataLayer&cx=c
https://www.googletagmanager.com/gtm.js?id=GTM-W3NQ5KW
https://www.google-analytics.com/plugins/ua/linkid.js
Thank you.
Konstantinos Harhalakis

How can I force a hard refresh if page has been visited before

Is it possible to check if the client has a cached version of a website, and if so, force his browser to apply a hard refresh once?
You can't force a browser to do anything, because you don't know how rigidly a remote client is observing the rules of HTTP.
However you can set HTTP headers which the browser is supposed to obey.
One such is Cache-control. There are a number of values that may meet your needs including no-cache and max-age. There is also the Expires header which specifies a wall-clock expiration time.
It is not ready apparent if the client has a cached version. To tell the client not to use cache you can use these meta tags.
<HEAD>
<TITLE>---</TITLE>
<META HTTP-EQUIV="Pragma" CONTENT="no-cache">
<META HTTP-EQUIV="Expires" CONTENT="-1">
</HEAD>

preventing google from indexing/caching

is there a reliable way to prevent google from crawling/indexing/caching a page?
i am thinking about creating a product where users could temporarily share information, using temporary url's.
the information is not very confidential, but i'd definitely like to not see it show up on some cache or even search results.
what's the most reliable way of doing this, and what are the possible pitfalls?
Make a robots.txt file. See http://www.robotstxt.org/ for information.
You can also use <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"> in the <HEAD> of your index.html.
Google's specifications for their robots.txt handling are here.