.htaccess redirects according to the query string contents - apache

I have an online archive of images, some of which reside on Cloud Storage. The archive is hierarchical with four levels, and the appropriate level is accessed using query strings:
a.php?level=image&collection=a&document=b&item=72
The level can be archive, collection, document, or image.
I want to prevent robots from accessing the actual images, primarily to minimise traffic on the cloud storage. So the idea is if they issue a request where the query string level parameter is image ("?level=image"), that request is diverted.
The .htaccess code below is intended to check the query string for a request from a foreign referrer, and if the request is for an image, direct the request elsewhere:
RewriteEngine On
RewriteCond %{HTTP_HOST}##%{HTTP_REFERER} !^([^#]*)##https?://\1
RewriteCond %{QUERY_STRING} ^level=image$
RewriteRule (.*) https://a.co.uk/blank.htm [NC,R,L]
My code appears to have no obvious effect. Can anybody see what I am doing wrong? I do not pretend to have a lot of confidence with .htaccess code, normally relying on snippets produced by people cleverer than me.

RewriteCond %{QUERY_STRING} ^level=image$
This checks that the query string is exactly equal to level=image, whereas in your example the level URL parameter is just one of many (the first one).
To check that the URL parameter level=image appears anywhere in the query string then modify the above condition like so:
RewriteCond %{QUERY_STRING} (^|&)level=image($|&)
RewriteCond %{HTTP_HOST}##%{HTTP_REFERER} !^([^#]*)##https?://\1
Minor issue, but this would allow referrers where the requested hostname (eg. example.com) occurs only as a subdomain of the referrer. eg. example.com.referrer.com. To resolve this, modify the CondPattern to include a trailing slash or end-of-string anchor. For example:
RewriteCond %{HTTP_HOST}##%{HTTP_REFERER} !^([^#]*)##https?://\1(/|$)
RewriteRule (.*) https://a.co.uk/blank.htm [NC,R,L]
There's no need for the capturing subpattern. If you only need the rule to be successful for any URL-path then use just ^ to avoid traversing the URL-path. But in your example, the request is for a.php, not "any URL"?
But why "redirect", rather than simply block the request? As you say, this is for "robots" after all. For example, to send a 403 Forbidden:
RewriteRule ^a\.php$ - [F]
In summary:
RewriteCond %{HTTP_HOST}##%{HTTP_REFERER} !^([^#]*)##https?://\1(/|$)
RewriteCond %{QUERY_STRING} (^|&)level=image($|&)
RewriteRule ^a\.php$ - [F]
Note, however, that search engine "bots" generally don't send a Referer header at all. And it is trivial for arbitrary bots to fake the Referer header and circumvent your block.

Related

.htaccess, prevent hotlinking, allow big bots, allow some access and allow my own domain without adding it statically

I want to allow image crawling on my site from a couple of different bots and exclude all others.
I want to allow images in at least one folder to not be blocked for any request.
I don't want to block image requests from visitors on my own site.
I don't want to include my domain name in the .htaccess file for portability.
The reason I ask this here and don't simply test the following code myself is that I work on my own and have no colleges to ask or external resources to test from. I think what I've got is correct but I find .htaccess rules extremely confusing, and I don't know what I don't even know at this point.
RewriteCond %{HTTP_REFERER} !^$ [OR]
RewriteCond %{HTTP_REFERER} !^https?://(www\.)?bing\..+$ [NC,OR]
RewriteCond %{HTTP_REFERER} !^https?://(www\.)?facebook\..+$ [NC,OR]
RewriteCond %{HTTP_REFERER} !^https?://(www\.)?google\..+$ [NC,OR]
RewriteCond %{HTTP_REFERER} !^https?://(www\.)?instagram\..+$ [NC,OR]
RewriteCond %{HTTP_REFERER} !^https?://(www\.)?linkedin\..+$ [NC,OR]
RewriteCond %{HTTP_REFERER} !^https?://(www\.)?reddit\..+$ [NC,OR]
RewriteCond %{HTTP_REFERER} !^https?://(www\.)?twitter\..+$ [NC,OR]
RewriteCond %{REQUEST_URI} !^/cross-origin-resources/ [NC,OR]
RewriteCond %{HTTP_HOST}##%{HTTP_REFERER} !^([^#]*)##https?://\1/.* [NC]
RewriteRule \.(bmp|gif|jpe?g|png|webp)$ - [F,L,NC]
I've tested it on htaccess tester, and looks good but does complain about the second last line when tested using the following URL: http://www.example.co.uk/poignant/foo.webp
You have the logic in reverse. As written these conditions (RewriteCond directives) will always be successful and the request will always be blocked.
You have a series of negated conditions that are OR'd. These would only fail (ie. not block the request) if all the conditions match, which is impossible. (eg. The Referer header cannot be bing and facebook.)
You need to remove the OR flag on all your RewriteCond directives, so they are implicitly AND'd.
Incidentally, the suggestion in comments from #StephenOstermiller to combine the HTTP_REFERER checks into one (which is a good one) is the equivalent to having the individual conditions AND'd, not OR'd (as you have posted initially).
I want to allow image crawling on my site from a couple of different bots and exclude all others.
Once you've corrected the OR/AND as stated above, this rule will likely allow ALL bots to crawl your site images because bots generally do not send a Referer header. These directives are not really about "crawling", they allow certain websites to display your images on their domain (ie. hotlinking). This is probably the intention, however, it's not what you are stating in point #1.
(To block bots from crawling your site you would need to check the User-Agent request header, ie. HTTP_USER_AGENT - which would probably be better done in a separate rule.)
RewriteCond %{HTTP_REFERER} !^https?://(www\.)?bing\..+$
Minor point, but the +$ at the end of the regex is superfluous. There's no need to match the entire Referer when you are only interested in the hostname. Although these sites probably have a Referrer-Policy set that prevents the URL-path being sent (by the browser) in the Referer header anyway, but it is still unnecessary.
RewriteCond %{HTTP_HOST}##%{HTTP_REFERER} !^([^#]*)##https?://\1/.* [NC]
In comments, you were asking what this line does. This satisfies points #3 and #4 in your list, so it is certainly needed. It ensures that the requested Host header (HTTP_HOST) matches the hostname in the Referer. So the request is coming from the same site.
The alternative is to hardcode your domain in the condition, which you are trying to avoid.
(Again, the trailing .* on the regex is unnecessary and should be removed.)
This is achieved by using an internal backreference \1 in the regex against the HTTP_REFERER that matches HTTP_HOST in the TestString (first argument). The ## string is just an arbitrary string that does not occur in the HTTP_HOST or HTTP_REFERER server variables.
This is clearer if you expand the TestString to see what is being matched. For example, if you make an internal request to https://example.com/myimage.jpg from your homepage (ie. https://example.com/) then the TestString in the RewriteCond directive is:
example.com##https://example.com/
This is then matched against the regex ^([^#]*)##https?://\1/ (the ! prefix on the CondPattern is an operator and is part of the argument, not the regex).
([^#]*) - the first capturing group captures example.com (The value of HTTP_HOST).
##https?:// - simply matches ##https:// in the TestString (part of the HTTP_REFERER).
\1 - this is an internal backreference. So this must match the value captured from the first capturing group (#1 above). In this example, it must match example.com. And it does, so there is a successful match.
The ! prefix on the CondPattern (not strictly part of the regex), negates the whole expression, so the condition is successful when the regex does not match.
So, in the above example, the regex matches and so the condition fails (because it's negated), so the rule is not triggered and the request is not blocked.
However, if a request is made to https://example.com/myimage.jpg from an external site, eg. https://external-site.example/ then the TestString in the RewriteCond directive is:
example.com##https://external-site.example/
Following the steps above, the regex fails to match (because external-site.example does not match example.com). The negated condition is therefore successful and the rule is triggered, so the request is blocked. (Unless one of the other conditions failed.)
Note that with the condition as written, www.example.com is different to example.com. For example, if you were on example.com and you used an absolute URL to your image using www.example.com then the regex will fail to match and the request will be blocked. This could perhaps be incorporated into the regex, to allow for this. But this is very much an edge case and can be avoided with a canonical 301 redirect earlier in the config.
RewriteCond %{HTTP_REFERER} !^$
This allows an empty (or not present) Referer header. You "probably" do need this. It allows bots to crawl your images. It permits direct requests to images. It also allows users who have chosen to suppress the Referer header to be able to view your images on your site.
HOWEVER, it's also possible these days for a site to set a Referrer-Policy that completely suppresses the Referer header being sent (by the browser) and so bypasses your hotlink protection.
RewriteRule \.(bmp|gif|jpe?g|png|webp)$ - [F,L,NC]
Minor point, but the L flag is not required when the F flag is used (it is implied).
Are you really serving .bmp images?!
Aside: Sites don't necessarily "hotlink"
Some of these external sites (bing, Facebook, Google, Instagram, LinkedIn, Reddit, twitter, etc.) don't necessarily "hotlink" images anyway. They often make their own (resized/compressed) "copy" of the image instead (a bot makes the initial request to retrieve the image - with no Referer - so the request is not blocked).
So, explicitly permitting some of these sites in your "hotlink-protection" script might not be necessary anyway.
Summary
Taking the above points into consideration, the directives should look more like this:
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^https?://(www\.)?(bing|facebook|google|instagram|linkedin|reddit|twitter)\.
RewriteCond %{REQUEST_URI} !^/cross-origin-resources/ [NC]
RewriteCond %{HTTP_HOST}##%{HTTP_REFERER} !^([^#]*)##https?://\1/
RewriteRule \.(gif|jpe?g|png|webp|bmp)$ - [F,NC]

Log image filename that's cached by external cdn using htaccess

I want to keep a log of image file names whenever a specific cdn caches our images but I can't quite get it. Right now, my code looks something like:
RewriteCond %{HTTP_USER_AGENT} Photon/1.0
RewriteRule ^(.*)$ log.php?image=$1 [L]
The above always logs the image as being "log.php" even if I'm making the cdn cache "example.jpg" and I thoroughly don't understand why.
The above always logs the image as being "log.php" even if I'm making the cdn cache "example.jpg" and I thoroughly don't understand why.
Because in .htaccess the rewrite engine loops until the URL passes through unchanged (despite the presence of the L flag) and your rule also matches log.php (your rule matches everything) - so this is the "image" that is ultimately logged. The L flag simply stops the current pass through the rewrite engine.
For example:
Request /example.jpg
Request is rewritten to log.php?image=example.jpg
Rewrite engine starts over, passing /log.php?image=example.jpg to the start of the second pass.
Request is rewritten to log.php?image=log.php by the same RewriteRule directive.
Rewrite engine starts over, passing /log.php?image=log.php to the start of the third pass.
Request is rewritten to log.php?image=log.php (again).
URL has not changed in the last pass - processing stops.
You need to make an exception so that log.php itself is not processed. Or, state that all non-.php files are processed (instead of everything). Or, if only images are meant to be processed then only check for images.
For example:
# Log images only
RewriteCond %{HTTP_USER_AGENT} Photon/1\.0
RewriteRule ^(.+\.(?:png|jpg|webp|gif))$ log.php?image=$1 [L]
Remember to backslash-escape literal dots in the regex.
Or,
# Log Everything except log.php itself
RewriteCond %{HTTP_USER_AGENT} Photon/1\.0
RewriteCond %{REQUEST_URI} ^/(.+)
RewriteRule !^log\.php$ log.php?image=%1 [L]
In the last example, %1 refers to the captured subpattern in the preceding CondPattern. I only did it this way, rather than using REQUEST_URI directly since you are excluding the slash prefix in your original logging directive (ie. you are passing image.jpg to your script when /image.jpg is requested). If you want to log the slash prefix as well, then you can omit the 2nd condition and pass REQUEST_URI directly. For example:
# Log Everything except log.php itself (include slash prefix)
RewriteCond %{HTTP_USER_AGENT} Photon/1.0
RewriteRule !^log\.php$ log.php?image=%{REQUEST_URI} [L]
Alternatively, on Apache 2.4+ you can use the END flag instead of L to force the rewrite engine to stop and prevent further passes through the rewrite engine. For example:
RewriteCond %{HTTP_USER_AGENT} Photon/1\.0
RewriteRule (.+) log.php?image=$1 [END]

.htaccess - Internal friendly URL rewrite for one parameter only

I am having trouble using .htaccess to internally rewrite (that is, use the requested URL to form an internal request to then provide that to the client, who still only sees the original requested URL) a URL where only one parameter is prettified, and the rest of the request parameters are still appended. Other posts on stack either concern just one relevant parameter, or wish to redirect every parameter.
That is,
https://new.mysite.com/overhoringen/open/7 should internally request https://new.mysite.com/overhoringen/open?testId=7
https://new.mysite.com/overhoringen/open/9?other=param&more=param should internally request
https://new.mysite.com/overhoringen/open?testId=9&other=param&more=param
I can do this for the first bullet, a single parameter rewrite;
RewriteEngine on
RewriteBase /
#Prettify test
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^/?overhoringen/open/([^/]+)/?$ /overhoringen/open.php?testId=$1 [L]
However, I am unsure how to capture the request query at the end and then append it to the internal redirect (if present at all), without the ? still in front (to avoid open?testId=9?other=param&more=param), etc.
Help with this would just be really cool. :]
Change this line:
RewriteRule ^/?overhoringen/open/([^/]+)/?$ /overhoringen/open.php?testId=$1 [L]
to:
RewriteRule ^/?overhoringen/open/([^/]+)/?$ /overhoringen/open.php?testId=$1 [L,QSA]
Adding QSA should append the additional query string to the new url.

Apache rewrite rule that ignores query/parameters, always redirects based on path

I'm trying to create a rewrite rule that will ignore any additional URL query/parameters and just redirect based on the path.
My company has a Wifi Hotspot service that does some DNS routing trick to force people to login before they can use it. Unfortunately when folks get disconnected from the WiFi and dropped back to their normal cell data service sometimes a URL request is still sent to our host, and it shows up as:
www.ourwebsite.com/login?dst=http://www.google.com/m?client=ms-android-verizon&source=android-home
I already wrote a set of rules to take care of base paths of /login and /login/ to redirect to our homepage,
RewriteCond %{THE_REQUEST} ^.*\/login/\ HTTP/
RewriteRule ^(.*)login/?$ "/$1" [R=301,L]
RewriteCond %{THE_REQUEST} ^.*\/login\ HTTP/
RewriteRule ^(.*)login?$ "/$1" [R=301,L]
but I am having trouble coming up with an appropriate string to ALWAYS redirect based souly on the path, and ignore any query parameters that may or may not come after.
Any help would be appreciate! Thanks in advance.
If I understood right, something like this should do it:
Options +FollowSymLinks -MultiViews
RewriteEngine On
RewriteBase /
RewriteRule ^login /? [R=301,L]
This rule-set will redirect to root as long as the incoming URL is something like:
http://www.ourwebsite.com/login?any_query
From Apache 2.4.0 on you can apply the QSD-flag to the rule.
When the requested URI contains a query string, and the target URI does not, the default behavior of RewriteRule is to copy that query string to the target URI. Using the [QSD] flag causes the query string to be discarded.
-- https://httpd.apache.org/docs/2.4/rewrite/flags.html#flag_qsd
When using this flag for earlier Apache versions you'll cause an 500 Internal Server Error.

apache mod_rewrite redirect between hostnames (except one directory)

I've got two hostnames (e.g. www.site1.com and www.site2.com) mapped to the same apache virtual host. I'd like to redirect all traffic from site1.com over to site2.com, except for all POST requests from a particular folder. (That folder contains URLs which are used by older clients which are not capable of handling redirects on POST requests.)
Here's the rule I came up with. I'm a newbie to rewrite rules, so wanted to make sure I wasn't missing anything obvious.
RewriteCond %{HTTP_HOST} ^www.site1.com$
RewriteCond %{HTTP_METHOD} ^POST
RewriteCond %{REQUEST_URI} !^/dontredirectme/
RewriteRule /(.*) http://www.site2.com/$1 [R=301,L]
Is this a correct rule to use for this job? And if yes, are there any efficiency optimizations I should be considering?
Your rule set looks relatively correct, but you need to modify your second RewriteCond a little to reflect your goal:
RewriteCond %{HTTP_METHOD} !^POST [OR]
This will allow you to redirect if the request type is not POST, or it is and the requested URI is not /dontredirectme/, which effectively results in a redirect for everything that isn't a POST request to /dontredirectme/.
Additionally, the input to the RewriteRule will not have a leading forward slash if you're defining it in a per-directory context (in a .htaccess file or in a <Directory> section). If you are defining it directly in the <VirtualHost> (a per-server context), then the input will have a leading slash, so your rule would be fine as-is.
As far as efficiency goes, rules defined in the server configuration have the benefit of only having to be parsed one time. On the other hand, a .htaccess file must be parsed for each request, a process which involves the additional (albeit small) overhead of reading the file and compiling the regular expressions.
If you really want to squeeze efficiency out of it, you could make the following changes:
RewriteCond %{HTTP_HOST} =www.example.com
RewriteCond %{HTTP_METHOD} !=POST [NC,OR]
RewriteCond %{REQUEST_URI} !^/dontredirectme/
RewriteRule ^ http://www.example.net%{REQUEST_URI} [R=301,L]
I doubt the difference is really appreciable in all but the most extreme cases, but this removes two regular expressions in favour of a direct text comparison. Also, since you just want to redirect the request to the new host verbatim, you can "simplify" the regular expression involved in the RewriteRule and just use %{REQUEST_URI} directly in the replacement.