.htaccess, prevent hotlinking, allow big bots, allow some access and allow my own domain without adding it statically - apache

I want to allow image crawling on my site from a couple of different bots and exclude all others.
I want to allow images in at least one folder to not be blocked for any request.
I don't want to block image requests from visitors on my own site.
I don't want to include my domain name in the .htaccess file for portability.
The reason I ask this here and don't simply test the following code myself is that I work on my own and have no colleges to ask or external resources to test from. I think what I've got is correct but I find .htaccess rules extremely confusing, and I don't know what I don't even know at this point.
RewriteCond %{HTTP_REFERER} !^$ [OR]
RewriteCond %{HTTP_REFERER} !^https?://(www\.)?bing\..+$ [NC,OR]
RewriteCond %{HTTP_REFERER} !^https?://(www\.)?facebook\..+$ [NC,OR]
RewriteCond %{HTTP_REFERER} !^https?://(www\.)?google\..+$ [NC,OR]
RewriteCond %{HTTP_REFERER} !^https?://(www\.)?instagram\..+$ [NC,OR]
RewriteCond %{HTTP_REFERER} !^https?://(www\.)?linkedin\..+$ [NC,OR]
RewriteCond %{HTTP_REFERER} !^https?://(www\.)?reddit\..+$ [NC,OR]
RewriteCond %{HTTP_REFERER} !^https?://(www\.)?twitter\..+$ [NC,OR]
RewriteCond %{REQUEST_URI} !^/cross-origin-resources/ [NC,OR]
RewriteCond %{HTTP_HOST}##%{HTTP_REFERER} !^([^#]*)##https?://\1/.* [NC]
RewriteRule \.(bmp|gif|jpe?g|png|webp)$ - [F,L,NC]
I've tested it on htaccess tester, and looks good but does complain about the second last line when tested using the following URL: http://www.example.co.uk/poignant/foo.webp

You have the logic in reverse. As written these conditions (RewriteCond directives) will always be successful and the request will always be blocked.
You have a series of negated conditions that are OR'd. These would only fail (ie. not block the request) if all the conditions match, which is impossible. (eg. The Referer header cannot be bing and facebook.)
You need to remove the OR flag on all your RewriteCond directives, so they are implicitly AND'd.
Incidentally, the suggestion in comments from #StephenOstermiller to combine the HTTP_REFERER checks into one (which is a good one) is the equivalent to having the individual conditions AND'd, not OR'd (as you have posted initially).
I want to allow image crawling on my site from a couple of different bots and exclude all others.
Once you've corrected the OR/AND as stated above, this rule will likely allow ALL bots to crawl your site images because bots generally do not send a Referer header. These directives are not really about "crawling", they allow certain websites to display your images on their domain (ie. hotlinking). This is probably the intention, however, it's not what you are stating in point #1.
(To block bots from crawling your site you would need to check the User-Agent request header, ie. HTTP_USER_AGENT - which would probably be better done in a separate rule.)
RewriteCond %{HTTP_REFERER} !^https?://(www\.)?bing\..+$
Minor point, but the +$ at the end of the regex is superfluous. There's no need to match the entire Referer when you are only interested in the hostname. Although these sites probably have a Referrer-Policy set that prevents the URL-path being sent (by the browser) in the Referer header anyway, but it is still unnecessary.
RewriteCond %{HTTP_HOST}##%{HTTP_REFERER} !^([^#]*)##https?://\1/.* [NC]
In comments, you were asking what this line does. This satisfies points #3 and #4 in your list, so it is certainly needed. It ensures that the requested Host header (HTTP_HOST) matches the hostname in the Referer. So the request is coming from the same site.
The alternative is to hardcode your domain in the condition, which you are trying to avoid.
(Again, the trailing .* on the regex is unnecessary and should be removed.)
This is achieved by using an internal backreference \1 in the regex against the HTTP_REFERER that matches HTTP_HOST in the TestString (first argument). The ## string is just an arbitrary string that does not occur in the HTTP_HOST or HTTP_REFERER server variables.
This is clearer if you expand the TestString to see what is being matched. For example, if you make an internal request to https://example.com/myimage.jpg from your homepage (ie. https://example.com/) then the TestString in the RewriteCond directive is:
example.com##https://example.com/
This is then matched against the regex ^([^#]*)##https?://\1/ (the ! prefix on the CondPattern is an operator and is part of the argument, not the regex).
([^#]*) - the first capturing group captures example.com (The value of HTTP_HOST).
##https?:// - simply matches ##https:// in the TestString (part of the HTTP_REFERER).
\1 - this is an internal backreference. So this must match the value captured from the first capturing group (#1 above). In this example, it must match example.com. And it does, so there is a successful match.
The ! prefix on the CondPattern (not strictly part of the regex), negates the whole expression, so the condition is successful when the regex does not match.
So, in the above example, the regex matches and so the condition fails (because it's negated), so the rule is not triggered and the request is not blocked.
However, if a request is made to https://example.com/myimage.jpg from an external site, eg. https://external-site.example/ then the TestString in the RewriteCond directive is:
example.com##https://external-site.example/
Following the steps above, the regex fails to match (because external-site.example does not match example.com). The negated condition is therefore successful and the rule is triggered, so the request is blocked. (Unless one of the other conditions failed.)
Note that with the condition as written, www.example.com is different to example.com. For example, if you were on example.com and you used an absolute URL to your image using www.example.com then the regex will fail to match and the request will be blocked. This could perhaps be incorporated into the regex, to allow for this. But this is very much an edge case and can be avoided with a canonical 301 redirect earlier in the config.
RewriteCond %{HTTP_REFERER} !^$
This allows an empty (or not present) Referer header. You "probably" do need this. It allows bots to crawl your images. It permits direct requests to images. It also allows users who have chosen to suppress the Referer header to be able to view your images on your site.
HOWEVER, it's also possible these days for a site to set a Referrer-Policy that completely suppresses the Referer header being sent (by the browser) and so bypasses your hotlink protection.
RewriteRule \.(bmp|gif|jpe?g|png|webp)$ - [F,L,NC]
Minor point, but the L flag is not required when the F flag is used (it is implied).
Are you really serving .bmp images?!
Aside: Sites don't necessarily "hotlink"
Some of these external sites (bing, Facebook, Google, Instagram, LinkedIn, Reddit, twitter, etc.) don't necessarily "hotlink" images anyway. They often make their own (resized/compressed) "copy" of the image instead (a bot makes the initial request to retrieve the image - with no Referer - so the request is not blocked).
So, explicitly permitting some of these sites in your "hotlink-protection" script might not be necessary anyway.
Summary
Taking the above points into consideration, the directives should look more like this:
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^https?://(www\.)?(bing|facebook|google|instagram|linkedin|reddit|twitter)\.
RewriteCond %{REQUEST_URI} !^/cross-origin-resources/ [NC]
RewriteCond %{HTTP_HOST}##%{HTTP_REFERER} !^([^#]*)##https?://\1/
RewriteRule \.(gif|jpe?g|png|webp|bmp)$ - [F,NC]

Related

.htaccess redirects according to the query string contents

I have an online archive of images, some of which reside on Cloud Storage. The archive is hierarchical with four levels, and the appropriate level is accessed using query strings:
a.php?level=image&collection=a&document=b&item=72
The level can be archive, collection, document, or image.
I want to prevent robots from accessing the actual images, primarily to minimise traffic on the cloud storage. So the idea is if they issue a request where the query string level parameter is image ("?level=image"), that request is diverted.
The .htaccess code below is intended to check the query string for a request from a foreign referrer, and if the request is for an image, direct the request elsewhere:
RewriteEngine On
RewriteCond %{HTTP_HOST}##%{HTTP_REFERER} !^([^#]*)##https?://\1
RewriteCond %{QUERY_STRING} ^level=image$
RewriteRule (.*) https://a.co.uk/blank.htm [NC,R,L]
My code appears to have no obvious effect. Can anybody see what I am doing wrong? I do not pretend to have a lot of confidence with .htaccess code, normally relying on snippets produced by people cleverer than me.
RewriteCond %{QUERY_STRING} ^level=image$
This checks that the query string is exactly equal to level=image, whereas in your example the level URL parameter is just one of many (the first one).
To check that the URL parameter level=image appears anywhere in the query string then modify the above condition like so:
RewriteCond %{QUERY_STRING} (^|&)level=image($|&)
RewriteCond %{HTTP_HOST}##%{HTTP_REFERER} !^([^#]*)##https?://\1
Minor issue, but this would allow referrers where the requested hostname (eg. example.com) occurs only as a subdomain of the referrer. eg. example.com.referrer.com. To resolve this, modify the CondPattern to include a trailing slash or end-of-string anchor. For example:
RewriteCond %{HTTP_HOST}##%{HTTP_REFERER} !^([^#]*)##https?://\1(/|$)
RewriteRule (.*) https://a.co.uk/blank.htm [NC,R,L]
There's no need for the capturing subpattern. If you only need the rule to be successful for any URL-path then use just ^ to avoid traversing the URL-path. But in your example, the request is for a.php, not "any URL"?
But why "redirect", rather than simply block the request? As you say, this is for "robots" after all. For example, to send a 403 Forbidden:
RewriteRule ^a\.php$ - [F]
In summary:
RewriteCond %{HTTP_HOST}##%{HTTP_REFERER} !^([^#]*)##https?://\1(/|$)
RewriteCond %{QUERY_STRING} (^|&)level=image($|&)
RewriteRule ^a\.php$ - [F]
Note, however, that search engine "bots" generally don't send a Referer header at all. And it is trivial for arbitrary bots to fake the Referer header and circumvent your block.

htaccess redirect outcome not as expected

I hope someone can help me with the following problem.
I have a multiple language site with the language as a folder like
example.com/se/post
I want to get the language separated by domain like example.se.
So far no problem with a DNS alias and WPML plugin.
The problem I have is that I want to redirect example.com/se/post to example.se/post. I try to use this rule in the .HTACCESS file but it changes the URL to example.se/se with the /se that I do not need. I'm not very familiar with the rewrite engine in .HTACCESS file.
<IfModule mod_headers.c>
RewriteEngine on
RewriteCond %{HTTP_HOST} ^(www\.)?nofairytales\.nl$ [NC]
RewriteCond %{REQUEST_URI} ^/sv(/.*)?$ [NC]
RewriteRule ^(.*)$ http://www.nofairytales.se%{REQUEST_URI} [L,R=301]
</IfModule>
RewriteCond %{HTTP_HOST} ^(www\.)?nofairytales\.nl$ [NC]
RewriteCond %{REQUEST_URI} ^/sv(/.*)?$ [NC]
RewriteRule ^(.*)$ http://www.example.se%{REQUEST_URI} [L,R=301]
This is close... you are capturing the URL-path (/post part) in the preceding condition but not using it in the substitution string. Instead, you are using REQUEST_URI which contains the full root-relative URL-path.
You are also matching sv in the URL-path, but redirecting to se in the TLD. The following should resolve the issue (with minimal changes):
RewriteCond %{REQUEST_URI} ^/se(/.*)?$ [NC]
RewriteRule ^(.*)$ http://www.example.se%1 [L,R=301]
Where %1 is a backreference to the captured subpattern in the preceding condition (the /post part).
However, You don't need the second (or even the first) condition(s), as it can be all done in the RewriteRule directive. There wouldn't seem to be a need to check the requested hostname, since if the language subdirectory is in the URL-path then it would seem you should redirect anyway to remove it.
For example, the following should be sufficient to redirect a single language code:
# Language "se"
RewriteRule ^se(?:/(.*))?$ https://www.example.se/$1 [R=301,L]
The non-capturing group that contains the slash delimiter ensures that we always have a trailing slash on the target URL (after the hostname). The first rule above requires the user-agent to "correct" the redirect response when the slash after the hostname is omitted (which it does).
For multiple languages you can modify the same rule with regex alternation. For example:
# All supported languages
RewriteRule ^(se|uk|us|au|id)(?:/(.*))?$ https://www.example.$1/$2 [R=301,L]
This assumes all language codes map to a TLD using the same code. If not then you can implement a "mapping" (lang code -> TLD) in the rule itself or use a RewriteMap if you have access to the server config. This could also provide a "default" TLD.
You could be more generic and allow any two-character language code in the regex. eg. ^([a-z]{2})(?:/(.*))?$. And simply pass this through to the TLD. However, a request for an unknown language (eg. /xx/post) - which might have resulted from an error on your site - will now result in either a malformed redirect (since the domain won't resolve) or worse, a redirect to a competitor lying in wait. And this might go undetected unless you run an analysis of your redirects. So, being more restrictive with the regex/rule may be advisable.

ecception to this .htaccess rule specific pdf link are allowed

I added this .htaccess rule to a WordPress website
<IfModule mod_rewrite.c>
RewriteEngine on
RewriteCond %{REQUEST_FILENAME} (.*)
RewriteCond %{HTTP_COOKIE} !wordpress_logged_in_([a-zA-Z0-9_]*) [NC]
RewriteRule \.(zip|doc|docx|pdf)$ – [NC,F,L]
</IfModule>
This rule work right but I want to give an exception.
https://example.com/wp-content/uploads/2022/11/name-of-pdf.pdf
must be visible even if your not logged in.
RewriteCond %{REQUEST_FILENAME} (.*)
Change this condition to:
RewriteCond %{REQUEST_URI} !=/wp-content/uploads/2022/11/name-of-pdf.pdf
The ! prefix negates the expression, so it is successful when it does not match.
The = prefix operator makes this an exact string match, not a regex, so just use the complete URL-path as-is.
must be visible even if your not logged in
Just to clarify (concern raised in comments)... this code does not check that the user is "logged in" (it does not authenticate the WP auth token). This simply checks for the existence of a cookie (ie. a Cookie HTTP request header that contains the value wordpress_logged_in_).
This might stop the casual user, but it is easily circumvented by the determined user so cannot be used to protect sensitive media.

Log image filename that's cached by external cdn using htaccess

I want to keep a log of image file names whenever a specific cdn caches our images but I can't quite get it. Right now, my code looks something like:
RewriteCond %{HTTP_USER_AGENT} Photon/1.0
RewriteRule ^(.*)$ log.php?image=$1 [L]
The above always logs the image as being "log.php" even if I'm making the cdn cache "example.jpg" and I thoroughly don't understand why.
The above always logs the image as being "log.php" even if I'm making the cdn cache "example.jpg" and I thoroughly don't understand why.
Because in .htaccess the rewrite engine loops until the URL passes through unchanged (despite the presence of the L flag) and your rule also matches log.php (your rule matches everything) - so this is the "image" that is ultimately logged. The L flag simply stops the current pass through the rewrite engine.
For example:
Request /example.jpg
Request is rewritten to log.php?image=example.jpg
Rewrite engine starts over, passing /log.php?image=example.jpg to the start of the second pass.
Request is rewritten to log.php?image=log.php by the same RewriteRule directive.
Rewrite engine starts over, passing /log.php?image=log.php to the start of the third pass.
Request is rewritten to log.php?image=log.php (again).
URL has not changed in the last pass - processing stops.
You need to make an exception so that log.php itself is not processed. Or, state that all non-.php files are processed (instead of everything). Or, if only images are meant to be processed then only check for images.
For example:
# Log images only
RewriteCond %{HTTP_USER_AGENT} Photon/1\.0
RewriteRule ^(.+\.(?:png|jpg|webp|gif))$ log.php?image=$1 [L]
Remember to backslash-escape literal dots in the regex.
Or,
# Log Everything except log.php itself
RewriteCond %{HTTP_USER_AGENT} Photon/1\.0
RewriteCond %{REQUEST_URI} ^/(.+)
RewriteRule !^log\.php$ log.php?image=%1 [L]
In the last example, %1 refers to the captured subpattern in the preceding CondPattern. I only did it this way, rather than using REQUEST_URI directly since you are excluding the slash prefix in your original logging directive (ie. you are passing image.jpg to your script when /image.jpg is requested). If you want to log the slash prefix as well, then you can omit the 2nd condition and pass REQUEST_URI directly. For example:
# Log Everything except log.php itself (include slash prefix)
RewriteCond %{HTTP_USER_AGENT} Photon/1.0
RewriteRule !^log\.php$ log.php?image=%{REQUEST_URI} [L]
Alternatively, on Apache 2.4+ you can use the END flag instead of L to force the rewrite engine to stop and prevent further passes through the rewrite engine. For example:
RewriteCond %{HTTP_USER_AGENT} Photon/1\.0
RewriteRule (.+) log.php?image=$1 [END]

apache mod_rewrite redirect between hostnames (except one directory)

I've got two hostnames (e.g. www.site1.com and www.site2.com) mapped to the same apache virtual host. I'd like to redirect all traffic from site1.com over to site2.com, except for all POST requests from a particular folder. (That folder contains URLs which are used by older clients which are not capable of handling redirects on POST requests.)
Here's the rule I came up with. I'm a newbie to rewrite rules, so wanted to make sure I wasn't missing anything obvious.
RewriteCond %{HTTP_HOST} ^www.site1.com$
RewriteCond %{HTTP_METHOD} ^POST
RewriteCond %{REQUEST_URI} !^/dontredirectme/
RewriteRule /(.*) http://www.site2.com/$1 [R=301,L]
Is this a correct rule to use for this job? And if yes, are there any efficiency optimizations I should be considering?
Your rule set looks relatively correct, but you need to modify your second RewriteCond a little to reflect your goal:
RewriteCond %{HTTP_METHOD} !^POST [OR]
This will allow you to redirect if the request type is not POST, or it is and the requested URI is not /dontredirectme/, which effectively results in a redirect for everything that isn't a POST request to /dontredirectme/.
Additionally, the input to the RewriteRule will not have a leading forward slash if you're defining it in a per-directory context (in a .htaccess file or in a <Directory> section). If you are defining it directly in the <VirtualHost> (a per-server context), then the input will have a leading slash, so your rule would be fine as-is.
As far as efficiency goes, rules defined in the server configuration have the benefit of only having to be parsed one time. On the other hand, a .htaccess file must be parsed for each request, a process which involves the additional (albeit small) overhead of reading the file and compiling the regular expressions.
If you really want to squeeze efficiency out of it, you could make the following changes:
RewriteCond %{HTTP_HOST} =www.example.com
RewriteCond %{HTTP_METHOD} !=POST [NC,OR]
RewriteCond %{REQUEST_URI} !^/dontredirectme/
RewriteRule ^ http://www.example.net%{REQUEST_URI} [R=301,L]
I doubt the difference is really appreciable in all but the most extreme cases, but this removes two regular expressions in favour of a direct text comparison. Also, since you just want to redirect the request to the new host verbatim, you can "simplify" the regular expression involved in the RewriteRule and just use %{REQUEST_URI} directly in the replacement.