Why doesn't $1 store the complete URL when the pattern is ^(.*)$? - apache

There are several topics already about this. But I haven't found an answer or I still don't understand it correctly.I know that $1 represents the match from the first set of parentheses in the RewriteRule regex. $1 also stores this value.But if there is only ^(.*)$, then it seems to work differently?
Example:URL: http://www.example.com/
RewriteBase /
RewriteCond %{HTTP_HOST} ^www\.(.*)$ [NC,OR]
RewriteCond %{HTTPS_HOST} ^www\.(.*)$ [NC]
RewriteRule ^(.*)$ https://%1/$1 [R=301,L]
What I understand:1. http://www.example.com/ matches with RewriteCond %{HTTP_HOST} ^www\.(.*)$ [NC] and stores the match in %1 (=example.com/).2. go to RewriteRule because the URL matched in step 13. RewriteRule gets the string http://www.example.com/. Because of ^(.*)$, http://www.example.com/ matches completely and is stored in $1.4. I think this URL should appear : https://example.com/http://www.example.com/
What actually appears: https://example.com/
Why does $1 have an empty string? It's all matched, isn't it?

There's quite a few misconceptions here that I'll try to address...
RewriteBase /
RewriteCond %{HTTP_HOST} ^www\.(.*)$ [NC,OR]
RewriteCond %{HTTPS_HOST} ^www\.(.*)$ [NC]
RewriteRule ^(.*)$ https://%1/$1 [R=301,L]
I'll ignore the RewriteBase directive and the second RewriteCond directive...
The RewriteBase directive does not apply here, since there are no relative path substitution strings (the 2nd argument to the RewriteRule directive).
There is no HTTPS_HOST server variable, only HTTP_HOST. See the following question on ServerFault: https://serverfault.com/questions/953020/what-is-the-difference-between-http-host-and-https-host-in-apache-htaccess-file
I think HTTPS_HOST has perpetuated around the internet due to a few typos/misconceptions that have been blindly copy/pasted.
HTTP_HOST contains the value of the Host HTTP request header (the hostname) eg. www.example.com or example.com, depending on what was requested. Hence the name HTTP_ + HOST. This is the same naming convention used for all HTTP request headers. A corresponding server variable is created for each.
So, this becomes (removing the OR flag from the first condition):
RewriteCond %{HTTP_HOST} ^www\.(.*)$ [NC]
RewriteRule ^(.*)$ https://%1/$1 [R=301,L]
The RewriteRule pattern (eg. ^(.*)$)
But if there is only ^(.*)$, then it seems to work differently?
No, it works the same. The confusion would seem to be what the RewriteRule pattern actually matches against.
The RewriteRule pattern matches against the URL-path only.
The URL-path is the part of the URL after the scheme + hostname and before the query string. eg. Given a request for http://example.com/ then the URL-path is simply /. Or request http://example.com/foo/bar?param=1 - the URL-path is /foo/bar.
HOWEVER, in a per-directory context like .htaccess (as opposed to a server or virtualhost context) the directory-prefix is first removed from the URL-path before the match occurs. (Because .htaccess is processed after the request is mapped to the filesystem and strictly speaking matches against a file-path.) The directory-prefix is the absolute file path of the .htaccess file itself and notably ends with a slash. eg. When the .htaccess file is located in the document root, then the directory-prefix will be something like /var/www/user/public_html/ (the filesystem path to the document root).
So, given a request for http://example.com/ then the URL-path that is matched by the RewriteRule pattern in .htaccess is simply "" (empty string). Or request http://example.com/foo/bar?param=1 - the URL-path that is matched is foo/bar - no slash prefix.
This is more significant when the .htaccess file is located in a subdirectory off the document root. For example, if the .htaccess file is located in the /subdir subdirectory and there is a request of the form http://example.com/subdir/foo/bar, the RewriteRule pattern will again match against just foo/bar (not subdir/foo/bar or /subdir/foo/bar). This is a significant difference to when RewriteRule directives are used in a server (or virtualhost) context. In a server context, the RewriteRule pattern always matches against the full URL-path, starting with a slash - there is no concept of a directory-prefix when used in a server context, since the directives are processed before the request is mapped to the filesystem.
What I understand:
http://www.example.com/ matches with RewriteCond %{HTTP_HOST} ^www\.(.*)$ [NC] and stores the match in %1 (=example.com/).
go to RewriteRule because the URL matched in step 1
RewriteRule gets the string http://www.example.com/. Because of ^(.*)$, http://www.example.com/ matches completely and is stored
in $1.
I think this URL should appear : https://example.com/http://www.example.com/
You've got the order of processing wrong. It's actually the RewriteRule pattern that is processed first. Only if the RewriteRule pattern matches are the preceding RewriteCond (conditions) processed. If all the conditions are successful then the RewriteRule substituion (2nd argument) occurs.
So, in order, given a request for http://www.example.com/:
RewriteRule ^(.*)$ - The resulting URL-path "" (empty string) matches the RewriteRule pattern ^(.*)$. The $1 backreference then holds an empty string (as does the $0 backreference - which stores the match of the entire pattern - the same in this case)
RewriteCond %{HTTP_HOST} ^www\.(.*)$ - If the RewriteRule pattern matched in step #1 (it does in this case) then the preceding RewriteCond directive is processed. This matches the Host header eg. www.example.com (no http://) against the regex ^www\.(.*)$. If this is successful then the %1 backreference holds the value of the first captured group, ie. example.com in this example.
RewriteRule ^(.*)$ https://%1/$1 [R=301,L] - If the preceding condition(s) is successful then the substitution (ie. https://%1/$1) in the RewriteRule directive occurs. ie. https://example.com/ - %1 is example.om from the captured group in the last matched CondPattern and $1 is an empty string, from the captured group in the RewriteRule pattern.
Other notes:
Due to the order of processing, it is naturally more efficient to do as much pattern matching in the RewriteRule pattern as possible, instead of relying on preceding RewriteCond directives. (A common misconception that RewriteCond directives are processed first - that is not the case.)
Due to the order of processing, you can use $n backreferences in the TestString (first) argument of the preceding RewriteCond directives. (This wouldn't be possible if the directives were literally processed top-down.)
The %n back references are only from the last matched CondPattern. This is important to consider if you have multiple conditions.

Related

htaccess redirect outcome not as expected

I hope someone can help me with the following problem.
I have a multiple language site with the language as a folder like
example.com/se/post
I want to get the language separated by domain like example.se.
So far no problem with a DNS alias and WPML plugin.
The problem I have is that I want to redirect example.com/se/post to example.se/post. I try to use this rule in the .HTACCESS file but it changes the URL to example.se/se with the /se that I do not need. I'm not very familiar with the rewrite engine in .HTACCESS file.
<IfModule mod_headers.c>
RewriteEngine on
RewriteCond %{HTTP_HOST} ^(www\.)?nofairytales\.nl$ [NC]
RewriteCond %{REQUEST_URI} ^/sv(/.*)?$ [NC]
RewriteRule ^(.*)$ http://www.nofairytales.se%{REQUEST_URI} [L,R=301]
</IfModule>
RewriteCond %{HTTP_HOST} ^(www\.)?nofairytales\.nl$ [NC]
RewriteCond %{REQUEST_URI} ^/sv(/.*)?$ [NC]
RewriteRule ^(.*)$ http://www.example.se%{REQUEST_URI} [L,R=301]
This is close... you are capturing the URL-path (/post part) in the preceding condition but not using it in the substitution string. Instead, you are using REQUEST_URI which contains the full root-relative URL-path.
You are also matching sv in the URL-path, but redirecting to se in the TLD. The following should resolve the issue (with minimal changes):
RewriteCond %{REQUEST_URI} ^/se(/.*)?$ [NC]
RewriteRule ^(.*)$ http://www.example.se%1 [L,R=301]
Where %1 is a backreference to the captured subpattern in the preceding condition (the /post part).
However, You don't need the second (or even the first) condition(s), as it can be all done in the RewriteRule directive. There wouldn't seem to be a need to check the requested hostname, since if the language subdirectory is in the URL-path then it would seem you should redirect anyway to remove it.
For example, the following should be sufficient to redirect a single language code:
# Language "se"
RewriteRule ^se(?:/(.*))?$ https://www.example.se/$1 [R=301,L]
The non-capturing group that contains the slash delimiter ensures that we always have a trailing slash on the target URL (after the hostname). The first rule above requires the user-agent to "correct" the redirect response when the slash after the hostname is omitted (which it does).
For multiple languages you can modify the same rule with regex alternation. For example:
# All supported languages
RewriteRule ^(se|uk|us|au|id)(?:/(.*))?$ https://www.example.$1/$2 [R=301,L]
This assumes all language codes map to a TLD using the same code. If not then you can implement a "mapping" (lang code -> TLD) in the rule itself or use a RewriteMap if you have access to the server config. This could also provide a "default" TLD.
You could be more generic and allow any two-character language code in the regex. eg. ^([a-z]{2})(?:/(.*))?$. And simply pass this through to the TLD. However, a request for an unknown language (eg. /xx/post) - which might have resulted from an error on your site - will now result in either a malformed redirect (since the domain won't resolve) or worse, a redirect to a competitor lying in wait. And this might go undetected unless you run an analysis of your redirects. So, being more restrictive with the regex/rule may be advisable.

htaccess send 404 if query string contains keyword

I'm seeing a lot of traffic which I suspect is probing for a flaw or exploit with the request format of
https://example.com/?testword
I figured while I look into this more I could save resources and disrupt or discourage these requests with a 404 or 500 response
I have tried
RewriteEngine On
RewriteCond %{QUERY_STRING} !(^|&)testword($|&) [NC]
RewriteRule https://example.com/ [L,R=404]
And some other variations on the Query string match but none seem to return 404 when testing. Other questions I have found look for query string values/pairs and rewrite them but no examples seem to exits for just a single value.
RewriteCond %{QUERY_STRING} !(^|&)testword($|&) [NC]
RewriteRule https://example.com/ [L,R=404]
There are a few issues here:
The CondPattern in your condition is negated (! prefix), so it's only successfull when the testword is not present in the query string.
The RewriteRule directive is missing the pattern (first) argument (or substitution (second) argument depending on how you look at it). The RewriteRule directive matches against the URL-path only.
When you specify a non-3xx status code for the R flag, the substitution is ignored. You should specify a single hyphen (-) to indicate no substitution.
To test that the whole-word "testword" exists anywhere in the query string, you can use the regex \btestword\b - where \b are word boundaries. Or maybe you simply want the regex testword - to match "testword" literally anywhere, including when it appears as part of another word? In comparison, the regex (^|&)testword($|&) would miss instances where "testword" appears as a URL parameter name.
Try the following instead:
RewriteCond %{QUERY_STRING} \btestword\b [NC]
RewriteRule ^$ - [R=404]
This matches the homepage only (ie. empty URL-path). The L flag is not required when specifying a non-3xx return status, it is implied.
The - (second argument) indicates no substitution. As mentioned above, when specifying a non-3xx HTTP status, the substitution string is ignored anyway.
To test any URL-path then simply remove the $ (end-of-string anchor) on the RewriteRule pattern. For example:
RewriteCond %{QUERY_STRING} \btestword\b [NC]
RewriteRule ^ - [R=404]
If your homepage doesn't accept any query string parameters then you could simply reject the request (ie. 404 Not Found) when a query string is present. For example:
RewriteCond %{QUERY_STRING} .
RewriteRule ^$ - [R=404]

Rewriting subdirectory to query string parameter

I have two requirements;
That, for example, /product/12345 is internally redirected to /product/product.php?product=12345.
That if the user tries to access /product/product.php in the URL bar, it is redirected to /product/ for tidiness.
Separate, they both work correctly, but together it results in an infinite loop - I know that I'm redirecting from /product/ to /product.php and back again, but the difference is internal vs external and I'm not sure how to distinguish between them.
RewriteEngine On
RewriteRule ^product/product.php /product/ [NC,R=307,END]
RewriteCond %{REQUEST_URI} !^/product/product.php [NC]
RewriteRule ^product/(.*) /product/product.php?product=$1 [NC]
There probably exist other solutions, but it works if you change two things:
Add a condition to the first RewriteRule that checks if the query string is empty, i.e. product/product.php without query string redirects to /product/.
Change (.*) in the second RewriteRule to (.+) or ([0-9]+) to only rewrite requests containing a product id (requests to /product/ are not rewritten).
RewriteEngine On
RewriteCond %{QUERY_STRING} ="" [NC]
RewriteRule ^product/product\.php$ /product/ [NC,R=307,END]
RewriteCond %{REQUEST_URI} !^/product/product\.php [NC]
RewriteRule ^product/(.+) /product/product.php?product=$1 [NC]
access /product/product.php in the URL bar, it is redirected to /product/ for tidiness
You might as well also redirect /product/product.php?product=12345 to the corresponding canonical URL (ie. /product/12345) - which you can do all in the same rule. If the product ID is numeric only then you should restrict your regex accordingly - this will also avoid the need for an additional condition.
For example:
# Canonical redirect
RewriteCond %{ENV:REDIRECT_STATUS} ^$
RewriteCond %{QUERY_STRING} ^(?:product=(\d*))?$ [NC]
RewriteRule ^product/product\.php$ /product/%1 [NC,R=307,L]
# Rewrite requests from "pretty" URL to underlying filesystem path
RewriteRule ^product/(\d*) /product/product.php?product=$1 [L]
The condition that checks against the REDIRECT_STATUS environment variable is necessary to prevent a redirect loop in this instance since the query string is entirely optional.
By restricting the match to digits-only, we avoid the need for an additional condition on the internal rewrite, product.php won't match. If the product id can contain letters then restrict the pattern to avoid dots (.), eg. ([^./]*).
Only include a NC flag on the internal rewrite if this is strictly necessary, otherwise this potentially creates a duplicate content issue.

.htaccess skip all rules if url matches

I want to skip all rewrite URLs when specific URL matches. I want to open this page:
https://www.example.com/.well-known/pki-validation/godaddy.html
If godaddy.html matches the URL. Here is what i am doing:
RewriteCond "%{REQUEST_URI}" "==/godaddy.html"
RewriteRule ^(.*)$ https://www.example.com/.well-known/pki-validation/godaddy.html [L]
RewriteCond %{HTTP_HOST} ^example.com
RewriteRule ^(.*)$ https://www.example.com/index.php
but it does not work. I have also tried the [END] flag, but when I write flag [END] it gives me 500 internal server error.
If you want to stop rewriting, when the requested URL ends with godaddy.html, you can use a dash - as the substitution
Substitution of a rewrite rule is the string that replaces the original URL-path that was matched by Pattern. The Substitution may be a:
...
- (dash)
A dash indicates that no substitution should be performed (the existing path is passed through untouched). This is used when a flag (see below) needs to be applied without changing the path.
RewriteRule godaddy.html$ - [L]

htacces rewrite tld without changing subdomain or dirs

I am trying to rewrite the following url:
the subdomain should match any subdomain. same for the TLD.
both: http://car.example.com/ and http://cat.example.co.uk should be rewritten
http://subdomain.example.com/some/dir
to
http://subdomain.example.nl/some/dir
and
http://example.com/some/dir
to
http://exampkle.nl/some/dir
(also with www. adress)
but my knowledge of htaccess and rewrite rules in general aren't good enough for this :(
I hope one of you knows the solution.
ps. I did try a search ;)
The challenge comes with having to detect and account for four different possible domain patterns:
example.com → example.nl
example.co.uk → example.nl
sub.example.com → sub.example.nl
sub.example.co.uk → sub.example.nl
So, what this ruleset does is checks that the TLD is not .nl (preventing a loop from occurring), then pulls the subdomain, www or not, off the front (read as "capture anything other than a dot followed by a dot, optional), followed by the base domain, followed by a dot. We don't have to match the entire URL, since we aren't keeping the TLD.
RewriteEngine On
RewriteCond %{HTTP_HOST} !example\.nl$
RewriteCond %{HTTP_HOST} ^([^.]+\.)?example\.
RewriteRule ^ http://%1example.nl%{REQUEST_URI} [NC,L,R=301]
The RewriteRule's ^ matches any URL, then inserts the contents of the first set of parens in the preceding RewriteCond (the subdomain) with %1, and completes the rewriting by appending the requested path and flags to ignore case, make this the last rule, and redirect with a search-engine-friendly 301, ensuring the rewritten URL appears in the user's browser. Any query string (text appearing after a ? in the URL) is automatically included by default.
Try this:
EDIT: See changes to subdomain, using %1 to capture from RewriteCond
RewriteEngine On
# Check if the hostname requested is subdomain.example.com or empty
# Now we attempt to capture the subdomain with (.*)? and reuse with %1
RewriteCond %{HTTP_HOST} ^(.*)?example.com$ [NC]
RewriteCond %{HTTP_HOST} !^$
# Rewrite it as subdomain.example.nl and redirect the browser
RewriteRule ^(.*) http://%1example.nl$1 [L,R,NE,QSA]
# Note: With the above edit for %1, this part should no longer be necessary.
# Then do the same for example.com, with or without the www
RewriteCond %{HTTP_HOST} ^(www\.)?example.com$ [NC]
RewriteCond %{HTTP_HOST} !^$
RewriteRule ^(.*) http://www.example.nl$1 [L,R,NE,QSA]