How to prevent AEM/Sling from adding trailing slash to the extensionless URLs? - apache

All extensionless URLs on the site, which resolve to the actual nodes, are being redirected (with 301 code) to their versions with added trailing slash. It doubles amount of requests to the frontend web server so we would like to fix this.
We do use Apache mod_rewrite to rewrite all incoming URLs (with or without slash) to their .html equivalents in order to make dispatcher caching consistent, but the actual processing is a bit weird.
In general, we have three cases:
URL has an extension ( i.e. /content/xxx/yyy.html ) - it's being processed right away, no redirects
URL has trailing slash ( /content/xxx/yyy/ ) - it is processed by mod_rewrite and rewritten to /content/xxx/yyy.html successfully. no redirects
Extensionless URL ( /content/xxx/yyy ) - processed by mod_rewrite, rewritten to /content/xxx/yyy.html and immediately redirected to /content/xxx/yyy/ which is subsequently goes through the routine from the point 2 above.
To exclude Apache originated redirects we disabled almost all modules, such as mod_dir, mod_negotiation, mod_autoindex, etc to avoid redirects due to the content negotiation or directory indexing but requests are still being redirected.
Our app doesn't contain any redirects based on the URL so I'm wondering if there is any OSGI service or hidden configuration setting which triggers such redirects?
We also have a set of shortcuts on the site, Apache rewrites them to actual URLs and they are NOT being redirected.
For example, if requesting URL is /aboutus it's being successfully mapped to the /content/xxx/yyy/operations/aboutus.html and processed in one loop without any additional redirects. The problem described above is valid only there is an actual corresponding node in JCR and request is extensionless.

Related

RedirectMatch without last part of URL

I have this RedirecMatch
RedirectMatch 301 ^/en/products/(.*)/(.*)/(.*)$ https://www.example.com/en/collections/$2/
If I visit
https://www.example.com/en/products/sofas/greyson/greyson-sofa
I'm redirected to
https://www.example.com/en/collections/greyson/greyson-sofa
What I want is
https://www.example.com/en/collections/greyson/
How do I accomplish this?
There's nothing obvious in what you have posted that would produce the specific output you are seeing, however, there are other errors in the directives and you may be seeing a cached response. 301s are cached persistently by the browser, so any errors are also cached.
The Redirect directive is prefix-matching and everything after the match is copied onto the end of the target URL. So, the redirect you are seeing would be produced by a directive something like this:
Redirect 301 /en/products/sofas/greyson https://www.example.com/en/collections/sofas/greyson
When you request /en/products/sofas/greyson/greyson-sofa, the part after the match, ie. /greyson-sofa, is copied onto the end of the target URL to produce /en/collections/sofas/greyson/greyson-sofa
You can resolve most of these issues by reordering your rules (but also watch the trailing slashes). You need to have the most specific redirects first. RedirectMatch before Redirect. For example, take the following two redirects:
Redirect 301 /en/products/accessories https://www.example.com/en/products/complements/
Redirect 301 /en/products/accessories/bush/ https://www.example.com/en/collections/bush-on/
Since the Redirect directive is prefix-matching, a request for /en/products/accessories/bush/ will actually be caught by the first rule, not the second and end up redirecting to /en/products/complements//bush-on/ - note the erroneous double-slash (since you have a mismatch of trailing slashes on the source and target URLs.)
You need to reverse these two rules. (But also watch the trailing slash.)
The same applies to the Redirect directives that follow. You also have some duplication, ie. You have two rules for /en/products/chairs-and-bar-stools/piper/?

Apache 2.4 rewriting directory URLs without trailing slash to https://default_site/dir/ instead of preserving domain

This is a relatively recent behavioral change and appears to be related only to requests which include a "Upgrade-Insecure-Requests: 1" request header.
Apache has started rewriting such requests for sites which are HTTP-only to an HTTPS URL using the default site name instead of just adding the / at the end of the requested URL.
Example: URL submitted in browser: http://www.example.com/blah
Intended redirect: 301 to http://www.example.com/blah/
Instead redirects: 301 to https://default.site.configured/blah/
This happens whether it's a named virtual on the same address as the default server or a virtual using a separate address with separate Listen directives.
I understand all the arguments in favor of the idea that everything should always be encrypted and I don't want to get into a debate about that. This site doesn't consider the tradeoffs desirable at this time.
The default site does have SSL and is configured to redirect HTTP->HTTPS, but the www.foo.com site is not configured that way and does not wish to implement SSL at this time.
Is there any way to get Apache 2.4 to disregard that "Upgrade" header and simply rewrite the URL as desired rather than altering the domain name?
After banging on this some more, I finally found the source of my woes.
This happens when you have IP based virtual hosts and did not configure a name for them using the "ServerName" directive.
tl;dr: If you are having this problem, try adding a "ServerName www.example.com" directive within the VirtualHost definition for the site and that should resolve it.
Details:
It does not happen until you encounter a URL that requires a rewrite other than adding a trailing /. (i.e. if you get a request that doesn't contain the "Upgrade-Insecure-Requests: 1" header, it only gets the trailing / added, but if you get one with that header, it also tries to rewrite the protocol to https which triggers the full URL rewrite).
In my case, the default host name had an SSL configuration, so it didn't fall back to HTTP after the rewrite or reject the rewrite as invalid.
YMMV, I did not continue to do an exhaustive test of all permutations once I found the solution.

mod_rewrite behaviour when no rewriteBase

Just want to confirm something. From what I gather of how mod_rewrite works, Apache receives an URL and immediately mod_rewrite applies (non-<directory>) rules in httpd.conf, then per-directory mod-rewriting goes to work, then restarts the process with a new URL if any changes are made.
#JonLin's great answer to this question first says that when your per-directory rule specs an absolute replacement (ie. starting with a slash), it's assumed to be relative to the DocumentRoot which I get. But of relative replacements (no slash) Jon then says:
it's based on the directory that the rule is in. So if
RewriteRule ^foo$ bar.php [L]
is in the "root" and you go to http://example.com/foo, you get served http://example.com/bar.php. But if that rule is in the "subdir1" directory, and you go to http://example.com/subdir1/foo, you get served http://example.com/subdir1/bar.php. etc. This sometimes works and sometimes doesn't, as the documentation says, it's supposed to be required for relative paths, but most of the time it seems to work. Except when you are redirecting (using the R flag, or implicitly because you have http://host in your rule's target). That means this rule:
RewriteRule ^foo$ bar.php [L,R]
if it's in the "subdir2" directory, and you go to http://example.com/subdir2/foo, mod_rewrite will mistake the relative path as a file-path instead of a URL-path and because of the R flag, you'll end up getting redirected to something like: http://example.com/var/www/localhost/htdocs/subdir1.
As Jon explains in the last bit, when a redirect will occur and when there's no rewriteBase, a string intended as filepath gets appended to the site's base address to create a phony URL. But just to confirm, even in the former case Jon mentions, ie. not an actual redirect, the substituted string does get sent back to Apache's URL-reception code, restarting the whole process, correct? The diagram on this page of the spec seems to imply that until no rules make a change, the process keeps restarting. These non-redirect cases would seem to be the time when it WOULD make sense to tack the filepath right from the file system root to the htaccess directory onto the beginning of the substitution. But how does that get turned into a proper URL as expected by the URL-reception code - does http://localhost get prepended? I think that would make everything relative to the documentroot, not the actual file system root.
Thanks!
Been doing some more reading and think I've got this explained, for anyone who's interested.
Regarding my question about how a file system absolute path gets turned into a valid url for the internal redirect, I was thinking that the URI in an HTTP request contained "http://hostname", but this has been cut off ie. the URI is like /this/is/a/path. The host name is in a separate "Host" header field, and is no longer a vital piece of information by the time mod_rewrite is running, as Apache's initial Post Read Request phase has already noticed the GET request on the port and, if Name-Based Virtual Hosting is in use, interpreted things like the DocumentRoot from the Host header field, and finally called the URI Translation Phase where mod_rewrite executes. So any time mod_rewrite is running, there could be only one host name that got us here.
So to summarize, what I had called the "URL-reception" part of Apache always deals with /paths/like/this/without/hostname, not just after internal redirects. The spec does say that rewriteCond/rewriteRule match against such paths, but I figured the host name was there initially and got removed. So then all that's left is to ensure our rules are prepared for cases where they are running in an internal redirect spawned by an earlier runthrough of themselves, and not do something inadvertent when they see a file system absolute path caused by a replacement that didn't start with a slash. What a mouthful.

Level of obscurity of destination URLs via mod_rewrite

To achieve a single layer of content delivery security, I'm looking into the possibility of obscuring a resource URL via an .htaccess RewriteRule:
RewriteEngine on
RewriteBase /js/
RewriteRule obscure-alias\.js http://example.com/sensitive.js
It would of course be implemented as:
<script type="text/javascript" src="obscure-alias.js"></script>
Because this is not a 301 redirect, but rather a routing scenario similar to that of many of our frameworks we used today, would it be safe to say that this RewriteRule adequately obfuscates the actual URL where this resource is located, or:
Can the destination URL still be found out via some HTTP header sniffing utility
Might a web browser be able to reveal the "Download URL"
I'm going to pre-answer my own questions by saying no to both since the "internal proxy" is taking place on the server-side and not on the client side if I understand it correctly: http://httpd.apache.org/docs/current/mod/mod_rewrite.html. I just wanted to confirm that when Apache goes to serve the destination URL, that it also isn't passing along information to the user agent what the URL was that it rewrote the original request as.
It depends on how you specify the redirect target.
If your http://example.com/ is running on the same server, there will be an internal redirect that is invisible to the client. From the manual:
Absolute URL
If an absolute URL is specified, mod_rewrite checks to see whether the hostname matches the current host. If it does, the scheme and hostname are stripped out and the resulting path is treated as a URL-path. Otherwise, an external redirect is performed for the given URL. To force an external redirect back to the current host, see the [R] flag below.
if the absolute URL points to a remote domain, a header redirect will be performed. A header redirect is visible to the client and will reveal the sensitive location.
To make sure no external redirect takes place, specify a relative URL like
RewriteRule obscure-alias\.js sensitive.js
Note that the sensitive JS file's URL can still be guessed.
To find out whether a request results in a header redirect, log in onto a terminal (eg. on a Linux server) and do
wget --server-response http://www.example.com
If the first HTTP/.... line (there may be more than one) is something that begins with a 3xx, like
HTTP request sent, awaiting response...
HTTP/1.1 302 Moved Temporarily
you are looking at a header redirect.
Possible using proxy throughput.
See http://httpd.apache.org/docs/2.4/rewrite/proxy.html
Also alluded to here as well: mod_rewrite not working as internal proxy

Apache rewrite rule - prevent rewritten URL appearing in browser URL bar

I have a rewrite rule which is looking for a particular URI. When it matches the particular URL it rewrites it with a proper file path so the required content can be found. It then changes the protocol to HTTPS and allows the request to pass through.
I have two problems;
I don't want the rewritten path to appear in the users browser - i want to maintain the vanity url
I do want the HTTPS protocol to appear indicating to the user that they are accessing the site over a secured conection.
I have tried a couple of options but no success. If i include the [R] flag the URL and protocol remain unchanged but that is not the desired effect
Any suggestions on how i can achieve this?
This is my rule;
RewriteMap redirectsIfSecure txt:/myserver/content/secure_urls.txt
RewriteCond ${lowercase:%{REQUEST_URI}} ^/(.+)$
RewriteCond ${redirectsIfSecure:%1|NOT_FOUND} !NOT_FOUND
RewriteRule ^(.*)$ https://myserver.com${redirectsIfSecure:%1} [PT]
From the mod_rewrite documentation:
If an absolute URL is specified, mod_rewrite checks to see whether the
hostname matches the current host. If it does, the scheme and hostname
are stripped out and the resulting path is treated as a URL-path.
Otherwise, an external redirect is performed for the given URL. To
force an external redirect back to the current host, see the [R] flag
below.
If you rewrite the request to a fully qualified URL (that is, anything starting with http://, https://, etc) that doesn't match your ServerName, then mod_rewrite will issue an HTTP redirect, which will cause the client browser to request the resource from the new location.
If you're not trying to switch between http and https you can use a proxy rule (the P flag) to have Apache make the request on behalf of the client and return the result, thus masking the rewritten URL.
However, if you're trying to upgrade from http to https (or the other way around), this will always require a client redirect.