SEO URLs with ColdFusion controller? - apache

quick ref: area = portal type page.
I would like old urls http://domain.com/long/rubbish/url/blah/blah/index.cfm?id=12345
to redirect to http://domain.com/area/12345-short-title
http://domain.com/area/12345-short-title should display the content.
I have worked out so far to do this I could use apache to write all URLs to
http://domain.com/index.cfm/long/rubbish/url/blah/blah/index.cfm?id=12345
and
http://domain.com/index.cfm/area/12345-short-title
The index.cfm will either server the content or apply a permanent redirect, but it will need to get the title and area information from the database first.
There are 50,000 pages on this website. I also have other ideas for subdomain redirects, and permanent subdomains and controlling how they act through the index.cfm.
Infrastructure are keen to do as much through Apache rewrite as possible, we suspect it would be faster. However I'm not sure we have that choice if we need to get the area and title information for each page.
Has anyone got some experience with this that can provide input?
--
Something to note, I'm assuming we'll have to keep all the internal URLs used on the website in the old format. It would be a mega job to change them all.
This means all internal URLs will have to use a permanent redirect every time.

Rather than redirecting both groups of URLs to the same script, why not simply send them to two distinct scripts?
Simply like this:
RewriteCond ${REQUEST_URI} !-f
RewriteRule ^\w+/\d+-[\w-]+$ /content.cfm/$0 [L]
RewriteCond ${REQUEST_URI} !-f
RewriteRule ^.* /redirect.cfm/$0 [L,QSA]
Then, the redirect.cfm can lookup the replacement URL and do the 301 redirect, whilst content.cfm simply serves the content.
(You haven't specified how your CF is setup; you may need to update the Jrun/Tomcat/other config to support /content.cfm/* and /redirect.cfm/* - it'll be done the same as it's done for index.cfm)
For performance reasons, you still want to avoid the database hits for redirecting if you can, and you can do that by generating rewrite rules for each page that performs the 301 redirect on the Apache side. This can be as simple as appending a line to the .htaccess file, like so:
<cfset NewLine = 'RewriteRule #ReEscape(OldUrl)# #NewUrl# [L,QSA,R=301]' />
<cffile action="append" file="./.htaccess" output=#NewLine# />
(Where OldUrl and NewUrl have been looked-up from the database.)
You might also want to investigate using mod_alias redirect instead of mod_rewrite RewriteRule, where the syntax would be Redirect permanent #OldUrl# #NewUrl# - since the OldUrl is an exact path match it would likely be faster.
Note that these rules will need to be checked before the above redirect.cfm redirect is done - if they are in the same .htaccess you can't simply do an append, but if they are in the site's general Apache config files then the .htaccess rules will be checked first.
Also, as per Sharon's comment, you should verify if your Apache will handle 50k rules - whilst I've seen it reported that "thousands" of regex-based Apache rewrites are perfectly fine, there may well be some limit (or at least the need to split across multiple files).

Using apache rewrites would only be faster if they were static rewrites, or if they all followed some rule that you could write in regex within the .htaccess file. If you're having to touch the database for these redirects, then it may not make sense to do it in .htaccess.
Another approach is the one used by most CMSs for handling virtual directories and redirects. An index.cfm file at the root of the site handles all incoming requests and returns the correct pages and pathing. MURA CMS uses this approach (as well as Joomla and most of the others.)
Basically you're using the CGI.path_info variable on an incoming request, searching for it in your DB, and doing a redirect to the new path. As usual, Ben Nadel has a good write-up of how to use this approach: Ben Nadel: Using IIS URL Rewriting And CGI.PATH_INFO With IIS MOD-Rewrite
You can, however, use the .htaccess to remove the "index.cfm" from the url string entirely if you want by redirecting all incoming requests to the root URL with something that looks like this in your .htaccess:
RewriteEngine On
RewriteCond %{DOCUMENT_ROOT}%{REQUEST_URI} !-d
RewriteRule ^([a-zA-Z0-9-]{1,})/([a-zA-Z0-9/-]+)$ /$1/index.cfm/$2 [PT]
Basically this would redirect something like http://www.yourdomain.com/your-new-url/ to http://www.yourdomain.com/index.cfm/your-new-url/ where it could be processed as described by the blog post above. The user would never see the index.cfm.

Related

.htaccess redirects if the condition doe not match/ negative condition

I am modifying the .htaccess file of a legacy PHP web application. I am not familiar with apache .htaccess syntax. I found this tutorial. What I am trying to do is that I am trying to redirect all the requests to a URL/ path if the request URL is not a specific URL/ path. For example, all the requests to the website will be redirected to localhost/my-custom-page unless the request URL is localhost/my-custom-page.
I know how to redirect mapping 1 to 1 as follows:
RewriteEngine on
RewriteRule ^my-old-url.html$ /my-new-url.html [R=301,L]
But, what I am trying to do is that redirecting all the requests to the specific page unless the request is to that page. Even the home page will be redirected to that page. How can I do that?
When I tried the following solution
RewriteEngine on
RewriteCond %{REQUEST_URI} !/my-new-url\.html
RewriteRule ^ /my-new-url.html [R=301]
I get the error
I want to check using OR condition as well. For example, if the path is not path-one or path-two, redirect all the requests to path-one.
Your question is a bit vague, due to your wording. But I assume this is what you are actually looking for:
RewriteEngine on
RewriteCond %{REQUEST_URI} !/my-new-url\.html
RewriteRule ^ /my-new-url.html [R=301]
In case you receive an internal server error (http status 500) using the rule above then chances are that you operate a very old version of the apache http server. You will see a definite hint to an unsupported [END] flag in your http servers error log file in that case. You can either try to upgrade or use the older [L] flag, it probably will work the same in this situation, though that depends a bit on your setup.
It is a good idea to start out with a 302 temporary redirection and only change that to a 301 permanent redirection later, once you are certain everything is correctly set up. That prevents caching issues while trying things out...
This rule will work likewise in the http servers host configuration or inside a dynamic configuration file (".htaccess" file). Obviously the rewriting module needs to be loaded inside the http server and enabled in the http host. In case you use a dynamic configuration file you need to take care that it's interpretation is enabled at all in the host configuration and that it is located in the host's DOCUMENT_ROOT folder.
And a general remark: you should always prefer to place such rules in the http servers host configuration instead of using dynamic configuration files (".htaccess"). Those dynamic configuration files add complexity, are often a cause of unexpected behavior, hard to debug and they really slow down the http server. They are only provided as a last option for situations where you do not have access to the real http servers host configuration (read: really cheap service providers) or for applications insisting on writing their own rules (which is an obvious security nightmare).
RewriteCond %{REQUEST_URI} !/my-new-url\.html
RewriteRule ^ /my-new-url.html [R=301]
There are a few potential issues with this, particularly since you hint in a comment that you are perhaps using a front-controller to "route" the URL.
This redirect satisfies the conditions outlined in the question, but does assume that you have no other rewrites, have an essentially "static site" and are not linking to any static resources.
You are missing an L (last) flag, so processing will continue through the file and possibly be rewritten if you have later rewrites.
If you are rewriting the URL to a front-controller in order to route the URL (as you suggest in comments) then this redirect will break, as it will redirect away from the front-controller. You need to only redirect direct requests, ie. when the REDIRECT_STATUS environment variable is empty.
If you are linking to any static resources in the same file space then these will also be redirected. You need to create an exception for any static resources you are using, either by file extension (eg. (css|js|jpg|png)) or by location (eg. /static).
So, try the following instead:
RewriteCond %{ENV:REDIRECT_STATUS} ^$
RewriteCond %{REQUEST_URI} !\.(js|css|jpg|png)$
RewriteRule !^my-custom-url$ /my-custom-url [R=302,L]
You don't need a separate condition to implement the exception for the URL you are redirecting to. It is more efficient to do this directly in the RewriteRule pattern.
The first condition ensures we are only redirecting direct requests and not rewritten requests to your front-controller.
The second condition avoids any static resources also being redirected. You could alternatively check the filesystem path if all your resources are stored under a common root. Or, as a last resort, implement filesystem checks (ie. RewriteCond %{REQUEST_FILENAME} !-f) if your static resources are too varied - but note that this is less efficient.
You will need to clear your browser cache before testing, since any earlier (erroneous) 301s are cached persistently by the browser.

Does REQUEST_URI hide or ignore some filenames in .htaccess?

I'm having some difficulty with a super simple htaccess redirect.
All I want to do is rewrite absolutely everything, except a couple files.
htaccess looks like this:
RewriteEngine On
RewriteCond %{REQUEST_URI} !sitemap
RewriteCond %{REQUEST_URI} !robots
RewriteRule ^(.*)$ http://example.com/$1 [L,R=301]
The part that works is that everything gets redirected to new domain as it should be. And I can also access robots.txt without being forwarded, but not with sitemap.xml. If I try to go to sitemap.xml, the domain forwards along anyway and opens the sitemap file on the new domain.
I have this exact same issue when trying to "ignore" index.html. I can ignore robots, I can ignore alternate html or php files, but if I want to ignore index.html, the regex fails.
Since I can't actually SEE what is in the REQUEST_URI variable, my guess is that somehow index.html and sitemap.xml are some kind of "special" files that don't end up in REQUEST_URI? I know this because of a stupid test. If I choose to ignore index.html like this:
RewriteCond %{REQUEST_URI} !index.html
Then if I type example.com/index.html I will be forwarded. But if I just type example.com/ the ignore actually works and it shows the content of index.html without forwarding!
How is it that when I choose to ignore the regex "index.html", it only works when "index.html" is not actually typed in the address bar!?!
And it gets even weirder! Should I type something like example.com/index.html?option=value, then the ignore rule works and I do NOT get forwarded when there are attributes like this. But index.html by itself doesn't work, and then just having the slash root, the rule works again.
I'm completely confused! Why does it seem like REQUEST_URI is not able to see some filenames like index.html and sitemap.xml? I've been Googling for 2 days and not only can I not find out if this is true, but I can't seem to find any websites which actually give examples of what these htaccess server variables actually contain!
Thanks!
my guess is that somehow index.html and sitemap.xml are some kind of "special" files that don't end up in REQUEST_URI?
This is not true. There is no such special treatment of any requested URL. The REQUEST_URI server variable contains the URL-path (only) of the request. This notably excludes the scheme + hostname and any query string (which are available in their own variables).
However, if there are any other mod_rewrite directives that precede this (including the server config) that rewrite the URL then the REQUEST_URI server variable is also updated to reflect the rewritten URL.
index.html (Directory Index)
index.html is possibly a special case. Although, if you are explicitly requesting index.html as part of the URL itself (as you appear to be doing) then this does not apply.
If, on the other hand, you are requesting a directory, eg. http://example.com/subdir/ and relying on mod_dir issuing an internal subrequest for the directory index (ie. index.html), then the REQUEST_URI variable may or may not contain index.html - depending on the version of Apache (2.2 vs 2.4) you are on. On Apache 2.2 mod_dir executes first, so you would need to check for /subdir/index.html. However, on Apache 2.4, mod_rewrite executes first, so you simply check for the requested URL: /subdir/. It's safer to check for both, particularly if you have other rewrites and there is possibility of a second pass through the rewrite engine.
Caching problems
However, the most probable cause in this scenario is simply a caching issue. If the 301 redirect has previously been in place without these exceptions then it's possible these redirections have been cached by the browser. 301 (permanent) redirects are cached persistently by the browser and can cause issues with testing (as well as your users that also have these redirects cached - there is little you can do about that unfortunately).
RewriteCond %{REQUEST_URI} !(sitemap|index|alternate|alt) [NC]
RewriteRule .* alternate.html [R,L]
The example you presented in comments further suggests a caching issue, since you are now getting different results for sitemap than those posted in your question. (It appears to be working as intended in your second example).
Examining Apache server variables
#zzzaaabbb mentioned one method to examine the value of the Apache server variable. (Note that the Apache server variable REQUEST_URI is different to the PHP variable of the same name.) You can also assign the value of an Apache server variable to an environment variable, which is then readable in your application code.
For example:
RewriteRule ^ - [E=APACHE_REQUEST_URI:%{REQUEST_URI}]
You can then examine the value of the APACHE_REQUEST_URI environment variable in your server-side code. Note that if you have any other rewrites that result in the rewritting process to start over then you could get multiple env vars, each prefixed with REDIRECT_.
With the index.html problem, you probably just need to escape the dot (index\.html). You are in the regex pattern-matching area on the right-hand side of RewriteCond. With the un-escaped dot in there, there would need to be a character at that spot in the request, to match, and there isn't, so you're not matching and are getting the unwanted forward.
For the sitemap not matching problem, you could check to see what REQUEST_URI actually contains, by just creating an empty dummy file (to avoid 404 throwing) and then do a redirect at top of .htaccess. Then, in browser URL, type in anything you want to see the REQUEST_URI for -- it will show in address bar.
RewriteCond %{QUERY_STRING} ^$
RewriteRule ^ /test.php?var=%{REQUEST_URI} [NE,R,L]
Credit MrWhite with that easy test method.
Hopefully that will show that sitemap in URL ends up as something else, so will at least partially explain why it's not pattern-matching and preventing redirect, when it should be pattern-matching and preventing redirect.
I would also test by being sure that the server isn't stepping in front of things with custom 301 directive that for whatever reason makes sitemap behave unexpectedly. Put this at the top of your .htaccess for that test.
ErrorDocument 301 default

Domain handling with a controller

Im running an MVC based application on my mainsite, I have 2 other domains (for the sake of an example, www.a.com & www.b.com)
I'd like to be able to handle all a.com's requests with mainsite.com/a/ and similarly b.com with mainsite.com/b/
However I do not want the url to be redirected/changed in the browser.
I've been trying with mod_rewrite, however it seems to be clashing with my existing .htaccess rules set for mainsite.com
this is my existing .htaccess
Could anyone please suggest the best way to do this?
In the existing .htaccess, I don't see any rules redirecting the domains a.com or b.com. To do that is pretty straightforward, though.
A condition for selecting the proper host www.a.com or a.com
RewriteCond %{HTTP_HOST} ^(?:www\.)?a\.com$
prevent an endless loop
RewriteCond %{REQUEST_URI} !^/a/
and do the actual rewrite
RewriteRule ^ /a%{REQUEST_URI} [L]
As long as you don't use the R flag, the URL shouldn't change in the browser.
The rule for host b.com is analogous.
Update:
Since you already have a very large .htaccess file, the performance impact shouldn't matter too much. If you want to know for sure, there's no substitute for measuring.
If you want to reduce the performance hit nevertheless, you have two options
Move the directives in the .htaccess file to your main config or virtual config file, see When (not) to use .htaccess files for an explanation.
Do some custom rewriting with PHP in your front controller. This depends on the framework or routing mechanism you use, of course.

Apache Mod_Rewrite Scenario

I was wondering how I would do a complex mod_rewrite. Below is basically how I want it done.
If the user goes to:
-http://files.stuff.example.txt.r.site.com/doc.txt
Then the server would rewrite the url to:
-http://r.site.com/index.php?type=txt&username=example&dir=files.stuff&file=doc.txt
Better picture:
-http://[dir3-dir2-dir1].[username].[type].r.site.com/[file]
Rewrites to:
-http://r.site.com/index.php?type=[type]&username=[username]&dir=[dir3.dir2.dir1]&file=[file]
I created a colour coded image to clearly show what I mean:
(can't embed images) look here:
http://i.stack.imgur.com/24H8j.png
The first subdomains are a directory structure (shown in red), so the amount of subdomains can change.
I hope someone can provide me with a solution. Either using mod_rewrite or maybe another method. Thanks.
Provided that you have configured your DNS so that requested URL hits server where your application is (maybe wildcard DNS on your domain: *.site.com -> 123.45.67.89, if supported by your DNS server/hosting), you can create more or less complicated rewrite rule. I'd do it this way:
RewriteEngine On
RewriteCond %{HTTP_HOST} ^(.*).r.site.com$
RewriteCond $1 !^index.php
RewriteRule (.*) index.php?subdomain_part=%1&file_part=$1
So in index.php you get $_GET['subdomain_part'] and $_GET['file_part'], which you can parse further to extract parameters according to your convention.
Of course, you can write more complicated regex to get URL parts extracted by mod_rewrite (I'm not such an regex expert myself). However doing parsing in PHP would be much easier and you can do better error handling (e.g. if URL is not formed properly).

Apache mod_rewrite not doing anything (?)

I'm having some trouble with Apache's mod_rewrite. One of the things I'm trying to get it to do is hide some of my implementation details, so that, for example, the user sees the URL http://www.mysite.com/login but Apache responds with the page at http://www.mysite.com/doc_root/login.php instead (preferably without showing the user that it's a PHP file or the directory structure). Here's what I have in my .htaccess file:
RewriteEngine on
RewriteCond %{HTTP_HOST} ^(www.)?mysite.com*
RewriteRule ^/(\w+) /doc_root/$1.php [L]
#Redirect http://www.mysite.com to the login page
RewriteRule ^/?$ https://www.mysite.com/doc_root/login.php
But when I go to http://www.mysite.com/login, I get a 404 error even though the page exists. I clearly don't have a great understanding of how the mod_rewrite conditionals and rules work, so can anyone please tell me what I'm doing wrong? Thanks.
Take doc_root out of all the stuff you have it in. That will give you the result you're asking for. However I'm not sure if it's desired or not. How are you going to force someone to login if they manually type http://www.mysite.com/index.php?
Also if you're trying to force all traffic to SSL it's better to use a second VirtualHost and Redirect instead of mod_rewrite. Those are all questions probably better suited for ServerFault
Unless your site has a bunch of different domain names, and you only want mysite.com to do the rewriting, you don't need the RewriteCond. (Potential problem. Apache likes to dick around with the domain name unless you set UseCanonicalName off. If the name isn't what it's expecting, the rewrite won't happen.)
In RewriteCond (and RewriteRule) patterns, . matches any character. Add a backslash before them. (Minor bug. Shouldn't cause rewrites to fail, but they would match stuff like "mysite-com" as well.)
mod_rewrite is actually a URL-to-filename filter. Though it is often used to rewrite URLs to other URLs, sometimes it will misbehave if what you're rewriting to is a URL and it can't tell. (Especially if what it's rewriting to would be an alias, or would otherwise not translate directly to a real filename.) If you add a [PT] flag onto your rule, though, it will consider the rewritten thing a URL and pass it along to the other filters (including the ones that turn URLs into filenames).
Do you really need "/doc_root"? The document root should already be set up in Apache using the DocumentRoot directive, and shouldn't need to be part of the URL unless you have multiple apps on the same domain (in which case it's the app root; the document root doesn't change).
UPDATE:
Another thing i just thought about: Rewrite rules work differently in .htaccess files. Apache likes to strip off the leading slash. So you will probably want to get rid of the first slash in your patterns, or at least make it optional (^/?login instead of ^/login).
^/?(\w+) will match /doc_root/login.php, and cause a rewrite to /doc_root/doc_root.php. You should probably have a $ at the end of your pattern.