htaccess to redirect to dead links to 404 - apache

We recently migrated from one domain to another. We successfully redirected all valid URLs to their counterpart on the new site. However, there are quite a few links that were valid on the old domain that simply don't exist on the new domain. (e.g. pages/links that were outdated so we didn't migrate them)
For example, we had a blog component on the old domain that generated a lot of dynamic links like /blog/category/abc and /blog/tag/xyz. We no longer have this blog component on the new domain.
Using htaccess, what is the best way to make sure Google and other SE's are correctly aware that these pages/links no longer exist?

The correct http status code to send is the 410 Gone code. To quote RFC2616 (emphasis mine):
The requested resource is no longer available at the server and no
forwarding address is known. This condition is expected to be
considered permanent. Clients with link editing capabilities
SHOULD delete references to the Request-URI after user approval. If
the server does not know, or has no facility to determine, whether or
not the condition is permanent, the status code 404 (Not Found) SHOULD
be used instead. This response is cacheable unless indicated
otherwise.

Related

removing cookies on another domain using mod-rewrite and apache

I have built a cookie consent module that is used on many sites, all using the same server architecture, on the same cluster. For the visitors of these sites it is possible to administer their cookie settings (eg. no advertising cookies, but allow analytics cookes) on a central domain that keeps track of the user preferences (and sites that are visited).
When they change their settings, all sites that the visitor has been to that are using my module (kept in cookie) are contacted by loading it with a parameter in hidden iframes. I tried the same with images.
On these sites a rewrite rule is in place that detects that parameter and then retracts the cookie (set the date in the past) and redirects to a page on the module site (or an image on the module site).
This scheme is working in all browsers, except IE, as it needs a P3P (Probably the reason why it is not working for images is similar).
I also tried loading a non-existent image on the source domain (that is, the domain that is using the module) through an image tag, obviously resulting in a 404. This works on all browsers, except Safari, which doesn't set cookies on 404's (at least, that is my conclusion).
My question is, how would it be possible to retract the cookie consent cookie on the connected domains, given that all I can change are the rewrite rules?
I hope that I have explained the problem well enough for you guys to give an answer, and that a solution is possible...
I am still not able to resolve this question, but when looked at it the other way around there is a solution. Using JSONP (for an example, see: Basic example of using .ajax() with JSONP?), the client domain can load information from the master server and compare that to local information.
Based on that, the client site can retract the cookie (or even replace it) and force a reload which will trigger the rewrite rules...
A drawback of this solution is that it will hit the server for every pageview, and in my case, that's a real problem. Only testing that every x minutes or so (by setting a temporary cookie) would provide a solution.
Another, even more simple solution would be to expire all the cookies on the client site every x hour. This will force a revisit of the main domain as well.

404 vs 403 when directory index is missing

This is mostly a philosophical question about the best way to interpret the HTTP spec. Should a directory with no directory index (e.g. index.html) return 404 or 403? (403 is the default in Apache.)
For example, suppose the following URLs exist and are accessible:
http://example.com/files/file_1/
http://example.com/files/file_2/
But there's nothing at:
http://example.com/files/
(Assume we're using 301s to force trailing slashes for all URLs.)
I think several things should be taken into account:
By default, Apache returns 403 in this scenario. That's significant to me. They've thought about this stuff, and they made the decision to use 403.
According to W3C, 403 means "The server understood the request, but is refusing to fulfill it." I take that to mean you should return 403 if the URL is meaningful but nonetheless forbidden.
403 might result in information disclosure if the client correctly guesses that the URL maps to a real directory on disk.
http://example.com/files/ isn't a resource, and the fact that it internally maps to a directory shouldn't be relevant to the status code.
If you interpret the URL scheme as defining a directory structure from the client's perspective, the internal implementation is still irrelevant, but perhaps the outward appearance should indeed have some bearing on the status codes. Maybe, even if you created the same URL structure without using directories internally, you should still use 403s, because it's about the client's perception of a directory structure.
In the balance, what do you think is the best approach? Should we just say "a resource is a resource, and if it doesn't exist, it's a 404?" Or should we say, "if it has slashes, it looks like a directory to the client, and therefore it's a 403 if there's no index?"
If you're in the 403 camp, do you think you should go out of your way to return 403s even if the internal implementation doesn't use directories? Suppose, for example, that you have a dynamic web app with this URL: http://example.com/users/joe, which maps to some code that generates the profile page for Joe. Assuming you don't write something that lists all users, should http://example.com/users/ return 403? (Many if not all web frameworks return 404 in this case.)
The first step to answering this is to refer to RFC 2616: HTTP/1.1. Specifically the sections talking about 403 Forbidden and 404 Not Found.
10.4.4 403 Forbidden
The server understood the request, but is refusing to fulfill it. Authorization will not help and the request SHOULD NOT be repeated. If the request method was not HEAD and the server wishes to make public why the request has not been fulfilled, it SHOULD describe the reason for the refusal in the entity. If the server does not wish to make this information available to the client, the status code 404 (Not Found) can be used instead.
10.4.5 404 Not Found
The server has not found anything matching the Request-URI. No indication is given of whether the condition is temporary or permanent. The 410 (Gone) status code SHOULD be used if the server knows, through some internally configurable mechanism, that an old resource is permanently unavailable and has no forwarding address. This status code is commonly used when the server does not wish to reveal exactly why the request has been refused, or when no other response is applicable.
My interpretation of this is that 404 is the more general error code that just says "there's nothing there". 403 says "there's nothing there, don't try again!".
One reason why Apache might return 403 on directories without explicit index files is that auto-indexing (i.e. listing all files in it) is disabled (a.k.a "forbidden"). In that case saying "listing all files in this directory is forbidden" makes more sense than saying "there is no directory".
Another argument why 404 is preferable: google webmaster tools.
Indeed, for a 404, Google Webmaster Tool displays the referer (allowing you to clean up the bad link to the directory), whereas for a 403, it doesn't display it.

Url rewrite without redirect in ASP.NET

We have a CMS system that creates long URLs with many parameters. We would like to change the way they are presented, to make them more friendly.
Since we have many sites already built on this CMS, it's a little difficult to rewrite the CMS to create friendly urls (although it's a method we're considering, if no alternative is found), we we're looking for a method that when a user clicks on a long url, the url will change into a friendly one - in the browser - without using Response.Redirect().
In Wordpress such a method exists (I'm not sure whether it's done in code or in Apache), and I'm wondering if it could be done in ASP.NET 2.0 too.
Another thing to take into consideration is that the change between the urls has to be done by accessing the DB.
UPDATE: We're using IIS6
If you're using ii7 the easiest way to do this is to use the URL Rewrite Module According to that link you can
Define powerful rules to transform
complex URLs into simple and
consistent Web addresses
URL Rewrite allows Web administrators
to easily build powerful rules using
rewrite providers written in .NET,
regular expression pattern matching,
and wildcard mapping to examine
information in both URLs and other
HTTP headers and IIS server variables.
Rules can be written to generate URLs
that can be easier for users to
remember, simple for search engines to
index, and allow URLs to follow a
consistent and canonical host name
format. URL Rewrite further simplifies
the rule creation process with support
for content rewriting, rule templates,
rewrite maps, rule validation, and
import of existing mod_rewrite rules.
Otherwise you will have to use the techniques described by Andrew M or use Response.Redirect. In any case I'm fairly certain all of these methods result in a http 301 response. I mention this because its not clear why you don't want to do Response.Redirect. Is this a coding constraint?
Update
Since you're using IIS 6 you'll need to use another method for URL rewriting.
This Article from Scott Mitchell describes in detail how to do it.
Implementing URL Rewriting
URL rewriting can be implemented
either with ISAPI filters at the IIS
Web server level, or with either HTTP
modules or HTTP handlers at the
ASP.NET level. This article focuses on
implementing URL rewriting with
ASP.NET, so we won't be delving into
the specifics of implementing URL
rewriting with ISAPI filters. There
are, however, numerous third-party
ISAPI filters available for URL
rewriting, such as:
ISAPI Rewrite
IIS Rewrite
PageXChanger
And many others!
The article goes on to describe how to implement HTTP Modules or Handlers.
Peformance
A redirect response HTTP 301 usually only contains a small amount of data < 1K. So I would be surprised if it was noticeable.
For example the difference in the page load of these urls isn't noticible
"https://stackoverflow.com/q/4144940/119477"
"https://stackoverflow.com/questions/4144940/url-rewrite-without-redirect-in-asp-net"
(I have confirmed using ieHTTPHeaders that http 301 is what is used for the change in URL)
Page Rank
This is what google's webmaster central site has to say about 301.
If you need to change the URL of a
page as it is shown in search engine
results, we recommended that you use a
server-side 301 redirect. This is the
best way to ensure that users and
search engines are directed to the
correct page.
In response to extra comments, I think what you need to do is bite the bullet and modify the CMS to write the new links out into the pages. You've already said that you have normal URL rewriting which can translate the new URLs to old when they're incoming. If you were to also write out the new URLs in your markup then everything should simply work.
From an SEO point of view, if the pages your CMS produces have the old links, then that's what the search engines will see and index. There's nothing much you can do about that, javascript, redirect or otherwise. (although a permanent redirect would get you a little way there).
I also think that what you must have been seeing in Wordpres was probably a redirect. Without finding an example I can't be sure though. The thing to do would be to use Fiddler or another http debugger to see what happens when you follow one of these links.
For perfect SEO, once you've got the new URLs working outbound and inbound, what you'd want to do is decide that your new URLs are the definitive URLs. Make the old URLs do a redirect to the new URLs, and or use a canonical link tag back to the new URL from the old one.
I'm not certain what you're saying here, but basically a page the user is already reading contains an old, long, URL, and you'd like it to change to the new, short URL, dynamically on the client side, before the browser requests the page from the server?
The only way I think this coule be done would be to use Javascript to change the URL in response to onclick or document.ready, but it would be pointless. You'd need to know the new short url for the javascript to re-write to, and if you knew that, why not simply render that url into the link in the first place?
It sounds more like you want URL routing, as included in ASP.Net 4 and 3.5?
Standard URL rewriting modifies the incoming request object on the server, so the client browser submits the new URL, and the downstream page handlers see the old URL. I believe the routing things extend this concept to the outgoing response too, rewriting old urls in the response page into new URLs before they're sent to the client.
Scott Gu covers the subject here:
http://weblogs.asp.net/scottgu/archive/2009/10/13/url-routing-with-asp-net-4-web-forms-vs-2010-and-net-4-0-series.aspx
Scott Gu also has an older post on normal URL rewriting outlining several different ways to do it. Perhaps you could extend this concept by hooking into Application_PreSendRequestContent and manually modifying all the href values in the response stream, but I wouldn't fancy it myself.
http://weblogs.asp.net/scottgu/archive/2007/02/26/tip-trick-url-rewriting-with-asp-net.aspx

Mod_rewrite - How to tell Google to dynamically delete pages from their index after 7 days

Search engines like to crawl and index webpages or URLs, but what if your webpages/URLs have expired content and you do not want them to be indexed after so many days?
Can you put an expiration in the URL and have mod_rewrite 301 redirect pages after a given expiration date?
Or maybe a cron job to add a 301 redirect header to all expired pages?
Just have the 'expired' pages return a 404? I am pretty sure that when Google encounters a 404, it will remove the page.
Not 404 or 301, but 410 Gone. This is the appropriate HTTP response:
The requested resource is no longer available at the server and no forwarding address is known. This condition is expected to be considered permanent. Clients with link editing capabilities SHOULD delete references to the Request-URI after user approval. If the server does not know, or has no facility to determine, whether or not the condition is permanent, the status code 404 (Not Found) SHOULD be used instead. This response is cacheable unless indicated otherwise.
The 410 response is primarily intended to assist the task of web maintenance by notifying the recipient that the resource is intentionally unavailable and that the server owners desire that remote links to that resource be removed. Such an event is common for limited-time, promotional services and for resources belonging to individuals no longer working at the server's site. It is not necessary to mark all permanently unavailable resources as "gone" or to keep the mark for any length of time -- that is left to the discretion of the server owner.
How you provide this response is open to discussion, however. There are many ways.

Multiple Domain name

I have a customer that been on the web for some time. They have bought a domain name that describe it product, and a second one more up to date. Now that company has evolved to something more general and has bought a 3rd domain - something like:
vegetables.com (2005)
ecolo-vegetables.com (2006)
good-health-eating.com (2009)
Here are my questions:
What is the bet way to get all those domains under the new name?
The new name is unknown to search engine and other linker, I don't want to lose the ranking, so what is the best way to keep that ranking?
Can I point URLs to the "best" ranked domain?
What append to the backlinker? they link to which domain?
The new domain has a "-" in the name... which is really good to SEO but a little unnatural to type, should I get the no dash version too?
n.b. It make sense to redirect all the domain under the same, but will you choose the oldest (with modrewrite) or the newest but with no life under it's belt (so it doesn't exist anywhere in search engine)
another p.s. Some will tell me to redirect with .htaccess, but should I change the dns to point to the last .com. which solution is better
Are all three sites "Different" or do they point to the same website/content?
Use 301 Redirects to redirect your old domain names to the new domain names. If all domains are pointing to the same website, make sure you also use the Canonical Tag on all your pages.
If you 301 Redirect from the old domain names / urls, your rankings will be transfered to your new domain/pages. (the only exception to this may be any extra points you get from embedded keywords in your old domain names).
You should point old urls to your "new" urls/domain. Rankings and link juice should/will be transfered to the new urls/domain.
Ideally all your backlinks should update their links to the new domain, but it doesn't really matter. If the old domains are 301 redirecting to the new domain anyway, point to the old domain is just like pointing to the new domain.
Definitely get the no-dash version of the domain as well and just have it 301 redirect to the actual domain you want to target.
I'll give this a go.
1. You could possibly have redirects or just allow the DNS of the domain to point to the new (desired) website.
2. It's not hard to understand SEO (Search Engine Optimization) nowadays - ensuring you have the correct meta tags and other SE info will give you a big helping hand. There isn't any way of transferring SE ranks.
3. That's possible. You could have ABCDEF.COM at number 3 on google, but then set ABCDEF.COM to redirect to GHIJKL.COM.
4. If you set up redirects, and the new site has the same content as the old one, there is the possiblity of setting up your DNS and your redirect to redirect to the new version of the previous page on the new website.
( I don't think I worded that very well, hope you catch my drift )
5. Out of pure experience I'd say yes, get both. That way you can market to your customer audience as ABCDEF.com, but show to SEs as AB-CD-EF.COM.
Here is the best answer i got from this link
302 and 301 Redirects
When a request for a page or URL is
made by a browser, agent or spider,
the web server where the page is
hosted checks a file called
'.htaccess'. This file contains
instructions on how to handle specific
requests and also plays a key role in
security. The '.htaccess' file can be
modified so that it instructs
browsers, agents or spiders that the
page has either temporarily moved (302
redirect) or permanently moved (301
redirect). It is usually possible to
implement this redirect without
messing with the '.htaccess' file
directly, using your web host's
control panel instead.
From a search engine perspective, 301
redirects are the only acceptable way
to redirect URLs. In the case of
moved pages, search engines will index
only the new URL, but will transfer
link popularity from the old URL to
the new one so that search engine
rankings are not affected. The same
behavior occurs when additional
domains are set to point to the main
domain through a 301 redirect.
And the last word : from this link that just confirm what i know know !
First off, ensure you're using "301 redirects" rather than "302 redirects" or the link juice (PageRank) won't transfer to the destination URL. You can verify that 301s (not 302s) are in place by using a "server header checker" like this one. Only a 301 tells engines the previous URL has moved permanently and thus forwards the page's link equity to the new location.