Mod_rewrite - How to tell Google to dynamically delete pages from their index after 7 days - apache

Search engines like to crawl and index webpages or URLs, but what if your webpages/URLs have expired content and you do not want them to be indexed after so many days?
Can you put an expiration in the URL and have mod_rewrite 301 redirect pages after a given expiration date?
Or maybe a cron job to add a 301 redirect header to all expired pages?

Just have the 'expired' pages return a 404? I am pretty sure that when Google encounters a 404, it will remove the page.

Not 404 or 301, but 410 Gone. This is the appropriate HTTP response:
The requested resource is no longer available at the server and no forwarding address is known. This condition is expected to be considered permanent. Clients with link editing capabilities SHOULD delete references to the Request-URI after user approval. If the server does not know, or has no facility to determine, whether or not the condition is permanent, the status code 404 (Not Found) SHOULD be used instead. This response is cacheable unless indicated otherwise.
The 410 response is primarily intended to assist the task of web maintenance by notifying the recipient that the resource is intentionally unavailable and that the server owners desire that remote links to that resource be removed. Such an event is common for limited-time, promotional services and for resources belonging to individuals no longer working at the server's site. It is not necessary to mark all permanently unavailable resources as "gone" or to keep the mark for any length of time -- that is left to the discretion of the server owner.
How you provide this response is open to discussion, however. There are many ways.

Related

ending a passbook program - HTTP response to incoming passbook requests?

We attempted a passbook program but it never made it out of beta, but there are a few passes out there that keep phoning home (and throwing errors because the passes are out of sync with existing data). My plan is to 404 any incoming requests, but I'm not sure if that is the best way to handle existing passes. Any other ideas or is 404 the right solution?
There are a few of options:
Return an updated pass without that has a blank web service url
Return an appropriate error
Remove the DNS entry of the subdomain
Update the web service url
Any of the fields in the pass can be updated including the web service url. Removing the url will prevent further requests for updates. This s potentially the most effective, but would require a bit of development to return the updated pass and would need to be maintained until all passes have been "disabled."
Return an appropriate error code
It may be easier to simply return an error code. This could be done through the web server configuration preventing the requests from being processed by your application (and presumably stop the errors in the application). This would allow you to remove the code altogether from your application.
The Passbook Web Service Reference indicates that Passbook will eventually give up when receiving persistent errors.
If a request fails—for example, due to a network connectivity issue—Passbook tries again several times after waiting a period of time. Each time it tries again, it waits longer. If the request continues to fail, it eventually gives up.
The documentation also indicates that standard HTTP status codes should be used in the response from the call to Getting the Latest Version of a Pass (and others).
Response
If request is authorized, return HTTP status 200 with a payload of the pass data.
If the request is not authorized, return HTTP status 401.
Otherwise, return the appropriate standard HTTP status.
Discussion
Support standard HTTP caching on this endpoint: check for the If-Modified-Since header and return HTTP status code 304 if the pass has not changed.
It sounds like the ending of the passbook program is permanent in which case 410 Gone would be an appropriate error code. (From RFC 2616).
410 Gone
The requested resource is no longer available at the server and no forwarding address is known. This condition is expected to be considered permanent. Clients with link editing capabilities SHOULD delete references to the Request-URI after user approval. If the server does not know, or has no facility to determine, whether or not the condition is permanent, the status code 404 (Not Found) SHOULD be used instead. This response is cacheable unless indicated otherwise.
The 410 response is primarily intended to assist the task of web maintenance by notifying the recipient that the resource is intentionally unavailable and that the server owners desire that remote links to that resource be removed. Such an event is common for limited-time, promotional services and for resources belonging to individuals no longer working at the server's site. It is not necessary to mark all permanently unavailable resources as "gone" or to keep the mark for any length of time -- that is left to the discretion of the server owner.
Remove subdomain DNS
If your web service url was set up on a separate subdomain (e.g. passbook.example.com) you can simply remove the DNS entry for the subdomain. The requests will never reach the server and Passbook will eventually give up.

htaccess to redirect to dead links to 404

We recently migrated from one domain to another. We successfully redirected all valid URLs to their counterpart on the new site. However, there are quite a few links that were valid on the old domain that simply don't exist on the new domain. (e.g. pages/links that were outdated so we didn't migrate them)
For example, we had a blog component on the old domain that generated a lot of dynamic links like /blog/category/abc and /blog/tag/xyz. We no longer have this blog component on the new domain.
Using htaccess, what is the best way to make sure Google and other SE's are correctly aware that these pages/links no longer exist?
The correct http status code to send is the 410 Gone code. To quote RFC2616 (emphasis mine):
The requested resource is no longer available at the server and no
forwarding address is known. This condition is expected to be
considered permanent. Clients with link editing capabilities
SHOULD delete references to the Request-URI after user approval. If
the server does not know, or has no facility to determine, whether or
not the condition is permanent, the status code 404 (Not Found) SHOULD
be used instead. This response is cacheable unless indicated
otherwise.

Asp.net MVC + subdomain areas. How to handle HTTP HEAD request?

We have a website that is used to showcase our various products. The website uses MVC4 and subdomains areas.
product1.website.com
product2.website.com
We use the subdomain to determine which area to route the request.
Lately we have been getting http HEAD requests to our site using the IP only. Without the subdomain we can't know which area to send the request.
What should we do?
Send back a 404
Redirect to our most important area/product
Redirect to our company website
why not redirect users to an overview page where they see a short list of the products. In this way you can redirect them behind the screens to whereever you want without hem knowing and this also has the ability to be used when user make typo in the url so that they are 'guided' to the right product and even find other ones.
-a 404 usually makes people seek elsewhere since tey think they have the wrong IP
- redirect to most important product may result in confusion when you change your major product (users tend to bookmark a lot of useless urls)
- redirect to the company website is to my opinion the lesser of all evils, but users tend to get lost when redirected to a 'general' website.
example: you're looking for Windows 8 download and have the IP bookmarked
- 404 error: oh the page no longer exists
- main product: windows 9 is out but for some reason you still need windows 8: you spend more time looking for what you really need and probably find it elsewhere
- overview page: you see what you need in a list and if the list is short you quickly find it, otherwise a simple search reveals the item also.
so redirect to overview page is still a winning shot in my opinion

How to track uniques in an Apache2 Access Log

Background:
I'm running a podcast and I'm interested in gathering statistics on the number of times an mp3 is downloaded. The files actually reside over on Amazon S3, I simply do a 301 from a path on my server just so I'm able to catch the request in my logs. The podcast RSS feed is managed by Wordpress.
Problem:
Using [IP + mp3 requested] isn't good enough to determine uniqueness. What if there are several people downloading behind a NAT?
Question 1:
Wordpress doesn't seem to store a cookie when one goes to the feed URL. What would I do to store a unique cookie for the user?
Question 2:
Is there a way - using Apache access logs only - I could log the person's cookie? I'm pretty sure iTunes & NetNewsReader support cookies (they use Safari). I'm not sure about <insert RSS reader of choice>, which might not, for them IP address may be all I have to go on.
You could add a parameter to the end of the url where you 301 redirect and make your link yourserver.com/podcastxyz?userid=someguid.
Now if you have a way of identifying the unique user before you 301 redirect you could add the same guid at the end for the same user.
If you are not familiar with how you could identify uniques you could do it by adding a cookie with the guid with a long expiry date. Whenever you load the page where the 301 redirect takes place check for the presence of the cookie and add the guid value stored in the cookie.
Tracking unique downloads from people who do not visit your website with a browser is impossible.

404 vs 403 when directory index is missing

This is mostly a philosophical question about the best way to interpret the HTTP spec. Should a directory with no directory index (e.g. index.html) return 404 or 403? (403 is the default in Apache.)
For example, suppose the following URLs exist and are accessible:
http://example.com/files/file_1/
http://example.com/files/file_2/
But there's nothing at:
http://example.com/files/
(Assume we're using 301s to force trailing slashes for all URLs.)
I think several things should be taken into account:
By default, Apache returns 403 in this scenario. That's significant to me. They've thought about this stuff, and they made the decision to use 403.
According to W3C, 403 means "The server understood the request, but is refusing to fulfill it." I take that to mean you should return 403 if the URL is meaningful but nonetheless forbidden.
403 might result in information disclosure if the client correctly guesses that the URL maps to a real directory on disk.
http://example.com/files/ isn't a resource, and the fact that it internally maps to a directory shouldn't be relevant to the status code.
If you interpret the URL scheme as defining a directory structure from the client's perspective, the internal implementation is still irrelevant, but perhaps the outward appearance should indeed have some bearing on the status codes. Maybe, even if you created the same URL structure without using directories internally, you should still use 403s, because it's about the client's perception of a directory structure.
In the balance, what do you think is the best approach? Should we just say "a resource is a resource, and if it doesn't exist, it's a 404?" Or should we say, "if it has slashes, it looks like a directory to the client, and therefore it's a 403 if there's no index?"
If you're in the 403 camp, do you think you should go out of your way to return 403s even if the internal implementation doesn't use directories? Suppose, for example, that you have a dynamic web app with this URL: http://example.com/users/joe, which maps to some code that generates the profile page for Joe. Assuming you don't write something that lists all users, should http://example.com/users/ return 403? (Many if not all web frameworks return 404 in this case.)
The first step to answering this is to refer to RFC 2616: HTTP/1.1. Specifically the sections talking about 403 Forbidden and 404 Not Found.
10.4.4 403 Forbidden
The server understood the request, but is refusing to fulfill it. Authorization will not help and the request SHOULD NOT be repeated. If the request method was not HEAD and the server wishes to make public why the request has not been fulfilled, it SHOULD describe the reason for the refusal in the entity. If the server does not wish to make this information available to the client, the status code 404 (Not Found) can be used instead.
10.4.5 404 Not Found
The server has not found anything matching the Request-URI. No indication is given of whether the condition is temporary or permanent. The 410 (Gone) status code SHOULD be used if the server knows, through some internally configurable mechanism, that an old resource is permanently unavailable and has no forwarding address. This status code is commonly used when the server does not wish to reveal exactly why the request has been refused, or when no other response is applicable.
My interpretation of this is that 404 is the more general error code that just says "there's nothing there". 403 says "there's nothing there, don't try again!".
One reason why Apache might return 403 on directories without explicit index files is that auto-indexing (i.e. listing all files in it) is disabled (a.k.a "forbidden"). In that case saying "listing all files in this directory is forbidden" makes more sense than saying "there is no directory".
Another argument why 404 is preferable: google webmaster tools.
Indeed, for a 404, Google Webmaster Tool displays the referer (allowing you to clean up the bad link to the directory), whereas for a 403, it doesn't display it.