404 vs 403 when directory index is missing - apache

This is mostly a philosophical question about the best way to interpret the HTTP spec. Should a directory with no directory index (e.g. index.html) return 404 or 403? (403 is the default in Apache.)
For example, suppose the following URLs exist and are accessible:
http://example.com/files/file_1/
http://example.com/files/file_2/
But there's nothing at:
http://example.com/files/
(Assume we're using 301s to force trailing slashes for all URLs.)
I think several things should be taken into account:
By default, Apache returns 403 in this scenario. That's significant to me. They've thought about this stuff, and they made the decision to use 403.
According to W3C, 403 means "The server understood the request, but is refusing to fulfill it." I take that to mean you should return 403 if the URL is meaningful but nonetheless forbidden.
403 might result in information disclosure if the client correctly guesses that the URL maps to a real directory on disk.
http://example.com/files/ isn't a resource, and the fact that it internally maps to a directory shouldn't be relevant to the status code.
If you interpret the URL scheme as defining a directory structure from the client's perspective, the internal implementation is still irrelevant, but perhaps the outward appearance should indeed have some bearing on the status codes. Maybe, even if you created the same URL structure without using directories internally, you should still use 403s, because it's about the client's perception of a directory structure.
In the balance, what do you think is the best approach? Should we just say "a resource is a resource, and if it doesn't exist, it's a 404?" Or should we say, "if it has slashes, it looks like a directory to the client, and therefore it's a 403 if there's no index?"
If you're in the 403 camp, do you think you should go out of your way to return 403s even if the internal implementation doesn't use directories? Suppose, for example, that you have a dynamic web app with this URL: http://example.com/users/joe, which maps to some code that generates the profile page for Joe. Assuming you don't write something that lists all users, should http://example.com/users/ return 403? (Many if not all web frameworks return 404 in this case.)

The first step to answering this is to refer to RFC 2616: HTTP/1.1. Specifically the sections talking about 403 Forbidden and 404 Not Found.
10.4.4 403 Forbidden
The server understood the request, but is refusing to fulfill it. Authorization will not help and the request SHOULD NOT be repeated. If the request method was not HEAD and the server wishes to make public why the request has not been fulfilled, it SHOULD describe the reason for the refusal in the entity. If the server does not wish to make this information available to the client, the status code 404 (Not Found) can be used instead.
10.4.5 404 Not Found
The server has not found anything matching the Request-URI. No indication is given of whether the condition is temporary or permanent. The 410 (Gone) status code SHOULD be used if the server knows, through some internally configurable mechanism, that an old resource is permanently unavailable and has no forwarding address. This status code is commonly used when the server does not wish to reveal exactly why the request has been refused, or when no other response is applicable.
My interpretation of this is that 404 is the more general error code that just says "there's nothing there". 403 says "there's nothing there, don't try again!".
One reason why Apache might return 403 on directories without explicit index files is that auto-indexing (i.e. listing all files in it) is disabled (a.k.a "forbidden"). In that case saying "listing all files in this directory is forbidden" makes more sense than saying "there is no directory".

Another argument why 404 is preferable: google webmaster tools.
Indeed, for a 404, Google Webmaster Tool displays the referer (allowing you to clean up the bad link to the directory), whereas for a 403, it doesn't display it.

Related

Should an optional, statically-defined resource that is unimplemented be defined as 403 or 501 (or something else)?

I have a device for which I am implementing an HTTP API and defining it via OpenAPI 3.0.
The following paths are defined:
/scan/inventory/start
/scan/location/start
/scan/direction/start
This API is designed to run on various devices, but not all of them implement the location or direction feature, but all do implement the inventory feature. The available features can be queried by GETing the /scan path.
For a device that does not support location my team has been waffling on the error code to return if someone attempts to use it. We want to provide useful feedback when this happens, so 404 seems ruled out, especially since the path is documented in our API. After reading and re-reading RFC-7231 and various summaries of it, 501 or 403 seemed like good choices.
At first, "501 Not Implemented" seemed like a good choice, but the 5XX class of errors seems to suggest a more serious server error.
Since we want to provide feedback, "403 Forbidden" seems good and puts the onus on the client for accessing a bad path.
I'm sure part of the problem is that we're attempting to use a specification (HTTP) that was not necessarily designed for arbitrary APIs.
What would you suggest we do?
This is a pretty straightforward 404.
The 404 (Not Found) status code indicates that the origin server did not find a current representation for the target resource or is not willing to disclose that one exists.
403 isn't right, since that indicates that the user isn't "authorized" to access that resource. It implies that the issue lies on the client side. But in this case there simply is no resource.
501 isn't right, since that "indicates that the server does not support the functionality required to fulfill the request. This is the appropriate response when the server does not recognize the request method and is not capable of supporting it for any resource." In this case the server has no problem supporting the request, it's just that the resource doesn't exist.
Note that "server" refers to the web server. The issue isn't whether or not your application as a whole supports some bit of functionality, it's whether the web server is capable of handling the HTTP request it was sent. It's not appropriate to use HTTP status codes to indicate that kind of high-level application state.
Also note that the status code isn't just a private contract between your application and its users. All actors in the web stack—loggers, intermediate caches, browsers, etc.—might change their behavior depending on the status code. That's why it's important to reserve things like 5xx for actual server errors.
To summarize, since the resource at that URI doesn't exist, the best way to provide useful feedback is to return a 404. If you want to distinguish between features that are never supported and those that are simply unsupported for that device you should use a mechanism other than the status code. Fortunately you're off to a good start by listing the available features at /scan.

Query string (URL) lead to 403

Please help me on this.Already tried disable mode_security module through .htaccess no use.
PHP Version 5.6.30
Apache redirect the request to 403 page if pass parameter below.
&test[object_type]=0
The name (object_type) leads to 403 page.
eg:http://www.cudec.com.my/?test[object_type]=0 ✖ NOT WORKING LEADS TO 403
eg:http://www.cudec.com.my/?test[object_types]=0 ✓ WORKING
Will update this post to a full answer as soon as I got more information to work with ;)
I tried to call the 403-URL:
You don't have permission to access / on this server.
Additionally, a 403 Forbidden error was encountered while trying to use an ErrorDocument to handle the request.
You ensured that ModSecurity is the one replying with 403? Looks more like the folder permissions are insufficient.
Check if your DocumentRoot is at least readable for users (an 'r' at the last triple or 4 in the last byte).
If it's really ModSecurity, have a look into /var/log/apache2/modsecurity_audit.log and you should see which rule (by ID) is the one throwing 403 and also the reason (Error-Msg in the rule) why.
Does http://www.cudec.com.my/?test[object_types]=0 return the expected result?
The parameter doesn't seem to be interpretated when using &test[object_type] instead of &test[object_types] and the target ressource / seems to have insufficient rights, same for the error-pages...

htaccess to redirect to dead links to 404

We recently migrated from one domain to another. We successfully redirected all valid URLs to their counterpart on the new site. However, there are quite a few links that were valid on the old domain that simply don't exist on the new domain. (e.g. pages/links that were outdated so we didn't migrate them)
For example, we had a blog component on the old domain that generated a lot of dynamic links like /blog/category/abc and /blog/tag/xyz. We no longer have this blog component on the new domain.
Using htaccess, what is the best way to make sure Google and other SE's are correctly aware that these pages/links no longer exist?
The correct http status code to send is the 410 Gone code. To quote RFC2616 (emphasis mine):
The requested resource is no longer available at the server and no
forwarding address is known. This condition is expected to be
considered permanent. Clients with link editing capabilities
SHOULD delete references to the Request-URI after user approval. If
the server does not know, or has no facility to determine, whether or
not the condition is permanent, the status code 404 (Not Found) SHOULD
be used instead. This response is cacheable unless indicated
otherwise.

Return code for wrong HTTP method in REST API?

Our API user can get the root document (collection list) by sending GET request to root API address. If he sends POST, we should return something. The same question applies for other resource paths, like e.g. sending PATCH on query path etc. Not all methods have meaning on some paths.
As I see from HTTP RFCs is that we should return code 405: Method not allowed and sending back the Allowed response header with list of allowed methods.
I see that e.g. GitHub API returns 404: Not found in the case I explained above (sending POST to root).
What would be the proper response? 404 or 405? I see 405 more developer-friendly, so is there any reason not to use it?
The expected behavior in this case, as per the HTTP spec and by REST guidelines, would be to return 405 Method Not Allowed. The resource is there, since a GET works, so a 404 Not Found would be confusing.
I'm not familiar with the GitHub API but in some cases I see that for 403 Forbidden it also returns 404 Not Found:
Requests that require authentication will return 404 Not Found, instead of 403 Forbidden, in some places. This is to prevent the accidental leakage of private repositories to unauthorized users.
Maybe the behavior on the root address is part of a bigger mechanism that handles such cases generally, who knows. Maybe you could ask?

Mod_rewrite - How to tell Google to dynamically delete pages from their index after 7 days

Search engines like to crawl and index webpages or URLs, but what if your webpages/URLs have expired content and you do not want them to be indexed after so many days?
Can you put an expiration in the URL and have mod_rewrite 301 redirect pages after a given expiration date?
Or maybe a cron job to add a 301 redirect header to all expired pages?
Just have the 'expired' pages return a 404? I am pretty sure that when Google encounters a 404, it will remove the page.
Not 404 or 301, but 410 Gone. This is the appropriate HTTP response:
The requested resource is no longer available at the server and no forwarding address is known. This condition is expected to be considered permanent. Clients with link editing capabilities SHOULD delete references to the Request-URI after user approval. If the server does not know, or has no facility to determine, whether or not the condition is permanent, the status code 404 (Not Found) SHOULD be used instead. This response is cacheable unless indicated otherwise.
The 410 response is primarily intended to assist the task of web maintenance by notifying the recipient that the resource is intentionally unavailable and that the server owners desire that remote links to that resource be removed. Such an event is common for limited-time, promotional services and for resources belonging to individuals no longer working at the server's site. It is not necessary to mark all permanently unavailable resources as "gone" or to keep the mark for any length of time -- that is left to the discretion of the server owner.
How you provide this response is open to discussion, however. There are many ways.