How ETags are generated and configured? - apache

I recently came through the concept of ETag HTTP header. (this) But I still have a problem that for a particular HTTP resource who is responsible to generate ETags?
In other words, it is actual application, container (Ex:Tomcat), Web Server/Load balancer (Ex: Apache/Nginx)?
Can anyone please help?

Overview of typical algorithms used in webservers.
Consider we have a file with
Size 1047 i.e. 417 in hex.
MTime i.e. last modification on Mon, 06 Jan 2020 12:54:56 GMT which
is 1578315296 seconds in unix time or 1578315296666771000 nanoseconds.
Inode which is a physical file number 66 i.e. 42 in hex
Different webservers returns ETag like:
Nginx: "5e132e20-417" i.e. "hex(MTime)-hex(Size)". Not configurable.
BusyBox httpd the same as Nginx
monkey httpd the same as Nginx
Apache/2.2: "42-417-59b782a99f493" i.e. "hex(INode)-hex(Size)-hex(MTime in nanoseconds)". Can be configured but MTime anyway will be in nanos
Apache/2.4: "417-59b782a99f493" i.e. "hex(Size)-hex(MTime in nanoseconds)" i.e. without INode which is friendly for load balancing when identical file have different INode on different servers.
OpenWrt uhttpd: "42-417-5e132e20" i.e. "hex(INode)-hex(Size)-hex(MTime)". Not configurable.
Tomcat 9: W/"1047-1578315296666" i.e. Weak"Size-MTime in milliseconds". This is incorrect ETag because it should be strong as for a static file i.e. octal compatibility.
LightHTTPD: "hashcode(42-1047-1578315296666771000)" i.e. INode-Size-MTime but then reduced to a simple integer by hashcode (dekhash). Can be configured but you can only disable one part (etag.use-inode = "disabled")
MS IIS: it have a form Filetimestamp:ChangeNumber e.g. "53dbd5819f62d61:0". Not documented, not configurable but can be disabled.
Jetty: based on last mod, size and hashed. See Resource.getWeakETag()
Kitura (Swift): "W/hex(Size)-hex(MTime)" StaticFileServer.calculateETag
Few thoughts:
Hex numbers are used here so often because it's cheap to convert a decimal number to a shorter hex string.
Inode while adding more guarantees makes load balancing not possible and very fragile if you simply copied the file during application redeploy.
MTime in nanoseconds is not available on all platforms and such granularity not needed.
Apache have a bug about this like https://bz.apache.org/bugzilla/show_bug.cgi?id=55573
The order MTime-Size or Size-MTime is also matters because MTime is more likely changed so comparing ETag string may be faster for a dozen CPU cycles.
Even if this is not a full checksum hash but definitely not a weak ETag. This is enough to show that we expect octal compatibility for Range requests.
Apache and Nginx shares almost all traffic in Internet but most static files are shared via Nginx and it is not configurable.
It looks like Nginx uses the most reasonable schema so if you implementing try to make it the same.
The whole ETag generated in C with one line:
printf("\"%" PRIx64 "-%" PRIx64 "\"", last_mod, file_size)
My proposition is to take Nginx schema and make it as a recommended ETag algorithm by W3C.

As with most aspects of the HTTP specification, the responsibility ultimately lies with whoever is providing the resource.
Of course, it's often that case that we use tools—servers, load balancers, application frameworks, etc.—that help us fulfill those responsibilities. But there isn't any specification defining what a "web server", as opposed to the application, is expected to provide, it's just a practical question of what features are available in the tools you're using.
Now, looking at ETags in particular, a common situation is that the framework or web server can be configured to automatically hash the response (either the body or something else) and put the result in the ETag. Then, on a conditional request, it will generate a response and hash it to see if it has changed, and automatically send the conditional response if it hasn't.
To take two examples that I'm familiar with, nginx can do this with static files at web server level, and Django can do this with dynamic responses at the application level.
That approach is common, easy to configure, and works pretty well. In some situations, though, it might not be the best fit for your use case. For example:
To compute a hash to compare to the incoming ETag you first have to have a response. So although the conditional response can save you the overhead of transmitting the response, it can't save you the cost of generating the response. So if generating your response is expensive, and you have an alternative source of ETags (for example, version numbers stored in the database), you can use that to achieve better performance.
If you're planning to use the ETags to prevent accidental overwrites with state-changing methods, you will probably need to add your own application code to make your compare-and-set logic atomic.
So in some situations you might want to create your ETags at the application level. To take Django as an example again, it provides an easy way for you to provide your own function to compute ETags.
In sum, it's ultimately your responsibility to provide the ETags for the resources you control, but you may well be able to take advantage of the tools in your software stack to do it for you.

Related

Can caching an API based on the hash of the whole URL be a potential threat?

I am adding caching to an API server. The implementation of the caching system of the framework I am using simply hashes the whole URL of the request and uses it as a cache key (as well as some other data like language, etc..).
But with this simple system I can add fake query parameters to the url with arbitrary values and my system will also cache those requests. For example:
GET https://example.com/apicall # Cached
GET https://example.com/apicall?fake=1 # Also cached
GET https://example.com/apicall?fake=2 # Also cached!
Is this something I should worry about? That people can very easily fill my cache with junk entries that aren't used? Or am I exaggerating the potential impact of this?

Implementing basic S3 compatible API with akka-http

I'm trying to implement the file storage ыукмшсу with basic S3 compatible API using akka-http.
I use s3 java sdk to test my service API and got the problem with the putObject(...) method. I can't consume file properly on my akka-http backend. I wrote simple route for the test purposes:
def putFile(bucket: String, file: String) = put{
extractRequestEntity{ ent =>
val finishedWriting = ent.dataBytes.runWith(FileIO.toPath(new File(s"/tmp/${file}").toPath))
onComplete(finishedWriting) { ioResult =>
complete("Finished writing data: " + ioResult)
}
}
}
It saves file, but file is always corrupted. Looking inside the file I found the lines like these:
"20000;chunk-signature=73c6b865ab5899b5b7596b8c11113a8df439489da42ddb5b8d0c861a0472f8a1".
When I try to PUT file with any other rest client it works as fine as expected.
I know S3 uses "Expect: 100-continue" header and may it he causes problems.
I really can't figure out how to deal with that. Any help appreciated.
This isn't exactly corrupted. Your service is not accounting for one of the four¹ ways S3 supports uploads to be sent on the wire, using Content-Encoding: aws-chunked and x-amz-content-sha256: STREAMING-AWS4-HMAC-SHA256-PAYLOAD.
It's a non-standards-based mechanism for streaming an object, and includes chunks that look exactly like this:
string(IntHexBase(chunk-size)) + ";chunk-signature=" + signature + \r\n + chunk-data + \r\n
...where IntHexBase() is pseudocode for a function that formats an integer as a hexadecimal number as a string.
This chunk-based algorithm is similar to, but not compatible with, Transfer-Encoding: chunked, because it embeds checksums in the stream.
Why did they make up a new HTTP transfer encoding? It's potentially useful on the client side because it eliminates the need to either "read your payload twice or buffer [the entire object payload] in memory [concurrently]" -- one or the other of which is otherwise necessary if you are going to calculate the x-amz-content-sha256 hash before the upload begins, as you otherwise must, since it's required for integrity checking.
I am not overly familiar with the internals of the Java SDK, but this type of upload might be triggered by using .withInputStream() or it might be standard behavor for files too, or for files over a certain size.
Your minimum workaround would be to throw an HTTP error if you see x-amz-content-sha256: STREAMING-AWS4-HMAC-SHA256-PAYLOAD in the request headers since you appear not to have implemented this in your API, but this would most likely only serve to prevent storing objects uploaded by this method. The fact that this isn't already what happens automatically suggests that you haven't implemented x-amz-content-sha256 handling at all, so you are not doing the server-side payload integrity checks that you need to be doing.
For full compatibility, you'll need to implement the algorithm supported by S3 and assumed to be available by the SDKs, unless the SDKs specifically support a mechanism for disabling this algorithm -- which seems unlikely, since it serves a useful purpose, particularly (it appears) for streams whose length is known but that aren't seekable.
¹ one of four -- the other three are a standard PUT, a web-based html form POST, and the multipart API that is recommended for large files and mandatory for files larger than 5 GB.

Server with the sole purpose of setting cookies

At work we ran up against the problem of setting server-side cookies - a lot of them. Right now we have a PHP script, the sole purpose of which is to set a cookie on the client for our domain. This happens a lot more than 'normal' requests to the server (which is running an app), so we've discussed moving it to its own server. This would be an Apache server, probably dedicated, with one PHP script 3 lines long, just running over and over again.
Surely there must be a faster, better way of doing this, rather than starting up the whole PHP environment. Basically, I need something super simple that can sit around all day/night doing the following:
Check if a certain cookie is set, and
If that cookie is not set, fill it with a random hash (right now it's a simple md5(microtime))
Any suggestions?
You could create a simple http server yourself to accept requests and return the set-cookie header and empty body. This would allow you to move the cookie generation overhead to wherever you see fit.
I echo the sentiments above though; Unless cookie generation is significantly expensive, I don't think you will gain much by moving from your current setup.
By way of an example, here is an extremely simple server written with Tornado that simply sets a cookie on GET or HEAD requests to '/'. It includes an async example listening for '/async' which may be of use depending on what you are doing to get your cookie value.
import time
import tornado.ioloop
import tornado.web
class CookieHandler(tornado.web.RequestHandler):
def get(self):
cookie_value = str( time.time() )
self.set_cookie('a_nice_cookie', cookie_value, expires_days=10)
# self.set_secure_cookie('a_double_choc_cookie', cookie_value)
self.finish()
def head(self):
return self.get()
class AsyncCookieHandler(tornado.web.RequestHandler):
#tornado.web.asynchronous
def get(self):
self._calculate_cookie_value(self._on_create_cookie)
#tornado.web.asynchronous
def head(self):
self._calculate_cookie_value(self._on_create_cookie)
def _on_create_cookie(self, cookie_value):
self.set_cookie('double_choc_cookie', cookie_value, expires_days=10)
self.finish()
def _calculate_cookie_value(self, callback):
## meaningless async example... just wastes 2 seconds
def _fake_expensive_op():
val = str(time.time())
callback(val)
tornado.ioloop.IOLoop.instance().add_timeout(time.time()+2, _fake_expensive_op)
application = tornado.web.Application([
(r"/", CookieHandler),
(r"/async", AsyncCookieHandler),
])
if __name__ == "__main__":
application.listen(8888)
tornado.ioloop.IOLoop.instance().start()
Launch this process with Supervisord and you'll have a simple, fast, low-overhead server that sets cookies.
You could try using mod_headers (usually available in the default install) to manually construct a Set-Cookie header and emit it -- no programming needed as long as it's the same cookie every time. Something like this could work in an .htaccess file:
Header add Set-Cookie "foo=bar; Path=/; Domain=.foo.com; Expires=Sun, 06 May 2012 00:00:00 GMT"
However, this won't work for you. There's no code here. It's just a stupid header. It can't come up with the new random value you'd want, and it can't adjust the expire date as is standard practice.
This would be an Apache server, probably dedicated, with one PHP script 3 lines long, just running over and over again. [...] Surely there must be a faster, better way of doing this, rather than starting up the whole PHP environment.
Are you using APC or another bytecode cache? If so, there's almost no startup cost. Because you're talking about setting up an entire server just for this, it sounds like you control the server as well. This means that you can turn off apc.stat for even less of a startup hit.
Really though, if all that script is doing is building an md5 hash and setting a cookie, it should already be blisteringly fast, especially if it's mod_php. Do you already know, though benchmarking and testing, that the script isn't performing as well as you'd like? If so, can you share those benchmarks with us?
It would be interesting to know why do you think you need extra server - do you actually have a bottle neck for generating the cookie or somewhere else ? Is it the log writing as requests happen alot ? ajax polling ? Client download speed ?
Atleast for starters, i'd look something more efficient than fetching time to generate the "random hash". For example, on this intel i7 laptop i have, generating 999999 md5 hashes from microtime takes roughly about 4 seconds and doing same thing with random numbers is second faster (not taking a seeding of rand into account).
Then, if you take opening/and closing of socket into account, just moving your script (which is most likely already really fast - that is, without knowing how your pages take that into account), you will end up actually slowing down the requests. Actually, now that i've re-read your question, it makes me think that your cookie setter script is already a dedicated page ? Or do you just "include" into real content served by another php script? If not, try that approach. Also this would beneficial if you have default logging rules for apache, if cookies are set in on own page, your apache will log a row for that and in high load systems, this will cumulate to total io time spend by apache.
Also, consider that testing if cookie is set and then setting it, might be slower than just to forcefully set it always even if cookie exists or not ?
But overall, i don't think you'd need to set up a server just to offload cookie generation without knowing more about how you handle the cookies now.. Unless you are doing something really nasty.
Apache has a module called mod_usertrack which looks like it might do exactly what you want. There's no need for PHP and you could likely create a really optimised lightweight Apache config to serve this with.
If you want to go for something even faster and are happy to not use Apache you could use lighttpd and it's mod_usertrack or nginx's HttpUserId module

Does the `Expires` HTTP header needs to be consistent across multiple cold-cache requests?

I'm implementing a custom web server of a kind. And am looking into adding an Expires header support. However, I'm a little unsure of how exactly to implement it.
If multiple cold-cache requests are being made to the same unchanged resource on the server and the server returned different Expires header (say it uses relative time to calculate the exact value of the Expires date e.g. +6 hours from the request time), does that invalidate the cache on all the proxy servers in-between as well? Or is it impossible to happen (per the spec)?
Does the Expires HTTP header needs to be consistent across multiple cold-cache requests?
Ok, never mind, found the relevant information under the Cache Revalidation and Reload Controls section of the HTTP Spec
Basically, you can serve all the different validators you want but you must be aware that in such case proxies may have a set of different validators from their own cache and from various user agents communicating with the proxy. They may choose to send one to you and that might not be the correct or the most optimal one for the end-users. However, a "best approach" has been suggested in the spec.
I suppose this should covers Expires headers as well as ETags, Cache-Control and whatnot.
Here's the relevant excerpt, in case anyone's interested:
When an intermediate cache is forced,
by means of a max-age=0 directive, to
revalidate its own cache entry, and
the client has supplied its own
validator in the request, the supplied
validator might differ from the
validator currently stored with the
cache entry. In this case, the cache
MAY use either validator in making its
own request without affecting semantic
transparency. However, the choice of
validator might affect performance.
The best approach is for the
intermediate cache to use its own
validator when making its request. If
the server replies with 304 (Not
Modified), then the cache can return
its now validated copy to the client
with a 200 (OK) response. If the
server replies with a new entity and
cache validator, however, the
intermediate cache can compare the
returned validator with the one
provided in the client's request,
using the strong comparison function.
If the client's validator is equal to
the origin server's, then the
intermediate cache simply returns 304
(Not Modified). Otherwise, it returns
the new entity with a 200 (OK)
response. If a request includes the
no-cache directive, it SHOULD NOT
include min-fresh, max-stale, or
max-age.

How do you make an etag that matches Apache?

I want to make an etag that matches what Apache produces. How does apache create it's etags?
Apache uses the standard format of inode-filesize-mtime. The only caveat to this is that the mtime must be epoch time and padded with zeros so it is 16 digits. Here is how to do it in PHP:
$fs = stat($file);
header("Etag: ".sprintf('"%x-%x-%s"', $fs['ino'], $fs['size'],base_convert(str_pad($fs['mtime'],16,"0"),10,16)));
One thing to remember about Apache's Etags is that they don't play well in clusters because they include inode information that can—and probably will—vary between machines in the same cluster.
If you're dynamically generating your page though, this probably won't make sense. If you're in PHP, you can pick the inode and file size of the main script, but the modify time won't tell you if your data has changed. Unless you have a good caching process or just generate static pages, etags aren't helpful. If you do have a good caching process, the inode and file size are probably irrelevant.
Edit: For people who don't know what etags are - they're just supposed to be a value that changes when the content has changed, for caching purposes. The browser gets the etag from the web server, compares it to the etag for its cached copy and then fetches the whole page if the etag has changed.
the answer above (from Chris) works well, but can be simplified using an implicit cast in the sprintf:
sprintf('"%x-%x-%x"', $s['ino'], $s['size'], str_pad($s['mtime'], 16, "0"));
The suggested %016x doesn't work because the padding is applied after the conversion to hex, rather than before.