Server with the sole purpose of setting cookies - apache

At work we ran up against the problem of setting server-side cookies - a lot of them. Right now we have a PHP script, the sole purpose of which is to set a cookie on the client for our domain. This happens a lot more than 'normal' requests to the server (which is running an app), so we've discussed moving it to its own server. This would be an Apache server, probably dedicated, with one PHP script 3 lines long, just running over and over again.
Surely there must be a faster, better way of doing this, rather than starting up the whole PHP environment. Basically, I need something super simple that can sit around all day/night doing the following:
Check if a certain cookie is set, and
If that cookie is not set, fill it with a random hash (right now it's a simple md5(microtime))
Any suggestions?

You could create a simple http server yourself to accept requests and return the set-cookie header and empty body. This would allow you to move the cookie generation overhead to wherever you see fit.
I echo the sentiments above though; Unless cookie generation is significantly expensive, I don't think you will gain much by moving from your current setup.
By way of an example, here is an extremely simple server written with Tornado that simply sets a cookie on GET or HEAD requests to '/'. It includes an async example listening for '/async' which may be of use depending on what you are doing to get your cookie value.
import time
import tornado.ioloop
import tornado.web
class CookieHandler(tornado.web.RequestHandler):
def get(self):
cookie_value = str( time.time() )
self.set_cookie('a_nice_cookie', cookie_value, expires_days=10)
# self.set_secure_cookie('a_double_choc_cookie', cookie_value)
self.finish()
def head(self):
return self.get()
class AsyncCookieHandler(tornado.web.RequestHandler):
#tornado.web.asynchronous
def get(self):
self._calculate_cookie_value(self._on_create_cookie)
#tornado.web.asynchronous
def head(self):
self._calculate_cookie_value(self._on_create_cookie)
def _on_create_cookie(self, cookie_value):
self.set_cookie('double_choc_cookie', cookie_value, expires_days=10)
self.finish()
def _calculate_cookie_value(self, callback):
## meaningless async example... just wastes 2 seconds
def _fake_expensive_op():
val = str(time.time())
callback(val)
tornado.ioloop.IOLoop.instance().add_timeout(time.time()+2, _fake_expensive_op)
application = tornado.web.Application([
(r"/", CookieHandler),
(r"/async", AsyncCookieHandler),
])
if __name__ == "__main__":
application.listen(8888)
tornado.ioloop.IOLoop.instance().start()
Launch this process with Supervisord and you'll have a simple, fast, low-overhead server that sets cookies.

You could try using mod_headers (usually available in the default install) to manually construct a Set-Cookie header and emit it -- no programming needed as long as it's the same cookie every time. Something like this could work in an .htaccess file:
Header add Set-Cookie "foo=bar; Path=/; Domain=.foo.com; Expires=Sun, 06 May 2012 00:00:00 GMT"
However, this won't work for you. There's no code here. It's just a stupid header. It can't come up with the new random value you'd want, and it can't adjust the expire date as is standard practice.
This would be an Apache server, probably dedicated, with one PHP script 3 lines long, just running over and over again. [...] Surely there must be a faster, better way of doing this, rather than starting up the whole PHP environment.
Are you using APC or another bytecode cache? If so, there's almost no startup cost. Because you're talking about setting up an entire server just for this, it sounds like you control the server as well. This means that you can turn off apc.stat for even less of a startup hit.
Really though, if all that script is doing is building an md5 hash and setting a cookie, it should already be blisteringly fast, especially if it's mod_php. Do you already know, though benchmarking and testing, that the script isn't performing as well as you'd like? If so, can you share those benchmarks with us?

It would be interesting to know why do you think you need extra server - do you actually have a bottle neck for generating the cookie or somewhere else ? Is it the log writing as requests happen alot ? ajax polling ? Client download speed ?
Atleast for starters, i'd look something more efficient than fetching time to generate the "random hash". For example, on this intel i7 laptop i have, generating 999999 md5 hashes from microtime takes roughly about 4 seconds and doing same thing with random numbers is second faster (not taking a seeding of rand into account).
Then, if you take opening/and closing of socket into account, just moving your script (which is most likely already really fast - that is, without knowing how your pages take that into account), you will end up actually slowing down the requests. Actually, now that i've re-read your question, it makes me think that your cookie setter script is already a dedicated page ? Or do you just "include" into real content served by another php script? If not, try that approach. Also this would beneficial if you have default logging rules for apache, if cookies are set in on own page, your apache will log a row for that and in high load systems, this will cumulate to total io time spend by apache.
Also, consider that testing if cookie is set and then setting it, might be slower than just to forcefully set it always even if cookie exists or not ?
But overall, i don't think you'd need to set up a server just to offload cookie generation without knowing more about how you handle the cookies now.. Unless you are doing something really nasty.

Apache has a module called mod_usertrack which looks like it might do exactly what you want. There's no need for PHP and you could likely create a really optimised lightweight Apache config to serve this with.
If you want to go for something even faster and are happy to not use Apache you could use lighttpd and it's mod_usertrack or nginx's HttpUserId module

Related

Can caching an API based on the hash of the whole URL be a potential threat?

I am adding caching to an API server. The implementation of the caching system of the framework I am using simply hashes the whole URL of the request and uses it as a cache key (as well as some other data like language, etc..).
But with this simple system I can add fake query parameters to the url with arbitrary values and my system will also cache those requests. For example:
GET https://example.com/apicall # Cached
GET https://example.com/apicall?fake=1 # Also cached
GET https://example.com/apicall?fake=2 # Also cached!
Is this something I should worry about? That people can very easily fill my cache with junk entries that aren't used? Or am I exaggerating the potential impact of this?

How ETags are generated and configured?

I recently came through the concept of ETag HTTP header. (this) But I still have a problem that for a particular HTTP resource who is responsible to generate ETags?
In other words, it is actual application, container (Ex:Tomcat), Web Server/Load balancer (Ex: Apache/Nginx)?
Can anyone please help?
Overview of typical algorithms used in webservers.
Consider we have a file with
Size 1047 i.e. 417 in hex.
MTime i.e. last modification on Mon, 06 Jan 2020 12:54:56 GMT which
is 1578315296 seconds in unix time or 1578315296666771000 nanoseconds.
Inode which is a physical file number 66 i.e. 42 in hex
Different webservers returns ETag like:
Nginx: "5e132e20-417" i.e. "hex(MTime)-hex(Size)". Not configurable.
BusyBox httpd the same as Nginx
monkey httpd the same as Nginx
Apache/2.2: "42-417-59b782a99f493" i.e. "hex(INode)-hex(Size)-hex(MTime in nanoseconds)". Can be configured but MTime anyway will be in nanos
Apache/2.4: "417-59b782a99f493" i.e. "hex(Size)-hex(MTime in nanoseconds)" i.e. without INode which is friendly for load balancing when identical file have different INode on different servers.
OpenWrt uhttpd: "42-417-5e132e20" i.e. "hex(INode)-hex(Size)-hex(MTime)". Not configurable.
Tomcat 9: W/"1047-1578315296666" i.e. Weak"Size-MTime in milliseconds". This is incorrect ETag because it should be strong as for a static file i.e. octal compatibility.
LightHTTPD: "hashcode(42-1047-1578315296666771000)" i.e. INode-Size-MTime but then reduced to a simple integer by hashcode (dekhash). Can be configured but you can only disable one part (etag.use-inode = "disabled")
MS IIS: it have a form Filetimestamp:ChangeNumber e.g. "53dbd5819f62d61:0". Not documented, not configurable but can be disabled.
Jetty: based on last mod, size and hashed. See Resource.getWeakETag()
Kitura (Swift): "W/hex(Size)-hex(MTime)" StaticFileServer.calculateETag
Few thoughts:
Hex numbers are used here so often because it's cheap to convert a decimal number to a shorter hex string.
Inode while adding more guarantees makes load balancing not possible and very fragile if you simply copied the file during application redeploy.
MTime in nanoseconds is not available on all platforms and such granularity not needed.
Apache have a bug about this like https://bz.apache.org/bugzilla/show_bug.cgi?id=55573
The order MTime-Size or Size-MTime is also matters because MTime is more likely changed so comparing ETag string may be faster for a dozen CPU cycles.
Even if this is not a full checksum hash but definitely not a weak ETag. This is enough to show that we expect octal compatibility for Range requests.
Apache and Nginx shares almost all traffic in Internet but most static files are shared via Nginx and it is not configurable.
It looks like Nginx uses the most reasonable schema so if you implementing try to make it the same.
The whole ETag generated in C with one line:
printf("\"%" PRIx64 "-%" PRIx64 "\"", last_mod, file_size)
My proposition is to take Nginx schema and make it as a recommended ETag algorithm by W3C.
As with most aspects of the HTTP specification, the responsibility ultimately lies with whoever is providing the resource.
Of course, it's often that case that we use tools—servers, load balancers, application frameworks, etc.—that help us fulfill those responsibilities. But there isn't any specification defining what a "web server", as opposed to the application, is expected to provide, it's just a practical question of what features are available in the tools you're using.
Now, looking at ETags in particular, a common situation is that the framework or web server can be configured to automatically hash the response (either the body or something else) and put the result in the ETag. Then, on a conditional request, it will generate a response and hash it to see if it has changed, and automatically send the conditional response if it hasn't.
To take two examples that I'm familiar with, nginx can do this with static files at web server level, and Django can do this with dynamic responses at the application level.
That approach is common, easy to configure, and works pretty well. In some situations, though, it might not be the best fit for your use case. For example:
To compute a hash to compare to the incoming ETag you first have to have a response. So although the conditional response can save you the overhead of transmitting the response, it can't save you the cost of generating the response. So if generating your response is expensive, and you have an alternative source of ETags (for example, version numbers stored in the database), you can use that to achieve better performance.
If you're planning to use the ETags to prevent accidental overwrites with state-changing methods, you will probably need to add your own application code to make your compare-and-set logic atomic.
So in some situations you might want to create your ETags at the application level. To take Django as an example again, it provides an easy way for you to provide your own function to compute ETags.
In sum, it's ultimately your responsibility to provide the ETags for the resources you control, but you may well be able to take advantage of the tools in your software stack to do it for you.

"+having+" in $GET/$POST causes server to return 403 Forbidden

One of my clients has a PHP script that kept crashing inexplicably. After hours of research, I determined if you send any PHP script a variable (either through GET or POST) that contains " having t", or escaped for the URL "+having+t", it crashes the script and returns a "403 forbidden error". To test it, I made a sample script with the entire contents:
<?php echo "works";
I put it live (temporarily) here: http://primecarerefer.com/test/test.php
Now if you try sending it some data like: http://primecarerefer.com/test/test.php?x=+having+x
It fails. The last letter can be any letter and it will still crash, but changing any other letter makes the script load fine. What would cause this and how can it be fixed? The link is live for now if anyone wants to try out different combinations.
PS - I found that if I get the 403 error a bunch of times in a row, the sever blocks me for 15 minutes.
I had this type of issue on a webserver that ran apache mod_security, but it was very poorly configured, actually mod_security has very bad default regex rules, which are very easy to trip with valid POST or GET data.
To be clear, this has nothing to do with PHP or HTML, it's about POST and GET data passing through mod_security, almost certainly, and mod_security rejecting the request because it believes it is an sql injection attempt.
You can edit the rules yourself depending on the server access, but I don't believe you can do anything, well, if it's mod_security, I know you can't do anything via PHP to get around this.
/etc/httpd/conf.d/mod_security.conf (old path, it's changed, but it gives the idea)
Examples of the default rules:
SecFilter "delete[[:space:]]+from"
SecFilter "insert[[:space:]]+into"
SecFilter "select.+from"
These are samples of the rules
https://www.howtoforge.com/apache_mod_security
here they trip the filter:
http://primecarerefer.com/test/test.php?x=%20%22%20%20select%20from%22
Note that the article is very old and the rules actually are quite differently structured now, but the bad regex remains, ie: select[any number of characters, no matter how far removed, or close]from will trip it, any sql that matches these loose rules will trip it.
But since editing those default files requires access to them, and also assumes they won't be altered in an upgrade of apache mod_security at some point, it's not a good way to fix the problem I found, moving to a better, more professionally setup, hoster, fixed those issues for us. But it does help if you talk to the hosting support to know what the cause of the issue is.
In fact 'having' is not irrelevant at all, it's part of sql injection filters in the regex rules in the security filters run on POST/GET. We used to hit this all the time when admins would edit CMS pages, which would trigger invariably some sql filter, since any string of human words would invariably contain something like 'select.*from' or 'insert.*into' etc.
This mod_security issue used to drive me bonkers trying to debug why backend edit form updates would just hang, until I finally realized it was badly done generic regex patterns in the mod_security file itself.
In a sense, this isn't an answer, because the only fix is going into the server and either editing the rules file, which is pretty easy, or disabling mod_security, or moving to a web hoster that doesn't use those bad generic defaults.

HTTPS proxy with support for chunked-encoded requests

I'm developing a simple HTTPS proxy (written in Python) which receives POST/GET requests/responses, applies some transformation and finally forwards the result to the recipient.
I need to handle chunked-encoded requests/responses in a "streaming" fashion, meaning that as soon as a chunk is received the proxy transforms it and forwards it to the recipient.
Before deciding to support chunked-encoded requests, I've been using mitmproxy http://mitmproxy.org/ and it worked perfectly. Unfortunately, I noticed that it waits until the entire body is received before letting me handle the response/request.
How can I implement a proxy supporting chunked-encoded requests/responses? Has anyone of you ever done something like this?
Thanks
EDIT: MORE INFO ON MY USE CASE
I need to handle POST requests and GET responses.
In the POST request I receive a JSON object and I have to encrypt some of its values.
In the GET response I receive a JSON object and I have to decrypt some of its values.
Till now, the following code has worked perfectly:
def handle_request(self, r):
if(r.method=='POST'):
// encryption of r.get_form_urlencoded()
def handle_response(self, r):
if(r.request.method=='GET'):
// decryption of r.content
How can I do the same thing with single chunks?
EDIT: UPDATES
After evaluating different solutions, I decided to go for Squid (proxy) + ICAP (content adaptation).
I've successfully configured Squid and the performance are just great. Unfortunately, I can't find a suitable ICAP server (in Python, if possible) for doing content adaptation (modification). I thought this one https://github.com/netom/pyicap could do the job but looks like it doesn't read the body of myPOST requests.
Do you guys know a Python ICAP server that I can use together with Squid?
Thanks
The answer below is outdated. You can now pass --stream to mitmproxy, whose behaviour is explained in the mitmproxy documentation.
mitmproxy developer here. This is definitely a feature we want for mitmproxy as well, but it's not that trivial and probably not coming very soon. If you really want to implement that yourself, I can recommend two things:
If you have a very specific use case, you can employ libmproxy.protocol.http.HTTPRequest.from_stream for parsing the header and do the body processing yourself.
If you do not want to modify the request/response body, you may find it sufficient to modify mitmproxy itself. In a nutshell, you would need to read the request/response without content (see 1.), modify it to your needs, pass it to the server and then delegate control to the libmproxy.protocol.tcp (see https://github.com/mitmproxy/mitmproxy/blob/master/libmproxy/proxy/server.py#L169)
If you have further questions, don't hesistate to ask here or on mitmproxy's IRC channel.
Re Comment #1:
You can't take too much out of mitmproxy, but at least you get delegate the header parsing & processing.
# ...accept request, socket.makefile() etc...
req = HTTPRequest.from_stream(client_conn.rfile, include_content=False)
# manually forward to the server (req._assemble_head())
# manually receive response body chunk by chunk and forward it to the server, see
# https://github.com/mitmproxy/netlib/blob/master/netlib/http.py#L98
resp = HTTPResponse.from_stream(server_conn.rfile, include_content=False)
# manually forward headers
# manually process body and forward
That being said, this is a fairly complex topic. Eventually, you're better off hacking that directly into libmproxy.protocol.http.HTTPHandler.
Another option, depending on your use case again: Use mitmproxy, set the conntype to tcp and forward traffic as-is and use regex replacements on the content in libmproxy.protocol.tcp . Probably the easiest way, but the most hacky one.
If you can provide some context, I may guide you further in the right direction.
Re Comment #2:
Before we get to the main part: JSON is a really bad choice for streaming/chunking as long as you don't want to encrypt the complete JSON object and treat it as a single string. You should definitely consider something like tnetstrings if you only want to encrypt parts.
Apart from that, hooking into read_chunk works, but first you need to get to the point where you can actually receive chunks over the line. Then, it's as simple as reading the single chunks, encrypting them and forwarding them.

Best way to cache RESTful API results of GET calls

I'm thinking about the best way to create a cache layer in front or as first layer for GET requests to my RESTful API (written in Ruby).
Not every request can be cached, because even for some GET requests the API has to validate the requesting user / application. That means I need to configure which request is cacheable and how long each cached answer is valid. For a few cases I need a very short expiration time of e.g. 15s and below. And I should be able to let cache entries expire by the API application even if the expiration date is not reached yet.
I already thought about many possible solutions, my two best ideas:
first layer of the API (even before the routing), cache logic by myself (to have all configuration options in my hand), answers and expiration date stored to Memcached
a webserver proxy (high configurable), perhaps something like Squid but I never used a proxy for a case like this before and I'm absolutely not sure about it
I also thought about a cache solution like Varnish, I used Varnish for "usual" web applications and it's impressive but the configuration is kind of special. But I would use it if it's the fastest solution.
An other thought was to cache to the Solr Index, which I'm already using in the data layer to not query the database for most requests.
If someone has a hint or good sources to read about this topic, let me know.
Firstly, build your RESTful API to be RESTful. That means authenticated users can also get cached content as to keep all state in the URL it needs to contain the auth details. Of course the hit rate will be lower here, but it is cacheable.
With a good deal of logged in users it will be very beneficial to have some sort of model cache behind a full page cache as many models are still shared even if some aren't (in a good OOP structure).
Then for a full page cache you are best of to keep all the requests off the web server and especially away from the dynamic processing in the next step (in your case Ruby). The fastest way to cache full pages from a normal web server is always a caching proxy in front of the web servers.
Varnish is in my opinion as good and easy as it gets, but some prefer Squid indeed.
memcached is a great option, and I see you mentioned this already as a possible option. Also Redis seems to be praised a lot as another option at this level.
On an application level, in terms of a more granular approach to cache on a file by file and/or module basis, local storage is always an option for common objects a user may request over and over again, even as simple as just dropping response objects into session so that can be reused vs making another http rest call and coding appropriately.
Now people go back and forth debating about varnish vs squid, and both seem to have their pros and cons, so I can't comment on which one is better but many people say Varnish with a tuned apache server is great for dynamic websites.
Since REST is an HTTP thing, it could be that the best way of caching requests is to use HTTP caching.
Look into using ETags on your responses, checking the ETag in requests to reply with '304 Not Modified' and having Rack::Cache to serve cached data if the ETags are the same. This works great for cache-control 'public' content.
Rack::Cache is best configured to use memcache for its storage needs.
I wrote a blog post last week about the interesting way that Rack::Cache uses ETags to detect and return cached content to new clients: http://blog.craz8.com/articles/2012/12/19/rack-cache-and-etags-for-even-faster-rails
Even if you're not using Rails, the Rack middleware tools are quite good for this stuff.
Redis Cache is best option.
check here.
It is open source. Advanced key-value cache and store.
I’ve used redis successfully this way in my REST view:
from django.conf import settings
import hashlib
import json
from redis import StrictRedis
from django.utils.encoding import force_bytes
def get_redis():
#get redis connection from RQ config in settings
rc = settings.RQ_QUEUES['default']
cache = StrictRedis(host=rc['HOST'], port=rc['PORT'], db=rc['DB'])
return cache
class EventList(ListAPIView):
queryset = Event.objects.all()
serializer_class = EventSerializer
renderer_classes = (JSONRenderer, )
def get(self, request, format=None):
if IsAdminUser not in self.permission_classes: # dont cache requests from admins
# make a key that represents the request results you want to cache
# your requirements may vary
key = get_key_from_request()
# I find it useful to hash the key, when query parms are added
# I also preface event cache key with a string, so I can clear the cache
# when events are changed
key = "todaysevents" + hashlib.md5(force_bytes(key)).hexdigest()
# I dont want any cache issues (such as not being able to connect to redis)
# to affect my end users, so I protect this section
try:
cache = get_redis()
data = cache.get(key)
if not data:
# not cached, so perform standard REST functions for this view
queryset = self.filter_queryset(self.get_queryset())
serializer = self.get_serializer(queryset, many=True)
data = serializer.data
# cache the data as a string
cache.set(key, json.dumps(data))
# manage the expiration of the cache
expire = 60 * 60 * 2
cache.expire(key, expire)
else:
# this is the place where you save all the time
# just return the cached data
data = json.loads(data)
return Response(data)
except Exception as e:
logger.exception("Error accessing event cache\n %s" % (e))
# for Admins or exceptions, BAU
return super(EventList, self).get(request, format)
in my Event model updates, I clear any event caches.
This hardly ever is performed (only Admins create events, and not that often),
so I always clear all event caches
class Event(models.Model):
...
def clear_cache(self):
try:
cache = get_redis()
eventkey = "todaysevents"
for key in cache.scan_iter("%s*" % eventkey):
cache.delete(key)
except Exception as e:
pass
def save(self, *args, **kwargs):
self.clear_cache()
return super(Event, self).save(*args, **kwargs)