string searching / wild-card matching - file-io

Iv'e currently been working on a relatively small project for my company to have a play with, its basically a proxy in node.js, the features at the moment are relatively simple
Caching
Http(s)
Blacklist
Configurable
etc.
Im at the stage where im building the blacklisting system, and my blacklist file is a plain file that would have each blacklisted site on a single line.
Now the blacklist would be constructed so that you could the following types blacklist values:
google.com
google.com/path
ww2.google.com/path
202.55.66.201
202.55.66.[100-200]
now within node.js when a request comes in i have available to me is the requested URL from the client side, this would then be looked up in the IP Cache file, if it does not exists it gets pinged and i get the IP for that request.
So have a few bits of information at hand, 1 being the domain, 2 being the IP, 3 being the port.
Now the problem is finding the fastest way to check these values against the file based blacklist.
As these values are not direct lookups im not sure if putting then into an object and doing:
if(ip in blacklist || domain in blacklist || fullUri in blacklist)
{
//block
}
Even if I did do that it would not really be beneficial as I cant check IP Ranges etc, it lacks support for the more demanding site blacklisting techniques.
I was thinking of some sort of database system but this is something I wanted to avoid, so basically what im asking is there some way to perform wild-card lookups on a datafile without causing too much overhead.

I think the more efficient way would be to loop each line of the file, and compare against your information - also would allow pattern matching - so in pseudo code:
each file as line
if line equal ip or line eq domain or line match 134.567.987.[0-9]{1-3}
then block and break

You can load the file on booting your nodejs process. You can then process the whole file and separate in on 3 arrays (IP, domains and ports).
Searching elements on memory is fast.
You can then have a setInterval that reloads the contents of the file and save it to the memory to get the latest blacklist.

Related

Can caching an API based on the hash of the whole URL be a potential threat?

I am adding caching to an API server. The implementation of the caching system of the framework I am using simply hashes the whole URL of the request and uses it as a cache key (as well as some other data like language, etc..).
But with this simple system I can add fake query parameters to the url with arbitrary values and my system will also cache those requests. For example:
GET https://example.com/apicall # Cached
GET https://example.com/apicall?fake=1 # Also cached
GET https://example.com/apicall?fake=2 # Also cached!
Is this something I should worry about? That people can very easily fill my cache with junk entries that aren't used? Or am I exaggerating the potential impact of this?

"+having+" in $GET/$POST causes server to return 403 Forbidden

One of my clients has a PHP script that kept crashing inexplicably. After hours of research, I determined if you send any PHP script a variable (either through GET or POST) that contains " having t", or escaped for the URL "+having+t", it crashes the script and returns a "403 forbidden error". To test it, I made a sample script with the entire contents:
<?php echo "works";
I put it live (temporarily) here: http://primecarerefer.com/test/test.php
Now if you try sending it some data like: http://primecarerefer.com/test/test.php?x=+having+x
It fails. The last letter can be any letter and it will still crash, but changing any other letter makes the script load fine. What would cause this and how can it be fixed? The link is live for now if anyone wants to try out different combinations.
PS - I found that if I get the 403 error a bunch of times in a row, the sever blocks me for 15 minutes.
I had this type of issue on a webserver that ran apache mod_security, but it was very poorly configured, actually mod_security has very bad default regex rules, which are very easy to trip with valid POST or GET data.
To be clear, this has nothing to do with PHP or HTML, it's about POST and GET data passing through mod_security, almost certainly, and mod_security rejecting the request because it believes it is an sql injection attempt.
You can edit the rules yourself depending on the server access, but I don't believe you can do anything, well, if it's mod_security, I know you can't do anything via PHP to get around this.
/etc/httpd/conf.d/mod_security.conf (old path, it's changed, but it gives the idea)
Examples of the default rules:
SecFilter "delete[[:space:]]+from"
SecFilter "insert[[:space:]]+into"
SecFilter "select.+from"
These are samples of the rules
https://www.howtoforge.com/apache_mod_security
here they trip the filter:
http://primecarerefer.com/test/test.php?x=%20%22%20%20select%20from%22
Note that the article is very old and the rules actually are quite differently structured now, but the bad regex remains, ie: select[any number of characters, no matter how far removed, or close]from will trip it, any sql that matches these loose rules will trip it.
But since editing those default files requires access to them, and also assumes they won't be altered in an upgrade of apache mod_security at some point, it's not a good way to fix the problem I found, moving to a better, more professionally setup, hoster, fixed those issues for us. But it does help if you talk to the hosting support to know what the cause of the issue is.
In fact 'having' is not irrelevant at all, it's part of sql injection filters in the regex rules in the security filters run on POST/GET. We used to hit this all the time when admins would edit CMS pages, which would trigger invariably some sql filter, since any string of human words would invariably contain something like 'select.*from' or 'insert.*into' etc.
This mod_security issue used to drive me bonkers trying to debug why backend edit form updates would just hang, until I finally realized it was badly done generic regex patterns in the mod_security file itself.
In a sense, this isn't an answer, because the only fix is going into the server and either editing the rules file, which is pretty easy, or disabling mod_security, or moving to a web hoster that doesn't use those bad generic defaults.

returning absolute vs relative URIs in REST API

suppose the DogManagementPro program is an application written in client/server architecture, where the customers who buys it is supposed to run the server on his own PC, and access it either locally or remotely.
suppose I want to support a "list all dogs" operations in the DogManagementPro REST API.
so a GET to http://localhost/DogManagerPro/api/dogs should fetch the following response now:
<dogs>
<dog>http://localhost/DogManagerPro/api/dogs/ralf</dog>
<dog>http://localhost/DogManagerPro/api/dogs/sparky</dog>
</dogs>
where I want to access it remotely on my local LAN, [the local IP of my machine is 192.168.0.33]
what should a a GET to http://192.168.0.33:1234/DogManagerPro/api/dogs fetch?
should it be:
<dogs>
<dog>http://localhost/DogManagerPro/api/dogs/ralf</dog>
<dog>http://localhost/DogManagerPro/api/dogs/sparky</dog>
</dogs>
or perhaps:
<dogs>
<dog>http://192.168.0.33/DogManagerPro/api/dogs/ralf</dog>
<dog>http://192.168.0.33/DogManagerPro/api/dogs/sparky</dog>
</dogs>
?
some people argue that I should subside the problem altogether by returning just a path element like so:
<dogs>
<dog>/DogManagerPro/api/dogs/ralf</dog>
<dog>/DogManagerPro/api/dogs/sparky</dog>
</dogs>
what is the best way?
I've personally always used non-absolute urls. It solves a few other problems as well, such as reverse / caching proxies.
It's a bit more complicated for the client though, and if they want to store the document as-is, it may imply they also now need to store the base url, or expand the inner urls.
If you do choose to go for the full-url route, I would not recommend using HTTP_HOST, but setup multiple vhosts, and environment variable and use that.
This solves the issue if you later on need proxies in front of your origin server.
I would say absolute URLs created based on the Host header that the client sent
<dogs>
<dog>http://192.168.0.33:1234/DogManagerPro/api/dogs/ralf</dog>
<dog>http://192.168.0.33:1234/DogManagerPro/api/dogs/sparky</dog>
</dogs>
The returned URIs should be something the client is able to resolve.

Server with the sole purpose of setting cookies

At work we ran up against the problem of setting server-side cookies - a lot of them. Right now we have a PHP script, the sole purpose of which is to set a cookie on the client for our domain. This happens a lot more than 'normal' requests to the server (which is running an app), so we've discussed moving it to its own server. This would be an Apache server, probably dedicated, with one PHP script 3 lines long, just running over and over again.
Surely there must be a faster, better way of doing this, rather than starting up the whole PHP environment. Basically, I need something super simple that can sit around all day/night doing the following:
Check if a certain cookie is set, and
If that cookie is not set, fill it with a random hash (right now it's a simple md5(microtime))
Any suggestions?
You could create a simple http server yourself to accept requests and return the set-cookie header and empty body. This would allow you to move the cookie generation overhead to wherever you see fit.
I echo the sentiments above though; Unless cookie generation is significantly expensive, I don't think you will gain much by moving from your current setup.
By way of an example, here is an extremely simple server written with Tornado that simply sets a cookie on GET or HEAD requests to '/'. It includes an async example listening for '/async' which may be of use depending on what you are doing to get your cookie value.
import time
import tornado.ioloop
import tornado.web
class CookieHandler(tornado.web.RequestHandler):
def get(self):
cookie_value = str( time.time() )
self.set_cookie('a_nice_cookie', cookie_value, expires_days=10)
# self.set_secure_cookie('a_double_choc_cookie', cookie_value)
self.finish()
def head(self):
return self.get()
class AsyncCookieHandler(tornado.web.RequestHandler):
#tornado.web.asynchronous
def get(self):
self._calculate_cookie_value(self._on_create_cookie)
#tornado.web.asynchronous
def head(self):
self._calculate_cookie_value(self._on_create_cookie)
def _on_create_cookie(self, cookie_value):
self.set_cookie('double_choc_cookie', cookie_value, expires_days=10)
self.finish()
def _calculate_cookie_value(self, callback):
## meaningless async example... just wastes 2 seconds
def _fake_expensive_op():
val = str(time.time())
callback(val)
tornado.ioloop.IOLoop.instance().add_timeout(time.time()+2, _fake_expensive_op)
application = tornado.web.Application([
(r"/", CookieHandler),
(r"/async", AsyncCookieHandler),
])
if __name__ == "__main__":
application.listen(8888)
tornado.ioloop.IOLoop.instance().start()
Launch this process with Supervisord and you'll have a simple, fast, low-overhead server that sets cookies.
You could try using mod_headers (usually available in the default install) to manually construct a Set-Cookie header and emit it -- no programming needed as long as it's the same cookie every time. Something like this could work in an .htaccess file:
Header add Set-Cookie "foo=bar; Path=/; Domain=.foo.com; Expires=Sun, 06 May 2012 00:00:00 GMT"
However, this won't work for you. There's no code here. It's just a stupid header. It can't come up with the new random value you'd want, and it can't adjust the expire date as is standard practice.
This would be an Apache server, probably dedicated, with one PHP script 3 lines long, just running over and over again. [...] Surely there must be a faster, better way of doing this, rather than starting up the whole PHP environment.
Are you using APC or another bytecode cache? If so, there's almost no startup cost. Because you're talking about setting up an entire server just for this, it sounds like you control the server as well. This means that you can turn off apc.stat for even less of a startup hit.
Really though, if all that script is doing is building an md5 hash and setting a cookie, it should already be blisteringly fast, especially if it's mod_php. Do you already know, though benchmarking and testing, that the script isn't performing as well as you'd like? If so, can you share those benchmarks with us?
It would be interesting to know why do you think you need extra server - do you actually have a bottle neck for generating the cookie or somewhere else ? Is it the log writing as requests happen alot ? ajax polling ? Client download speed ?
Atleast for starters, i'd look something more efficient than fetching time to generate the "random hash". For example, on this intel i7 laptop i have, generating 999999 md5 hashes from microtime takes roughly about 4 seconds and doing same thing with random numbers is second faster (not taking a seeding of rand into account).
Then, if you take opening/and closing of socket into account, just moving your script (which is most likely already really fast - that is, without knowing how your pages take that into account), you will end up actually slowing down the requests. Actually, now that i've re-read your question, it makes me think that your cookie setter script is already a dedicated page ? Or do you just "include" into real content served by another php script? If not, try that approach. Also this would beneficial if you have default logging rules for apache, if cookies are set in on own page, your apache will log a row for that and in high load systems, this will cumulate to total io time spend by apache.
Also, consider that testing if cookie is set and then setting it, might be slower than just to forcefully set it always even if cookie exists or not ?
But overall, i don't think you'd need to set up a server just to offload cookie generation without knowing more about how you handle the cookies now.. Unless you are doing something really nasty.
Apache has a module called mod_usertrack which looks like it might do exactly what you want. There's no need for PHP and you could likely create a really optimised lightweight Apache config to serve this with.
If you want to go for something even faster and are happy to not use Apache you could use lighttpd and it's mod_usertrack or nginx's HttpUserId module

Why would Apache be URL decoding my query string?

My Web host has refused to help me with this, so I'm coming to the wise folks here for some help "black-box debugging". Here's an edited version of what I sent to them:
I have two (among other) domains at dreamhost:
1) thefigtrees.net
2) shouldivoteformccain.com
I noticed today that when I host a CGI script on #1, that by the time the
CGI script runs, the HTTP GET query string passed to it as the QUERY_STRING
environment variable has already been URL decoded. This is a problem because
it then means that a standard CGI library (such as perl's CGI.pm) will try to
split on ampersands and then decode the string itself. There are two
potential problems with this:
1) the string is doubly-decoded, so if a value is submitted to the script
such as "%2525", it will end up being treated as just "%" (decoded twice)
rather than "%25" (decoded once)
2) (more common) if there is an ampersand in a value submitted, then it
will get (properly) submitted as %26, but the QUERY_STRING env. variable will
have it already decoded into an "&" and then the CGI library will improperly
split the query string at that ampersand. This is a big problem!
The script at http://thefigtrees.net/test.cgi demonstrates this. It echoes back the
environment variables it is called with. Navigating in a browser to:
http://thefigtrees.net/lee/test.cgi?x=y%26z
You can see that REQUEST_URI properly contains x=y%26z (unencoded) but that
QUERY_STRING already has it decoded to x=y&z.
If I repeat the test at domain #2 (
http://www.shouldivoteformccain.com/test.cgi?x=y%26z ) I see that the
QUERY_STRING remains undecoded, so that CGI.pm then splits and decodes
correctly.
I tried disabling my .htaccess files on both to make sure that was not the
problem, and saw no difference.
Could anyone speculate on potential causes of this, since my Web host seems unwilling to help me?
thanks,
Lee
I have the same behavior in Apache.
I believe mod_rewrite will automatically decode the URL if it is installed, however, I have seen the auto-decode behavior even without it. I haven't tracked down the other culprit.
A common workaround is to double encode the input parameter (taking advantage of URL decoding being safe when called on an unencoded URL).
Curious. Nothing I can see from here would give us a clue why this would happen... I can only confirm that it is an environment bug and suspect maybe configuration differences like maybe rewrite rules.
Per CGI 1.1, this decoding should only happen to SCRIPT-NAME and PATH-INFO, not QUERY-STRING. It's pointless and annoying that it happens at all, but that's the spec. Using REQUEST-URI instead of those variables where available (ie. Apache) is a common workaround for places where you want to put out-of-bounds and Unicode characters in path parts, so it might be reasonable to do the same for query strings until some sort of resolution is available from the host.
VPSs are cheap these days...