URL shortener with no database - url-shortener

I'd like to write a URL shortener that doesn't have to use a database. Instead, to have as few moving parts as possible, the script would just create a unique hash for my URL based on an algorithm (like md5, except an md5 would be too long). I'm not really sure how I'd go about doing this. Any advice?
If it matters, I'd prefer to write this in Ruby.

What you need, is a way to compress and decompress a String. Where the resulting compressed version is a string too. This is nearly impossible, because an URL is already very short. Encoding and lossless compression always add minimal overhead, which will result in a string that is larger than the original, for most URLS.
For very long URLs, however, it may work.
So, in the end, you will almost always need a lookup-table in storage (database).
Base64 is the most logical solution. On itself, however, Base64 encoding returns longer strings than the original, for short strings (which URL are, generally); due to the padding, mostly. So we'll also try with zlib, to compress the string.
require "uri"
require "base64"
require "zlib"
shortner_url = URI.parse("https://s.to")
long = "https://stackoverflow.com/questions/4818429/url-shortener-with-no-database"
url = URI.parse(long)
stripped = url.host + url.path
stripped.length #=> 66
# Let's see that Base64 on its own does not shorten the url.
encoded = Base64.encode64(stripped)
encoded.length #=> 90
# So, using zlib. To compress.
compressed = Zlib::Deflate.deflate(stripped)
encoded = Base64.encode64(compressed)
encoded.length #=> 94
# It became worse.
# Now, with a long url (they can be much longer even), in a oneliner; to simplify omit the stripping part:
long = "http://www.thelongestlistofthelongeststuffatthelongestdomainnameatlonglast.com/wearejustdoingthistobestupidnowsincethiscangoonforeverandeverandeverbutitstilllookskindaneatinthebrowsereventhoughitsabigwasteoftimeandenergyandhasnorealpointbutwehadtodoitanyways.html"
long.length #=> 263
Base64.encode64(Zlib::Deflate.deflate(long)).length #=> 228
# In order to turn this into a valid short URL, however, we need `urlsaf_encode64()`
shortner_url.path = "/" + Base64.urlsafe_encode64(Zlib::Deflate.deflate(long))
shorther_url.to_s #=> "https://s.to/eJxNjkEWwyAIRG-U7HsbElFpEPIE68vti6t2BcwbZn51v1_7PufcvCKrFDRnMtf8u81HzuA_IWkDEoGG4EtiMN9ObftE6Pgey0FSvK6gIx7GTUl0GsmJSz1Biqpk7fjBDpL-xjGcopKYWfWyiySBRBFJABw9UnB9xaWj1LDCQWUGAQYzBVLECPbyxFLBJDqA7-DxSJ5YIbkGnoM8Ex7bqjf-AiodbYM="
shortner_url.to_s.length #=> 237 WE SAVED 26 characters!
Note on stripping: can remove 'https://'. A Real implementation would need to add a piece to the string, to determine https or http: '1'+result for https, '0'+result for http. Another "hack" would be to make the url-shortening service use http for http urls and https for https urls.
If you always have the same domain, you can disgard the domain part too.
If you have a lot of slashes, or other repeating characters such as a dash, the compression works better.

You could do this with several of the string manipulation tools available to transform a URL into something obscured however as you noted in your question the url's you get from doing this would be longer than is typical for a url shortener.
url's don't compress very well.

Ultimately if you're after a short link, you simply need to generate a suitably legible unique code (try to omit similar letters/numbers such as zero and 'o', in case some poor bugger actually has to type it in) and associate that code with the original URL in some form of store.
Whilst I can understand why you don't want to use a database, in many ways it's the perfect form of storage, especially if you look at one of the dedicated key/value stores such as Cassandra, Redis, MongoDB, etc. (That said, a simple "traditional" SQL database may be an easy first step if you're in unfamiliar territory.)

You won't be able to resolve the original URL from a hash code without looking it up in some kind of database.
About the only thing you can do without a database is compress the URL and then decompress it when you resolve the URL.
Strictly speaking, I guess you could just hash the URL. But of what possible value would that be if you are not able to resolve it back to the original URL?

Related

Amazon s3 URL + being encoded to %2?

I've got Amazon s3 integrated with my hosting account at WP Engine. Everything works great except when it comes to files with + characters in them.
For example in the following case when a file is named: test+2.pdf
http://support.mcsolutions.com/wp-content/uploads/2011/11/test+2.pdf = does not work.
The following URL is the amazon URL. Notice the + charcter is encoded. Is there a way to prevent/change this?
http://mcsolutionswpe.s3.amazonaws.com/mcsupport/wp-content/uploads/2011/11/test%2b2.pdf
Other URLs work fine:
Amazon -> http://mcsolutionswpe.s3.amazonaws.com/mcsupport/wp-content/uploads/2011/11/test2.pdf
Website -> http://support.mcsolutions.com/wp-content/uploads/2011/11/test2.pdf
If I understand your question correctly, then no, there is no way to really change this.
The cause appears to be an unfortunate design decision made on S3 many years ago -- which, of course, cannot be fixed, now, because it would break too many other things -- which involves S3 using an incorrect variant of URL-escaping (which includes but is not quite limited to "percent-encoding") in the path part of the URL, where the object's key is sent.
In the query string (the optional part of a URL after ? but before the fragment, if present, which begins with #), the + character is considered equivalent to [SPACE], (ASCII Dec 32, Hex 0x20).
...but in the path of a URL, this is not supposed to be the case.
...but in S3's implementation, it is.
So + doesn't actually mean +, it means [SPACE]... and therefore, + can't also mean +... which means that a different expression is required to convey + -- and that value is %2B, the url-escaped value of + (ASCII Dec 43, Hex 0x2B).
When you upload your files, the + is converted by the code you're using (assuming it understands this quirk, as apparently it does) into the format S3 expects (%2B)... and so it must be requested using %2B so when you download the files.
Strangely, but not surprisingly, if you store the file in S3 with a space in the path, you can actually request it with a + or a space or even %20 and all three of these should actually fetch the file... so if seeing the + in the path is what you want, you can sort of work around the issue by saving it with a space instead, though this workaround deserves to be described as a "hack" if ever a workaround did. This tactic will not work with libraries that generate pre-signed GET URLs, unless they specifically are designed to ignore the standard behavior of S3 and do what you want, instead... but for public links, it should be essentially equivalent.

Possible to apache mod_autoindex and mod_dir to return directory listing via AJAX?

This probably sounds silly, after all I could generate the file listing via PHP, right?
But I am becoming more and more fascinated with what all can be accomplished with just Apache and JQuery alone. I've been reading documentation and it seems like things are SO VERY close, but I am obviously missing a few things.
First, can I set a directory listing to a "path" or file name,
overwriting the default, "index.html"? In particular, I am trying to
configure any request ending in "ndx.mnu" to return the directory
listing:
"DirectoryIndex ndx.mnu"
...does not accomplish that. Anyideas?
Second, does anyone know of a way to impose a numerical sort similar
to the way in which VersionSort works for files? Right now:
"foo-1, foo-2"
sorts correctly but what if I want to force:
"foo-1, bar-2"
to be order returned?
Trying to make something with as few moving parts as possible. Any pointers to read up would be appreciated.
Well for the second part, you want to sort by the number rather than the letters correct? You should be able to read the string backwards and sort from end to beginning. Using strrev() to reverse it, you can write a sorting algorithm to do that.
Or if all the file use the '-#' notation then $num = explode('-', $string); and sort by $num[1] (which should be the number on the end) though if some file names contain multiple '-' you could use regular expressions.

Caucho Resin Digest Authentication with CustomAuthenticator, someone please enlighten me

Ok after experimenting a little bit I found out that resin was calling my AbstractAuthenticator implementation "authenticate" method that takes an HttpDigestCredentials object instead of DigestCredentials (still don't know when is called each one of them) the problem is that HttpDigestCredentials doesn't have a getDigest() method, instead it has a getResponse() method which doesn't return a hash or at least not a comparable one.
After creating my own hash of [[user:realmassword] [nonce] [method:uri]] the hash is very different, in fact I think getResponse() does not return the digest but maybe the server response to the browser?.
Any way this is my debugging log :
USER:user:PASSWORD:password:REALM:resin:METHOD:GET:URI/appe/appe.html:NONCE:HsJzN+j+GQD:CNONCE:b1ad4fa1ba857cac88c202e64528bc0c:CLIENTDIGEST:[B#5dcd8bf7:SERVERDIGEST:I4DkRCh21YG2Mk14iTe+hg==
as you can see both the supposed client nonce is very very different from the server generated nonce, in fact the client nonce doesn't look like a MD5 hash at all.
Please has someone does this before? is there something missing in the HttpDigestCredentials? I know digest is barely used.
Please, I know about SSL but I can't have an SSL certificate just yet so don't tell me "Why don't you use SSL". ;)
Update:
Not sure if was the right thing to do but, as I read before Resin uses base64 format for hashes so I used apache commons-codec-1.6 to use encodeBase64String() method and now the hashes look alike but they are no the same.
I tried both passwordDigest.getPasswordDigest(a1+':'+nonce+':'+a2); passwordDigest.getPasswordDigest(a1+':'+nonce+':'+ncount+':'+cnonce+':'+qop+':'+a2);
and none of them gives the same hash as the one from HttpDigestCredentials.
Ok I finally made it . Weird subject Huh, only two views?
First, digest authentication makes use of user, password, realm, nonce, client_nonce, nonce_count, method, qop, and uri. Basically it uses the full digest spec. So in order to calculate the hash one must calculate it with all the whistles. Is just a matter of calling the get method for each one of the variables from HttpDigestCredentials except for user and password. The user will come in the form of a Principal and the password you must look for it yourself in your DB (in my case a DB4O database).
Then you must create a PasswordDigest object, that will take care of generate a hash with the getPasswordDigest() method, but first one must set the format to hex with passwordDigestObject.setFormat("hex").
There is one for the HA1 getPasswordDigest(user,password,realm) and there is another getPasswordDigest() method that takes just one string and one can use it to generate the rest of the hashes, both HA2 and with the previous hashed HA1 the final hash, of course with the nonce nonce_count client_nonce and qop, of course each one separated by a semicolon.
Then it comes the tricky part, although resin works with base64 encoding for digest when you call the getResponse() method from HttpDigestCredentials it returns a byte array (which is weird) so in order to compare it with your hash what I did was use the Hex.encodeHexString() method from org.apache.commons.codec.binary.Hex and pass the HttpCredentialsDigest getResponse() return value, and that will give a nice hex String to compare.
I was doing it the other way around, I was using the Base64 hash from PasswordDigest and converting the HttpDigestCredentials hash to Base64 and the resulting string were never the same.

TSearch2 - dots explosion

Following conversion
SELECT to_tsvector('english', 'Google.com');
returns this:
'google.com':1
Why does TSearch2 engine didn't return something like this?
'google':2, 'com':1
Or how can i make the engine to return the exploded string as i wrote above?
I just need "Google.com" to be foundable by "google".
Unfortunately, there is no quick and easy solution.
Denis is correct in that the parser is recognizing it as a hostname, which is why it doesn't break it up.
There are 3 other things you can do, off the top of my head.
You can disable the host parsing in the database. See postgres documentation for details. E.g. something like ALTER TEXT SEARCH CONFIGURATION your_parser_config
DROP MAPPING FOR url, url_path
You can write your own custom dictionary.
You can pre-parse your data before it's inserted into the database in some manner (maybe splitting all domains before going into the database).
I had a similar issue to you last year and opted for solution (2), above.
My solution was to write a custom dictionary that splits words up on non-word characters. A custom dictionary is a lot easier & quicker to write than a new parser. You still have to write C tho :)
The dictionary I wrote would return something like 'www.facebook.com':4, 'com':3, 'facebook':2, 'www':1' for the 'www.facebook.com' domain (we had a unique-ish scenario, hence the 4 results instead of 3).
The trouble with a custom dictionary is that you will no longer get stemming (ie: www.books.com will come out as www, books and com). I believe there is some work (which may have been completed) to allow chaining of dictionaries which would solve this problem.
First off in case you're not aware, tsearch2 is deprecated in favor of the built-in functionality:
http://www.postgresql.org/docs/9/static/textsearch.html
As for your actual question, google.com gets recognized as a host by the parser:
http://www.postgresql.org/docs/9.0/static/textsearch-parsers.html
If you don't want this to occur, you'll need to pre-process your text accordingly (or use a custom parser).

String version of term_to_binary

I'm trying to write a simple server that talks to clients via tcp. I have it sending messages around just fine, but now I want it to interpret the messages as Erlang data types. For example, pretend it's HTTP-like (it's not) and that I want to send from the client {get, "/foo.html"} and have the server interpret that as a tuple containing an atom and a list, instead of just a big list or binary.
I will probably end up using term_to_binary and binary_to_term, but debugging text-based protocols is so much easier that I was hoping to find a more list-friendly version. Is there one hiding somewhere?
You can parse a string as an expression (similar to file:consult) via:
% InputString = "...",
{ok, Scanned, _} = erl_scan:string(InputString),
{ok, Exprs} = erl_parse:parse_exprs(Scanned),
{value, ParsedValue, _} = erl_eval:exprs(Exprs, [])
(See http://www.trapexit.org/String_Eval)
You should be able to use io_lib:format to convert an expression to a string using the ~w or ~p format codes, such as io_lib:format("~w", [{get, "/foo.html"}]).
I don't think this will be very fast, so if performance is an issue you should probably not use strings like this.
Also note that this is potentially unsafe since you're evaluating arbitrary expressions -- if you go this route, you should probably do some checks on the intermediate output. I'd suggest looking at the result of erl_parse:parse_exprs to make sure it contains the formats you're interested in (i.e., it's always a tuple of {atom(), list()}) with no embedded function calls. You should be able to do this via pattern matching.