Amazon s3 URL + being encoded to %2? - amazon-s3

I've got Amazon s3 integrated with my hosting account at WP Engine. Everything works great except when it comes to files with + characters in them.
For example in the following case when a file is named: test+2.pdf
http://support.mcsolutions.com/wp-content/uploads/2011/11/test+2.pdf = does not work.
The following URL is the amazon URL. Notice the + charcter is encoded. Is there a way to prevent/change this?
http://mcsolutionswpe.s3.amazonaws.com/mcsupport/wp-content/uploads/2011/11/test%2b2.pdf
Other URLs work fine:
Amazon -> http://mcsolutionswpe.s3.amazonaws.com/mcsupport/wp-content/uploads/2011/11/test2.pdf
Website -> http://support.mcsolutions.com/wp-content/uploads/2011/11/test2.pdf

If I understand your question correctly, then no, there is no way to really change this.
The cause appears to be an unfortunate design decision made on S3 many years ago -- which, of course, cannot be fixed, now, because it would break too many other things -- which involves S3 using an incorrect variant of URL-escaping (which includes but is not quite limited to "percent-encoding") in the path part of the URL, where the object's key is sent.
In the query string (the optional part of a URL after ? but before the fragment, if present, which begins with #), the + character is considered equivalent to [SPACE], (ASCII Dec 32, Hex 0x20).
...but in the path of a URL, this is not supposed to be the case.
...but in S3's implementation, it is.
So + doesn't actually mean +, it means [SPACE]... and therefore, + can't also mean +... which means that a different expression is required to convey + -- and that value is %2B, the url-escaped value of + (ASCII Dec 43, Hex 0x2B).
When you upload your files, the + is converted by the code you're using (assuming it understands this quirk, as apparently it does) into the format S3 expects (%2B)... and so it must be requested using %2B so when you download the files.
Strangely, but not surprisingly, if you store the file in S3 with a space in the path, you can actually request it with a + or a space or even %20 and all three of these should actually fetch the file... so if seeing the + in the path is what you want, you can sort of work around the issue by saving it with a space instead, though this workaround deserves to be described as a "hack" if ever a workaround did. This tactic will not work with libraries that generate pre-signed GET URLs, unless they specifically are designed to ignore the standard behavior of S3 and do what you want, instead... but for public links, it should be essentially equivalent.

Related

Twisted.web File directory listing issues

I'm trying to use Twisted in a web-app, and I'm coming across an interesting issue. I'm very new to Twisted, so I'm not sure if I'm seeing a bug in Twisted, or if I just am not using it correctly.
Theoretically from the example, a File resource object can be use to both serve files from a directory, as well as provide the directory listing. So assuming I have the variables (port, reportsDir) defined elsewhere before the code snippet, I do the following:
rootResource = Resource()
rootResource.putChild("reports", File(reportsDir))
reactor.listenTCP(port, Site(rootResource))
reactor.run(installSignalHandlers=False)
Now, when I access '/reports' on my host I get a message "Request did not return bytes" in my browser with a bunch of stuff that was obviously produced by twisted, but also contains a print of a u'.....' string literal, which in fact has the directory listing in it. So the DirectoryLister is obviously creating the listing HTML, but it isn't seeing as valid by something in Twisted. It doesn't seem to like the unicode string; which was in fact produced by Twisted itself.
Do I need to set some other configuration item to get it to convert the unicode string to the necessary bytes object (or whatever), or some other approach?
Many thanks,
-D
Well, it seems like the issue is that Python will promote any string to unicode if any source string on a format was unicode. In my case, "reportsDir" was unicode because it came from a XML file, and that set it down the error path.
Changing the above line:
rootResource.putChild("reports", File(reportsDir))
to:
rootResource.putChild("reports", File(reportsDir.encode('ascii', 'ignore')))
fixed the issue. I would however suggest that the Twisted developers do a check for unicode in the constructor for File, or in the DirectoryLister simply check for unicode, and if it is then return the ascii-encoded version.

IOS JSON escaping special characters

I'm working in IOS and trying to pass some content to a web server via an NSURLRequest. On the server I have a PHP script setup to accept the request string and convert it into an JSON object using the Zend_JSON framework. The issue I am having is whenever the character "ø" is in any part of the request parameters, then the request string is cut short by one character.
Request string before going to server.
[{"description":"Blah blah","type":"Russebuss","name":"Roscoe Simulator","appVersion":"1.0.20","osVersion":"IOS 5.1","phone":"5555555","country":"Østfold","udid":"bed164974ea0d436a43f3cdee0e005a1"}]
Request string on server before any parsing
[{"description":"Blah blah","type":"Russebuss","name":"Roscoe Simulator","appVersion":"1.0.20","osVersion":"IOS 5.1","phone":"5555555","country":"Nord-Trøndelag","udid":"bed164974ea0d436a43f3cdee0e005a1"}
Everything looks exactly the same except the final closing ] is missing. I'm thinking it's having an issue when converting the string to UTF-8, but not sure the correct way to fix this issue.
Does anyone have any ideas why this is happening?
first of all do not trust the xcode console in such cases. you never know which coding the console is actually using.
second, escape the invalid characters before you build you json string. easiest way would probably to make sure you are using the same unicode representation, like utf-8, all the time.
third, if there are still invalid characters use a json lib with a parser (does the encoding). validate the output by parsing back to e.g. NSString. or validate the output manually by using a web form like http://jsonformatter.curiousconcept.com/
the badest way is to replace the single characters in the string, build your json and convert back. one way to do this could be to replace e.g an german ä with its unicode representaion U+00E4 (http://www.utf8-chartable.de/).
Thats the way I do it. I am glad that I nerver needed to go further than step three and this is the step you should do anyway to keep your code simple.
Please try to use Zends internal json Encoding:
Zend_Json::$useBuiltinEncoderDecoder = true;
should fix your issue.

S3's BOTO is returning NoSuchKey while trying to copy an existing key

I've create a key on S3.
mykey.exists() returns true
mykey.get_contents_to_filename() generates a file that is correct
But:
mykey.copy('bucket', '/backup/file')
returns:
NoSuchKey
The Specified key does not exist.
Key = mykey
It looks like I'm using boto 2.0b4
If the key exists, why am I getting a NoSuchKey error?
What am I missing?
edit: change backslash in key name to the foreslashes that I am actually using
I have a theory that because amazon s3 is eventually consistent, one request could see the key (.exists() == True) while another request ends up at a different s3 server which does not yet have knowledge of the new key (an inconsistent read - this is the difficulty with eventually consistent data stores. This is known behavior for s3 with a put followed by a head/get. I expect it to hold for copy as well.) After a usually short (but indefinite) period of time all requests will see your key. Normally this is only about a second or two. Put a 30 second timeout in your code between the exists() check and the copy. Does it still happen?
The issue is described here: https://forums.aws.amazon.com/thread.jspa?threadID=21634&tstart=0)
I think you may be running into an issue with your key name. The baskslash characters in the string '\backup\file' are actually is interpreted as string escapes so '\b' is replaced with the ASCII backspace character and '\f' is interpreted as the ASCII formfeed (see this for more details). While that probably isn't what you intended, it really should still work but there was a bug in the escaping of key names in boto2.0b4 (which is fixed now in github master) that is preventing this from working.
If you actually want your keyname to be "\backup\file" try specifying it as r'\backup\file' in Python. This treats it as a raw string and no escape processing will occur.

URL shortener with no database

I'd like to write a URL shortener that doesn't have to use a database. Instead, to have as few moving parts as possible, the script would just create a unique hash for my URL based on an algorithm (like md5, except an md5 would be too long). I'm not really sure how I'd go about doing this. Any advice?
If it matters, I'd prefer to write this in Ruby.
What you need, is a way to compress and decompress a String. Where the resulting compressed version is a string too. This is nearly impossible, because an URL is already very short. Encoding and lossless compression always add minimal overhead, which will result in a string that is larger than the original, for most URLS.
For very long URLs, however, it may work.
So, in the end, you will almost always need a lookup-table in storage (database).
Base64 is the most logical solution. On itself, however, Base64 encoding returns longer strings than the original, for short strings (which URL are, generally); due to the padding, mostly. So we'll also try with zlib, to compress the string.
require "uri"
require "base64"
require "zlib"
shortner_url = URI.parse("https://s.to")
long = "https://stackoverflow.com/questions/4818429/url-shortener-with-no-database"
url = URI.parse(long)
stripped = url.host + url.path
stripped.length #=> 66
# Let's see that Base64 on its own does not shorten the url.
encoded = Base64.encode64(stripped)
encoded.length #=> 90
# So, using zlib. To compress.
compressed = Zlib::Deflate.deflate(stripped)
encoded = Base64.encode64(compressed)
encoded.length #=> 94
# It became worse.
# Now, with a long url (they can be much longer even), in a oneliner; to simplify omit the stripping part:
long = "http://www.thelongestlistofthelongeststuffatthelongestdomainnameatlonglast.com/wearejustdoingthistobestupidnowsincethiscangoonforeverandeverandeverbutitstilllookskindaneatinthebrowsereventhoughitsabigwasteoftimeandenergyandhasnorealpointbutwehadtodoitanyways.html"
long.length #=> 263
Base64.encode64(Zlib::Deflate.deflate(long)).length #=> 228
# In order to turn this into a valid short URL, however, we need `urlsaf_encode64()`
shortner_url.path = "/" + Base64.urlsafe_encode64(Zlib::Deflate.deflate(long))
shorther_url.to_s #=> "https://s.to/eJxNjkEWwyAIRG-U7HsbElFpEPIE68vti6t2BcwbZn51v1_7PufcvCKrFDRnMtf8u81HzuA_IWkDEoGG4EtiMN9ObftE6Pgey0FSvK6gIx7GTUl0GsmJSz1Biqpk7fjBDpL-xjGcopKYWfWyiySBRBFJABw9UnB9xaWj1LDCQWUGAQYzBVLECPbyxFLBJDqA7-DxSJ5YIbkGnoM8Ex7bqjf-AiodbYM="
shortner_url.to_s.length #=> 237 WE SAVED 26 characters!
Note on stripping: can remove 'https://'. A Real implementation would need to add a piece to the string, to determine https or http: '1'+result for https, '0'+result for http. Another "hack" would be to make the url-shortening service use http for http urls and https for https urls.
If you always have the same domain, you can disgard the domain part too.
If you have a lot of slashes, or other repeating characters such as a dash, the compression works better.
You could do this with several of the string manipulation tools available to transform a URL into something obscured however as you noted in your question the url's you get from doing this would be longer than is typical for a url shortener.
url's don't compress very well.
Ultimately if you're after a short link, you simply need to generate a suitably legible unique code (try to omit similar letters/numbers such as zero and 'o', in case some poor bugger actually has to type it in) and associate that code with the original URL in some form of store.
Whilst I can understand why you don't want to use a database, in many ways it's the perfect form of storage, especially if you look at one of the dedicated key/value stores such as Cassandra, Redis, MongoDB, etc. (That said, a simple "traditional" SQL database may be an easy first step if you're in unfamiliar territory.)
You won't be able to resolve the original URL from a hash code without looking it up in some kind of database.
About the only thing you can do without a database is compress the URL and then decompress it when you resolve the URL.
Strictly speaking, I guess you could just hash the URL. But of what possible value would that be if you are not able to resolve it back to the original URL?

what is the impact of escaped characters on seo-friendly urls?

I have a site that displays products - in the simplest sense the url of the page for a particular product is:
site.com/products/manufacturer_model - so for example if I was displaying a Dell Latitude D700 laptop my URL would look like:
site.com/products/dell_latitude_d700
I have a number of products that contain characters that I would need to URL escape - so for example a Dell Latitude 12?34. Obviously I cannot include the '?' character in the URL. For the purpose of being SEO-friendly - should I ignore that character? e.g.
site.com/products/dell_latitude_1234
Or should I escape it? e.g.
site.com/products/dell_latitude_12%3F34
Seems like escaping it would be the most logical approach - but do crawlers understand this?
Well, using "_" is not so friendly to users, so I think using "-" is better (check seoMOZ beginners guide).
Also, you would like to check what characters really need escaping on RFC 3986. If you are using PHP, check out urlencode function page at php.net. I wrote a function to make this updated conversion a few months ago ;)
But getting back to your main question, do use escaped (when needed per RFC 3986) for writing your URLs. It is the safe path to not getting stuck or penalized.