In my s3 bucket, will there be any problem in accessing object over the browser, if the objects are named in Arabic characters.
See Object key naming guidelines for specifics and recommendations on what to avoid.
Theoretically:
You can use any UTF-8 character in an object key name. However, using certain characters in key names can cause problems with some applications and protocols. The following guidelines help you maximize compliance with DNS, web-safe characters, XML parsers, and other APIs.
Related
In an application, I am allowing (potentially untrusted) users to store an arbitrary string (limited to 255 characters) in the value of a PDF custom property.
Are there any unsafe characters that can lead to an exploit?
Are there any unsafe characters that can lead to an exploit?
Not per se.
Every PDF viewer implementation should get along with 255 arbitrary valid characters, and I'm not aware of any specific exploits.
But of course there always is a chance some character sequences trigger bugs in some pdf viewers, in particular in exotic, less tested ones.
I am posing this use case as a reason to enable support for the CDATA section of XML documents on SQL Server, in response to the opinion of Michael Rys.
He states that
"There is no semantic difference in the data that you store."
I am a software controls engineer where we use a supervised distributed system, we generally have a windows based server and database for supervisory functions as well as high speed machine control applications. We use any number of PLCs to compose our distributed control system, and we keep a copy of the PLC program on the server. The PLC program is L5X format that calls for the CDATA section per specification (see page 40 for more info).
The CDATA section is used for component descriptions due to invalid XML characters being present in some of them and the need to preserve them:
"Component descriptions are brought into the project without being processed by
the XML parser for markup language. The description text is contained in a
CDATA element, a standard in the XML specification. A CDATA element
begins with the character sequence <![CDATA[ and ends with the character
sequence ]]>. None of the text within the CDATA element is interpreted by the
XML parser. The CDATA element preserves formatting so there is no need to use
control characters to enter formatted descriptions."
Here, I think at least, is an entirely valid reason for the existence and use of the CDATA section - in contrast the the opinion of Microsoft.
Buggy tools.
You may find it more or less convenient, but the only technical reasons is if you have buggy tools that don't follow the XML rules in some way.
The CDATA section is used for component descriptions due to invalid XML characters being present in some of them and the need to preserve them.
Either you mean characters that are invalid in XML unescaped, in which case they could also be escaped, or you mean characters that are not valid in XML at all, in which case they are not valid CDATA sections. In the first case if your tools can't work with that, they are buggy. In the second case if your tools require you to work with that, they are buggy. Either way this is in the buggy tools category.
The general consensus in the XML community is that the following three forms are semantically equivalent:
<x>±</x>
<x>±</x>
<x><![CDATA[±]]></x>
and that XML processing software therefore does not need to preserve the distinction between them. This is why entity references (or numeric character references) and CDATA sections are not part of the data model used by XPath, XSLT, and XQuery (known as XDM).
Unfortunately the XML specification itself does not define a data model and is rather weak on saying which constructs are information-bearing and which are not, so there will always be people coming up with arguments like this, just as there will be people arguing that the order of attributes should be preserved.
So it is well documented that Amazon S3 only uses lowercase for bucket names and object names, so it is difficult to represent file names that may have contained uppercase letters when, for instance, backing up a file to S3. I thought then that I would put the 'true' mixed-case filename in metadata, only to discover that the metadata also has a restriction of all lowercase(!!!).
Has anyone established a best practice or technique for representing mixed-case filenames? I can think of methods like storing additional metadata that indicates which letters in the key or in other metadata should be uppercase, but this is a mess.
Any recommendations or common practices?
I'm not really sure what you mean by just lowercase for objects. Yes, bucket names are lowercase but objects can have mixed case names. Check below:
To prevent the casual distribution of pdf document, is there any way such as embedding the serial number to the file?
My idea is to embed the id bound to user and enable to find who distribute the file.
I know it's not preventing the distribution but may discourage the casual distribution by the certain level.
Any other solution is also welcome.
Thanks!
Common way is placing of meta data, but you can easily remove them.
Let's search hideouts (most of them low-level)!
Non-mark text
Text under overlapping objects
Objects of older versions (doesn't noticed by reader, but there with redundant information)
Marks in streams between BX-EX (with weird information from readers point of view)
Information before %PDF-X
Information above %%EOF
Substitution of names for some elements (like font name)
Steganography
Manipulation from used fonts
Whitespacing
Images with setganograpy
My favorite are steganography and BX-EX block within stream, with proper compression and/or encryption it is hard to find (if do not know, where it is). To make search harder wrap some normal blocks with BX-EX.
Some of ways are easy to remove, some harder, but decided attacker will be able to find and sanitize them all. Think about copy-paste of text or print trough PDF-printer.
You can render transparent text. You can write text outside the media box of a page. You can add custom document property. There are plenty of ways to do this.
Why not create a digital id on the documents?
Our CMS accepts files with national characters in their names and stores them on the server without a problem. But how bad is such approach in perspective? For example is it possible to store files with filenames in Hebrew, or Arabic or in any other language with non-latin alphabet? Is there a standard established way to handle these?
A standard way would be to generate unique names yourself and store the original file name somewhere else. Typically, even if your underlying OS and file system allow arbitrary Unicode characters in the file name, you don't want users to decide about file names on your server. Doing so may impose certain risks and lead to problems, e.g. caused by too long names or file system collisions. Examples of sites that do that would be Facebook, flickr and many other.
For generating the unique file name Guid values would be a good choice.
Store the original filename in a database of some sort, in case you ever need to use it.
Then, rename the filename using a unique alphanumeric id, keeping the original file extension.
If you expect many files then you should create directories to group the files. Using the year, month, day, hour and minute is usually enough for most. For example:
.../2010/12/02/10/28/1a2b3c4d5e.mp3
Yes, I've had experience with massive mp3 collections which are notorious for being named in the language of the country where the song originates which can cause trouble in several places.
It's fine as long as you detect the charset it's in from the headers in the request, and use a consistent charset (such as UTF-8) internally.
On a Unix server, it's technically feasible and easy to accept any Unicode character in the filename, and then convert filenames to UTF-8 before saving them. However, there might be bugs in the conversion (in the HTML templating engine or web framework you are using, or the user's web browser), so it might be possible that some users will complain that some files they have uploaded disappeared. The root cause might be buggy filename conversion. If all characters in the filename or non-latin, and you (as a software developer) don't speak that foreign language, then good luck figuring out what has happened to the file.
It is an excellent idea. Being Hungarian, I'm pretty annoyed when I'm not allowed to use characters like áÉŰÖÜúÓÚŰÉÍí :)
There is a lot of software out there that has bugs regarding dealing with such file names, especially on Windows.
Udpate:
Example: I couldn't use the Android SDK (without creating a new user), because I had an é in my user name. I also ran into a similar problem with the Intel C++ compiler.
Software usually isn't tested properly with such file names. The Windows API still offers "ANSI" encoded versions of functions, and many developers don't seem to understand its potential problems. I also keep on coming across webpages that mess up my name.
I don't say don't allow such file names, in fact in the 21st century I would expect to be able to use such characters everywhere. But be prepared that you may run into problems.