How bad is idea of letting users to upload and store files with national characters in the filename? - file-upload

Our CMS accepts files with national characters in their names and stores them on the server without a problem. But how bad is such approach in perspective? For example is it possible to store files with filenames in Hebrew, or Arabic or in any other language with non-latin alphabet? Is there a standard established way to handle these?

A standard way would be to generate unique names yourself and store the original file name somewhere else. Typically, even if your underlying OS and file system allow arbitrary Unicode characters in the file name, you don't want users to decide about file names on your server. Doing so may impose certain risks and lead to problems, e.g. caused by too long names or file system collisions. Examples of sites that do that would be Facebook, flickr and many other.
For generating the unique file name Guid values would be a good choice.

Store the original filename in a database of some sort, in case you ever need to use it.
Then, rename the filename using a unique alphanumeric id, keeping the original file extension.
If you expect many files then you should create directories to group the files. Using the year, month, day, hour and minute is usually enough for most. For example:
.../2010/12/02/10/28/1a2b3c4d5e.mp3
Yes, I've had experience with massive mp3 collections which are notorious for being named in the language of the country where the song originates which can cause trouble in several places.

It's fine as long as you detect the charset it's in from the headers in the request, and use a consistent charset (such as UTF-8) internally.

On a Unix server, it's technically feasible and easy to accept any Unicode character in the filename, and then convert filenames to UTF-8 before saving them. However, there might be bugs in the conversion (in the HTML templating engine or web framework you are using, or the user's web browser), so it might be possible that some users will complain that some files they have uploaded disappeared. The root cause might be buggy filename conversion. If all characters in the filename or non-latin, and you (as a software developer) don't speak that foreign language, then good luck figuring out what has happened to the file.

It is an excellent idea. Being Hungarian, I'm pretty annoyed when I'm not allowed to use characters like áÉŰÖÜúÓÚŰÉÍí :)

There is a lot of software out there that has bugs regarding dealing with such file names, especially on Windows.
Udpate:
Example: I couldn't use the Android SDK (without creating a new user), because I had an é in my user name. I also ran into a similar problem with the Intel C++ compiler.
Software usually isn't tested properly with such file names. The Windows API still offers "ANSI" encoded versions of functions, and many developers don't seem to understand its potential problems. I also keep on coming across webpages that mess up my name.
I don't say don't allow such file names, in fact in the 21st century I would expect to be able to use such characters everywhere. But be prepared that you may run into problems.

Related

Documenting File Naming Conventions

I'm writing down some documentation on file naming of configuration files, and I don't know whether this question belongs here as it's not code-related, but here it goes.
So I want to describe how to name files and I have something along the lines of:
{client_name}-{product_name}-{season}-{id}.yaml
Now my question is since client_name, product_name, season, and id are not necessarily variable names, just used to describe what string goes where, should there be underscores in multiple word descriptions, or should I keep them separated with space, like:
{client name}-{product name}-{season}-{id}.yaml
My concern is that because of the underscores, someone may think they are variables and look for them somewhere in the code of the projects, instead of in other linked documentation.
Indeed if your naming system do not vary too much, it is evidently failing on its purpose.
I hardly can find an standard on this, but some widespread use is to have the following structure, with more or less modifications
[Project Name]-[Department]-[Doc.Type]-[Number]-[Revision]
The last kind of structure works well for definite, unchangeable work structures. Check the CERN LHC Document Name Structure as an example for this.
From other side, for more dynamic but flat environments, with documents generated day by day, and even, auto generated by systems like cameras or dataloggers, something exposing the specific timestamp are more useful, in the date format YYYYMMDD or YYYY-MM-DD, or even with timestamps extending to hours, minutes or seconds for more automated processes YYYYMMDDhhmmss based on the ISO Date:
[Date]-[Document Name]
Unfortunately there is not any Standard Naming Convention nor under the ISO-7001 or under any other Standard related to Naming Systems, and it is much left to the Quality Management System to specify the structure to use.

How to create your own package for interaction with word, pdf etc

I know that there are a lot of packages around which allow you to create or read e.g. PDF, Word and other files.
What I'm interested in (and never learned at the university) is how you create such a package? Are you always relying on source code being given by the original company (such as Adobe or Microsoft), or is there another clever way of working around it? Should I analyze the individual bytes I see in e.g. PDF files?
It varies.
Some companies provide an SDK ("Software Development Kit") for their own data format, others only a specification (i.e., Adobe for PDF, Microsoft for Word and it's up to the software developer to make sure to write a correct implementation.
Since that can be a lot of work – the PDF specification, for example, runs to over 700 pages and doesn't go deep into practically required material such as LZW, JPEG/JPEG2000, color theory, and math transformations – and you need a huge set of data to test against, it's way easier to use the work that others have done on it.
If you are interested in writing a support library for a certain file format which
is not legally protected,
has no, or only sparse (official) documentation,
and is not already under deconstruction elsewhere,a
then yes: you need to
gather as many possible different files;
from as many possible sources;
(ideally, you should have at least one program that can both read and create the files)
inspect them on the byte level;
create a 'reader' which works on all of the test files;
if possible, interesting, and/or required, create a 'writer' that can create a new file in that format from scratch or can convert data in another format to this one.
There is 'cleverness' involved, mainly in #3, as you need to be very well versed in how data representation works in general. You should be able to tell code from data, and string data from floating point, and UTF8 encoded strings from MacRoman-encoded strings (and so on).
I've done this a couple of times, primarily to inspect the data of various games, mainly because it's huge fun! (Fair warning: it can also be incredibly frustrating.) See Reverse Engineering's Reverse engineering file containing sprites for an example approach; notably, at the bottom of my answer in there I admit defeat and start using the phrases "possibly" and "may" and "probably", which is an indication I did not get any further on that.
a Not necessarily of course. You can cooperate with other whose expertise lies elsewhere, or even do "grunt work" for existing projects – finding out and codifying fairly trivial subcases.
There are also advantages of working independently on existing projects. For example, with the experience of my own PDF reader (written from scratch), I was able to point out a bug in PDFBox.

JSON vs classic schema design [duplicate]

The Project
I've been asked to work on an interesting project -- what amounts to a basic Web CMS -- that uses HTML/CSS/jQuery with PHP. However, one requirement is that there won't be a database to house the data (they want flat files for the documents/pages -- preferable in JSON format).
In a very basic sense, it'll be used to generate HTML pages via a very "non-techie" interface. Each installation would only have around 20 pages, but a few may get up to 100. It has to be fairly easy to drop onto a PHP capable server and run, with very little setup needed.
What's Out There
There are tons of CMS options and quite a few flat file versions. But an OSS or other existing CMS is not an option. They need a simple propriety system.
Initial Thoughts
So flat files it is... but I'd really like to get some feedback on the drawbacks, and if it is worth the effort to try and convince them to use something like MySQL (SQLite or CouchDB are out since none of the servers can be configured to run them at the present time).
Of course the document files are pretty straightforward, but we're also talking about login info for 1 or 2 admins per installation, a few lists, as well as configs/settings (which also can easily be stored in a file with protection).
The Dilemma
If there are benefits to using MySQL rather than JOSN formatted files and some arrays in a simple project like this -- beyond my own pre-conceived notions :) -- I'll be sure to argue them.
But honestly I can't see any that outweigh their need to not have a database system.
I'd appreciate you insight and opinions.
If you can't cite a specific need for relational table design, then you're good with flat files. Build as specified. The moment you can cite a specific need, let them know; upgrading isn't that hard, if you're perception is timely (that is, if you aren;t in the position of having to normalize data that should have been integrated earlier).
It's a shame you can't use CouchDB, this seems like the perfect application for it. Keep in mind that using flat-files severely constrains your architecture and, especially, scalability.
What's the best case scenario for your CMS app? It's successful and people want to use it more? If you're using flat-files it'll be harder to service and improve your system (e.g. make it more robust, and add new features for future versions) and performance will not scale well. So "success" in this case is at best short-lived, as success translates into more and more work for less and less gains in feature-set and performance.
Then again, if the CSM is designed right, then switching between a flat file to RDMS should be as simple as using a different data access file.
Will this be installed on any shared hosting sites. For this to work somewhat safely, a mechanism like suEXEC needs to be set up properly as the web server will need write permissions to various directories.
What would be cool with a simple site that was feed via JSON and jQuery is that the site wouldn't need to load on each click. Just the relevant data would change. You could then use hashes in the location bar to keep track of where you were (ex. http://localhost/#about)
The problem being if they are editing the raw JSON file they can mess it up pretty quick. I think your admin tools would have to generate the JSON files based on the input so that you can ensure nothing breaks. The admin tools would be more entailed then the site (though isn't that always the case with dynamic sites)
What is the predicted data sizes for the CMS?
A large reason for the use of a RDMS is quick,specific access to large amounts of data. The data format might not be large, but if there is a lot of the data, then it might be better in the long run for a RDMS.
Then again, if the CSM is designed right, then switching between a flat file to RDMS should be as simple as using a different data access file.
While an RDBMS may be necessary for a very large CMS, a small one could run off flat files very well. A lot of CMS products out there fall down in that regard, I think, by throwing an RDBMS into the mix when there's no real need.
However, if you are using flat files, there are security issues which others have highlighted. Another issue I've come across is hosting providers using the disable_functions directive in php.ini to disable file I/O functions like fopen() and friends. If you're hosting your CMS on a box you control, you won't have this problem but if you're using a third-party provider, check first.
As the original poster, I wasn't signed in, so I'm following up to the answers so far in an answer (sorry if this is bad form).
There may instances where this is on
a shared host.
Though the JSON files can technically
be edited, this won't be the case.
The admin interface will be robust
enough to do all of the creating/editing of pages
The size for each install will be
relatively small -- 1 - 2 admins,
10-100 pages. A few lists of common
items may run longer (snippets of
copy for example).
Security will be a big issue -- any
other options suggestions on this
specifically?
Well, isn't there a problem with they being distrustful to any database system? Isn't the problem more in their thinking than in technology? Maybe they are afraid of database because it sounds complex to them. In that case, if you just present them some very simple CMS (like CMS made simple, which I've heard is really simple and the learning process is very fast), if they see everything is easy then may be they just don't care what's behind, if it's a database or whatever!
They could hear to arguments like better maintenance, lower cost of maintenance, much better handover to another webmaster than proprietary solutions (they are not dependent on you) etc.

Writing Binary To file in Visual Basic with the intent of unreadability

I am working on a program that creates a "license" file. This file is expected to be binary, containing a name, today's date, a warning date, an expiration date, and a preference of Metric or Imperial units of measurement, and essentially authorizes programs to work until the expiration date is reached, before which the warning date notifies the user that the license will expire. For this functionality to be fully utilized, the dates must not be able to be easily edited so as to prevent people from setting the date to whatever they want and keeping the program.
What I have now writes each field from a String or Integer into whatever the BinaryWriter class deems should be written when I use its "write" method. I have been experimenting with the difference between Big and Little Endian encoding, which is selectable in the form.
[code redacted]
If the entered name has no spaces, the file looks a bit unreadable, but not enough. With Big Endian, most of the Expiration Date is still showing; with Little Endian, the other two dates are mostly visible. However, using spaces in the entered name changes the format of the outputted text quite a bit, making all characters deliminated by a space, and therefore incredibly easy to change. My apologies that I cannot actually show you what the files look like.
Is there a better/more accepted way of storing this data? I would like the license files to work with existing FORTRAN programs, of which read unformatted files in the general structure I've detailed, but reverse-engineering this sounds a bit difficult from what I've read and my employer has offered to rewrite the FORTRAN files to accept this new license creation program if need be.
Create your license structure as text, containing whatever data you need (XML is a convenient format).
Encrypt that using public key encryption (using your private key).
Embed the public key in your app. Decrypt the license file with the public key. Deal with it as you need to.
Easy!
Most license managers I've seen tend to show the license information in plain text, followed by a checksum code that the program checks against, most likely the data hashed with some other random stuff. That provides the benefit of having a human-readable license file while being hard to change.
Be advised that license managers like these will make casual copying difficult, but someone determined to run your program without a license will still be able to crack it with a disassembler and some time.
The most secure way to do that would probably be to encrypt the license file and have the programs using the licenses decrypt the file and display the info in it as necessary.

How do you test your app for Iñtërnâtiônàlizætiøn? (Internationalization?)

How do you test your app for Iñtërnâtiônàlizætiøn compliance? I tell people to store the Unicode string Iñtërnâtiônàlizætiøn into each field and then see if it is displayed correctly on output.
--- including output as a cell's content in Excel reports, in rtf format for docs, xml files, etc.
What other tests should be done?
Added idea from #Paddy:
Also try a right-to-left language. Eg, שלום ירושלים ([The] Peace of Jerusalem). Should look like:
(source: kluger.com)
Note: Stackoverflow is implemented correctly. If text does not match the image, then you have a problem with your browser, os, or possibly a proxy.
Also note: You should not have to change or "setup" your already running app to accept either the W European characters or the Hebrew example. You should be able to just type those characters into your app and have them come back correctly in your output. In case you don't have a Hebrew keyboard laying around, copy and paste the the examples from this question into your app.
Pick a culture where the text reads from right to left and set your system up for that - make sure that it reads properly (easier said than done...).
Use one of the three "pseudo-locales" available since Windows Vista:
The three different pseudo-locale are for testing 3 kinds of locales:
Base The qps-ploc locale is used for English-like pseudo
localizations. Its strings are longer versions of English strings,
using non-Latin and accented characters instead of the normal script.
Additionally simple Latin strings should sort in reverse order with
this locale.
Mirrored qpa-mirr is used for right-to-left pseudo data, which is
another area of interest for testing.
East Asian qps-asia is intended to utilize the large CJK character
repertoire, which is also useful for testing.
Windows will start formatting dates, times, numbers, currencies in a made-up psuedo-locale that looks enough like english that you can work with it, but obvious enough when you're not respecting the locale:
[Шěđлеśđαỳ !!!], 8 ōf [Μäŕςћ !!] ōf 2006
There is more to internationalization than unicode handling. You also need to make sure that dates show up localized to the user's timezone, if you know it (and make sure there's a way for people to tell you what their time zone is).
One handy fact for testing timezone handling is that there are two timezones (Pacific/Tongatapu and Pacific/Midway) that are actually 24 hours apart. So if timezones are being handled properly, the dates should never be the same for users in those two timezones for any timestamp. If you use any other timezones in your tests, results may vary depending on the time of day you run your test suite.
You also need to make sure dates and times are formatted in a way that makes sense for the user's locale, or failing that, that any potential ambiguity in the rendering of dates is explained (e.g. "05/11/2009 (dd/mm/yyyy)").
"Iñtërnâtiônàlizætiøn" is a really bad string to test with since all the characters in it also appear in ISO-8859-1, so the string can work completely without any Unicode support at all! I've no idea why it's so commonly used when it utterly fails at its primary function!
Even Chinese or Hebrew text isn't a good choice (though right-to-left is a whole can of worms by itself) because it doesn't necessarily contain anything outside 3-byte UTF-8, which curiously was a very large hole in MySQL's default UTF-8 implementation (which is limited to 3-byte chars), until it was fixed by the addition of the utf8mb4 charset in MySQL 5.5. These days one of the more common uses of >3-byte UTF-8 is Emojis like these: [💝🐹🌇⛔]. If you don't see some pretty little coloured pictures between those brackets, congratulations, you just found a hole in your Unicode stack!
First, learn The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets.
Make sure your application can handle Turkish. It has several quirks that break applications that assume English rules. Because there are four kinds of letter "i" (dotted and dot-less, upper and lower case), applications that assume uppercase(i) => I will break when using Turkish rules, where uppercase(i) => İ.
A common thing to do is check if the user typed the command "exit" by using lowercase(userInput) == "exit" or uppercase(userInput) == "EXIT". This works as expected under English rules but will fail under Turkish rules where "exıt" != "exit" and "EXİT" != "EXIT". To do this correctly, one must use case-insensitive comparison routines which are built into all modern languages.
I was thinking about this question from a completely different angle. I can't recall exactly what we did, but on a previous project I think we wound up changing the Regional Settings (in the Regional and Language Options control panel?) to help us ensure the localized strings were working.