File naming and ISO 8601 - pdf

I am creating an archive of pdf files (magazine issues) and currently in the process of determining the file naming convention I want to adopt. So far I have thought of Title_Title-YYYY-MM-DD-VVV-NNNN where Title is the magazine name (possibly two words separated by an underscore), then the ISO 8601 date standard, followed by volume number, then issue number.
My problem is the following: not all magazines have the same types of data, for example, some have volume numbers while others only issue numbers, and more to the question here, some are issued on specific days while others on a month without a day.
My question: when faced by a magazine for which a field (say DD or VVV) does not apply, should I replace the field with zeros or drop the field completely?
And would adding zeros ruin the compatibility of my file names with ISO 8601 and any services working with it?
I am thinking about both human- and machine-readability. These files will be hosted on a website and my idea is to maximize compatibility (with google SEO for example) as well as maintain a convention that facilitates retrieval locally.
Thank you very much,

Related

Documenting File Naming Conventions

I'm writing down some documentation on file naming of configuration files, and I don't know whether this question belongs here as it's not code-related, but here it goes.
So I want to describe how to name files and I have something along the lines of:
{client_name}-{product_name}-{season}-{id}.yaml
Now my question is since client_name, product_name, season, and id are not necessarily variable names, just used to describe what string goes where, should there be underscores in multiple word descriptions, or should I keep them separated with space, like:
{client name}-{product name}-{season}-{id}.yaml
My concern is that because of the underscores, someone may think they are variables and look for them somewhere in the code of the projects, instead of in other linked documentation.
Indeed if your naming system do not vary too much, it is evidently failing on its purpose.
I hardly can find an standard on this, but some widespread use is to have the following structure, with more or less modifications
[Project Name]-[Department]-[Doc.Type]-[Number]-[Revision]
The last kind of structure works well for definite, unchangeable work structures. Check the CERN LHC Document Name Structure as an example for this.
From other side, for more dynamic but flat environments, with documents generated day by day, and even, auto generated by systems like cameras or dataloggers, something exposing the specific timestamp are more useful, in the date format YYYYMMDD or YYYY-MM-DD, or even with timestamps extending to hours, minutes or seconds for more automated processes YYYYMMDDhhmmss based on the ISO Date:
[Date]-[Document Name]
Unfortunately there is not any Standard Naming Convention nor under the ISO-7001 or under any other Standard related to Naming Systems, and it is much left to the Quality Management System to specify the structure to use.

Issues with Dates in Apache OpenNLP

For a recent project to aid me learning NLP I am working on a number of documents, each of which contain a date. What I would like to be able to do is read the unstructured data and identify the date or dates within, converting it into a numeric format and possibly setting it to the documents metadata. (Note: Since the documents being used is all pseudo information, the actual meta data of the files being read in are false).
Recently I have been attempting to use OpenNLP in conjunction with Lucene to do so and it works to a given degree.
However if the date is written as "13 January 1990" or "2010/01/05", OpenNLP only identifies "January 1990" and "2010" respectively, but not the entire date. Other date formats may have issues as well, I have yet to try them all. While I recognise that OpenNLP works upon a statistical basis rather than a format basis, I can't help but get the feeling I'm making an elementary mistake.
Am I making a mistake? If not is there an easy manner in which to rectify this?
I understand that I may be able to construct my own trained model based on a training data set. Is the Apache OpenNLP one freely available, so I may extend it? Are there any others that are freely available?
Is there a better way to do this? I've heard of Apache UIMA, the main reason why I went for OpenNLP is due to its mention in Taming Text by Manning. I should note that the extraction of dates is the first stage of the project and other data will be extracted later as well.
Many thanks for any response.
I am not an expert in OpenNLP but I know that the problem you are trying to solve is called Temporal Expression Extraction (because I do research in this field :P). Nowadays, there are some systems which can greatly help you in extracting and unambiguously representing the temporal meaning of such expressions.
Here are some references:
ManTIME, online demo, software
HeidelTime, online demo, software
SUTime, online demo, software
If you want a broader overview of the field, please have a look at the results of the last temporal information extraction challenge (TempEval-3, Task A).
I hope this helps. :)

How bad is idea of letting users to upload and store files with national characters in the filename?

Our CMS accepts files with national characters in their names and stores them on the server without a problem. But how bad is such approach in perspective? For example is it possible to store files with filenames in Hebrew, or Arabic or in any other language with non-latin alphabet? Is there a standard established way to handle these?
A standard way would be to generate unique names yourself and store the original file name somewhere else. Typically, even if your underlying OS and file system allow arbitrary Unicode characters in the file name, you don't want users to decide about file names on your server. Doing so may impose certain risks and lead to problems, e.g. caused by too long names or file system collisions. Examples of sites that do that would be Facebook, flickr and many other.
For generating the unique file name Guid values would be a good choice.
Store the original filename in a database of some sort, in case you ever need to use it.
Then, rename the filename using a unique alphanumeric id, keeping the original file extension.
If you expect many files then you should create directories to group the files. Using the year, month, day, hour and minute is usually enough for most. For example:
.../2010/12/02/10/28/1a2b3c4d5e.mp3
Yes, I've had experience with massive mp3 collections which are notorious for being named in the language of the country where the song originates which can cause trouble in several places.
It's fine as long as you detect the charset it's in from the headers in the request, and use a consistent charset (such as UTF-8) internally.
On a Unix server, it's technically feasible and easy to accept any Unicode character in the filename, and then convert filenames to UTF-8 before saving them. However, there might be bugs in the conversion (in the HTML templating engine or web framework you are using, or the user's web browser), so it might be possible that some users will complain that some files they have uploaded disappeared. The root cause might be buggy filename conversion. If all characters in the filename or non-latin, and you (as a software developer) don't speak that foreign language, then good luck figuring out what has happened to the file.
It is an excellent idea. Being Hungarian, I'm pretty annoyed when I'm not allowed to use characters like áÉŰÖÜúÓÚŰÉÍí :)
There is a lot of software out there that has bugs regarding dealing with such file names, especially on Windows.
Udpate:
Example: I couldn't use the Android SDK (without creating a new user), because I had an é in my user name. I also ran into a similar problem with the Intel C++ compiler.
Software usually isn't tested properly with such file names. The Windows API still offers "ANSI" encoded versions of functions, and many developers don't seem to understand its potential problems. I also keep on coming across webpages that mess up my name.
I don't say don't allow such file names, in fact in the 21st century I would expect to be able to use such characters everywhere. But be prepared that you may run into problems.

How do you test your app for Iñtërnâtiônàlizætiøn? (Internationalization?)

How do you test your app for Iñtërnâtiônàlizætiøn compliance? I tell people to store the Unicode string Iñtërnâtiônàlizætiøn into each field and then see if it is displayed correctly on output.
--- including output as a cell's content in Excel reports, in rtf format for docs, xml files, etc.
What other tests should be done?
Added idea from #Paddy:
Also try a right-to-left language. Eg, שלום ירושלים ([The] Peace of Jerusalem). Should look like:
(source: kluger.com)
Note: Stackoverflow is implemented correctly. If text does not match the image, then you have a problem with your browser, os, or possibly a proxy.
Also note: You should not have to change or "setup" your already running app to accept either the W European characters or the Hebrew example. You should be able to just type those characters into your app and have them come back correctly in your output. In case you don't have a Hebrew keyboard laying around, copy and paste the the examples from this question into your app.
Pick a culture where the text reads from right to left and set your system up for that - make sure that it reads properly (easier said than done...).
Use one of the three "pseudo-locales" available since Windows Vista:
The three different pseudo-locale are for testing 3 kinds of locales:
Base The qps-ploc locale is used for English-like pseudo
localizations. Its strings are longer versions of English strings,
using non-Latin and accented characters instead of the normal script.
Additionally simple Latin strings should sort in reverse order with
this locale.
Mirrored qpa-mirr is used for right-to-left pseudo data, which is
another area of interest for testing.
East Asian qps-asia is intended to utilize the large CJK character
repertoire, which is also useful for testing.
Windows will start formatting dates, times, numbers, currencies in a made-up psuedo-locale that looks enough like english that you can work with it, but obvious enough when you're not respecting the locale:
[Шěđлеśđαỳ !!!], 8 ōf [Μäŕςћ !!] ōf 2006
There is more to internationalization than unicode handling. You also need to make sure that dates show up localized to the user's timezone, if you know it (and make sure there's a way for people to tell you what their time zone is).
One handy fact for testing timezone handling is that there are two timezones (Pacific/Tongatapu and Pacific/Midway) that are actually 24 hours apart. So if timezones are being handled properly, the dates should never be the same for users in those two timezones for any timestamp. If you use any other timezones in your tests, results may vary depending on the time of day you run your test suite.
You also need to make sure dates and times are formatted in a way that makes sense for the user's locale, or failing that, that any potential ambiguity in the rendering of dates is explained (e.g. "05/11/2009 (dd/mm/yyyy)").
"Iñtërnâtiônàlizætiøn" is a really bad string to test with since all the characters in it also appear in ISO-8859-1, so the string can work completely without any Unicode support at all! I've no idea why it's so commonly used when it utterly fails at its primary function!
Even Chinese or Hebrew text isn't a good choice (though right-to-left is a whole can of worms by itself) because it doesn't necessarily contain anything outside 3-byte UTF-8, which curiously was a very large hole in MySQL's default UTF-8 implementation (which is limited to 3-byte chars), until it was fixed by the addition of the utf8mb4 charset in MySQL 5.5. These days one of the more common uses of >3-byte UTF-8 is Emojis like these: [💝🐹🌇⛔]. If you don't see some pretty little coloured pictures between those brackets, congratulations, you just found a hole in your Unicode stack!
First, learn The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets.
Make sure your application can handle Turkish. It has several quirks that break applications that assume English rules. Because there are four kinds of letter "i" (dotted and dot-less, upper and lower case), applications that assume uppercase(i) => I will break when using Turkish rules, where uppercase(i) => İ.
A common thing to do is check if the user typed the command "exit" by using lowercase(userInput) == "exit" or uppercase(userInput) == "EXIT". This works as expected under English rules but will fail under Turkish rules where "exıt" != "exit" and "EXİT" != "EXIT". To do this correctly, one must use case-insensitive comparison routines which are built into all modern languages.
I was thinking about this question from a completely different angle. I can't recall exactly what we did, but on a previous project I think we wound up changing the Regional Settings (in the Regional and Language Options control panel?) to help us ensure the localized strings were working.

Source of data for "official" country/region list

Lately we've started getting issues with outdated countries / regions list being presented to users of our web-application.
We currently have a few DB tables to store localized country names along with their regions (states). However as the planet goes, that list is in constant evolution and it's proving to be a pain to maintain as some regions are deleted, some merged - existing data needs to be updated all the time.
What are, if any exist, the best practices when it come to dealing with multi-locale countries/regions list?
Is there a place or a standard in place? I know of ISO 3166, but their list isn't exactly DB friendly ... plus it's not fully localized.
An ideal solution would simply allow us to "sync" to it? Preferably in multiple language. The solution would preferably be free or subscription based with an historic of what changed so we could update our data (aka tblAddress)
Thanks!
geonames is pretty accurate in this respect, and they update regularly.
http://www.geonames.org/export/
There is no such thing. This is a political issue, which you can only solve in the context of your own application. Deciding to use ISO 3166 could be the easiest to defend. I know of issues with at least:
China/Taiwan
Israel/Palestine
China/Tibet
Greece/Macedonia
The ISO lists here are DB friendly, though they only include short names and codes.
This one looks very good: Multiple languages, update option, database independent file format for import, countries/regions/cities information, and some other features you might use or not.
And it's quite affordable if you need it for only one server.
You can try CLDR
http://cldr.unicode.org/
This set of data is maintained by the Unicode organization. It is updated regularly and the data is versioned so it is easy for you to manage the state of your list.
Hy! you can find a free dump of all countries with their respective continents https://gist.github.com/kamermans/1441495, its much easy to use.just download the dump & upload in your data base.
Well, wait, do you just want an up-to-date list of countries? Or do you need to know that Country X has split into Country Y and Country Z? Because I'm not aware of any automated way to get the latter. Even updates to the ISO databases are distributed as PDFs (you're on your own for implementing the change)
The EU maintains data about Local Administrative Units (LAUs) which can be downloaded as hierarchical XLS files in several languages.
United Nations Statistics Division, Standard country or area codes for statistical use (M49).
Look for "Search and Download: Full View" on page left. That leads here.
Groups countries by continent, sub-continental region, Least Developed Countries, and so on.
If you cannot import the excel version, note that the csv has unquoted" fields and a comma in one country name that will bust your import ("Bonaire, Sint Eustatius and Saba"). Perhaps open it first in LibreOffice or whatever, fix the broken country name and shunt its other right-most columns back into place. Then set all cells to type Text, saveAs csv with [Edit Filter Settings] checked [x] in the saveAs dialog, and make sure string delimiter is set to ", as it should be by default.