I'm working on an internationalized database application that supports multiple locales in a single instance. When international users sort data in the applications built on top of the database, the database theoretically sorts the data using a collation appropriate to the locale associated with the data the user is viewing.
I'm trying to find sorted lists of words that meet two criteria:
the sorted order follows the collation rules for the locale
the words listed will allow me to exercise most / all of the specific collation rules for the locale
I'm having trouble finding such trusted test data. Are such sort-testing datasets currently available, and if so, what / where are they?
"words.en.txt" is an example text file containing American English text:
Andrew
Brian
Chris
Zachary
I am planning on loading the list of words into my database in randomized order, and checking to see if sorting the list conforms to the original input.
Because I am not fluent in any language other than English, I do not know how to create sample datasets like the following sample one in French (call it "words.fr.txt"):
cote
côte
coté
côté
The French prefer diacritical marks to be ordered right to left. If you sorted that using code-point order, it likely comes out like this (which is an incorrect collation):
cote
coté
côte
côté
Thank you for the help,
Chris
Here's what I found.
The Unicode Common Locale Data Repository (CLDR) is pretty much the authority on collations for international text. I was able to find several lists of words conforming to the rules found in CLDR in the ICU Project's ICU Demonstration - Locale Explorer tool. It turns out that ICU (International Components for Unicode) uses CLDR rules to help solve common internationalization problems. It's a great library; check it out.
In some cases, it was useful to construct some nonsense terms by reverse-engineering the CLDR rules directly. Search engines available in the United States were not suited for finding foreign terms with the case/diacritic/other nuances I was interested in for this testing (in retrospect, I wonder if international search engines would have been better-suited for this task).
Related
As per the documentation:
The generic language support only stems words to themselves, queries in these languages will not include variations of words based on their meanings in the results.
xdmp:document-insert("/lang/test.xml", <el xml:lang="fr">chats</el>)
Considering I do not have a license for french language, will the word
"chats" be stemmed to "chat" in the term list ? or will it be indexed as chats, i.e. without any variations ?
You can easily check this yourself using cts:stem:
cts:stem("chats", "fr")
Without necessary licenses it simply returns chats.
HTH!
No license for French means no stemming for French.
I'm actually kind of confused by the question. If it did work without the license, then what would the license get you?
I would like that if a lucene document contains the word cheeseburger and a user searches for burger for this documents to come up. I see that I will probably need a custom analyzer to break this compound word into cheese and burger. However, breaking words may also bring irrelevant results.
Ex: if when indexing production we index product and ion as well, then when the user searches for ion documents containing production will come out, which is not relevant.
So a simple word breaker won't cut it. I need a way of knowing that cheeseburger is associated to burger and cheese, but that production is not associated to ion.
Is there a more intelligent process to achieve this?
Does this has a name just like stemming is to reduce words to their root form?
Depending on how accurate you want your synonymy to be, you might need to look into approaches such as Latent Semantic Analysis (LSA) and its variants such as LDA etc. A simpler approach would be to use an Ontology such as Wordnet to augment your searches. A wordnet Lucene index is available. However if your scenario includes domain-specific vocab then you might need to generate a "mapping" Ontology.
You should look at DictionaryCompoundWordTokenFilter which uses a brute-force algorithm to split compound nouns based on a dictionary.
in most cases you can simply use wildcard queries with a leading wildcard *burger. You only have to enable the support for leading wildcards on your query parser:
parser = new QueryParser(LuceneVersion.getVersion(), searchedAttributes, analyzer);
parser.setAllowLeadingWildcard(true);
Take care:
Leading wildcards might slow your search down.
If you need a more specific solution I would suggest to go with stemming. If really a matter of finding the right analyzer.
There are stemming implementations for several languages e.g. the SnowballAnalyzer (http://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/analysis/snowball/SnowballAnalyzer.html).
Best regards,
Chris
Getting associations by looking at the word is not going to scale to other words. For example, you cannot know "whopper" is associated with burger and "big-mac" is associated with cheese just by looking at the words. To make the search aware of the associations, you probably need a database of associations like "A is a B" or "A contains B". (As Mikos has mentioned, I think WordNet provides such a database.) Then, when you see B in a query, you translate the query so that it also searches for A.
I think the underlying question is -- how big is the collection you are indexing? If you are indexing some collection where all of the synonyms and related words are already known, then the index can just include the synonyms and related words directly, like 'cheeseburger' including the related words 'cheese' and 'burger'. (An approach successfully used in the LOINC standard medical terms Lucene index.)
If you are trying to solve the general problem for a whole human language (English, Chinese, etc.) then you have to move to some kind of semantic analysis as mentioned above.
It might be useful to talk with the subject matter experts of the area you are indexing to see how they search for terms -- what synonyms/related words do they use, do they have defined lists of synonyms/related words, do they need/use stemming, etc. This should give you some idea as to which approach (direct synonym/related-word inclusion or semantic analysis) you need to pursue.
I found Naming Guidelines from MSDN, but is it any guideline for MSSQL database from Microsoft?
The naming conventions used in SQL Server's AdventureWorks database demonstrate many best practices in terms of style.
To summarize:
Object names are easily understood
Table names are not pluralized
("User" table not "Users")
Abbreviations are few, but allowed
(i.e. Qty, Amt, etc.)
PascalCase used exclusively with the
exception of certain column names
(i.e. rowguid)
No underscores
Certain keywords are allowed (i.e.
Name)
Stored procedures are prefaced with
"usp"
Functions are prefaced with "ufn"
You can find more details here:
AdventureWorks Data Dictionary
Stored Procedures in
AdventureWorks
Functions in AdventureWorks
One caveat: database naming conventions can be very controversial and most database developers I've met have a personal stake in their style. I've heard heated arguments over whether a table should be named "OrderHeader" or "OrderHeaders."
No, there isn't but the practices in the link you provided are good to keep in mind.
With respect to naming stored procedures - do not prefix them with "sp_" You can read more about why in this link:
"Do not prefix stored procedures with
sp_, because this prefix is reserved
for identifying system-stored
procedures."
I don't know what "best practices in terms of style" in the answer by #8kb (at the time of writing) means. Certainly some of the listed items ("Table names are not pluralized", "No underscores", etc) are mere style choices which are obviously subjective. I would have thought the personal preferences of the documentation team lead would be the greatest factor here.
As regards heuristics in SQL in general (as opposed to proprietary SQL such as T-SQL), there is but one book on the subject: Joe Celko's SQL programming style.Many of the choices for SQL Server's AdventureWorks database conflict with Celko's guidelines.
Celko's naming convention is based on on the international standard ISO 11179 e.g. specifies that a delimiting character (such as an underscore) should be used to separate elements in a name. Other style choices are similarly backup up by research e.g. using exclusively lower case letters for column names so aid scanning by the human eye. No doubt there are subjective personal preferences in there too but they are based on many years of experiences out in the field.
On the plus side, things have improved in the SQL Server docs in recent years e.g. SQL keywords capitalized, semi-colons to separate statements, etc. Adventure works is a vast improvement on Northwind and pubs. Now why can't the scripting feature in Management Studio spit out code that is a little easier on the eye?!
If you were to build a SQL Server naming conventions guide, I recommend starting with Konstantin's document on GitHub.
How do you test your app for Iñtërnâtiônàlizætiøn compliance? I tell people to store the Unicode string Iñtërnâtiônàlizætiøn into each field and then see if it is displayed correctly on output.
--- including output as a cell's content in Excel reports, in rtf format for docs, xml files, etc.
What other tests should be done?
Added idea from #Paddy:
Also try a right-to-left language. Eg, שלום ירושלים ([The] Peace of Jerusalem). Should look like:
(source: kluger.com)
Note: Stackoverflow is implemented correctly. If text does not match the image, then you have a problem with your browser, os, or possibly a proxy.
Also note: You should not have to change or "setup" your already running app to accept either the W European characters or the Hebrew example. You should be able to just type those characters into your app and have them come back correctly in your output. In case you don't have a Hebrew keyboard laying around, copy and paste the the examples from this question into your app.
Pick a culture where the text reads from right to left and set your system up for that - make sure that it reads properly (easier said than done...).
Use one of the three "pseudo-locales" available since Windows Vista:
The three different pseudo-locale are for testing 3 kinds of locales:
Base The qps-ploc locale is used for English-like pseudo
localizations. Its strings are longer versions of English strings,
using non-Latin and accented characters instead of the normal script.
Additionally simple Latin strings should sort in reverse order with
this locale.
Mirrored qpa-mirr is used for right-to-left pseudo data, which is
another area of interest for testing.
East Asian qps-asia is intended to utilize the large CJK character
repertoire, which is also useful for testing.
Windows will start formatting dates, times, numbers, currencies in a made-up psuedo-locale that looks enough like english that you can work with it, but obvious enough when you're not respecting the locale:
[Шěđлеśđαỳ !!!], 8 ōf [Μäŕςћ !!] ōf 2006
There is more to internationalization than unicode handling. You also need to make sure that dates show up localized to the user's timezone, if you know it (and make sure there's a way for people to tell you what their time zone is).
One handy fact for testing timezone handling is that there are two timezones (Pacific/Tongatapu and Pacific/Midway) that are actually 24 hours apart. So if timezones are being handled properly, the dates should never be the same for users in those two timezones for any timestamp. If you use any other timezones in your tests, results may vary depending on the time of day you run your test suite.
You also need to make sure dates and times are formatted in a way that makes sense for the user's locale, or failing that, that any potential ambiguity in the rendering of dates is explained (e.g. "05/11/2009 (dd/mm/yyyy)").
"Iñtërnâtiônàlizætiøn" is a really bad string to test with since all the characters in it also appear in ISO-8859-1, so the string can work completely without any Unicode support at all! I've no idea why it's so commonly used when it utterly fails at its primary function!
Even Chinese or Hebrew text isn't a good choice (though right-to-left is a whole can of worms by itself) because it doesn't necessarily contain anything outside 3-byte UTF-8, which curiously was a very large hole in MySQL's default UTF-8 implementation (which is limited to 3-byte chars), until it was fixed by the addition of the utf8mb4 charset in MySQL 5.5. These days one of the more common uses of >3-byte UTF-8 is Emojis like these: [💝🐹🌇⛔]. If you don't see some pretty little coloured pictures between those brackets, congratulations, you just found a hole in your Unicode stack!
First, learn The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets.
Make sure your application can handle Turkish. It has several quirks that break applications that assume English rules. Because there are four kinds of letter "i" (dotted and dot-less, upper and lower case), applications that assume uppercase(i) => I will break when using Turkish rules, where uppercase(i) => İ.
A common thing to do is check if the user typed the command "exit" by using lowercase(userInput) == "exit" or uppercase(userInput) == "EXIT". This works as expected under English rules but will fail under Turkish rules where "exıt" != "exit" and "EXİT" != "EXIT". To do this correctly, one must use case-insensitive comparison routines which are built into all modern languages.
I was thinking about this question from a completely different angle. I can't recall exactly what we did, but on a previous project I think we wound up changing the Regional Settings (in the Regional and Language Options control panel?) to help us ensure the localized strings were working.
Lately we've started getting issues with outdated countries / regions list being presented to users of our web-application.
We currently have a few DB tables to store localized country names along with their regions (states). However as the planet goes, that list is in constant evolution and it's proving to be a pain to maintain as some regions are deleted, some merged - existing data needs to be updated all the time.
What are, if any exist, the best practices when it come to dealing with multi-locale countries/regions list?
Is there a place or a standard in place? I know of ISO 3166, but their list isn't exactly DB friendly ... plus it's not fully localized.
An ideal solution would simply allow us to "sync" to it? Preferably in multiple language. The solution would preferably be free or subscription based with an historic of what changed so we could update our data (aka tblAddress)
Thanks!
geonames is pretty accurate in this respect, and they update regularly.
http://www.geonames.org/export/
There is no such thing. This is a political issue, which you can only solve in the context of your own application. Deciding to use ISO 3166 could be the easiest to defend. I know of issues with at least:
China/Taiwan
Israel/Palestine
China/Tibet
Greece/Macedonia
The ISO lists here are DB friendly, though they only include short names and codes.
This one looks very good: Multiple languages, update option, database independent file format for import, countries/regions/cities information, and some other features you might use or not.
And it's quite affordable if you need it for only one server.
You can try CLDR
http://cldr.unicode.org/
This set of data is maintained by the Unicode organization. It is updated regularly and the data is versioned so it is easy for you to manage the state of your list.
Hy! you can find a free dump of all countries with their respective continents https://gist.github.com/kamermans/1441495, its much easy to use.just download the dump & upload in your data base.
Well, wait, do you just want an up-to-date list of countries? Or do you need to know that Country X has split into Country Y and Country Z? Because I'm not aware of any automated way to get the latter. Even updates to the ISO databases are distributed as PDFs (you're on your own for implementing the change)
The EU maintains data about Local Administrative Units (LAUs) which can be downloaded as hierarchical XLS files in several languages.
United Nations Statistics Division, Standard country or area codes for statistical use (M49).
Look for "Search and Download: Full View" on page left. That leads here.
Groups countries by continent, sub-continental region, Least Developed Countries, and so on.
If you cannot import the excel version, note that the csv has unquoted" fields and a comma in one country name that will bust your import ("Bonaire, Sint Eustatius and Saba"). Perhaps open it first in LibreOffice or whatever, fix the broken country name and shunt its other right-most columns back into place. Then set all cells to type Text, saveAs csv with [Edit Filter Settings] checked [x] in the saveAs dialog, and make sure string delimiter is set to ", as it should be by default.