Definitive country list for e-commerce applications - sql

I am looking for a list of countries for use in the development of an e-commerce app including:
Country Name,
Country Code,
Language,
etc.
While only the country name (and probably the country code) are really necessary, some of the other info may be nice (as long as there isn't too much!). I used to have a good list but I can't find it anymore.
Thanks.

ISO (International Standards Organization) maintains a list of countries here.

Check ISO 3166-1 for country codes.
There are 3 sets of country codes:
ISO 3166-1 Alpha-2
ISO 3166-1 Alpha-3
ISO 3166-1 Numeric
For the list of all 3 sets, view here: http://en.wikipedia.org/wiki/ISO_3166-1

Wikipedia, as usual, has a pretty definitive list. It's not in a readily computer-readable form, but it shouldn't take long to massage it, if you can't find anything else:
ISO_3166-1 (country codes)
List of sovereign states (countries and their languages)

Several options for "country code" but ISO is probably best
Language will be difficult because it's often many per country. Take Switzerland where I am: do you support French/German only? Include Italian (many sites don't). And for the handful of Romansch...?

http://cldr.unicode.org/ - common standard multilanguage database include country list and other localizable data.

Here is the curated and regularly updated country list data available in CSV and JSON formats (ISO 3166-1):
https://datahub.io/core/country-list
If you need more comprehensive country codes, e.g., ISO3166-1-Alpha-2, ISO3166-1-Alpha-3, ITU dialing codes, ISO 4217 currency codes, and many others:
https://datahub.io/core/country-codes
There is a data set for language codes as well. ISO Language Codes (639-1 and 693-2) and IETF Language Types:
https://datahub.io/core/language-codes
For more reference data you can take a look at:
https://datahub.io/awesome/reference-data
Good luck!

Related

SQL DIFFERENCE function with names bringing too many results

I have a function that uses the SQL DIFFERENCE function to see if the name of a client is similar to a client already in the database
SELECT ID FROM People p
WHERE DIFFERENCE(p.FullName, #fullName) = 4
Being #fullname a variable passed to the function. The issue I'm having is that if I pass "pedro sanchez" as a parameter, the query will bring me all the Peter's in the database, or if I enter "pablo sanchez", it'll bring record "PEOPLE'S CREDIT UNION".
As I understand the DIFFERENCE function should returns 4 when the two strings are almost identical, but the results I'm having say otherwise.
Is there a way to further specify the resemblance to the DIFFERENCE function, or maybe another approach in finding similar names ?
Difference() is based on soundex(), which in turn -- to be frank -- is a lousy system for comparing strings. Let me add a caveat: it is pretty good for the purpose it was designed for, which is matching last names of people in English. You can read about the rules here and you can try it out here. Using the latter link, you can see that "Pedro" and "People" have the same code, P-140.
Soundex encodes the consonants and basically the first four matching consonants the list it cares about. (Some languages, such as Hawaiian and other Polynesian languages are rather light in consonants. One assumes the designers were not thinking about names in such languages.)
When you are looking for proximity among written strings, Levenshtein distance is a common metric. Unfortunately, SQL Server does not have this functionality built-in, but you can easily find implementations on the web. For most real applications, Levenshtein distance is too slow. Happily, the functionality of the full text search component is usually sufficient for most purposes.

relative search in lucene (not geo-sptial search)

I am having only "Europe" being indexed along with some related data,but when someone searches using the word "Germany" although there is nothing specifically indexed for Germany but logically I can provide results under Europe than providing nothing at all,is there any way to do this? Does lunene have any supporting libraries which can do this?
But I dont want to have any geo-sptial search so how can we achieve this
I think that would just work out of the box by using a multi-valued field. You can have an indexed field which contains geo information (let's call it "place") such as Munich, Bavaria, Germany, Europe, World or Nice, French Riviera, France, Europe, World. Then, if you are looking for something in Bavaria, just run the query:
+text:something +place:(Bavaria Germany Europe World)
This will make all documents which have "something" in their text field appear in the result set, and boost documents depending on how far they are from Bavaria.

String Pattern Matching for Limited, Ltd, Incorporated, Inc, Etc

We're doing a LOT of work towards trying to reconcile about 1,000 duplicate manufacturer names and 1,000,000 duplicate part numbers. One thing that has come up is how to "match" things like "Limited" vs. "Ltd." vs. "Ltd"
The purpose is for the application to reconcile these matched items into a standard format. So:
ACME Ltd.
ACME Limited
ACME Ltd
Should all be reconciled into ACME Ltd.
This will also be used to prevent entering additional duplicates in the future.
Any suggestions on how to accomplish this pattern matching in SQL Server? Any known algorithms to find items with mapped equivalencies, etc...?
Thanks!
Eric.
How about a table that lists what you want in one column and variations in the next?
Ltd Limited
Ltd Ltd.
St Street
St Str.
Then, if you find a match on the second column, you change it to the first. It may take several iterations, as you find other alternatives.
Using SQL Server Full Text Search you can use synonyms:
For each full-text language, SQL
Server also provides a file in which
you can optionally define
language-specific synonyms to extend
the scope of search queries (a
thesaurus file).
In your case you could add a section like the following:
<expansion>
<sub>Limited</sub>
<sub>Ltd</sub>
<sub>Ltd.</sub>
</expansion>
Here is a link that goes into more detail on how to modify the thesaurus file. This may work for what you are trying to do...
SQL Server also offers some limited pattern matching by using LIKE. I would recommend looking over the options it offers to see if they will be sufficient for your needs.
If LIKE is insufficient you can always look at creating a CLR stored procedure or UDFs that will allow you to use regular expressions. This will allow you to match MUCH more complex patters...

Localize dbms demo data

I'm developing a windows mobile application which should work in multiple languages (English, German, French, Russian).
This application is about to be shown to customers (Germans, Russians,...) and we would like to generate data depending on the culture it is setup for.
So: has anybody thought of a way to create data which than is about to be inserted into the dbms at runtime?
For example: tha VAT description for the english version reads "VAT 17.5%" with value 17.5, the german version "Mehrwertsteuer 19%" with value 19, the french version "TVA 19.6%" with value 19.6
Thanks in advance
EDIT
I admit i was not very clear. I need a set of data to be localized. I need a mechanism which somehow reads this "prepared" localized data and inserts into the dbms.
A second thought of mine would be to use a XML file which has the same structure for all the languages (but of course different values), e.g
datafile.en-US.xml
datafile.de-DE.xml
What do you think about this?
I don't quite know what is your aim, so I could be mistaken here... Anyway, if you planning to distribute your Windows Mobile client application across various countries and one language version is to work in one country, I would suggest using resource files instead SQL DB. You could put messages like "VAT {0}", "TVA {0}" and format them at runtime (depending on programming language it would look different, please find C#/.Net example below) preserving valid cultural format.
var vat = string.Format(vatPatternStringFromResources, vatValueFromResources.ToString("P")); // "P" means percentage format
If you still need to add VAT value to SQL for reference, you can simply add one decimal column which will hold either foreign key to VAT table or simply VAT value...
Update on different VAT values
The problem is that, VAT values differs not only by countries but also depending on what you purchase... Therefore one need to store them in configurable way... Well, if you want to go with SQL DB, you could use additional VAT table with PK spanned across two columns: one CountryID (FK for Country table) and the second RateID (Integer) both uniquely identifying given VAT rate for the country...

Searching and sorting by language

I am testing Lucene.NET for our searching requirements, and I've got couple of questions.
We have documents in XML format. Every document contains multi-language text. The number of languages and the languages itself vary from document to document. See example below:
<document>This is a sample document, which is describing a <word lang="de">tisch</word>, a <word lang="en">table</word> and a <word lang="en">desk</word>.</document>
The keywords of a document are tagged with a special element and language attribute.
When I am creating lucene index I extract the text content from the XML and pairs of language and keyword (I am not sure if I have to), like this:
This is a sample document, which is describing a tisch, a table and a desk.
de - tisch
en - table
en - desk
I don't know exactly how to create an index that I will be able to search for example:
- all the documents that contains word tisch in German (and not the document which contains word tisch in other languages).
And also I want to specifiy sorting at runtime:
I want to sort by user specified language order (depending on a user interface). For example, if we have two documents:
<document>This is a sample document, which is describing a <word lang="de">tisch</word>.</document>
<document>This is a another sample document, which is describing a <word lang="en">table</word>.</document>
and a user on an English interface searches by "tisch OR table" I want to get the second result first.
Any information or advice is appreciated.
Many thanks!
You have a design decision to make, where the options are:
Use a single index, where each document has a field per each language it uses, or
Use M indexes, M being the number of languages in the corpus.
If you use the multi-index approach, it will be easier to restrict search to a specific language or set of languages - just search the indexes for these languages, not using the other languages. Also, sorting by language becomes easier. Therefore, if you do not have
an "AND" search that requires keywords from different languages appear in the same document, I would suggest the M-index approach.
Based on your example, I assume that the part of the documents not specially tagged is in English. If this is so, you can add the document text to the English index as a separate field; The other indexes need only store a document id, which will make them lighter.