How to implement support for different language data? - sql-server-express

For now it supports data only in English.
For example I have a title and a description column. I want them to be in Russian or Hebrew. But when I type in these languages, it converts it to question marks.
Any solutions?

Related

Wikipedia API Extraction of abstracts in 2 languages

I am trying to connect 2 API queries.
https://en.wikipedia.org/w/api.php?action=query&prop=extracts&exintro=&explaintext=&titles=Albert+Einstein&format=json
Where I search for article descriptions and
https://en.wikipedia.org/w/api.php?action=query&prop=langlinks&format=json&lllang=de&titles=Companion%20dog
Where I retrieve the name of the article in another language (here German).
Is there a way to connect them to retrieve description data both in English and German?
I have tried connecting them via "generators" and I seem to not understand how to apply it here.
I also tried inputting another query after extracting names in 2 languages (searching for descriptions). However, the names are sometimes formatted so that I cannot reuse them in the query.
No. The description is a snippet from the start of the article. If you want a German description, you need to get it from the German Wikipedia (ie. a different API endpoint).

MarkLogic generic language support

As per the documentation:
The generic language support only stems words to themselves, queries in these languages will not include variations of words based on their meanings in the results.
xdmp:document-insert("/lang/test.xml", <el xml:lang="fr">chats</el>)
Considering I do not have a license for french language, will the word
"chats" be stemmed to "chat" in the term list ? or will it be indexed as chats, i.e. without any variations ?
You can easily check this yourself using cts:stem:
cts:stem("chats", "fr")
Without necessary licenses it simply returns chats.
HTH!
No license for French means no stemming for French.
I'm actually kind of confused by the question. If it did work without the license, then what would the license get you?

Wordnet standard database format accross many languages

I am trying to find a standard format for the Wordnet database to use with many languages (starting with English and French). I was only able to find the French version (WOLF) in VisDic XML format and the English one as the original format (non-XML) defined on the Princeton website
I found the Open Multilingual Wordnet which seems to have most languages in a standard XML format, but they do not contain relationships between terms (e.g. hypernym).
Does the English database exist in the VisDic XML Format?
or
Does the French database exist in the original non-xml format?
or
Is there any other better standard format that I am not aware of?

CMU Sphinx: how to add keywords in addition to existing vocabulary?

CMU Sphinx comes with a large vocabulary of English words. that is fine however I want to emphasize certain words which I will be using as commands. some of these words are not English words. how can I make sure that Sphinx can understand both my special command keywords and the rest of the English dictionary words? how can I make sure that my special command keywords take precedence over the rest of the English vocabulary?
Using sphinx, there is a call that I have attempted to use for this purpose:
ps_add_word(ps, "OKAY", "OW K EY", 1);
However all of the words that I add this way appear to not be recognized any more frequently and any other word.
It is not possible in runtime at the moment. You have to add the word to some grammar/language model. You can find more details about language models in CMUSphinx tutorial:
http://cmusphinx.sourceforge.net/wiki/tutoriallm
You can also read advanced LM tutorial to understand how to update current language model
http://cmusphinx.sourceforge.net/wiki/tutoriallmadvanced

L10N: Trusted test data for Locale Specific Sorting

I'm working on an internationalized database application that supports multiple locales in a single instance. When international users sort data in the applications built on top of the database, the database theoretically sorts the data using a collation appropriate to the locale associated with the data the user is viewing.
I'm trying to find sorted lists of words that meet two criteria:
the sorted order follows the collation rules for the locale
the words listed will allow me to exercise most / all of the specific collation rules for the locale
I'm having trouble finding such trusted test data. Are such sort-testing datasets currently available, and if so, what / where are they?
"words.en.txt" is an example text file containing American English text:
Andrew
Brian
Chris
Zachary
I am planning on loading the list of words into my database in randomized order, and checking to see if sorting the list conforms to the original input.
Because I am not fluent in any language other than English, I do not know how to create sample datasets like the following sample one in French (call it "words.fr.txt"):
cote
côte
coté
côté
The French prefer diacritical marks to be ordered right to left. If you sorted that using code-point order, it likely comes out like this (which is an incorrect collation):
cote
coté
côte
côté
Thank you for the help,
Chris
Here's what I found.
The Unicode Common Locale Data Repository (CLDR) is pretty much the authority on collations for international text. I was able to find several lists of words conforming to the rules found in CLDR in the ICU Project's ICU Demonstration - Locale Explorer tool. It turns out that ICU (International Components for Unicode) uses CLDR rules to help solve common internationalization problems. It's a great library; check it out.
In some cases, it was useful to construct some nonsense terms by reverse-engineering the CLDR rules directly. Search engines available in the United States were not suited for finding foreign terms with the case/diacritic/other nuances I was interested in for this testing (in retrospect, I wonder if international search engines would have been better-suited for this task).