Handling Grammar / Spelling Issues in Translation Strings - variables

We are currently implementing a Zend Framework Project, that needs to be translated in 6 different languages. We already have a pretty sophisticated translation system, based on Zend_Translate, which also handles variables in translation keys.
Our project has a new Turkish translator, and we are facing a new issue: Grammar, especially Turkish one. I noticed that this problem might be evident in every translation system and in most languages, so I posted a question here.
Question: Any ideas how to handle translations like:
Key: I have a[n] {fruit}
Variables: apple, banana
Result: I have an apple. I have a banana.
Key: Stimme für {user}[s] Einsendung
Variables: Paul, Markus
Result: Stimme für Pauls Einsendung,
Result: Stimme für Markus Einsendung
Anybody has a solution or idea for this? My only guess would be to avoid this by not using translations where these issues occur.
How do other platforms handle this?
Of course the translation system has no idea which type of word it is placing where in which type of Sentence. It only does some string replacements...
PS: Turkish is even more complicated:
For example, on a profile page, we have "Annie's Network". This should translate as "Annie'nin Aği".
If the first name ends in a vowel, the suffix will start with an n and look like "Annie'nin"
If the first name ends in a consonant, it will not have the first n, and look like "Kris'in"
If the last vowel is an a or ı, it will look like "Dan'ın"; or Seyma'nın"
If the last vowel is an o or u, it will look like "Davud'un"; or "Burcu'nun"
If the last vowel is an e or i, it will look like "Erin'in"; or "Efe'nin"
If the last vowel is an ö or ü, it will look like "Göz'ün'; or "Iminönü'nün"
If the last letter is a k (like the name "Basak"), it will look like "Basağın"; or "Eriğin"

It is actually very hard problem, as grammar rules are different even among languages from the same family. I don't think you could easily do anything for let's say Slavic languages...
However, if you want to solve this problem (because this is extra challenging) and you are looking for creative (cross inspiring) ways to do that, you might want to look into something called ChoiceFormat (example would be one from ICU Project) or you can look up GNU Gettext's solution for plural forms problem.

ICU (mentioned above) has a SelectFormat http://site.icu-project.org/design/formatting/select that may be of help- it's like a choice format but with arbitrary keywords. Also, it does have a PluralFormat which already has rules for many language's plural rules.

Related

portuguese tokenizer: t is breaking “ao” in “a” and “o”

I am using the Spacy as a tokenizer for Portuguese documents (the last version).
But, it is making a mistake in the following sentence: 'esta quebrando aonde nao devia, separando a e o em ao e aos'.
It is breaking “ao” in “a” and “o”. The same is happening with other words like “aonde” (“a” + “onde”) and othes (“aos”, etc).
Other strange cases: "àquele" into "a" and "quele"; "às" into "à" and "s".
The problem can be shown in the "Test the model live (experimental)" in https://spacy.io/models/pt.
For now, I am adding some known words with tokenizer.add_special_case. But I may not remember all cases.
Is it possible to adjust this problem?
It seems appropriate to me break down the expression "ao" in two functional parts: preposition and article. Depending on the application, it would be simple to concatenate these parts together as required by the official grammar.

TTS microsoft.speech, best way to say a sentence fluently with language change

I need to say a sentence, with a german name in the sentence. To do so I used Microsoft speech with english, called the speakasync function to say the first part of the sentence, then changed Language to german, said the name, then went back to english and finished the sentence. this all works well, except that each time i call the speakasync function the is a 1 second pause. so I have 1 second pause before and after the name. can this be removed somehow? I would like to have no pause in between.
s.SetOutputToDefaultAudioDevice()
s.SelectVoice(myENGLISHvoice)
s.SpeakAsync("Next on the line is mr. ")
s.SelectVoice(myGERMANvoice)
s.SpeakAsync("Stefan Hanswurst")
s.SelectVoice(myENGLISHvoice)
s.SpeakAsync("Please stand up.")
Update, I have also tried this, with no success.. same problem:
pb.AppendSsmlMarkup("<voice xml:lang=""en-EN"">")
pb.AppendText("Next on the line is mr.")
pb.AppendSsmlMarkup("</voice>")
pb.AppendSsmlMarkup("<voice xml:lang=""de-DE"">")
pb.AppendText("Hansjörg Bratwurst ")
pb.AppendSsmlMarkup("</voice>")
pb.AppendSsmlMarkup("<voice xml:lang=""en-EN"">")
pb.AppendText("Please stand up.")
pb.AppendSsmlMarkup("</voice>")
In context of speech engines you usually avoid switching language during speech output, this is unusual since humans also simply stick to one pronounciation (see how Americans and Italiens pronounce coffee or Cappuccino for example).
Usually this is done by inserting pronounciation hints for "foreign" words into the language you currently generate output for. Just like Germans have to learn how to pronounce "Cappuccino" and it will still always have a German accent/specific to it.
See details for microsofts speech API here (search for "pronunciation"-> they have a spelling error on the page):
https://msdn.microsoft.com/en-us/library/hh378454(v=office.14).aspx

How to generate (book) indexes?

I need to create an index for a book. While the task is easy at the first look -- group words by the first letter, then sort them, -- this obvious solution works only for the usa language. The real word is, however, more complex. See http://en.wikipedia.org/wiki/Collation :
The difference between computer-style numerical sorting and true alphabetical sorting becomes obvious in languages using an extended Latin alphabet. For example, the 29-letter alphabet of Spanish treats ñ as a basic letter following n, and formerly treated ch and ll as basic letters following c and l, respectively. Ch and ll are still considered letters, but are now alphabetized as two-letter combinations. (The new alphabetization rule was issued by the Royal Spanish Academy in 1994.) On the other hand, the digraph rr follows rqu as expected, both with and without the 1994 alphabetization rule. A numeric sort may order ñ incorrectly following z and treat ch as c + h, also incorrect when using pre-1994 alphabetization.
I tried to find an existing solution.
DocBook stylesheets does not address the problem.
The best match I found is xindy ( http://xindy.sourceforge.net/ ), but this tool is too much connected to LaTeX.
Any other suggestions?
Naively, you could examine every word in the text and create a hash, using the words as a key, and building up an array of locations (page numbers?) as values.
But indexes are generally a bit more focused than that.
Well, after answering to comments, I realized that I don't need a tool to generate indexes, but a library which can sort according to cultures. First experiments shows that I'm going to use ICU and its Python bindings PyICU. For example:
import icu
words = ["liche", "lichée", "lichen", "lichénoïde", "licher", "lichoter"]
collator = icu.Collator.createInstance(icu.Locale.getFrance())
for word in sorted(words, cmp=collator.compare):
print word.decode("string-escape")

Add spaces between words in spaceless string

I'm on OS X, and in objective-c I'm trying to convert
for example,
"Bobateagreenapple"
into
"Bob ate a green apple"
Is there any way to do this efficiently? Would something involving a spell checker work?
EDIT: Just some extra information:
I'm attempting to build something that takes some misformatted text (for example, text copy pasted from old pdfs that end up without spaces, especially from internet archives like JSTOR). Since the misformatted text is probably going to be long... well, I'm just trying to figure out whether this is feasibly possible before I actually attempt to actually write system only to find out it takes 2 hours to fix a paragraph of text.
One possibility, which I will describe this in a non-OS specific manner, is to perform a search through all the possible words that make up the collection of letters.
Basically you chop off the first letter of your letter collection and add it to the current word you are forming. If it makes a word (eg dictionary lookup) then add it to the current sentence. If you manage to use up all the letters in your collection and form words out of all of them, then you have a full sentence. But, you don't have to stop here. Instead, you keep running, and eventually you will produce all possible sentences.
Pseudo-code would look something like this:
FindWords(vector<Sentence> sentences, Sentence s, Word w, Letters l)
{
if (l.empty() and w.empty())
add s to sentences;
return;
if (l.empty())
return;
add first letter from l to w;
if w in dictionary
{
add w to s;
FindWords(sentences, s, empty word, l)
remove w from s
}
FindWords(sentences, s, w, l)
put last letter from w back onto l
}
There are, of course, a number of optimizations you could perform to make it go fast. For instance checking if the word is the stem of any word in the dictionary. But, this is the basic approach that will give you all possible sentences.
Solving this problem is much harder than anything you'll find in a framework. Notice that even in your example, there are other "solutions": "Bob a tea green apple," for one.
A very naive (and not very functional) approach might be to use a spell-checker to try to isolate one "real word" at a time in the string; of course, in this example, that would only work because "Bob" happens to be an English word.
This is not to say that there is no way to accomplish what you want, but the way you phrase this question indicates to me that it might be a lot more complicated than what you're expecting. Maybe someone can give you an acceptable solution, but I bet they'll need to know a lot more about what exactly you're trying to do.
Edit: in response to your edit, it would probably take less effort to run some kind of OCR tool on a PDF and correct its output than it would just to correct what this system might give you, let alone program it
I implemented a solution, the code is avaible on code project:
http://www.codeproject.com/Tips/704003/How-to-add-spaces-between-spaceless-strings
My idea was to prioritize results that use up most of the characters (preferable all of them) then favor the ones with the longest words, because 2,3 or 4 character long words can often come up by chance from leftout characters. Most of the times this provides the correct solution.
To find all possible permutations I used recursion. The code is quite fast even with big dictionaries (tested with 50 000 words).

What is The Turkey Test?

I came across the word 'The Turkey Test' while learning about code testing. I don't know really what it means.
What is Turkey Test? Why is it called so?
The Turkey problem is related to software internationalization or simply to its misbehavior in various language cultures.
In various countries there are different standards, for example for writing dates (14.04.2008 in Turkey and 4/14/2008 in US), numbers (i.e. 123,45 in Poland and 123.45 in USA) and rules about character uppercasing (like in Turkey with letters i, I and ı).
As Jeff Moser pointed below one such problem was pointed out by a Turkish user who found a bug in the ToUpper() function. There are more details in comments below.
However the problem is not limited to Turkey and to string conversions.
For example, in Poland and many other countries, dates and numbers are also written in a different manner.
Some links from a Google search for the Turkey Test :
Does Your Code Pass The Turkey Test?
by Jeff Moser
What's Wrong With Turkey?
by Jeff Atwood
Here is described the turkey test
Forget about Turkey, this won't even pass in the USA. You need a case insensitive compare. So you try:
String.Compare(string,string,bool ignoreCase):
....
Do any of these pass "The Turkey Test?"
Not a chance!
Reason: You've been hit with the "Turkish I" problem.
As discussed by lots and lots of people, the "I" in Turkish behaves differently than in most languages. Per the Unicode standard, our lowercase "i" becomes "İ" (U+0130 "Latin Capital Letter I With Dot Above") when it moves to uppercase. Similarly, our uppercase "I" becomes "ı" (U+0131 "Latin Small Letter Dotless I") when it moves to lowercase.
We write dates smaller to bigger like dd.MM.yyyy: 28.10.2010
We use '.'(dot) for thousands separator, and ','(comma) for decimal separator: 4.567,9
We have ö=>Ö, ç=>Ç, ş=>Ş, ğ=>Ğ, ü=>Ü, and most importantly ı=>I and i => İ; in other words, lower case of upper I is dotless and upper case of lower i is dotted.
People may have very stressful times because of meaningless errors caused by the above rules.
If your code properly runs in Turkey, it'll probably work anywhere.
The so called "Turkey Test" is related to Software internationalization. One problem of globalization/internationalization are that date and time formats in different cultures can differ on many levels (day/month/year order, date separator etc).
Also, Turkey has some special rules for capitalization, which can lead to problems. For example, the Turkish "i" character is a common problem for many programs which capitalize it in a wrong way.
The link provided by #Luixv gives a comprehensive description of the issue.
The summary is that if your going to test your code on only one non-English locale, test it on Turkish.
This is because the Turkish has instances of most edge cases you are likely to encounter with localization, including "unusual" format strings and non-standard characters (such as a different capitalization rules for i).
Jeff Atwood has a blog article on same which is the first place I came across it myself.
in summary attempting to run your application under a Turkish Locale is an excellent test
of your I18n.
here's jeffs article