Can Microsoft Bing Speech be configured to return only numbers / letters? - voice-recognition

Can the Microsoft Bing Speech API be configured to only return numbers and letters, as opposed to full words?
The use case is translating Canadian postal codes. Ex. M 1 B 0 R 3. Microsoft may return "Em 1 Be 0 Are 3"
Our audio file is 8000hz and encoded with "M-ULAW". We have no flexibility in changing the sample rate or encoding. We are using the "SMD" scenario, but I can't find any documentation on what this does. Base request URI:
https://speech.platform.bing.com/recognize?scenarios=smd&appid=D4D52672-91D7-4C74-8AD8-42B1D98141A5&device.os=your_device_os&version=3.0
Is there a way to get a more accurate response from Microsoft for this use case?
Thank you

You could try using Microsoft's Custom Speech Service (previously known as the Custom Recognition Intelligent Service, or CRIS) to create and use a custom language model.
The guidelines for transcription of custom language models say "Common acronyms can be left as a single entity without periods or spaces between the letters, but all other acronyms should be written out in separate letters, with each letter separated by a single space" and include this example:
Original text After normalization
----------------------- ---------------------------
play OU812 by Van Halen play O U 8 1 2 by Van Halen
So following their guidelines, your custom language model will be a file where each line looks something like this:
M 1 B 0 R 3
You can easily generate a file containing thousands of examples of Canadian postal codes based on the structure of the codes, which in regular expression format looks like this:
[ABCEGHJKLMNPRSTVXY][0-9][ABCEGHJKLMNPRSTVWXYZ][0-9][ABCEGHJKLMNPRSTVWXYZ][0-9]
(The above expression is taken from this answer about validating postal codes.)
By doing this you're telling the recognizer what sort of things you're expecting people to say, and helping it choose when there are multiple possibilities for a sound (e.g. "U" vs. "you"). I think it will make a huge difference in the results you get.

Related

Use MeCab to separate Japanese sentences into words not morphemes in vb.net

I am using the following code to split Japanese sentences into its words:
Dim parameter = New MeCabParam()
Dim tagger = MeCabTagger.Create(parameter)
For Each node In tagger.ParseToNodes(sentence)
If node.CharType > 0 Then
Dim features = node.Feature.Split(",")
Console.Write(node.Surface)
Console.WriteLine(" (" & features(7) & ") " & features(1))
End If
Next
An input of それに応じて大きくになります。 outputs morphemes:
それ (それ) 代名詞
に (に) 格助詞
応じ (おうじ) 自立
て (て) 接続助詞
大きく (おおきく) 自立
に (に) 格助詞
なり (なり) 自立
ます (ます) *
。 (。) 句点
Rather than words like so:
それ
に
応じて
大きく
に
なります
。
Is there a way I can use a parameter to get MeCab to output the latter? I am very new to coding so would appreciate it if you explain simply. Thanks.
This is actually pretty hard to do. MeCab, Kuromoji, Sudachi, KyTea, Rakuten-MA—all of these Japanese parsers and the dictionary databases they consume (IPADIC, UniDic, Neologd, etc.) have chosen to parse morphemes, the smallest units of meaning, instead of what you call "words", which as your example shows often contain multiple morphemes.
There are some strategies that usually folks combine to improve on this.
Experiment with different dictionaries. I've noticed that UniDic is sometimes more consistent than IPADIC.
Use a bunsetsu chunker like J.DepP, which consumes the output of MeCab to chunk together morphemes into bunsetsu. Per this paper, "We use the notion of a bunsetsu which roughly corresponds to a minimum phrase in English and consists of a content words (basically nouns or verbs) and the functional words surrounding them." The bunsetsu output by J.DepP often correspond to "words". I personally don't think of, say, a noun + particle phrase as a "word" but you might—these two are usually in a single bunsetsu. (J.DepP is also pretttty fancy, in that it also outputs a dependency tree between bunsetsu, so you can see which one modifies or is secondary to which other one. See my example.)
A last technique that you shouldn't overlook is scanning the dictionary (JMdict) for runs of adjacent morphemes; this helps find idioms or set phrases. It can get complicated because the dictionary may have a deconjugated form of a phrase in your sentence, so you might have to search both the literal sentence form and the deconjugated (lemma) form of MeCab output.
I have an open-source package that combines all of the above called Curtiz: it runs text through MeCab, chunks them into bunsetsu with J.DepP to find groups of morphemes that belong together, identifies vocabulary by looking them up in the dictionary, separates particles and conjugated phrases, etc. It is likely not going to be useful for you, since I use it to support my activities in learning Japanese and making Japanese learning tools but it shows how the above pieces can be combined to get to what you need in Japanese NLP.
Hopefully that's helpful. I'm happy to elaborate more on any of the above topics.

How to request a page title in a foreign language using wikipedia API

I am trying to use a simple GET request
"https://en.wikipedia.org/w/api.php?action=query&titles=音乐&prop=langlinks&lllimit=500"
but with chinese characters in the title wikipedia can't find the page even though it exists https://zh.wikipedia.org/wiki/%E9%9F%B3%E4%B9%90
I have tried URL encoding the chinese word, I tried both simplified and traditional. I tried giving it unicode in ascii like "\u97f3\u4e50"
Does anyone know how to do this?
I solved it.
There are a few things to remember when doing this:
You need to use the wikipedia of your target language. so in this case
zh.wikipedia.org
Chinese wikipedia externally displays the charset of the users region (simplified for mainland, standard for Taiwan). But internally it depends on who wrote the article. The title in your api query must be in the original character set of the person who created it. So for Music, 音樂 will not work and you must use the simplified 音乐. But for Notebook computer the simplified 笔记本 will not work and you must use 筆記本. You have no choice but to try both. .NET includes a set of methods for converting between the two character sets.

Handling Grammar / Spelling Issues in Translation Strings

We are currently implementing a Zend Framework Project, that needs to be translated in 6 different languages. We already have a pretty sophisticated translation system, based on Zend_Translate, which also handles variables in translation keys.
Our project has a new Turkish translator, and we are facing a new issue: Grammar, especially Turkish one. I noticed that this problem might be evident in every translation system and in most languages, so I posted a question here.
Question: Any ideas how to handle translations like:
Key: I have a[n] {fruit}
Variables: apple, banana
Result: I have an apple. I have a banana.
Key: Stimme für {user}[s] Einsendung
Variables: Paul, Markus
Result: Stimme für Pauls Einsendung,
Result: Stimme für Markus Einsendung
Anybody has a solution or idea for this? My only guess would be to avoid this by not using translations where these issues occur.
How do other platforms handle this?
Of course the translation system has no idea which type of word it is placing where in which type of Sentence. It only does some string replacements...
PS: Turkish is even more complicated:
For example, on a profile page, we have "Annie's Network". This should translate as "Annie'nin Aği".
If the first name ends in a vowel, the suffix will start with an n and look like "Annie'nin"
If the first name ends in a consonant, it will not have the first n, and look like "Kris'in"
If the last vowel is an a or ı, it will look like "Dan'ın"; or Seyma'nın"
If the last vowel is an o or u, it will look like "Davud'un"; or "Burcu'nun"
If the last vowel is an e or i, it will look like "Erin'in"; or "Efe'nin"
If the last vowel is an ö or ü, it will look like "Göz'ün'; or "Iminönü'nün"
If the last letter is a k (like the name "Basak"), it will look like "Basağın"; or "Eriğin"
It is actually very hard problem, as grammar rules are different even among languages from the same family. I don't think you could easily do anything for let's say Slavic languages...
However, if you want to solve this problem (because this is extra challenging) and you are looking for creative (cross inspiring) ways to do that, you might want to look into something called ChoiceFormat (example would be one from ICU Project) or you can look up GNU Gettext's solution for plural forms problem.
ICU (mentioned above) has a SelectFormat http://site.icu-project.org/design/formatting/select that may be of help- it's like a choice format but with arbitrary keywords. Also, it does have a PluralFormat which already has rules for many language's plural rules.

How to generate (book) indexes?

I need to create an index for a book. While the task is easy at the first look -- group words by the first letter, then sort them, -- this obvious solution works only for the usa language. The real word is, however, more complex. See http://en.wikipedia.org/wiki/Collation :
The difference between computer-style numerical sorting and true alphabetical sorting becomes obvious in languages using an extended Latin alphabet. For example, the 29-letter alphabet of Spanish treats ñ as a basic letter following n, and formerly treated ch and ll as basic letters following c and l, respectively. Ch and ll are still considered letters, but are now alphabetized as two-letter combinations. (The new alphabetization rule was issued by the Royal Spanish Academy in 1994.) On the other hand, the digraph rr follows rqu as expected, both with and without the 1994 alphabetization rule. A numeric sort may order ñ incorrectly following z and treat ch as c + h, also incorrect when using pre-1994 alphabetization.
I tried to find an existing solution.
DocBook stylesheets does not address the problem.
The best match I found is xindy ( http://xindy.sourceforge.net/ ), but this tool is too much connected to LaTeX.
Any other suggestions?
Naively, you could examine every word in the text and create a hash, using the words as a key, and building up an array of locations (page numbers?) as values.
But indexes are generally a bit more focused than that.
Well, after answering to comments, I realized that I don't need a tool to generate indexes, but a library which can sort according to cultures. First experiments shows that I'm going to use ICU and its Python bindings PyICU. For example:
import icu
words = ["liche", "lichée", "lichen", "lichénoïde", "licher", "lichoter"]
collator = icu.Collator.createInstance(icu.Locale.getFrance())
for word in sorted(words, cmp=collator.compare):
print word.decode("string-escape")

Data Cleanup, post conversion from ALLCAPS to Title Case

Converting a database of people and addresses from ALL CAPS to Title Case will create a number of improperly capitalized words/names, some examples follow:
MacDonald, PhD, CPA, III
Does anyone know of an existing script that will cleanup all the common problem words? Certainly, it will still leave some mistakes behind (less common names with CamelCase-like spellings, i.e. "MacDonalz").
I don't think it matters much, but the data currently resides in MSSQL. Since this is a one-time job, I'd export out to text if a solution requires it.
There is a thread that posed a related question, sometimes touching on this problem, but not addressing this problem specifically. You can see it here:
SQL Server: Make all UPPER case to Proper Case/Title Case
Don't know if this is of any help
private static function ucNames($surname) {
// ( O\' | \- | Ma?c | Fitz ) # attempt to match Irish, Scottish and double-barrelled surnames
$replaceValue = ucwords($surname);
return preg_replace('/
(?: ^ | \\b ) # assertion: beginning of string or a word boundary
( O\' | \- | Ma?c | Fitz ) # attempt to match Irish, Scottish and double-barrelled surnames
( [^\W\d_] ) # match next char; we exclude digits and _ from \w
/xe',
"'\$1' . strtoupper('\$2')",
$replaceValue);
}
It's a simple PHP function that I use to set surnames to correct case that works for names like O'Connor, McDonald and MacBeth, FitzPatrick, and double-barrelled names like Hedley-Smythe
Here is the answer I was looking for:
There is a data company, Melissa Data, who publishes some API and applications for database cleanup -- geared mostly around the direct marketing industry.
I was able to use two applications to solve my problem.
StyleList: this app, among other
things, converts ALL CAPS to mixed
case and in the process it does not
dirty up the data, leaving titles
such as CPA, MD, III, etc. in tact;
as well as natural, common
camel-case names such as McDonalds.
Personator: I used personator to break the Full Name fields into Prefix, First Name, Middle Name, Last Name, and Suffix. To be honest, it was far from perfect, but the data I gave it was pretty challenging (often no space separating a middle name and a suffix). This app does a number of other usefult things as well, including assigning gender to most names. It's available as an API you can call, too.
Here is a link to the solutions offered by Melissa Data:
http://www.melissadata.com/dqt/index.htm
For me, the Melissa Data apps did much of the heavy lifting and the remaining dirty data was identifiable and fixable in SQL by reporting on LEFT x or RIGHT x counts -- the dirt typically has the least uniqueness, patterns easily discovered and fixed.