Does MS speech support custom vocabulary - microsoft-speech-api

I have a requirement to write an application which would take an audio file and identity precisely at which points in the file specific words are being spoken. These are not English words, but rather Aramaic words, so would have to be added as additional vocabulary. Does MS speech recognition support this? Thanks

Yes. There are several options, depending on the specificities of your specific words.
One is simply using phrase list: https://learn.microsoft.com/en-US/azure/cognitive-services/speech-service/improve-accuracy-phrase-list?tabs=terminal&pivots=programming-language-csharp
One is called Custom Speech: https://learn.microsoft.com/en-US/azure/cognitive-services/speech-service/custom-speech-overview
The 1st one is easier to test and implement, as you will not need audio data for the training.

Related

SpaCy different Language Models

I'm making some progress:) developing my litle OCR Project.
I was wondering if my idea is possible in this case!
After extracting the Text from a Images (ocr), I use nlp (spacy) to identify two Entities (LOCation and PERson). I write to a Dictionary and later in a JSON Data. That works good.
Now I'm wondering if I can improve my identified Entities.
One way I can imagine is to use the right Language Model for the text.
I have varies Texts in German, English,Spanish and French.
At the moment I'm using the
But now I have no idea how to put langdetect into this
Have a great week!
Greets
Here is a link that you might find useful when it comes to detecting a language (There are multiple options including langdetect) - How to detect language
You can create a dictionary with the languages you plan to detect and match it with langdetect's output. I guess you have the rest sorted out.

Custom model in Apache Open NLP

I am working currently with custom models which I am training for my own use case. My use case is to classify emails based on whether it is an address change request. If the address change request could be understood from a single sentence, it is working fine without issues. But if the address change request needs to be understood from multiple sentences, it is not working.
Giving few examples below :-
Example 1 :- THIS IS WORKING
1.
a)training file :-
Guys I wish to <START:contactupdate> change my address <END> .
My new address is 68 Dorset Road, Coventry, West Midlands, CV1 4ED.
Please confirm once you are done.
Thanks.
b)Testing model with the below sentence :-
String input = "Guys I wish to change my address.My new address is 68 Dorset Road, Coventry, West Midlands, CV1 4ED.Please confirm once you are done. Thanks."; //Working
EXAMPLE 2 :- This is not working.
Lets say the address change request can only be deduced from multiple lines.
"My old address is no longer valid. Need to update it."
How do I train my model in this scenario?How do I specify the custom tags for above?
Can you please help. I am stuck.
Many Thanks
What do you mean with not working? That the thing you want to retrieve is not retrieved? Or that the training crashes somewhere when the tags are spread out over multiple lines?
In general, the (by default MaxEnt) model that you are training in this procedure tries to detect common features for the thing you are training for. Typically, these are named entities like persons, organisations, locations. And in many languages, these contain typical features (like the prefix Mr./Mrs., the suffix corp., the morpheme "street", respectively). This can be picked up by the model, and applied in new data, leading to the recognition of whichever it is you want to recognise. The thing you are trying to do however, is pretty advanced NLP already. Since the longer the phrase, the larger the possible variation, it becomes more difficult to pick up commonalities. I'd say for your use case, people are typically using parsing (either constituency or dependency parsing) or other more sophisticated tools than just this relatively flat pattern recognition. So you may want to look into these instead. I don't know how much data you have at your disposal, from which you can infer different ways of expressing the desire to change an address in a customer database. If reasonable (i.e. not just a couple of sentences), you may want to manually annotate them, parse the corpus, use machine learning on the parse trees/graphs for the sentences of interest and go about it in this way. As mentioned, quite advanced NLP in my opinion, and not something that has an out of the box solution.
If I understand your question correctly, I think you are trying to categorize emails to find out if its for address change. But the model example looks like for named entity. In my opinion, it might be better to use "Document Categorizer" feature of Apache OpenNLP.
You can provide different samples for possible sentences which can be categorized as address change. "Address_change", "general_inquiry" etc. can be a categories. This way you can add as many different sampels as you want with many variations of sentences. Here is easy & basic tutorial for document categorization training & usage.

best way to index a text consist of multilingual word in elasticsearch

I'm new to elasticsearch.The doc on official site just say the basic and do not contain specific example.Due to it is a little disorganized as my view, I can't figure out how to get start to achieve my purpose.
I have crawl a lot of torrents, they are published by many different language.
I see there is analysis in elasticsearch to deal with input text, but I don't understand the work flow. elasticsearch do not use all analyzers to process input data as I try.
It seems I should appoint a analyzer to process a text.
Such as a text :no game no life 游戏人生 ノーゲーム・ノーライフ, it contain three language.How can I know which three analyzers I have to use?And it also too heavy to use all analyzer to process this text.
I have seen a article Three Principles for Multilingal Indexing in Elasticsearch talk about this.However I am a beginner and non-native English speaker, it is hard to understand without a example.
Please give me some guide.
Thank you.
I would probably create two fields (or multiple for number of expected languages) and apply different analyzers (language dependent) to each of them. Then when you search you would search both fields.

optical character recognition of PDFs of parliamentary debates

For a contract work, I need to digitalize a lot of old, scanned-graphic-only plenary debate protocol PDFs from the Federal Parliament of Germany.
The problem is that most of these files have a two-column format:
Sample Protocol http://sert.homedns.org/img/btp12001.png
I would love to read your answer to my following questions:
How I can split the two columns before feeding them into OCR?
Which commercial, open-source OCR software or framework, do you recommend and why?
Please note that any tool, programming-language, framework etc. is all fine. Don't hesitate recommend esoteric products, libraries if you think they are cut for the jub ^__^!!
UPDATE: These documents are already scanned by the parliament o_O: sample (same as the image above) and there are lots of them and I want to deliver on the contract ASAP so I can't go fetch print copies of the same documents, cut and scan them myself. There are just too many of them.
Best Regards,
Cetin Sert
Cut the pages down the middle before you scan.
It depends what OCR software you are using. A few years ago I did some work with an OCR API, I cant quite remember the name but I think there's lots of alternatives. Anyway this API allowed me to define regions on the page to OCR, If you always know roughly where the columns are you could use an SDK to map out parts of the page.
I use Omnipage 17 for such things. It has an batchmode too, where you can put the documents in an folder, where they was grabed, and put the result into another.
It autorecognit the layout, include columns, or you can set the default layout to columns.
You can set many options how the output should look like.
But try a demo, if it goes correct. I have at the moment problems with ligaturs in some of my documents. So words like "fliegen" comes out as "fl iegen" so you must spell them.
Take a look at http://www.wisetrend.com/wisetrend_ocr_cloud.shtml (an online, REST API for OCR). It is based on the powerful ABBYY OCR engine. You can get a free account and try it with a few of your images to see if it handles the 2-column format (it should be able to do it). Also, there are a bunch of settings you can play with (see API documentation) - you may have to tweak some of them before it will work with 2 columns. Finally, as a solution of last resort, if the 2-column split is always in the same place, you can first create a program that splits the input image into two images (shouldn't be very difficult to write this using some standard image processing library), and then feed the resulting images to the OCR process.

What are the things should we consider while writing a Spell Checker?

I want to write a very simple Spell Checker. The spell checker will try to match the input word with equivalent words form the dictionary.
What can be done to find those 'equivalent words'? What analysis can be preformed on two words to mark them equivalent?
Before investing too much trying to unravel that i'd first look to already existing implementations like Aspell or netspell for two main reasons
Not much point in re-inventing the wheel. Spell checking is much trickier than it first appears and it makes sense to build on work that has already been done
If your interest is finding out how to do it, the source code and community will be a great benefit should you decide to implement your own anyway
Much depends on your use case. For example:
Is your dictionary very small (about twenty words)? In this case it probably is better to precompute all possible nearby mistaken words and use a table/hash lookup.
What is your error model? Aspell has at least two (one for spelling errors caused by nearby letters on the keyboard, and the other for spelling errors caused by the way a word sounds).
How dynamic is your dictionary? Can you afford to do a massive preparation in order to get an efficient retrieval?
You may need a "word equivalence" measure like Double Metaphone, in addition to edit distance.
You can get some feel by reading Peter Norvig's great description of spelling correction.
And, of course, whenever possible, steal code. Do not reinvent the wheel without a reason - a reason could be a very special domain, a special way your users make spelling mistakes, or just to learn how it's done.
Edit Distance is the theory you need to write a spell checker. You also need a dictionary. Most UNIX systems come with a dictionary already installed for your locale.
I just finished implementing a spell checker and used a combination of the following in getting a list of "suggested" words
Phonetic hashing of the "misspelled" word to lookup a hash of identical dictionary hashed real words (for java check out Apache Commons Codec for a suitable library). The phonetic hash of your dictionary file can be precomputed.
Edit distance between the input and the potentials (this is reasonably expensive so you need to reduce the list first with something like a phonetic hash, assuming a higher volume load - in my case, a server based spell check)
A known list of common misspellings, e.g. recieve vs. receive.
An ordered list of the most common words in the english language
Essentially I weighted each potential word primarily based on edit-distance and commonality. e.g. if word probability is a percentage, then
weight = edit-distance * 100 / probability
(lower weights are better)
But then I also also override any result with the known common misspellings (i.e. these always float to the top suggested result).
There may be better ways, but this worked pretty well.
You may also wish to ignore ALL CAPS words, initials etc, so choosing what to ignore is also something to think about.
Under linux/unix you have ispell. Why reinventing the whell.