I'm trying to convert English sounds to IPA symbols in code. I came across espeak, which has good collection of words with corresponding sounds for English in en_list.
As per my understanding so far, e-speak uses Kirshenbaum representation.
In en_list, I come across usage of #,
eg: anomaly a#n0m#li
But # is unused character as per Kirshenbaum.
So, wanted to know meaning of # in en_list
Related
I am using the following code to split Japanese sentences into its words:
Dim parameter = New MeCabParam()
Dim tagger = MeCabTagger.Create(parameter)
For Each node In tagger.ParseToNodes(sentence)
If node.CharType > 0 Then
Dim features = node.Feature.Split(",")
Console.Write(node.Surface)
Console.WriteLine(" (" & features(7) & ") " & features(1))
End If
Next
An input of それに応じて大きくになります。 outputs morphemes:
それ (それ) 代名詞
に (に) 格助詞
応じ (おうじ) 自立
て (て) 接続助詞
大きく (おおきく) 自立
に (に) 格助詞
なり (なり) 自立
ます (ます) *
。 (。) 句点
Rather than words like so:
それ
に
応じて
大きく
に
なります
。
Is there a way I can use a parameter to get MeCab to output the latter? I am very new to coding so would appreciate it if you explain simply. Thanks.
This is actually pretty hard to do. MeCab, Kuromoji, Sudachi, KyTea, Rakuten-MA—all of these Japanese parsers and the dictionary databases they consume (IPADIC, UniDic, Neologd, etc.) have chosen to parse morphemes, the smallest units of meaning, instead of what you call "words", which as your example shows often contain multiple morphemes.
There are some strategies that usually folks combine to improve on this.
Experiment with different dictionaries. I've noticed that UniDic is sometimes more consistent than IPADIC.
Use a bunsetsu chunker like J.DepP, which consumes the output of MeCab to chunk together morphemes into bunsetsu. Per this paper, "We use the notion of a bunsetsu which roughly corresponds to a minimum phrase in English and consists of a content words (basically nouns or verbs) and the functional words surrounding them." The bunsetsu output by J.DepP often correspond to "words". I personally don't think of, say, a noun + particle phrase as a "word" but you might—these two are usually in a single bunsetsu. (J.DepP is also pretttty fancy, in that it also outputs a dependency tree between bunsetsu, so you can see which one modifies or is secondary to which other one. See my example.)
A last technique that you shouldn't overlook is scanning the dictionary (JMdict) for runs of adjacent morphemes; this helps find idioms or set phrases. It can get complicated because the dictionary may have a deconjugated form of a phrase in your sentence, so you might have to search both the literal sentence form and the deconjugated (lemma) form of MeCab output.
I have an open-source package that combines all of the above called Curtiz: it runs text through MeCab, chunks them into bunsetsu with J.DepP to find groups of morphemes that belong together, identifies vocabulary by looking them up in the dictionary, separates particles and conjugated phrases, etc. It is likely not going to be useful for you, since I use it to support my activities in learning Japanese and making Japanese learning tools but it shows how the above pieces can be combined to get to what you need in Japanese NLP.
Hopefully that's helpful. I'm happy to elaborate more on any of the above topics.
I'm working with spaCy, version 2.3. I have a not-quite-regular-expression scanner which identifies spans of text which I don't want analyzed any further. I've added a pipe at the beginning of the pipeline, right after the tokenizer, which uses the document retokenizer to make these spans into single tokens. I'd like to remainder of the pipeline to treat these tokens as proper nouns. What's the right way to do this? I've set the POS and TAG attrs in my calls to retokenizer.merge(), and those settings persist in the resulting sentence parse, but the dependency information on these tokens makes me doubt that my settings have had the desired impact. Is there a way to update the vocabulary so that the POS tagger knows that the only POS option for these tokens is PROPN?
Thanks in advance.
The tagger and parser are independent (the parser doesn't use the tags as features), so modifying the tags isn't going to affect the dependency parse.
The tagger doesn't overwrite any existing tags, so if a tag is already set, it doesn't modify it. (The existing tags don't influence its predictions at all, though, so the surrounding words are tagged the same way they would be otherwise.)
Setting TAG and POS in the retokenizer is a good way to set those attributes. If you're not always retokenizing and you want to set the TAG and/or POS based on a regular expression for the token text, then the best way to do this is a custom pipeline component that you add before the tagger that sets tags for certain words.
The transition-based parsing algorithm can't easily deal with partial dependencies in the input, so there isn't a straightforward solution here. I can think of a few things that might help:
The parser does respect pre-set sentence boundaries. If your skipped tokens are between sentences, you can set token.is_sent_start = True for that token and the following token so that the skipped token always ends up in its own sentence. If the skipped tokens are in the middle of a sentence or you want them to be analyzed as nouns in the sentence, then this won't help.
The parser does use the token.norm feature, so if you set the NORM feature in the retokenizer to something extremely PROPN-like, you might have a better chance of getting the intended analysis. For example, if you're using a provided English model like en_core_web_sm, use a word you think would be a frequent similar proper noun in American newspaper text from 20 years ago, so if the skipped token should be like a last name, use "Bush" or "Clinton". It won't guarantee a better parse, but it could help.
If you using a model with vectors like en_core_web_lg, you can also set the vectors for the skipped token to be the same as a similar word (check that the similar word has a vector first). This is how to tell the model to refer to the same row in the vector table for UNKNOWN_SKIPPED as Bush.
The simpler option (that duplicates the vectors in the vector table internally):
nlp.vocab.set_vector("UNKNOWN_SKIPPED", nlp.vocab["Bush"].vector)
The less elegant version that doesn't duplicate vectors underneath:
nlp.vocab.vectors.add("UNKNOWN_SKIPPED", row=nlp.vocab["Bush"].rank)
nlp.vocab["UNKNOWN_SKIPPED"].rank = nlp.vocab["Bush"].rank
(The second line is only necessary to get this to work for a model that's currently loaded. If you save it as a custom model after the first line with nlp.to_disk() and reload it, then only the first line is necessary.)
If you just have a small set of skipped tokens, you could update the parser with some examples containing these tokens, but this can be tricky to do well without affecting the accuracy of the parser for other cases.
The NORM and vector modifications will also influence the tagger, so it's possible if you choose those well, you might get pretty close to the results you want.
I am finding the tokenization code quite complicated and I still couldn't find where in the code the sentences are split.
For example, how does the tokenizer know that
Mr. Smitt stayed at home. He was tired
should not be split in "Mr." and should be split before "He".? And where in the code does the split before "He" happens?
(In fact, I am unsure actually unsure if I am looking at the right place: if I search for sents in tokenizer.pyx I don't find any occurrence)
You access the splits via the doc object, with the generator:
doc.sents
The output of the generator is a series of spans.
As for how the splits are chosen, the document is parsed for dependency relationships. Understanding the parser is not trivial - you'll have to read into it if you want to understand it - it's using a neural network to inform the decision about how to construct the dependency trees; but the splits are those gaps between tokens which are not crossed by dependencies. This is not simply where you find a full-stop, and the method is more robust as a result.
We are currently implementing a Zend Framework Project, that needs to be translated in 6 different languages. We already have a pretty sophisticated translation system, based on Zend_Translate, which also handles variables in translation keys.
Our project has a new Turkish translator, and we are facing a new issue: Grammar, especially Turkish one. I noticed that this problem might be evident in every translation system and in most languages, so I posted a question here.
Question: Any ideas how to handle translations like:
Key: I have a[n] {fruit}
Variables: apple, banana
Result: I have an apple. I have a banana.
Key: Stimme für {user}[s] Einsendung
Variables: Paul, Markus
Result: Stimme für Pauls Einsendung,
Result: Stimme für Markus Einsendung
Anybody has a solution or idea for this? My only guess would be to avoid this by not using translations where these issues occur.
How do other platforms handle this?
Of course the translation system has no idea which type of word it is placing where in which type of Sentence. It only does some string replacements...
PS: Turkish is even more complicated:
For example, on a profile page, we have "Annie's Network". This should translate as "Annie'nin Aği".
If the first name ends in a vowel, the suffix will start with an n and look like "Annie'nin"
If the first name ends in a consonant, it will not have the first n, and look like "Kris'in"
If the last vowel is an a or ı, it will look like "Dan'ın"; or Seyma'nın"
If the last vowel is an o or u, it will look like "Davud'un"; or "Burcu'nun"
If the last vowel is an e or i, it will look like "Erin'in"; or "Efe'nin"
If the last vowel is an ö or ü, it will look like "Göz'ün'; or "Iminönü'nün"
If the last letter is a k (like the name "Basak"), it will look like "Basağın"; or "Eriğin"
It is actually very hard problem, as grammar rules are different even among languages from the same family. I don't think you could easily do anything for let's say Slavic languages...
However, if you want to solve this problem (because this is extra challenging) and you are looking for creative (cross inspiring) ways to do that, you might want to look into something called ChoiceFormat (example would be one from ICU Project) or you can look up GNU Gettext's solution for plural forms problem.
ICU (mentioned above) has a SelectFormat http://site.icu-project.org/design/formatting/select that may be of help- it's like a choice format but with arbitrary keywords. Also, it does have a PluralFormat which already has rules for many language's plural rules.
I'm on OS X, and in objective-c I'm trying to convert
for example,
"Bobateagreenapple"
into
"Bob ate a green apple"
Is there any way to do this efficiently? Would something involving a spell checker work?
EDIT: Just some extra information:
I'm attempting to build something that takes some misformatted text (for example, text copy pasted from old pdfs that end up without spaces, especially from internet archives like JSTOR). Since the misformatted text is probably going to be long... well, I'm just trying to figure out whether this is feasibly possible before I actually attempt to actually write system only to find out it takes 2 hours to fix a paragraph of text.
One possibility, which I will describe this in a non-OS specific manner, is to perform a search through all the possible words that make up the collection of letters.
Basically you chop off the first letter of your letter collection and add it to the current word you are forming. If it makes a word (eg dictionary lookup) then add it to the current sentence. If you manage to use up all the letters in your collection and form words out of all of them, then you have a full sentence. But, you don't have to stop here. Instead, you keep running, and eventually you will produce all possible sentences.
Pseudo-code would look something like this:
FindWords(vector<Sentence> sentences, Sentence s, Word w, Letters l)
{
if (l.empty() and w.empty())
add s to sentences;
return;
if (l.empty())
return;
add first letter from l to w;
if w in dictionary
{
add w to s;
FindWords(sentences, s, empty word, l)
remove w from s
}
FindWords(sentences, s, w, l)
put last letter from w back onto l
}
There are, of course, a number of optimizations you could perform to make it go fast. For instance checking if the word is the stem of any word in the dictionary. But, this is the basic approach that will give you all possible sentences.
Solving this problem is much harder than anything you'll find in a framework. Notice that even in your example, there are other "solutions": "Bob a tea green apple," for one.
A very naive (and not very functional) approach might be to use a spell-checker to try to isolate one "real word" at a time in the string; of course, in this example, that would only work because "Bob" happens to be an English word.
This is not to say that there is no way to accomplish what you want, but the way you phrase this question indicates to me that it might be a lot more complicated than what you're expecting. Maybe someone can give you an acceptable solution, but I bet they'll need to know a lot more about what exactly you're trying to do.
Edit: in response to your edit, it would probably take less effort to run some kind of OCR tool on a PDF and correct its output than it would just to correct what this system might give you, let alone program it
I implemented a solution, the code is avaible on code project:
http://www.codeproject.com/Tips/704003/How-to-add-spaces-between-spaceless-strings
My idea was to prioritize results that use up most of the characters (preferable all of them) then favor the ones with the longest words, because 2,3 or 4 character long words can often come up by chance from leftout characters. Most of the times this provides the correct solution.
To find all possible permutations I used recursion. The code is quite fast even with big dictionaries (tested with 50 000 words).