Decent multi-lingual stemmer or analyzer for Lucene / ElasticSearch? - lucene

I'm curious if there are generic analyzers which do a decent job of stemming / analyzing text which could be in different languages. For certain tasks, doing proper multi-lingual search (e.g. splitting a field name into name.english, name.french, etc.) seems like overkill.
Is there an analyzer which will strip suffixes (e.g. "dogs" --> "dog") and work for more than just English? I don't really care whether it does language detection, etc., and working on just e.g. romantic & germanic languages would probably be good enough. Or, is the loss of quality serious enough that it's always worth just using language-specific analyzers and language-specific queries?

Your best bet would be to use the icu analyzers. They are useful for normalizing but less useful for things like stemming, which is inherently language specific.
Additionally, it is possible to use a separate language field and use different ananalyzers based on the value of that field. So, you could combine both approaches and fall back to the icu tokenizer and support languages you care about with specialized analyzers: http://www.elasticsearch.org/guide/reference/mapping/analyzer-field/
You might want to watch this presentation from the recent Berlin Buzzwords conference about multi language support: http://www.youtube.com/watch?v=QI0XEshXygo. There's a lot of good stuff in there. Jump to the 27th minute for an example of the using different analyzers.

Related

Is there a description of the mecab (Japanese word parser) algorithm?

Is there a document somewhere that describes the Mecab algorithm?
Or could someone give a simple one-paragraph or one-page description?
I'm finding it too hard to understand the existing code, and what the databases contain.
I need this functionality in my free website and phone apps for teaching languages (www.jtlanguage.com). I also want to generalize it for other languages, and make use of the conjugation detection mechanism I've already implemented, and I also need it without license encumbrance. Therefore I want to create my own implementation (C#).
I already have a dictionary database derived from EDICT. What else is needed? A frequency-of-usage database?
Thank you.
Some thoughts that are too long to fit in a comment.
§ What license encumbrances? MeCab is dual-licensed including BSD, so that's about as unencumbered as you can get.
§ There's also a Java rewrite of Mecab called Kuromoji that's Apache licensed, also very commercial-friendly.
§ MeCab implements a machine learning technique called conditional random fields for morphological parsing (separating free text into morphemes) and part-of-speech tagging (labeling those morphemes) Japanese text. It is able to use various dictionaries as training data, which you've seen—IPADIC, UniDic, etc. Those dictionaries are compilations of morphemes and parts-of-speech, and are the work of many human-years worth of linguistic research. The linked paper is by the authors of MeCab.
§ Others have applied other powerful machine learning algorithms to the problem of Japanese parsing.
Kytea can use both support vector machines and logistic regression to the same problem. C++, Apache licensed, and the papers are there to read.
Rakuten MA is in JavaScript, also liberally licensed (Apache again), and comes with a regular dictionary and a light-weight one for constrained apps—it won't give you readings of kanji though. You can find the academic papers describing the algorithm there.
§ Given the above, I think you can see that simple dictionaries like EDICT and JMDICT are insufficient to do the advanced analysis that these morphological parsers do. And these algorithms are likely way overkill for other, easier-to-parse languages (i.e., languages with spaces).
If you need the power of these libraries, you're probably better off writing a microservice that runs one of these systems (I wrote a REST frontend to Kuromoji called clj-kuromoji-jmdictfurigana) instead of trying to reimplement them in C#.
Though note that it appears C# bindings to MeCab exist: see this answer.
In several small projects I just shell out to MeCab, then read and parse its output. My TypeScript example using UniDic for Node.js.
§ But maybe you don't need full morphological parsing and part-of-speech tagging? Have you ever used Rikaichamp, the Firefox add-on that uses JMDICT and other low-weight publicly-available resources to put glosses on website text? (A Chrome version also exists.) It uses a much simpler deinflector that quite frankly is awful compared to MeCab et al. but can often get the job done.
§ You had a question about the structure of the dictionaries (you called them "databases"). This note from Kimtaro (the author of Jisho.org) on how to add custom vocabulary to IPADIC may clarify at least how IPADIC works: https://gist.github.com/Kimtaro/ab137870ad4a385b2d79. Other more modern dictionaries (I tend to use UniDic) use different formats, which is why the output of MeCab differs depending on which dictionary you're using.

Is there a repository of grammars for CMU Sphinx?

I'm writing an (offline) voice recognition app. I have CMU Sphinx4 set up and working using some of the included demo dictionaries. However, they're of limited scope (eg..numbers, cities, etc).
Is there a more comprehensive grammar available? Or maybe a repository of more of these limited grammars? I'm trying to exhaust any other options before creating my own.
Thank you
Grammars are always specific to your particular goal, so it does not make sense to share those . Even such simple subject as digits can vary between concrete applications: we use "zero" and "oh" to denote "0" in regular speech, whilst scientists also use "not" for the same purpose.
Sphinx4 supports JSGF and GRXML formats, you can easily find specifications of both.
You seem to be mistaking grammars with dictionaries. They are completely different things.
Sphinx supports not only grammars, but also n-gram language models. You may find them more versatile. Such model can be automatically generated and will work if given a large corpora which reflects the real usage sentences.
As for dictionaries - creating them for english is relatively simple. One could even think about a tool which reads a phonetic word representation from an online dictionary and converts it to sphinx format. The only input would be a word list.
I believe this paper will come handy to your effort. This paper entails creating grammar and dictionary for a new language, Swahili

How Do I Design Abstract Semantic Graphs?

Can someone direct me to online resources for designing and implementing abstract semantic graphs (ASG)? I want to create an ASG editor for my language. Being able to edit the ASG directly has a number of advantages:
Only identifiers and literals need to be typed in and identifiers are written only once, when they're defined. Everything else is selected via the mouse.
Since the editor knows the language's grammar, there are no more syntax errors. The editor prevents them from being created in the first place.
Since the editor knows the language's semantics, there are no more semantic errors.
There are some secondary advantages:
Since all the reserved words are easily separable, a program can be written in one locale and viewed in other. On-the-fly changes of locale are possible.
All the text literals are easily separable, so changes of locale are easily made, including on-the-fly changes.
I'm not aware of a book on the matter, but you'll find the topic discussed in portions of various books on computer language. You'll also find discussions of this surrounding various projects which implement what you describe. For instance, you'll find quite a bit of discussion regarding the design of Scratch. Most workflow engines are also based on scripting in semantic graphs.
Allow me to opine... We've had the technology to manipulate language structurally for basically as long as we've had programming languages. I believe that the reason we still use textual language is a combination of the fact that it is more natural for us as humans, who communicate in natural language, to wield, and the fact that it is sometimes difficult to compose and refactor code when proper structure has to be maintained. If you're not sure what I mean, try building complex expressions in Scratch. Text is easier and a decent IDE gives virtually as much verification of correct structure.*
*I don't mean to take anything away from Scratch, it's a thing of beauty and is perfect for its intended purpose.

How I can start building wordnet for Turkish language to use in sentiment analysis

Although I hold EE background, I didn't get chance to attend Natural Language processing classes.
I would like to build sentiment analysis tool for Turkish language. I think it is best to create a Turkish wordnet database rather than translating the text to English and analyze it with buggy translated text with provided tools. (is it?)
So what do you guys recommend me to do ? First of all taking NLP classes from an open class website? I really don't know where to start. Could you help me and maybe provide me step by step guide? I know this is an academic project but I am interested to build skills as a hobby in that area.
Thanks in advance.
Here is the process I have used before (making Japanese, Chinese, German and Arabic semantic networks):
Gather at least two English/Turkish dictionaries. They must be independent, not derived from each other. You can use Wikipedia to auto-generate one of your dictionaries. If you need to publish your network, then you may need open source dictionaries, or license fees, or a lawyer.
Use those dictionaries to translate English Wordnet, producing a confidence rating for each synset.
Keep those with strong confidence, manually approving or fixing through those with medium or low confidence.
Finish it off manually
I expanded on this in the "Automatic Translation Of WordNet" section of my 2008 paper: http://dcook.org/mlsn/about/papers/nlp2008.MLSN_A_Multilingual_Semantic_Network.pdf
(For your stated goal of a Turkish sentiment dictionary, there are other approaches, not involving a semantic network. E.g. "Semantic Analysis and Opinion Mining", by Bing Liu, is a good round-up of research. But a semantic network approach will, IMHO, always give better results in the long run, and has so many other uses.)

How to use CFStringTokenizer with Chinese and Japanese?

I'm using the code here to split text into individual words, and it's working great for all languages I have tried, except for Japanese and Chinese.
Is there a way that code can be tweaked to properly tokenize Japanese and Chinese as well? The documentation says those languages are supported, but it does not seem to be breaking words in the proper places. For example, when it tokenizes "新しい" it breaks it into two words "新し" and "い" when it should be one (I don't speak Japanese, so I don't know if that is actually correct, but the sample I have says that those should all be one word). Other times it skips over words.
I did try creating Chinese and Japanese locales, while using kCFStringTokenizerUnitWordBoundary. The results improved, but are still not good enough for what I'm doing (adding hyperlinks to vocabulary words).
I am aware of some other tokenizers that are available, but would rather avoid them if I can just stick with core foundation.
[UPDATE] We ended up using mecab with a specific user dictionary for Japanese for some time, and have now moved over to just doing all of this on the server side. It may not be perfect there, but we have consistent results across all platforms.
If you know that you're parsing a particular language, you should create your CFStringTokenzier with the correct CFLocale (or at the very least, the guess from CFStringTokenizerCopyBestStringLanguage) and use kCFStringTokenizerUnitWordBoundary.
Unfortunately, perfect word segmentation of Chinese and Japanese text remains an open and complex problem, so any segmentation library you use is going to have some failings. For Japanese, CFStringTokenizer uses the MeCab library internally and ICU's Boundary Analysis (only when using kCFStringTokenizerUnitWordBoundary, which is why you're getting a funny break with "新しい" without it).
Also have a look at NSLinguisticTagger.
But by itself won't give you much more.
Truth be told, these two languages (and some others) are really hard to programatically tokenize accurately.
You should also see the WWDC videos on LSM. Latent Semantic Mapping. They cover the topic of stemming and lemmas. This is the art and science of more accurately determining how to tokenize meaningfully.
What you want to do is hard. Finding word boundaries alone does not give you enough context to convey accurate meaning. It requires looking at the context and also identifying idioms and phrases that should not be broken by word. (Not to mention grammatical forms)
After that look again at the available libraries, then get a book on Python NLTK to learn what you really need to learn about NLP to understand how much you really want to pursue this.
Larger bodies of text inherently yield better results. There's no accounting for typos and bad grammar. Much of the context needed to drive logic in analysis implicit context not directly written as a word. You get to build rules and train the thing.
Japanese is a particularly tough one and many libraries developed outside of Japan don't come close. You need some knowledge of a language to know if the analysis is working. Even native Japanese people can have a hard time doing the natural analysis without the proper context. There are common scenarios where the language presents two mutually intelligible correct word boundaries.
To give an analogy, it's like doing lots of look ahead and look behind in regular expressions.