Lucene Analyzer - lucene

I have worked with Lucene for indexing documents and providing search among them, however, my work was in English language, but now, I have a project which is Kurdish language, Kurdish language uses some Arabic unicode characters and several other characters, here is Table of Unicode Characters used in Kurdish-Arabic script
My question is how to create Analyzer for this language, or can I use Arabic Analyzer for this purpose?

Lucene has a list of other analyzers, including Arabic. I'm afraid there's no one which targets specifically Kurdish, but maybe you can extend Arabic analyzer to fit your needs?
Just bear in mind that all these analyzers come separately from the main Lucene distribution.

To answer your question about howto create a custom Analyzer for a new language..."Lucene In Action" book covers the creation of custom analyzers and it is pretty detailed. You can "leverage" a lot of the code found in other analyzers and just change what you need. Lucene is open source and very extensible, therefore profiling these changes is pretty easy.

Related

How to use MorfologikAnalyzer in apache lucene for lemmatization?

I am creating an English search engine using Apache Lucene. Since I need to do the lemmatization for that, I am using Stanford CoreNLP and I know how to do that.
Is it possible to use MorfologikAnalyzer or something similar from the out of the box Apache Lucene to do the lemmatization?
Unfortunately, MorfologikAnalyzer is only suppose to work with Polish language and provide stemming capabilities, rather than lemmatization.
There are no built-in Apache Lucene analyzers that might help you. So, the existing options for you are the following:
Stanford CoreNLP
OpenNLP lemmatizer
NLTK Lemmatizer (Python)
Of course there are several paid lemmatization engines, some of them may be could be even more rich than those above, especially if lemmatization is required for the specific domain - publishing, etc.
I wouldn’t list any of them here, but it shouldn’t be hard to find them if needed.

Lucene lemmatization

I'm indexing some English texts in a Java application with Lucene, and I need to lemmatization them with Lucene 4_1_0. I've found stemming (PorterStemFilter and SnowballFilter), but it not enough.
After lemmatizations I wanted to use a thesaurus for query expansion, does Lucene also include a thesaurus?
If it is not possible I will use the StanfordCoreNLP and WordNet instead.
Do you think that lemmatization may influence the search using Lucene library?
Thanks
As far as I know, you will need to build synonymization support in yourself.

Is there a repository of grammars for CMU Sphinx?

I'm writing an (offline) voice recognition app. I have CMU Sphinx4 set up and working using some of the included demo dictionaries. However, they're of limited scope (eg..numbers, cities, etc).
Is there a more comprehensive grammar available? Or maybe a repository of more of these limited grammars? I'm trying to exhaust any other options before creating my own.
Thank you
Grammars are always specific to your particular goal, so it does not make sense to share those . Even such simple subject as digits can vary between concrete applications: we use "zero" and "oh" to denote "0" in regular speech, whilst scientists also use "not" for the same purpose.
Sphinx4 supports JSGF and GRXML formats, you can easily find specifications of both.
You seem to be mistaking grammars with dictionaries. They are completely different things.
Sphinx supports not only grammars, but also n-gram language models. You may find them more versatile. Such model can be automatically generated and will work if given a large corpora which reflects the real usage sentences.
As for dictionaries - creating them for english is relatively simple. One could even think about a tool which reads a phonetic word representation from an online dictionary and converts it to sphinx format. The only input would be a word list.
I believe this paper will come handy to your effort. This paper entails creating grammar and dictionary for a new language, Swahili

Decent multi-lingual stemmer or analyzer for Lucene / ElasticSearch?

I'm curious if there are generic analyzers which do a decent job of stemming / analyzing text which could be in different languages. For certain tasks, doing proper multi-lingual search (e.g. splitting a field name into name.english, name.french, etc.) seems like overkill.
Is there an analyzer which will strip suffixes (e.g. "dogs" --> "dog") and work for more than just English? I don't really care whether it does language detection, etc., and working on just e.g. romantic & germanic languages would probably be good enough. Or, is the loss of quality serious enough that it's always worth just using language-specific analyzers and language-specific queries?
Your best bet would be to use the icu analyzers. They are useful for normalizing but less useful for things like stemming, which is inherently language specific.
Additionally, it is possible to use a separate language field and use different ananalyzers based on the value of that field. So, you could combine both approaches and fall back to the icu tokenizer and support languages you care about with specialized analyzers: http://www.elasticsearch.org/guide/reference/mapping/analyzer-field/
You might want to watch this presentation from the recent Berlin Buzzwords conference about multi language support: http://www.youtube.com/watch?v=QI0XEshXygo. There's a lot of good stuff in there. Jump to the 27th minute for an example of the using different analyzers.

How to use CFStringTokenizer with Chinese and Japanese?

I'm using the code here to split text into individual words, and it's working great for all languages I have tried, except for Japanese and Chinese.
Is there a way that code can be tweaked to properly tokenize Japanese and Chinese as well? The documentation says those languages are supported, but it does not seem to be breaking words in the proper places. For example, when it tokenizes "新しい" it breaks it into two words "新し" and "い" when it should be one (I don't speak Japanese, so I don't know if that is actually correct, but the sample I have says that those should all be one word). Other times it skips over words.
I did try creating Chinese and Japanese locales, while using kCFStringTokenizerUnitWordBoundary. The results improved, but are still not good enough for what I'm doing (adding hyperlinks to vocabulary words).
I am aware of some other tokenizers that are available, but would rather avoid them if I can just stick with core foundation.
[UPDATE] We ended up using mecab with a specific user dictionary for Japanese for some time, and have now moved over to just doing all of this on the server side. It may not be perfect there, but we have consistent results across all platforms.
If you know that you're parsing a particular language, you should create your CFStringTokenzier with the correct CFLocale (or at the very least, the guess from CFStringTokenizerCopyBestStringLanguage) and use kCFStringTokenizerUnitWordBoundary.
Unfortunately, perfect word segmentation of Chinese and Japanese text remains an open and complex problem, so any segmentation library you use is going to have some failings. For Japanese, CFStringTokenizer uses the MeCab library internally and ICU's Boundary Analysis (only when using kCFStringTokenizerUnitWordBoundary, which is why you're getting a funny break with "新しい" without it).
Also have a look at NSLinguisticTagger.
But by itself won't give you much more.
Truth be told, these two languages (and some others) are really hard to programatically tokenize accurately.
You should also see the WWDC videos on LSM. Latent Semantic Mapping. They cover the topic of stemming and lemmas. This is the art and science of more accurately determining how to tokenize meaningfully.
What you want to do is hard. Finding word boundaries alone does not give you enough context to convey accurate meaning. It requires looking at the context and also identifying idioms and phrases that should not be broken by word. (Not to mention grammatical forms)
After that look again at the available libraries, then get a book on Python NLTK to learn what you really need to learn about NLP to understand how much you really want to pursue this.
Larger bodies of text inherently yield better results. There's no accounting for typos and bad grammar. Much of the context needed to drive logic in analysis implicit context not directly written as a word. You get to build rules and train the thing.
Japanese is a particularly tough one and many libraries developed outside of Japan don't come close. You need some knowledge of a language to know if the analysis is working. Even native Japanese people can have a hard time doing the natural analysis without the proper context. There are common scenarios where the language presents two mutually intelligible correct word boundaries.
To give an analogy, it's like doing lots of look ahead and look behind in regular expressions.