How to do stemmization of Telugu language using Java Lucene?
Unfortunately, there is no built-in stemmer for Telugu language (like Hindi have it). This means, that if you want to do some Telugu stemming you would need to implement this component yourself.
Good starting point would be take a look at this presentation and incorporate those techniques into Lucene TokenFilter.
Writing custom TokenFilter isn't that hard as it looks. You could take a look for examples here
Related
I'm indexing some English texts in a Java application with Lucene, and I need to lemmatization them with Lucene 4_1_0. I've found stemming (PorterStemFilter and SnowballFilter), but it not enough.
After lemmatizations I wanted to use a thesaurus for query expansion, does Lucene also include a thesaurus?
If it is not possible I will use the StanfordCoreNLP and WordNet instead.
Do you think that lemmatization may influence the search using Lucene library?
Thanks
As far as I know, you will need to build synonymization support in yourself.
I'm writing an (offline) voice recognition app. I have CMU Sphinx4 set up and working using some of the included demo dictionaries. However, they're of limited scope (eg..numbers, cities, etc).
Is there a more comprehensive grammar available? Or maybe a repository of more of these limited grammars? I'm trying to exhaust any other options before creating my own.
Thank you
Grammars are always specific to your particular goal, so it does not make sense to share those . Even such simple subject as digits can vary between concrete applications: we use "zero" and "oh" to denote "0" in regular speech, whilst scientists also use "not" for the same purpose.
Sphinx4 supports JSGF and GRXML formats, you can easily find specifications of both.
You seem to be mistaking grammars with dictionaries. They are completely different things.
Sphinx supports not only grammars, but also n-gram language models. You may find them more versatile. Such model can be automatically generated and will work if given a large corpora which reflects the real usage sentences.
As for dictionaries - creating them for english is relatively simple. One could even think about a tool which reads a phonetic word representation from an online dictionary and converts it to sphinx format. The only input would be a word list.
I believe this paper will come handy to your effort. This paper entails creating grammar and dictionary for a new language, Swahili
I'm curious if there are generic analyzers which do a decent job of stemming / analyzing text which could be in different languages. For certain tasks, doing proper multi-lingual search (e.g. splitting a field name into name.english, name.french, etc.) seems like overkill.
Is there an analyzer which will strip suffixes (e.g. "dogs" --> "dog") and work for more than just English? I don't really care whether it does language detection, etc., and working on just e.g. romantic & germanic languages would probably be good enough. Or, is the loss of quality serious enough that it's always worth just using language-specific analyzers and language-specific queries?
Your best bet would be to use the icu analyzers. They are useful for normalizing but less useful for things like stemming, which is inherently language specific.
Additionally, it is possible to use a separate language field and use different ananalyzers based on the value of that field. So, you could combine both approaches and fall back to the icu tokenizer and support languages you care about with specialized analyzers: http://www.elasticsearch.org/guide/reference/mapping/analyzer-field/
You might want to watch this presentation from the recent Berlin Buzzwords conference about multi language support: http://www.youtube.com/watch?v=QI0XEshXygo. There's a lot of good stuff in there. Jump to the 27th minute for an example of the using different analyzers.
I have worked with Lucene for indexing documents and providing search among them, however, my work was in English language, but now, I have a project which is Kurdish language, Kurdish language uses some Arabic unicode characters and several other characters, here is Table of Unicode Characters used in Kurdish-Arabic script
My question is how to create Analyzer for this language, or can I use Arabic Analyzer for this purpose?
Lucene has a list of other analyzers, including Arabic. I'm afraid there's no one which targets specifically Kurdish, but maybe you can extend Arabic analyzer to fit your needs?
Just bear in mind that all these analyzers come separately from the main Lucene distribution.
To answer your question about howto create a custom Analyzer for a new language..."Lucene In Action" book covers the creation of custom analyzers and it is pretty detailed. You can "leverage" a lot of the code found in other analyzers and just change what you need. Lucene is open source and very extensible, therefore profiling these changes is pretty easy.
I'm looking for the ideal tool to use for publishing technical documentation in English & Arabic (in the same document). Should I use DocBook, or is it better to stick with TeX/LaTeX? I am a complete beginner to both systems so there's no legacy stuff to worry about. The two most import factors for me are easy of use and support for Arabic. By ease of use I mean that I don't mind setting up XML documents or so on, but for day to day writing I'd rather not deal with hand-coding XML, a good editor that gives a feel for how the document sort of looks would be ideal. I would like the output to be print-ready PDF as well as HTML.
Well, TeX/LaTeX in the TeXLive CD/DVD/bundle in the XeTeX incarnation is certainly able to deal with Arabic, see these examples. I'm not sure whether all the DocBook utilities (like the editors and things like fop) are up to this.