How do I handle special characters like(#) in OpenSearchServer / Lucene? - lucene

I am using OpenSearchServer (community edition) v1.2.4-rc3 - stable - rev 1474 - build 802. I crawl a C# and C++ programming website. Now when i search for C# or C++ the software strips special characters like #,+. The results are not exact which software returns. How do I handle special characters like(#) in OpenSearchServer / Lucene? Can any one please suggest me idea? Thanks in advance

You need to change your indexing strategy to use a custom or semi-custom tokenizer that preserves the special characters you need to represent C# and C++ code terms. You would use this tokenizer both during indexing and during searching.
Off-hand, I would look at org.apache.lucene.analysis.standard and org.apache.lucene.wikipedia.analysis to get some ideas as how to construct the tokenizer (using a tokenizer (lexical analyzer) generator like JFlex etc. may be called for rather than hand-coding the tokenizer).

Related

Noise Removal using Lucene 4.8

I couldn't find any examples of removing stop-words from text using Lucene 4.8. Can you please advise me on how to use the classes StopFilter and StopAnalyzer classes to achieve this.
Two of three StandardAnalyzer constructors allow specifying stopwords; just use any of those. This analyzer uses StopFilter underneath, you don't have to do anything extra.

ANTLR and content assist in Eclipse

I have a project in Eclipse where I have an editor for a custom language. I am using ANTLR to generate the compiler for it. What I need is to add content assist to the editor.
The input is a source code in the custom language, and the position of the character where the user requested content assist. The source code is most of time incomplete as the user can ask for content assist any time. What I need is to calculate the list of possible tokens that are valid for the given position.
It is possible to write a custom code to do the calculation, but that code would have to be manually kept in sync with the grammar. I figured the parser is doing something similar. It has to be able to determine at a given context what are the acceptable tokens. Is it possible to "reuse" that? What is the best practice in creating content assist anyway?
Thanks,
Balint
Have a look at Xtext. Xtext uses Antlr3 under the hood and provides content assist for the Antlr based languages. Have a look especially into package org.eclipse.xtext.ui.editor.contentassist.
You may consider to redefine your grammar with Xtext, which would provide the content assist out-of-the-box. It is not possible to reuse the Antlr grammar of a custom language.

I need a string that won't properly convert to ANSI using several code pages

My .NET library has to marshal strings to a C library that expects text encoded using the system's default ANSI code page. Since .NET supports Unicode, this makes it possible for users to pass a string to the library that doesn't properly convert to ANSI. For example, on an English machine, "デスクトップ" will turn in to "?????" when passed to the C library.
To address this, I wrote a method that detects when this will happen by comparing the orginal string to a string converted using the ANSI code page. I'd like to test this method, but I really need a string that's guaranteed to be not encodable. For example, we test our code on English and Japanese machines (among other languages.) If I write the test to use the Japanese string above, the test will fail when the Japanese system properly encodes the string. I could write the test to check the current system's encoding, but then I have a maintenance nightmare every time we add/remove a new language.
Is there a unicode character that doesn't encode with any ANSI code page? Failing that, could a string be constructed with characters from enough different code pages to guarantee failure? My first attempt was to use Chinese characters since we don't cover Chinese, but apparently Japanese can convert the Chinese characters I tried.
edit I'm going to accept the answer that proposes a Georgian string for now, but was really expecting a result with a smattering of characters from different languages. I don't know if we plan on supporting Georgian so it seems OK for now. Now I have to test it on each language. Joy!
There are quite a few Unicode-only languages. Georgian is one of them. Here's the word 'English' in Georgian: ინგლისური
You can find more in Georgian file (ka.xml) of the CLDR DB.
If by "ANSI" you mean Windows code pages, I am pretty sure the characters out of BMP are not covered by any Windows code pages.
For instance, try some of Byzantine Musical Symbols
There are Windows code pages, which cover all Unicode characters (e.g. Cp1200, Cp12000, Cp65000 and Cp65001), so it's not always possible to create a string, which is not convertable.
What do you mean by an 'ANSI code page'? On Windows, the code pages are Microsoft, not ANSI. ISO defines the 8859-x series of code sets; Microsoft has Windows code pages analogous to most of these.
Are you thinking of single-byte code sets? If so, you should look for Unicode characters in esoteric languages for which there is less likely to be a non-Unicode, single-byte code set.
You could look at languages such as: Devanagari, Oi Chiki, Cherokee, Ogham.

Is there a free library for morphological analysis of the German language?

I'm looking for a library which can perform a morphological analysis on German words, i.e. it converts any word into its root form and providing meta information about the analysed word.
For example:
gegessen -> essen
wurde [...] gefasst -> fassen
Häuser -> Haus
Hunde -> Hund
My wishlist:
It has to work with both nouns and verbs.
I'm aware that this is a very hard task given the complexity of the German language, so I'm also looking for libaries which provide only approximations or may only be 80% accurate.
I'd prefer libraries which don't work with dictionaries, but again I'm open to compromise given the cirumstances.
I'd also prefer C/C++/Delphi Windows libraries, because that would make them easier to integrate but .NET, Java, ... will also do.
It has to be a free library. (L)GPL, MPL, ...
EDIT: I'm aware that there is no way to perform a morphological analysis without any dictionary at all, because of the irregular words.
When I say, I prefer a library without a dictionary I mean those full blown dictionaries which map each and every word:
arbeite -> arbeiten
arbeitest -> arbeiten
arbeitet -> arbeiten
arbeitete -> arbeiten
arbeitetest -> arbeiten
arbeiteten -> arbeiten
arbeitetet -> arbeiten
gearbeitet -> arbeiten
arbeite -> arbeiten
...
Those dictionaries have several drawbacks, including the huge size and the inability to process unknown words.
Of course all exceptions can only be handled with a dictionary:
esse -> essen
isst -> essen
eßt -> essen
aß -> essen
aßt -> essen
aßen -> essen
...
(My mind is spinning right now :) )
I think you are looking for a "stemming algorithm".
Martin Porter's approach is well known among linguists. The Porter stemmer is basically an affix stripping algorithm, combined with a few substitution rules for those special cases.
Most stemmers deliver stems that are linguistically "incorrect". For example: both "beautiful" and "beauty" can result in the stem "beauti", which, of course, is not a real word. This doesn't matter, though, if you're using those stems to improve search results in information retrieval systems. Lucene comes with support for the Porter stemmer, for instance.
Porter also devised a simple programming language for developing stemmers, called Snowball.
There are also stemmers for German available in Snowball. A C version, generated from the Snowball source, is also available on the website, along with a plain text explanation of the algorithm.
Here's the German stemmer in Snowball: http://snowball.tartarus.org/algorithms/german/stemmer.html
If you're looking for the corresponding stem of a word as you would find it in a dictionary, along with information on the part of speech, you should Google for "lemmatization".
(Disclaimer: I'm linking my own Open Source projects here)
This data in form of a word list is available at http://www.danielnaber.de/morphologie/. It could be combined with a word splitter library (like jwordsplitter) to cover compound nouns not in the list.
Or just use LanguageTool from Java, which has the word list embedded in form of a compact finite state machine (plus it also includes compound splitting).
You asked this a while ago, but you might still give it a try with morphisto.
Here's an example on how to do it in Ubuntu:
Install the Stuttgart finite-state transducer tools
$ sudo apt-get install sfst
Download the morphisto morphology, e.g. morphisto-02022011.a
Compact it, e.g.
$ fst-compact morphisto-02022011.a morphisto-02022011.ac
Use it! Here are some examples:
$ echo Hochzeit | fst-proc morphisto-02022011.ac
^Hochzeit/hohZeit<+NN>/hohZeit<+NN>/hohZeit<+NN>/hohZeit<+NN>/HochZeit<+NN>/HochZeit<+NN>/HochZeit<+NN>/HochZeit<+NN>/Hochzeit<+NN>/Hochzeit<+NN>/Hochzeit<+NN>/Hochzeit<+NN>$
$ echo gearbeitet | fst-proc morphisto-02022011.ac
^gearbeitet/arbeiten<+ADJ>/arbeiten<+ADJ>/arbeiten<+V>$
Have a look at LemmaGen (http://lemmatise.ijs.si/) which is a project that aims at providing standardized open source multilingual platform for lemmatisation. It is doing exactly what you want.
I don't think that this can be done without a dictionary.
Rules-based approaches will invariably trip over things like
gegessen -> essen
gegangen -> angen
(note to people who don't speak german: the correct solution in the second case is "gehen").
Have a look at Leo.
They offer the data which you are after, maybe it gives you some ideas.
One can use morphisto with ParZu (https://github.com/rsennrich/parzu). ParZu is a dependency parser for German.
This means that the ParZu also disambiguate the output from morphisto
There are some tools out there which you could use like the morph. component in the Matetools, Morphisto etc. But the pain is to integrate them in your tool chain. A very good wrapper around quite a lot of these linguistic tools is DKpro (https://dkpro.github.io/dkpro-core/), a framework using UIMA. It allows you to write your own preprocessing pipeline using different linguistic tools from different resources which are all downloaded automatically on your computer and speak to each other. You can use Java or Groovy or even Jython to use it. DKPro provides you easy access to two morphological analyzers, MateMorphTagger and SfstAnnotator.
You don't want to use a stemmer like Porter, it will reduce the word form in a way which does not make any sense linguistically and does not have the behaviour you describe. If you only want to find the basic form, for a verb that would be the infinitive and for a noun the nominative singular, then you should use a lemmatizer. You can find a list of German lemmatizers here. Treetagger is widely used. You can also use a more complex analysis provided by a morphological analyzer like SMORS. It will give you something like this (example from the SMORS website):
And here is the analysis of "unübersetzbarstes" showing prefixation, suffixation and >gradation:
un<PREF>übersetzen<V>bar<SUFF><+ADJ><Sup><Neut><Nom><Sg><St>

Which is it Perl or perl, TIF or TIFF, ant or Ant, ClearCase or Clear Case?

In one sentence I have manage to create 16 possible variations on how I present information. Does it matter as long as the context is clear? Do any common mistakes irritate you?
regarding Perl: How should I capitalize Perl?
TIFF stands for Tagged Image File Format, whereas the extension of files using that format is often ".tif".
That is for the purpose of compatibility with 8.3 filenames, I believe.
I generally like the Perl way of capitalizing when used as a proper noun, but lowercasing when referring to the command itself (assuming the command is lowercase to begin with).
Well, Perl and TIFF have already been answered, so I'll add the last two
the Apache Foundation writes "Apache Ant".
Rational ClearCase (or sometimes "IBM Rational ClearCase") is written as such at its web site.
Even though Perl was originally an acronym for Practical Extration and Report Language, it is written Perl.
These things dont 'bother' me as much as they provide insights into the level of knowledge of the speaker/author. You see, we work in a industry that requires precision, so precision in language does matter as it affects the understanding of the consumer.
The one that really seems to bother me is when people fully upper case JAVA as though it was an acronym.