Hyphens in Lucene - lucene

I'm playing around with Lucene and noticed that the use of a hyphen (e.g. "semi-final") will result in two words ("semi" and "final" in the index. How is this supposed to match if the users searches for "semifinal", in one word?
Edit: I'm just playing around with the StandardTokenizer class actually, maybe that is why? Am I missing a filter?
Thanks!
(Edit)
My code looks like this:
StandardAnalyzer sa = new StandardAnalyzer();
TokenStream ts = sa.TokenStream("field", new StringReader("semi-final"));
while (ts.IncrementToken())
{
string t = ts.ToString();
Console.WriteLine("Token: " + t);
}

This is the explanation for the tokenizer in lucene
- Splits words at punctuation
characters, removing punctuation.
However, a dot that's not followed by
whitespace is considered part of a
token.
- Splits words at hyphens, unless
there's a number in the token, in
which case the whole token is
interpreted as a product number and
is not split.
- Recognizes email addresses and internet hostnames as one token.
Found here
this explains why it would be splitting your word.
This is probably the hardest thing to correct, human error. If an individual types in semifinal, this is theoretically not the same as searching semi-final. So if you were to have numerous words that could be typed in different ways ex:
St-Constant
Saint Constant
Saint-Constant
your stuck with the task of having
both st and saint as well as a hyphen or non hyphenated to veriy. your tokens would be huge and each word would need to be compared to see if they matched.
Im still looking to see if there is a good way of approaching this, otherwise, if you don't have a lot of words you wish to use then have all the possibilities stored and tested, or have a loop that splits the word starting at the first letter and moves through each letter splitting the string in half to form two words, testing the whole way through to see if it matches. but again whose to say you only have 2 words. if you are verifying more then two words then you have the problem of splitting the word in multiple sections
example
saint-jean-sur-richelieu
if i come up with anything else I will let you know.

I would recommend you use the WordDelimiterFilter from Solr (you can use it in just your Lucene application as a TokenFilter added to your analyzer, just go get the java file for this filter from Solr and add it to your application).
This filter is designed to handle cases just like this:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory

If you're looking for a port of the WordDelimiterFilter then I advise a google of WordDelimiter.cs, I found such a port here:
http://osdir.com/ml/attachments/txt9jqypXvbSE.txt
I then created a very basic WordDelimiterAnalyzer:
public class WordDelimiterAnalyzer: Analyzer
{
#region Overrides of Analyzer
public override TokenStream TokenStream(string fieldName, TextReader reader)
{
TokenStream result = new WhitespaceTokenizer(reader);
result = new StandardFilter(result);
result = new LowerCaseFilter(result);
result = new StopFilter(true, result, StopAnalyzer.ENGLISH_STOP_WORDS_SET);
result = new WordDelimiterFilter(result, 1, 1, 1, 1, 0);
return result;
}
#endregion
}
I said it was basic :)
If anyone has an implementation I would be keen to see it!

You can write your own tokenizer which will produce for words with hyphen all possible combinations of tokens like that:
semifinal
semi
final
You will need to set proper token offsets to tell lucene that semi and semifinal actually start at the same place in document.

The rule (for the classic analyzer) is from is written in jflex:
// floating point, serial, model numbers, ip addresses, etc.
// every other segment must have at least one digit
NUM = ({ALPHANUM} {P} {HAS_DIGIT}
| {HAS_DIGIT} {P} {ALPHANUM}
| {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+
| {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+
| {ALPHANUM} {P} {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+
| {HAS_DIGIT} {P} {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+)
// punctuation
P = ("_"|"-"|"/"|"."|",")

Related

Seeking the right token filters for my requirements and getting desperate

I'm indexing documents which contain normal text, programming code and other non-linguistic fragments. For reasons which aren't particularly relevant I am trying to tokenise the content into lowercased strings of normal language, and single character symbols.
Thus the input
a few words. Cost*count
should generate the tokens
[a] [few] [words] [.] [cost] [*] [count]
Thus far thus extremely straightforward. But I want to handle "compound" words too, because the content can include words like order_date and object-oriented and class.method as well.
I'm following the principle that any of [-], [_] and [.] should be treated as a compound word conjunction rather than a symbol IF they are between two word characters, and should be treated as a separate symbol character if they are adjacent to a space, another symbol character, or the beginning or end of a string. I can handle all of this with a PatternTokenizer, like so:
public static final String tokenRgx = "(([A-Za-z0-9]+[-_.])*[A-Za-z0-9]+)|[^A-Za-z0-9\\s]{1}";
protected TokenStreamComponents createComponents(String fieldName) {
PatternTokenizer src = new PatternTokenizer(Pattern.compile(tokenRgx), 0);
TokenStream result = new LowerCaseFilter(src);
return new TokenStreamComponents(src, result);
}
This successfully distinguishes between full stops at the end of sentences and full stops in compounds, between hyphens introducing negative numbers and hyphenated words, etc. So in the above analyzer, the input:
a few words. class.simple_method_name. dd-mm-yyyy.
produces the tokens
[a] [few] [words] [.] [class.simple_method_name] [.] [dd-mm-yyyy] [.]
We're almost there, but not quite. Finally I want to split the compound terms into their parts RETAINING the trailing hyphen/underscore/stop character in each part. So I think I need to introduce another filter step to my analyzer so that the final set of tokens I end up with is this:
[a] [few] [words] [.] [class.] [simple_] [method_] [name] [.] [dd-] [mm-] [yyyy] [.]
And this is the piece that I am having trouble with. I presume that some kind of PatternCaptureGroupTokenFilter is required here but I haven't been able to find the right set of expressions to get the exact tokens I want emerging from the analyzer.
I know it must be possible, but I seem to have walked into a regular expression wall that blocks me. I need a flash of insight or a hint, if anyone can offer me one.
Thanks,
T
Edit:
Thanks to #rici for pointing me towards the solution
The string which works (including support for decimal numbers) is:
String tokenRegex = "-?[0-9]+\\.[0-9]+|[A-Za-z0-9]+([-_.](?=[A-Za-z0-9]))?|[^A-Za-z0-9\\s]";
Seems to me like it would be easier to do the whole thing in one scan, using a regex like:
[A-Za-z0-9]+([-_.](?=[A-Za-z0-9]))?|[^A-Za-z0-9\\s]
That uses a zero-width forward assertion in order to only add [-._] to the preceding word if it is immediately followed by a letter or digit. (Because (?=…) is an assertion, it doesn't include the following character in the match.)
To my mind, that won't correctly handle decimal numbers; -3.14159 will be three tokens rather than a single number token. But it depends on your precise needs.

Strange behavior of Lucene SpanishAnalyzer class with accented words

I'm using the SpanishAnalyzer class in Lucene 3.4. When I want to parse accented words, I'm having a strange result. If I parse, for example, these two words: "comunicación" and "comunicacion", the stems I'm getting are "comun" and "comunicacion". If I instead parse "maratón" and "maraton", I'm getting the same stem for both words ("maraton").
So, at least in my opinion, it's very strange that the same word, "comunicación", gives different results depending on it is accented or not. If I search the word "comunicacion", I should get the same result regardless of whether it's accented or not.
The code I'm using is the next one:
SpanishAnalyzer sa = new SpanishAnalzyer(Version.LUCENE_34);
QueryParser parser = new QueryParser(Version.LUCENE_34, "content", sa);
String str = "comunicación";
String str2 = "comunicacion";
System.out.println("first: " + parser.parse(str)); //stem = comun
System.out.println("second: " + parser.parse(str2)); //stem = comunicacion
The solution I've found to be able to get every single word that shares the stem of "comunicacion", accented or not, is to take off accents in a first step, and then parse it with the Analyzer, but I don't know if it's the right way.
Please, can anyone help me?
Did you check what tokenizer & tokenfilters SpanishAnalyzer uses? There is something called ASCIIFoldingFilter. Try placing it before the StemFilter. It will remove the accents

Using Hibernate Search (Lucene), I Need to Be Able to Search a Code With or Without Dashes

This is really the same as it would be for a social security #.
If I have a code with this format:
WHO-S-09-0003
I want to be able to do:
query = qb.keyword().onFields("key").matching("WHOS090003").createQuery();
I tried using a WhitespaceAnalyzer.
Using StandardAnalyzer or WhitespaceAnalyzer both have the same problem. They will index 'WHO-S-09-0003' as is which means that when you do a search it will only work if you have hyphens in the search term.
One solution to your problem would be to implement your own TokenFilter which detects the format of your codes and removes the hyphens during indexing. You can use AnayzerDef to build a chain of toekn filters and an overall custom analyzer. Of course you will have to use the same analyzer when searching, but the Hibernate Search query DSL will take care of that.
actually you can implement your own method like this one:
private String specialCharacters(String keyword) {
String [] specialChars = {"-","!","?"};
for(int i = 0; i < specialChars.length; i++ )
if(keyword.indexOf(specialChars[i]) > -1)
keyword = keyword.replace(specialChars[i], "\\"+specialChars[i]);
return keyword;
}
as you know lucene has special chars, so if you want escape special chars than you should insert before that char double backslashes...

Parse a search string (into NHibernate Criterias )

I would like to implement an advanced search for my project.
The search right now uses all the strings the user enters and makes one big disjunction with criteria API.
This works fine, but now I would like to implement more features: AND, OR and brackets()
I have got a hard time parsing the string - and building criterias from the string. I have found this Stackoverflow question, but it didn't really help (he didn't make it clear what he wanted).
I found another article, but this supports much more and spits out sql statements.
Another thing I've heard mention a lot is Lucene - but I'm not sure if this really would help me.
I've been searching around a little bit and I've found the Lucene.Net WhitespaceAnalyzer and the QueryParser.
It changes the search A AND B OR C into something like +A +B C, which is a good step in the correct direction (plus it handles brackets).
The next step would be to get the converted string into a set of conjunctions and disjunctions.
The Java example I found was using the query builder which I couldn't find in NHibernate.
Any more ideas ?
Guess you haven't heard about Nhibernate Search till now
Nhibernate Search uses lucene underneath and gives u all the options of using AND, OR, grammar.
All you have to do is attribute your entities for indexing and Nhibernate will index it at a predefined location.
Next time you can search this index with the power that lucene exposes and then get your domain level entity objects in return.
using (IFullTextSession s = Search.CreateFullTextSession(sf.OpenSession(new SearchInterceptor()))) {
QueryParser qp = new QueryParser("id", new StopAnalyzer());
IQuery NHQuery = s.CreateFullTextQuery(qp.Parse("Summary:series"), typeof(Book));
IList result = NHQuery.List();
Powerful, isn’t it?
What I am basically doing right now is parsing the input string with the Lucene.Net parse API.
This gives me a uniform and simplified syntax. (Pseudocode)
using Lucene.Net.Analysis;
using Lucene.Net.Analysis.Standard;
using Lucene.Net.QueryParsers;
using Lucene.Net.Search;
void Function Search (string queryString)
{
Analyzer analyzer = new WhitespaceAnalyzer();
QueryParser luceneParser = new QueryParser("name", analyzer);
Query luceneQuery = luceneParser.Parse(queryString);
string[] words = luceneQuery.ToString().Split(' ');
foreach (string word in words)
{
//Parsing the lucene.net string
}
}
After that I am parsing this string manually, creating the disjunctions and conjunctions.

Using MultiFieldQueryParser

Am using MultiFieldQueryParser for parsing strings like a.a., b.b., etc
But after parsing, its removing the dots in the string.
What am i missing here?
Thanks.
I'm not sure the MultiFieldQueryParser does what you think it does. Also...I'm not sure I know what you're trying to do.
I do know that with any query parser, strings like 'a.a.' and 'b.b.' will have the periods stripped out because, at least with the default Analyzer, all punctuation is treated as white space.
As far as the MultiFieldQueryParser goes, that's just a QueryParser that allows you to specify multiple default fields to search. So with the query
title:"Of Mice and Men" "John Steinbeck"
The string "John Steinbeck" will be looked for in all of your default fields whereas "Of Mice and Men" will only be searched for in the title field.
What analyzer is your parser using? If it's StopAnalyzer then the dot could be a stop word and is thus ignored. Same thing if it's StandardAnalyzer which cleans up input (includes removing dots).
(Repeating my answer from the dupe. One of these should be deleted).
The StandardAnalyzer specifically handles acronyms, and converts C.F.A. (for example) to cfa. This means you should be able to do the search, as long as you make sure you use the same analyzer for the indexing and for the query parsing.
I would suggest you run some more basic test cases to eliminate other factors. Try to user an ordinary QueryParser instead of a multi-field one.
Here's some code I wrote to play with the StandardAnalyzer:
StringReader testReader = new StringReader("C.F.A. C.F.A word");
StandardAnalyzer analyzer = new StandardAnalyzer();
TokenStream tokenStream = analyzer.tokenStream("title", testReader);
System.out.println(tokenStream.next());
System.out.println(tokenStream.next());
System.out.println(tokenStream.next());
The output for this, by the way was:
(cfa,0,6,type=<ACRONYM>)
(c.f.a,7,12,type=<HOST>)
(word,13,17,type=<ALPHANUM>)
Note, for example, that if the acronym doesn't end with a dot then the analyzer assumes it's an internet host name, so searching for "C.F.A" will not match "C.F.A." in the text.