How exact phrase search is performed by a Search Engine? - lucene

I am using Lucene to search in a Data-set, I need to now how "" search (I mean exact phrase search) mechanism has been implemented?
I want to make it able to result all "little cat" hits when the user enters "littlecat". I now that I should manipulate the indexing code, but at least I should now how the "" search works.

I want to make it able to result all "little cat" hits when the user enters "littlecat"
This might sound easy but it is very tough to implement. For a human being little and cat are two different words but for a computer it does not know little and cat seperately from littlecat, unless you have a dictionary and your code check those two words in dictionary. On the other hand searching for "little cat" can easily search for "littlecat" aswell. And i believe that this goes beyong the concept of an exact phrase search. Exact phrase search will only return littlecat if you search for "littlecat" and vice versa. Even google seemingly (expectedly too), doesnt return "little cat" on littlecat search

A way to implement this is Dynamic programming - using a dictionary/corpus to compare your individual words against(and also the left over words after you have parsed the text into strings).
Think of it like you were writing a custom spell-checker or likewise. In this, there's also a scenario when more than one combination of words may be left over eg -"walkingmydoginrain" - here you could break the 1st word as "walk", or as "walking" , and this is the beauty of DP - since you know (from your corpus) that you can't form legitimate words from "ingmydoginrain" (ie rest of the string - you have just discovered that in this context - you should pick the segmented word as "Walking" and NOT walk.
Also think of it like not being able to find a match is adding to a COST function that you define, so you should get optimal results - meaning you can be sure that your text(un-separated with white spaces) will for sure be broken into legitimate words- though there may be MORE than one possible word sequences in that line(and hence, possibly also intent of the person seeking this)
You should be able to find pretty good base implementations over the web for your use case (read also : How does Google implement - "Did you mean" )
For now, see also -
How to split text without spaces into list of words?

Related

Use MeCab to separate Japanese sentences into words not morphemes in vb.net

I am using the following code to split Japanese sentences into its words:
Dim parameter = New MeCabParam()
Dim tagger = MeCabTagger.Create(parameter)
For Each node In tagger.ParseToNodes(sentence)
If node.CharType > 0 Then
Dim features = node.Feature.Split(",")
Console.Write(node.Surface)
Console.WriteLine(" (" & features(7) & ") " & features(1))
End If
Next
An input of それに応じて大きくになります。 outputs morphemes:
それ (それ) 代名詞
に (に) 格助詞
応じ (おうじ) 自立
て (て) 接続助詞
大きく (おおきく) 自立
に (に) 格助詞
なり (なり) 自立
ます (ます) *
。 (。) 句点
Rather than words like so:
それ
に
応じて
大きく
に
なります
。
Is there a way I can use a parameter to get MeCab to output the latter? I am very new to coding so would appreciate it if you explain simply. Thanks.
This is actually pretty hard to do. MeCab, Kuromoji, Sudachi, KyTea, Rakuten-MA—all of these Japanese parsers and the dictionary databases they consume (IPADIC, UniDic, Neologd, etc.) have chosen to parse morphemes, the smallest units of meaning, instead of what you call "words", which as your example shows often contain multiple morphemes.
There are some strategies that usually folks combine to improve on this.
Experiment with different dictionaries. I've noticed that UniDic is sometimes more consistent than IPADIC.
Use a bunsetsu chunker like J.DepP, which consumes the output of MeCab to chunk together morphemes into bunsetsu. Per this paper, "We use the notion of a bunsetsu which roughly corresponds to a minimum phrase in English and consists of a content words (basically nouns or verbs) and the functional words surrounding them." The bunsetsu output by J.DepP often correspond to "words". I personally don't think of, say, a noun + particle phrase as a "word" but you might—these two are usually in a single bunsetsu. (J.DepP is also pretttty fancy, in that it also outputs a dependency tree between bunsetsu, so you can see which one modifies or is secondary to which other one. See my example.)
A last technique that you shouldn't overlook is scanning the dictionary (JMdict) for runs of adjacent morphemes; this helps find idioms or set phrases. It can get complicated because the dictionary may have a deconjugated form of a phrase in your sentence, so you might have to search both the literal sentence form and the deconjugated (lemma) form of MeCab output.
I have an open-source package that combines all of the above called Curtiz: it runs text through MeCab, chunks them into bunsetsu with J.DepP to find groups of morphemes that belong together, identifies vocabulary by looking them up in the dictionary, separates particles and conjugated phrases, etc. It is likely not going to be useful for you, since I use it to support my activities in learning Japanese and making Japanese learning tools but it shows how the above pieces can be combined to get to what you need in Japanese NLP.
Hopefully that's helpful. I'm happy to elaborate more on any of the above topics.

Lucene part:123 vs part 123 vs part123

I'm not too familiar with Lucene so my apologies if this isn't clear or I'm getting my terms/nomenclature mixed up.
So I have a requirement where a field containing text (example part:123) should be able to be found via the following:
part:123
part 123
part123
Now my understanding of StandardAnalyzer is that it will break the word "part:123" into terms "part" and "123".
So with that, I'm able to search with part:123 or part 123, but because they're two different terms "part123" won't work.
It seems to me like I'd also need to get the indexer to add another term where both are combined, so it'd be "part", "123", "part123".
Is this the right way to accomplish this -- and does anyone know how I'd go about implementing this?
Thanks!

IEnumString searching substrings - possible?

I've implemented auto completion to a combobox like this article shows. Is it possible to make it search for substrings instead of just the beginning of the words?
http://www.codeproject.com/Articles/2371/IAutoComplete-and-custom-IEnumString-implementatio
I haven't found any way to customize how IEnumString/IAutoComplete compares the strings. Is it possible?
The built in search options help a bit but it is complete chaos. To find instring matches you need to set flag AcoWordFilter. But this will prevent from numbers being matched!! However, there is a trick to get the numbers to match: preced with a double-quote as in "3 to find a string containing or starting with "3". Some more chaos? In the AcoWordFilter you also need to prefix other characters not considered part of a "word", eg. you need to prefix parentheses with a " but then you will not find parentheses at the first position!
So the solution is either to create your own implementation of IAutoComplete or offer the user to switch between the modes (a bit awkward).
I dont think that the MS engineers are especially proud of such chaos. How about one more option: AcoSearchAnwhere?
After retrieving the Edit control's IAutoComplete interface, query it for an IAutoComplete2 interface. Calling its SetOptions member you can disable prefix filtering by specifying the ACO_NOPREFIXFILTERING AUTOCOMPLETEOPTIONS.
This is available on Windows Vista and later. If you need a solution that works with pre-Vista versions, you'll have to write your own.

How can I perform a fuzzy search for all words provided in a Lucene.net search

I am trying to teach myself Lucene.Net to implement on my site. I understand how to do almost everything I need except for one issue. I am trying to figure out how to allow a fuzzy search for all search terms in a search string.
So for example if I have a document with the string The big red fox, I am trying to get bag fix to match it.
The problem is, it seems like in order to perform fuzzy searches, I have to add ~ to every search term the user enters. I am unsure of the best way to go about this. Right now I am attempting this by
string queryString = "bag rad";
queryString = queryString.Replace("~", string.Empty).Replace(" ", "~ ") + "~";
The first replace is due to Lucene.Net throwing an exception if the search string has a ~ already, apparently it can't handle ~~ in a phrase. This method works, but it seems like it will get messy if I start adding fuzzy weight values.
Is there a better way to default all words to allow for fuzzyness?
You might want to index your documents as bi-grams or tri-grams. Take a look at the CJKAnalyzer to see how they do it. You will want to download the source and look at the source.

Add spaces between words in spaceless string

I'm on OS X, and in objective-c I'm trying to convert
for example,
"Bobateagreenapple"
into
"Bob ate a green apple"
Is there any way to do this efficiently? Would something involving a spell checker work?
EDIT: Just some extra information:
I'm attempting to build something that takes some misformatted text (for example, text copy pasted from old pdfs that end up without spaces, especially from internet archives like JSTOR). Since the misformatted text is probably going to be long... well, I'm just trying to figure out whether this is feasibly possible before I actually attempt to actually write system only to find out it takes 2 hours to fix a paragraph of text.
One possibility, which I will describe this in a non-OS specific manner, is to perform a search through all the possible words that make up the collection of letters.
Basically you chop off the first letter of your letter collection and add it to the current word you are forming. If it makes a word (eg dictionary lookup) then add it to the current sentence. If you manage to use up all the letters in your collection and form words out of all of them, then you have a full sentence. But, you don't have to stop here. Instead, you keep running, and eventually you will produce all possible sentences.
Pseudo-code would look something like this:
FindWords(vector<Sentence> sentences, Sentence s, Word w, Letters l)
{
if (l.empty() and w.empty())
add s to sentences;
return;
if (l.empty())
return;
add first letter from l to w;
if w in dictionary
{
add w to s;
FindWords(sentences, s, empty word, l)
remove w from s
}
FindWords(sentences, s, w, l)
put last letter from w back onto l
}
There are, of course, a number of optimizations you could perform to make it go fast. For instance checking if the word is the stem of any word in the dictionary. But, this is the basic approach that will give you all possible sentences.
Solving this problem is much harder than anything you'll find in a framework. Notice that even in your example, there are other "solutions": "Bob a tea green apple," for one.
A very naive (and not very functional) approach might be to use a spell-checker to try to isolate one "real word" at a time in the string; of course, in this example, that would only work because "Bob" happens to be an English word.
This is not to say that there is no way to accomplish what you want, but the way you phrase this question indicates to me that it might be a lot more complicated than what you're expecting. Maybe someone can give you an acceptable solution, but I bet they'll need to know a lot more about what exactly you're trying to do.
Edit: in response to your edit, it would probably take less effort to run some kind of OCR tool on a PDF and correct its output than it would just to correct what this system might give you, let alone program it
I implemented a solution, the code is avaible on code project:
http://www.codeproject.com/Tips/704003/How-to-add-spaces-between-spaceless-strings
My idea was to prioritize results that use up most of the characters (preferable all of them) then favor the ones with the longest words, because 2,3 or 4 character long words can often come up by chance from leftout characters. Most of the times this provides the correct solution.
To find all possible permutations I used recursion. The code is quite fast even with big dictionaries (tested with 50 000 words).