I use Sphinx to index HTML pages, giving different weights to title, description, etc. I'm looking for a way to get the search words location in the page from the results that I get from Sphinx.
Meaning, if the wordset is "stack overflow program" and I have 5 documents that match, each of them was a match because it contained at least one word from the wordset.
The question is: how do I know where each word was found in a document? For example, I want to know if document 1 returned because it contained "overflow" in the title and "stack" in the description.
I see that the result returns with a certain weight (3780, for example) but I can't conclude from that on what word was found where.
Thanks a lot!
You'll have to (somehow) get the results back programmatically, and then you can call BuildExcerpts on the contents. Sphinx will then give you an HTML block with the relative positions of the found text.
Related
I have made a process that reads PDFs and scrapes their text in UiPath. I am struggling to come up with a regular expression that I can use to search for a PO Number. The text that comes from the scrape is fairly unstructured so my best bet is to search for a set of numbers that starts with a 'PO' with no space. For example, "PO1234567890". I will be setting a variable so the system knows that no PO number was found if the string doesn't come up with anything. Any reference material would be welcome as I am a beginner to VB. Thanks!
I have researched and cannot find a way to do the type of search I would like to do.
I expect to be able to search for a "PO1234567890" and no let something like "PO" save. So I somehow need to be able to search for "PO - two digits" and any numbers following without whitespace.
Just try the following:
Dim Regex As System.Text.RegularExpressions.Regex
Regex = New System.Text.RegularExpressions.Regex("PO[0-9]+")
Regex.Matches(SearchString)
The regex string PO[0-9]+ means:
PO followed by at least one number
if you want more digits for example 3... just use PO[0-9]{3}[0-9]* that means:
PO followed by three numbers and as numbers as it can match.
If you need help using regex matches just ask.
Hope it helps!
I am using Lucene to search in a Data-set, I need to now how "" search (I mean exact phrase search) mechanism has been implemented?
I want to make it able to result all "little cat" hits when the user enters "littlecat". I now that I should manipulate the indexing code, but at least I should now how the "" search works.
I want to make it able to result all "little cat" hits when the user enters "littlecat"
This might sound easy but it is very tough to implement. For a human being little and cat are two different words but for a computer it does not know little and cat seperately from littlecat, unless you have a dictionary and your code check those two words in dictionary. On the other hand searching for "little cat" can easily search for "littlecat" aswell. And i believe that this goes beyong the concept of an exact phrase search. Exact phrase search will only return littlecat if you search for "littlecat" and vice versa. Even google seemingly (expectedly too), doesnt return "little cat" on littlecat search
A way to implement this is Dynamic programming - using a dictionary/corpus to compare your individual words against(and also the left over words after you have parsed the text into strings).
Think of it like you were writing a custom spell-checker or likewise. In this, there's also a scenario when more than one combination of words may be left over eg -"walkingmydoginrain" - here you could break the 1st word as "walk", or as "walking" , and this is the beauty of DP - since you know (from your corpus) that you can't form legitimate words from "ingmydoginrain" (ie rest of the string - you have just discovered that in this context - you should pick the segmented word as "Walking" and NOT walk.
Also think of it like not being able to find a match is adding to a COST function that you define, so you should get optimal results - meaning you can be sure that your text(un-separated with white spaces) will for sure be broken into legitimate words- though there may be MORE than one possible word sequences in that line(and hence, possibly also intent of the person seeking this)
You should be able to find pretty good base implementations over the web for your use case (read also : How does Google implement - "Did you mean" )
For now, see also -
How to split text without spaces into list of words?
I am trying to teach myself Lucene.Net to implement on my site. I understand how to do almost everything I need except for one issue. I am trying to figure out how to allow a fuzzy search for all search terms in a search string.
So for example if I have a document with the string The big red fox, I am trying to get bag fix to match it.
The problem is, it seems like in order to perform fuzzy searches, I have to add ~ to every search term the user enters. I am unsure of the best way to go about this. Right now I am attempting this by
string queryString = "bag rad";
queryString = queryString.Replace("~", string.Empty).Replace(" ", "~ ") + "~";
The first replace is due to Lucene.Net throwing an exception if the search string has a ~ already, apparently it can't handle ~~ in a phrase. This method works, but it seems like it will get messy if I start adding fuzzy weight values.
Is there a better way to default all words to allow for fuzzyness?
You might want to index your documents as bi-grams or tri-grams. Take a look at the CJKAnalyzer to see how they do it. You will want to download the source and look at the source.
I have indexed a list of words such as 'just saw','just passed','just met'.. I have a list of sentences and I want to extract only those sentences which have these keywords in it. for eg
'I just saw a movie'. but I don't want the sentences which are like ' I was in US and met Obama'. I want only those sentences which have consecutive keywords. How can I do that using luence
Proximity Search in Lucene
Lucene supports finding words are a within a specific distance away. To do a proximity search use the tilde, "~", symbol at the end of a Phrase. For example to search for a "apache" and "jakarta" within 10 words of each other in a document use the search:
"jakarta apache"~10
There is also SpanQuery which gives good control over the order of the terms.
i am searching for strings indexed in lucene as documents. now i give it a long string to match.
example:
"iamrohitbanga is a stackoverflow user" search string
documents:
document 1: field value: rohit
document 2: field value: banga
now i use fuzzy matching to find the search strings in the documents.
the 2 documents match. i want to retrieve the position at which the string rohit occurs in the search string. how to do it using lucene java api.
also note that the fuzzy matching would lead to inexact matches also. but i am interested in the position word in the searched string.
the answer to
Finding the position of search hits from Lucene
refers to a website which requires us to download some files from http://www.iq-computing.de and this page does not load.
so could you provide a solution?
Probably this should help:
http://lucene.apache.org/java/2_9_1/api/contrib-highlighter/index.html