Lucene Based Searching - lucene

I've a problem in Lucene based searching. I have designed a document with five fields. Consider the document be Address with addressline1, addressline2, city, state and pin. If a search is to be performed, then the search has be done in all the fields, so I'm using boolean term queries. So the results would be retrieved. Now I also have to respond not only with responses but also with the matching field. For eg if the city field matches the search, then I should respond as city matches the search along with the actual search response. Is there are any lucene api to accommodate this?

AFAIK there's no simple solution to find out which field matched the query.
Your options are:
try using hit highlighter (it knows where the match occurred but it's noticeably slow on large result sets)
fiddle with IndexSearcher's explain method
build your custom solution
Hit highlighter experience and workaround findings.
IMHO it shouldn't be hard to implement that yourself, since Lucene in some point in time surely knows which field yielded a match, but it discards that information as unnecessary weight by the time it composes your response.
I stumbled upon this custom approach.
Try to find more resources on search-lucene.com, the best Lucene/Solr related search engine.

Related

What exactly differs fuzzy search from Full Text Search?

In my project, I am asked to implement a text query service on the database we are using; Postgresql. I have used Postgresql Full Text Search features, which works fairly fine in terms of time. One problem about full text search is, it does not have fuzzy search abilities. On the other hand, there is an extension named pgtrgm providing functions and operators for determining the similarity of alphanumeric text. Also there are several examples of text search using pgtrgm like:
select actor
from products
where actor % 'tomy';
As you know example of postgres FTS also here;
SELECT title
FROM pgweb
WHERE to_tsvector(body) ## to_tsquery('friend');
So, the main question is, what is the difference between these two search strategies? Which one is more appropriate way for searching texts? Is it possible to mix them? I also need to say that performance is an important concern as well. Thanks in advance!
They do completely different things. About the only thing that is not different between them is that they operate on text and can benefit from use of indexes. From you question, it seems like you already have a good sense of the differences. The appropriate one is the one that does what you want. If one of them was always appropriate, we probably wouldn't have created the other one.
You can mix them, but you will need different indexes for each one, they cannot share an index. Also, you probably need different tables as well, as full text search is more appropriate for sentences or paragraphs while trigram for individual words or short phrases.
One way to mix them would be to have one table of full texts, and another table which lists only each distinct word present in any of the full texts. The 2nd table could be used to detect probable typos in the query, and then once those are fixed by suggestions from trigram searching, run the fixed query against the 1st table.
The difference is quite huge - in fuzzy search, you're searching for a similar result, in full-text search - for the exact same. If one is more appropriate than the other is the matter of use-case.
If you don't need fuzziness, don't use it, it's a huge performance overhead because it has to match the text not exactly, but also try other combinations.

find indexed terms in non-indexed document/string

Sorry if I'm using the wrong terminology here, I'm new to Lucene :D
Assume that I have indexed all titles of the English Wikipedia in Lucene.
Let's say I'm visiting a news website. Within the article I want to convert all phrases (that match a title in the Wikipedia) into a link to the Wikipedia page.
To clarify: I don't want to put the news article into the Lucene index, but rather use the indexed WP titles to find matches within a given string (the article). We also don't want to bother with the JS/HTML stuff, just focus on Lucene for now.
I'd also like to match greedy: i.e. if the text contains "Stack Overflow", I'd like to link to SO, rather than to "Stack" and "Overflow". But if I can get shorter matches as well, that would be neat, too. (I kindof want to do both, but I'll settle for either one if having both is difficult).
naive solution: I can see that I'd be able to query for single words iteratively and whenever I hit an index, try to find the current word plus the next word until I miss. Then convert the last match into a link and continue after that, until I'm through the complete document.
But, that seems really awkward and I have a suspicion that Lucene might have some functionality that could support me here (or at least I hope so :D), but I have no clue what I'd be looking for.
Lucene's inverted index should make this a pretty fast operation, so I guess this might perform reasonably well, even with the naive approach.
Any pointers? I'm stuck :3
There is such a thing, it's called the Tagger Handler:
Given a dictionary (a Solr index) with a name-like field, you can post text to this request handler and it will return every occurrence of one of those names with offsets and other document metadata desired. It’s used for named entity recognition (NER).
It seems a bit fiddly to set-up, but it's exactly what I wanted :D

Azure Search - issues with Phonetic Analyzer

Our clients query on our Azure Search index, mostly for people's names. We are using the Lucene analyzer for all of our fields. We build the query string by making the client's input name into a phrase, and adding proximity rate of 3. Because we search using a phrase, we can not use the Fuzzy Search capability of the Lucene analyzer, as it only works on single words.
We were therefore in search of a solution for being able to bring back results with names that weren't spelled exactly as the client input them. We came across the phonetic analyzer, and have just implemented the Metaphone algorithm into our index. We've run some tests and while it gets us closer to what we need, we still see some issues:
The analyzer's scope is so wide, that it's bringing back a lot of false positives. For example, when searching on Kenneth Gooden, it brings back Kenneth Cotton. That's just a little too far to be considered phonetically similar, in our opinion. Can the sensitivity be tweaked in any way, or, can something be done to boost some other parameter to remedy this?
When doing a search on Barry Soper, the first and highest-scored result that comes back is "Barry Spear." The second result, scored lower, is "Soper, Barry Russell." To a certain extent, I can maybe see why it's scored that way (b/c of the 2nd one being last name first) but then... not really. The 2nd result contains both exact terms within the required proximity. Maybe Azure Search gives priority to the order of words in the phrase before applying the analyzer? Still doesn't make sense to me. (Side note - this query also brings back "Barh Super" - see issue #1 above)
I would like to know if someone could offer suggestions to tweak Azure Search's behavior to work more along the lines of what we need, OR, perhaps suggest an alternative to the phonetic analyzer. We haven't tried any of the the other available phonetic algorithms either yet, only b/c it seems Metaphone is the best and most commonly-used. But we're open to suggestions regarding the other algorithms as well.
Thanks.
You are correct that the fuzzy operator only works on single terms. In this case, you can use a custom analyzer (phonetic tokenfilter) or Synonyms feature (in preview). I am not sure what you meant by "we have just implemented the Metaphone algorithm into our index" but there are several phonetic tokenfilters you can choose from in Azure Search custom analysis stack. Synonyms is a newer feature only available in preview, you can take a look here. For synonyms, you will need to define synonyms rules, say 'Nate, Nathan, Nathaniel' for example, and at query time, searching for one automatically includes the results for the others.
Okay, then how should I use these building blocks in a way to control relevance for my search? One way to model is to use separate field for each expansion strategy. For example, instead of a single field for the name, you can have three fields, say 'name', 'name_synonym', and 'name_phonetic'. The first field 'name' is for exact matches, 'name_synonym' field has synonyms enabled and the third uses a phonetic analyzer and broadens the search the most. You can then use the scoring profile to boost scores from matches in each field. You can give the boost value of 10 for exact matches, 5 for synonyms and 1 for phonetic expansions, for example. Your search will be issued against these three internal fields.
Regarding your question as to why 'Soper, Barry Russell' is ranked lower than 'Barry Spear'. After the phonetic analysis. the words 'soper' and 'spear' reduce to the same form both at indexing and query time and treated as if they were identical terms. In computing the score and ranking, the search engine uses analyzed form of the terms and phonetic similarity makes no influence to the score. That’s why, secondary factors, like field length, will play a more significant role influencing the relevance score.
Hope this helps. I provided one example to model this but you could also take a look at term boosting in the full lucene query syntax.
Let me know if you have any additional questions.
Nate

Question Answering with Lucene

For a toy project, I want to implement an automated question answering system with Lucene and I'm trying to figure out a reasonable way to implement it. The basic operation is as follows:
1) The user will enter a question.
2) The system will identify the keywords in the question.
3) The keywords will be searched in a large knowledgebase and matching sentences will be shown as answers.
My knowledgebase (i.e., corpus) is not structured. It is just a large, continuous text (say, a user manual without any chapters). I mean that the only structure is that sentences and paragraphs are identified.
I plan to treat each sentence or paragraph as a separate document. To present the answer in a context, I may consider keeping one sentence/paragraph before/after the indexed one as payload. I would like to know if that makes sense. Also, I'm wondering if there are other tried and well-known approaches for that kind of systems. As an example, another approach that comes to mind is to index large chunks of the corpus as documents with the token positions, then process the vicinity of found keywords to construct my answers.
I would appreciate direct recommendations based on experience or intuition, but also tutorials or introductory materials to question-answering systems with Lucene in mind.
Thanks.
It's not an unreasonable approach to take.
One enhancement you might consider is incorporating learning feedback, so that you can continually improve the scoring of content vs search terms. To do this you would ask users to rate the answers that come back ('helpful vs unhelpful'), that way you can start to rank documents against keywords based on the historical data. You could classify potential documents as helpful/unhelpful for given keywords by using a simple Bayesian classifier.
Indexing each sentence as a document will give you some problems. You've pointed out one: you would need to store the surrounding texts a payloads. That means you'll need to store each sentence three times (before, during and after), and you'll have to manually get into the payload.
If you want to go the route of each sentence being a document, I would recommend coming up with an ID for each sentence and storing that as a separate field. Then you can display [ID-1, ID, ID+1] in each result.
The bigger question though is: how should you break up the text into documents? Identifying semantically related areas seems difficult, so doing it by sentence/paragraph might be the only way to go. A better way would be if you could find which text is the header of a section, and then put everything in that section as a document.
You might also want to use the index (if your corpus has one). The terms there could be boosted, as they are presumably more important.
Instead of luncene which does text indexing, search and retrieval, I think using something like Apache Mahout would help with this. Mahout considers text as knowledge and doing that makes the answering the question better than just text matching. Mahout is a machine learning and data mining f/w which fits this domain better. Just a very high level thought.
--Sai

Relevant Search Results Across Multiple Databases

I have three databases that all have the contents of several web pages in them. What would be the best way to go about searching all three and having the most relevant web page at the top of the search results?
The only way I can think of is break down content by word count and/or creating a complex set of search rules to give one content priority over another. This might be more trouble than what it's worth, but I was wondering if anybody knows a way or product out there that would be able to help me.
To further support Ivans answer above Lucene is the way to go. You haven't mentioned what platform you're on so I'll point out that you can use a .NET port of this too.
If you do use Lucene there is a very good book from Manning on the subject which I recommend you look at.
When it comes to populating your index, you have a couple of choices. For starters you can just dump all of your text into the index and allow the engine to just search on it. However, I'd recommend adding fixed fields to your index which will allow you to support things such as partitioned searches or searches against those fields only.
To explain, lets say you have a field for the website. Then you can partition your index by restricting the index search to those documents that have that website in that field.
The other process is to extract points of interest from your document and allow searches on those without searching the entire index entry. Your mileage may vary with this as the lucene engine is very well written so it may simply allow you to collect your searches into more logical units which helps you with your solution.
I've done this myself and it helps when answering management questions about what exactly is searched and indexed.
HTH!
If you're using MS SQL Server then the full text search can return a ranking for you. I haven't used it, so you'll need to check the documentation or online for specifics.