How to get the results which has all the strings specified in the search query - lucene

I am a beginner in Lucene. I am writing a search engine to search our code base for certain key words. I have a requirement for which I need your help. Say I am searching for a word "Apple computers", I would like Lucene to throw only the lines which have case insesitive "apple computers". But what I see is I see lines having Apple computers, lines having only apple and lines having only computers. How do I filter it to get only the lines having apple and computer.

How do you query Lucene? Basically what you are asking about is covered by building a query using BooleanClause.Occur.MUST.
Exactly how to do this is dependent on your query construction: For the default query parser you should use something like
+Apple +computers
While if you are building queries programmatically you should use MUST for every term.

As Yuval suggested, it's important to know how do you use Lucene.
If you use it through lucene-java and need exact phrase results (docs that contain only "apple computers" together) you can use PhraseQuery.
The example of how to compose it.

Related

SQL like '%term%' except without letters

I'm searching against a table of news articles. The 2 relevant columns are ArticleTitle and ArticleText. When I want to search an article for a particular term, i started out with
column LIKE '%term%'.
However that gave me a lot of articles with the term inside anchor links, for example <a href="example.com/*term*> which would potentially return an irrelevant article.
So then I switched to
column LIKE '% term %'.
The problem with this query is it didn't find articles who's title or text began/ended with the term. Also it didn't match against things like term- or term's, which I do want.
It seems like the query i want should be able to do something like this
'%[^a-z]term[^a-z]%
This should exclude terms within anchor links, but everything else. I think this query still excludes strings that begin/end with the term. Is there a better solution? Does SQL-Server's FULL TEXT INDEXING solve this problem?
Additionally, would it be a good idea to store ArticleTitle and ArticleText as HTML-free columns? Then i could use '%term%' without getting anchor links. These would be 2 extra columns though, because eventually i will need the original HTML for formatting purposes.
Thanks.
SQL Server's LIKE allows you to define Regex-like patterns like you described.
A better option is to use fulltext search:
WHERE CONTAINS(ArticleTitle, 'term')
exploits the index properly (the LIKE '%term%' query is slow), and provides other benefit in the search algorithm.
Additionally, you might benefit from storing a plaintext version of the article alongside the HTML version, and run your search queries on it.
SQL is not designed to interpret HTML strings. As such, you'd only be able to postpone the problem till a more difficult issue arrives (for example, a comment node that contains your search terms as part of a plain sentence).
You can still utilize FULL TEXT as a prefilter and then run an HTML analysis on the application layer to further filter your result set.

Lucene Proximity Search with multiple words

I am trying to build a query to search an Lucene index of names with name variants.
The index was built with Lucene.NET version 2.9.2
The user enters for example "Margaret White".
Without a name variant option, my query becomes "Margaret White"~1 and it works.
Now, I can look up name variants against both firstname and surname to produce an extended list.
eg. in this case (and I only include some as an example, since the list can be 100 or more sometimes) we can have
Margaret / Margrett White / Whyte
The query "margrett white"~1 OR "margaret white"~1 OR "margrett whyte"~1 OR "margaret whyte"~1
gives me the correct result but given a possible 100 x 100 variant combinations, the query string woudl be cumbersome to say the least.
I have tried various ways to achieve a more compact query but nothing seems to work.
Can anyone give me any pointers or alternative approach. I have control over the index creation process and wonder if there is something I can do at that stage?
Thanks for looking
Roger
Do the synonym filter in your indexing process instead of at query time. Just map "white", "whyte", ... to some single word; say "white". Same for "Margaret."
Then your query will just be "margaret white"~1
I faced a similar problem and solved it by writing my own query parser and manually instantiating query primitives. Writing your own query parser is not necessarily easy, but it gives you a lot of flexibility. In my new query language, I use within/N to specify a proximity query. With it, the following complex queries become possible:
margaret within/3 white
margaret within/3 (white or whyte)
Or even more complex queries
("first name" within/3 margaret) within/10 ("last name" within/3 (white or whyte))

Retrieving per keyword/field match position in Lucene Solr -- possible?

Is there any way to retrieve the match field/position for each keyword for each matching document from solr?
For example, if the document has title "Retrieving per keyword/field match position in Lucene Solr -- possible?" and the query is "solr keyword", I'd like to get, in addition to the doc-id (I normally only want the doc-id, not the full document), something that can tell me the matches are at:
solr:
title: 9
keyword:
title: 3
I'm pretty sure such info is computing during query execution (for phrase queries), but is it possible to return these to the application?
Thanks!
Debugging Relevance Issues in Search suggest using Solr analysis, which you can get to from the admin URL, using something like http://localhost:8983/solr/admin/analysis.jsp?highlight=on .
This highlights matching terms and gives their position.
AFAIK there is no way to do that directly, but you can use hit highlighting to implement it.

All of these words feature

I have a "description" field indexed in Lucene.This field contains a book's description.
How do i achieve "All of these words" functionality on this field using BooleanQuery class?
For example if a user types in "top selling book" then it should return books which have all of these words in its description.
Thanks!
There are two pieces to get this to work:
You need the incoming documents to be analysed properly, so that individual words are tokenised and indexed separately
The user query needs to be tokenised, and the tokens combined with the AND operator.
For #1, there are a number of Analyzers and Tokenizers that come with Lucene - have a look in the org.apache.lucene.analysis package. There are options for many different languages, stemming, stopwords and so on.
For #2, there are again a lot of query parsers that come with Lucene, mainly in the org.apache.lucene.queryParser packagage. MultiFieldQueryParser might be good for you: to require every term to be present, just call
QueryParser.setDefaultOperator(QueryParser.AND_OPERATOR)
Lucene in Action, although a few versions old, is still accurate and extremely useful for more information on analysis and query parsing.
I believe if you add all query parts (one per term) via
BooleanQuery.add(Query, BooleanClause.Occur)
and set that second parameter to the constant BooleanClause.Occur.MUST, then you should get what you want. The equivalent query syntax would be "+term1+term2 +term3 ...".

Best way to implement a stored procedure with full text search

I would like to run a search with MSSQL Full text engine where given the following user input:
"Hollywood square"
I want the results to have both Hollywood and square[s] in them.
I can create a method on the web server (C#, ASP.NET) to dynamically produce a sql statement like this:
SELECT TITLE
FROM MOVIES
WHERE CONTAINS(TITLE,'"hollywood*"')
AND CONTAINS(TITLE, '"square*"')
Easy enough. HOWEVER, I would like this in a stored procedure for added speed benefit and security for adding parameters.
Can I have my cake and eat it too?
I agreed with above, look into AND clauses
SELECT TITLE
FROM MOVIES
WHERE CONTAINS(TITLE,'"hollywood*" AND "square*"')
However you shouldn't have to split the input sentences, you can use variable
SELECT TITLE
FROM MOVIES
WHERE CONTAINS(TITLE,#parameter)
by the way
search for the exact term (contains)
search for any term in the phrase (freetext)
The last time I had to do this (with MSSQL Server 2005) I ended up moving the whole search functionality over to Lucene (the Java version, though Lucene.Net now exists I believe). I had high hopes of the full text search but this specific problem annoyed me so much I gave up.
Have you tried using the AND logical operator in your string? I pass in a raw string to my sproc and stuff 'AND' between the words.
http://msdn.microsoft.com/en-us/library/ms187787.aspx