How to improve Search query efficiency in lucene? - lucene

I'm building a search for my application. For the entered search term (foo),
1) I look for exact match (foo), if it returns NULL
2) I use fuzzy search (foo~), if it returns NULL
3) I use wildcard (foo*).
Is this an efficient way? Or is there any lucene method to do all these?

There is no built-in way of doing this in the Lucene. However, usually this case is handled outside of the Lucene in client-side. Yes, from my experience it's very efficient, since it's usually provides high precision results. In some sources over the internet it's called staged search
E.g. you create a query for exact match, let's say TermQuery("field","foo"), if this query return nothing, than you use FuzzyQuery and last one PrefixQuery (I will recommend it over WildcardQuery, for the last case you want to do)

Related

PostgreSQL Full Text Search with substrings

I'm trying to create the fastest way to search millions (80+ mio) of records in a PostgreSQL (version 9.4), over multiple columns.
I would like to try and use standard PostgreSQL, and not Solr etc.
I'm currently testing Full Text Search followed https://blog.lateral.io/2015/05/full-text-search-in-milliseconds-with-postgresql/.
It works, but I would like some more flexible way to search.
Currently, if I have a column containing ex. "Volvo" and one containing "Blue" I am able to find the record with the search string "volvo blue", but I would like to also find the record using "volvo blu" as if I used LIKE and "%blu%'.
Is that possible with full text search?
The only option to something like this is by using the pg_trgm contrib module.
This enables you to create a GIN or GiST index that indexes all sequences of three characters, which can be used for a search with the similarity operator %.
Two notes:
Using the % operator may return “false positive” results, so be sure to add a second condition (e.g. with LIKE) that eliminates those.
A trigram search works well with longer search strings, but performs badly with short search strings because of the many false positive results.
If that is not good enough for your purposes, you'll have to resort to an third-party solution.

Cloudant Search: Why are my wildcards not working?

I have a Cloudant database with a search index. In the search index I index the titles of my documents. For instance, search for 'rijkspersoneel':
http://wetten.cloudant.com/regelingen/_design/RegelingInfo/_search/regeling?q=title:rijkspersoneel
Returns 48 rows.
However, when I replace the 'o' with a ? wildcard:
http://wetten.cloudant.com/regelingen/_design/RegelingInfo/_search/regeling?q=title:rijkspers?neel
I get 0 results. Why is that? The Cloudant docs say that this should match 'rijkspersoneel' as well!
My previous answer was definitely mistaken. Internal wildcads do appear to be supported. Try:
title:rijkspe*on*
title rijksper?on*
Fairly sure what is happening here is an analysis issue. Fairly sure you are using a stemming analyzer. I'm not really all the familiar with cloudant and their implementation of this, but in Lucene, wildcard queries are not subject to the same analysis as term queries. I'm guessing that your analysis of this field includes a stemmer, in which case "rijkspersoneel" is actually indexed as "rijkspersone".
So, when you search for
rijkspersonee*
or
rijkper?oneel
Since the "el" is missing from the end in the indexed form, you find no matches. When just searching for rijkpersoneel it does get analyzed though, and you search for the stemmed form of the word, and will find matches.
Stemming and wildcards just don't get along.

not query in lucene

i need to do not queries on my lucene index. Lucene currently allows not only when we have two or more terms in the query:
So I can do something like:
country:canada not sweden
but I can't run a query like:
country:not sweden
Could you please let me know if there is some efficient solution for this problem
Thanks
A very late reply, but it might be useful for somebody else later:
*:* AND NOT country:sweden
IF I'm not mistaken this should do a logical "AND" with all documents and the documents with a country that is different from "sweden".
Try with the following query in the search box:
NOT message:"warning"
message being the search field
Please check answer for similar question. The solution is to use MatchAllDocsQuery.
The short answer is that this is not possible using the standard Lucene.
Lucene does not allow NOT queries as a single term for the same reason it does not allow prefix queries - to perform either, the engine would have to look through each document to ascertain whether the document is/is not a hit. It has to look through each document because it cannot use the search term as the key to look up documents in the inverted index (used to store the indexed documents).
To take your case as an example:
To search for not sweden, the simplest (and possibly most efficient) approach would be to search for sweden and then "invert" the result set to return all documents that are not in that result set. Doing this would require finding all the required (ie. not in the result set) documents in the index, but without a key to look them up by. This would be done by iterating over the documents in the index - a task it is not optimised for, and hence speed would suffer.
If you really need this functionality, you could maintain your own list of items when indexing, so that a not sweden search becomes a sweden search using Lucene, followed by an inversion of the results using your set of items.
OK, I see what you are trying to do.
You can use it as a query refinement since there are no unary Boolean operators in Lucene. Despite the answers above, I believe this is a better and most forward approach (note the space before the wildcard):
&query= *&qf=-country:Canada

Lucene: search within search using FuzzyQuery

I need to make a FuzzyQuery using an index that contains around 8 million lines. That kind of query is pretty slow, needing about 20 seconds for every match. The fact is that I can narrow down the results using another field to about 5000 hits before doing the fuzzy search. For this to work, I should be able to make a search by the "narrower" field first, and then use the fuzzy search within those results.
According to the lucene FAQ, the only thing I have to do is a BooleanQuery, where the "narrower" should be required (BooleanClause.Occur.MUST in lucene 3).
Now I have tried two different approaches:
a) Using the Query Parser, with an input like:
narrower:+narrowing_text fuzzy:fuzzy_text~0.9
b) Constructing a BooleanQuery with a TermQuery and a FuzzyQuery
Neither did work, I'm getting about the same times than the ones when the narrower is not used.
Also, just to check that if the narrower was working the times should be much better, I reindexed only the 5000 items that match the narrower, and the search went fast as hell.
In case anyone wonders, I'm using pylucene 3.0.2.
Doppleganger, you can probably use a Filter, specifically a QueryWrapperFilter.
Follow the example from Lucene in Action. You may have to make some modifications for use in python, but otherwise it should be simple:
Create the query that narrows this down to 5000 hits.
Use it to build a QueryWrapperFilter.
Use the filter in a search involving the fuzzy query.

SQL with Regular Expressions vs Indexes with Logical Merging Functions

I am trying to develop a complex textual search engine.
I have thousands of textual pages from many books.
I need to search pages that contain specified complex logical criterias.
These criterias can contain virtually any compination of the following:
A: Full words.
B: Word roots (semilar to stems; i.e. all words with certain key letters).
C: Word templates (in some languages roots are filled in certain templates to form various part of speech such as adjactives, past/present verbs...).
D: Logical connectives: AND/OR/XOR/NOT/IF/IFF and parentheses to state priorities.
Now, would it be faster to have the pages' full text in database (not indexed) and search through them all using SQL and Regular Expressions ?
Or would it be better to construct indexes of word/root/template-page-location tuples.
Hence, we can boost searching for individual words/roots/templates.
However, it gets tricky as we introduce logical connectives into our queries.
I thought of doing the following steps in such cases:
1: Seperately search for each individual words/roots/templates in the specified query.
2: On priority bases, we merge two result lists (from step 1) at a time depedning on the logical connective
For example, if we are searching for "he AND (is OR was)":
1: We shall search for "he", "is" and "was" seperately and get result lists for each word.
2: Merge the result lists of "is" and "was" using the merging function OR-MERGE.
3: Merge the merged result list from the OR-MERGE function with the one of "he" using the merging function AND-MERGE.
The result of step 3 is then returned as the result of the specified query.
What do you think gurues ? Which is faster ? Any better ideas ?
Thank you all in advance.
There are plenty of off-the-shelf solutions to this kind of problem. I would strongly recommend you use one of those instead of developing your own.
You don't say what database solution you're using. If it's Microsoft SQL Server, you could use its Full Text Search features. If it's MySQL, take a look at its Full-Text Search Functions. I'm sure Oracle, DB2 and any other major DBMS will have similar functionality.
Alternatively, take a look at Apache's Lucene for Java or Lucene for .NET. This will allow you to index documents without needing to use a DBMS.