I want to search for names by a custom phonetic analyzer and the build-in simpleanalyzer.
The result should be ordered that the result of the simpleanalyzer comes first and then the result of the phonetic analyzer.
The only way I found is to make two queries and sort the results on the RavenDB-client.
Is there another possibility to query and sort on serverside with a kind of combined analyzer?
Related
I'm trying to create the fastest way to search millions (80+ mio) of records in a PostgreSQL (version 9.4), over multiple columns.
I would like to try and use standard PostgreSQL, and not Solr etc.
I'm currently testing Full Text Search followed https://blog.lateral.io/2015/05/full-text-search-in-milliseconds-with-postgresql/.
It works, but I would like some more flexible way to search.
Currently, if I have a column containing ex. "Volvo" and one containing "Blue" I am able to find the record with the search string "volvo blue", but I would like to also find the record using "volvo blu" as if I used LIKE and "%blu%'.
Is that possible with full text search?
The only option to something like this is by using the pg_trgm contrib module.
This enables you to create a GIN or GiST index that indexes all sequences of three characters, which can be used for a search with the similarity operator %.
Two notes:
Using the % operator may return “false positive” results, so be sure to add a second condition (e.g. with LIKE) that eliminates those.
A trigram search works well with longer search strings, but performs badly with short search strings because of the many false positive results.
If that is not good enough for your purposes, you'll have to resort to an third-party solution.
I am a little confused about usage of filter, tokenizer vs query. I can select ngram filter or tokenizer during indexing (through an analyzer) I can also use multi_field to store different variation of same field for different usage of a query so I should not have concerns about flexibility of this approach as mentioned here: http://jontai.me/blog/2013/02/adding-autocomplete-to-an-elasticsearch-search-application/
when I used ngram filter during analysis of text I gave same result as when I used fuzzy query (even better results, because of edgeNGram option that was not available for fuzzy queries.)
so when should I use fuzzy query (through fuzziness option or fuzzy_like_this query ..) if using filter (during indexing) and simple match query gets better results and as I read it is more scalable?
when should I use ngram tokenizer instead of ngram filter?
I have a system where the search queries multiple fields with different boost values. It is running on Lucene.NET 2.9.4 because it's an Umbraco (6.x) site, and that's what version of Lucene.NET the CMS uses.
My client asked me if I could add stemming, so I wrote a custom analyzer that does Standard / Lowercase / Stop / PorterStemmer. The stemming filter seems to work fine.
But now, when I try to use my new analyzer with the MultiFieldQueryParser, it's not finding anything.
The MultiFieldQueryParser is returning a query containing stemmed words - e.g. if I search for "the figure", what I get as part of the query it returns is:
keywords:figur^4.0 Title:figur^3.0 Collection:figur^2.0
i.e. it's searching the correct fields and applying the correct boosts, but trying to do an exact search on stemmed terms on indexes that contained unstemmed words.
I think what's actually needed is for the MultiFieldQueryParser to return a list of clauses which are of type PrefixQuery. so it'll output a query like
keywords:figur*^4.0 Title:figur*^3.0 Collection:figur*^2.0
If I try to just add a wildcard to the end of the term, and feed that into the parser, the stemmer doesn't kick in. i.e. it builds a query to look for "figure*".
Is there any way to combine MultiFieldQueryParser boosting and prefix queries?
You need to reindex using your custom analyzer. Applying a stemmer only at query time is useless. You might kludge together something using wildcards, but it would remain an ugly, unreliable kludge.
I'm using Examine in Umbraco to query Lucene index of content nodes. I have a field "completeNodeText" that is the concatenation of all the node properties (to keep things simple and not search across multiple fields).
I'm accepting user-submitted search terms. When the search term is multiple words (ie, "firstterm secondterm"), I want the resulting query to be an OR query: Bring me back results where fullNodeText is firstterm OR secondterm.
I want:
{+completeNodeText:"firstterm ? secondterm"}
but instead, I'm getting:
{+completeNodeText:"firstterm secondterm"}
If I search for "firstterm OR secondterm" instead of "firstterm secondterm", then the generated query is correctly: {+completeNodeText:"firstterm ? secondterm"}
I'm using the following API calls:
var searcher = ExamineManager.Instance.SearchProviderCollection["ExternalSearcher"];
var searchCriteria = searcher.CreateSearchCriteria();
var query = searchCriteria.Field("completeNodeText", term).Compile();
Is there an easy way to force Examine to generate this "OR" query? Or do I have to manually construct the raw query by calling the StandardAnalyzer to tokenize the user input and concatenating together a query by iterating through the tokens? And bypassing the entire Examine fluent query API?
I don't think that question mark means what you think it means.
It looks like you are generating a PhraseQuery, but you want two disjoint TermQueries. In Lucene query syntax, a phrase query is enclosed in quotes.
"firstterm secondterm"
A phrase query is looking for precisely that phrase, with the two terms appearing consecutively, and in order. Placing an OR within a phrase query does not perform any sort of boolean logic, but rather treats it as the word "OR". The question mark is a placeholder using in PhraseQuery.toString() to represent a removed stop word (See #Lucene-1396). You are still performing a phrasequery, but now it is expecting a three word phrase firstterm, followed by a removed stop word, followed by secondterm
To simply search for two separate terms, get rid of the quotes.
firstterm secondterm
Will search for any document with either of those terms (with higher score given to documents with both).
I need to make a FuzzyQuery using an index that contains around 8 million lines. That kind of query is pretty slow, needing about 20 seconds for every match. The fact is that I can narrow down the results using another field to about 5000 hits before doing the fuzzy search. For this to work, I should be able to make a search by the "narrower" field first, and then use the fuzzy search within those results.
According to the lucene FAQ, the only thing I have to do is a BooleanQuery, where the "narrower" should be required (BooleanClause.Occur.MUST in lucene 3).
Now I have tried two different approaches:
a) Using the Query Parser, with an input like:
narrower:+narrowing_text fuzzy:fuzzy_text~0.9
b) Constructing a BooleanQuery with a TermQuery and a FuzzyQuery
Neither did work, I'm getting about the same times than the ones when the narrower is not used.
Also, just to check that if the narrower was working the times should be much better, I reindexed only the 5000 items that match the narrower, and the search went fast as hell.
In case anyone wonders, I'm using pylucene 3.0.2.
Doppleganger, you can probably use a Filter, specifically a QueryWrapperFilter.
Follow the example from Lucene in Action. You may have to make some modifications for use in python, but otherwise it should be simple:
Create the query that narrows this down to 5000 hits.
Use it to build a QueryWrapperFilter.
Use the filter in a search involving the fuzzy query.