Lucene - Which field contains search term? - lucene

I have developed a search application with Lucene. I have created the basic search. Basically, my app works as follows:
My index has many fields. (Around 40)
User can enter query to multiple fields i.e: +NAME:John +SURNAME:Doe
Queries can contain wildcards such as ? and * i.e: +NAME:J?hn +SURNAME:Do*
Queries can also contain fuzzy i.e: +NAME:Jahn~0.5
Now, I want to find, which field(s) contains my search term(s). As I am using wildcard and fuzzy, I cannot just make string comparison. How can I do it?

If you need it for debugging purposes, you could use IndexSearcher.explain.
Otherwise, this problem looks like highlighting, so you should be able to find out the fields that matched by:
re-analyzing your document,
or using its term vectors.

Related

How to treat the whole sentence as one token in Azure Search

We are facing a problem of performing exact matching and ignore case-sensitivity by using Azure Search.
For example, we have a field called Description and it can be a small sentence or long sentence (For example: Welcome to Azure Search). We are trying to treat the whole sentence as one token such that when user search "Welcome to" it won't return the result back because we have to search "Welcome to Azure Search" to do a exactly matching. Another requirement is that we want the capability to search case-insensitive such that searching "welcome TO Azure SEARCH" will return the result.
I have used Keyword Analyzer to treat the whole field as a single token but this will prevent search case-insensitive from working.
I am also trying to define custom analyzer with keyword_v2 tokenizer and lowercase token filter. Looks like this will solve my problem however there is a 300 maximum token length limitation. In some of the cases, the Description field will be a long sentence more than 300 characters.
I also thought about duplicating an index field to be lowercase and using OData syntax $filter=Description eq 'welcome to azure search'. For example, there will be two fields: "Description" and "DescriptionLowerCase", when do the searching, search on "DescriptionLowerCase" and when returning result, returning "Description". But this will double the size of index storage.
Is there a better way to solve my problem?
you have pretty much covered all the options available options. At the moment there is no workaround the size limitation as without that the search will suffer performance. Now exactly why would you need exact match on the whole string more than 300 characters is beyond me. Have you tried using quotes around your search?

PostgreSQL Full Text Search with substrings

I'm trying to create the fastest way to search millions (80+ mio) of records in a PostgreSQL (version 9.4), over multiple columns.
I would like to try and use standard PostgreSQL, and not Solr etc.
I'm currently testing Full Text Search followed https://blog.lateral.io/2015/05/full-text-search-in-milliseconds-with-postgresql/.
It works, but I would like some more flexible way to search.
Currently, if I have a column containing ex. "Volvo" and one containing "Blue" I am able to find the record with the search string "volvo blue", but I would like to also find the record using "volvo blu" as if I used LIKE and "%blu%'.
Is that possible with full text search?
The only option to something like this is by using the pg_trgm contrib module.
This enables you to create a GIN or GiST index that indexes all sequences of three characters, which can be used for a search with the similarity operator %.
Two notes:
Using the % operator may return “false positive” results, so be sure to add a second condition (e.g. with LIKE) that eliminates those.
A trigram search works well with longer search strings, but performs badly with short search strings because of the many false positive results.
If that is not good enough for your purposes, you'll have to resort to an third-party solution.

Examine lucene.net custom query after analyzer tokenizes

I'm using Examine in Umbraco to query Lucene index of content nodes. I have a field "completeNodeText" that is the concatenation of all the node properties (to keep things simple and not search across multiple fields).
I'm accepting user-submitted search terms. When the search term is multiple words (ie, "firstterm secondterm"), I want the resulting query to be an OR query: Bring me back results where fullNodeText is firstterm OR secondterm.
I want:
{+completeNodeText:"firstterm ? secondterm"}
but instead, I'm getting:
{+completeNodeText:"firstterm secondterm"}
If I search for "firstterm OR secondterm" instead of "firstterm secondterm", then the generated query is correctly: {+completeNodeText:"firstterm ? secondterm"}
I'm using the following API calls:
var searcher = ExamineManager.Instance.SearchProviderCollection["ExternalSearcher"];
var searchCriteria = searcher.CreateSearchCriteria();
var query = searchCriteria.Field("completeNodeText", term).Compile();
Is there an easy way to force Examine to generate this "OR" query? Or do I have to manually construct the raw query by calling the StandardAnalyzer to tokenize the user input and concatenating together a query by iterating through the tokens? And bypassing the entire Examine fluent query API?
I don't think that question mark means what you think it means.
It looks like you are generating a PhraseQuery, but you want two disjoint TermQueries. In Lucene query syntax, a phrase query is enclosed in quotes.
"firstterm secondterm"
A phrase query is looking for precisely that phrase, with the two terms appearing consecutively, and in order. Placing an OR within a phrase query does not perform any sort of boolean logic, but rather treats it as the word "OR". The question mark is a placeholder using in PhraseQuery.toString() to represent a removed stop word (See #Lucene-1396). You are still performing a phrasequery, but now it is expecting a three word phrase firstterm, followed by a removed stop word, followed by secondterm
To simply search for two separate terms, get rid of the quotes.
firstterm secondterm
Will search for any document with either of those terms (with higher score given to documents with both).

How do i include other fields in a lucene search?

Lets use emails for an example as a document. You have your subject, body, the person who its from and lets say we can also tag them (as gmail does)
From my understanding of QueryParser i give it ONE field and the parser type. If a user enter text the user only searches whatever i set. I notice it will look in the subject or body field if i wrote fieldName: text to search however how do i make a regular query such as "funny SO question unicorn" find result(s) with some of those strings in the subject, the others in the body? ATM because i knew it would be easy i made a field called ALL and combined all the other fields into that but i would like to know how i can do it in a proper way. Especially since my next app is text search dependent
Use MultiFieldQueryParser. You can specify list of fields to be searched using following constructor.
MultiFieldQueryParser(Version matchVersion, String[] fields, Analyzer analyzer)
This will generate a query as if you have created multiple queries on different fields. This partially addresses your problem. This, still, will not match one term matching in field1 and another matching in field2. For this, as you have rightly pointed out, you will need to combine all the fields in one single field and search in that field. Nevertheless, you will find MultiFieldQueryParser useful when query terms do not cross the field boundaries.

In Lucene, using a Standard Analyzer, I want to make fields with spaces searchable

In Lucene, using a Standard Analyzer, I want to make fields with space searchable.
I set Field.Index.NOT_ANALYZED and Field.Store.YES using the StandardAnalyzer
When I look at my index in LUKE, the fields are as I expected, a field and a value such as:
location -> 'New York'.
Here I found that I can use the KeywordAnalyzer to find this value using the query:
location:"New York".
But I want to add another term to the query. Let's say a have a body field which contains the normalized and analyzed terms created by the StandardAnalyzer. Using the KeywordAnalyzer for this field I get different results than when I use the StandardAnalyzer.
How do I combine two Analyzers in one QueryParser, where one Analyzer works for some fields and another one for another fields. I though of creating my own Analyzer which could behave differently depending on the field, but I have no clue how to do it.
PerFieldAnalyzerWrapper lets you apply different analyzers for different fields.