In Lucene, using a Standard Analyzer, I want to make fields with spaces searchable - lucene

In Lucene, using a Standard Analyzer, I want to make fields with space searchable.
I set Field.Index.NOT_ANALYZED and Field.Store.YES using the StandardAnalyzer
When I look at my index in LUKE, the fields are as I expected, a field and a value such as:
location -> 'New York'.
Here I found that I can use the KeywordAnalyzer to find this value using the query:
location:"New York".
But I want to add another term to the query. Let's say a have a body field which contains the normalized and analyzed terms created by the StandardAnalyzer. Using the KeywordAnalyzer for this field I get different results than when I use the StandardAnalyzer.
How do I combine two Analyzers in one QueryParser, where one Analyzer works for some fields and another one for another fields. I though of creating my own Analyzer which could behave differently depending on the field, but I have no clue how to do it.

PerFieldAnalyzerWrapper lets you apply different analyzers for different fields.

Related

PostgreSQL Full Text Search with substrings

I'm trying to create the fastest way to search millions (80+ mio) of records in a PostgreSQL (version 9.4), over multiple columns.
I would like to try and use standard PostgreSQL, and not Solr etc.
I'm currently testing Full Text Search followed https://blog.lateral.io/2015/05/full-text-search-in-milliseconds-with-postgresql/.
It works, but I would like some more flexible way to search.
Currently, if I have a column containing ex. "Volvo" and one containing "Blue" I am able to find the record with the search string "volvo blue", but I would like to also find the record using "volvo blu" as if I used LIKE and "%blu%'.
Is that possible with full text search?
The only option to something like this is by using the pg_trgm contrib module.
This enables you to create a GIN or GiST index that indexes all sequences of three characters, which can be used for a search with the similarity operator %.
Two notes:
Using the % operator may return “false positive” results, so be sure to add a second condition (e.g. with LIKE) that eliminates those.
A trigram search works well with longer search strings, but performs badly with short search strings because of the many false positive results.
If that is not good enough for your purposes, you'll have to resort to an third-party solution.

How to use Lucene FastVectorHighlighter on multiple fields?

I've got a basic search working, and I'm highlighting using FastVectorHighlighter. When you ask the highlighter for a "best fragment" you have a few overloads of getBestFragment(s) to choose from, documented here. I'm now using the simplest one, like this:
highlightedText = highlighter.getBestFragment(fieldQuery, searcher.getIndexReader(),
scoreDoc.doc, "description", 100)
So I'm highlighting the match from the "description" field. My query however searches another field, "notes". How do I include that in the highlighting? There is an overload that takes a Set<String> matchedFields and one String storedField, but I don't understand the docs. The doc for the method says:
it is advisable that all matchedFields share the same source as storedField or are at least a prefix of it.
What does that mean? How do I index the "notes" and "description" Strings, and what do I pass for matchedFields and storedField?
That call, I believe, is intended to highlight against multiple indexed forms of the same content. That is, if you have one stored full-text content field, but you have indexed it in a number of different ways to expand how you can search it. Perhaps you have one indexed field that uses standard analysis, another with language-specific stemming, another that uses ngrams, and another indexing metaphones.
If you want to highlight two different stored fields, two calls to getBestFragment would be called for. Or you could use a different highlighter that allows multiple stored fields to be highlighted at the same time, PostingsHighlighter, for instance.

Lucene - Which field contains search term?

I have developed a search application with Lucene. I have created the basic search. Basically, my app works as follows:
My index has many fields. (Around 40)
User can enter query to multiple fields i.e: +NAME:John +SURNAME:Doe
Queries can contain wildcards such as ? and * i.e: +NAME:J?hn +SURNAME:Do*
Queries can also contain fuzzy i.e: +NAME:Jahn~0.5
Now, I want to find, which field(s) contains my search term(s). As I am using wildcard and fuzzy, I cannot just make string comparison. How can I do it?
If you need it for debugging purposes, you could use IndexSearcher.explain.
Otherwise, this problem looks like highlighting, so you should be able to find out the fields that matched by:
re-analyzing your document,
or using its term vectors.

How do i include other fields in a lucene search?

Lets use emails for an example as a document. You have your subject, body, the person who its from and lets say we can also tag them (as gmail does)
From my understanding of QueryParser i give it ONE field and the parser type. If a user enter text the user only searches whatever i set. I notice it will look in the subject or body field if i wrote fieldName: text to search however how do i make a regular query such as "funny SO question unicorn" find result(s) with some of those strings in the subject, the others in the body? ATM because i knew it would be easy i made a field called ALL and combined all the other fields into that but i would like to know how i can do it in a proper way. Especially since my next app is text search dependent
Use MultiFieldQueryParser. You can specify list of fields to be searched using following constructor.
MultiFieldQueryParser(Version matchVersion, String[] fields, Analyzer analyzer)
This will generate a query as if you have created multiple queries on different fields. This partially addresses your problem. This, still, will not match one term matching in field1 and another matching in field2. For this, as you have rightly pointed out, you will need to combine all the fields in one single field and search in that field. Nevertheless, you will find MultiFieldQueryParser useful when query terms do not cross the field boundaries.

Searching for multiple terms in a field

I want to do an AND query, say 'foo AND bar', in Lucene.NET. I have a WholeIndex field which has the whole document indexed, and I want Lucene to search in the whole document.
Up to here it's quite easy, but there's a constraint.
I want both terms 'foo' and 'bar' to be in the same field.
Is there an easy way to do this without querying the index for the full list of fields and searching in every field?
Edit: What I want to know is if there is a way to tell Lucene to perform a search in every field, without having to know all the fields in my index. An automated way to search the following:
"field1:(+foo +bar) field2:(+foo +bar) ... fieldN:(+foo +bar)"
You can use GetFieldNames to get all the field names, and then go programmatically over the list and generating a query like the one you wrote, using BooleanQuery.