Search multi word phrase with wild cards in lucene

Search multi word phrase with wild cards in lucene - lucene

Using the following code block:
public void MultiField(string fieldValue, string[] fieldList)
{
List<Occur> occurs = new List<Occur>();
foreach (string field in fieldList)
{
occurs.Add(Occur.SHOULD);
}
MultiFieldQueryParser parser = new MultiFieldQueryParser(Lucene.Net.Util.Version.LUCENE_30, fieldList, analyzer);
parser.AllowLeadingWildcard = true;
Query qry = parser.Parse(fieldValue.ToLower());
booleanQuery.Add(qry, Occur.MUST);
}
where fieldValue is a user input and fieldList is a set list of fields. I am using a standard analyzer.
I need to be able to search multiple words with wildcards enabled. In it's current state when a user enters a search term (for instance "search") the logic in my application will add * to either side making it "*search*". This brings back the expected results.
If, however a user entered "search s" it would search all fields for "*search" and then all fields again for "s*"; returning way more than the desired results. I have tried to escape special characters/whitespace however this also removes the wildcard search as "*" is a special character. I've tried this using the escape method and by adding "\"" in to the fieldValue string. Is there a way to encapsulate the whole phrase to search and append asterisks at the start and end of the search term?

Related

Lucene calculate term vectors for existing index

With Lucene.net I would like to get the term vectors as described in this stackoverflow question.
The problem is, the index is already generated with the field indexed and stored, but without term vectors.
FieldType type = new FieldType();
type.setIndexed(true);
type.setStored(true);
type.setStoreTermVectors(false);
Theoretically, it should be possible to re-calculate the term vectors for each document and then store it in the index.
Do you know how this could be possible, without deleting the complete Lucene index?

As mentioned in my comments in the question, you can generate term vector data on-the-fly, which may help you to avoid a complete rebuild of your indexed data.
In my scenario, I want to find the offset positions of my search term in the matched document.
I don't want to oversell this approach - it's absolutely not a substitute for re-indexing - but if your queries are basic, it may help.
Step 1: Perform whatever query you are currently performing.
For each document in the list of hits, you will then need to re-process the relevant field from that document - so, either you already have the field data stored in your existing index, or you will need to retrieve it from its original source.
Step 2: For each such field, you can re-use the same analyzer to build a token stream on-the-fly. The token stream can be configured with different attributes, such as:
token attributes
offset attributes
and others (see here)
Example:
using Lucene.Net.Analysis.Standard;
using Lucene.Net.Analysis.TokenAttributes;
using Lucene.Net.Util;
const LuceneVersion AppLuceneVersion = LuceneVersion.LUCENE_48;
String? fieldName = null;
String fieldContent = "Foo Bar Baz Bar Bat";
String searchTerm = "bar";
var analyzer = new StandardAnalyzer(AppLuceneVersion);
var ts = analyzer.GetTokenStream(fieldName, fieldContent);
var charTermAttr = ts.AddAttribute<ICharTermAttribute>();
var offsetAttr = ts.AddAttribute<IOffsetAttribute>();
try
{
ts.Reset();
Console.WriteLine("");
Console.WriteLine("Token: " + searchTerm);
while (ts.IncrementToken())
{
if (searchTerm.Equals(charTermAttr.ToString()))
{
var start = offsetAttr.StartOffset;
var end = offsetAttr.EndOffset;
Console.WriteLine(String.Format(" > offset: {0}-{1}", start, end));
}
}
ts.End();
}
catch (Exception)
{
throw;
}
The above example assumes one of the hits from step 1 was a field containing "Foo Bar Baz Bar Bat" - with a search term of bar.
The output generated is:
Token: bar
> offset: 4-7
> offset: 12-15
So, as you can see, you are not re-executing a query - you are just re-processing a token stream. The more complex the original search term is, the harder it will be to make this approach work the way you probably need it to.

Do I need to implement full text search in this case? alternatives?

I have two columns in a table first_name and last_name(PostgreSQL).
In front end, I have an input to allow users to search for people. It is an auto-complete field that calls a web service for searching people by first and/or last names.
Currently, I have made a query (using my query builder):
$searches = preg_split('/\s+/', $search);
if (!empty($search)) {
$orX = $query->expr()->orX();
$i = 0;
foreach ($searches as $value) {
$orX->add($query->expr()->eq('c.firstName', ':name'.$i));
$orX->add($query->expr()->eq('c.lastName', ':name'.$i));
$query->setParameter('name'.$i, $value);
$i++;
}
$query->andWhere($orX);
}
But this query is not as precise as it is required, it uses OR for every word so if I am looking for "Rasmus Lerdorf" it also gives me "Rasmus Adams" and "Adel Lerdorf". It works only if I enter a single word ("Rasmus" for example), in this case it gives me all people with "Rasmus" as first_name or last_name.
I read about MATCH AGAINST but I am using PostgreSQL. I also heard about Full text search feature in PostgreSQL as the equivalent of MATCH AGAINST, but I am wondering if implementing a full text search would be an overkill for such an objective (especially that the maximum number of words in both columns wouldn't exceed 4).
I ask you please your advices, your usual help is always appreciated. Thanks

You don't need fulltext search.
Just add the different search terms with AND instead of OR:
$i = 0;
foreach ($searches as $value) {
$orX = $query->expr()->orX();
$orX->add($query->expr()->eq('c.firstName', ':name'.$i));
$orX->add($query->expr()->eq('c.lastName', ':name'.$i));
$query->setParameter('name'.$i, $value);
$i++;
$query->andWhere($orX);
}
I would also suggest using LIKE instead of an equality comparison (add '%' to the start and end of the users search term), and probably also make everything case insensitive by adding $query->expr()->lower() appropriately.

Lucene query result : get the words in the returned documents that were found by the query

In order to present highlighted match-words in the documents that were returned by Lucene queries, Lucene search result may contain the words that were used to return the doc as matching my request.
For example :
Lucene query : "dog cat"
Result : ["dogs are nice","dog and cats are friends"]
How to achieve this with Lucene? Manually I can't handle cats or dogs or any difference between requested words and returned words.

Use Lucene's Highlighter. Something like this:
//By default, this formatter will wrap highlights with <b>, but that is configurable.
Formatter formatter = new SimpleHTMLFormatter();
QueryScorer scorer = new QueryScorer(query);
Highlighter highlighter = new Highlighter(formatter, queryScorer);
//You can set a fragmenter as well, by default it will split into fragments 100 chars in size, using SimpleFragmenter.
String highlightedSnippet = highlighter.getBestFragment(myAnalyzer, fieldName, fieldContent);

using lucene search I need to get hit if search string either phrase or character that exits in field value

How do I get search hit like below mentioned scenario using lucene search?
Example:
Hi Hello world
In above example, if I enter "Hello wo",or "Hel",or "Hello" I need to get a hit.
that means if entered phrase or character exits in search string I need to get hit
Here is my code to get hits:
QueryParser parser = null;
Query query = null;
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT, new HashSet());
BooleanQuery.setMaxClauseCount(32767);
parser = new QueryParser("fieldname", analyzer);
parser.setAllowLeadingWildcard(true);
query = parser.parse("searchString");
TopDocs topResultDocs = searcher.search(query, null, 20);

The simplest way, would just be to create a prefix query by adding a wildcard (*) to the end of the search string, like:
query = parser.parse("hel*")
Alternatively, you might also use an ngram filter in your analyzer to split tokens into smaller chunks.

How do you get Endeca to search on a particular target field rather than across all indexed fields?

We have an Endeca index configured across multiple fields of email content - subject and body. But we only want searches to be performed on the subject lines. Endeca is returning matches within the bodies too. How do you limit the search to the subject?

You can search a specific field or fields by specifying it (them) with the Ntk parameter.
Or if you wish to search a specific group of fields frequently you can set up an interface (also specified with the Ntk parameter), that includes that group of fields.

This is how you can do it using presentation API.
final ENEQuery query = new ENEQuery();
final DimValIdList dimValIdList = new DimValIdList("0");
query.setNavDescriptors(dimValIdList);
final ERecSearchList searches = new ERecSearchList();
final StringBuilder builder = new StringBuilder();
for(final String productId : productIds){
builder.append(productId);
builder.append(" ");
}
final ERecSearch eRecSearch = new ERecSearch("product.id", builder.toString().trim(), "mode matchany");
searches.add(eRecSearch);
query.setNavERecSearches(searches);
Please see this post for a complete example.

Use Search Interfaces in Developer Studio.
Refer - http://docs.oracle.com/cd/E28912_01/DeveloperStudio.612/pdf/DevStudioHelp.pdf#page=209

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Search multi word phrase with wild cards in lucene - lucene

Related

Lucene calculate term vectors for existing index

Do I need to implement full text search in this case? alternatives?

Lucene query result : get the words in the returned documents that were found by the query

using lucene search I need to get hit if search string either phrase or character that exits in field value

How do you get Endeca to search on a particular target field rather than across all indexed fields?

Categories

Resources