Searching for multiple terms in a field - lucene

I want to do an AND query, say 'foo AND bar', in Lucene.NET. I have a WholeIndex field which has the whole document indexed, and I want Lucene to search in the whole document.
Up to here it's quite easy, but there's a constraint.
I want both terms 'foo' and 'bar' to be in the same field.
Is there an easy way to do this without querying the index for the full list of fields and searching in every field?
Edit: What I want to know is if there is a way to tell Lucene to perform a search in every field, without having to know all the fields in my index. An automated way to search the following:
"field1:(+foo +bar) field2:(+foo +bar) ... fieldN:(+foo +bar)"

You can use GetFieldNames to get all the field names, and then go programmatically over the list and generating a query like the one you wrote, using BooleanQuery.

Related

SQL Server Efficient Search for LIKE '%str%'

In Sql Server, I have a table containing 46 million rows.
In "Title" column of table, I want make search. The word may be at any index of field value.
For example:
Value in table: BROTHERS COMPANY
Search string: ROTHER
I want this search to match the given record. This is exactly what LIKE '%ROTHER%' do. However, LIKE '%%' usage should not be used on large tables because of performance issues. How can I achieve it?
Though I don't know your requirements, your best approach may be to challenge them. Middle-of-the-string searches are usually not very practical. If you can get your users to perform prefix searches (broth%) then you can easily use Full Text's wildcard search (CONTAINS(*, '"broth*"')). Full Text can also handle suffix searches (%rothers) with a little extra work.
But when it comes to middle-of-the-string searches with SQL Server, you're stuck using LIKE. However you may be able to improve performance of LIKE by using a binary collation as explained in this article. (I hate to post a link without including its content but it is way too long of an article to post here and I don't understand the approach enough to sum it up.)
If that doesn't help and if middle-of-the-string searches are that important of a requirement then you should consider using a different search solution like Lucene.
Add Full-Text index if you want.
You can search the table using CONTAINS:
SELECT *
FROM YourTable
WHERE CONTAINS(TableColumnName, 'SearchItem')

Lucene - Which field contains search term?

I have developed a search application with Lucene. I have created the basic search. Basically, my app works as follows:
My index has many fields. (Around 40)
User can enter query to multiple fields i.e: +NAME:John +SURNAME:Doe
Queries can contain wildcards such as ? and * i.e: +NAME:J?hn +SURNAME:Do*
Queries can also contain fuzzy i.e: +NAME:Jahn~0.5
Now, I want to find, which field(s) contains my search term(s). As I am using wildcard and fuzzy, I cannot just make string comparison. How can I do it?
If you need it for debugging purposes, you could use IndexSearcher.explain.
Otherwise, this problem looks like highlighting, so you should be able to find out the fields that matched by:
re-analyzing your document,
or using its term vectors.

In Lucene, using a Standard Analyzer, I want to make fields with spaces searchable

In Lucene, using a Standard Analyzer, I want to make fields with space searchable.
I set Field.Index.NOT_ANALYZED and Field.Store.YES using the StandardAnalyzer
When I look at my index in LUKE, the fields are as I expected, a field and a value such as:
location -> 'New York'.
Here I found that I can use the KeywordAnalyzer to find this value using the query:
location:"New York".
But I want to add another term to the query. Let's say a have a body field which contains the normalized and analyzed terms created by the StandardAnalyzer. Using the KeywordAnalyzer for this field I get different results than when I use the StandardAnalyzer.
How do I combine two Analyzers in one QueryParser, where one Analyzer works for some fields and another one for another fields. I though of creating my own Analyzer which could behave differently depending on the field, but I have no clue how to do it.
PerFieldAnalyzerWrapper lets you apply different analyzers for different fields.

Make lucene treat all terms in a field as a single term

In my Lucene documents I have a field "company" where the company name is tokenized.
I need the tokenization for a certain part of my application.
But for this query, I need to be able to create a PrefixQuery over the whole company field.
Example:
My Brand
my
brand
brahmin farm
brahmin
farm
Regularly querying for "bra" would return both documents because they both have a term starting with bra.
The result I want though, would only return the last entry because the first term starts with bra.
Any suggestions?
Create another indexed field, where the company name is not tokenized. When necessary, search on that field rather than the tokenized company name field.
If you want fast searches, you need to have index entries that point directly at the records of interest. There might be something that you can to with the proximity data to filter records, but it will be slow. I see the problem as: how can a "contains" query over a complete field be performed efficiently?
You might be able to minimize the increase in index size by creating (for each current field) a "first term" field and "remaining terms" field. This would eliminate duplication of the first term in two fields. For "normal" queries, you look for query terms in either of these fields. For "startswith" queries, you search only the "first term" field. But this seems like more trouble than it's worth.
Use a SpanQuery to only search the first term position. A PrefixQuery wrapped by SpanMultiTermQueryWrapper wrapped by SpanPositionRangeQuery:
<SpanPositionRangeQuery: spanPosRange(SpanMultiTermQueryWrapper(company:bra*), 0, 1)>

How to sort by Lucene.Net field and ignore common stop words such as 'a' and 'the'?

I've found how to sort query results by a given field in a Lucene.Net index instead of by score; all it takes is a field that is indexed but not tokenized. However, what I haven't been able to figure out is how to sort that field while ignoring stop words such as "a" and "the", so that the following book titles, for example, would sort in ascending order like so:
The Cat in the Hat
Horton Hears a Who
Is such a thing possible, and if yes, how?
I'm using Lucene.Net 2.3.1.2.
I wrap the results returned by Lucene into my own collection of custom objects. Then I can populate it with extra info/context information (and use things like the highlighter class to pull out a snippet of the matches), plus add paging. If you took a similar route you could create a "result" class/object, add something like a SortBy property and grab whatever field you wanted to sort by, strip out any stop words, then save it in this property. Now just sort the collection based on that property instead.
When you create your index, create a field that only contains the words you wish to sort on, then when retrieving, sort on that field but display the full title.
It's been a while since I used Lucene but my guess would be to add an extra field for sorting and storing the value in there with the stop words already stripped. You can probably use the same analyzers to generate this value.
There seems to be a catch-22 in that you must tokenize a field with an analyzer in order to strip punctuation and stop words, but you can't sort on tokenized fields. How then to strip the stop words without tokenizing?
For search, I found search lucene .net index with sort option link interesting to solve ur problem