Lucene query - "Match exactly one of x, y, z" - lucene

I have a Lucene index that contains documents that have a "type" field, this field can be one of three values "article", "forum" or "blog". I want the user to be able to search within these types (there is a checkbox for each document type)
How do I create a Lucene query dependent on which types the user has selected?
A couple of prerequisites are:
If the user doesn't select one of the types, I want no results from that type.
The ordering of the results should not be affected by restricting the type field.
For reference if I were to write this in SQL (for a "blog or forum search") I'd write:
SELECT * FROM Docs
WHERE [type] in ('blog', 'forum')

For reference, should anyone else come across this problem, here is my solution:
IList<string> ALL_TYPES = new[] { "article", "blog", "forum" };
string q = ...; // The user's search string
IList<string> includeTypes = ...; // List of types to include
Query searchQuery = parser.Parse(q);
Query parentQuery = new BooleanQuery();
parentQuery.Add(searchQuery, BooleanClause.Occur.SHOULD);
// Invert the logic, exclude the other types
foreach (var type in ALL_TYPES.Except(includeTypes))
{
query.Add(
new TermQuery(new Term("type", type)),
BooleanClause.Occur.MUST_NOT
);
}
searchQuery = parentQuery;
I inverted the logic (i.e. excluded the types the user had not selected), because if you don't the ordering of the results is lost. I'm not sure why though...! It is a shame as it makes the code less clear / maintainable, but at least it works!

Add a constraints to reject documents that weren't selected. For example, if only "article" was checked, the constraint would be
-(type:forum type:blog)

While erickson's suggestion seems fine, you could use a positive constraint ANDed with your search term, such as text:foo AND type:article for the case only "article" was checked,
or text:foo AND (type:article OR type:forum) for the case both "article" and "forum" were checked.

Related

Using lucene to search data differently for different users conditionally

Consider that the entities that I need to perform text search are as following
Sample{
int ID, //Unique ID
string Name,//Searchable field
string Description //Searchable field
}
Now, I have several such entities which are commonly shared by all the users but each user can associate different tags, Notes etc to any of these entities. For simplicity lets say a user can add tags to a Sample entity.
UserSampleData{
int ID, //Sample ID
int UserID, //For condition
string tags //Searchable field
}
When a user performs search, I want to search for the given string in the fields Name, Description and tags associated to that Sample by the current user. I am pretty new to using lucene indexing and I am not able to figure how can I design a index and also the queries for such a situation. I need the results sorted on the relevance with the search query. Following approaches crossed my mind, but I have a feeling there could be better solutions:
Separately query 2 different entities Samples and UserSampleData and somehow mix the 2 results. For the results that intersect, we need to combine the match scores by may be averaging.
Flatten out the data by combining both the entities => multiple entries for same ID.
You could use a JoinUtil Lucene class but you must rename the second "ID" field of UserDataSample document into SAMPLE_ID (or another name different from "ID").
Below an example:
r = DirectoryReader.open(dir);
final Version version = Version.LUCENE_47; // Your lucene version
final IndexSearcher searcher = new IndexSearcher(r);
final String fromField = "ID";
final boolean multipleValuesPerDocument = false;
final String toField = "SAMPLE_ID";
String querystr = "UserID:xxxx AND yourQueryString"; //the userID condition and your query String
Query fromQuery = new QueryParser(version, "NAME", new WhitespaceAnalyzer(version)).parse(querystr);
final Query joinQuery = JoinUtil.createJoinQuery(fromField, multipleValuesPerDocument, toField, fromQuery, searcher, ScoreMode.None);
final TopDocs topDocs = searcher.search(joinQuery, 10);
Check the bug https://issues.apache.org/jira/browse/LUCENE-4824). I don't know if the bug is automatically solved into the current version of LUCENE otherwise I think you must convert the type of your ID fields to String.
I think that you need Relational Data. Handling relational data is not simple with Lucene. This is a useful blog post for.

Orchard Search multiple fields with same term

I am trying to create a custom search module based on the Orchard.Search. I have created a custom field called keywords which I have successfully added to the index. I want to match content where the title, body or keywords match. Adding these using .WithField or passing a string array of fields tests for each field matching the term, I need these to return content if there is a match in any of the fields. I have included examples of how I am using both methods below.
Examples of how I am using the search builder:
var searchBuilder = Search()
.WithField("type", "Cell").Mandatory().ExactMatch()
.WithField("body", query)
.WithField("title", query);
.WithField("cell-keywords", query);
String Array FieldNames:
string[] searchFields = new string[2] { "body", "title", "cell-keywords"};
var searchBuilder = Search().WithField("type", "Cell").Mandatory().ExactMatch().Parse(searchFields, query, false);
If anyone could point me in the right direction that would fantastic :)
A colleague wrote an article on this on his blog, should prove helpful http://breakoutdeveloper.com/orchard-cms/creating-an-advanced-search
I have resolved my issue!
The problem was when I was adding my keywords field to the index on the part handler. There were content items with NULL which was causing an error which I missed!!

RavenDB - using "Analyzers" in "Dynamic Query"

This is contrary to another post I asked about how to NOT use dynamic queries, but in preparation for the thought that I may need to, I am attempting to learn more.
I have situations where sometimes I want a field to be analyzed, and sometimes I do not. I am solving this right now by having two separate indexes.
/// <summary>
/// Query for an entity by its identity or name, using full indexing to search for specific parts of either.
/// </summary>
public class [ENTITY]__ByName : AbstractIndexCreationTask<[ENTITY]> {
public [ENTITY]__ByName() {
Map = results => from result in results
select new {
Id = result.Id,
Name = result.Name
};
Index(n => n.Name, FieldIndexing.Analyzed);
}
}
/// <summary>
/// Query for an entity by its full name, or full identity without any analyzed results, forcing
/// all matches to be absolutely identical.
/// </summary>
public class [ENTITY]__ByFullName : AbstractIndexCreationTask<[ENTITY]> {
public [ENTITY]__ByFullName() {
Map = results => from result in results
select new {
Id = result.Id,
Name = result.Name
};
}
}
However, I am being told that I should be using "dynamic indexes" (which to me defeats the purpose of having indexes, but this comment came from a senior developer that I greatly respect, so I am entertaining it)
So, I need to figure out how to pass my preference in analyzer to a dynamic query. Right now my query looks something along the lines of ...
RavenQueryStatistics statistics;
var query = RavenSession
.Query<[ENTITY], [INDEX]>()
.Customize(c => c.WaitForNonStaleResultsAsOfNow())
.Statistics(out statistics)
.Search(r => r.Name, [name variable])
.Skip((request.page - 1) * request.PageSize)
.Take(request.PageSize)
.ToList();
var totalResults = statistics.TotalResults;
Alright, so since I am informed having so many indexes isn't what I should do, I need to go to dynamic queries? So it would be more like this ... ?
RavenQueryStatistics statistics;
var query = RavenSession
.Query<[ENTITY]>()
.Customize(c => c.WaitForNonStaleResultsAsOfNow())
.Statistics(out statistics)
.Search(r => r.Name, [name variable])
.Skip((request.page - 1) * request.PageSize)
.Take(request.PageSize)
.ToList();
var totalResults = statistics.TotalResults;
But the problem is that sometimes I want an Analyzer - and sometimes I don't. For example...
On a grid of 6000 results, if the user does a "search" for One, I want it to find everything that has One anywhere in the name. The analyzer allows for this.
On a validator that is designed to ensure that the user does not add a new entity with the exact same name as another, I do not want such flexibility. If the user is typing in Item Number as the name, and Item Number 2 exists, I do not want it to match because one or two of the words match. I want it to match if they type in exactly the same word.
So, is there a way to incorporate this into dynamic queries? or am I smart to just keep using different queries?

Sitecore: Exclude items during lucene search

How can I use ADC during lucene search to exclude out unwanted items? (Given that I have few millions of items)
Given the unwanted items are different from time to time, thus, it is impossible for me to use the config file to exclude it out.
From what I understand you want to be able to manually set some of the items as excluded from appearing in search results.
The simplest solution would be to add some Exclude boolean flag to the base template and check for this flag while searching for the items.
The other solution is to create some settings page with multilist field for items excluded in the search and then pass ids of the selected items to the search query excluding them from the search.
Below is a pretty extensive overview of what you'll need to do to get this going. What it does is it prevents items that have a checkbox field checked in sitecore from ever even getting indexed. Sorry it's not easier!
Requirements: Advanced Database Crawler: http://marketplace.sitecore.net/en/Modules/Search_Contrib.aspx
1) Add a checkbox field to the base template in sitecore, titled "Exclude from Search" or whatever.
2) Create your custom index crawler that will index the new field.
namespace YourNamespace
{
class MyIndexCrawler : Sitecore.SharedSource.SearchCrawler.Crawlers.AdvancedDatabaseCrawler
{
protected override void AddSpecialFields(Lucene.Net.Documents.Document document, Sitecore.Data.Items.Item item)
{
base.AddSpecialFields(document, item);
document.Add(CreateValueField("exclude from search",
string.IsNullOrEmpty(item["Exclude From Search"])
? "0"
: "1"));
3) Configure Lucene to use a new custom index crawler (Web.config if you're not using includes)
<configuration>
<indexes hint="list:AddIndex">
...
<locations hint="list:AddCrawler">
<master type="YourNameSpace.MyIndexCrawler,YourNameSpace">
<Database>web</Database>
<Root>/sitecore/content</Root>
<IndexAllFields>true</IndexAllFields>
4) Configure your search query
var excludeQuery = new BooleanQuery();
Query exclude = new TermQuery(new Term("exclude from search", "0"));
excludeQuery.Add(exclude, BooleanClause.Occur.MUST);
5) Get your search hits
var db = Sitecore.Context.Database;
var index = SearchManager.GetIndex("name_of_your_index"); // I use db.Name.ToLower() for my master/web indexes
var context = index.CreateSearchContext();
var searchContext = new SearchContext(db.GetItem(rootItem));
var hits = context.Search(excludeQuery, searchContext);
Note: You can obviously use a combined query here to get more flexibility on your searches!

Best design approach to query documents for 'labels'

I am storing documents - and each document has a collection of 'labels' - like this. Labels are user defined, and could be any plain text.
{
"FeedOwner": "4ca44f7d-b3e0-4831-b0c7-59fd9e5bd30d",
"MessageBody": "blablabla",
"Labels": [
{
"IsUser": false,
"Text": "Mine"
},
{
"IsUser": false,
"Text": "Incomplete"
}
],
"CreationDate": "2012-04-30T15:35:20.8588704"
}
I need to allow the user to query for any combination of labels, i.e.
"Mine" OR "Incomplete"
"Incomplete" only
or
"Mine" AND NOT "Incomplete"
This results in Raven queries like this:
Query: (FeedOwner:25eb541c\-b04a\-4f08\-b468\-65714f259ac2) AND (Labels,
Text:Mine) AND (Labels,Text:Incomplete)
I realise that Raven will generate a 'dynamic index' for queries it has not seen before. I can see with this, this could result in a lot of indexes.
What would be the best approach to achieving this functionality with Raven?
[EDIT]
This is my Linq, but I get an error from Raven "All is not supported"
var result = from candidateAnnouncement in session.Query<FeedAnnouncement>()
where listOfRequiredLabels.All(
requiredLabel => candidateAnnouncement.Labels.Any(
candidateLabel => candidateLabel.Text == requiredLabel))
select candidateAnnouncement;
[EDIT]
I had a similar question, and the answer for that resolved both questions: Raven query returns 0 results for collection contains
Please notice that in case of FeedOwner being a unique property of your documents the query doesn't make a lot of sense at all. In that case, you should do it on the client using standard linq to objects.
Now, given that FeedOwner is not something unique, your query is basically correct. However, depending on what you actually want to return, you may need to create a static index instead:
If you're using the dynamically generated indexes, then you will always get the documents as the return value and you can't get the particular labels which matched the query. If this is ok for you, then just go with that approach and let the query optimizer do its job (only if you have really a lot of documents build the index upfront).
In the other case, where you want to use the actual labels as the query result, you have to build a simple map index upfront which covers the fields you want to query upon, in your sample this would be FeedOwner and Text of every label. You will have to use FieldStorage.Yes on the fields you want to return from a query, so enable that on the Text property of your labels. However, there's no need to do so with the FeedOwner property, because it is part of the actual document which raven will give you as part of any query results. Please refer to ravens documentation to see how you can build a static index and use field storage.