How can I get Examine to perform a phrase query in Umbraco 7? - lucene

I am trying to build some custom search logic in Umbraco 7 (7.3.6) which will search for multiple terms supplied by a user, where those terms may include phrases enclosed in quotes.
I have the following code which takes the supplied term, uses a regex to split individual terms (whilst maintaining those enclosed in quotes), then uses a series of GroupedOr calls to search against multiple fields
var searcher = Examine.ExamineManager.Instance.SearchProviderCollection[this.searchConfig.SiteSearchProviderName];
var searchCriteria = searcher.CreateSearchCriteria(Examine.SearchCriteria.BooleanOperation.Or);
var splitTerms = Regex.Matches(term, #"[\""].+?[\""]|[^ ]+")
.Cast<Match>()
.Select(m => m.Value)
.ToArray();
var query = searchCriteria.GroupedOr(
new[] { BaseContent.FIELD_NodeName },
this.GetValues(splitTerms, 3, 0.8F))
.Or()
.GroupedOr(
new[] { this.searchConfig.ContentFieldName },
this.GetValues(splitTerms, 1, 0.8F));
This is the GetValues method:
private IExamineValue[] GetValues(string[] terms, float boost, float fuzziness)
{
return terms
.Select(t => t.Boost(boost).Value.Fuzzy(fuzziness))
.ToArray();
}
I have a document in my index which contains the term "The quick brown fox jumps over the lazy dog". If I pass the string "\"brown fox\"" through the above logic then examine my query I can see my query object contains the following Lucene query:
(nodeName:"brown fox"~0.8) (_content:"brown fox"~0.8)
However, when I use this to build a search query as follows, I get no results.
var searchQuery = searcher
.Search(query.Compile(), 100)
.OrderByDescending(x => x.Score)
.TakeWhile(x => x.Score > 0.05f);
But if I run the exact same Lucene query using Luke I get the result I was expecting.
Is anyone able to help me understand this? Extra marks if you can explain why my boost values aren't being added to the Lucene query!!

Related

how to search special characters in hibernate search?

I'm new to hibernate lucene search. From few days on wards, I am working on search keyword with special characters. I am using MultiFieldQueryParser for exact phrase matching as well as Boolean search. But in this process I am unable to get the results with search keywords like "Having 1+ years of experience" and if I am not putting any quotes around the search keyword then I am getting results. So what I observed in the execution of lucene query is, it is escaping the special symbols(+). I am using StandardAnalyzer.class. I think, If I am using WhiteSpaceAnalyzer it will not escape the special characters but it may effect the Boolean searching like +java +php(i.e java and php) because it may treat as normal text. so please assist some suggestions.
The following is my snippet:
Session session = getSession();
FullTextSession fullTextSession = Search.getFullTextSession(session);
MultiFieldQueryParser parser = new MultiFieldQueryParser(new String[] { "student.skills.skill",
"studentProfileSummary.profileTitle", "studentProfileSummary.currentDesignation" },
new StandardAnalyzer());
parser.setDefaultOperator(Operator.OR);
org.apache.lucene.search.Query luceneQuery = null;
QueryBuilder qb = fullTextSession.getSearchFactory().buildQueryBuilder().forEntity(Student.class).get();
BooleanQuery boolQuery = new BooleanQuery();
if (StringUtils.isEmpty(zipcode) != true && StringUtils.isBlank(zipcode) != true) {
boolQuery.add(
qb.keyword().onField("personal.locations.postalCode").matching(zipcode).createQuery(),
BooleanClause.Occur.MUST);
}
if (StringUtils.isEmpty(query) != true && StringUtils.isBlank(query) != true) {
try {
luceneQuery = parser.parse(query.toUpperCase());
} catch (ParseException e) {
luceneQuery = parser.parse(parser.escape(query.toUpperCase()));
}
boolQuery.add(luceneQuery, BooleanClause.Occur.MUST);
}
boolQuery.add(qb.keyword().onField("vStatus").matching(1).createQuery(), BooleanClause.Occur.MUST);
boolQuery.add(qb.keyword().onField("status").matching(1).createQuery(), BooleanClause.Occur.MUST);
boolQuery.add(qb.range().onField("studentProfileSummary.profilePercentage").from(80).to(100).createQuery(),
BooleanClause.Occur.MUST);
FullTextQuery createFullTextQuery = fullTextSession.createFullTextQuery(boolQuery, Student.class);
createFullTextQuery.setProjection("id", "studentProfileSummary.profileTitle", "firstName","lastName");
if (isEmptyFilter == false) {
createFullTextQuery.setFirstResult((int) pageNumber);
createFullTextQuery.setMaxResults((int) end);
}
return createFullTextQuery.list();
The key to control such effects is indeed in the Analyzers you choose to use. As you noticed the standard Analyzer is going to remove/ignore some symbols as they are commonly not used.
Since the standard analyzer is good with most english natural language but you want to treat also special symbols, the typical solution is to index text into multiple fields, and you assign a different Analyzer to each field. You can then generate the queries targeting both fields, and combine the scores it obtains from both fields. You can even customize the weight that each field shoudl have and experiment with different Similarity implementations to obtain various effects.
But un your specific example of "1+ years" you might want to consider what you expect it to find. Should it match a string "6 years"?
Then you probably want to implement a custom analyzer which specifically looks for such patterns and generates multiple matching tokens like a sequence {"1 year", "2 years", "3 years", ...}. That's going to be effective but only match that specific sequence of terms, so maybe you want to look for more advanced extensions from the Lucene community, as you can plug many more extensions in it.

Sitecore Search Predicate Builder multiple keyword search with boosting not working as desired

I have sitecore pages / lucene documents with the following fields:
Title
Filename
Content
File Contents
I'm creating a search for these and have the following requirements:
Hits containing the whole phrase in the title field should be returned first.
Hits containing the whole phrase in the filename field should be returned second.
Hits containing the whole phrase in the content should be returned third
Hits containing the whole phrase in the file contents should be returned fourth
Hits containing all of the keywords (in any order) in the title field should be returned fifth
Hits containing all of the keywords (in any order) in the filename field should be returned sixth
Hits containing all of the keywords (in any order) in the content should be returned seventh.
Hits containing all of the keywords (in any order) in the file contents should be returned eighth.
Here is what I've got:
public static Expression<Func<T, bool>> GetSearchTermPredicate<T>(string searchTerm)
where T : ISearchableItem
{
var actualPhrasePredicate = PredicateBuilder.True<T>()
.Or(r => r.Title.Contains(searchTerm).Boost(2f))
.Or(r => r.FileName.Contains(searchTerm).Boost(1.5f))
.Or(r => r.Content.Contains(searchTerm))
.Or(r => r.DocumentContents.Contains(searchTerm));
var individualWordsPredicate = PredicateBuilder.False<T>();
foreach (var term in searchTerm.Split(' '))
{
individualWordsPredicate
= individualWordsPredicate.And(r =>
r.Title.Contains(term).Boost(2f)
|| r.FileName.Contains(term).Boost(1.5f)
|| r.Content.Contains(term)
|| r.DocumentContents.Contains(term));
}
return PredicateBuilder.Or(actualPhrasePredicate.Boost(2f),
individualWordsPredicate);
}
The actual phrase part seems to work well. Hits with the full phrase in the title are returned first. However, if I remove a word from the middle of the phrase, no results are returned.
i.e. I have a page with a title "The England football team are dreadful", but when I search with "The football team are dreadful", it doesn't find the page.
Note: pages can have documents attached to them, so I want to boost the filenames too but not as highly as the page title.
I managed to get this to work with the following:
public static Expression<Func<T, bool>> GetSearchTermPredicate<T>(string searchTerm)
where T : ISearchableItem
{
var actualPhraseInTitlePredicate = PredicateBuilder.True<T>()
.And(r => r.Title.Contains(searchTerm));
var actualPhraseInFileNamePredicate = PredicateBuilder.True<T>()
.And(r => r.FileName.Contains(searchTerm));
var actualPhraseInContentPredicate = PredicateBuilder.True<T>()
.And(r => r.Content.Contains(searchTerm));
var actualPhraseInDocumentPredicate = PredicateBuilder.True<T>()
.And(r => r.DocumentContents.Contains(searchTerm));
var terms = searchTerm.Split(' ');
var titleContainsAllTermsPredicate = PredicateBuilder.True<T>();
foreach (var term in terms)
titleContainsAllTermsPredicate
= titleContainsAllTermsPredicate.And(r => r.Title.Contains(term).Boost(2f));
var fileNameAllTermsContains = PredicateBuilder.True<T>();
foreach (var term in terms)
fileNameAllTermsContains
= fileNameAllTermsContains.And(r => r.FileName.Contains(term));
var contentContainsAllTermsPredicate = PredicateBuilder.True<T>();
foreach (var term in terms)
contentContainsAllTermsPredicate
= contentContainsAllTermsPredicate.And(r => r.Content.Contains(term));
var documentContainsAllTermsPredicate = PredicateBuilder.True<T>();
foreach (var term in terms)
documentContainsAllTermsPredicate
= documentContainsAllTermsPredicate.And(r => r.DocumentContents.Contains(term));
var predicate = actualPhraseInTitlePredicate.Boost(3f)
.Or(actualPhraseInFileNamePredicate.Boost(2.5f))
.Or(actualPhraseInContentPredicate.Boost(2f))
.Or(actualPhraseInDocumentPredicate.Boost(1.5f))
.Or(titleContainsAllTermsPredicate.Boost(1.2f))
.Or(fileNameAllTermsContains.Boost(1.2f))
.Or(contentContainsAllTermsPredicate)
.Or(documentContainsAllTermsPredicate);
return predicate;
}
It's obviously quite a bit more code, but I think separating the predicates makes more sense for boosting to work effectively.
The main issue with the previous code was two fold:
PredicateBuilder.Or(actualPhrasePredicate.Boost(2f), individualWordsPredicate) doesn't seem to include the predicate being Or'd. When doing a .ToString() on the resulting joined predicate, the expression didn't contain anything for the individualWordsPredicate
After fixing that it still didn't work, and this was because I was using PredicateBuilder.False<T>() for the individualWordsPredicate. When looking at the expression it was basically producing (False AND Field.Contains(keyword)) which of course will never evaluate to true. Using .True<T>() fixed this.

How to delete Documents from a Lucene Index using Term or QueryParser

I am trying to delete documents from Lucene Index.
I want to delete only the specified file from lucene index .
My following program is deleting the index which can be searched using keyword analyzer but my required filename can be searched only using StandardAnalyzer . So is it any way to set standard analyzer in my term or instead of term how can i user QueryParser to delete the Documents from lucene index.
try{
File INDEX_DIR= new File("D:\\merge lucene\\abc\\");
Directory directory = FSDirectory.open(INDEX_DIR);
IndexReader indexReader = IndexReader.open(directory,false);
Term term= new Term("path","fileindex23005.htm");
int l= indexReader.deleteDocuments(term);
indexReader.close();
System.out.println("documents deleted");
}
catch(Exception x){x.printStackTrace();}
I assume you are using Lucene 3.6 or before, otherwise IndexReader.deleteDocuments no longer exists. You should, however, be using IndexWriter instead, anyway.
If you can only find the document using query parser, then just run a normal query, then iterate through the documents returned, and delete them by docnum, along the lines of:
Query query = queryParser.parse("My Query!");
ScoreDoc[] docs = searcher.search(query, 100).scoreDocs;
For (ScoreDoc doc : docs) {
indexReader.deleteDocument(doc.doc);
}
Or better yet (simpler, uses non-defunct, non-deprecated functionality), just use an IndexWriter, and pass it the query directly:
Query query = queryParser.parse("My Query!");
writer.deleteDocuments(query);
Adding for future reference for someone like me, where delete documents is on indexWriter , you may use
indexWriter.deleteDocuments(Term... terms)
instead of using deleteDocuments(query) method; to have less hassle if you have to match only one field. Be-aware that this method treats terms as OR condition if multiple terms are passed. So it will match any term and will delete all records. The code below will match state=Tx in documents stored and will delete matching records.
indexWriter.deleteDocuments(
new Term("STATE", "Tx")
);
For combining different fields with AND condition, we may use following code:
BooleanQuery.Builder builder = new BooleanQuery.Builder();
//note year is stored as int , not as string when document is craeted.
//if you use Term here which will need 2016 as String, that will not match with documents stored with year as int.
Query yearQuery = IntPoint.newExactQuery("year", 2016);
Query stateQuery = new TermQuery(new Term("STATE", "TX"));
Query cityQuery = new TermQuery(new Term("CITY", "CITY NAME"));
builder.add(yearQuery, BooleanClause.Occur.MUST);
builder.add(stateQuery, BooleanClause.Occur.MUST);
builder.add(cityQuery, BooleanClause.Occur.MUST);
indexWriter.deleteDocuments(builder.build());
As #dillippattnaik pointed out, multiple terms result in OR. I have updated his code to make it AND using BooleanQuery:
BooleanQuery query = new BooleanQuery
{
{ new TermQuery( new Term( "year", "2016" ) ), Occur.MUST },
{ new TermQuery( new Term( "STATE", "TX" ) ), Occur.MUST },
{ new TermQuery( new Term( "CITY", "CITY NAME" ) ), Occur.MUST }
};
indexWriter.DeleteDocuments( query );

Ordering a query by the string length of one of the fields

In RavenDB (build 2330) I'm trying to order my results by the string length of one of the indexed terms.
var result = session.Query<Entity, IndexDefinition>()
.Where(condition)
.OrderBy(x => x.Token.Length);
However the results look to be un-sorted. Is this possible in RavenDB (or via a Lucene query) and if so what is the syntax?
You need to add a field to IndexDefinition to order by, and define the SortOption to Int or something more appropriate (however you don't want to use String which is default).
If you want to use the Linq API like in your example you need to add a field named Token_Length to the index' Map function (see Matt's comment):
from doc in docs
select new
{
...
Token_Length = doc.TokenLength
}
And then you can query using the Linq API:
var result = session.Query<Entity, IndexDefinition>()
.Where(condition)
.OrderBy(x => x.Token.Length);
Or if you really want the field to be called TokenLength (or something other than Token_Length) you can use a LuceneQuery:
from doc in docs
select new
{
...
TokenLength = doc.Token.Length
}
And you'd query like this:
var result = session.Advanced.LuceneQuery<Entity, IndexDefinition>()
.Where(condition)
.OrderBy("TokenLength");

Proper Way to Retrieve More than 128 Documents with RavenDB

I know variants of this question have been asked before (even by me), but I still don't understand a thing or two about this...
It was my understanding that one could retrieve more documents than the 128 default setting by doing this:
session.Advanced.MaxNumberOfRequestsPerSession = int.MaxValue;
And I've learned that a WHERE clause should be an ExpressionTree instead of a Func, so that it's treated as Queryable instead of Enumerable. So I thought this should work:
public static List<T> GetObjectList<T>(Expression<Func<T, bool>> whereClause)
{
using (IDocumentSession session = GetRavenSession())
{
return session.Query<T>().Where(whereClause).ToList();
}
}
However, that only returns 128 documents. Why?
Note, here is the code that calls the above method:
RavenDataAccessComponent.GetObjectList<Ccm>(x => x.TimeStamp > lastReadTime);
If I add Take(n), then I can get as many documents as I like. For example, this returns 200 documents:
return session.Query<T>().Where(whereClause).Take(200).ToList();
Based on all of this, it would seem that the appropriate way to retrieve thousands of documents is to set MaxNumberOfRequestsPerSession and use Take() in the query. Is that right? If not, how should it be done?
For my app, I need to retrieve thousands of documents (that have very little data in them). We keep these documents in memory and used as the data source for charts.
** EDIT **
I tried using int.MaxValue in my Take():
return session.Query<T>().Where(whereClause).Take(int.MaxValue).ToList();
And that returns 1024. Argh. How do I get more than 1024?
** EDIT 2 - Sample document showing data **
{
"Header_ID": 3525880,
"Sub_ID": "120403261139",
"TimeStamp": "2012-04-05T15:14:13.9870000",
"Equipment_ID": "PBG11A-CCM",
"AverageAbsorber1": "284.451",
"AverageAbsorber2": "108.442",
"AverageAbsorber3": "886.523",
"AverageAbsorber4": "176.773"
}
It is worth noting that since version 2.5, RavenDB has an "unbounded results API" to allow streaming. The example from the docs shows how to use this:
var query = session.Query<User>("Users/ByActive").Where(x => x.Active);
using (var enumerator = session.Advanced.Stream(query))
{
while (enumerator.MoveNext())
{
User activeUser = enumerator.Current.Document;
}
}
There is support for standard RavenDB queries, Lucence queries and there is also async support.
The documentation can be found here. Ayende's introductory blog article can be found here.
The Take(n) function will only give you up to 1024 by default. However, you can change this default in Raven.Server.exe.config:
<add key="Raven/MaxPageSize" value="5000"/>
For more info, see: http://ravendb.net/docs/intro/safe-by-default
The Take(n) function will only give you up to 1024 by default. However, you can use it in pair with Skip(n) to get all
var points = new List<T>();
var nextGroupOfPoints = new List<T>();
const int ElementTakeCount = 1024;
int i = 0;
int skipResults = 0;
do
{
nextGroupOfPoints = session.Query<T>().Statistics(out stats).Where(whereClause).Skip(i * ElementTakeCount + skipResults).Take(ElementTakeCount).ToList();
i++;
skipResults += stats.SkippedResults;
points = points.Concat(nextGroupOfPoints).ToList();
}
while (nextGroupOfPoints.Count == ElementTakeCount);
return points;
RavenDB Paging
Number of request per session is a separate concept then number of documents retrieved per call. Sessions are short lived and are expected to have few calls issued over them.
If you are getting more then 10 of anything from the store (even less then default 128) for human consumption then something is wrong or your problem is requiring different thinking then truck load of documents coming from the data store.
RavenDB indexing is quite sophisticated. Good article about indexing here and facets here.
If you have need to perform data aggregation, create map/reduce index which results in aggregated data e.g.:
Index:
from post in docs.Posts
select new { post.Author, Count = 1 }
from result in results
group result by result.Author into g
select new
{
Author = g.Key,
Count = g.Sum(x=>x.Count)
}
Query:
session.Query<AuthorPostStats>("Posts/ByUser/Count")(x=>x.Author)();
You can also use a predefined index with the Stream method. You may use a Where clause on indexed fields.
var query = session.Query<User, MyUserIndex>();
var query = session.Query<User, MyUserIndex>().Where(x => !x.IsDeleted);
using (var enumerator = session.Advanced.Stream<User>(query))
{
while (enumerator.MoveNext())
{
var user = enumerator.Current.Document;
// do something
}
}
Example index:
public class MyUserIndex: AbstractIndexCreationTask<User>
{
public MyUserIndex()
{
this.Map = users =>
from u in users
select new
{
u.IsDeleted,
u.Username,
};
}
}
Documentation: What are indexes?
Session : Querying : How to stream query results?
Important note: the Stream method will NOT track objects. If you change objects obtained from this method, SaveChanges() will not be aware of any change.
Other note: you may get the following exception if you do not specify the index to use.
InvalidOperationException: StreamQuery does not support querying dynamic indexes. It is designed to be used with large data-sets and is unlikely to return all data-set after 15 sec of indexing, like Query() does.