Lucene: facet range depends on the returned results - lucene

I have a working search set up where I give the facet range, and get the correct results back.
The problem is that for prices facet, I need to be depending on the returned result, so I can't know the ranges beforehand.
Example 1: the search found 4 products with the following prices: 20, 30, 40, 55. So I expect the facets be something like this:
0 - 20 (1)
21 - 40 (2)
41 - 60 (1)
Example 2: the search found 2 products with the following prices: 200, 400, So I expect the facets be something like this:
100 - 200 (1)
300 - 400 (1)
Is there somewhere in Lucene where I can specify that I want the ranges be based on a field from the search results?
Thank you

After many searching I did not find a way to do it out of the box.
The way to do it, is to extend DrillSideways, and at the beginning of the BuildFacetsResult method, we extract the field with the necessary value. Something like this:
public class DrillSidewaysHelper : DrillSideways
{
private TopFieldCollector _localCollector;
private List<decimal> _prices;
protected override LuceneFacets BuildFacetsResult(FacetsCollector drillDownCollector, FacetsCollector[] drillSidewaysCollectors,
string[] drillSidewaysFacets)
{
ExtractResults(); //here we extract the methods from the _localCollector and populate _prices
....here we can generate the facets range based on the _prices
}
public FacetResultSet<T> OurSearch(DrillDownQuery drillDownQuery, Sort sort)
{
int limit = m_searcher.IndexReader.MaxDoc;
_localCollector = TopFieldCollector.Create(sort, limit, null, true, true, false, true);
var drillSidewaysResult = Search(drillDownQuery, _localCollector);
var facets = ExtractFacets(drillSidewaysResult.Facets);
return new FacetResultSet<T>(_results, (uint) _localCollector.TotalHits, facets);
}
}
and usage:
var drillSideways = new DrillSidewaysHelper(searcher, facetsConfig, taxonomyReader);
return drillSideways.OurSearch(drillDownQuery, sort);

Related

RavenDB : Use "Search" only if given within one query

I have a situation where my user is presented with a grid, and it, by default, will just get the first 15 results. However they may type in a name and search for an item across all pages.
Alone, either of these works fine, but I am trying to figure out how to make them work as a single query. This is basically what it looks like...
// find a filter if the user is searching
var filters = request.Filters.ToFilters();
// determine the name to search by
var name = filters.Count > 0 ? filters[0] : null;
// we need to be able to catch some query statistics to make sure that the
// grid view is complete and accurate
RavenQueryStatistics statistics;
// try to query the items listing as quickly as we can, getting only the
// page we want out of it
var items = RavenSession
.Query<Models.Items.Item>()
.Statistics(out statistics) // output our query statistics
.Search(n => n.Name, name)
.Skip((request.Page - 1) * request.PageSize)
.Take(request.PageSize)
.ToArray();
// we need to store the total results so that we can keep the grid up to date
var totalResults = statistics.TotalResults;
return Json(new { data = items, total = totalResults }, JsonRequestBehavior.AllowGet);
The problem is that if no name is given, it does not return anything; Which is not the desired result. (Searching by 'null' doesn't work, obviously.)
Do something like this:
var q= RavenSession
.Query<Models.Items.Item>()
.Statistics(out statistics); // output our query statistics
if(string.IsNullOrEmpty(name) == false)
q = q.Search(n => n.Name, name);
var items = q.Skip((request.Page - 1) * request.PageSize)
.Take(request.PageSize)
.ToArray();

Datatables sorting varchar

I have a SQL query that is pulling back results ordered correctly when I try the query in MSSQL Studio.
I am using Datatables from Datatables.net and everything is working great apart from a sorting issue. I have some properties in the first column and I would like to order these like this:
1
1a
1b
2
3
4a
5
etc
However what comes back is something like this:
1
10
100
11
11a
I have looked though various posts but nothing seems to work and I believe that this must be something I should trigger from the datatables plugin but cannot find anything.
Could someone advise?
Your data contain numbers and characters, so they will be sorted as string by default. You should write your own plugin for sorting your data type. Have a look at here and here
to see how to write a plugin and how to use it with your table.
Edit: got some time today to work with the datatable stuff. If you still need a solution, here you go:
//Sorting plug-in
jQuery.extend( jQuery.fn.dataTableExt.oSort, {
//pre-processing
"numchar-pre": function(str){
var patt = /^([0-9]+)([a-zA-Z]+)$/; //match data like 1a, 2b, 1ab, 100k etc.
var matches = patt.exec($.trim(str));
var number = parseInt(matches[1]); //extract the number part
var str = matches[2].toLowerCase(); //extract the "character" part and make it case-insensitive
var dec = 0;
for (i=0; i<str.length; i++)
{
dec += (str.charCodeAt(i)-96)*Math.pow(26, -(i+1)); //deal with the character as a base-26 number
}
return number + dec; //combine the two parts
},
//sort ascending
"numchar-asc": function(a, b){
return a-b;
},
//sort descending
"numchar-desc": function(a, b){
return b-a;
}
});
//Automatic type detection plug-in
jQuery.fn.dataTableExt.aTypes.unshift(
function(sData)
{
var patt = /^([0-9]+)([a-zA-Z]+)$/;
var trimmed = $.trim(sData);
if (patt.test(trimmed))
{
return 'numchar';
}
return null;
}
);
You can use the automatic type detection function to let the data type automatically detected or you can set the data type for the column
"aoColumns": [{"sType": "numchar"}]

Get a value from array based on the value of others arrays (VB.Net)

Supposed that I have two arrays:
Dim RoomName() As String = {(RoomA), (RoomB), (RoomC), (RoomD), (RoomE)}
Dim RoomType() As Integer = {1, 2, 2, 2, 1}
I want to get a value from the "RoomName" array based on a criteria of "RoomType" array. For example, I want to get a "RoomName" with "RoomType = 2", so the algorithm should randomize the index of the array that the "RoomType" is "2", and get a single value range from index "1-3" only.
Is there any possible ways to solve the problem using array, or is there any better ways to do this? Thank you very much for your time :)
Note: Code examples below using C# but hopefully you can read the intent for vb.net
Well, a simpler way would be to have a structure/class that contained both name and type properties e.g.:
public class Room
{
public string Name { get; set; }
public int Type { get; set; }
public Room(string name, int type)
{
Name = name;
Type = type;
}
}
Then given a set of rooms you can find those of a given type using a simple linq expression:
var match = rooms.Where(r => r.Type == 2).Select(r => r.Name).ToList();
Then you can find a random entry from within the set of matching room names (see below)
However assuming you want to stick with the parallel arrays, one way is to find the matching index values from the type array, then find the matching names and then find one of the matching values using a random function.
var matchingTypeIndexes = new List<int>();
int matchingTypeIndex = -1;
do
{
matchingTypeIndex = Array.IndexOf(roomType, 2, matchingTypeIndex + 1);
if (matchingTypeIndex > -1)
{
matchingTypeIndexes.Add(matchingTypeIndex);
}
} while (matchingTypeIndex > -1);
List<string> matchingRoomNames = matchingTypeIndexes.Select(typeIndex => roomName[typeIndex]).ToList();
Then to find a random entry of those that match (from one of the lists generated above):
var posn = new Random().Next(matchingRoomNames.Count);
Console.WriteLine(matchingRoomNames[posn]);

Proper Way to Retrieve More than 128 Documents with RavenDB

I know variants of this question have been asked before (even by me), but I still don't understand a thing or two about this...
It was my understanding that one could retrieve more documents than the 128 default setting by doing this:
session.Advanced.MaxNumberOfRequestsPerSession = int.MaxValue;
And I've learned that a WHERE clause should be an ExpressionTree instead of a Func, so that it's treated as Queryable instead of Enumerable. So I thought this should work:
public static List<T> GetObjectList<T>(Expression<Func<T, bool>> whereClause)
{
using (IDocumentSession session = GetRavenSession())
{
return session.Query<T>().Where(whereClause).ToList();
}
}
However, that only returns 128 documents. Why?
Note, here is the code that calls the above method:
RavenDataAccessComponent.GetObjectList<Ccm>(x => x.TimeStamp > lastReadTime);
If I add Take(n), then I can get as many documents as I like. For example, this returns 200 documents:
return session.Query<T>().Where(whereClause).Take(200).ToList();
Based on all of this, it would seem that the appropriate way to retrieve thousands of documents is to set MaxNumberOfRequestsPerSession and use Take() in the query. Is that right? If not, how should it be done?
For my app, I need to retrieve thousands of documents (that have very little data in them). We keep these documents in memory and used as the data source for charts.
** EDIT **
I tried using int.MaxValue in my Take():
return session.Query<T>().Where(whereClause).Take(int.MaxValue).ToList();
And that returns 1024. Argh. How do I get more than 1024?
** EDIT 2 - Sample document showing data **
{
"Header_ID": 3525880,
"Sub_ID": "120403261139",
"TimeStamp": "2012-04-05T15:14:13.9870000",
"Equipment_ID": "PBG11A-CCM",
"AverageAbsorber1": "284.451",
"AverageAbsorber2": "108.442",
"AverageAbsorber3": "886.523",
"AverageAbsorber4": "176.773"
}
It is worth noting that since version 2.5, RavenDB has an "unbounded results API" to allow streaming. The example from the docs shows how to use this:
var query = session.Query<User>("Users/ByActive").Where(x => x.Active);
using (var enumerator = session.Advanced.Stream(query))
{
while (enumerator.MoveNext())
{
User activeUser = enumerator.Current.Document;
}
}
There is support for standard RavenDB queries, Lucence queries and there is also async support.
The documentation can be found here. Ayende's introductory blog article can be found here.
The Take(n) function will only give you up to 1024 by default. However, you can change this default in Raven.Server.exe.config:
<add key="Raven/MaxPageSize" value="5000"/>
For more info, see: http://ravendb.net/docs/intro/safe-by-default
The Take(n) function will only give you up to 1024 by default. However, you can use it in pair with Skip(n) to get all
var points = new List<T>();
var nextGroupOfPoints = new List<T>();
const int ElementTakeCount = 1024;
int i = 0;
int skipResults = 0;
do
{
nextGroupOfPoints = session.Query<T>().Statistics(out stats).Where(whereClause).Skip(i * ElementTakeCount + skipResults).Take(ElementTakeCount).ToList();
i++;
skipResults += stats.SkippedResults;
points = points.Concat(nextGroupOfPoints).ToList();
}
while (nextGroupOfPoints.Count == ElementTakeCount);
return points;
RavenDB Paging
Number of request per session is a separate concept then number of documents retrieved per call. Sessions are short lived and are expected to have few calls issued over them.
If you are getting more then 10 of anything from the store (even less then default 128) for human consumption then something is wrong or your problem is requiring different thinking then truck load of documents coming from the data store.
RavenDB indexing is quite sophisticated. Good article about indexing here and facets here.
If you have need to perform data aggregation, create map/reduce index which results in aggregated data e.g.:
Index:
from post in docs.Posts
select new { post.Author, Count = 1 }
from result in results
group result by result.Author into g
select new
{
Author = g.Key,
Count = g.Sum(x=>x.Count)
}
Query:
session.Query<AuthorPostStats>("Posts/ByUser/Count")(x=>x.Author)();
You can also use a predefined index with the Stream method. You may use a Where clause on indexed fields.
var query = session.Query<User, MyUserIndex>();
var query = session.Query<User, MyUserIndex>().Where(x => !x.IsDeleted);
using (var enumerator = session.Advanced.Stream<User>(query))
{
while (enumerator.MoveNext())
{
var user = enumerator.Current.Document;
// do something
}
}
Example index:
public class MyUserIndex: AbstractIndexCreationTask<User>
{
public MyUserIndex()
{
this.Map = users =>
from u in users
select new
{
u.IsDeleted,
u.Username,
};
}
}
Documentation: What are indexes?
Session : Querying : How to stream query results?
Important note: the Stream method will NOT track objects. If you change objects obtained from this method, SaveChanges() will not be aware of any change.
Other note: you may get the following exception if you do not specify the index to use.
InvalidOperationException: StreamQuery does not support querying dynamic indexes. It is designed to be used with large data-sets and is unlikely to return all data-set after 15 sec of indexing, like Query() does.

Hibernate Criteria - Restricting Data Based on Field in One-to-Many Relationship

I need some hibernate/SQL help, please. I'm trying to generate a report against an accounting database. A commission order can have multiple account entries against it.
class CommissionOrderDAO {
int id
String purchaseOrder
double bookedAmount
Date customerInvoicedDate
String state
static hasMany = [accountEntries: AccountEntryDAO]
SortedSet accountEntries
static mapping = {
version false
cache usage: 'read-only'
table 'commission_order'
id column:'id', type:'integer'
purchaseOrder column: 'externalId'
bookedAmount column: 'bookedAmount'
customerInvoicedDate column: 'customerInvoicedDate'
state column : 'state'
accountEntries sort : 'id', order : 'desc'
}
...
}
class AccountEntryDAO implements Comparable<AccountEntryDAO> {
int id
Date eventDate
CommissionOrderDAO commissionOrder
String entryType
String description
double remainingPotentialCommission
static belongsTo = [commissionOrder : CommissionOrderDAO]
static mapping = {
version false
cache usage: 'read-only'
table 'account_entry'
id column:'id', type:'integer'
eventDate column: 'eventDate'
commissionOrder column: 'commissionOrder'
entryType column: 'entryType'
description column: 'description'
remainingPotentialCommission formula : SQLFormulaUtils.AccountEntrySQL.REMAININGPOTENTIALCOMMISSION_FORMULA
}
....
}
The criteria for the report is that the commissionOrder.state==open and the commissionOrder.customerInvoicedDate is not null. And the account entries in the report should be between the startDate and the endDate and with remainingPotentialCommission > 0.
I'm looking to display information on the CommissionOrder mainly (and to display account entries on that commission order between the dates), but when I use the following projection:
def results = accountEntryCriteria.list {
projections {
like ("entryType", "comm%")
ge("eventDate", beginDate)
le("eventDate", endDate)
gt("remainingPotentialCommission", 0.0099d)
and {
commissionOrder {
eq("state", "open")
isNotNull("customerInvoicedDate")
}
}
}
order("id", "asc")
}
I get the correct accountEntries with the proper commissionOrders, but I'm going in backwards: I have loads of accountEntries which can reference the same commissionOrder. Aut when I look at the commissionOrders that I've retrieved, each one has ALL its accountEntries not just the accountEntries between the dates.
I then loop through the results, get the commissionOrder from the accountEntriesList, and remove accountEntries on that commissionOrder after the end date to get the "snapshot" in time that I need.
def getCommissionOrderListByRemainingPotentialCommissionFromResults(results, endDate) {
log.debug("begin getCommissionOrderListByRemainingPotentialCommissionFromResults")
int count = 0;
List<CommissionOrderDAO> commissionOrderList = new ArrayList<CommissionOrderDAO>()
if (results) {
CommissionOrderDAO[] commissionOrderArray = new CommissionOrderDAO[results?.size()];
Set<CommissionOrderDAO> coDuplicateCheck = new TreeSet<CommissionOrderDAO>()
for (ae in results) {
if (!coDuplicateCheck.contains(ae?.commissionOrder?.purchaseOrder) && ae?.remainingPotentialCommission > 0.0099d) {
CommissionOrderDAO co = ae?.commissionOrder
CommissionOrderDAO culledCO = removeAccountEntriesPastDate(co, endDate)
def lastAccountEntry = culledCO?.accountEntries?.last()
if (lastAccountEntry?.remainingPotentialCommission > 0.0099d) {
commissionOrderArray[count++] = culledCO
}
coDuplicateCheck.add(ae?.commissionOrder?.purchaseOrder)
}
}
log.debug("Count after clean is ${count}")
if (count > 0) {
commissionOrderList = Arrays.asList(ArrayUtils.subarray(commissionOrderArray, 0, count))
log.debug("commissionOrderList size = ${commissionOrderList?.size()}")
}
}
log.debug("end getCommissionOrderListByRemainingPotentialCommissionFromResults")
return commissionOrderList
}
Please don't think I'm under the impression that this isn't a Charlie Foxtrot. The query itself doesn't take very long, but the cull process takes over 35 minutes. Right now, it's "manageable" because I only have to run the report once a month.
I need to let the database handle this processing (I think), but I couldn't figure out how to manipulate hibernate to get the results I want. How can I change my criteria?
Try to narrow down the bottle neck of that process. If you have a lot of data, then maybe this check could be time expensive.
coDuplicateCheck.contains(ae?.commissionOrder?.purchaseOrder)
in Set contains have O(n) complexity. You can use i.e. Map to store keys that you would check and then search for "ae?.commissionOrder?.purchaseOrder" as key in the map.
The second thought is that maybe when you're getting ae?.commissionOrder?.purchaseOrder it is always loaded from db by lazy mechanism. Try to turn on query logging and check that you don't have dozens of queries inside this processing function.
Finally and again I would suggest to narrow down where is the most expensive part and time waste.
This plugin maybe helpful.