Lucene.Net: How to grab the next 10 results after a document that has a certain field value? - lucene

Suppose I have a query in Lucene.Net 3.0.3 that matches 1,000,000 documents, and each document has a field named ProductID with a unique value. How can I grab the next 10 items immediately after a specific ProductID?
For instance, grab me the next 10 items after ProductID 4264423.
The ProductID could be anywhere within the 1,000,000 matches, and sorted however I wish.
One brute force solution is to loop through all the ScoreDocs, and use the FieldCache to find the correct ProductID, then grab the next 10. However, that seems inefficient, since we'd need to populate a huge ScoreDocs array.
Another idea is to use a custom Collector, along with FieldCache to look for the correct ProductID, but as far as I know, Collector's aren't sorted.
Perhaps a solution is to use a combination of a custom Collector with a PriorityQueue, use the FieldCache to find the correct ProductID, note the Score of that document, then grab the next 10 items based on Score. (Although, if there are similar Score values, how is that handled?)
Please provide code samples, as I'm a Lucene.Net newbie. (Sample code preferably in C#.)
If a custom Collector + PriorityQueue is a viable option, here is some sample code to assist: https://stackoverflow.com/a/7938433/1145177

Related

Elastic Search, Nest. functional sorting

I'm building a filter page, with facets etc, which works as it should.
Now the our customer has a request to, basically "Be able to decide which sorting the items comes out in".
Each product is decorated with a Product Display Order, and is in a Product Line.
We got these example Product Display Orders:
1. Featured Item
2. Core Item
3. Spare Part
4. Utility
And these Product Lines:
1. Hammers
2. Saw
3. Wood
and the sorting is like this:
Sorting should firstly be based on Product Display Orders, secondly by product lines, thirdly Alphabetically.
So all products which is a Featured Item is listed first, and all these Featured Items is then sorted by their product line, and if some product are in the same Featured Item and Product Line, then its alphabetically.
The challenge is: I can't just get the sorting of Product Display order items and product lines as a number on the product, i only got a name/id.
We've thought of Boosting based on if the product are in the different categories, but it seems a bit messy.
OR
See if it possible to have some logic in the Sorting.
Sort by productDisplayOrder:
1. featured, 2. core Item ...
Then by ProductLines:
1. Hammers, 2. Saw ...
Then by Name DESC.
Which way is the best way to have this sorting, is it possible to give this logic to elastic, if it is a match and then sort it. Or are we needed to twist the boosts of product?
Hopefully this makes sense for you.
Thanks in advance! :)
Option 1). Quickest/Best performing solution would be to create new/separate integer fields for productDisplayOrder and ProductLine and then use those in your sort criteria as described (after reindexing and validating the the data is indexed as expected).
Option 2) If you want more nuance than described (eg higher scoring matches can 'break through' the ordering ceiling described) then you can explore using a Function Score Query to implement a custom scoring strategy that takes productDisplayOrder and ProductLine into consideration in generating an overall match score.
Option 3). If you can't change the mapping and reindexing your data, you can use Script-Based Sorting to generate sorting values from the currently indexed productDisplayOrder/ProductLine text using a script (eg Groovy). Keep in mind that query performance will be worse than the first two options.

Secondary sorting in Magento

I have come across a problem which my developer says there is no solution to.
I have an ecommerce site www.lovefashion.pk roughly 800 products to date. It has categories with collections of products from a particular brand and New Arrivals and Sale pages that show a number of categories. The issue I am faced with is the way Magento displays these products on catalog pages.
We create a custom sort 'BY date' and called it 'Latest' so I would choose a date for a collection and the catalog pages would show the collection with newest sort by date first and so on. Issue is with secondary sorting now. I have products with design codes say Brand A code 1-A, Brand A code 1-B, Brand A code 2-A and so on. When creating these products, we go alphabetically like 1-A,1-B, 2-A etc.
Say a collection has 15 products and i have given all its products a sort by date of 15th April. Magento does the date sorting OK but secondary sorting i.e by product ID (oldest to newest) is what we cant find a way to implement. It shows products now with latest ID first, i.e 20-B, 20-A, 19-B, and so on till 1-A e.g http://www.lovefashion.pk/shop-by-price/
Is there a way to solve this problem? or can SQL manipulation do the trick? Will aprreicate any help
This is a very interesting question. Can you tell us a little more about what you are trying to do? I think you want the products on each page reversed, but the pages to keep the order they have, yes? or do you want the page to go 'Asim Jofa Luxury Lawn Collection 2014 / 1-A' then 'Orient Lawn Kurti Collection 2014 / 01-B' then 'Orient Lawn Kurti Collection 2014 / 01-A' and so on?
Anyway, I think all you need to do is code a secondary sorting of the collection for each page which could be coded in the list.phtml file.
Psuedo code:
get the collection sorted by date and paginated
reverse the collection
pass the collection into the existing foreach(){echo(<li>product info</li>)}
So 'reverse the collection' might be tricky. And strictly you want to sort by sku on 1A, 1B, 2A, 2B - but I'm sure your developer can figure it out.
$reversed_collection = array_reverse($current_page_collection->toArray());
might work but I think the toArray will loose all the class types so it's not the same as reversing the object.
Collections are like arrays so end($current_page_collection) and prev($current_page_collection) might work but you would have to try it (and report back if it works). In that case a simple for loop to iterate over each item of the colection.
If you are making a custom filter then really it is just a case of putting each item of the collection into an array with an appropriate key and sorting with ksort().
So it is my opinion that it can be sorted that way, but it should be done in two stages.

How to keep a list of 'used' data per user

I'm currently working on a project in MongoDB where I want to get a random sampling of new products from the DB. But my problem is not MongoDB specific, I think it's a general database question.
The scenario:
Let's say we have a collection (or table) of products. And we also have a collection (or table) of users. Every time a user logs in, they are presented with 10 products. These products are selected randomly from the collection/table. Easy enough, but the catch is that every time the user logs in, they must be presented with 10 products that they have NEVER SEEN BEFORE. The two obvious ways that I can think of solving this problem are:
Every user begins with their own private list of all products. Each time they get one of these products, the product is removed from their private list. The result is that the next time products are chosen from this previously trimmed list, it already contains only new items.
Every user has a private list of previously viewed products. When a user logs in, they select 10 random products from the master list, compare the id of each against their list of previously viewed products, and if the item appears on the previously viewed list, the application throws this one away selects a new one, and iterates until there are 10 new items, which it then adds to the previously viewed list for next time.
The problem with #1 is it seems like a tremendous waste. You would basically be duplicating the list data for n number of users. Also removing/adding new items to the system would be a nightmare since it would have to iterate through all users. #2 seems preferable, but it too has issues. You could end up making a lot of extra and unnecessary calls to the DB in order to guarantee 10 new products. As a user goes through more and more products, there are less new ones to choose from, so the chances of having to throw one away and get new one from the DB greatly increases.
Is there an alternative solution? My first and primary concern is performance. I will give up disk space in order to optimize performance.
Those 2 ways are a complete waste of both primary and secondary memory.
You want to show 2 never before seen products, but is this a real must?
If you have a lot of products 10 random ones have a high chance of being unique.
3 . You could list 10 random products, even though not as easy as in MySQL, still less complicated than 1 and 2.
If you don't care how random the sequence of id's is you could do this:
Create a single randomized table of just product id's and a sequential integer surrogate key column. Start each customer at a random point in the list on first login and cycle through the list ordered by that key. If you reach the end, start again from the top.
The customer record would contain a single value for the last product they saw (the surrogate from the randomized list, not the actual id). You'd then pull the next ten on login and do a single update to the customer. It wouldn't really be random, of course. But this kind of table-seed strategy is how a lot of simpler pseudo-random number generators work.
The only problem I see is if your product list grows more quickly than your users log in. Then they'd never see the portions of the list which appear before wherever they started. Even so, with a large list of products and very active users this should scale much better than storing everything they've seen. So if it doesn't matter that products appear in a set psuedo-random sequence, this might be a good fit for you.
Edit:
If you stored the first record they started with as well, you could still generate the list of all things seen. It would be everything between that value and last viewed.
How about doing this: crate a collection prodUser where you will have just the id of the product and the list of customersID, (who have seen these products) .
{
prodID : 1,
userID : []
}
when a customer logs in you find the 10 prodID which has not been assigned to that user
db.prodUser.find({
userID : {
$nin : [yourUser]
}
})
(For some reason $not is not working :-(. I do not have time to figure out why. If you will - plz let me know.). After showing the person his products - you can update his prodUser collection. To mitigate mongos inability to find random elements - you can insert elements randomly and just find first 10.
Everything should work really fast.

Solr: Search in multiple fields BUT STOP if documents match was found

I want to search in multiple fields in Solr.
(In know the concept of the copy-fields and I know the (e)dismax search handler.)
So I have an orderd list of fields, I want the terms to be searched against.
1.) SKU
2.) Name
3.) Description
4.) Summary
and so on.
Now, when the query matches a term, let's say in the SKU field, I want this match and no further searches in the proceeding fields.
Only, if there are NO matches at all in the first field (SKU field), the second field (in this case "name") should be used and so on.
Is this possible with Solr?
Do I have to implement my own Lucene Search Handler for this?
Any advice is welcome!
Thank you,
Bernhard
I think your case requires executing 4 different searches. If you implement you very own SearchHandler you could avoid penalty of search result accumulation in 4 different request. Which means, you would send one query, and custom SearchHandler would execute 4 searches and prepare one result set.
If my guess is right you want to rank the results based on the order of the fields. If so then you can just use standard query like
q=sku:(query)^4 OR name:(query)^3 OR description:(query)^2 OR summary:(query)
this will rank the results by the order of the fields.
Hope is helps.

Want an efficient approach to retrieving records from a database when the retrieval is weighted and balanced

Im working on something incredibly unique..... a property listings website. ;)
It displays a list of properties. For each property a teaser image and some caption data is displayed. If the teaser image and caption takes a site visitors interest, they can click on it and get a full property profile. All very standard.
The customer wants to be able to allow property owners to add multiple teaser images and to be able to track which teaser images got the most click throughs. No worries there.
But they also want to allow the property owner to weight each teaser image to control when it is shown. So for 3 images with weightings of 2, 6, 2, the 2nd image would be shown 6/10 times. This needs to be balanced. If the first 6 times the 2nd image is shown, it cant be shown again until the 1st and 3rd images have be shown twice each.
So I need to both increment how often an image has been retrieved and also retrieve images in a balanced way. Forget about actual image handling, Im actually just talking about Urls.
Note incrementing how often it has been retrieved is a different animal to incrementing how often it has captured a click through.
So i can think of a few different ways to approach the problem using database triggers or maybe some LINQ2SQL, etc but it strikes me that someone out there will know of a solution that could be orders fo magnitude faster than what i might come up with.
My first rough idea is to have a schema like so:
TeaseImage(PropId, ImageId, ImageUrl, Weighting, RetrievedCount, PropTotalRetrievedCount)
and then
select ImageRanks.*
from (Select t.ImageID,
t.ImageUrl,
rank() over (partition by t.RetrievedCount order by sum(t.RetrievedCount) desc) as IMG_Rank
from TeaseImage t
where t.RetrievedCount<t.Weighting
group by t.PropID) ImageRanks
where ImageRanks.IMG_Rank <= 1
And then
1. for each ImageId in the result set increment RetrievedCount by 1 and then
2. for each PropId in ResultSet increment PropTotalRetrievedCount by 1 and then
3. for each PropId in ResultSet check if PropTotalRetrievedCount ==10 and if so reset it to PropTotalRetrievedCount = 0 and RetrievedCount=0 for each associated ImageId
Which frankly sounds awful :(
So any ideas?
Note: if I have to step out of the datalayer I'd be using C# / .Net. Thanks.
If you want to do this entirely in your database, you could split your table in two:
Image(ImageId, ImageUrl)
TeaseImage(TeaseImageId, PropId, ImageId, DateLastAccessed)
The TeaseImage table manages weightings by storing additional (redundant) copies of each property-image pair. So an image with a weight of six would get six records.
Then the following query gives you the least-recently used record.
select top 1 ti.TeaseImageId, i.ImageUrl
from TeaseImage ti
join Image i
on i.ImageId = ti.ImageId
where ti.PropId = #PropId
order by ti.DateLastAccessed
Following the select, just update the record's DateLastAccessed. (Or even update it as part of the select procedure, depending on how fault-tolerant you need to be.)
Using this technique would give you fine-grained control over the order of image delivery, (by seeding their DateLastAccessed values appropriately) and you could easily modify the ratios if need be.
Of course, as the table grows, the additional records would degrade query performance earlier than other approaches, but depending on the cost of the query relative to everything else that's going on that may not be significant.