Cost comparison using Solr - lucene

I plan to build something like pricegrabber.com/google product search.
Assume I already have the data available in a huge table. I plan to submit this all to Solr. This solves the problem of search. However I am not sure how to do comparison. I can do a group by query(on UPC/SKU) for the products returned by Solr on the DB. However, I dont want to do that. I want to somehow get product comparison data returned to me along with search from Solr itself.
How do you think should my schema be? Do you think this use-case can be solved all by Solr/Sphinx?

You need 'result grouping' or 'field collapsing' support to properly handle it.
In Solr, the feature is not available in any release version and is still under development. If you are willing to use an unreleased version of Solr, then get the details here.
Sphinx supports result grouping and I had used it a long time ago in a similar project. You can get more details here.
An alternative strategy could be to preprocess your data so that only a single record per UPC/SKU gets inserted in the index. Each record can have a separate field containing the ids of all the items with the same UPC/SKU.
Doing a database GROUP BY on the products returned by Solr may not be enough. For example, if products A and B have the same UPC and a certain query matches A but not B, then you will not get both A and B in your result set.

Related

How can I fetch (via GET) all JIRA issues? Do I go to the Search node?

It looks like /api/2/project easily returns all projects in a JIRA instance in JSON format.
I'd like to do the same for issues, but this does not appear to exist.
Is /api/2/search the standard way to do a mass-dump like this? And what is the best way to regularly update this to a database? Would I do something like search (update date > [last entry in database]) and then go through the pagination? Surely I can't be the first person attempting this, though I see no similar guide anywhere online to this (I checked Jira's own docs, no mass-issue-export guide really).
EDIT: Okay it looks like search really is the "issue dump" and not the issue node which, contrary to their documentation, does not default to a collection but really for creating issues or listing one at at time. I'll probably go the route of updated > [whatever last date is in the DB]
Unless you have very few issues, you can't fetch all of them at once.
What you can do is to execute the search step by step.
For example, lets say you have 1324 JIRA issues. In order to retrive all of them you have to execute a search similar to this several times:
/rest/api/2/search?&maxResults=100&startAt=0
This will retrive the first 100 JIRA issues starting from 0.
How to get the others?
When you execute the search, a field named total is returned. That field is the number of the total JIRA issues in your system (1324 issues).
The next query will be:
/rest/api/2/search?&maxResults=100&startAt=100
Repeat this operation, incrementing the value of startAt by 100 every time, until all the issues are returned.

Querying Apache Solr based on score values

I am working on an image retrieval task. I have a dataset of wikipedia images with their textual description in xml files (1 xml file per image). I have indexed those xmls in Solr. Now while retrieving those, I want to maintain some threshold for Score values, so that docs with less score will not come in the result (because they are not of much importance). For example I want to retrieve all documents having similarity score greater than or equal to 2.0. I have already tried range queries like score:[2.0 TO *] but can't get it working. Does anyone have any idea how can I do that?
What's the motivation for wanting to do this? The reason I ask, is
score is a relative thing determined by Lucene based on your index
statistics. It is only meaningful for comparing the results of a
specific query with a specific instance of the index. In other words,
it isn't useful to filter on b/c there is no way of knowing what a
good cutoff value would be.
http://lucene.472066.n3.nabble.com/score-filter-td493438.html
Also, take a look here - http://wiki.apache.org/lucene-java/ScoresAsPercentages
So, in general it's bad to cut off by some value, because you'll never know which threshold value is best. In good query it could be score=2, in bad query score=0.5, etc.
These two links should explain you why you DONT want to do it.
P.S. If you still want to do it take a look here - https://stackoverflow.com/a/15765203/2663985
P.P.S. I recommend you to fix your search queries, so they will search better with high precision (http://en.wikipedia.org/wiki/Precision_and_recall)

Solr sort different criteria for each subset

We are using Apache SOLR for full text search. We have specific requirement for sorting the search results - basically when querying for data, we need 2 sets of data - A and B, but each set should have its own sorting criteria and we cannot make 2 different calls. We can get 2 sets by using an OR condition, but how do we sort each set differently ? To illustrate, if :
Set A = {3,1,2}
Set B = {8,5,9}
So, the expected response can have set A returned in ascending order {1,2,3} but the set B can be returned in descending order {9,8,5}
I believe the default sort in SOLR will sort the entire results sets. Any suggestions or if the question is not clear,let me know.
You can possibly achieve this using FieldCollapsing
You might need to do a little more work - i.e. have a display order field(could be an integer) so that Solr knows one field that it needs to sort by.
Next you could use a query like this -
&q=*&group=true&group.field=set&group.sort=display_order
I would recommend keeping the logic such as this out of Solr, it isn't meant to be a substitute for Relational Databases, and getting it to do complex SQL like operation (while some are possible) is going to be tricky.
By the way there is an open issue in Solr's JIRA that addresses batch processing of multiple queries. Which means, when it is merged into a release, you could fire n different queries to fetch these sets in one call to Solr.
If you are keen to have SOLR perform this task for you, the patch is available in the JIRA card, you could create a build for yourself and let us all know how it goes :)

django objects...values() select only some fields

I'm optimizing the memory load (~2GB, offline accounting and analysis routine) of this line:
l2 = Photograph.objects.filter(**(movie.get_selectors())).values()
Is there a way to convince django to skip certain columns when fetching values()?
Specifically, the routine obtains all rows of the table matching certain criteria (db is optimized and performs it very quickly), but it is a bit too much for python to handle - there is a long string referenced in each row, storing the urls for thumbnails.
I only really need three fields from each row, but, if all the fields are included, it suddenly consumes about 5kB/row which sadly pushes the RAM to the limit.
The values(*fields) function allows you to specify which fields you want.
Check out the QuerySet method, only. When you declare that you only want certain fields to be loaded immediately, the QuerySet manager will not pull in the other fields in your object, till you try to access them.
If you have to deal with ForeignKeys, that must also be pre-fetched, then also check out select_related
The two links above to the Django documentation have good examples, that should clarify their use.
Take a look at Django Debug Toolbar it comes with a debugsqlshell management command that allows you to see the SQL queries being generated, along with the time taken, as you play around with your models on a django/python shell.

How are the Apache Solr search results sequenced?

I have Apache Solr 6.x-1.0-rc3 module installed in my site and it works fine. I wanted to know how are the Apache Solr search results sequenced. I have tried a few things and have concluded that it's not alphabetical or according to the recently updated node.
How are the search results sequenced? I mean in what order or logic.
Basically, it's scored based on the strength of the match with a few boosting factors added. There's a great breakdown on the algorithm here:
http://www.supermind.org/blog/378/lucene-scoring-for-dummies
Yeah, Solr results are sorted by relevance score by default. The fields within a single result should be in the same order in which they were committed (at least within multi-valued fields...should be true for all fields, though), so you should be able to preserve some field sorting that you might do in the processing script. If you just show all results, I think the results are ordered by the order they were added to the index.
The Solr Wiki page (http://wiki.apache.org/solr/CommonQueryParameters) says the default sort order is score desc.