Advanced search with Apache SOLR - apache

I have a problem to filter data.
I already have a data that contain size per product.
{"product_name":"new jeans","product_size_url":"27"},
{"product_name":"new sporty shoes ","product_size_url":"39"},
{"product_name":"new shoes ","product_size_url":"45"}
How do I build the query to show data that contains size 27,45 ?
I Really need help for this case.
Thanks.

You already have a query that returns size per product? If you don't want your size query to affect the relevancy scores of your query use a filter query - fq
This parameter can be used to specify a query that can be used to restrict the super set of documents that can be returned, without influencing score. It can be very useful for speeding up complex queries since the queries specified with fq are cached independently from the main query
https://wiki.apache.org/solr/CommonQueryParameters#fq
So you would query like this
q=product_name:new&fq=product_size_url:(27 OR 45)
That would find all products with the word new in the name and then restrict the super set by applying the filter query product_size_url:(27 OR 45)

Related

Find out the amount of space each field takes in Google Big Query

I want to optimize the space of my Big Query and google storage tables. Is there a way to find out easily the cumulative space that each field in a table gets? This is not straightforward in my case, since I have a complicated hierarchy with many repeated records.
You can do this in Web UI by simply typing (and not running) below query changing to field of your interest
SELECT <column_name>
FROM YourTable
and looking into Validation Message that consists of respective size
Important - you do not need to run it – just check validation message for bytesProcessed and this will be a size of respective column
Validation is free and invokes so called dry-run
If you need to do such “columns profiling” for many tables or for table with many columns - you can code this with your preferred language using Tables.get API to get table schema ; then loop thru all fields and build respective SELECT statement and finally Dry Run it (within the loop for each column) and get totalBytesProcessed which as you already know is the size of respective column
I don't think this is exposed in any of the meta data.
However, you may be able to easily get good approximations based on your needs. The number of rows is provided, so for some of the data types, you can directly calculate the size:
https://cloud.google.com/bigquery/pricing
For types such as string, you could get the average length by querying e.g. the first 1000 fields, and use this for your storage calculations.

Filter the fields returned by Elastic Search hits to enchance performance.(source filtering)

Indexed documents with around 70 fields.Some of them with store=yes but not indexed and others with store=no but indexed (with some analyzed and some not analyzed).Upon querying our .net client (the one talking to ES cluster for search) is extracting complete documents(those matches the search).
We want to enhance performance and but we dont need all the fields of the documents indexed(fields required vary from query to query passed as view columns).
On query level(jason query body) best way to do this filtering(source filtering may be not sure,googled but documentation is very immature).way to specify in the query that for this searchrequest body i want these fields?

SOLR search filter by relevancy score

So each SOLR search result has their own relevancy score:
https://wiki.apache.org/solr/SolrRelevancyFAQ
"How can I see the relevancy scores for search results
Request that the pseudo-field named "score" be returned by adding it
to the fl (field list) parameter. The "score" will then appear along
with the stored fields in returned documents. q=Justice
League&fl=*,score"
My question is...is it possible to filter SOLR results by this relevancy score?
Eg. perform a query in the nature of the following
Search for keyword "LOL" and only fetch documents whose relevancy score > 50
If it's possible how would you go about specifying this query syntactically?
You can specify a maximum number of results to return. The results will appear in descending order by score, so you could stop processing at a specific point in the result set.
solr/search/select?q=LOL&&start=0&rows=10&fl=*%2Cscore
See the following article for a discussion about setting a minimum score: Is it possible to set a Solr Score threshold 'reasonably', independent of results returned? (i.e. Is Solr Scoring standardized in any way)
I spent hours trying to filter out values with a relevance score of 0. I couldn't find any straight forward way to do this. I ended up accomplishing this with a workaround that assigns the query function to a local param. I call this local param in both the query ("q=") and the filter query ("fq=").
Example
Let's say you have a query like:
q={!func}sum(*your arguments*)
First, make the function component its own parameter:
q={!func}$localParam
&localParam={!func}sum(*your arguments*)
Now to only return results with scores between 1 and 10 simply add a filter query on that localParam:
q={!func}$localParam
&localParam={!func}sum(*your arguments*)
&fq={!frange l=1 u=10 inclusive=true}$localParam
solr 6.6:
add a solr filter query (fq):
q=SEARCH_PHRASE
&fq={!frange l=50.0}query($q,0)
in this case solr will return the result with "score" >= 50.0
$q in query($q,0) - it's like a reference to a q parameter

Solr sort different criteria for each subset

We are using Apache SOLR for full text search. We have specific requirement for sorting the search results - basically when querying for data, we need 2 sets of data - A and B, but each set should have its own sorting criteria and we cannot make 2 different calls. We can get 2 sets by using an OR condition, but how do we sort each set differently ? To illustrate, if :
Set A = {3,1,2}
Set B = {8,5,9}
So, the expected response can have set A returned in ascending order {1,2,3} but the set B can be returned in descending order {9,8,5}
I believe the default sort in SOLR will sort the entire results sets. Any suggestions or if the question is not clear,let me know.
You can possibly achieve this using FieldCollapsing
You might need to do a little more work - i.e. have a display order field(could be an integer) so that Solr knows one field that it needs to sort by.
Next you could use a query like this -
&q=*&group=true&group.field=set&group.sort=display_order
I would recommend keeping the logic such as this out of Solr, it isn't meant to be a substitute for Relational Databases, and getting it to do complex SQL like operation (while some are possible) is going to be tricky.
By the way there is an open issue in Solr's JIRA that addresses batch processing of multiple queries. Which means, when it is merged into a release, you could fire n different queries to fetch these sets in one call to Solr.
If you are keen to have SOLR perform this task for you, the patch is available in the JIRA card, you could create a build for yourself and let us all know how it goes :)

Cost comparison using Solr

I plan to build something like pricegrabber.com/google product search.
Assume I already have the data available in a huge table. I plan to submit this all to Solr. This solves the problem of search. However I am not sure how to do comparison. I can do a group by query(on UPC/SKU) for the products returned by Solr on the DB. However, I dont want to do that. I want to somehow get product comparison data returned to me along with search from Solr itself.
How do you think should my schema be? Do you think this use-case can be solved all by Solr/Sphinx?
You need 'result grouping' or 'field collapsing' support to properly handle it.
In Solr, the feature is not available in any release version and is still under development. If you are willing to use an unreleased version of Solr, then get the details here.
Sphinx supports result grouping and I had used it a long time ago in a similar project. You can get more details here.
An alternative strategy could be to preprocess your data so that only a single record per UPC/SKU gets inserted in the index. Each record can have a separate field containing the ids of all the items with the same UPC/SKU.
Doing a database GROUP BY on the products returned by Solr may not be enough. For example, if products A and B have the same UPC and a certain query matches A but not B, then you will not get both A and B in your result set.