Searching by weight , length , depth etc using the Best Buy API - best-buy-api

The Best Buy Search allows to search products specifying weight , depth , length and using < and > operator. For example:
weight<4
However it seems that the API is using alpha sorting rather than actual numeric values comparison for those fields and operators.
for example it will consider that 11 < 4
Are there any way to force the API to properly evaluate those conditions ?

The "weight" attribute in the API is a string field that contains both the value and the unit of measure. The unit of measure may differ per product. So it is not possible for the API to evaluate the "weight" attribute numerically in searches and sorting.
However - there is also a "shippingWeight" attribute in the API, which IS evaluated numerically and the unit of measure is pounds. When searching and sorting by "shippingWeight", the API will treat the value numerically. Of course, "shippingWeight" is not the same as the raw "weight" attribute - but I hope it will be sufficient for your use case.

Related

How does score multiplier operator in Watson Discovery service work?

I have a set of JSON documents uploaded to my WDS instance. I want to understand the importance of the score multiplier operator(^). The document just says,"Increases the score value of the search term". I have tried a simple query on one field, it multiplies the score by the number specified.
If I specify two fields & I want Watson Discovery to know which of the two fields is more important for the search, is score multiplier applicable in this case? With two fields & a score multiplier applied to one, I could not identify the difference. Also, on what datatypes is this allowed? It didn't work with a number.
I found this through some more experiments. Score multiplier is used when you want to increase the relative importance of fields in the query. So, for example, you want to give more importance to Name.LastName in the below example:
Name.FirstName:"ABC",Name.LastName:"DEF"^3
Here, LastName is given more relevance & the search results are ordered in the same way.
Could be useful for someone.

Similarity matching algorithm

I have products with different details in different attributes and I need to develop an algorithm to find the most similar to the one I'm trying to find.
For example, if a product has:
Weight: 100lb
Color: Black, Brown, White
Height: 10in
Conditions: new
Others can have different colors, weight, etc. Then I need to do a search where the most similar return first. For example, if everything matches but the color is only Black and White but not Brown, it's a better match than another product that is only Black but not White or Brown.
I'm open to suggestions as the project is just starting.
One approach, for example, I could do is restrict each attribute (weight, color, size) a limited set of option, so I can build a binary representation. So I have something like this for each product:
Colors Weight Height Condition
00011011000 10110110 10001100 01
Then if I do an XOR between the product's binary representation and my search, I can calculate the number of set bits to see how similar they are (all zeros would mean exact match).
The problem with this approach is that I cannot index that on a database, so I would need to read all the products to make the comparison.
Any suggestions on how I can approach this? Ideally I would like to have something I can index on a database so it's fast to query.
Further question: also if I could use different weights for each attribute, it would be awesome.
You basically need to come up with a distance metric to define the distance between two objects. Calculate the distance from the object in question to each other object, then you can either sort by minimum distance or just select the best.
Without some highly specialized algorithm based on the full data set, the best you can do is a linear time distance comparison with every other item.
You can estimate the nearest by keeping sorted lists of certain fields such as Height and Weight and cap the distance at a threshold (like in broad phase collision detection), then limit full distance comparisons to only those items that meet the thresholds.
What you want to do is a perfect use case for elasticsearch and other similar search oriented databases. I don't think you need to hack with bitmasks/etc.
You would typically maintain your primary data in your existing database (sql/cassandra/mongo/etc..anything works), and copy things that need searching to elasticsearch.
What are you talking about very similar to BK-trees. BK-tree constructs search tree with some metric associated with keys of this tree. Most common use of this tree is string corrections with Levenshtein or Damerau-Levenshtein distances. This is not static data structure, so it supports future insertions of elements.
When you search exact element (or insert element), you need to look through nodes of this tree and go to links with weight equal to distance between key of this node and your element. if you want to find similar objects, you need to go to several nodes simultaneously that supports your wishes of constrains of distances. (Maybe it can be even done with A* to fast finding one most similar object).
Simple example of BK-tree (from the second link)
BOOK
/ \
/(1) \(4)
/ \
BOOKS CAKE
/ / \
/(2) /(1) \(2)
/ | |
BOO CAPE CART
Your metric should be Hamming distance (count of differences between bit representations of two objects).
BUT! is it good to compare two integers as count of different bits in their representation? With Hamming distance HD(10000, 00000) == HD(10000, 10001). I.e. difference between numbers 16 and 0, and 16 and 17 is equal. Is it really what you need?
BK-tree with details:
https://hamberg.no/erlend/posts/2012-01-17-BK-trees.html
https://nullwords.wordpress.com/2013/03/13/the-bk-tree-a-data-structure-for-spell-checking/

Search API - HTTP Query Argument Format

I've created a search API for a site that I work on. For example, some of the queries it supports are:
/api/search - returns popular search results
/api/search?q=car - returns results matching the term "car"
/api/search?start=50&limit=50 - returns 50 results starting at offset 50
/api/search?user_id=3987 - returns results owned by the user with ID 3987
These query arguments can be mixed and matched. It's implemented under the hood using Solr's faceted search.
I'm working on adding query arguments that can filter results based on a numeric attribute. For example, I might want to only return results where the view count is greater than 100. I'm wondering what the best practice is for specifying this.
Solr uses this way:
/api/search?views:[100 TO *]
Google seems to do something like this:
/api/search?viewsisgt:100
Neither of these seem very appealing to me. Is there a best practice for specifying this kind of query term? Any suggestions?
Simply use ',' as separator for from/to, it reads the best and in my view is intuitive:
# fixed from/to
/search?views=4,2
# upper wildcard
/search?views=4,
# lower wildcard
/search?views=,4
I take values inclusive. In most circumstances you won't need the exclusive/inclusive additional syntax sugar.
Binding it even works very well in some frameworks out of the box (like spring mvc), which bind ',' separated values to an array of values. You could then wrap the internal array with specific accessors (getMin(), getMax()).
Google's approach is good, why it's not appealing?
Here comes my suggestion:
/api/search?viewsgt=100
I think the mathematical notation for limits are suitable.
[x the lower limit can be atleast x
x] the upper limit can be atmost x
(x the lower limit must be strictly greater than x
x) the upper limit must be strictly lesser than x
Hence,
q=cats&range=(100,200) - the results from 100 to 200, but not including 100 and 200
q=cats&range=[100,200) - the results from 100 to 200, but the lower limit can be greater than 100
q=cats&range=[100 - any number from 100 onwards
q=cats&range=(100 - any number greater than 100
q=cats&range=100,200 - default, same as [100,200]
Sure, its aesthetics are still questionable, but it seems (IMO) the most intuitive for the human eye, and the parser is still easy.
As per http://en.wikipedia.org/wiki/Percent-encoding =,&,[,],(,) are reserved

Is it possible to affect a Lucene rank based on a numeric value?

I have content with various numeric values, and a higher value indicates (theoretically) more valuable content, which I want to rank higher.
For instance:
Average rating (0 - 5)
Number of comments (0 - whatever)
Number of inbound link references from other pages (0 - whatever)
Some arbitrary number I apply to indicate how important I feel the content is (1 - whatever)
These can be indexed by Lucene as a numeric value, but how can I tell Lucene to use this value in its ranking algorithm?
you can set this value using "Field.SetBoost" while indexing.
Depending how exactly you want to proceed, you can set boost while indexing as suggested by #L.B, or if you want to make it dynamic, i.e. at search time rather than indexing time, you can use ValueSourceQuery and CustomScoreQuery.
You can see example in the question I asked some time ago:
Lucene custom scoring for numeric fields (the example was tested with Lucene 3.0).

how to store an approximate number? (number is too small to be measured)

I have a table representing standards of alloys. The standard is partly based on the chemical composition of the alloys. The composition is presented in percentages. The percentage is determined by a chemical composition test. Sample data.
But sometimes, the lab cannot measure below a certain percentage. So they indicate that the element is present, but the percentage is less than they can measure.
I was confused how to accurately store such a number in an SQL database. I thought to store the number with a negative sign. No element can have a negative composition of course, but i can interpret this as less than the specified value. Or option is to add another column for each element!! The latter option i really don't like.
Any other ideas? It's a small issue if you think about it, but i think a crowd is always wiser. Somebody might have a neater solution.
Question updated:
Thanks for all the replies.
The test results come from different labs, so there is no common lower bound.
The when the percentage of Titanium is less than <0.0004 for example, the number is still important, only the formula will differ slightly in this case.
Hence the value cannot be stored as NULL, and i don't know the lower bound for all values.
Tricky one.
Another possibility i thought of is to store it as a string. Any other ideas?
What you're talking about is a sentinel value. It's a common technique. Strings in most languages after all use 0 as a sentinel end-of-string value. You can do that. You just need to find a number that makes sense and isn't used for anything else. Many string functions will return -1 to indicate what you're looking for isn't there.
0 might work because if the element isn't there there shouldn't even be a record. You also face the problem that it might be mistaken for actually meaning 0. -1 is another option. It doesn't have that same problem obviously.
Another column to indicate if the amount is measurable or not is also a viable option. The case for this one becomes stronger if you need to store different categories of trace elements (eg <1%, <0.1%, <0.01%, etc). Storing the negative of those numbers seems a bit hacky to me.
You could just store it as NULL, meaning that the value exists but is undefined.
Any arithmetic operation with a NULL yields a NULL.
Division by NULL is safe.
NULL's are ignored by the aggregation functions, so queries like these:
SELECT SUM(metal_percent), COUNT(metal_percent)
FROM alloys
GROUP BY
metal
will give you the sum and the count of the actual, defined values, not taking the unfilled values into account.
I would use a threshold value which is at least one significant digit smaller than your smallest expected value. This way you can logically say that any value less than say 0.01, can be presented to you application as a "trace" amount. This remains easy to understand and gives you flexibility in determining where your threshold should lie.
Since the constraints of the values are well defined (cannot have negative composition), I would go for the "negative value to indicate less-than" approach. As long as this use of such sentinel values are sufficiently documented, it should be reasonably easy to implement and maintain.
An alternative but similar method would be to add 100 to the values, assuming that you can't get more than 100%. So <0.001 becomes 100.001.
I would have a table modeling the certificate, in a one to many relation with another table, storing the values for elements. Then, I would still have the elements table containing the value in one column and a flag (less than) as a separate column.
Draft:
create table CERTIFICATES
(
PK_ID integer,
NAME varchar(128)
)
create table ELEMENTS
(
ELEMENT_ID varchar(2),
CERTIFICATE_ID integer,
CONCENTRATION number,
MEASURABLE integer
)
Depending on the database engine you're using, the types of the columns may vary.
Why not add another column to store whether or not its a trace amount
This will allow you to to save the amount that the trace is less than too
Since there is no common lowest threshold value and NULL is not acceptable, the cleanest solution now is to have a marker column which indicates whether there is a quantifiable amount or a trace amount present. A value of "Trace" would indicate to anybody reading the raw data that only a trace amount was present. A value of "Quantity" would indicate that you should check an amount column to find the actual quantity present.
I would have to warn against storing numerical values as strings. It will inevitably add additional pain, since you now lose the assertions a strong type definition gives you. When your application consumes the values in that column, it has to read the string to determine whether it's a sentinel value, a numeric value or simply some other string it can't interpret. Trying to handle data conversion errors at this point in your application is something I'm sure you don't want to be doing.
Another field seems like the way to go; call it 'MinMeasurablePercent'.