Search API - HTTP Query Argument Format - api

I've created a search API for a site that I work on. For example, some of the queries it supports are:
/api/search - returns popular search results
/api/search?q=car - returns results matching the term "car"
/api/search?start=50&limit=50 - returns 50 results starting at offset 50
/api/search?user_id=3987 - returns results owned by the user with ID 3987
These query arguments can be mixed and matched. It's implemented under the hood using Solr's faceted search.
I'm working on adding query arguments that can filter results based on a numeric attribute. For example, I might want to only return results where the view count is greater than 100. I'm wondering what the best practice is for specifying this.
Solr uses this way:
/api/search?views:[100 TO *]
Google seems to do something like this:
/api/search?viewsisgt:100
Neither of these seem very appealing to me. Is there a best practice for specifying this kind of query term? Any suggestions?

Simply use ',' as separator for from/to, it reads the best and in my view is intuitive:
# fixed from/to
/search?views=4,2
# upper wildcard
/search?views=4,
# lower wildcard
/search?views=,4
I take values inclusive. In most circumstances you won't need the exclusive/inclusive additional syntax sugar.
Binding it even works very well in some frameworks out of the box (like spring mvc), which bind ',' separated values to an array of values. You could then wrap the internal array with specific accessors (getMin(), getMax()).

Google's approach is good, why it's not appealing?
Here comes my suggestion:
/api/search?viewsgt=100

I think the mathematical notation for limits are suitable.
[x the lower limit can be atleast x
x] the upper limit can be atmost x
(x the lower limit must be strictly greater than x
x) the upper limit must be strictly lesser than x
Hence,
q=cats&range=(100,200) - the results from 100 to 200, but not including 100 and 200
q=cats&range=[100,200) - the results from 100 to 200, but the lower limit can be greater than 100
q=cats&range=[100 - any number from 100 onwards
q=cats&range=(100 - any number greater than 100
q=cats&range=100,200 - default, same as [100,200]
Sure, its aesthetics are still questionable, but it seems (IMO) the most intuitive for the human eye, and the parser is still easy.
As per http://en.wikipedia.org/wiki/Percent-encoding =,&,[,],(,) are reserved

Related

SQL Edit Distance: How have you handled Fuzzy String Matching using SQL in the past?

I have always wanted to ask for your views on this topic, so here we go:
My team just provided me with a list of customer accounts we need to match with other databases and the main challenge we face is the fact our list is non-standarized so we call similarly but differently the same accounts than in our databases. For example:
My_List.Customers_Name Customers_Database.Customers_Name
- -
Charles Schwab Charles Schwab Corporation
So for example, I use Jaro Wrinkler Similarity function and Edit Distance in order to gather a list of similar strings and then manually match the accounts if needed. My question is:
Which rules/filters do you apply to the results of the fuzzy data matching in order to reduce the amount of manual match?
I am refering to rules like:
If the string has more than 20 characters and Edit Distance <= 1 then it will probably be the same so consider it a match. If the string has less than 4 characters and Edit Distance >0 then it will probably not be the same account so consider it a mismatch.
These rules I apply are completely made up from my side, I am wondering if there are some standard convention for applying text string fuzzy matching in order to only retrieve useful results and reduce manual match workload.
If there are not, could you tell your experience and how you handled this before?
Thank you so much
I've done this a few times. It's hugely dependent on the data sets, and the rules change every time.
My process is:
select a random set of sample records to check my rule set - large enough to be representative, small enough to be able to scan visually.
create a "match" table with "original", "match" and "confidence score" columns.
write the rules, as "insert" or "update" statements to create records in the "match" table
run the rules on my sample data set
evaluate the matches on the samples. Tweak, add, configure the rules.
rinse & repeat
The "rules" depend hugely on the data set. Commonly I use the following:
strip out punctuation
apply common substitutions (e.g. "Corp" becomes "Corporation")
split into separate words; apply fraction of each exact match out of 10 (so "Charles Schwab" matching "Charles Schwab Corporeation" would be 2/3 = 7 points, "HSBC" matching "HSBC" is 1/1 = 10 points
split into separate words; apply fraction of each close match out of 5 (so "Chls Schwab" matching "Charles Schwab Corporation" would be 2/3 = 3 points, "HSBC" matching "HSCB" is 1/1 = 5 points)

Does ZRANGEBYLEX support contains query?

How can I query my sorted set to get all keys containing some characters?
"Starts with" works fine but I need "contains".
I am using below query for "start with" which works fine
zrangebylex zset [2110 "[2110\xff" LIMIT 0 10
Is there any way we can do \xff query \xff ?
No. The lexicographical range for Redis' Sorted Sets can only be used for prefix searches.
Note that by using another Sorted Set that stores the reverse of the values you can also perform a suffix search on the values. However, even combining these two approaches will not provide the functionality you need.
Alternatively, you could perform a prefix search and then filter the results using a Lua script. Depending on your queries and data, this may or may not be an effective approach.
You could, also, consider implementing a full text indexing mechanism on top of Redis but that would be an overkill in most cases and besides, there are existing tested technologies that already do that.
But you can use ZSCAN with a glob-style pattern, for example to get all the strings which contains the characters "s" and/or "a":
ZSCAN key 0 MATCH *[sa]*
From the ZRANGEBYLEX original documentation (also look zzlCompareElements function realization in source code):
The elements are considered to be ordered from lower to higher strings as compared byte-by-byte using the memcmp() C function. Longer strings are considered greater than shorter strings if the common part is identical.
From memcmp documentation:
memcmp compares the first num bytes of the block of memory pointed by ptr1 to the first num bytes pointed by ptr2, returning zero if they all match or a value different from zero representing which is greater if they do not.
So you cant use zrangebylex with contains query. And I'm afraid there is not any "lite" workaround. "Lite" - without redis sourfce patching.

search lucene NumericField for maximum value

I know there is a NumericRangeQuery in Lucene but is it possible to have lucene simply return the maximum value stored in in a NumericField. I can use a RangeQuery over the entire known range and then sort but this is extremely cumbersome and it may return a huge amount of results if there are a lot of records
The second parameter of IndexSearcher.search(Query query, int n, Sort sort) allows to specify the top n hits (in your case 1), which, if you sort correctly, only returns the desired result. There are other overloaded methods that allow achieving the same.
Can't argue about the cumbersomeness though :)
You could Term Enum through your index. Unfortunately I don't think they're sorted in a way which makes finding the maximum instantaneous, but at least you won't have to do an actual search to find it. You will need to use NumericUtils to convert from Lucene's internal structure to a normal number.
This thread contains an example.

In Lucene, what is the accepted way to return an approximate result count without impacting performance?

I want my users' search results to include some idea of how many matches there were for a given search query.
However, after a bit of research and observation of my users' search logs, I've noticed a direct correlation between logged query speed and the number of total results and have determined that this is because I am accessing the totalHits property, which apparently has to iterate over the entire result set in order to return a value.
I would be happy to simply return an approximate value, maybe even just an order of magnitude indicating a rough idea of how many results are available, but I can't see if there's any good way to calculate this without noticeably affecting performance. I don't really want to just dump a seemingly-bottomless result set in front of the user without providing them any rough idea of how many results their search matched.
Any suggestions?
With boolean queries you can try to approximate:
|A or B| / |D| = ((|A| / |D|) + (|B| / |D|)) / 2
|A and B| / |D| = (|A| / |D|) * (|B| / |D|)
Where A and B are two terms, and |D| is the total number of documents. This is basically making an assumption of independence.
You can use the rewrite method to rewrite any query to a boolean query.
There isn't really a better way of doing this, but I've found that this assumption isn't too bad in practice. If you have a very small number of docs it might give bad answers though.
EDIT: as jpountz points out, my calculation for OR is wrong. Should be:
P(A U B) = 1 - P(~(AUB))
= 1 - P((~A) & (~B))
= 1 - P(~A)P(~B)
= 1 - (1 - P(A))(1 - P(B))
= 1 - (1 - P(A) - P(B) + P(A)P(B))
= P(A) + P(B) - P(A)P(B)
Recent versions of Lucene have a collector dedicated to computing counts called TotalHitCountCollector.
It is usually faster than other collectors because:
it accepts documents out of order,
it doesn't need to compute scores,
it doesn't build the array of the top matches.
First we should know what kind of query you want to do that for. For example, there is a very fast way to find out how many documents there are containing any concrete term (the term's docFreq). So, say you have a conjunction of three terms, you can approximate with the smallest of the three docFreqs.
Regarding totalHits: this is just a value set by lucene after completing the search. Accessing the property does not do any extra work and certainly doesn't iterate all results.
Lucene always sets this (and knows how many results in total there are) when doing a search. It needs to do that in order to give you the requested top-N results (by score or sort field depending what you specified).
So actually the search is slow for certain situations.
Have you checked what kind of queries are slow?
The combination of slow and lots of results can indicate there are some sort of wildcard/fuzzy queries.
General information for improving search speed can be found at http://wiki.apache.org/lucene-java/ImproveSearchingSpeed
Based on your lucene-2.9.2 tag I would recommend first that you try upgrading to the latest version if possible and measure again. There were a lot of changes/improvements since 2.9.2

Is it possible to affect a Lucene rank based on a numeric value?

I have content with various numeric values, and a higher value indicates (theoretically) more valuable content, which I want to rank higher.
For instance:
Average rating (0 - 5)
Number of comments (0 - whatever)
Number of inbound link references from other pages (0 - whatever)
Some arbitrary number I apply to indicate how important I feel the content is (1 - whatever)
These can be indexed by Lucene as a numeric value, but how can I tell Lucene to use this value in its ranking algorithm?
you can set this value using "Field.SetBoost" while indexing.
Depending how exactly you want to proceed, you can set boost while indexing as suggested by #L.B, or if you want to make it dynamic, i.e. at search time rather than indexing time, you can use ValueSourceQuery and CustomScoreQuery.
You can see example in the question I asked some time ago:
Lucene custom scoring for numeric fields (the example was tested with Lucene 3.0).