lucene group by - lucene

hi have index simple document where you have 2 fields:
1. profileId as long
2. profileAttribute as long.
i need to know how many profileId's have a certain set of attribute.
for example i index:
doc1: profileId:1 , profileAttribute = 55
doc2: profileId:1 , profileAttribute = 57
doc3: profileId:2 , profileAttribute = 55
and i want to know how many profiles have both attribute 55 and 57
in this example the answer is 1 cuz only profile id 1 have both attributes
thanks in advance for your help

You can search for profileAttribute:(57 OR 55) and then iterate over the results and put their profileId property in a set in order to count the total number of unique profileIds.
But you need to know that Lucene will perform poorly at this compared to, say, a RDBMS. This is because Lucene is an inverted index, meaning it is very good at retrieving the top documents that match a query. It is however not very good at iterating over the stored fields of a large number of documents.
However, if profileId is single-valued and indexed, you can get its values using Lucene's fieldCache which will prevent you from performing costly disk accesses. The drawback is that this fieldCache will use a lot of memory (depending on the size of your index) and take time to load every time you (re-)open your index.
If changing the index format is acceptable, this solution can be improved by making profileIds uniques, your index would have the following format :
doc1: profileId: [1], profileAttribute: [55, 57]
doc2: profileId: [2], profileAttribute: [55]
The difference is that profileIds are unique and profileAttribute is now a multi-valued field. To count the number of profileIds for a given set of profileAttribute, you now only need to query for the list of profileAttribute (as previously) and use a TotalHitCountCollector.

Related

Why does select result fields double data scanned in BigQuery

I have a table with 2 integer fields x,y and few millions of rows.
The fields are created with the following code:
Field.newBuilder("x", LegacySQLTypeName.INTEGER).setMode(Field.Mode.NULLABLE).build();
If I run the following from the web:
SELECT x,y FROM [myproject:Test.Test] where x=1 LIMIT 50
Query Editor: "Valid: This query will process 64.9 MB when run."
compared to:
SELECT x FROM [myproject:Test.Test] where x=1 LIMIT 50
Query Editor: " Valid: This query will process 32.4 MB when run."
It scans more than double of the original data scanned.
I would expect it will first find the relevant rows based on where clause and then bring the extra field without scanning the entire second field.
Any inputs on why it doubles the data scanned and how to avoid it will be appreciated.
In my application I have hundred of possible fields which I need to fetch for a very small number of rows (50) which answer the query.
Does this means I will need to processed all fields data?
* I'm aware how columnar database works, but wasn't aware for the huge price when you want to brings lots of fields based on a very specific where clause.
The following link provide very clear answer:
best-practices-performance-input
BigQuery does not have a concept of index or something like that. When you query a field column, BigQuery will scan through all the values of that column and then make the operations you want (for a deeper deep understanding they have some pretty cool posts about the inner workings of BQ).
That means that when you select x and y where x = 1, BQ will read through all values of x and y and then find where x = 1.
This ends up being an amazing feature of BQ, you just load your data there and it just works. It does force you to be aware on how much data you retrieve from each query. Queries of the type select * from table should be used only if you really need all columns.

Should I reverse order a queryset before slicing the first N records, or count it to slice the last N records?

Let's say I want to get the last 50 records of a query that returns around 10k records, in a table with 1M records. I could do (at the computational cost of ordering):
data = MyModel.objects.filter(criteria=something).order_by('-pk')[:50]
I could also do (at the cost of 2 database hits):
# assume I don't care about new records being added between
# the two queries being executed
index = MyModel.objects.filter(criteria=something).count()
data = MyModel.objects.filter(criteria=something)[index-50:]
Which is better for just an ordinary relational database with no indexing on the criteria (eg postgres in my case; no columnar storage or anything fancy)? Most importantly, why?
Does the answer change if the table or queryset is significantly bigger (eg 100k records from a 10M row table)?
This one is going to be very slow
data = MyModel.objects.filter(criteria=something)[index-50:]
Why because it translates into
SELECT * FROM myapp_mymodel OFFEST (index-50)
You are not enforcing any ordering here, so the server is going to have to calulcate the result set and jump to the end of it and that's going to involve a lot of reading and will be very slow. Let us not forgot that count() queries aren't all that hot either.
OTH, this one is going to be fast
data = MyModel.objects.filter(criteria=something).order_by('-pk')[:50]
You are reverse ordering on the primary key and getting the first 50. And the first 50 you can fetch equally quickly with
data = MyModel.objects.filter(criteria=something).order_by('pk')[:50]
So this is what you really should be doing
data1 = MyModel.objects.filter(criteria=something).order_by('-pk')[:50]
data2 = MyModel.objects.filter(criteria=something).order_by('pk')[:50]
The cost of ordering on the primary key is very low.

PostgreSQL, find strings differ by n characters

Suppose I have a table like this
id data
1 0001
2 1000
3 2010
4 0120
5 0020
6 0002
sql fiddle demo
id is primary key, data is fixed length string where characters could be 0, 1, 2.
Is there a way to build an index so I could quickly find strings which are differ by n characters from given string? like for string 0001 and n = 1 I want to get row 6.
Thanks.
There is the levenshtein() function, provided by the additional module fuzzystrmatch. It does exactly what you are asking for:
SELECT *
FROM a
WHERE levenshtein(data, '1110') = 1;
SQL Fiddle.
But it is not very fast. Slow with big tables, because it can't use an index.
You might get somewhere with the similarity or distance operators provided by the additional module pg_trgm. Those can use a trigram index as detailed in the linked manual pages. I did not get anywhere, the module is using a different definition of "similarity".
Generally the problem seems to fit in the KNN ("k nearest neighbours") search pattern.
If your case is as simple as the example in the question, you can use LIKE in combination with a trigram GIN index, which should be reasonably fast with big tables:
SELECT *
FROM a
WHERE data <> '1110'
AND (data LIKE '_110' OR
data LIKE '1_10' OR
data LIKE '11_0' OR
data LIKE '111_');
Obviously, this technique quickly becomes unfeasible with longer strings and more than 1 difference.
However, since the string is so short, any query will match a rather big percentage of the base table. Therefore, index support will hardly buy you anything. Most of the time it will be faster for Postgres to scan sequentially.
I tested with 10k and 100k rows with and without a trigram GIN index. Since ~ 19% match the criteria for the given test case, a sequential scan is faster and levenshtein() still wins. For more selective queries matching less than around 5 % of the rows (depends), a query using an index is (much) faster.

How should I model this in Redis?

FYI: Redis n00b.
I need to store search terms in my web app.
Each term will have two attributes: "search_count" (integer) and "last_searched_at" (time)
Example I've tried:
Redis.hset("search_terms", term, {count: 1, last_searched_at: Time.now})
I can think of a few different ways to store them, but no good ways to query on the data. The report I need to generate is a "top search terms in last 30 days". In SQL this would be a where clause and an order by.
How would I do that in Redis? Should I be using a different data type?
Thanks in advance!
I would consider two ordered sets.
When a search term is submitted, get the current timestamp and:
zadd timestamps timestamp term
zincrby counts 1 term
The above two operations should be atomic.
Then to find all terms in the given time interval timestamp_from, timestamp_to:
zrangebyscore timestamps timestamp_from timestamp_to
after you get these, loop over them and get the counts from counts.
Alternatively, I am curious whether you can use zunionstore. Here is my test in Ruby:
require 'redis'
KEYS = %w(counts timestamps results)
TERMS = %w(test0 keyword1 test0 test1 keyword1 test0 keyword0 keyword1 test0)
def redis
#redis ||= Redis.new
end
def timestamp
(Time.now.to_f * 1000).to_i
end
redis.del KEYS
TERMS.each {|term|
redis.multi {|r|
r.zadd 'timestamps', timestamp, term
r.zincrby 'counts', 1, term
}
sleep rand
}
redis.zunionstore 'results', ['timestamps', 'counts'], weights: [1, 1e15]
KEYS.each {|key|
p [key, redis.zrange(key, 0, -1, withscores: true)]
}
# top 2 terms
p redis.zrevrangebyscore 'results', '+inf', '-inf', limit: [0, 2]
EDIT: at some point you would need to clear the counts set. Something similar to what #Eli proposed (https://stackoverflow.com/a/16618932/410102).
Depends on what you want to optimize for. Assuming you want to be able to run that query very quickly and don't mind expending some memory, I'd do this as follows.
Keep a key for every second you see some search (you can go more or less granular if you like). The key should point to a hash of $search_term -> $count where $count is the number of times $search_term was seen in that second.
Keep another key for every time interval (we'll call this $time_int_key) over which you want data (in your case, this is just one key where your interval is the last 30 days). This should point to a sorted set where the items in the set are all of your search terms seen over the last 30 days, and the score they're sorted by is the number of times they were seen in the last 30 days.
Have a background worker that every second grabs the key for the second that occurred exactly 30 days ago and loops through the hash attached to it. For every $search_term in that key, it should subtract the $count from the score associated with that $search_term in $time_int_key
This way, you can just use ZRANGE $time_int_key 0 $m to grab the m top searches ([WITHSCORES] if you want the amounts they were searched) in O(log(N)+m) time. That's more than cheap enough to be able to run as frequently as you want in Redis for just about any reasonable m and to always have that data updated in real time.

How to create a sorted set with "field1 desc, field2 asc" order in Redis?

I am trying to build leaderboards in Redis and be able to get top X scores and retrieve a rank of user Y.
Sorted lists in Redis look like an easy fit except for one problem - I need scores to be sorted not only by actual score, but also by date (so whoever got the same score earlier will be on top). SQL query would be:
select * from scores order by score desc, date asc
Running zrevrange on a sorted set in Redis uses something like:
select * from scores order by score desc, key desc
Which would put users with lexicographically bigger keys above.
One solution I can think of is making some manipulations with a score field inside a sorted set to produce a combined number that consists of a score and a timestamp.
For example for a score 555 with a timestamp 111222333 the final score could be something like 555.111222333 which would put newer scores above older ones (not exactly what I need but could be adjusted further).
This would work, but only on small numbers, as a score in a sorted set has only 16 significant digits, so 10 of them will be wasted on a timestamp right away leaving not much room for an actual score.
Any ideas how to make a sorted set arrange values in a correct order? I would really want an end result to be a sorted set (to easily retrieve user's rank), even if it requires some temporary structures and sorts to build such set.
Actually, all my previous answers are terrible. Disregard all my previous answers (although I'm going to leave them around for the benefit of others).
This is how you should actually do it:
Store only the scores in the zset
Separately store a list of each time a player achieved that score.
For example:
score_key = <whatever unique key you want to use for this score>
redis('ZADD scores-sorted %s %s' %(score, score))
redis('RPUSH score-%s %s' %(score, score_key))
Then to read the scores:
top_score_keys = []
for score in redis('ZRANGE scores-sorted 0 10'):
for score_key in redis('LRANGE score-%s 0 -1' %(score, )):
top_score_keys.append(score_key)
Obviously you'd want to do some optimizations there (ex, only reading hunks of the score- list, instead of reading the entire thing).
But this is definitely the way to do it.
User rank would be straight forward: for each user, keep track of their high score:
redis('SET highscores-%s %s' %(user_id, user_high_score))
Then determine their rank using:
user_high_score = redis('GET highscores-%s' %(user_id, ))
score_rank = int(redis('ZSCORE scores-sorted %s' %(user_high_score, )))
score_rank += int(redis('LINDEX score-%s' %(user_high_score, )))
It's not really the perfect solution, but if you make a custom epoch that would be closer to the current time, then you would need less digits to represent it.
For instance if you use January 1, 2012 for your epoch you would (currently) only need 8 digits to represent the timestamp.
Here's an example in ruby:
(Time.new(2012,01,01,0,0,0)-Time.now).to_i
This would give you about 3 years before the timestamp would require 9 digits, at which time you could perform some maintenance to move the custom epoch forward again.
I would however love to hear if anyone has a better idea, since I have the excact same problem.
(Note: this answer is almost certainly suboptimial; see https://stackoverflow.com/a/10575370/71522)
Instead of using a timestamp in the score, you could use a global counter. For example:
score_key = <whatever unique key you want to use for this score>
score_number = redis('INCR global-score-counter')
redis('ZADD sorted-scores %s.%s %s' %(score, score_number, score_key)
And to sort them in descending order, pick a large score count (1<<24, say), use that as the initial value of global-score-counter, then use DECR instead of INCR.
(this would also apply if you are using a timestamp)
Alternately, if you really incredibly worried about the number of players, you could use a per-score counter:
score_key = <whatever unique key you want to use for this score>
score_number = redis('HINCR score-counter %s' %(score, ))
redis('ZADD sorted-scores %s.%s %s' %(score, score_number, score_key))
(Note: this answer is almost certainly suboptimial; see https://stackoverflow.com/a/10575370/71522)
A couple thoughts:
You could make some assumptions about the timestamps to make them smaller. For example, instead of storing Unix timestamps, you could store "number of minutes since May 13, 2012" (for example). In exchange for seven significant digits, this would let you store times for the next 19 years.
Similarly, you could reduce the number of significant digits in the scores. For example, if you expect scores to be in the 7-digit range, you could divide them by 10, 100, or 1000 when storing them in the sorted list, then use the results of the sorted list to access the actual scores, sorting those at the application level.
For example, using both of the above (in potentially buggy pseudo-code):
score_small = int(score / 1000)
time_small = int((time - 1336942269) / 60)
score_key = uuid()
redis('SET full-score-%s "%s %s"' %(score_key, score, time))
redis('ZADD sorted-scores %s.%s %s' %(score_small, time_small, score_key))
Then to load them (approximately):
top_scores = []
for score_key in redis('ZRANGE sorted-scores 0 10'):
score_str, time_str = redis('GET full-score-%s' %(score_key, )).split(" ")
top_scores.append((int(score_str), int(time_str))
top_scores.sort()
This operation could even be done entirely inside Redis (avoid the network overhead of the O(n) GET operations) using the EVAL command (although I don't know enough Lua to confidently provide an example implementation).
Finally, if you expect a truly huge range of scores (for example, you expect that there will be a large number of scores below 10,000 and an equally large number of scores over 1,000,000), then you could use two sorted sets: scores-below-100000 and scores-above-100000.