How to retrieve keys in large REDIS databases using SCAN - redis

I have a large redis database where I query keys using SCAN using the syntax:
SCAN 0 MATCH *something* COUNT 50
I get the result
1) "500000"
2) (empty list or set)
but the key is there. If I call subsequent with the new key in 1) at some time I will get the result.
I was under the impression MATCH would return matching keys until the max number specified by COUNT, but it seems REDIS scans COUNT keys and return only if they match.
Do I miss something? How can I do: "give me the first (count) keys that match the match" ?

Related

Redis Secondary Indexes and Performance Question

I know that Redis doesn't really have the concept of secondary indexes, but that you can use the Z* commands to simulate one. I have a question about the best way to handle the following scenario.
We are using Redis to keep track of orders. But we also want to be able to find those orders by phone number or email ID. So here is our data:
> set 123 7245551212:dlw#email.com
> set 456 7245551212:dlw#email.com
> set 789 7245559999:kdw#email.com
> zadd phone-index 0 7245551212:123:dlw#email.com
> zadd phone-index 0 7245551212:456:dlw#email.com
> zadd phone-index 0 7245559999:789:kdw#email.com
I can see all the orders for a phone number via the following (is there a better way to get the range other than adding a 'Z' to the end?):
> zrangebylex phone-index [7245551212 (7245551212Z
1) "7245551212:123:dlw#dcsg.com"
2) "7245551212:456:dlw#dcsg.com"
My question is, is this going to perform well? Or should we just create a list that is keyed by phone number, and add an order ID to that list instead?
> rpush phone:7245551212 123
> rpush phone:7245551212 456
> rpush phone:7245559999 789
> lrange phone:7245551212 0 -1
1) "123"
2) "456"
Which would be the preferred method, especially related to performance?
RE: is there a better way to get the range other than adding a 'Z' to the end?
Yes, use the next immediate character instead of adding Z:
zrangebylex phone-index [7245551212 (7245551213
But certainly the second approach offers better performance.
Using a sorted set for lexicographical indexing, you need to consider that:
The addition of elements, ZADD, is O(log(N))
The query, ZRANGEBYLEX, is O(log(N)+M) with N being the number of elements in the sorted set and M the number of elements being returned
In contrast, using lists:
The addition, RPUSH, is O(1)
The query, LRANGE, is O(N) as you are starting in zero.
You can also use sets (SADD and SMEMBERS), the difference is lists allows duplicates and preserves order, sets ensure uniqueness and doesn't respect insertion order.
ZSet use skiplist for score and dict for hashset. And if you add all elements with same score, skiplist will be turned to B-TREE like structure, which have a O(logN) time complexity for lexicographical order search.
So if you don't always perform range query for phone number, you should use list for orders which phone number as key for precise query. Also this will work for email(you can use hash to combine these 2 list). In this way performance for query will be much better than ZSET.

Results DataSet from DynamoDB Query using GSI is not returning correct results

I have a dynamo DB table where I am currently storing all the events that are happening in my system with respect to every product. There is a primary key with a Hash combination of productid,eventtype and eventcategory and Sort Key as Creation Time on the main table. The table was created and data was added into it.
Later I added a new GSI on the table with the attributes being Secondary Hash (which is just the combination of eventcategory and eventtype (excluding productid) and CreationTime as Sort Key. This was added so that I can query for multiple products at once.
The GSI seems to work fine, However only later I realized the data being returned is incorrect
Here is the scenario. (I am running all these queries against the newly created index)
I was querying for products with in the last 30 days and the Query returns 312 records, However, when I run the same query for last 90 days, it returns me only 128 records (which is wrong, should be atleast equal or greater than last 30 days records)
I have the pagination logic already embedded in my code, so that the lastEvaluatedKey is verified every-time, to loop and fetch the next set of records and after the loop, all the results are combined.
Not sure if I am missing something.
ANy suggestions would be appreciated.
var limitPtr *int64
if limit > 0 {
limit64 := int64(limit)
limitPtr = &limit64
}
input := dynamodb.QueryInput{
ExpressionAttributeNames: map[string]*string{
"#sch": aws.String("SecondaryHash"),
"#pkr": aws.String("CreationTime"),
},
ExpressionAttributeValues: map[string]*dynamodb.AttributeValue{
":sch": {
S: aws.String(eventHash),
},
":pkr1": {
N: aws.String(strconv.FormatInt(startTime, 10)),
},
":pkr2": {
N: aws.String(strconv.FormatInt(endTime, 10)),
},
},
KeyConditionExpression: aws.String("#sch = :sch AND #pkr BETWEEN :pkr1 AND :pkr2"),
ScanIndexForward: &scanForward,
Limit: limitPtr,
TableName: aws.String(ddbTableName),
IndexName: aws.String(ddbIndexName),
}
You reached the maximum number of items to evaluate (not necessarily the number of matching items). The limit is 1 MB.
The response will contain a LastEvaluatedKey parameter, it is the last item's id. You have to perform a new query with an extra ExclusiveStartKey parameter. (ExclusiveStartKey should be equal with LastEvaluatedKey's value.)
When the LastEvaluatedKey is empty you reached the end of the table.

Should I reverse order a queryset before slicing the first N records, or count it to slice the last N records?

Let's say I want to get the last 50 records of a query that returns around 10k records, in a table with 1M records. I could do (at the computational cost of ordering):
data = MyModel.objects.filter(criteria=something).order_by('-pk')[:50]
I could also do (at the cost of 2 database hits):
# assume I don't care about new records being added between
# the two queries being executed
index = MyModel.objects.filter(criteria=something).count()
data = MyModel.objects.filter(criteria=something)[index-50:]
Which is better for just an ordinary relational database with no indexing on the criteria (eg postgres in my case; no columnar storage or anything fancy)? Most importantly, why?
Does the answer change if the table or queryset is significantly bigger (eg 100k records from a 10M row table)?
This one is going to be very slow
data = MyModel.objects.filter(criteria=something)[index-50:]
Why because it translates into
SELECT * FROM myapp_mymodel OFFEST (index-50)
You are not enforcing any ordering here, so the server is going to have to calulcate the result set and jump to the end of it and that's going to involve a lot of reading and will be very slow. Let us not forgot that count() queries aren't all that hot either.
OTH, this one is going to be fast
data = MyModel.objects.filter(criteria=something).order_by('-pk')[:50]
You are reverse ordering on the primary key and getting the first 50. And the first 50 you can fetch equally quickly with
data = MyModel.objects.filter(criteria=something).order_by('pk')[:50]
So this is what you really should be doing
data1 = MyModel.objects.filter(criteria=something).order_by('-pk')[:50]
data2 = MyModel.objects.filter(criteria=something).order_by('pk')[:50]
The cost of ordering on the primary key is very low.

Selecting data from two different tables

I have 2 tables.
Table TSTRSN
[P]Client
[P]Year
[P]Rule_Nbr
Type_Code
Table TSTOCK
[P]Client
[P]Year
TimeStamp
EndOfFiscalYear
( [P] means Primary Key)
The request is twofold:
1) List a count of all the Rule_Nbr within a given time (from TimeStamp).
...then User chooses a specific Rule_Nbr...
2) List all Client, Year, EndOfFiscalYear for that specific Rule_Nbr
So for Part 1) I have to take the Rule_Nbr, take the matching Client and Year - use that to search for the TimeStamp. If it falls within the right time, increment count by 1... and so on.
Then for Part 2) I could either have saved the data from part 1 (I don't know if this is feasible given the size of the tables) or redo the query 1) for just one Rule_Nbr.
Im very new to SQL/DB2... so how to go about doing this? My first thought was make an array, store TSTRSN.Client/Year/Rule_Nbr and then prune it by comparing it to TSTOCK.Client/Year/Timestamp but I wonder if theres a better way (Im not even sure if Arrays exist in DB2!)
Any tips?
What you're looking for is the JOIN keyword.
http://www.gatebase.toucansurf.com/db2examples13.html

How should I model this in Redis?

FYI: Redis n00b.
I need to store search terms in my web app.
Each term will have two attributes: "search_count" (integer) and "last_searched_at" (time)
Example I've tried:
Redis.hset("search_terms", term, {count: 1, last_searched_at: Time.now})
I can think of a few different ways to store them, but no good ways to query on the data. The report I need to generate is a "top search terms in last 30 days". In SQL this would be a where clause and an order by.
How would I do that in Redis? Should I be using a different data type?
Thanks in advance!
I would consider two ordered sets.
When a search term is submitted, get the current timestamp and:
zadd timestamps timestamp term
zincrby counts 1 term
The above two operations should be atomic.
Then to find all terms in the given time interval timestamp_from, timestamp_to:
zrangebyscore timestamps timestamp_from timestamp_to
after you get these, loop over them and get the counts from counts.
Alternatively, I am curious whether you can use zunionstore. Here is my test in Ruby:
require 'redis'
KEYS = %w(counts timestamps results)
TERMS = %w(test0 keyword1 test0 test1 keyword1 test0 keyword0 keyword1 test0)
def redis
#redis ||= Redis.new
end
def timestamp
(Time.now.to_f * 1000).to_i
end
redis.del KEYS
TERMS.each {|term|
redis.multi {|r|
r.zadd 'timestamps', timestamp, term
r.zincrby 'counts', 1, term
}
sleep rand
}
redis.zunionstore 'results', ['timestamps', 'counts'], weights: [1, 1e15]
KEYS.each {|key|
p [key, redis.zrange(key, 0, -1, withscores: true)]
}
# top 2 terms
p redis.zrevrangebyscore 'results', '+inf', '-inf', limit: [0, 2]
EDIT: at some point you would need to clear the counts set. Something similar to what #Eli proposed (https://stackoverflow.com/a/16618932/410102).
Depends on what you want to optimize for. Assuming you want to be able to run that query very quickly and don't mind expending some memory, I'd do this as follows.
Keep a key for every second you see some search (you can go more or less granular if you like). The key should point to a hash of $search_term -> $count where $count is the number of times $search_term was seen in that second.
Keep another key for every time interval (we'll call this $time_int_key) over which you want data (in your case, this is just one key where your interval is the last 30 days). This should point to a sorted set where the items in the set are all of your search terms seen over the last 30 days, and the score they're sorted by is the number of times they were seen in the last 30 days.
Have a background worker that every second grabs the key for the second that occurred exactly 30 days ago and loops through the hash attached to it. For every $search_term in that key, it should subtract the $count from the score associated with that $search_term in $time_int_key
This way, you can just use ZRANGE $time_int_key 0 $m to grab the m top searches ([WITHSCORES] if you want the amounts they were searched) in O(log(N)+m) time. That's more than cheap enough to be able to run as frequently as you want in Redis for just about any reasonable m and to always have that data updated in real time.