sdiff - limit the result set to X items - redis

I want to get the diff of two sets in redis, but I don't need to return the entire array, just 10 items for example. Is there any way to limit the results?
I was thinking something like this:
SDIFF set1 set2 LIMIT 10
If not, are there any other options to achieve this in a performant way, considering that set1 can be millions of objects and set2 is much much smaller (hundreds).

More info would be helpful on what you want to achieve. Something like this might require you to duplicate your data. Though I don’t know if it’s something you want.
An option is chunking them.
Create a set with a unique generated id that can hold a max of 10 items
Create a sorted set like so…
zadd(key, timestamp, chunkid)
where your timestamp is a unix time and the chunkid is the key the connects to the set. The key can be the name of whatever you would like it to be or it could also be a uniquely generated id.
Use zrange to grab a specific one
(Repeat steps 1-3 for the second set)
Once you have your 1 result from both your sorted sets “zset”. You can now do your sdiff by using the chunkid.
Note that there is advantages and disadvantages in doing this. Like more connection consumption (if calling from a a client), and the obvious being a little more processing. Though it will help immensely if you put this in a lua script.
Hope this helps or at least gives you an idea on how to model your data. Though if this is critical data you might need to use a automated script of some sort to move your data around to meet the modeling requirement.

Related

Redis - Sorted set, find item by property value

In redis I store objects in a sorted set.
In my solution, it's important to be able to run a ranged query by dates, so I store the items with the score being the timestamp of each items, for example:
# Score Value
0 1443476076 {"Id":"92","Ref":"7ADT","DTime":1443476076,"ATime":1443901554,"ExTime":0,"SPName":"7ADT33CFSAU6","StPName":"7ADT33CFSAU6"}
1 1443482969 {"Id":"11","Ref":"DAJT","DTime":1443482969,"ATime":1443901326,"ExTime":0,"SPName":"DAJTJTT4T02O","StPName":"DAJTJTT4T02O"}
However, in other situations I need to find a single item in the set based on it's ID.
I know I can't just query this data structure as if it were a nosql db, but I tried using ZSCAN, which didn't work.
ZSCAN MySet 0 MATCH Id:92 count 1
It returns; "empty list or set"
Maybe I need to serialize different?
I have serialized using Json.Net.
How, if possible, can I achieve this; using dates as score and still be able to lookup an item by it's ID?
Many thanks,
Lars
Edit:
Assume it's not possible, but any thoughts or inputs are welcome:
Ref: http://openmymind.net/2011/11/8/Redis-Zero-To-Master-In-30-Minutes-Part-1/
In Redis, data can only be queried by its key. Even if we use a hash,
we can't say get me the keys wherever the field race is equal to
sayan.
Edit 2:
I tried to do:
ZSCAN MySet 0 MATCH *87*
127.0.0.1:6379> ZSCAN MySet 0 MATCH *87*
1) "192"
2) 1) "{\"Id\":\"64\",\"Ref\":\"XQH4\",\"DTime\":1443837798,\"ATime\":1444187707,\"ExTime\":0,\"SPName\":\"XQH4BPGW47FM\",\"StPName\":\"XQH4BPGW47FM\"}"
2) "1443837798"
3) "{\"Id\":\"87\",\"Ref\":\"5CY6\",\"DTime\":1443519199,\"ATime\":1444172326,\"ExTime\":0,\"SPName\":\"5CY6DHP23RXB\",\"StPName\":\"5CY6DHP23RXB\"}"
4) "1443519199"
And it finds the desired item, but it also finds another one with an occurance of 87 in the property ATime. Having more unique, longer IDs might work this way and I would have to filter the results in code to find the one with the exact value in its property.
Still open for suggestions.
I think it's very simple.
Solution 1(Inferior, not recommended)
Your way of ZSCAN MySet 0 MATCH Id:92 count 1 didn't work out because the stored string is "{\"Id\":\"92\"... not "{\"Id:92\".... The string has been changed into another format. So try to use MATCH Id\":\"64 or something like that to match the json serialized data in redis. I'm not familiar with json.net, so the actual string leaves for you to discover.
By the way, I have to ask you did ZSCAN MySet 0 MATCH Id:92 count 1 return a cursor? I suspect you used ZSCAN in a wrong way.
Solution 2(Better, strongly recommended)
ZSCAN is good when your sorted set is not large and you know how to save network roundtrip time by Redis' Lua transaction. This still make "look up by ID" operation O(n). Therefore, a better solution is to change you data model in the following way:
change sorted set
from
# Score Value
0 1443476076 {"Id":"92","Ref":"7ADT","DTime":1443476076,"ATime":1443901554,"ExTime":0,"SPName":"7ADT33CFSAU6","StPName":"7ADT33CFSAU6"}
1 1443482969 {"Id":"11","Ref":"DAJT","DTime":1443482969,"ATime":1443901326,"ExTime":0,"SPName":"DAJTJTT4T02O","StPName":"DAJTJTT4T02O"}
to
# Score Value
0 1443476076 Id:92
1 1443482969 Id:11
Move the rest detailed data in another set of hashes type keys:
# Key field-value field-value ...
0 Id:92 Ref-7ADT DTime-1443476076 ...
1 Id:11 Ref-7ADT DTime-1443476076 ...
Then, you locate by id by doing hgetall id:92. As to ranged query by date, you need do ZRANGEBYSCORE sortedset mindate maxdate then hgetall every id one by one. You'd better use lua to wrap these commands in one and it will still be super fast!
Data in NoSql database need to be organized in a redundant way like above. This may make some usual operation involve more than one commands and roundtrip, but it can be tackled by redis's lua feature. I strongly recommend the lua feature of redis, cause it wrap commands into one network roundtrip, which are all executed on the redis-server side and is atomic and super fast!
Reply if there's anything you don't know

Alphabetical index with millions of rows in redis

For my application, I need an alphabetical index on a set with millions of rows.
When I use a sorted set, and give all members the same score, the result looks perfect.
Performance is also great, with a test set of 2 million rows, the last third does not perform noticably less than the first third of the set.
However, I need to query those results. For example, get the first (max) 100 items that start with "goo". I played around with zscan and sort, but it does not give me a working and performant result.
Since redis is very fast when inserting a new member to the sorted set, it must be technically possible to immediately (well, very quickly) go to the right memory location. I suppose redis uses some kind of quicksort mechanism to accomplish this.
But.. I don't seem to get the result when I just want to query the data, and not write to it.
We use replicated slaves for read actions, and we prefer the (default) read-only config switch. So creating a dummy key and deleting it afterward (however unelegant) is not really an option.
I'm stuck a bit, and I'm thinking about writing a ZLEX command in redis-server itself. Which I could use like this:
HELP "ZLEX" -> (ZLEX set score startswith)
-- Query the lexicographical index of a sorted set, supplying a 'startswith' string.
127.0.0.1:12345> ZLEX myset 0 goo LIMIT 0 100
1) goo
2) goof
3) goons
4) goozer
What are your thoughts? Am I missing something in the standard redis commands?
We're using Redis 2.8.4 x64 on Debian.
Kind regards, TW
Edits:
Note:
Related issue: indexing-using-redis-sorted-sets -> At least the name I gave to ZLEX seems to conform with Antirez' (Salvatore's) standards. As of 24-1-2014, I'm working on implementing ZLEX. It seems to be the easiest and most straight-forward solution for this use case, and Antirez could merge it into the main branch for everyone's benefit.
I've implemented ZLEX.
Here are the full specs.
You can grab the new functionality from here: github tw-bert
I also posted a pull request to Antirez here.
Kind regards, TW
Have you had a look at this ?
It can be useful depending on the length of the field by which you sort, this method requires b*(a^2) keys, where a is the length of the field , and b is amount of rows for this field.

Redis Sorted Set ... store data in "member"?

I am learning Redis and using an existing app (e.g. converting pieces of it) for practice.
I'm really struggling to understand first IF and then (if applicable) HOW to use Redis in one particular use-case ... apologies if this is super basic, but I'm so new that I'm not even sure if I'm asking correctly :/
Scenario:
Images are received by a server and info like time_taken and resolution is saved in a database entry. Images are then associated (e.g. "belong_to") with one Event ... all very straight-forward for a RDBS.
I'd like to use a Redis to maintain a list of the 50 most-recently-uploaded image objects for each Event, to be delivered to the client when requested. I'm thinking that a Sorted Set might be appropriate, but here are my concerns:
First, I'm not sure if a Sorted Set can/should be used in this associative manner? Can it reference other objects in Redis? Or is there just a better way to do this altogether?
Secondly, I need the ability to delete elements that are greater than X minutes old. I know about the EXPIRE command for keys, but I can't use this because not all images need to expire at the same periodicity, etc.
This second part seems more like a query on a field, which makes me think that Redis cannot be used ... but then I've read that I could maybe use the Sorted Set score to store a timestamp and find "older than X" in that way.
Can someone provide come clarity on these two issues? Thank you very much!
UPDATE
Knowing that the amount of data I need to store for each image is small and will be delivered to the client's browser, can is there anything wrong with storing it in the member "field" of a sorted set?
For example Sorted Set => event:14:pictures <time_taken> "{id:3,url:/images/3.png,lat:22.8573}"
This saves the data I need and creates a rapidly-updatable list of the last X pictures for a given event with the ability to, if needed, identify pictures that are greater than X minutes old ...
First, I'm not sure if a Sorted Set can/should be used in this
associative manner? Can it reference other objects in Redis?
Why do you need to reference other objects? An event may have n image objects, each with a time_taken and image data; a sorted set is perfect for this. The image_id is the key, the score is time_taken, and the member is the image data as json/xml, whatever; you're good to go there.
Secondly, I need the ability to delete elements that are greater than
X minutes old
If you want to delete elements greater than X minutes old, use ZREMRANGEBYSCORE:
ZREMRANGEBYSCORE event:14:pictures -inf (currentTime - X minutes)
-inf is just another way of saying the oldest member without knowing the oldest members time, but for the top range you need to calculate it based on current time before using this command ( the above is just an example)

When would you ever set the increment value on a database identity field?

Given the table:
CREATE TABLE Table1
(
UniqueID int IDENTITY(1,1)
...etc
)
Now why would you ever set the increment to something other than 1?
I can understand setting the initial seed value differently. For example if, say, you're creating one database table per month of data (e.g. Table1_082009, Table1_092009) and want to start the UniqueID of the new table where the old one left off. (I probably wouldn't use that strategy myself, but hey, I can see people doing it).
But for the increment? I can only imagine it being of any use in really odd situations, for example:
after the initial data is inserted, maybe later someone will want to turn identity insert on and insert new rows in the gaps, but for efficient lookup on the index will want the rows to be close to each other?
if you're looking up ids based directly off a URL, and want to make it harder for people to arbitrarily access the other items (for example, instead of the user being able to work out that changing the URL suffix from /GetData?id=1000 to /GetData?id=1001, you set an increment of 437 so that the next url is actually /GetData?id=1437)? Of course if this is your "security" then you're probably already in trouble...
I can't think of anything else. Has anyone used an increment that wasn't 1, and why? I'm really just curious.
One idea might be using this to facilitate partitionnement of data (though there might be more "automated" ways to do that) :
Considering you have two servers :
On one server, you start at 1 and increment by 2
On the other server, you start at 2 and increment by 2.
Then, from your application, your send half inserts to one server, and the other half to the second server
some kind of software load-balancing
This way, you still have the ability to identify your entries : the "UniqueID" is still unique, even if the data is split on two servers / tables.
But that's only a wild idea -- there are probably some other uses to that...
Once, for pure fun, (Oh yeah, we have a wild side to us) decided to negative increment. It was strange to see the numbers grow in size and smaller in value at the same time.
I could hardly sit still in my chair.
edit (afterthought):
You think the creator of the IDENTITY was in love with FOR loops? You know..
for (i = 0; i<=99; i+=17)
or for those non semi-colon folks out there
For i = 0 to 100 step 17
Only for entertainment. And you have to be REALLY bored.

Represent Ordering in a Relational Database

I have a collection of objects in a database. Images in a photo gallery, products in a catalog, chapters in a book, etc. Each object is represented as a row. I want to be able to arbitrarily order these images, storing that ordering in the database so when I display the objects, they will be in the right order.
For example, let's say I'm writing a book, and each chapter is an object. I write my book, and put the chapters in the following order:
Introduction, Accessibility, Form vs. Function, Errors, Consistency, Conclusion, Index
It goes to the editor, and comes back with the following suggested order:
Introduction, Form, Function, Accessibility, Consistency, Errors, Conclusion, Index
How can I store this ordering in the database in a robust, efficient way?
I've had the following ideas, but I'm not thrilled with any of them:
Array. Each row has an ordering ID, when order is changed (via a removal followed by an insertion), the order IDs are updated. This makes retrieval easy, since it's just ORDER BY, but it seems easy to break.
// REMOVAL
UPDATE ... SET orderingID=NULL WHERE orderingID=removedID
UPDATE ... SET orderingID=orderingID-1 WHERE orderingID > removedID
// INSERTION
UPDATE ... SET orderingID=orderingID+1 WHERE orderingID > insertionID
UPDATE ... SET orderID=insertionID WHERE ID=addedID
Linked list. Each row has a column for the id of the next row in the ordering. Traversal seems costly here, though there may by some way to use ORDER BY that I'm not thinking of.
Spaced array. Set the orderingID (as used in #1) to be large, so the first object is 100, the second is 200, etc. Then when an insertion happens, you just place it at (objectBefore + objectAfter)/2. Of course, this would need to be rebalanced occasionally, so you don't have things too close together (even with floats, you'd eventually run into rounding errors).
None of these seem particularly elegant to me. Does anyone have a better way to do it?
An other alternative would be (if your RDBMS supports it) to use columns of type array. While this breaks the normalization rules, it can be useful in situations like this. One database which I know about that has arrays is PostgreSQL.
The acts_as_list mixin in Rails handles this basically the way you outlined in #1. It looks for an INTEGER column called position (of which you can override to name of course) and using that to do an ORDER BY. When you want to re-order things you update the positions. It has served me just fine every time I've used it.
As a side note, you can remove the need to always do re-positioning on INSERTS/DELETES by using sparse numbering -- kind of like basic back in the day... you can number your positions 10, 20, 30, etc. and if you need to insert something in between 10 and 20 you just insert it with a position of 15. Likewise when deleting you can just delete the row and leave the gap. You only need to do re-numbering when you actually change the order or if you try to do an insert and there is no appropriate gap to insert into.
Of course depending on your particular situation (e.g. whether you have the other rows already loaded into memory or not) it may or may not make sense to use the gap approach.
If the objects aren't heavily keyed by other tables, and the lists are short, deleting everything in the domain and just re-inserting the correct list is the easiest. But that's not practical if the lists are large and you have lots of constraints to slow down the delete. I think your first method is really the cleanest. If you run it in a transaction you can be sure nothing odd happens while you're in the middle of the update to screw up the order.
Just a thought considering option #1 vs #3: doesn't the spaced array option (#3) only postpone the problem of the normal array (#1)? Whatever algorithm you choose, either it's broken, and you'll run into problems with #3 later, or it works, and then #1 should work just as well.
I did this in my last project, but it was for a table that only occasionally needed to be specifically ordered, and wasn't accessed too often. I think the spaced array would be the best option, because it reordering would be cheapest in the average case, just involving a change to one value and a query on two).
Also, I would imagine ORDER BY would be pretty heavily optimized by database vendors, so leveraging that function would be advantageous for performance as opposed to the linked list implementation.
Use a floating point number to represent the position of each item:
Item 1 -> 0.0
Item 2 -> 1.0
Item 3 -> 2.0
Item 4 -> 3.0
You can place any item between any other two items by simple bisection:
Item 1 -> 0.0
Item 4 -> 0.5
Item 2 -> 1.0
Item 3 -> 2.0
(Moved item 4 between items 1 and 2).
The bisection process can continue almost indefinitely due to the way floating point numbers are encoded in a computer system.
Item 4 -> 0.5
Item 1 -> 0.75
Item 2 -> 1.0
Item 3 -> 2.0
(Move item 1 to the position just after Item 4)
Since I've mostly run into this with Django, I've found this solution to be the most workable. It seems that there isn't any "right way" to do this in a relational database.
I'd do a consecutive number, with a trigger on the table that "makes room" for a priority if it already exists.
I had this problem as well. I was under heavy time pressure (aren't we all) and I went with option #1, and only updated rows that changed.
If you swap item 1 with item 10, just do two updates to update the order numbers of item 1 and item 10. I know it is algorithmically simple, and it is O(n) worst case, but that worst case is when you have a total permutation of the list. How often is that going to happen? That's for you to answer.
I had the same issue and have probably spent at least a week concerning myself about the proper data modeling, but I think I've finally got it. Using the array datatype in PostgreSQL, you can store the primary key of each ordered item and update that array accordingly using insertions or deletions when your order changes. Referencing a single row will allow you to map all your objects based on the ordering in the array column.
It's still a bit choppy of a solution but it will likely work better than option #1, since option 1 requires updating the order number of all the other rows when ordering changes.
Scheme #1 and Scheme #3 have the same complexity in every operation except INSERT writes. Scheme #1 has O(n) writes on INSERT and Scheme #3 has O(1) writes on INSERT.
For every other database operation, the complexity is the same.
Scheme #2 should not even be considered because its DELETE requires O(n) reads and writes. Scheme #1 and Scheme #3 have O(1) DELETE for both read and write.
New method
If your elements have a distinct parent element (i.e. they share a foreign key row), then you can try the following ...
Django offers a database-agnostic solution to storing lists of integers within CharField(). One drawback is that the max length of the stored string can't be greater than max_length, which is DB-dependent.
In terms of complexity, this would give Scheme #1 O(1) writes for INSERT, because the ordering information would be stored as a single field in the parent element's row.
Another drawback is that a JOIN to the parent row is now required to update ordering.
https://docs.djangoproject.com/en/dev/ref/validators/#django.core.validators.validate_comma_separated_integer_list