I want to divide a large redis set into batches by scanning through it. The cursor becomes 0 after iterating some elements of the set. Assume the set length is 250000, sscan paginates about 70000 elements of the set and it comes to end.
Does anyone know why?
As #Niloct said, There are some conditions which guarantees full iteration:
https://redis.io/commands/scan#scan-guarantees
A full iteration always retrieves all the elements that were present in the collection from the start to the end of a full iteration
When you iterate a large set which has expiration time, you should control the speed of iteration with count number. If you pass small number for count, the iteration takes much more time and by this, some elements in the set may have been expired, so you miss them.
Related
I am trying to perform an analysis on some data, however, the speed shall be quite faster!
these are the steps that I follow. Please recommend any solutions that you think might speed up the processing time.
ts is a datetime object and the "time" column in Data is in epoch time. Note that Data might include up to 500000 records.
Data = pd.DataFrame(RawData) # (RawData is a list of lists)
Data.loc[:, 'time'] = pd.to_datetime(Data.loc[:, 'time'], unit='s')
I find the index of the first row in Data which has a time object greater than my ts as follows:
StartIndex = Data.loc[:, 'time'].searchsorted(ts)
StartIndex is usually very low and is found within a few records from the beginning, however, I have no idea if the size of Data would affect fining this index.
Now we get to the hard part: within Data there is column called "PDNumber". I have two other variables called Max_1 and Min_1. I have to find the index of the the row in which the "PDNumber" value goes above Max_1 or comes below Min_1. Note that this search shall start from StartIndex through the end of dataframe. whichever happens first, the search shall stop and the found Index is called SecondStartIndex. Now we have another two variables called Max_2 and Min_2. Again, we have to search the "PDNumber" column to find the index of the first row that goes above 'Max_2' or comes below Min_2; this index is called ThirdIndex
right now, I use a for loop to go through data adding the index by 1 in each step and see if I have reached the SecondIndex and when reached, I use a while loop till the end of dataframe to find the ThirdIndex. I use a counter in while loop as well.
Any suggestions on speeding up the process time?
I have a redis database with a few million keys. Sometimes I need to query keys by the pattern e.g. 2016-04-28:* for which I use scan. First call should be
scan 0 match 2016-04-28:*
it then would return a bunch of keys and next cursor or 0 if the search is complete.
However, if I run a query and there are no matching keys, scan still returns non-zero cursor but an empty set of keys. This keeps happening to every successive query, so the search does not seem to end for a really long time.
Redis docs say that
SCAN family functions do not guarantee that the number of elements returned per call are in a given range. The commands are also allowed to return zero elements, and the client should not consider the iteration complete as long as the returned cursor is not zero.
So I can't just stop when I get empty set of keys.
Is there a way I can speed things up?
You'll always need to complete the scan (i.e. get cursor == 0) to be sure there no no matched. You can, however, use the COUNT option to reduce the number of iterations. The default value of 10 is fast If this is a common scenario with your match pattern - start increasing it (e.g. double or powers of two but put a max cap just in case) with every empty reply, to make Redis "search harder" for keys. By doing so, you'll be saving on network round trips so it should "speed things up".
Redis has a SCAN command that may be used to iterate keys matching a pattern etc.
Redis SCAN doc
You start by giving a cursor value of 0; each call returns a new cursor value which you pass into the next SCAN call. A value of 0 indicates iteration is finished. Supposedly no server or client state is needed (except for the cursor value)
I'm wondering how Redis implements the scanning algorithm-wise?
You may find answer in redis dict.c source file. Then I will quote part of it.
Iterating works the following way:
Initially you call the function using a cursor (v) value of 0. 2)
The function performs one step of the iteration, and returns the
new cursor value you must use in the next call.
When the returned cursor is 0, the iteration is complete.
The function guarantees all elements present in the dictionary get returned between the start and end of the iteration. However it is possible some elements get returned multiple times. For every element returned, the callback argument 'fn' is called with 'privdata' as first argument and the dictionary entry'de' as second argument.
How it works
The iteration algorithm was designed by Pieter Noordhuis. The main idea is to increment a cursor starting from the higher order bits. That is, instead of incrementing the cursor normally, the bits of the cursor are reversed, then the cursor is incremented, and finally the bits are reversed again.
This strategy is needed because the hash table may be resized between iteration calls. dict.c hash tables are always power of two in size, and they use chaining, so the position of an element in a given table is given by computing the bitwise AND between Hash(key) and SIZE-1 (where SIZE-1 is always the mask that is equivalent to taking the rest of the division between the Hash of the key and SIZE).
For example if the current hash table size is 16, the mask is (in binary) 1111. The position of a key in the hash table will always be the last four bits of the hash output, and so forth.
What happens if the table changes in size?
If the hash table grows, elements can go anywhere in one multiple of the old bucket: for example let's say we already iterated with a 4 bit cursor 1100 (the mask is 1111 because hash table size = 16).
If the hash table will be resized to 64 elements, then the new mask will be 111111. The new buckets you obtain by substituting in ??1100 with either 0 or 1 can be targeted only by keys we already visited when scanning the bucket 1100 in the smaller hash table.
By iterating the higher bits first, because of the inverted counter, the cursor does not need to restart if the table size gets bigger. It will continue iterating using cursors without '1100' at the end, and also without any other combination of the final 4 bits already explored.
Similarly when the table size shrinks over time, for example going from 16 to 8, if a combination of the lower three bits (the mask for size 8 is 111) were already completely explored, it would not be visited again because we are sure we tried, for example, both 0111 and 1111 (all the variations of the higher bit) so we don't need to test it again.
Wait... You have TWO tables during rehashing!
Yes, this is true, but we always iterate the smaller table first, then we test all the expansions of the current cursor into the larger table. For example if the current cursor is 101 and we also have a larger table of size 16, we also test (0)101 and (1)101 inside the larger table. This reduces the problem back to having only one table, where the larger one, if it exists, is just an expansion of the smaller one.
Limitations
This iterator is completely stateless, and this is a huge advantage, including no additional memory used.
The disadvantages resulting from this design are:
It is possible we return elements more than once. However this is usually easy to deal with in the application level.
The iterator must return multiple elements per call, as it needs to always return all the keys chained in a given bucket, and all the expansions, so we are sure we don't miss keys moving during rehashing.
The reverse cursor is somewhat hard to understand at first, but this comment is supposed to help.
My code needs to frequently get the top score member from a sorted set of Redis.
The time complexity for zrangebyscore is O(logN): http://redis.io/commands/zrangebyscore.
Since I only want to get the top score one, will Redis optimize it to return top score member in O(1) time?
If you're trying to get the top score so frequently that ZRANGE's complexity is an issue, cache the top score independently of the sorted set and you'll be able to get to it with O(1).
The Redis documentation doesn't describe such an optimization. The page you linked to for ZRANGEBYSCORE states (emphasis added):
Time complexity: O(log(N)+M) with N being the number of elements in
the sorted set and M the number of elements being returned. If M is
constant (e.g. always asking for the first 10 elements with LIMIT),
you can consider it O(log(N)).
Given this, it seems that the time complexity will not be O(1), unless of course your sorted set contains only one element. Rather, the time complexity will be dependent on the number of elements in the sorted set and will still be O(log(N)).
I'm using redis 2.6. I've faced with strange behavior of ZRANGEBYSCORE function.
I have a sorted set with a length of about a few million elements.
Something like this:
10 marry
15 john
25 bob
...
So compare to queries:
ZRANGEBYSCORE longset 25 50 LIMIT 0 20 works like a charm, it takes milliseconds
ZRANGEBYSCORE longset 25 50 this one hangs up for a minutes!!
All elements which I'm intrested in are in the first hundred of the set.
I think that there's no need to scan elements with weight greater than "50"
because it is SORTED set.
Please explain how redis scans sorted sets and why there is such a big difference between these two queries.
One of the best things about Redis, IMO, is that you can check the time complexity of each command in the docs. The docs for zrangebyscore specifies:
Time complexity: O(log(N)+M) with N being the number of elements in the sorted set and M the number of elements being returned. If M is constant (e.g. always asking for the first 10 elements with LIMIT), you can consider it O(log(N)).
[...]
Keep in mind that if offset is large, the sorted set needs to be traversed for offset elements before getting to the elements to return, which can add up to O(N) time complexity.
This means that if you know that you only need a certain number of items, and specify a LIMIT offset count, if offset is (close to) 0, you can consider it O(log(N)), but if the returned number of items is high (or the offset is high), it can be considered O(N).