Time complexity of zadd when value has score greater than highest score present in the targeted sorted set - redis

If every value one adds to a sorted set (redis) is one with the highest score, will the time complexity be O(log(N)) for each zadd?
OR, for such edge cases, redis performs optimizations (e.g. an exception that in such cases where score is higher than the highest score in the set, simply add the value at the highest spot)?
Practically, I ask because I keep a global sorted set in my app where values are zadded with time since epoch as the score. And I'm wondering whether this will still be O(log(N)), or would it be faster?

Once a Sorted Set has grown over the thresholds set by the zset-max-ziplist-* configuration directives, it is encoded as a skip list. Optimizing insertion for this edge case seems impossible due to the need to maintain the skip list's upper levels. A cursory review of the source code shows that, as expected, this isn't handled in any special way.

Related

Infinite scroll algorithm for random items with different weight ( probability to show to the user )

I have a web / mobile application that should display an infinite scroll view (the continuation of the list of items is loaded periodically in a dynamic way) with items where each of the items have a weight, the bigger is the weight in comparison to the weights of other items the higher should be the chances/probability to load the item and display it in the list for the users, the items should be loaded randomly, just the chances for the items to be in the list should be different.
I am searching for an efficient algorithm / solution or at least hints that would help me achieve that.
Some points worth to mention:
the weight has those boundaries: 0 <= w < infinite.
the weight is not a static value, it can change over time based on some item properties.
every item with a weight higher than 0 should have a chance to be displayed to the user even if the weight is significantly lower than the weight of other items.
when the users scrolls and performs multiple requests to API, he/she should not see duplicate items or at least the chance should be low.
I use a SQL Database (PostgreSQL) for storing items so the solution should be efficient for this type of database. (It shouldn't be a purely SQL solution)
Hope I didn't miss anything important. Let me know if I did.
The following are some ideas to implement the solution:
The database table should have a column where each entry is a number generated as follows:
log(R) / W,
where—
W is the record's weight greater than 0 (itself its own column), and
R is a per-record uniform random number in (0, 1)
(see also Arratia, R., "On the amount of dependence in the prime factorization of a uniform random integer", 2002). Then take the records with the highest values of that column as the need arises.
However, note that SQL has no standard way to generate random numbers; DBMSs that implement SQL have their own ways to do so (such as RANDOM() for PostgreSQL), but how they work depends on the DBMS (for example, compare MySQL's RAND() with T-SQL's NEWID()).
Peter O had a good idea, but had some issues. I would expand it a bit in favor of being able to shuffle a little better as far as being user-specific, at a higher database space cost:
Use a single column, but store in multiple fields. Recommend you use the Postgres JSONB type (which stores it as json which can be indexed and queried). Use several fields where the log(R) / W. I would say roughly log(U) + log(P) where U is the number of users and P is the number of items with a minimum of probably 5 columns. Add an index over all the fields within the JSONB. Add more fields as the number of users/items get's high enough.
Have a background process that is regularly rotating the numbers in #1. This can cause duplication, but if you are only rotating a small subset of the items at a time (such as O(sqrt(P)) of them), the odds of the user noticing are low. Especially if you are actually querying for data backwards and forwards and stitch/dedup the data together before displaying the next row(s). Careful use of manual pagination adjustments helps a lot here if it's an issue.
Before displaying items, randomly pick one of the index fields and sort the data on that. This means you have a 1 in log(P) + log(U) chance of displaying the same data to the user. Ideally the user would pick a random subset of those index fields (to avoid seeing the same order twice) and use that as the order, but can't think of a way to make that work and be practical. Though a random shuffle of the index and sorting by that might be practical if the randomized weights are normalized, such that the sort order matters.

Compensating for laggy positive feedback

I'm trying to make a program run as accurately as possible while staying at a fixed frame rate. How do you do this?
Formally, I have some parameter b in [0,1] that I can set to determine how accurate my computations are (where 0 is least accurate, 0.5 is fairly accurate, and 1 is very accurate). The higher this is, the lower frame rate I will get.
However, there is a "lag", where after changing this parameter, the frame rate won't change until d milliseconds afterwards, where d can vary and is unknown.
Is there a way to change this parameter in a way that prevents "wiggling"? The problem is that if I am experiencing a low frame rate, if I increase the parameter then measure again, it will only be slightly higher, so I will need to increase it more, and then the framerate will be too slow, so I need to decrease the parameter, and I get this oscillating behavior. Is there a way to prevent this? I need to be as reactive as possible in doing this, because changing too slowly will cause the framerate to be incorrect for too long.
Looks like you need an adaptive feedback dampener. Trying an electrical circuit analogy :)
I'd first try to get more info about how the circuit's input signal and responsiveness look like. So I'd first make the algorithm update b not with the desired values but with the previous values plus or minus (as needed towards the desired value) a small fixed increment, say .01 instead (ignore the sloppy response time for now). While doing so I'd collect and plot/analyze the "desired" b values, looking for:
the general shape of the changes: smooth or rather "steppy" or "spiky"? (spiky would require a stronger dampening to prevent oscillations, steppy would require a weaker dampening to prevent lagging)
the maximum/typical/minimum changes in values from sample to sample
the distribution of the changes in values from sample to sample (I'd plan the algorithm to react best for changes in a typical range, say 20-80% range and consider acceptable lagging for changes higher than that or oscillations for values lower than that)
The end goal is to be able to obtain parameters for operating alternatively in 2 modes:
a high-speed tracking mode (also the system's initial mode)
a normal tracking mode
In high-speed tracking mode the b value updates can be either:
not dampened - the update value is the full desired value - only if the changes shape is not spiky and only in the 1st b update after entering the high-speed tracking mode. This would help reduce lagging.
dampened - the update delta is just a fraction (dampening factor) of the desired delta and reflects the fact that the effect of the previous b value update might not be completely reflected in the current frame rate due to d. Dampening helps preventing oscillations at the expense of potentially increasing lag (always conflicting requirements). The dampening factor would be higher for a smooth shape and smaller for a spiky shape.
Switching from high-speed tracking mode to normal tracking mode can be done when the delta between b's previous value and its desired value falls below a certain mode change threshold value (eventually maintained for a minimum number of consecutive samples). The mode change threshold value would be initially estimated from the info collected above and/or empirically determined.
In normal tracking mode the delta between b's previous value and its desired value remain below the mode change threshold value and is either ignored (no b update) or and update is made either with the desired value or some average one - tiny course corrections, keeping the frame rate practically constant, no dampening, no lagging, no oscillations.
When in normal tracking mode the delta between b's previous value and its desired value goes above the mode change threshold value the system switches again to the high-speed tracking mode.
I would also try to get a general idea about how the d response time looks like. To do that I'd change the algorithm to only update b with the desired values not at every iteration, but every n iterations apart (maybe even re-try for several n values). This should indicate how many sample periods would generally a b value change take to become fully effective and should be reflected in the dampening factor: the longer it takes for a change to take effect the stronger the dampening should be to prevent oscillations.
Of course, this is just the general idea, a lot of experimental trial/adjustment iterations may be required to reach a satisfactory solution.

Redis: Maximum score size for sorted sets? Score + Unique ids = Unique Scores?

I'm using timestamps as the score. I want to prevent duplicates by appending a unique object-id to the score. Currently, this id is a 6 digit number (the highest id right now is 221849), but it is expected to increase over a million. So, the score will be something like
1407971846221849 (timestamp:1407971846 id:221849) and will eventually reach 14079718461000001 (timestamp:1407971846 id:1000001).
My concern is not being able to store scores because they've reached the max allowed.
I've read the docs, but I'm a bit confused. I know, basic math. But bear with me, I want to get this right.
Redis sorted sets use a double 64-bit floating point number to represent the score. In all the architectures we support, this is represented as an IEEE 754 floating point number, that is able to represent precisely integer numbers between -(2^53) and +(2^53) included. In more practical terms, all the integers between -9007199254740992 and 9007199254740992 are perfectly representable. Larger integers, or fractions, are internally represented in exponential form, so it is possible that you get only an approximation of the decimal number, or of the very big integer, that you set as score.
There's another thing bothering me right now. Would the increase in ids break the chronological sort sequence ?
I will appreciate any insights, suggestions, different prespectives or flat out if what I'm trying to do is non-sense.
Thanks for any help.
No, it won't break the "chronological" order, but you may loose the precision of the last digits, so two members may end up having the same score (i.e. non-unique).
There is no problem with duplicate scores. It is just maintaining a sorted set in memory. Members are unique but the scores may be the same. If you want chronological processing I would just rely on the timestamp without adding an id to it.
Appending an id would break the chronological sort if your ids are mixed such that you could have timestamps 1, 2, 3 (simple example) and ids 100, 10, 1, you won't get the correct sort. If your ids will always be added monotonically then you should just use the id as the score.

Redis: Is it still O(logN) to get top score member from sorted set?

My code needs to frequently get the top score member from a sorted set of Redis.
The time complexity for zrangebyscore is O(logN): http://redis.io/commands/zrangebyscore.
Since I only want to get the top score one, will Redis optimize it to return top score member in O(1) time?
If you're trying to get the top score so frequently that ZRANGE's complexity is an issue, cache the top score independently of the sorted set and you'll be able to get to it with O(1).
The Redis documentation doesn't describe such an optimization. The page you linked to for ZRANGEBYSCORE states (emphasis added):
Time complexity: O(log(N)+M) with N being the number of elements in
the sorted set and M the number of elements being returned. If M is
constant (e.g. always asking for the first 10 elements with LIMIT),
you can consider it O(log(N)).
Given this, it seems that the time complexity will not be O(1), unless of course your sorted set contains only one element. Rather, the time complexity will be dependent on the number of elements in the sorted set and will still be O(log(N)).

Algorithm to decide if true based on 0% - 100% frequency threshold

Sorry if this is a duplicate question. I did a search but wasn't sure exactly what to search for.
I'm writing an app that performs a scan. When the scan is complete we need to decide if an item was found or not. Whether or not the item is found is decided by a threshold that the user can set: 0% of the time, 25% of the time, 50% of the time, 75% of the time or 100% of the time.
Obviously if the user chooses 0% or 100% we can use true/false but for the frequency but I'm drawing a blank on how this should work for the other thresholds.
I assume I'd need to store and increase some value every time a monster is found.
Thanks for any help in advance!
As #nix points out it sounds like you want to generate a random number and threshold based on the percentage of the time you wish to have 'found' something.
You need to be careful that the range you select and how you threshold achieve the desired result as well as the distribution of the random number generator you use. When dealing in percentages an obvious approach is to generate 1 of 100 uniformly distributed options and threshold appropriately e.g. 0-99 and check that the number is less than your percentage.
A quick check shows us that you will never get a number less than 0 so 0% achieves the expected result, you will always get a number less than 100 so 100% achieves the expected result and there a 50 options (0-49) less than 50 out of 100 options (0-99) so 50% achieves the expected result as well.
A subtly different approach, given that the user can only choose ranges in 25% increments, would be to generate numbers in the range 0-3 and return True if the number is less than the percentage / 25. If you were to store the user-selection as a number from 0-4 (0: 0%, 1: 25% .. 4: 100%) this might be even simpler.
Various approaches to pseudo-random number generation in Objective-C are discussed here: Generating random numbers in Objective-C.
Note that mention is made of the uniformity of the random numbers potentially being sensitive to the range depending on the method you go with.
To be confident you can always do some straight-forward testing by calling your function a large number of times, keeping track of the number of times it returns true and comparing this to the desired percentage.
Generate a random number between 0 and 100. If the number is greater than the threshold, an item is found. Otherwise, no item is found.