Compute the difference between two sets (sorted and simple) - redis

Is there a way to compute the difference between two sorted sets (zset) or do I have to use simple sets for this?
Problem:
Set F contains a list of sorted id's (sorted set, full list)
Set K contains a list of id's (simple set, subset of F)
I want to retrieve every entry in F, in order, that's not in K.
Is this possible using Redis alone or do I have to do the computation on the application? If yes, what is the best way?
EDIT: SDIFF does not suit this purpose as it doesn't allow sorted sets.

Make a copy of F as a simple set. Let's call it G. Now perform the SDIFF.
Or...
Make a copy of F as a sorted set. Let's call it G. Iterate through K and remove each element from G.
SDIFF really should work on sorted sets, regular sets, or combinations. But, at this time, it does not.
Also, if F is very large, you may see some performance hits when you make a copy of it. In this case, create a set G in your Redis DB that it updated when K is updated. That is, F and G are initially equal. As you add elements to K, remove the element from G.

Related

Searching for groups of objects given a reduction function

I have a few questions about a type of search.
First, is there a name and if so what is the name of the following type of search? I want to search for subsets of objects from some collection such that a reduction and filter function applied to the subset is true. For example, say I have the following objects, each of which contains an id and a value.
[A,10]
[B,10]
[C,10]
[D,9]
[E,11]
I want to search for "all the sets of objects whose summed values equal 30" and I would expect the output to be, {{A,B,C}, {A,D,E}, {B,D,E}, {C,D,E}}.
Second, is the only strategy to perform this search brute-force? Is there some type of general-purpose algorithm for this? Or are search optimizations dependent on the reduction function?
Third, if you came across this problem, what tools would you use to solve it in a general way? Assume the reduction and filter functions could be anything and are not necessarily the sum function. Does SQL provide a good API for this type of search? What about Prolog? Any interesting tips and tricks would be appreciated.
Thanks.
I cannot comment on the problem in general but brute forcing search can be easily done in prolog.
w(a,10).
w(b,10).
w(c,10).
w(d,9).
w(e,11).
solve(0, [], _).
solve(N, [X], [X|_]) :- w(X, N).
solve(N, [X|Xs], [X|Bs]) :-
w(X, W),
W < N,
N1 is N - W,
solve(N1, Xs, Bs).
solve(N, [X|Xs], [_|Bs]) :- % skip element if previous clause fails
solve(N, [X|Xs], Bs).
Which gives
| ?- solve(30, X, [a, b, c, d, e]).
X = [a,b,c] ? ;
X = [a,d,e] ? ;
X = [b,d,e] ? ;
X = [c,d,e] ? ;
(1 ms) no
Sql is TERRIBLE at this kind of problem. Until recently there was no way to get 'All Combinations' of row elements. Now you can do so with Recursive Common Table Expressions, but you are forced by its limitations to retain all partial results as well as final results which you would have to filter out for your final results. About the only benefit you get with SQL's recursive procedure is that you can stop evaluating possible combinations once a sub-path exceeds 30, your target total. That makes it slightly less ugly than an 'evaluate all 2^N combinations' brute force solution (unless every combination sums to less than the target total).
To solve this with SQL you would be running an algorithm that can be described as:
Seed your result set with all table entries less than your target total and their value as a running sum.
Iteratively join your prior result with all combinations of table that were not already used in the result set and whose value added to running sum is less than or equal to target total. Running sum becomes old running sum plus value, and append ID to ID LIST. Union this new result to the old results. Iterate until no more records qualify.
Make a final pass of the result set to filter out the partial sums that do not total to your target.
Oh, and unless you make special provisions, solutions {A,B,C}, {C,B,A}, and {A,C,B} all look like different solutions (order is significant).

Natural way of indexing elements in Flink

Is there a built-in way to index and access indices of individual elements of DataStream/DataSet collection?
Like in typical Java collections, where you know that e.g. a 3rd element of an ArrayList can be obtained by ArrayList.get(2) and vice versa ArrayList.indexOf(elem) gives us the index of (the first occurence of) the specified element. (I'm not asking about extracting elements out of the stream.)
More specifically, when joining DataStreams/DataSets, is there a "natural"/easy way to join elements that came (were created) first, second, etc.?
I know there is a zipWithIndex transformation that assigns sequential indices to elements. I suspect the indices always start with 0? But I also suspect that they aren't necessarily assigned in the order the elements were created in (i.e. by their Event Time). (It also exists only for DataSets.)
This is what I currently tried:
DataSet<Tuple2<Long, Double>> tempsJoIndexed = DataSetUtils.zipWithIndex(tempsJo);
DataSet<Tuple2<Long, Double>> predsLinJoIndexed = DataSetUtils.zipWithIndex(predsLinJo);
DataSet<Tuple3<Double, Double, Double>> joinedTempsJo = tempsJoIndexed
.join(predsLinJoIndexed).where(0).equalTo(0)...
And it seems to create wrong pairs.
I see some possible approaches, but they're either non-Flink or not very nice:
I could of course assign an index to each element upon the stream's
creation and have e.g. a stream of Tuples.
Work with event-time timestamps. (I suspect there isn't a way to key by timestamps, and even if there was, it wouldn't be useful for
joining multiple streams like this unless the timestamps are
actually assigned as indices.)
We could try "collecting" the stream first but then we wouldn't be using Flink anymore.
The 1. approach seems like the most viable one, but it also seems redundant given that the stream should by definition be a sequential collection and as such, the elements should have a sense of orderliness (e.g. `I'm the 36th element because 35 elements already came before me.`).
I think you're going to have to assign index values to elements, so that you can partition the data sets by this index, and thus ensure that two records which need to be joined are being processed by the same sub-task. Once you've done that, a simple groupBy(index) and reduce() would work.
But assigning increasing ids without gaps isn't trivial, if you want to be reading your source data with parallelism > 1. In that case I'd create a RichMapFunction that uses the runtimeContext sub-task id and number of sub-tasks to calculate non-overlapping and monotonic indexes.

Limiting the number of rows returned by `.where(...)` in pytables

I am dealing with tables having having up to a few billion rows and I do a lot of "where(numexpr_condition)" lookups using pytables.
We managed to optimise the HDF5 format so a simple where-query over 600mio rows is done under 20s (we still struggling to find out how to make this faster, but that's another story).
However, since it is still too slow for playing around, I need a way to limit the number of results in a query like this simple example one (the foo column is of course indexed):
[row['bar'] for row in table.where('(foo == 234)')]
So this would return lets say 100mio entries and it takes 18s, which is way to slow for prototyping and playing around.
How would you limit the result to lets say 10000?
The database like equivalent query would be roughly:
SELECT bar FROM row WHERE foo==234 LIMIT 10000
Using the stop= attribute is not the way, since it simply takes the first n rows and applies the condition to them. So in worst case if the condition is not fulfilled, I get an empty array:
[row['bar'] for row in table.where('(foo == 234)', stop=10000)]
Using slice on the list comprehension is also not the right way, since it will first create the whole array and then apply the slice, which of course is no speed gain at all:
[row['bar'] for row in table.where('(foo == 234)')][:10000]
However, the iterator must know its own size while the list comprehension exhaustion so there is surely a way to hack this together. I just could not find a suitable way doing that.
Btw. I also tried using zip and range to force a StopIteration:
[row['bar'] for for _, row in zip(range(10000), table.where('(foo == 234)'))]
But this gave me repeated numbers of the same row.
Since it’s an iterable and appears to produce rows on demand, you should be able to speed it up with itertools.islice.
rows = list(itertools.islice(table.where('(foo == 234)'), 10000))

Redis Sorted Sets: How do I get the first intersecting element?

I have a number of large sorted sets (5m-25m) in Redis and I want to get the first element that appears in a combination of those sets.
e.g I have 20 sets and wanted to take set 1, 5, 7 and 12 and get only the first intersection of only those sets.
It would seem that a ZINTERSTORE followed by a "ZRANGE foo 0 0" would be doing a lot more work that I require as it would calculate all the intersections then return the first one. Is there an alternative solution that does not need to calculate all the intersections?
There is no direct, native alternative, although I'd suggest this:
Create a hash which its members are your elements. Upon each addition to one of your sorted sets, increment the relevant member (using HINCRBY). Of course, you'll make the increment only after you check that the element does not exist already in the sorted set you are attempting to add to.
That way, you can quickly know which elements appear in 4 sets.
UPDATE: Now that I rethink about it, it might be too expensive to query your hash to find items with value of 4 (O(n)). Another option would be creating another Sorted Set, which its members are your elements, and their score gets incremented (as I described before, but using ZINCRBY), and you can quickly pull all elements with score 4 (using ZRANGEBYSCORE).

Representing multiply-linked lists in SQL

I have a data structure consisting of a set of objects which are arranged into a multiply-linked list which is also (isomorphically) a valid DAG. It could be viewed as one single multiply-linked list, or as a series of n doubly-linked lists which may share members. (This is the same data structure from Algorithm for quickly obtaining a partial ordering over multiple linked lists, for those of you following my questions.)
I am looking for a general technique, in no specific SQL dialect, for expressing this multiply-linked list/DAG in SQL, such that it's easy to take a given node and obtain:
The previous and next links in the DAG, given a topological ordering of the DAG
The previous and next links in each doubly-linked list to which this node belongs
Using the example data from that other question:
first = [a, b, d, f, h, i];
second = [a, b, c, f, g, i];
third = [a, e, f, g, h, i];
I'd want to be able to, given node f, obtain [(c|d|e), g] from the overall DAG's topology and also {first: [d, h], second: [c, g], third: [e, g]} from each of the lists orderings.
Here's the fun part: n, the number of doubly-linked lists, is not fixed and may grow at any time. I'd rather not redo the schema each time that happens.
All of the algorithms I've come up with so far either (a) stuff a big pickle into the DB and pull it out in order to calculate orderings, or (b) require that the lists be explicitly enumerated as recursive relations in the DB.
I'll go with an option in (b) if I can't find something better but I'm hoping that there's something magical out there to make this easier.
Pre:
This is a question and answer forum, not 'lets sit down, group think for a bit, and solve the whole problem' forum.
I think what you want to investigate in a technique called 'modified preordered tree traversal' a mouthful i know, but it allows the storing of hierarchical data in a flat database and individual enties. Sadly, you do have to do some rewriting on inserts, but the selects can be done in a single query, so it's best for 'many view/ few changes' situations like a website. Luckily, you rarely have to rewrite the whole dataset (only the parts you changed and those hierarchically after them.
I remember a good article on the basics on it ( a couple years ago) but can't find the bookmark atm, so start with just a google search.
EDIT/UPDATE:
link: http://www.sitepoint.com/hierarchical-data-database/
No matter what, from dealing with this issue extensively, you will have to choose were to put the brunt of the work, on view, or on change. Depending on the size of the 'master' tree, you may (like me) decide to break the tree up into parts and use a tree of trees, limiting the update cost.