Store targets as collections that handle logic operation - sql

I think my title is kinda unclear but I don't konw how to tell that otherwise.
My problem is:
We have users that belong to groups, there are many types of groups and any user belong to exaclty one group for each type.
Example: With group types A, B and C, containing respectively the groups (A1; A2; A3), (B1; B2) and (C1; C2; C3)
Every User must have a list of groups like [A1, B1, C1] or [A1, B2, C3] but never [A1, A2, B1] or [A1, C2]
We have messages that target to certain groups but not just a union, it can be more complex collection operations
Example: we can have message intended to [A1, B1, C3], [A1, *, *], [A1|A2, *, *] or even like ([A1, B1, C2] | [A2, B2, C1])
(* = any group of the type, | = or)
Messages are stored in a SQL DB, and users can retrieve all messages intended to their groups
How may I store messages and make my Query to reproduce this behavior ?

An option could be to encode both the user groups and the message targets in a (big) integer built on the powers of 2, and then base your query on a bitwise AND between user group code and message target code.
The idea is, group 1 is 1, group 2 is 2, group 3 is 4 and so on.
Level 1:
Assumptions:
you know in advance how many group types you have, and you have very few of them
you don't have more than 64 groups per type (assuming you work with 64-bit integers)
the message has only one target: A1|A2,B..,C... is ok, A*,B...,C... is ok, (A1,B1,C1)|(A2,B2,C2) is not.
Solution:
Encode each user group as the corresponding power of 2
Encode each message target as the sum of the allowed values: if groups 1 and 3 are allowed (A1|A3) the code will be 1+4=5, if all groups are allowed (A*) the code will be 2**64-1
you will have a User table and a Message table, and both will have one field for each group type code
The query will be WHERE (u.g1 & m.g1) * (u.g2 & m.g2) * ... * (u.gN & m.gN) <> 0
Level 2:
Assumptions:
you have some more group types, and/or you don't know in advance how many they are, or how they are composed
you don't have more than 64 groups in total (e.g. 10 for the first type, 12 for the second, ...)
the message still has only one target as above
Solution:
encode each user group and each message target as a single integer, taking care of the offset: if the first type has 10 groups they will be encoded from 1 to 1023 (2**10-1), then if the second type has 12 groups they will go from 1024 (2**10) to 4194304 (2**(10+12)-1), and so on
you will still have a User table and a Message table, and both will have one single field for the cumulative code
you will need to define a function which is able to check the user group vs the message target separately by each range; this can be difficult to do in SQL, and depends on which engine you are using
following is a Python implementation of both the encoding and the check
class IdEncoder:
def __init__(self, sizes):
self.sizes = sizes
self.grouplimits = {}
offset = 0
for i,size in enumerate(sizes):
self.grouplimits[i] = (2**offset, 2**(offset + size)-1)
offset += size
def encode(self, vals):
n = 0
for i, val in enumerate(vals):
if val == '*':
g = self.grouplimits[i][1] - self.grouplimits[i][0] + 1
else:
svals = val.split('|')
g = 0
for sval in svals:
g += 2**(int(sval)-1)
if i > 0:
g *= self.grouplimits[i][0]
print(g)
n += g
return n
def check(self, user, message):
res = False
for i,size in enumerate(self.sizes):
if user%2**size & message%2**size == 0:
break
if i < len(self.sizes)-1:
user >>= size
message >>= size
else:
res = True
return res
c = IdEncoder([10,12,10])
m3 = c.encode(['1|2','*','*'])
u1 = c.encode(['1','1','1'])
c.check(u1,m3)
True
u2=c.encode(['4','1','1'])
c.check(u2,m3)
False
Level 3:
Assumptions:
you adopt one of the above solutions, but you need multiple targets for each message
Solution:
You will need a third table, MessageTarget, containing the target code fields as above and a FK linking to the message
The query will search for all the MessageTarget rows compatible with the User group code(s) and show the related Message data

So you have 3 main tables:
Messages
Users
Groups
You then create 2 relationship tables:
Message-Group
User-Group
If you want to limit users to have access to just "their" messages then you join:
User > User-Group > Message-Group > Message

Related

fast row-wise boolean operations in sparse matrices

I have a ~4.4M dataframe with purchase orders. I'm interested in a column that indicates the presence of certain items in that purchase order. It is structured like this:
df['item_arr'].head()
1 [a1, a2, a5]
2 [b1, b2, c3...
3 [b3]
4 [a2]
There are 4k different items, and in each row there is always at least one. I have generated another 4.4M x 4k dataframe df_te_sub with a sparse structure indicating the same array in terms of booleans, i.e.
c = df_te_sub.columns[[10, 20, 30]]
df_te_sub[c].head()
>>a10 b8 c1
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
The name of the columns is not important, although it is in alphabetical order, for what is worth.
Given a subset g of items, I am trying to extract the orders (rows) for two different cases:
At least one of the items is present in the row
The items is present in the row are all a subset of g
The first rule I found it the best way was to do:
c = df_te_sub[g].sparse.to_coo()
rows = pd.unique(c.row)
The second rule presented a challenge. I tried different things but they are all slow:
# using set
s = set(g)
df['item_arr'].apply(s.issuperset)
# using the "non selected items"
c = df_te_sub[df_te_sub.columns[~df_te_sub.columns.isin(g)]].sparse.to_coo()
x = np.ones(len(df_te_sub), dtype='bool')
x[c.row] = False
# mix
s = set(g)
c = df_te_sub[g].sparse.to_coo()
rows = pd.unique(c.row)
df['item_arr'].iloc[rows].apply(s.issuperset)
Any ideas to improve performance? I need to do this for several subsets.
The output can be given either in rows (e.g. [0, 2, 3]) or as a boolean mask (e.g. True False True True ....), as both will work to slice the order dataframe.
I feel like you're overthinking this. If you have a boolean array of membership you've already done 90% of the work.
from scipy.sparse import csc_matrix
# Turn your sparse data into a sparse array
arr = csc_matrix(df_te_sub.sparse.to_coo())
# Get the number of items per row
row_len = arr.sum(axis=1).A.flatten()
# Get the column indices for each item and slice your array
arr_col_idx = [df.columns.get_loc(g_val) for g_val in g]
# Sum the number of items in g in the slice per row
arr_g = arr[:, arr_col_idx].sum(axis=1).A.flatten()
# Find all the rows with at least one thing in g
arr_one_g = arr_g > 0
# Find all the things in the rows which are subsets of G
# This assumes row_len is always greater than 0, if it isnt add a test for that
arr_subset_g = (row_len - arr_g) == 0
arr_one_g and arr_subset_g are 1d boolean arrays that should index for the things you want.

How to get same rank for same scores in Redis' ZRANK?

If I have 5 members with scores as follows
a - 1
b - 2
c - 3
d - 3
e - 5
ZRANK of c returns 2, ZRANK of d returns 3
Is there a way to get same rank for same scores?
Example: ZRANK c = 2, d = 2, e = 3
If yes, then how to implement that in spring-data-redis?
Any real solution needs to fit the requirements, which are kind of missing in the original question. My 1st answer had assumed a small dataset, but this approach does not scale as dense ranking is done (e.g. via Lua) in O(N) at least.
So, assuming that there are a lot of users with scores, the direction that for_stack suggested is better, in which multiple data structures are combined. I believe this is the gist of his last remark.
To store users' scores you can use a Hash. While conceptually you can use a single key to store a Hash of all users scores, in practice you'd want to hash the Hash so it will scale. To keep this example simple, I'll ignore Hash scaling.
This is how you'd add (update) a user's score in Lua:
local hscores_key = KEYS[1]
local user = ARGV[1]
local increment = ARGV[2]
local new_score = redis.call('HINCRBY', hscores_key, user, increment)
Next, we want to track the current count of users per discrete score value so we keep another hash for that:
local old_score = new_score - increment
local hcounts_key = KEYS[2]
local old_count = redis.call('HINCRBY', hcounts_key, old_score, -1)
local new_count = redis.call('HINCRBY', hcounts_key, new_score, 1)
Now, the last thing we need to maintain is the per score rank, with a sorted set. Every new score is added as a member in the zset, and scores that have no more users are removed:
local zdranks_key = KEYS[3]
if new_count == 1 then
redis.call('ZADD', zdranks_key, new_score, new_score)
end
if old_count == 0 then
redis.call('ZREM', zdranks_key, old_score)
end
This 3-piece-script's complexity is O(logN) due to the use of the Sorted Set, but note that N is the number of discrete score values, not the users in the system. Getting a user's dense ranking is done via another, shorter and simpler script:
local hscores_key = KEYS[1]
local zdranks_key = KEYS[2]
local user = ARGV[1]
local score = redis.call('HGET', hscores_key, user)
return redis.call('ZRANK', zdranks_key, score)
You can achieve the goal with two Sorted Set: one for member to score mapping, and one for score to rank mapping.
Add
Add items to member to score mapping: ZADD mem_2_score 1 a 2 b 3 c 3 d 5 e
Add the scores to score to rank mapping: ZADD score_2_rank 1 1 2 2 3 3 5 5
Search
Get score first: ZSCORE mem_2_score c, this should return the score, i.e. 3.
Get the rank for the score: ZRANK score_2_rank 3, this should return the dense ranking, i.e. 2.
In order to run it atomically, wrap the Add, and Search operations into 2 Lua scripts.
Then there's this Pull Request - https://github.com/antirez/redis/pull/2011 - which is dead, but appears to make dense rankings on the fly. The original issue/feature request (https://github.com/antirez/redis/issues/943) got some interest so perhaps it is worth reviving it /cc #antirez :)
The rank is unique in a sorted set, and elements with the same score are ordered (ranked) lexically.
There is no Redis command that does this "dense ranking"
You could, however, use a Lua script that fetches a range from a sorted set, and reduces it to your requested form. This could work on small data sets, but you'd have to devise something more complex for to scale.
unsigned long zslGetRank(zskiplist *zsl, double score, sds ele) {
zskiplistNode *x;
unsigned long rank = 0;
int i;
x = zsl->header;
for (i = zsl->level-1; i >= 0; i--) {
while (x->level[i].forward &&
(x->level[i].forward->score < score ||
(x->level[i].forward->score == score &&
sdscmp(x->level[i].forward->ele,ele) <= 0))) {
rank += x->level[i].span;
x = x->level[i].forward;
}
/* x might be equal to zsl->header, so test if obj is non-NULL */
if (x->ele && x->score == score && sdscmp(x->ele,ele) == 0) {
return rank;
}
}
return 0;
}
https://github.com/redis/redis/blob/b375f5919ea7458ecf453cbe58f05a6085a954f0/src/t_zset.c#L475
This is the piece of code redis uses to compute the rank in sorted sets. Right now ,it just gives rank based on the position in the Skiplist (which is sorted based on scores).
What does the skiplistnode variable "span" mean in redis.h? (what is span ?)

How can i add an array of values to Google ortools versus a lower and upper bound?

In the documentation and all examples I can find... in terms of nurse scheduling at least, everyone just declares shift values within the search space of {1,4} lets say for shift 1,2,3,4....
solver = pywrapcp.Solver("schedule_shifts")
num_nurses = 4
num_shifts = 4 # Nurse assigned to shift 0 means not working that day.
num_days = 7
# [START]
# Create shift variables.
shifts = {}
for j in range(num_nurses):
for i in range(num_days):
shifts[(j, i)] = solver.IntVar(0, num_shifts - 1, "shifts(%i,%i)" % (j, i))
shifts_flat = [shifts[(j, i)] for j in range(num_nurses) for i in range(num_days)]
# Create nurse variables.
nurses = {}
for j in range(num_shifts):
for i in range(num_days):
nurses[(j, i)] = solver.IntVar(0, num_nurses - 1, "shift%d day%d" % (j,i))
I want to avoid the use of range of values when I call solver.IntVar(lowerbound, upperbound, ...)
I want IntSolver([available values that you can choose], ...)
I created a matrix of all shifts as the columns flowing from the first day to last. My row indexes don't matter but in each day/shift column, I have the index values of nurses in ranked descending order of who bid the highest for that shift. I want to create then a constraint where if I choose a nurse, I choose the maximum bid that is allowed via other constraints from the column, however I don't know how to do that given the limited documentation ortools has with python IntVar.
Can you try
solver.IntVar([values...], 'name')
It should work.
See https://github.com/google/or-tools/blob/master/examples/python/einav_puzzle2.py

create unique Textboxes's control names

I have tried to create unique Textboxes control names for later use, i.e., assign corresponding values from datatables to each of textboxes. I have 128 questions of a survey in a datatable that are divided into group. For example, ABgroup.count = 4 meaning there are 4 questions for group AM. ACgroup.count = 18 and so on with total of 128 rows. Questions are categories into L (likeness), Y (yes/no), or C (comment). L has 4 possibilities of votes (strongly agree, agree, disagree, strongly disagree). I plan to create textboxes control name for AM group as following
datatable.count = 4
Q1: txtbx1.Name = AM100
L type: txtbx2.Name = AM100a; txtbx3.Name = AM100b; txtbx3.Name = AM100c; txtbx5.Name = AM100d
Q2: txtbox6.Name = AM101
Y type: txtbx7.Name = AM101y; txtbx8.Name = AM101n
Q3: similar as Q1
Q4 / C type: txtbx13.Name = AM103c
My question should I create textboxes controls separately for each group as described above? Or create something that dynamically create all textboxes control names? either way, can you please show me how to achieve it? Thank you.

Boolean to int conversion in Pig

I have a data set that looks like this:
foo,R
foo,Y
bar,C
foo,R
baz,Y
foo,R
baz,Y
baz,R
...
I'd like to generate a report that sums up the number of 'R', 'Y' and 'C' records for each unique value in the first column. For this data set, it would look like:
foo,3,1,0
bar,0,0,1
baz,1,2,0
Where the 2nd column is the number of 'R' records, the third is the number of 'Y' records and the last is the number of 'C' records.
I know I can first filter by record type, group and aggregate, but that leads to an expensive join of the three sub-reports. I would much rather group once and GENERATE each of the {R, Y, C} columns in my group.
How can I convert the Boolean result of comparing the second column in my data set to 'R', 'Y' or 'C' to a numeric value I can aggregate? Ideally I want 1 for a match and 0 for a non-match for each of the three columns.
Apache PIG is perfectly adapted for such type of problems. It can be solved with one GROUP BY and one nested FOREACH
inpt = load '~/pig/data/group_pivot.csv' using PigStorage(',') as (val : chararray, cat : chararray);
grp = group inpt by (val);
final = foreach grp {
rBag = filter inpt by cat == 'R';
yBag = filter inpt by cat == 'Y';
cBag = filter inpt by cat == 'C';
generate flatten(group) as val, SIZE(rBag) as R, SIZE(yBag) as Y, SIZE(cBag) as C;
};
dump final;
--(bar,0,0,1)
--(baz,1,2,0)
--(foo,3,1,0)
bool = foreach final generate val, (R == 0 ? 0 : 1) as R, (Y == 0 ? 0 : 1) as Y, (C == 0 ? 0 : 1) as C;
dump bool;
--(bar,0,0,1)
--(baz,1,1,0)
--(foo,1,1,0)
I have tried it on your example and got the expected result. The idea is that after GROUP BY each value has a BAG that contains all rows with R, Y, C categories. Using FILTER within FOREACH we create 3 separate BAGs (one per R, Y, C) and SIZE(bag) in GENERATE counts the number of rows in each bag.
The only problem you might encounter is when there are too many rows with the same value in val column, as nested FOREACH relies on in memory operations and resulting intermidiate BAGs could get quite large. If you start getting memory related exceptions, then you can inspire from How to handle spill memory in pig. The idea would be to use 2 GROUP BY operations, first one to get counts per (val, cat) and second to pivot R, Y, C around val, thus avoiding expensive JOIN operation (see Pivoting in Pig).
Regarding the question with BOOLEAN: I have used bincond operator.
If you do not need the counts, you could use IsEmpty(bag) instead of SIZE(bag), it would be slightly faster and bincond to get your 0 and 1 conversions.