Plotting CDF for ranking distribution - pandas

I have a panda dataframe that looks like this, this is generated with the groupby command and then sorted by # of users to give me user count for top X feature combination.
count_28day, Feature1, Feature2, Feature3
5000 a1 b1 c1
1000 a2 b2 c2
50 a3 b3 c3
I'm trying to plot cdf of user distribution. I don't need to know the features. I just want to show the top X feature combinations that will give me 90% of total user.
I'm doing this in a very hacky way.
topx = table.count_28day.sort_values(ascending=False).cumsum()/sum(table.count_28day)
ser_cdf = pd.Series(topx.tolist()[1:100], index=pd.Series(range(1,100)))
ser_cdf.plot(drawstyle='steps')
Is there a more elegant way to do this using histogram or ecdf or something?

Related

how to make pandas code faster or using dask dataframe or how to use vectorization for this type of problem?

import pandas as pd
# list of name, degree, score
label1 = ["a1", "a1", "a1","a1", "a2","a2","a2","a2", "b1","b1","b1","b1", "b2","b2","b2","b2"]
label2 = ["a1", "a2", "b1","b2", "a1","a2","b1","b2", "a1","a2","b1","b2", "a1","a2","b1","b2"]
m1 = [ 0, 3, 2, 7, 3, 0, 5, 8, 2, 5, 0, 9, 7, 8, 9, 0]
# dictionary of lists
dict = {'label1': label1, 'label2': label2,'m1':m1}
df = pd.DataFrame(dict)
df
output of this dataframe:
label1 label2 m1
0 a1 a1 0
1 a1 a2 3
2 a1 b1 2
3 a1 b2 7
4 a2 a1 3
5 a2 a2 0
6 a2 b1 5
7 a2 b2 8
8 b1 a1 2
9 b1 a2 5
10 b1 b1 0
11 b1 b2 9
12 b2 a1 7
13 b2 a2 8
14 b2 b1 9
15 b2 b2 0
I want to write a function that will take strings (samp1)a, (samp2)b, and a (df) data frame as input. We have to preprocess those two input strings so that we can get desired strings in our data frame. Then we need to access some particular rows' (like (a1,b1) or (a2,b2)) indices of the data frame to get their corresponding 'm1' value. Next, we will make some (addition) operations for those m1 values and store them in two variables and after that, it will return the minimum of two variables. [looking at coding snippet may be easier to understand]
The following is the code for this function:
def min_4line(samp1,samp2,df):
k=['1','2']
#k and samp are helping to generate variable along with number
#for example it will take a,b and can create a1,a2,b1,b2.....
samp1_1=samp1+k[0]
samp1_2=samp1+k[1]
samp2_1=samp2+k[0]
samp2_2=samp2+k[1]
#print(samp1_1)#a1
#print(samp1_2)#a2
#print(samp2_1)#b1
#print(samp2_2)#b2
"""
#As we are interested about particular rows to get comb1 variable, we need those row's
#indexes
#for comb1 we want to sum (a1,b1)[which located at ind1] and (a2,b2)[which located at ind2]
#same types of thing for comb2
"""
ind1=df.index[(df['label1']==samp1_1) & (df['label2']==samp2_1)].tolist()
ind2=df.index[(df['label1']==samp1_2) & (df['label2']==samp2_2)].tolist()
#print(ind1)#[2]
#print(ind2)#[7]
comb1=int(df.loc[ind1,'m1'])+int(df.loc[ind2,'m1'])
#print('comb1: ',comb1)#comb1: 10
ind3=df.index[(df['label1']==samp1_2) & (df['label2']==samp2_1)].tolist()
ind4=df.index[(df['label1']==samp1_1) & (df['label2']==samp2_2)].tolist()
#print(ind3)#[6]
#print(ind4) #[3]
comb2=int(df.loc[ind3,'m1'])+int(df.loc[ind4,'m1'])
#print('comb2: ',comb2)#comb2: 12
return min(comb1,comb2)#10
To append unique char like a,b from the dataframe we need to do a list operation:
#this list is needed so that I can compare how many unique values are there...
#it could get a,b,c,d.... and make comparison
#like (a,b), (a,c),(a,d), (b,c),(b,d),(c,d) for the function
list_line=list(df['label1'].unique())
string_test=[a[:-1] for a in list_line]
#string_test will exclude number portion of character
list_img=sorted(list(set(string_test)))
#print(list_img)#['a', 'b']
#print(len(list_img))#2
Now we need to create a data frame that will go over the 'list_img' and call the min4line function to get value like (a,b), (a,c) and corresponding output of the function. Here a nested loop is necessary as suppose list consist [a,b,c,d]. it will go like(a,b),(a,c),(a,d),(b,c),(b,d),(c,d). So that we can have unique pair. The code for this is:
%%time
d=[]
for i in range(len(list_img)):
for j in range(i+1,len(list_img)):
a=min_4line(list_img[i],list_img[j],df)
print(a)
d.append({'label1':str(list_img[i]),'label2':str(list_img[j]), 'metric': str(a)})
dataf=pd.DataFrame(d)
dataf.head(5)
output is:
label1label2metric
0 a b 10
Is there any way to make the code faster? I broke down the problem into small parts. this operation is needed for 16 million rows. I am interested in using dask for this. But when I have asked this type of question previously, many people failed to understand as I was not able to state the problem clearly. Hope this time I broke it down in easier format. You can copy those code cell and run in jupyter notebook to check the output and suggest me any good way to make the program faster.
[updated]
Can anyone suggest, how can I get those particular indices of those rows using numpy or any kind of vectorized operation?

Turning one-hot encoded table into 2D table of counts

I think I can solve this problem without too much difficulty but suspect that any solution I come up with will be sub-optimal, so am interested in how the real pandas experts would do it; I'm sure I could learn something from that.
I have a table of data that is one-hot encoded, something like:
Index. A1. A2. A3. B1. B2. C1. C2. C3. C4.
0. True. False. True. True. True. False. True. False. False.
...
So every entry is a Boolean and my columns consist of several groups of categories (the A's, B's and C's).
What I want to create is new DataFrames where I pick any two categories and get a table of counts of how many people are in the pair of categories corresponding to that row/column. So, if I was looking at categories A and B, I would generate a table:
Index. A1. A2. A3. None Total
B1. x11. x12. x13. x1N x1T
B2. x21. x22. x23. x2N. x2T
None. xN1. xN2. xN3. xNN xNT
Total. xT1. xT2. xT3. xTN xTT
where x11 is the count of rows in the original table that have both A1 and B1 True, x12 is the count of those rows that have A1 and B2 True, and so on.
I'm also interested in the counts of those entries where all the A values were False and/or all the B values were false, which are accounted for in the None columns.
Finally, I would also like the totals of rows where any of the columns in the corresponding category were True. So x1T would be the number of rows where B1 was True and any of A1, A2 or A3 were True, and so on (note that this is not just the sum of x11, x12 and x13 as the categories are not always mutually exclusive; a row could have both A1 True and A2 True for example). xNN is the number of rows that have all false values for A1, A2, A3, B1, B2, and xTT is the number of rows that have at least one true value for any of A1, A2, A3, B1 and B2, so xNN + xTT would equal the total number of rows in the original table.
Thanks
Graham
This is my approach:
def get_table(data, prefix):
'''
get the columns in the respective category
and assign `none` and `Total` columns
'''
return (df.filter(like=prefix)
.assign(none=lambda x: (1-x).prod(1),
Total=lambda x: x.any(1))
)
# sample data
np.random.seed(1)
df = pd.DataFrame(np.random.choice([True,False], size=(5,9), p=(0.6,0.4)),
columns=[f'{x}{y}'for x in 'ABC' for y in '123'])
# the output
df = df.astype(int)
get_table(df, 'B').T # get_table(df, 'A')
Output:
A1 A2 A3 none Total
B1 3 2 1 0 3
B2 3 2 1 0 3
B3 2 1 1 0 2
none 0 0 1 0 1
Total 4 3 2 0 5
Here I don't understand as why (none, Total) must be zero. Since none corresponds to all False in B and Total corresponds to some True in A.

using 'loop' or 'for' with table data to pull each row data and use the pulled data for two parameters in gams

I am new to GAMS and I have a table data which has 3 rows and 6 columns. I want to pull each row and use its data for two parameters(pull each row which has 6 element and use the first three elements for one parameter and the other three elements for the second parameter) using loop or for statement. i tried to use both of them but for the loop i received zero value for my parameter which is incorrect and for the for statement i received some errors.
this is my code for the first row which both 'loop' and 'for' are used (i used them separately each time but for show what was my code i just wrote them together).
Please help me.
Thanks
scalars j;
sets
o /red,green,blue/
p /b1,b2,b3,p1,p2,p3/
k /1*3/;
Table sup(*,*)
b1 b2 b3 p1 p2 p3
red 12 15 20 200 50 50
green 16 17 0 150 50 0
blue 13 18 0 100 50 0 ;
parameters Bid_Red(k),Pmax_Red(k),t;
*for statement***************
for(j= 1 to 3,
t=card(o)+j;
Bid_Red(k)$( ord(k) = j )=sup('red',j);
Pmax_Red(k)$( ord(k) = j )=sup('red',t);
);
*loop statement***************
t=card(o);
loop(k,
Bid_Red(k)=sup('red',k);
Pmax_Red(k)=sup('red',k+t);
);
display Bid_red, Pmax_Red
One of the core features of GAMS is how it deals with set structures and indexing. I'd recommend looking at the excellent documentation, for example on set definition https://www.gams.com/latest/docs/UG_SetDefinition.html, to really get a feel for how to get the best out of it.
In your case, you can proceed as follows. p is a set. Create some subsets of it p_ and b_, given by the syntax subset_name(set_name).
sets p_(p) / p1, p2, p3 /,
b_(p) / b1, b2, b3 /;
Create parameters over appropriate dimensions (i.e. the full set), and define them over the subset you are interested in:
parameters bid_red(o,p),pmax_red(o,p);
bid_red(o,b_) = sup(o,b_);
pmax_red(o,p_) = sup(o,p_);
Then display bid_red, pmax_red; gives:
---- 21 PARAMETER bid_red
b1 b2 b3
red 12.000 15.000 20.000
green 16.000 17.000
blue 13.000 18.000
---- 21 PARAMETER pmax_red
p1 p2 p3
red 200.000 50.000 50.000
green 150.000 50.000
blue 100.000 50.000
If you do want to select individual rows, you can use e.g. pmax_red('red',p_) in your code. This is essentially just a special case of subsetting in which the subset is of size 1.

Database transactions theory

No book seems to be able to answer this.
Suppose I have two transactions:
T1: Lock A, Lock B, Unlock A
T2: Lock B, Unlock B, Lock A, Unlock A
Q1. How many ways are there to plan these transactions? (Is it just a simple graph and the result is 3! * 4! ?)
Q2. How many of these ways are serializable?
I would really like to know what is the thinking process, how do you get to the answer?
Q1 is 7.
Proof: First of all, we have to merge the set 'Lock A', 'Lock B', 'Unlock A' (I denote the items as A1, A2, A3) into the set 'Lock B',..,'Unlock A' (I denote them B1..B4), that is to put 3 items into 5 places (between B's) with repetitions allowed, that is binomial coeff. choose 3 from (5-1+3). It is equal to 7!/(3!*4!) = 35.
Next, we have to drop 'bad' solutions (the ones prevented by locking conditions). It's where A1 stands between B3 and B4 (3 solutions) and A2 between B1 and B2 (2*4 = 8). Also, we have to exclude the solutions with B3 between A1 and A3 too. There are 3*3=9 with B3 between A1 and A2, and 6*2=12 with B3 between A2 and A3. Thus, we have 35-3-8-9-12=3. But we should also satisfy inclusion-exclusion principle: add solutions which violates two rules simultaneously. They could be only like this: B1 A2 B2 B3 B4, with A1 in either of two left positions, and A3 in either of two right ones. 4 in total. So, we have the final answer 35 - 3 - 8 - 9 - 12 + 4 = 7.

Redis and linked hashes

everyone
I would like to ask community of help to find a way of how to cache our huge plain table by splitting it to the multiple hashes or otherwise.
The sample of table, as an example for structure:
A1 B1 C1 D1 E1 X1
A1 B1 C1 D1 E1 X2
A7 B5 C2 D1 E2 X3
A8 B1 C1 D1 E2 X4
A1 B6 C3 D2 E2 X5
A1 B1 C1 D2 E1 X6
This our denormalized data, we don't have any ability to normalize it.
So currently we must perform 'group by' to get required items, for instance to get all D* we perform 'data.GroupBy(A1).GroupBy(B1).GroupBy(C1)' and it takes a lot of time.
Temporarily we had found workaround for this by creating composite a string key:
A1 -> 'list of lines begin A1'
A1:B1 -> 'list of lines begin A1:B1'
A1:B1:C1 -> 'list of lines begin A1:B1:C1'
...
as a cache of results of grouping operations.
The question is how it can be stored efficiently?
Estimated number of lines in denormalized data around 10M records and as in my an example there is 6 columns it will be 60M entries in hash. So, I'm looking an approach to lookup values in O(N) if it's possible.
Thanks.