Create new column on pandas DataFrame in which the entries are randomly selected entries from another column - pandas

I have a DataFrame with the following structure.
df = pd.DataFrame({'tenant_id': [1,1,1,2,2,2,3,3,7,7], 'user_id': ['ab1', 'avc1', 'bc2', 'iuyt', 'fvg', 'fbh', 'bcv', 'bcb', 'yth', 'ytn'],
'text':['apple', 'ball', 'card', 'toy', 'sleep', 'happy', 'sad', 'be', 'u', 'pop']})
This gives the following output:
df = df[['tenant_id', 'user_id', 'text']]
tenant_id user_id text
1 ab1 apple
1 avc1 ball
1 bc2 card
2 iuyt toy
2 fvg sleep
2 fbh happy
3 bcv sad
3 bcb be
7 yth u
7 ytn pop
I would like to groupby on tenant_id and create a new column which is a random selection of strings from the user_id column.
Thus, I would like my output to look like the following:
tenant_id user_id text new_column
1 ab1 apple [ab1, bc2]
1 avc1 ball [ab1]
1 bc2 card [avc1]
2 iuyt toy [fvg, fbh]
2 fvg sleep [fbh]
2 fbh happy [fvg]
3 bcv sad [bcb]
3 bcb be [bcv]
7 yth u [pop]
7 ytn pop [u]
Here, random id's from the user_id column have been selected, these id's can be repeated as "fvg" is repeated for tenant_id=2. I would like to have a threshold of not more than ten id's. This data is just a sample and has only 10 id's to start with, so generally any number much less than the total number of user_id's. This case say 1 less than total user_id's that belong to a tenant.
i tried first figuring out how to select random subset of varying length with
df.sample
new_column = df.user_id.sample(n=np.random.randint(1, 10)))
I am kinda lost after this, assigning it to my df results in Nan's, probably because they are of variable lengths. Please help.
Thanks.

per my comment:
Your 'new column' is not a new column, it's a new cell for a single row.
If you want to assign the result to a new column, you need to create a new column, and apply the cell computation to it.
df['new column'] = df['user_id'].apply(lambda x: df.user_id.sample(n=np.random.randint(1, 10))))
it doesn't really matter what column you use for the apply since the variable is not used in the computation

Related

Pandas Cumulative sum over 1 indice but not the other 3

I have a dataframe with 4 variables DIVISION, QTR, MODEL_SCORE, MONTH with the sum of variable X aggregated by those 4.
I would like to effective partition the data by DIVISION,QTR, and MODEL SCORE and keep a running total order the MONTH FIELD order smallest to largest. The idea being it would reset if it got to a new permutation of the other 3 columns
df = df.groupby(['DIVISION','MODEL','QTR','MONTHS'])['X'].sum()
I'm trying
df['cumsum'] = df.groupby(level=3)['X'].cumsum()
having tried all numbers I can think in the level argument. It seems be able to work any way other than what I want.
EDIT: I know the below isn't formatted ideally, but basically as long as the only variable changing was MONTH the cumulative sum would continue but any other variable would cause it to reset.
DIVSION QTR MODEL MONTHS X CUMSUM
A 1 1 1 10 10
A 1 1 2 20 30
A 1 2 1 5 5
I'm sorry for all the trouble I believe the answer was way simpler than I was making it to be.
After
df = df.groupby(['DIVISION','MODEL','QTR','MONTHS'])['X'].sum()
I was supposed to reset the index I did not want a multi-index and this appears to have worked.
df = df.reset_index()
df['cumsum'] = df.groupby(['DIVISION','MODEL','QTR'])['X'].cumsum()

Groupby two columns in pandas, and perform operations over totals for each group

The code below:
df = pd.read_csv('./filename.csv', header='infer').dropna()
df.groupby(['category_code','event_type']).event_type.count().head(20)
Returns the following table:
How can I obtain, for all the sub groups under event_type that have both "purchase" and "view", the ratio between the total of "purchase" and the total of "view"?
In this specific case, for instance, I need a function that returns:
1/57
1/232
3/249
Eventually, I will need to plot such result.
I have been trying for a day, without success. I am still new to pandas, and I searched across every possible forum without finding anything useful.
Next time please consider adding a sample of your data as text instead of as an image. It helps us testing..
Anyway, in your case you can combine different dataframe methods, such as groupby, as you have already done, and pivot_table. I used this data just as an example:
category_code event_type
0 A purchase
1 A view
2 B view
3 B view
4 C view
5 D purchase
6 D view
7 D view
You can create a new column from your groupby
df['event_count'] = df.groupby(['category_code', 'event_type'])\
.event_type.transform('count')
Then create a pivot_table
my_table = df.pivot_table(values='event_count',
index='category_code',
columns='event_type',
fill_value=0)
Then, finally, you can calculate the purchase_ratio directly:
my_table['purchase_ratio'] = my_table['purchase'] / my_table['view']
Which results in the following DataFrame:
event_type purchase view purchase_ratio
category_code
A 1 1 1.0
B 0 2 0.0
C 0 1 0.0
D 1 2 0.5

How to combine certain column values together in Python and make values in the other column be the means of the values combined?

I have a Panda dataframe where one of the columns is a sequence of numbers('sequence')many of them repeating and the other column values('binary variable') are either 1 or 0.
I have grouped by the values in the sequences column which are the same and made the column values in the binary variable be the % of entries which are non-zero in that group.
I now want to combine entries in the 'sequence' column with the same values together and make the column values in 'binary variable' the mean of the column values of those columns that that were combined.
So my data frame looks like this:
df = pd.DataFrame([{'sequence' : [1, 1, 4,4,4 ,6], 'binary variable' : [1,0,0,1,0,1]}).
I have then used this code to group together the same values in sequence. Using this code:
df.groupby(["sequence"]).apply(lambda 'binary variable': (binary variable!= 0).sum() / binary variable.count()*100 )
I am left with the sequence columns with non-repeating values and the binary variable column now being the percentage of non zeros
.
But now I want to group some of the column values together(so for this toy example the 1 and 4 values), and have the binary variable column have values which are the mean of the percentages of say the values for 1 and 4.
This isn't terribly well worded as finding it awkward to describe it but any help would be much appreciated, I've tried to look online and had many failed attempts with code of my own but it just is not working.
Any help would be greatly appreciated
It seems like you want to group the table twice and take the mean each time. For the second grouping, you need to create a new column to indicate the group.
Try this code:
import pandas as pd
# sequence groups for final average
grps = {(1,4):[1,4],
(5,6):[5,6]}
# initial data
df = pd.DataFrame({'sequence' : [1,1,4,4,4,5,5,6], 'binvar' : [1,0,0,1,0,1,0,1]})
gb = df.groupby(["sequence"])['binvar'].mean().reset_index() #.apply(lambda 'binary variable': (binary variable!= 0).sum() / binary variable.count()*100 )
def getgrp(x): # search groups
for k in grps:
if x in grps[k]:
return k
print(df.to_string(index=False))
gb['group'] = gb.apply(lambda r: getgrp(r[0]), axis = 1)
gb = gb.reset_index()
print(gb.to_string(index=False))
gb = gb[['group','binvar']].groupby("group")['binvar'].mean().reset_index()
print(gb.to_string(index=False))
Output
sequence binvar
1 1
1 0
4 0
4 1
4 0
5 1
5 0
6 1
index sequence binvar group
0 1 0.500000 (1, 4)
1 4 0.333333 (1, 4)
2 5 0.500000 (5, 6)
3 6 1.000000 (5, 6)
group binvar
(1, 4) 0.416667
(5, 6) 0.750000

How to create new columns using groupby based on logical expressions

I have this CSV file
http://www.sharecsv.com/s/2503dd7fb735a773b8edfc968c6ae906/whatt2.csv
I want to create three columns, 'MT_Value','M_Value', and 'T_Data', one who has the mean of the data grouped by year and month, which I accomplished by doing this.
data.groupby(['Year','Month']).mean()
But for M_value I need to do the mean of only the values different from zero, and for T_Data I need the count of the values that are zero divided by the total of values, I guess that for the last one I need to divide the amount of values that are zero by the amount of total data grouped, but honestly I am a bit lost. I looked on google and they say something about transform but I didn't understood very well
Thank you.
You could do something like this:
(data.assign(M_Value=data.Valor.where(data.Valor!=0),
T_Data=data.Valor.eq(0))
.groupby(['Year','Month'])
[['Valor','M_Value','T_Data']]
.mean()
)
Explanation: assign will create new columns with respective names. Now
data.Valor.where(data.Valor!=0) will replace 0 values with nan, which will be ignored when we call mean().
data.Valor.eq(0) will replace 0 with 1 and other values with 0. So when you do mean(), you compute count(Valor==0)/total_count().
Output:
Valor M_Value T_Data
Year Month
1970 1 2.306452 6.500000 0.645161
2 1.507143 4.688889 0.678571
3 2.064516 7.111111 0.709677
4 11.816667 13.634615 0.133333
5 7.974194 11.236364 0.290323
... ... ... ...
1997 10 3.745161 7.740000 0.516129
11 11.626667 21.800000 0.466667
12 0.564516 4.375000 0.870968
1998 1 2.000000 15.500000 0.870968
2 1.545455 5.666667 0.727273
[331 rows x 3 columns]

using list as an argument in groupby() in pandas and none of the key elements match column or index names

So I have a random values of dataframe as below and a book I am studying uses a list was groupby key (key_list). How is the dataframe grouped in this case since none of list values match column or index names? So, the last two lines are confusing to me.
people = pd.DataFrame(np.random.randn(5,5), columns = ['a','b','c','d','e'], index=['Joe','Steve','Wes','Jim','Travis'])
key_list = ['one','one','one','two','two']
people.groupby(key_list).min()
people.groupby([len, key_list]).min()
Thank you in advance!
The user guide on groupby explains a lot and I suggest you have a look at it. I'll explain as much as I understand for your use case.
You can verify the groups created using the group method:
people.groupby(key_list).groups
{'one': Index(['Joe', 'Steve', 'Wes'], dtype='object'),
'two': Index(['Jim', 'Travis'], dtype='object')}
You have your dictionary with the keys 'one' and two' being the groups from the key_list list. As such when you ask for the 'min', it looks at each group and picks out the minimum, indexed from the first column. Let's inspect group 'one' using the get_group method:
people.groupby(key_list).get_group('one')
a b c d e
Joe -0.702122 0.277164 1.017261 -1.664974 -1.852730
Steve -0.866450 -0.373737 1.964857 -1.123291 1.251595
Wes -0.043835 -0.011108 0.214802 0.065022 -1.335713
You can see that Steve has the lowest value from column 'a'. when you run the next line it should give you that:
people.groupby(key_list).get_group('one').min()
a -0.866450
b -0.373737
c 0.214802
d -1.664974
e -1.852730
dtype: float64
The same concept applies when you run it on the second group 'two'. As such, when you run the first part of your groupby code:
people.groupby(key_list).min()
You get the minimum row indexed at 'a' for each group:
a b c d e
one -0.866450 -0.373737 0.214802 -1.664974 -1.852730
two -1.074355 -0.098190 -0.595726 -2.194481 0.232505
The second part of your code, which involves the len applies the same grouping concept. In this case, it groups the dataframe according to the length of the strings in its index: (Jim, Joe, Wes) - 3 letters, (Steve) - 5 letters, (Travis) - 6 letters, and then groups with the key_list to give the final output:
a b c d e
3 one -0.702122 -0.011108 0.214802 -1.664974 -1.852730
two -0.928987 -0.098190 3.025985 0.702471 0.232505
5 one -0.866450 -0.373737 1.964857 -1.123291 1.251595
6 two -1.074355 1.110879 -0.595726 -2.194481 0.394216
Note that for 3 it spills out 'one' and 'two' because 'Joe' and 'Wes' are in group 'one' but the lowest is 'Joe', while 'Jim' is the only three letter word in group 'two'. The same concept goes for 5 letter and 6 letter words.