Cumulative value counts of categorical data, with group by - pandas

In my data frame, I have a text column group with the group name, and column drop_week, holding a categorical value in range [1,4]. I want to store, for each group, the cumulative count of values 1 to 4 of drop week. I'm doing this:
drop_data = all_data[['group', 'drop_week']].groupby('group')['drop_week'] \
.value_counts().unstack().transpose().fillna(0).cumsum().transpose()
and it works. But since it took me 2 hours of googling to come up with this solution, I was wondering if there is a better way to do it.

You could use pd.crosstab to create the frequency table. Then use cumsum(axis=1) to compute the cumulative sum across each row:
pd.crosstab(index=all_data['group'], columns=all_data['drop_week']).cumsum(axis=1)
# drop_week 1 2 3 4
# group
# 0 12 17 21 27
# 1 7 13 18 25
# 2 9 14 22 26
# 3 5 11 16 22
which agrees with
drop_data = (all_data[['group', 'drop_week']].groupby('group')['drop_week']
.value_counts().unstack().transpose().fillna(0).cumsum().transpose())
# drop_week 1 2 3 4
# group
# 0 12 17 21 27
# 1 7 13 18 25
# 2 9 14 22 26
# 3 5 11 16 22
The setup I used for this was:
import numpy as np
import pandas as pd
np.random.seed(2019)
N = 100
all_data = pd.DataFrame({'group':np.random.randint(4, size=N),
'drop_week':np.random.randint(1,5, size=N)})
drop_data = (all_data[['group', 'drop_week']].groupby('group')['drop_week']
.value_counts().unstack().transpose().fillna(0).cumsum().transpose())

Related

Print Pandas Unique Rows by Column Condition

I am trying to print the rows whereby a data condition is met in a pandas DF based on the unique values in the DF. For example, I have data that looks like this:
DF:
site temp month day
A 15 7 18
A 11 6 12
A 22 9 3
B 9 4 23
B 3 2 11
B -1 5 18
I need the result to print the rows where the max in the 'temp' column occurs such as this for the final result:
A 15
B 9
I have tried this but it is not working correctly:
for i in DF['site'].unique():
print(DF.temp.max())
I get the same answer of:
22
22
but the answer should be:
site temp month day
A 22 9 3
B 9 4 23
thank you!
A possible solution:
df.groupby('site', as_index=False).max()
Output:
site temp
0 A 22
1 B 9
In case you want to use a for loop:
for i in df['site'].unique():
print(df.loc[df['site'].eq(i), 'temp'].max())
Output:
22
9
df.groupby('site').max()
output:
temp month day
site
A 22 9 18
B 9 5 23
Let us do sort_values + drop_duplicates
df = df.sort_values('temp',ascending=False).drop_duplicates('site')
Out[190]:
site temp month day
2 A 22 9 3
3 B 9 4 23

Adding extra n rows at the end of a dataframe of a certain value

I have a dataframe with currently 22 rows
index value
0 23
1 22
2 19
...
21 20
to this dataframe, i want to add 72 rows to make the dataframe exactly 100 rows. So i need to fill loc[22:99] but with a certain value, let's say 100.
I tried something like this
uncon_dstn_2021['balance'].loc[22:99] = 100
but did not work. Any idea?
You can do reindex
out = df.reindex(df.index.tolist() + list(range(22, 99+1)), fill_value = 100)
You can also use pd.concat:
df1 = pd.concat([df, pd.DataFrame({'balance': [100]*(100-len(df))})], ignore_index=True)
print(df1)
# Output
balance
0 1
1 14
2 11
3 11
4 10
.. ...
96 100
97 100
98 100
99 100
[100 rows x 1 columns]

How to split numbers in pandas column into deciles?

I have a column in pandas dataset of random values ranging btw 100 and 500.
I need to create a new column 'deciles' out of it - like ranking, total of 20 deciles. I need to assign rank number out of 20 based on the value.
10 to 20 - is the first decile, number 1
20 to 30 - is the second decile, number 2
x = np.random.randint(100,501,size=(1000)) # column of 1000 rows with values ranging btw 100, 500.
df['credit_score'] = x
df['credit_decile_rank'] = df['credit_score'].map( lambda x: int(x/20) )
df.head()
Use integer division by 10:
df = pd.DataFrame({
'credit_score':[4,15,24,55,77,81],
})
df['credit_decile_rank'] = df['credit_score'] // 10
print (df)
credit_score credit_decile_rank
0 4 0
1 15 1
2 24 2
3 55 5
4 77 7
5 81 8

How to substitute a column in a pandas dataframe whit a series?

Let's have a dataframe df and a series s1 in pandas
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(10000,1000))
s1 = pd.Series(range(0,10000))
How can I modify df so that the column 42 become equal to s1?
How can I modify df so that the columns between 42 and 442 become equal to s1?
I would like to know the simplest way to do that but also a way to do that in place.
I think you need first same length Series with DataFrame, here 20:
np.random.seed(456)
df = pd.DataFrame(np.random.randn(20,10))
#print (df)
s1 = pd.Series(range(0,20))
#print (s1)
#set column by Series
df[8] = s1
#set Series to range of columns
cols = df.loc[:, 3:5].columns
df[cols] = pd.concat([s1] * len(cols), axis=1)
print (df)
0 1 2 3 4 5 6 7 8 9
0 -0.668129 -0.498210 0.618576 0 0 0 0.301966 0.449483 0 -0.315231
1 -2.015971 -1.130231 -1.111846 1 1 1 1.915676 0.920348 1 1.157552
2 -0.106208 -0.088752 -0.971485 2 2 2 -0.366948 -0.301085 2 1.141635
3 -1.309529 -0.274381 0.864837 3 3 3 0.670294 0.086347 3 -1.212503
4 0.120359 -0.358880 1.199936 4 4 4 0.389167 1.201631 4 0.445432
5 -1.031109 0.067133 -1.213451 5 5 5 -0.636896 0.013802 5 1.726135
6 -0.491877 0.254206 -0.268168 6 6 6 0.671070 -0.633645 6 1.813671
7 0.080433 -0.882443 1.152671 7 7 7 0.249225 1.385407 7 1.010374
8 0.307274 0.806150 0.071719 8 8 8 1.133853 -0.789922 8 -0.286098
9 -0.767206 1.094445 1.603907 9 9 9 0.083149 2.322640 9 0.396845
10 -0.740018 -0.853377 -2.039522 10 10 10 0.764962 -0.472048 10 -0.071255
11 -0.238565 1.077573 2.143252 11 11 11 1.542892 2.572560 11 -0.803516
12 -0.139521 -0.992107 -0.892619 12 12 12 0.259612 -0.661760 12 -1.508976
13 -1.077001 0.381962 0.205388 13 13 13 -0.023986 -1.293080 13 1.846402
14 -0.714792 -0.728496 -0.127079 14 14 14 0.606065 -2.320500 14 -0.992798
15 -0.127113 -0.563313 -0.101387 15 15 15 0.647325 -0.816023 15 -0.309938
16 -1.151304 -1.673719 0.074930 16 16 16 -0.392157 0.736714 16 1.142983
17 -1.247396 -0.471524 1.173713 17 17 17 -0.005391 0.426134 17 0.781832
18 -0.325111 0.579248 0.040363 18 18 18 0.361926 0.036871 18 0.581314
19 -1.057501 -1.814500 0.109628 19 19 19 -1.738658 -0.061883 19 0.989456
Timings
Another solutions, but it seems concat solution is fastest:
np.random.seed(456)
df = pd.DataFrame(np.random.randn(1000,1000))
#print (df)
s1 = pd.Series(range(0,1000))
#print (s1)
#set column by Series
df[8] = s1
#set Series to range of columns
cols = df.loc[:, 42:442].columns
print (df)
In [310]: %timeit df[cols] = np.broadcast_to(s1.values[:, np.newaxis], (len(df),len(cols)))
1 loop, best of 3: 202 ms per loop
In [311]: %timeit df[cols] = np.repeat(s1.values[:, np.newaxis], len(cols), axis=1)
1 loop, best of 3: 208 ms per loop
In [312]: %timeit df[cols] = np.array([s1.values]*len(cols)).transpose()
10 loops, best of 3: 175 ms per loop
In [313]: %timeit df[cols] = pd.concat([s1] * len(cols), axis=1)
10 loops, best of 3: 53.8 ms per loop

Add 2nd column index to pandas dataframe using another dataframe column

I have these 2 dataframes
edfmonthtradedays
Out[57]:
Instrument AAPL.O AMZN.O FB.OQ GOOG.OQ GOOGL.OQ BHP.AX JPM.N MSFT.O \
Date
2016-04-30 21 21 21 21 21 21 21 21
2016-05-31 21 21 21 21 21 21 21 21
2016-06-30 22 22 22 22 22 22 22 22
and
rics
Out[60]:
0 1
0 AAPL.O US
1 MSFT.O US
2 AMZN.O US
3 BHP.AX AU
I am trying to add a second column index to df1 using column[1] in df2, such that AAPL.O column would also have 'US' as a column index, BHP.AX would have 'AU', etc? I am new to python and programming but have tried for some time to get this working without luck.
I have tried,
dfmonthtradedays.columns = pd.MultiIndex.from_arrays(dfmonthtradedays.columns, rics[1].tolist())
number columns in df1 = number rows in df2
Regards