How to aggregate across columns in pandas? - pandas

There are 5 members contributing the value of something for every [E,M,S] as below:
E,M,S,Mem1,Mem2,Mem3,Mem4,Mem5
1,365,-10,15,21,18,16,,
1,365,10,23,34,,45,65
365,365,-20,34,45,43,32,23
365,365,20,56,45,,32,38
730,365,-5,82,64,13,63,27
730,365,15,24,68,,79,78
Notice that there are missing contributions ,,. I want to know the number of contributions for each [E,M,S]. For this e.g. the output is:
1,365,-10,4
1,365,10,4
365,365,-20,5
365,365,20,4
730,365,-5,5
730,365,15,4
groupingBy['E','M','S'] and then aggregating(counting) or applying(function) but across axis=1 would do. How is that done? Or any another idiomatic way to do such ?

The answer posted by #Wen is brilliant and definitely seems like the easiest way to do this.
If you wanted another way to do this, then you could use .melt to view the groups in the DF. Then, use groupby with a .sum() aggregation within each group in the melted DF. You just need to ignore the NaNs when you aggregate, and one way to do this is by following the approach in this SO post - .notnull() applied to groups.
Input DF
print df
E M S Mem1 Mem2 Mem3 Mem4 Mem5
0 1 365 -10 15 21 18.0 16 NaN
1 1 365 10 23 34 NaN 45 65.0
2 365 365 -20 34 45 43.0 32 23.0
3 365 365 20 56 45 NaN 32 38.0
4 730 365 -5 82 64 13.0 63 27.0
5 730 365 15 24 68 NaN 79 78.0
Here is the approach
# Apply melt to view groups
dfm = pd.melt(df, id_vars=['E','M','S'])
print(dfm.head(10))
E M S variable value
0 1 365 -10 Mem1 15.0
1 1 365 10 Mem1 23.0
2 365 365 -20 Mem1 34.0
3 365 365 20 Mem1 56.0
4 730 365 -5 Mem1 82.0
5 730 365 15 Mem1 24.0
6 1 365 -10 Mem2 21.0
7 1 365 10 Mem2 34.0
8 365 365 -20 Mem2 45.0
9 365 365 20 Mem2 45.0
# GROUP BY
grouped = dfm.groupby(['E','M','S'])
# Aggregate within each group, while ignoring NaNs
gtotals = grouped['value'].apply(lambda x: x.notnull().sum())
# (Optional) Reset grouped DF index
gtotals = gtotals.reset_index(drop=False)
print(gtotals)
E M S value
0 1 365 -10 4
1 1 365 10 4
2 365 365 -20 5
3 365 365 20 4
4 730 365 -5 5
5 730 365 15 4

Related

Pandas - creating new column based on data from other records

I have a pandas dataframe which has the folowing columns -
Day, Month, Year, City, Temperature.
I would like to have a new column that has the average (mean) temperature in same date (day\month) of all previous years.
Can someone please assist?
Thanks :-)
Try:
dti = pd.date_range('2000-1-1', '2021-12-1', freq='D')
temp = np.random.randint(10, 20, len(dti))
df = pd.DataFrame({'Day': dti.day, 'Month': dti.month, 'Year': dti.year,
'City': 'Nice', 'Temperature': temp})
out = df.set_index('Year').groupby(['City', 'Month', 'Day']) \
.expanding()['Temperature'].mean().reset_index()
Output:
>>> out
Day Month Year City Temperature
0 1 1 2000 Nice 12.000000
1 1 1 2001 Nice 12.000000
2 1 1 2002 Nice 11.333333
3 1 1 2003 Nice 12.250000
4 1 1 2004 Nice 11.800000
... ... ... ... ... ...
8001 31 12 2016 Nice 15.647059
8002 31 12 2017 Nice 15.555556
8003 31 12 2018 Nice 15.631579
8004 31 12 2019 Nice 15.750000
8005 31 12 2020 Nice 15.666667
[8006 rows x 5 columns]
Focus on 1st January of the dataset:
>>> df[df['Day'].eq(1) & df['Month'].eq(1)]
Day Month Year City Temperature # Mean
0 1 1 2000 Nice 12 # 12
366 1 1 2001 Nice 12 # 12
731 1 1 2002 Nice 10 # 11.33
1096 1 1 2003 Nice 15 # 12.25
1461 1 1 2004 Nice 10 # 11.80
1827 1 1 2005 Nice 12 # and so on
2192 1 1 2006 Nice 17
2557 1 1 2007 Nice 16
2922 1 1 2008 Nice 19
3288 1 1 2009 Nice 12
3653 1 1 2010 Nice 10
4018 1 1 2011 Nice 16
4383 1 1 2012 Nice 13
4749 1 1 2013 Nice 15
5114 1 1 2014 Nice 14
5479 1 1 2015 Nice 13
5844 1 1 2016 Nice 15
6210 1 1 2017 Nice 13
6575 1 1 2018 Nice 15
6940 1 1 2019 Nice 18
7305 1 1 2020 Nice 11
7671 1 1 2021 Nice 14

python pandas find percentile for a group in column

I would like to find percentile of each column and add to df data frame and also label
if the value of the column is
top 20 percent (value>80th percentile) then 'strong'
below 20 percent (value>80th percentile) then 'weak'
else average
Below is my dataframe
df=pd.DataFrame({'month':['1','1','1','1','1','2','2','2','2','2','2','2'],'X1':
[30,42,25,32,12,10,4,6,5,10,24,21],'X2':[10,76,100,23,65,94,67,24,67,54,87,81],'X3':
[23,78,95,52,60,76,68,92,34,76,34,12]})
df
Below what I tried
df['X1_percentile'] = df.X1.rank(pct = True)
df['X1_segment'] = np.where(df['X1_percentile']>0.8, 'Strong',np.where(df['X1_percentile']
<0.20,'Weak', 'Average'))
But I would like to do this for each month and for each column. And if possible this could be automted by a function for any col numbers and also type colname+"_per" and colname+"_segment" for each column ?
Thanks
We can use groupby + rank with optional parameter pct=True to calculate the ranking expressed as percentile rank, then using np.select bin/categorize the percentile values into discrete lables.
p = df.groupby('month').rank(pct=True)
df[p.columns + '_per'] = p
df[p.columns + '_seg'] = np.select([p.gt(.8), p.lt(.2)], ['strong', 'weak'], 'average')
month X1 X2 X3 X1_per X2_per X3_per X1_seg X2_seg X3_seg
0 1 30 10 23 0.600000 0.200000 0.200000 average average average
1 1 42 76 78 1.000000 0.800000 0.800000 strong average average
2 1 25 100 95 0.400000 1.000000 1.000000 average strong strong
3 1 32 23 52 0.800000 0.400000 0.400000 average average average
4 1 12 65 60 0.200000 0.600000 0.600000 average average average
5 2 10 94 76 0.642857 1.000000 0.785714 average strong average
6 2 4 67 68 0.142857 0.500000 0.571429 weak average average
7 2 6 24 92 0.428571 0.142857 1.000000 average weak strong
8 2 5 67 34 0.285714 0.500000 0.357143 average average average
9 2 10 54 76 0.642857 0.285714 0.785714 average average average
10 2 24 87 34 1.000000 0.857143 0.357143 strong strong average
11 2 21 81 12 0.857143 0.714286 0.142857 strong average weak

pands filter df by aggregate failing

My df:
Plate Route Speed Dif Latitude Longitude
1 724TL054M RUTA 23 0 32.0 19.489872 -99.183970
2 0350021 RUTA 35 0 33.0 19.303572 -99.083700
3 0120480 RUTA 12 0 32.0 19.356400 -99.125694
4 1000106 RUTA 100 0 32.0 19.212614 -99.131874
5 0030719 RUTA 3 0 36.0 19.522831 -99.258500
... ... ... ... ... ... ...
1617762 923CH113M RUTA 104 0 33.0 19.334467 -99.016880
1617763 0120077 RUTA 12 0 32.0 19.302448 -99.084530
1617764 0470053 RUTA 47 0 33.0 19.399706 -99.209190
1617765 0400070 CETRAM 0 33.0 19.265041 -99.163290
1617766 0760175 RUTA 76 0 33.0 19.274513 -99.240150
I want to filter those plates which, when their summed Dif (and hence I did a groupby) is bigger than 3600 (1 hour, since Dif is seconds), keep those. Otherwise discard them.
I tried (after a post from here):
df.groupby('Plate').filter(lambda x: x['Dif'].sum() > 3600)
But I still get about 60 plates with under 3600 as sum:
df.groupby('Plate').agg({'Dif':'sum'}).reset_index().nsmallest(60, 'Dif')
Plate Dif
952 655NZ035M 268.0
1122 949CH002C 814.0
446 0440220 1318.0
1124 949CH005C 1334.0
1042 698NZ011M 1434.0
1038 697NZ011M 1474.0
1 0010193 1509.0
282 0270302 1513.0
909 614NZ021M 1554.0
156 0140236 1570.0
425 0430092 1577.0
603 0620123 1586.0
510 0530029 1624.0
213 0180682 1651.0
736 0800126 1670.0
I have been some hours into this and I cant solve it. Any help is appreciated.
Assign it back
df = df.groupby('Plate').filter(lambda x: x['Dif'].sum() > 3600)
Then
df.groupby('Plate').agg({'Dif':'sum'}).reset_index().nsmallest(60, 'Dif')

pandas complex groupby, count and apply a cap

Sample dataframe
> 0 location_day payType Name ratePay elapsedSeconds
> 1 2019-12-10 PRE Amy 12.25 199
> 2 2019-12-12 PRE Amy 12.25 7
> 3 2019-12-17 PRE Amy 12.25 68
> 4 2019-12-17 RP Amy 8.75 62
For each day, sum elapsedSeconds and calculate new column with total toPay (elapsedSeconds * ratePay) but apply a "cap" elapsedSeconds of 120. For any single day that only has 1 payType, apply cap so that only 120 is used to calculate "toPay" col.
But...
Also, groupby payType so that if there are 2 unique "payTypes" on a single day, sum the elapsedSeconds to determine if it's over the cap (120) and if so, subtract the elapsedSeconds from the last payType to make the sum equal to 120.
So I desire this output:
> 0 location_day payType Name ratePay elapsedSeconds
> 1 2019-12-10 PRE Amy 12.25 120
> 2 2019-12-12 PRE Amy 12.25 7
> 3 2019-12-17 PRE Amy 12.25 68
> 4 2019-12-17 RP Amy 8.75 52
I'm not quite sure how to approach this one and really only have performed some very basic grouping and testing of calculating new columns with conditional statements such as
finDfcalc1 = finDf.sort('location_day').groupby(flds)['elapsedSeconds'].sum().reset_index()
finDfcalc1.loc[finDfcalc1['elapsedSeconds'] < 120, 'elapsedSecondsOverage'] = finDfcalc1['elapsedSeconds'] * 1
finDfcalc1.loc[finDfcalc1['elapsedSeconds'] > 120, 'elapsedSecondsOverage'] = finDfcalc1['elapsedSeconds'] - 120
finDfcalc1['toPay'] = finDfcalc1['ratePay'] * finDfcalc1['elapsedSecondsOverage']
None of this has to be a one-liner and would be perfectly happy just working out the logic. All suggestions and ideas are greatly appreciated.
We need to group on the day, calculate the cumsum of 'elapsedSeconds' and then apply some logic to clip the total in a day at 120 seconds and then back calculate the correct # of seconds for all rows.
Here's a longer sample dataset to show how it behaves for an additional day with many rows that need to get changed.
location_day payType Name ratePay elapsedSeconds
2019-12-10 PRE Amy 12.25 199
2019-12-12 PRE Amy 12.25 7
2019-12-17 PRE Amy 12.25 68
2019-12-17 RP Amy 8.75 62
2019-12-18 PRE Amy 12.25 50
2019-12-18 RP Amy 8.75 60
2019-12-18 RA Amy 8.75 20
2019-12-18 RE Amy 8.75 10
2019-12-18 XX Amy 8.75 123
Code:
# Will become the seconds you want in the end
df['real_sec'] = df.groupby('location_day').elapsedSeconds.cumsum()
# Calculate a difference
m = df['real_sec'] - df['elapsedSeconds']
#MagicNum
df['real_sec'] = (df['real_sec'].clip(upper=120) # 120 at most
- m.where(m.gt(0)).fillna(0) # only change rows where diff is positive
).clip(lower=0) # Negative results -> 0
location_day payType Name ratePay elapsedSeconds real_sec
0 2019-12-10 PRE Amy 12.25 199 120.0
1 2019-12-12 PRE Amy 12.25 7 7.0
2 2019-12-17 PRE Amy 12.25 68 68.0
3 2019-12-17 RP Amy 8.75 62 52.0
4 2019-12-18 PRE Amy 12.25 50 50.0
5 2019-12-18 RP Amy 8.75 60 60.0
6 2019-12-18 RA Amy 8.75 20 10.0
7 2019-12-18 RE Amy 8.75 10 0.0
8 2019-12-18 XX Amy 8.75 123 0.0

Trouble with NaNs: set_index().reset_index() corrupts data

I read that NaNs are problematic, but the following causes an actual corruption of my data, rather than an error. Is this a bug? Have I missed something basic in the documentation?
I would like the second command to give an error or to give the same response as the first command:
ipdb> df
year PRuid QC data
18 2007 nonQC 0 8.014261
19 2008 nonQC 0 7.859152
20 2010 nonQC 0 7.468260
21 1985 10 NaN 0.861403
22 1985 11 NaN 0.878531
23 1985 12 NaN 0.842704
24 1985 13 NaN 0.785877
25 1985 24 1 0.730625
26 1985 35 NaN 0.816686
27 1985 46 NaN 0.819271
28 1985 47 NaN 0.807050
ipdb> df.set_index(['year','PRuid','QC']).reset_index()
year PRuid QC data
0 2007 nonQC 0 8.014261
1 2008 nonQC 0 7.859152
2 2010 nonQC 0 7.468260
3 1985 10 1 0.861403
4 1985 11 1 0.878531
5 1985 12 1 0.842704
6 1985 13 1 0.785877
7 1985 24 1 0.730625
8 1985 35 1 0.816686
9 1985 46 1 0.819271
10 1985 47 1 0.807050
The value of "QC" is actually changed to 1 from NaN where it should be NaN.
Btw, for symmetry I added the ".reset_index()", but the data corruption is introduced by set_index.
And in case this is interesting, the version is:
pd.version
<module 'pandas.version' from '/usr/lib/python2.6/site-packages/pandas-0.10.1-py2.6-linux-x86_64.egg/pandas/version.pyc'>
So this was a bug. By the end of May 2013, pandas 0.11.1 should be released with the bug fix (see comments on this question).
In the mean time, I avoided using a value with NaNs in any multiindex, for instance by using some other flag value (-99) for the NaNs in the column 'QC'.