pands filter df by aggregate failing - pandas

My df:
Plate Route Speed Dif Latitude Longitude
1 724TL054M RUTA 23 0 32.0 19.489872 -99.183970
2 0350021 RUTA 35 0 33.0 19.303572 -99.083700
3 0120480 RUTA 12 0 32.0 19.356400 -99.125694
4 1000106 RUTA 100 0 32.0 19.212614 -99.131874
5 0030719 RUTA 3 0 36.0 19.522831 -99.258500
... ... ... ... ... ... ...
1617762 923CH113M RUTA 104 0 33.0 19.334467 -99.016880
1617763 0120077 RUTA 12 0 32.0 19.302448 -99.084530
1617764 0470053 RUTA 47 0 33.0 19.399706 -99.209190
1617765 0400070 CETRAM 0 33.0 19.265041 -99.163290
1617766 0760175 RUTA 76 0 33.0 19.274513 -99.240150
I want to filter those plates which, when their summed Dif (and hence I did a groupby) is bigger than 3600 (1 hour, since Dif is seconds), keep those. Otherwise discard them.
I tried (after a post from here):
df.groupby('Plate').filter(lambda x: x['Dif'].sum() > 3600)
But I still get about 60 plates with under 3600 as sum:
df.groupby('Plate').agg({'Dif':'sum'}).reset_index().nsmallest(60, 'Dif')
Plate Dif
952 655NZ035M 268.0
1122 949CH002C 814.0
446 0440220 1318.0
1124 949CH005C 1334.0
1042 698NZ011M 1434.0
1038 697NZ011M 1474.0
1 0010193 1509.0
282 0270302 1513.0
909 614NZ021M 1554.0
156 0140236 1570.0
425 0430092 1577.0
603 0620123 1586.0
510 0530029 1624.0
213 0180682 1651.0
736 0800126 1670.0
I have been some hours into this and I cant solve it. Any help is appreciated.

Assign it back
df = df.groupby('Plate').filter(lambda x: x['Dif'].sum() > 3600)
Then
df.groupby('Plate').agg({'Dif':'sum'}).reset_index().nsmallest(60, 'Dif')

Related

Creating a Lookup Matrix in Microsoft Access

I have the matrix below in Excel and want to import it into Access (2016) to then use in queries. The aim is to be able to lookup values based on the row and column. Eg lookup criteria of 10 and 117 should return 98.1.
Is this possible? I'm an Access novice and don't know where to start.
.
10
9
8
7
6
5
4
3
2
1
0
120
100.0
96.8
92.6
86.7
78.8
68.2
54.4
37.5
21.3
8.3
0.0
119
99.4
96.2
92.0
86.2
78.5
67.9
54.3
37.5
21.3
8.3
0.0
118
98.7
95.6
91.5
85.8
78.1
67.7
54.1
37.4
21.2
8.3
0.0
117
98.1
95.1
90.9
85.3
77.8
67.4
54.0
37.4
21.2
8.3
0.0
116
97.4
94.5
90.3
84.8
77.4
67.1
53.8
37.4
21.1
8.3
0.0
115
96.8
93.9
89.8
84.4
77.1
66.9
53.7
37.3
21.1
8.3
0.0
Consider creating a table with 3 columns to store this data:
Value1 - numeric
Value2 - numeric
LookupValue - currency
You can then use DLookup to get the value required:
?DLookup("LookupValue","LookupData","Value1=117 AND Value2=10")
If you have the values stored in variables, then you need to concatenate them in:
lngValue1=117
lngValue2=10
Debug.Print DLookup("LookupValue","LookupData","Value1=" & lngValue1 & " AND Value2=" & lngValue2)

python pandas find percentile for a group in column

I would like to find percentile of each column and add to df data frame and also label
if the value of the column is
top 20 percent (value>80th percentile) then 'strong'
below 20 percent (value>80th percentile) then 'weak'
else average
Below is my dataframe
df=pd.DataFrame({'month':['1','1','1','1','1','2','2','2','2','2','2','2'],'X1':
[30,42,25,32,12,10,4,6,5,10,24,21],'X2':[10,76,100,23,65,94,67,24,67,54,87,81],'X3':
[23,78,95,52,60,76,68,92,34,76,34,12]})
df
Below what I tried
df['X1_percentile'] = df.X1.rank(pct = True)
df['X1_segment'] = np.where(df['X1_percentile']>0.8, 'Strong',np.where(df['X1_percentile']
<0.20,'Weak', 'Average'))
But I would like to do this for each month and for each column. And if possible this could be automted by a function for any col numbers and also type colname+"_per" and colname+"_segment" for each column ?
Thanks
We can use groupby + rank with optional parameter pct=True to calculate the ranking expressed as percentile rank, then using np.select bin/categorize the percentile values into discrete lables.
p = df.groupby('month').rank(pct=True)
df[p.columns + '_per'] = p
df[p.columns + '_seg'] = np.select([p.gt(.8), p.lt(.2)], ['strong', 'weak'], 'average')
month X1 X2 X3 X1_per X2_per X3_per X1_seg X2_seg X3_seg
0 1 30 10 23 0.600000 0.200000 0.200000 average average average
1 1 42 76 78 1.000000 0.800000 0.800000 strong average average
2 1 25 100 95 0.400000 1.000000 1.000000 average strong strong
3 1 32 23 52 0.800000 0.400000 0.400000 average average average
4 1 12 65 60 0.200000 0.600000 0.600000 average average average
5 2 10 94 76 0.642857 1.000000 0.785714 average strong average
6 2 4 67 68 0.142857 0.500000 0.571429 weak average average
7 2 6 24 92 0.428571 0.142857 1.000000 average weak strong
8 2 5 67 34 0.285714 0.500000 0.357143 average average average
9 2 10 54 76 0.642857 0.285714 0.785714 average average average
10 2 24 87 34 1.000000 0.857143 0.357143 strong strong average
11 2 21 81 12 0.857143 0.714286 0.142857 strong average weak

ID rows containing values greater than corresponding values based on a criteria from another row

I have a grouped dataframe. I have created a flag that identifies if values in a row are less than the group maximums. This works fine. However I want to unflag rows where the value contained in a third column is greater than the value in the same (third) column within each group. I have a feeling there shoule be an elegant and pythonic way to do this but I can't figure it out.
The flag I have shown in the code compares the maximum value of tour_duration within each hh_id to the corresponding value of "comp_expr" and if found less, assigns "1" to the column flag. However, I want values in the flag column to be 0 if min(arrivaltime) for each subgroup tour_id > max(arrivaltime) for the tour_id whose tour_duration is found to be maximum within each hh_id. For example, in the given data, tour_id 16300 has the highest value of tour_duration. But tour_id 16200 has min arrivaltime 1080 which is < max(arrivaltime) for tour_id 16300 (960). So flag for all tour_id 16200 should be 0.
Kindly assist.
import pandas as pd
import numpy as np
stops_data = pd.DataFrame({'hh_id': [20044,20044,20044,20044,20044,20044,20044,20044,20044,20044,20044,20122,20122,20122,20122,20122,20122,20122,20122,20122,20122,20122,20122,20122,],'tour_id':[16300,16300,16100,16100,16100,16100,16200,16200,16200,16000,16000,38100,38100,37900,37900,37900,38000,38000,38000,38000,38000,38000,37800,37800],'arrivaltime':[360,960,900,900,900,960,1080,1140,1140,420,840,300,960,780,720,960,1080,1080,1080,1080,1140,1140,480,900],'tour_duration':[600,600,60,60,60,60,60,60,60,420,420,660,660,240,240,240,60,60,60,60,60,60,420,420],'comp_expr':[1350,1350,268,268,268,268,406,406,406,974,974,1568,1568,606,606,606,298,298,298,298,298,298,840,840]})
stops_data['flag'] = np.where(stops_data.groupby(['hh_id'])
['tour_duration'].transform(max) < stops_data['comp_expr'],0,1)
This is my current output:Current dataset and output
This is my desired output, please see flag column: Desired output, see changed flag values in bold
>>> stops_data.loc[stops_data.tour_id
.isin(stops_data.loc[stops_data.loc[stops_data
.groupby(['hh_id','tour_id'])['arrivaltime'].idxmin()]
.groupby('hh_id')['arrivaltime'].idxmax()]['tour_id']), 'flag'] = 0
>>> stops_data
hh_id tour_id arrivaltime tour_duration comp_expr flag
0 20044 16300 360 600 1350 0
1 20044 16300 960 600 1350 0
2 20044 16100 900 60 268 1
3 20044 16100 900 60 268 1
4 20044 16100 900 60 268 1
5 20044 16100 960 60 268 1
6 20044 16200 1080 60 406 0
7 20044 16200 1140 60 406 0
8 20044 16200 1140 60 406 0
9 20044 16000 420 420 974 0
10 20044 16000 840 420 974 0
11 20122 38100 300 660 1568 0
12 20122 38100 960 660 1568 0
13 20122 37900 780 240 606 1
14 20122 37900 720 240 606 1
15 20122 37900 960 240 606 1
16 20122 38000 1080 60 298 0
17 20122 38000 1080 60 298 0
18 20122 38000 1080 60 298 0
19 20122 38000 1080 60 298 0
20 20122 38000 1140 60 298 0
21 20122 38000 1140 60 298 0
22 20122 37800 480 420 840 0
23 20122 37800 900 420 840 0

How to aggregate across columns in pandas?

There are 5 members contributing the value of something for every [E,M,S] as below:
E,M,S,Mem1,Mem2,Mem3,Mem4,Mem5
1,365,-10,15,21,18,16,,
1,365,10,23,34,,45,65
365,365,-20,34,45,43,32,23
365,365,20,56,45,,32,38
730,365,-5,82,64,13,63,27
730,365,15,24,68,,79,78
Notice that there are missing contributions ,,. I want to know the number of contributions for each [E,M,S]. For this e.g. the output is:
1,365,-10,4
1,365,10,4
365,365,-20,5
365,365,20,4
730,365,-5,5
730,365,15,4
groupingBy['E','M','S'] and then aggregating(counting) or applying(function) but across axis=1 would do. How is that done? Or any another idiomatic way to do such ?
The answer posted by #Wen is brilliant and definitely seems like the easiest way to do this.
If you wanted another way to do this, then you could use .melt to view the groups in the DF. Then, use groupby with a .sum() aggregation within each group in the melted DF. You just need to ignore the NaNs when you aggregate, and one way to do this is by following the approach in this SO post - .notnull() applied to groups.
Input DF
print df
E M S Mem1 Mem2 Mem3 Mem4 Mem5
0 1 365 -10 15 21 18.0 16 NaN
1 1 365 10 23 34 NaN 45 65.0
2 365 365 -20 34 45 43.0 32 23.0
3 365 365 20 56 45 NaN 32 38.0
4 730 365 -5 82 64 13.0 63 27.0
5 730 365 15 24 68 NaN 79 78.0
Here is the approach
# Apply melt to view groups
dfm = pd.melt(df, id_vars=['E','M','S'])
print(dfm.head(10))
E M S variable value
0 1 365 -10 Mem1 15.0
1 1 365 10 Mem1 23.0
2 365 365 -20 Mem1 34.0
3 365 365 20 Mem1 56.0
4 730 365 -5 Mem1 82.0
5 730 365 15 Mem1 24.0
6 1 365 -10 Mem2 21.0
7 1 365 10 Mem2 34.0
8 365 365 -20 Mem2 45.0
9 365 365 20 Mem2 45.0
# GROUP BY
grouped = dfm.groupby(['E','M','S'])
# Aggregate within each group, while ignoring NaNs
gtotals = grouped['value'].apply(lambda x: x.notnull().sum())
# (Optional) Reset grouped DF index
gtotals = gtotals.reset_index(drop=False)
print(gtotals)
E M S value
0 1 365 -10 4
1 1 365 10 4
2 365 365 -20 5
3 365 365 20 4
4 730 365 -5 5
5 730 365 15 4

MySQL: How to select the UTC offset and DST for all timezones?

I want a list of all timezones in the mysql timezone tables, and need to select:
1) Their current offset from GMT
2) Whether DST is used by that timezone (not whether it's currently in use, just whether DST is considered at some point in the year for that timezone)
Reason:
I need to build a web form and match the users time zone information (which I can generate from javascript) to the correct time zone stored in the mysql DB. I can find UTC offset and get a DST flag from javascript functions.
Try this query. The offsettime is the (Offset / 60 / 60)
SELECT tzname.`Time_zone_id`,(`Offset`/60/60) AS `offsettime`,`Is_DST`,`Name`,`Transition_type_id`,`Abbreviation`
FROM `time_zone_transition_type` AS `transition`, `time_zone_name` AS `tzname`
WHERE transition.`Time_zone_id`=tzname.`Time_zone_id`
ORDER BY transition.`Offset` ASC;
The results are
501 -12.00000000 0 0 PHOT Pacific/Enderbury
369 -12.00000000 0 0 GMT+12 Etc/GMT+12
513 -12.00000000 0 1 KWAT Pacific/Kwajalein
483 -12.00000000 0 1 KWAT Kwajalein
518 -11.50000000 0 1 NUT Pacific/Niue
496 -11.50000000 0 1 SAMT Pacific/Apia
528 -11.50000000 0 1 SAMT Pacific/Samoa
555 -11.50000000 0 1 SAMT US/Samoa
521 -11.50000000 0 1 SAMT Pacific/Pago_Pago
496 -11.44888889 0 0 LMT Pacific/Apia
528 -11.38000000 0 0 LMT Pacific/Samoa
555 -11.38000000 0 0 LMT US/Samoa
521 -11.38000000 0 0 LMT Pacific/Pago_Pago
518 -11.33333333 0 0 NUT Pacific/Niue
544 -11.00000000 0 3 BST US/Aleutian
163 -11.00000000 0 3 BST America/Nome
518 -11.00000000 0 2 NUT Pacific/Niue
496 -11.00000000 0 2 WST Pacific/Apia
544 -11.00000000 0 0 NST US/Aleutian
163 -11.00000000 0 0 NST America/Nome
528 -11.00000000 0 4 SST Pacific/Samoa
528 -11.00000000 0 3 BST Pacific/Samoa