Related
I have the matrix below in Excel and want to import it into Access (2016) to then use in queries. The aim is to be able to lookup values based on the row and column. Eg lookup criteria of 10 and 117 should return 98.1.
Is this possible? I'm an Access novice and don't know where to start.
.
10
9
8
7
6
5
4
3
2
1
0
120
100.0
96.8
92.6
86.7
78.8
68.2
54.4
37.5
21.3
8.3
0.0
119
99.4
96.2
92.0
86.2
78.5
67.9
54.3
37.5
21.3
8.3
0.0
118
98.7
95.6
91.5
85.8
78.1
67.7
54.1
37.4
21.2
8.3
0.0
117
98.1
95.1
90.9
85.3
77.8
67.4
54.0
37.4
21.2
8.3
0.0
116
97.4
94.5
90.3
84.8
77.4
67.1
53.8
37.4
21.1
8.3
0.0
115
96.8
93.9
89.8
84.4
77.1
66.9
53.7
37.3
21.1
8.3
0.0
Consider creating a table with 3 columns to store this data:
Value1 - numeric
Value2 - numeric
LookupValue - currency
You can then use DLookup to get the value required:
?DLookup("LookupValue","LookupData","Value1=117 AND Value2=10")
If you have the values stored in variables, then you need to concatenate them in:
lngValue1=117
lngValue2=10
Debug.Print DLookup("LookupValue","LookupData","Value1=" & lngValue1 & " AND Value2=" & lngValue2)
I would like to find percentile of each column and add to df data frame and also label
if the value of the column is
top 20 percent (value>80th percentile) then 'strong'
below 20 percent (value>80th percentile) then 'weak'
else average
Below is my dataframe
df=pd.DataFrame({'month':['1','1','1','1','1','2','2','2','2','2','2','2'],'X1':
[30,42,25,32,12,10,4,6,5,10,24,21],'X2':[10,76,100,23,65,94,67,24,67,54,87,81],'X3':
[23,78,95,52,60,76,68,92,34,76,34,12]})
df
Below what I tried
df['X1_percentile'] = df.X1.rank(pct = True)
df['X1_segment'] = np.where(df['X1_percentile']>0.8, 'Strong',np.where(df['X1_percentile']
<0.20,'Weak', 'Average'))
But I would like to do this for each month and for each column. And if possible this could be automted by a function for any col numbers and also type colname+"_per" and colname+"_segment" for each column ?
Thanks
We can use groupby + rank with optional parameter pct=True to calculate the ranking expressed as percentile rank, then using np.select bin/categorize the percentile values into discrete lables.
p = df.groupby('month').rank(pct=True)
df[p.columns + '_per'] = p
df[p.columns + '_seg'] = np.select([p.gt(.8), p.lt(.2)], ['strong', 'weak'], 'average')
month X1 X2 X3 X1_per X2_per X3_per X1_seg X2_seg X3_seg
0 1 30 10 23 0.600000 0.200000 0.200000 average average average
1 1 42 76 78 1.000000 0.800000 0.800000 strong average average
2 1 25 100 95 0.400000 1.000000 1.000000 average strong strong
3 1 32 23 52 0.800000 0.400000 0.400000 average average average
4 1 12 65 60 0.200000 0.600000 0.600000 average average average
5 2 10 94 76 0.642857 1.000000 0.785714 average strong average
6 2 4 67 68 0.142857 0.500000 0.571429 weak average average
7 2 6 24 92 0.428571 0.142857 1.000000 average weak strong
8 2 5 67 34 0.285714 0.500000 0.357143 average average average
9 2 10 54 76 0.642857 0.285714 0.785714 average average average
10 2 24 87 34 1.000000 0.857143 0.357143 strong strong average
11 2 21 81 12 0.857143 0.714286 0.142857 strong average weak
I have a grouped dataframe. I have created a flag that identifies if values in a row are less than the group maximums. This works fine. However I want to unflag rows where the value contained in a third column is greater than the value in the same (third) column within each group. I have a feeling there shoule be an elegant and pythonic way to do this but I can't figure it out.
The flag I have shown in the code compares the maximum value of tour_duration within each hh_id to the corresponding value of "comp_expr" and if found less, assigns "1" to the column flag. However, I want values in the flag column to be 0 if min(arrivaltime) for each subgroup tour_id > max(arrivaltime) for the tour_id whose tour_duration is found to be maximum within each hh_id. For example, in the given data, tour_id 16300 has the highest value of tour_duration. But tour_id 16200 has min arrivaltime 1080 which is < max(arrivaltime) for tour_id 16300 (960). So flag for all tour_id 16200 should be 0.
Kindly assist.
import pandas as pd
import numpy as np
stops_data = pd.DataFrame({'hh_id': [20044,20044,20044,20044,20044,20044,20044,20044,20044,20044,20044,20122,20122,20122,20122,20122,20122,20122,20122,20122,20122,20122,20122,20122,],'tour_id':[16300,16300,16100,16100,16100,16100,16200,16200,16200,16000,16000,38100,38100,37900,37900,37900,38000,38000,38000,38000,38000,38000,37800,37800],'arrivaltime':[360,960,900,900,900,960,1080,1140,1140,420,840,300,960,780,720,960,1080,1080,1080,1080,1140,1140,480,900],'tour_duration':[600,600,60,60,60,60,60,60,60,420,420,660,660,240,240,240,60,60,60,60,60,60,420,420],'comp_expr':[1350,1350,268,268,268,268,406,406,406,974,974,1568,1568,606,606,606,298,298,298,298,298,298,840,840]})
stops_data['flag'] = np.where(stops_data.groupby(['hh_id'])
['tour_duration'].transform(max) < stops_data['comp_expr'],0,1)
This is my current output:Current dataset and output
This is my desired output, please see flag column: Desired output, see changed flag values in bold
>>> stops_data.loc[stops_data.tour_id
.isin(stops_data.loc[stops_data.loc[stops_data
.groupby(['hh_id','tour_id'])['arrivaltime'].idxmin()]
.groupby('hh_id')['arrivaltime'].idxmax()]['tour_id']), 'flag'] = 0
>>> stops_data
hh_id tour_id arrivaltime tour_duration comp_expr flag
0 20044 16300 360 600 1350 0
1 20044 16300 960 600 1350 0
2 20044 16100 900 60 268 1
3 20044 16100 900 60 268 1
4 20044 16100 900 60 268 1
5 20044 16100 960 60 268 1
6 20044 16200 1080 60 406 0
7 20044 16200 1140 60 406 0
8 20044 16200 1140 60 406 0
9 20044 16000 420 420 974 0
10 20044 16000 840 420 974 0
11 20122 38100 300 660 1568 0
12 20122 38100 960 660 1568 0
13 20122 37900 780 240 606 1
14 20122 37900 720 240 606 1
15 20122 37900 960 240 606 1
16 20122 38000 1080 60 298 0
17 20122 38000 1080 60 298 0
18 20122 38000 1080 60 298 0
19 20122 38000 1080 60 298 0
20 20122 38000 1140 60 298 0
21 20122 38000 1140 60 298 0
22 20122 37800 480 420 840 0
23 20122 37800 900 420 840 0
There are 5 members contributing the value of something for every [E,M,S] as below:
E,M,S,Mem1,Mem2,Mem3,Mem4,Mem5
1,365,-10,15,21,18,16,,
1,365,10,23,34,,45,65
365,365,-20,34,45,43,32,23
365,365,20,56,45,,32,38
730,365,-5,82,64,13,63,27
730,365,15,24,68,,79,78
Notice that there are missing contributions ,,. I want to know the number of contributions for each [E,M,S]. For this e.g. the output is:
1,365,-10,4
1,365,10,4
365,365,-20,5
365,365,20,4
730,365,-5,5
730,365,15,4
groupingBy['E','M','S'] and then aggregating(counting) or applying(function) but across axis=1 would do. How is that done? Or any another idiomatic way to do such ?
The answer posted by #Wen is brilliant and definitely seems like the easiest way to do this.
If you wanted another way to do this, then you could use .melt to view the groups in the DF. Then, use groupby with a .sum() aggregation within each group in the melted DF. You just need to ignore the NaNs when you aggregate, and one way to do this is by following the approach in this SO post - .notnull() applied to groups.
Input DF
print df
E M S Mem1 Mem2 Mem3 Mem4 Mem5
0 1 365 -10 15 21 18.0 16 NaN
1 1 365 10 23 34 NaN 45 65.0
2 365 365 -20 34 45 43.0 32 23.0
3 365 365 20 56 45 NaN 32 38.0
4 730 365 -5 82 64 13.0 63 27.0
5 730 365 15 24 68 NaN 79 78.0
Here is the approach
# Apply melt to view groups
dfm = pd.melt(df, id_vars=['E','M','S'])
print(dfm.head(10))
E M S variable value
0 1 365 -10 Mem1 15.0
1 1 365 10 Mem1 23.0
2 365 365 -20 Mem1 34.0
3 365 365 20 Mem1 56.0
4 730 365 -5 Mem1 82.0
5 730 365 15 Mem1 24.0
6 1 365 -10 Mem2 21.0
7 1 365 10 Mem2 34.0
8 365 365 -20 Mem2 45.0
9 365 365 20 Mem2 45.0
# GROUP BY
grouped = dfm.groupby(['E','M','S'])
# Aggregate within each group, while ignoring NaNs
gtotals = grouped['value'].apply(lambda x: x.notnull().sum())
# (Optional) Reset grouped DF index
gtotals = gtotals.reset_index(drop=False)
print(gtotals)
E M S value
0 1 365 -10 4
1 1 365 10 4
2 365 365 -20 5
3 365 365 20 4
4 730 365 -5 5
5 730 365 15 4
I want a list of all timezones in the mysql timezone tables, and need to select:
1) Their current offset from GMT
2) Whether DST is used by that timezone (not whether it's currently in use, just whether DST is considered at some point in the year for that timezone)
Reason:
I need to build a web form and match the users time zone information (which I can generate from javascript) to the correct time zone stored in the mysql DB. I can find UTC offset and get a DST flag from javascript functions.
Try this query. The offsettime is the (Offset / 60 / 60)
SELECT tzname.`Time_zone_id`,(`Offset`/60/60) AS `offsettime`,`Is_DST`,`Name`,`Transition_type_id`,`Abbreviation`
FROM `time_zone_transition_type` AS `transition`, `time_zone_name` AS `tzname`
WHERE transition.`Time_zone_id`=tzname.`Time_zone_id`
ORDER BY transition.`Offset` ASC;
The results are
501 -12.00000000 0 0 PHOT Pacific/Enderbury
369 -12.00000000 0 0 GMT+12 Etc/GMT+12
513 -12.00000000 0 1 KWAT Pacific/Kwajalein
483 -12.00000000 0 1 KWAT Kwajalein
518 -11.50000000 0 1 NUT Pacific/Niue
496 -11.50000000 0 1 SAMT Pacific/Apia
528 -11.50000000 0 1 SAMT Pacific/Samoa
555 -11.50000000 0 1 SAMT US/Samoa
521 -11.50000000 0 1 SAMT Pacific/Pago_Pago
496 -11.44888889 0 0 LMT Pacific/Apia
528 -11.38000000 0 0 LMT Pacific/Samoa
555 -11.38000000 0 0 LMT US/Samoa
521 -11.38000000 0 0 LMT Pacific/Pago_Pago
518 -11.33333333 0 0 NUT Pacific/Niue
544 -11.00000000 0 3 BST US/Aleutian
163 -11.00000000 0 3 BST America/Nome
518 -11.00000000 0 2 NUT Pacific/Niue
496 -11.00000000 0 2 WST Pacific/Apia
544 -11.00000000 0 0 NST US/Aleutian
163 -11.00000000 0 0 NST America/Nome
528 -11.00000000 0 4 SST Pacific/Samoa
528 -11.00000000 0 3 BST Pacific/Samoa