How to create multiple line graph using seaborn and find rate? - pandas

I need help to create a multiple line graph using below DataFrame
num user_id first_result second_result result date point1 point2 point3 point4
0 0 1480R clear clear pass 9/19/2016 clear consider clear consider
1 1 419M consider consider fail 5/18/2016 consider consider clear clear
2 2 416N consider consider fail 11/15/2016 consider consider consider consider
3 3 1913I consider consider fail 11/25/2016 consider consider consider clear
4 4 1938T clear clear pass 8/1/2016 clear consider clear clear
5 5 1530C clear clear pass 6/22/2016 clear clear consider clear
6 6 1075L consider consider fail 9/13/2016 consider consider clear consider
7 7 1466N consider clear fail 6/21/2016 consider clear clear consider
8 8 662V consider consider fail 11/1/2016 consider consider clear consider
9 9 1187Y consider consider fail 9/13/2016 consider consider clear clear
10 10 138T consider consider fail 9/19/2016 consider clear consider consider
11 11 1461Z consider clear fail 7/18/2016 consider consider clear consider
12 12 807N consider clear fail 8/16/2016 consider consider clear clear
13 13 416Y consider consider fail 10/2/2016 consider clear clear clear
14 14 638A consider clear fail 6/21/2016 consider clear consider clear
data file linke data.xlsx or data as dict
data = {'num': {0: 0,
1: 1,
2: 2,
3: 3,
4: 4,
5: 5,
6: 6,
7: 7,
8: 8,
9: 9,
10: 10,
11: 11,
12: 12,
13: 13,
14: 14},
'user_id': {0: '1480R',
1: '419M',
2: '416N',
3: '1913I',
4: '1938T',
5: '1530C',
6: '1075L',
7: '1466N',
8: '662V',
9: '1187Y',
10: '138T',
11: '1461Z',
12: '807N',
13: '416Y',
14: '638A'},
'first_result': {0: 'clear',
1: 'consider',
2: 'consider',
3: 'consider',
4: 'clear',
5: 'clear',
6: 'consider',
7: 'consider',
8: 'consider',
9: 'consider',
10: 'consider',
11: 'consider',
12: 'consider',
13: 'consider',
14: 'consider'},
'second_result': {0: 'clear',
1: 'consider',
2: 'consider',
3: 'consider',
4: 'clear',
5: 'clear',
6: 'consider',
7: 'clear',
8: 'consider',
9: 'consider',
10: 'consider',
11: 'clear',
12: 'clear',
13: 'consider',
14: 'clear'},
'result': {0: 'pass',
1: 'fail',
2: 'fail',
3: 'fail',
4: 'pass',
5: 'pass',
6: 'fail',
7: 'fail',
8: 'fail',
9: 'fail',
10: 'fail',
11: 'fail',
12: 'fail',
13: 'fail',
14: 'fail'},
'date': {0: '9/19/2016',
1: '5/18/2016',
2: '11/15/2016',
3: '11/25/2016',
4: '8/1/2016',
5: '6/22/2016',
6: '9/13/2016',
7: '6/21/2016',
8: '11/1/2016',
9: '9/13/2016',
10: '9/19/2016',
11: '7/18/2016',
12: '8/16/2016',
13: '10/2/2016',
14: '6/21/2016'},
'point1': {0: 'clear',
1: 'consider',
2: 'consider',
3: 'consider',
4: 'clear',
5: 'clear',
6: 'consider',
7: 'consider',
8: 'consider',
9: 'consider',
10: 'consider',
11: 'consider',
12: 'consider',
13: 'consider',
14: 'consider'},
'point2': {0: 'consider',
1: 'consider',
2: 'consider',
3: 'consider',
4: 'consider',
5: 'clear',
6: 'consider',
7: 'clear',
8: 'consider',
9: 'consider',
10: 'clear',
11: 'consider',
12: 'consider',
13: 'clear',
14: 'clear'},
'point3': {0: 'clear',
1: 'clear',
2: 'consider',
3: 'consider',
4: 'clear',
5: 'consider',
6: 'clear',
7: 'clear',
8: 'clear',
9: 'clear',
10: 'consider',
11: 'clear',
12: 'clear',
13: 'clear',
14: 'consider'},
'point4': {0: 'consider',
1: 'clear',
2: 'consider',
3: 'clear',
4: 'clear',
5: 'clear',
6: 'consider',
7: 'consider',
8: 'consider',
9: 'clear',
10: 'consider',
11: 'consider',
12: 'clear',
13: 'clear',
14: 'clear'}
}
I need to create a bar graph and a line graph, I have created the bar graph using point1 where x = consider, clear and y = count of consider and clear
but I have no idea how to create a line graph by this scenario
x = date
y = pass rate (%)
Pass Rate is a number of clear/(consider + clear)
graph the rate for first_result, second_result, result all on the same graph
and the graph should look like below
please comment or answer how can I do it. if I can get an idea of grouping dates and getting the ratio then also great.

Here's my idea how to do it:
# first convert all `clear`, `consider` to 1,0
tmp_df = df[['first_result', 'second_result']].apply(lambda x: x.eq('clear').astype(int))
# convert `pass`, `fail` to 1,0
tmp_df['result'] = df.result.eq('pass').astype(int)
# copy the date
tmp_df['date'] = df['date']
# groupby and compute mean, i.e. number_pass/total_count
tmp_df = tmp_df.groupby('date').mean()
tmp_df.plot()
Output for this dataset

Related

How to Melt a column into another melted column within Pandas?

I have a file consisting of sales for different items. I have a column of model predictions named outputs and which model produced those predictions named, model. I need to take the current model's (that's in production) predictions in a column named 'FCAST_QTYand make them part of the outputs column with the string "FCAST_QTY" being put into themodel column. So essentially, melting that column (FCAST_QTY') into the columns output and model so the current production model is in the same columns and the multiple models that are in development. This will make it easier to compare/contrast. I'm not sure how to do this. Example data below.
import pandas as pd
from pandas import Timestamp
sales_dict = {'MO_YR': {0: Timestamp('2021-01-01 00:00:00'),
1: Timestamp('2021-02-01 00:00:00'),
2: Timestamp('2021-03-01 00:00:00'),
3: Timestamp('2021-04-01 00:00:00'),
4: Timestamp('2021-05-01 00:00:00'),
5: Timestamp('2021-06-01 00:00:00'),
6: Timestamp('2021-07-01 00:00:00'),
7: Timestamp('2021-08-01 00:00:00'),
8: Timestamp('2021-09-01 00:00:00'),
9: Timestamp('2021-10-01 00:00:00'),
10: Timestamp('2021-11-01 00:00:00'),
11: Timestamp('2021-12-01 00:00:00'),
12: Timestamp('2021-01-01 00:00:00'),
13: Timestamp('2021-02-01 00:00:00'),
14: Timestamp('2021-03-01 00:00:00')},
'ITEM_BASE': {0: '289461K',
1: '289461K',
2: '289461K',
3: '289461K',
4: '289461K',
5: '289461K',
6: '289461K',
7: '289461K',
8: '289461K',
9: '289461K',
10: '289461K',
11: '289461K',
12: '400520J',
13: '400520J',
14: '400520J'},
'eaches': {0: 2592,
1: 3844,
2: 759,
3: 825,
4: 663,
5: 2025,
6: 471,
7: 1160,
8: 5987,
9: 679,
10: 469,
11: 907,
12: 64,
13: 48,
14: 63},
'FCAST_QTY': {0: 2800.0,
1: 5200.0,
2: 550.0,
3: 475.0,
4: 575.0,
5: 475.0,
6: 650.0,
7: 550.0,
8: 7900.0,
9: 1187.0,
10: 1187.0,
11: 1900.0,
12: 51.0,
13: 55.0,
14: 59.0},
'log_eaches': {0: 7.860185057472165,
1: 8.254268770090183,
2: 6.63200177739563,
3: 6.715383386334682,
4: 6.496774990185863,
5: 7.61332497954064,
6: 6.154858094016418,
7: 7.05617528410041,
8: 8.697345730925353,
9: 6.520621127558696,
10: 6.15060276844628,
11: 6.810142450115136,
12: 4.158883083359672,
13: 3.871201010907891,
14: 4.143134726391533},
'output': {0: 8.646015798513993,
1: 8.378197900630752,
2: 7.045235414873291,
3: 5.117058321275769,
4: 9.082928370640056,
5: 5.225648643174155,
6: 7.446383013291042,
7: 6.307484284901181,
8: 7.752673979530179,
9: 9.02189934080111,
10: 4.677594714421006,
11: 7.218749101888444,
12: 4.04018241973268,
13: 3.940978322900716,
14: 3.962359464699719},
'model': {0: 'LR_output',
1: 'LR_output',
2: 'LR_output',
3: 'LR_output',
4: 'LR_output',
5: 'LR_output',
6: 'LR_output',
7: 'LR_output',
8: 'LR_output',
9: 'LR_output',
10: 'LR_output',
11: 'LR_output',
12: 'AR(12)',
13: 'AR(12)',
14: 'AR(12)'}}
df = pd.DataFrame.from_dict(sales_dict)
Expected Output Added:
expected_dict = {'MO_YR': {0: Timestamp('2021-01-01 00:00:00'),
1: Timestamp('2021-02-01 00:00:00'),
2: Timestamp('2021-03-01 00:00:00'),
3: Timestamp('2021-04-01 00:00:00'),
4: Timestamp('2021-05-01 00:00:00'),
5: Timestamp('2021-06-01 00:00:00'),
6: Timestamp('2021-07-01 00:00:00'),
7: Timestamp('2021-08-01 00:00:00'),
8: Timestamp('2021-09-01 00:00:00'),
9: Timestamp('2021-10-01 00:00:00'),
10: Timestamp('2021-11-01 00:00:00'),
11: Timestamp('2021-12-01 00:00:00'),
12: Timestamp('2021-01-01 00:00:00'),
13: Timestamp('2021-02-01 00:00:00'),
14: Timestamp('2021-03-01 00:00:00'),
15: Timestamp('2021-01-01 00:00:00'),
16: Timestamp('2021-02-01 00:00:00'),
17: Timestamp('2021-03-01 00:00:00'),
18: Timestamp('2021-04-01 00:00:00'),
19: Timestamp('2021-05-01 00:00:00'),
20: Timestamp('2021-06-01 00:00:00'),
21: Timestamp('2021-07-01 00:00:00'),
22: Timestamp('2021-08-01 00:00:00'),
23: Timestamp('2021-09-01 00:00:00'),
24: Timestamp('2021-10-01 00:00:00'),
25: Timestamp('2021-11-01 00:00:00'),
26: Timestamp('2021-12-01 00:00:00'),
27: Timestamp('2021-01-01 00:00:00'),
28: Timestamp('2021-02-01 00:00:00'),
29: Timestamp('2021-03-01 00:00:00')},
'ITEM_BASE': {0: '289461K',
1: '289461K',
2: '289461K',
3: '289461K',
4: '289461K',
5: '289461K',
6: '289461K',
7: '289461K',
8: '289461K',
9: '289461K',
10: '289461K',
11: '289461K',
12: '400520J',
13: '400520J',
14: '400520J',
15: '289461K',
16: '289461K',
17: '289461K',
18: '289461K',
19: '289461K',
20: '289461K',
21: '289461K',
22: '289461K',
23: '289461K',
24: '289461K',
25: '289461K',
26: '289461K',
27: '400520J',
28: '400520J',
29: '400520J'},
'eaches': {0: 2592,
1: 3844,
2: 759,
3: 825,
4: 663,
5: 2025,
6: 471,
7: 1160,
8: 5987,
9: 679,
10: 469,
11: 907,
12: 64,
13: 48,
14: 63,
15: 2592,
16: 3844,
17: 759,
18: 825,
19: 663,
20: 2025,
21: 471,
22: 1160,
23: 5987,
24: 679,
25: 469,
26: 907,
27: 64,
28: 48,
29: 63},
'log_eaches': {0: 7.860185057472165,
1: 8.254268770090183,
2: 6.63200177739563,
3: 6.715383386334682,
4: 6.496774990185863,
5: 7.61332497954064,
6: 6.154858094016418,
7: 7.05617528410041,
8: 8.697345730925353,
9: 6.520621127558696,
10: 6.15060276844628,
11: 6.810142450115136,
12: 4.158883083359672,
13: 3.871201010907891,
14: 4.143134726391533,
15: 7.860185057472165,
16: 8.254268770090183,
17: 6.63200177739563,
18: 6.715383386334682,
19: 6.496774990185863,
20: 7.61332497954064,
21: 6.154858094016418,
22: 7.05617528410041,
23: 8.697345730925353,
24: 6.520621127558696,
25: 6.15060276844628,
26: 6.810142450115136,
27: 4.158883083359672,
28: 3.871201010907891,
29: 4.143134726391533,},
'output': {0: 8.646015798513993,
1: 8.378197900630752,
2: 7.045235414873291,
3: 5.117058321275769,
4: 9.082928370640056,
5: 5.225648643174155,
6: 7.446383013291042,
7: 6.307484284901181,
8: 7.752673979530179,
9: 9.02189934080111,
10: 4.677594714421006,
11: 7.218749101888444,
12: 4.04018241973268,
13: 3.940978322900716,
14: 3.962359464699719,
15: 2800.0,
16: 5200.0,
17: 550.0,
18: 475.0,
19: 575.0,
20: 475.0,
21: 650.0,
22: 550.0,
23: 7900.0,
24: 1187.0,
25: 1187.0,
26: 1900.0,
27: 51.0,
28: 55.0,
29: 59.0},
'model': {0: 'LR_output',
1: 'LR_output',
2: 'LR_output',
3: 'LR_output',
4: 'LR_output',
5: 'LR_output',
6: 'LR_output',
7: 'LR_output',
8: 'LR_output',
9: 'LR_output',
10: 'LR_output',
11: 'LR_output',
12: 'AR(12)',
13: 'AR(12)',
14: 'AR(12)',
15:'FCAST_QTY',
16:'FCAST_QTY',
17:'FCAST_QTY',
18:'FCAST_QTY',
19:'FCAST_QTY',
20:'FCAST_QTY',
21:'FCAST_QTY',
22:'FCAST_QTY',
23:'FCAST_QTY',
24:'FCAST_QTY',
25:'FCAST_QTY',
26:'FCAST_QTY',
27:'FCAST_QTY',
28:'FCAST_QTY',
29:'FCAST_QTY'}}
df = pd.DataFrame.from_dict(expected_dict)
Create a new dataframe with your logic and append it to the original dataframe:
fcast_qty = (df
.drop(columns = ['output', 'model'])
.rename(columns={"FCAST_QTY":"output"})
.assign(model="FCAST_QTY")
)
pd.concat([df.drop(columns='FCAST_QTY'), fcast_qty], ignore_index = True)
MO_YR ITEM_BASE eaches log_eaches output model
0 2021-01-01 289461K 2592 7.860185 8.646016 LR_output
1 2021-02-01 289461K 3844 8.254269 8.378198 LR_output
2 2021-03-01 289461K 759 6.632002 7.045235 LR_output
3 2021-04-01 289461K 825 6.715383 5.117058 LR_output
4 2021-05-01 289461K 663 6.496775 9.082928 LR_output
5 2021-06-01 289461K 2025 7.613325 5.225649 LR_output
6 2021-07-01 289461K 471 6.154858 7.446383 LR_output
7 2021-08-01 289461K 1160 7.056175 6.307484 LR_output
8 2021-09-01 289461K 5987 8.697346 7.752674 LR_output
9 2021-10-01 289461K 679 6.520621 9.021899 LR_output
10 2021-11-01 289461K 469 6.150603 4.677595 LR_output
11 2021-12-01 289461K 907 6.810142 7.218749 LR_output
12 2021-01-01 400520J 64 4.158883 4.040182 AR(12)
13 2021-02-01 400520J 48 3.871201 3.940978 AR(12)
14 2021-03-01 400520J 63 4.143135 3.962359 AR(12)
15 2021-01-01 289461K 2592 7.860185 2800.000000 FCAST_QTY
16 2021-02-01 289461K 3844 8.254269 5200.000000 FCAST_QTY
17 2021-03-01 289461K 759 6.632002 550.000000 FCAST_QTY
18 2021-04-01 289461K 825 6.715383 475.000000 FCAST_QTY
19 2021-05-01 289461K 663 6.496775 575.000000 FCAST_QTY
20 2021-06-01 289461K 2025 7.613325 475.000000 FCAST_QTY
21 2021-07-01 289461K 471 6.154858 650.000000 FCAST_QTY
22 2021-08-01 289461K 1160 7.056175 550.000000 FCAST_QTY
23 2021-09-01 289461K 5987 8.697346 7900.000000 FCAST_QTY
24 2021-10-01 289461K 679 6.520621 1187.000000 FCAST_QTY
25 2021-11-01 289461K 469 6.150603 1187.000000 FCAST_QTY
26 2021-12-01 289461K 907 6.810142 1900.000000 FCAST_QTY
27 2021-01-01 400520J 64 4.158883 51.000000 FCAST_QTY
28 2021-02-01 400520J 48 3.871201 55.000000 FCAST_QTY
29 2021-03-01 400520J 63 4.143135 59.000000 FCAST_QTY

Using shift function along with max function Pandas

I am attempting to create a technical indicator ('Supertrend') using Pandas. The formula for this column is recursive.
(For people familiar with Pinescript, this column will replicate the result of this Pinescript function):
df['st_trendup'] = np.select(df['Close'].shift() > df['st_trendup'].shift(),df[['st_up','st_trendup'.shift()]].max(axis=1),df['st_up'])
The problem occurs in the true part of the np.select()because I cannot call .shift() on a string.
Normally, I would make a new column that uses .shift() beforehand but since this is recursive, I have to do it all in one line.
If possible I'd like to avoid using loops for speed; prefer solutions using native pandas or numpy functions.
What I am looking for
A way to find max function that can accomodate a .shift() call
Columns that are used:
def tr(high,low,close1):
return max(high - low, abs(high - close1), abs(low - close1))
df['st_closeprev'] = df['Close'].shift()
df['st_hl2'] = (df['High']+df['Low'])/2
df['st_tr'] = df.apply(lambda row: tr(row['High'],row['Low'],row['st_closeprev']),axis=1)
df['st_atr'] = df['st_tr'].ewm(alpha = 1/pd,adjust=False,min_periods=pd).mean()
df['st_up'] = df['st_hl2'] - factor * df['st_atr']
df['st_dn'] = df['st_hl2'] + factor * df['st_atr']
df['st_trendup'] = np.select(df['Close'].shift() > df['st_trendup'].shift(),df[['st_up','st_trendup'.shift()]].max(axis=1),df['st_up'])
Sample data obtained by the df.to_dict
{'Date': {0: Timestamp('2021-01-01 09:15:00'),
1: Timestamp('2021-01-01 09:30:00'),
2: Timestamp('2021-01-01 09:45:00'),
3: Timestamp('2021-01-01 10:00:00'),
4: Timestamp('2021-01-01 10:15:00'),
5: Timestamp('2021-01-01 10:30:00'),
6: Timestamp('2021-01-01 10:45:00'),
7: Timestamp('2021-01-01 11:00:00'),
8: Timestamp('2021-01-01 11:15:00'),
9: Timestamp('2021-01-01 11:30:00'),
10: Timestamp('2021-01-01 11:45:00'),
11: Timestamp('2021-01-01 12:00:00'),
12: Timestamp('2021-01-01 12:15:00'),
13: Timestamp('2021-01-01 12:30:00'),
14: Timestamp('2021-01-01 12:45:00'),
15: Timestamp('2021-01-01 13:00:00'),
16: Timestamp('2021-01-01 13:15:00'),
17: Timestamp('2021-01-01 13:30:00'),
18: Timestamp('2021-01-01 13:45:00'),
19: Timestamp('2021-01-01 14:00:00'),
20: Timestamp('2021-01-01 14:15:00'),
21: Timestamp('2021-01-01 14:30:00'),
22: Timestamp('2021-01-01 14:45:00'),
23: Timestamp('2021-01-01 15:00:00'),
24: Timestamp('2021-01-01 15:15:00'),
25: Timestamp('2021-01-04 09:15:00')},
'Open': {0: 31250.0,
1: 31376.0,
2: 31405.0,
3: 31389.4,
4: 31377.5,
5: 31347.8,
6: 31310.8,
7: 31343.4,
8: 31349.5,
9: 31349.9,
10: 31325.1,
11: 31310.9,
12: 31329.0,
13: 31376.0,
14: 31375.5,
15: 31357.4,
16: 31325.0,
17: 31341.1,
18: 31300.0,
19: 31324.5,
20: 31353.3,
21: 31350.0,
22: 31346.9,
23: 31330.0,
24: 31314.3,
25: 31450.2},
'High': {0: 31407.0,
1: 31425.0,
2: 31411.95,
3: 31389.45,
4: 31382.0,
5: 31350.0,
6: 31354.6,
7: 31359.0,
8: 31370.0,
9: 31364.7,
10: 31350.0,
11: 31337.9,
12: 31378.9,
13: 31419.5,
14: 31377.75,
15: 31360.0,
16: 31367.15,
17: 31345.2,
18: 31340.0,
19: 31367.0,
20: 31375.0,
21: 31370.0,
22: 31350.0,
23: 31334.6,
24: 31329.6,
25: 31599.0},
'Low': {0: 31250.0,
1: 31367.95,
2: 31352.5,
3: 31331.65,
4: 31301.4,
5: 31303.05,
6: 31310.0,
7: 31325.05,
8: 31335.35,
9: 31315.35,
10: 31281.9,
11: 31292.0,
12: 31316.25,
13: 31352.05,
14: 31335.0,
15: 31322.0,
16: 31318.25,
17: 31261.55,
18: 31283.3,
19: 31324.5,
20: 31322.0,
21: 31332.15,
22: 31324.1,
23: 31300.15,
24: 31280.0,
25: 31430.0},
'Close': {0: 31375.0,
1: 31398.3,
2: 31386.0,
3: 31377.0,
4: 31342.3,
5: 31311.7,
6: 31345.0,
7: 31349.0,
8: 31344.2,
9: 31327.6,
10: 31311.3,
11: 31325.6,
12: 31373.0,
13: 31375.0,
14: 31357.4,
15: 31326.0,
16: 31345.9,
17: 31300.6,
18: 31324.4,
19: 31353.8,
20: 31345.6,
21: 31341.6,
22: 31332.5,
23: 31311.0,
24: 31285.0,
25: 31558.4},
'Volume': {0: 259952,
1: 163775,
2: 105900,
3: 99725,
4: 115175,
5: 78625,
6: 67675,
7: 46575,
8: 53350,
9: 54175,
10: 96975,
11: 80925,
12: 79475,
13: 147775,
14: 38900,
15: 64925,
16: 52425,
17: 142175,
18: 81800,
19: 74950,
20: 68550,
21: 40350,
22: 47150,
23: 119200,
24: 222875,
25: 524625}}
Change:
df[['st_up','st_trendup'.shift()]].max(axis=1)
to:
df[['st_up','st_trendup']].assign(st_trendup = df['st_trendup'].shift()).max(axis=1)

median imputation by groups in pandas (handling group medians that are NaN)

I have the following DataFrame train:
train = {'NAME_EDUCATION_TYPE': {5: 'Secondary / secondary special',
6: 'Higher education',
7: 'Higher education',
8: 'Secondary / secondary special',
9: 'Secondary / secondary special',
10: 'Higher education',
11: 'Secondary / secondary special',
12: 'Secondary / secondary special',
13: 'Secondary / secondary special',
14: 'Secondary / secondary special'},
'OCCUPATION_TYPE': {5: 'Laborers',
6: 'Accountants',
7: 'Managers',
8: nan,
9: 'Laborers',
10: 'Core staff',
11: nan,
12: 'Laborers',
13: 'Drivers',
14: 'Laborers'},
'AGE_GROUP': {5: '45-60',
6: '21-45',
7: '45-60',
8: '45-60',
9: '21-45',
10: '21-45',
11: '45-60',
12: '21-45',
13: '21-45',
14: '21-45'},
'DAYS_EMPLOYED': {5: -1588.0,
6: -3130.0,
7: -449.0,
8: nan,
9: -2019.0,
10: -679.0,
11: nan,
12: -2717.0,
13: -3028.0,
14: -203.0},
'DAYS_EMPLOYED_ANOM': {5: False,
6: False,
7: False,
8: True,
9: False,
10: False,
11: True,
12: False,
13: False,
14: False},
'DAYS_LAST_PHONE_CHANGE': {5: -2536.0,
6: -1562.0,
7: -1070.0,
8: 0.0,
9: -1673.0,
10: -844.0,
11: -2396.0,
12: -2370.0,
13: -4.0,
14: -188.0}}
I have a few NaN in the column DAYS_EMPLOYED. They are flagged as "True" in the column DAYS_EMPLOYED_ANOM.
I want to impute these NaN using the median of DAYS_EMPLOYED by the following group of columns : NAME_EDUCATION_TYPE, OCCUPATION_TYPE and AGE_GROUP
I believe this can be done in a few lines in pandas but I could not figure it out. I have tried to apply the following code that I found in an example of mean imputation for a Series, but the NaN values do not get imputed.
fill_median = lambda g: g.fillna(g.median())
train.loc[train['DAYS_EMPLOYED_ANOM'] == True,'DAYS_EMPLOYED'] = train.groupby(['NAME_EDUCATION_TYPE', 'OCCUPATION_TYPE', 'AGE_GROUP'])['DAYS_EMPLOYED'].apply(fill_median)`
I also tried to apply the code from this post without success:
How can I impute values to outlier cells based on groups?
You could do:
train['DAYS_EMPLOYED'] = (train.groupby(['NAME_EDUCATION_TYPE', 'OCCUPATION_TYPE', 'AGE_GROUP'],
dropna=False)
['DAYS_EMPLOYED']
.apply(lambda x: x.fillna(x.median()))
)
However, note that this will not work on your particular dataset as you need to have at least one non NaN value per group to be able to calculate the median.
You could use the population median instead:
train['DAYS_EMPLOYED'] = (train.groupby(['NAME_EDUCATION_TYPE', 'OCCUPATION_TYPE', 'AGE_GROUP'],
dropna=False)
['DAYS_EMPLOYED']
.apply(lambda x: x.fillna(train['DAYS_EMPLOYED'].median()))
)
Here is an hybrid approach to try calculating the group median and otherwise fall back to population one:
def median(s):
m = s.median()
if np.isnan(m):
m = train['DAYS_EMPLOYED'].median()
return m
train['DAYS_EMPLOYED'] = (train.groupby(['NAME_EDUCATION_TYPE', 'OCCUPATION_TYPE', 'AGE_GROUP'],
dropna=False
)
['DAYS_EMPLOYED'].apply(lambda x: x.fillna(median(s)))
)

Matching Buy Sell entries from two dataframes and creating a new one. Python 3.8 / W10

Python / Pandas.
Matching Buy and Sell entries row by row.
BuyDF and SellDF are obtained from one excel file and sorted as per ascending Time (column L).
The image shows how the matching has to be done.
Match Buy and Sell entries by Name following first in first out method.
Take very first entry (Name AAA) from BuyDF and match with very first / Top most entry (Name AAA) from SellDF and move the matching row from SellDF in front of corrosponding row of BuyDF and delete the row Sell DF.
Go back to BuyDF second entry and match SellDF entry and move the matching row from SellDF and move the matching row from SellDF in front of corrosponding row of BuyDF and the row is deleted from Sell DF ...... and so on.
For names which do not match leave the matching rows Blank.
The ascending order (Time / Column L) should not be changed to maintain first in first out.
Tried using merge but didn't work for me.
How to proceed ?
BuyDF
{'Date': {0: '2019-04-01', 1: '2019-04-01', 2: '2019-04-01', 3: '2019-04-01', 4: '2019-04-02', 5: '2019-04-02', 6: '2019-04-02', 7: '2019-04-02', 8: '2019-04-05'}, 'Name': {0: 'AAA', 1: 'AAA', 2: 'AAA', 3: 'AAA', 4: 'BBB', 5: 'CCC', 6: 'CCC', 7: 'BBB', 8: 'AAA'}, 'Ref': {0: 1, 1: 1, 2: 1, 3: 1, 4: 5, 5: 7, 6: 7, 7: 6, 8: 1}, 'Seg': {0: 'S', 1: 'S', 2: 'S', 3: 'S', 4: 'L', 5: 'XL', 6: 'XL', 7: 'L', 8: 'S'}, 'Trans': {0: 'buy', 1: 'buy', 2: 'buy', 3: 'buy', 4: 'buy', 5: 'buy', 6: 'buy', 7: 'buy', 8: 'buy'}, 'Qty': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1}, 'Price': {0: 225, 1: 225, 2: 225, 3: 225, 4: 210, 5: 210, 6: 210, 7: 210, 8: 225}, 'Order ID': {0: 8249, 1: 111, 2: 654, 3: 111, 4: 888, 5: 444, 6: 444, 7: 888, 8: 111}, 'Trade ID': {0: 1010, 1: 1010, 2: 1010, 3: 1010, 4: 4645, 5: 132, 6: 132, 7: 4700, 8: 1010}, 'Time': {0: '2019-04-01 11:05:18', 1: '2019-04-01 13:05:18', 2: '2019-04-01 13:05:18', 3: '2019-04-01 13:05:59', 4: '2019-04-02 13:20:05', 5: '2019-04-02 13:35:02', 6: '2019-04-02 13:35:02', 7: '2019-04-02 14:20:12', 8: '2019-04-05 13:05:18'}}
SellDF
{'Date': {5: '2019-04-01', 6: '2019-04-02', 7: '2019-04-02', 8: '2019-04-02', 13: '2019-04-03', 14: '2019-04-05', 15: '2019-04-05'}, 'Name': {5: 'AAA', 6: 'BBB', 7: 'BBB', 8: 'BBB', 13: 'DDD', 14: 'AAA', 15: 'AAA'}, 'Ref': {5: 3, 6: 2, 7: 2, 8: 2, 13: 8, 14: 4, 15: 4}, 'Seg': {5: 'L', 6: 'X', 7: 'X', 8: 'X', 13: 'XS', 14: 'L', 15: 'L'}, 'Trans': {5: 'sell', 6: 'sell', 7: 'sell', 8: 'sell', 13: 'sell', 14: 'sell', 15: 'sell'}, 'Qty': {5: 1, 6: 1, 7: 1, 8: 1, 13: 1, 14: 1, 15: 1}, 'Price': {5: 210, 6: 210, 7: 210, 8: 210, 13: 210, 14: 210, 15: 210}, 'Order ID': {5: 555, 6: 222, 7: 222, 8: 222, 13: 999, 14: 555, 15: 555}, 'Trade ID': {5: 1640, 6: 1532, 7: 1532, 8: 1532, 13: 14623, 14: 1645, 15: 1645}, 'Time': {5: '2019-04-01 14:13:40', 6: '2019-04-02 13:10:32', 7: '2019-04-02 13:10:32', 8: '2019-04-02 13:10:32', 13: '2019-04-03 15:25:50', 14: '2019-04-05 14:41:45', 15: '2019-04-05 14:41:45'}}
Image posted for ease of understanding.

Creating a Dropdown menu in Plotly from Pandas

I've had a look at the following link but its not very clear https://plot.ly/pandas/dropdowns/.
I have the following figure generated in plotly but would like a dropdown menu (of A, B and C) to select and display the respective line only
import pandas as pd
import plotly
plotly.offline.init_notebook_mode()
import plotly.offline as py
from plotly.graph_objs import *
df = pd.DataFrame({'freq': {0: 0.01, 1: 0.02, 2: 0.029999999999999999, 3: 0.040000000000000001, 4: 0.050000000000000003, 5: 0.059999999999999998, 6: 0.070000000000000007, 7: 0.080000000000000002, 8: 0.089999999999999997, 9: 0.10000000000000001, 10: 0.01, 11: 0.02, 12: 0.029999999999999999, 13: 0.040000000000000001, 14: 0.050000000000000003, 15: 0.059999999999999998, 16: 0.070000000000000007, 17: 0.080000000000000002, 18: 0.089999999999999997, 19: 0.10000000000000001, 20: 0.01, 21: 0.02, 22: 0.029999999999999999, 23: 0.040000000000000001, 24: 0.050000000000000003, 25: 0.059999999999999998, 26: 0.070000000000000007, 27: 0.080000000000000002, 28: 0.089999999999999997, 29: 0.10000000000000001}, 'kit': {0: 'B', 1: 'B', 2: 'B', 3: 'B', 4: 'B', 5: 'B', 6: 'B', 7: 'B', 8: 'B', 9: 'B', 10: 'A', 11: 'A', 12: 'A', 13: 'A', 14: 'A', 15: 'A', 16: 'A', 17: 'A', 18: 'A', 19: 'A', 20: 'C', 21: 'C', 22: 'C', 23: 'C', 24: 'C', 25: 'C', 26: 'C', 27: 'C', 28: 'C', 29: 'C'}, 'SNS': {0: 91.198979591799997, 1: 90.263605442199989, 2: 88.818027210899999, 3: 85.671768707499993, 4: 76.23299319729999, 5: 61.0969387755, 6: 45.1530612245, 7: 36.267006802700003, 8: 33.0782312925, 9: 30.739795918400002, 10: 90.646258503400006, 11: 90.306122449, 12: 90.178571428600009, 13: 89.498299319699996, 14: 88.435374149599994, 15: 83.588435374200003, 16: 75.212585034, 17: 60.969387755100001, 18: 47.278911564600001, 19: 37.627551020399999, 20: 90.986394557800011, 21: 90.136054421799997, 22: 89.540816326499993, 23: 88.690476190499993, 24: 86.479591836799997, 25: 82.397959183699996, 26: 73.809523809499993, 27: 63.180272108800004, 28: 50.935374149700003, 29: 41.241496598699996}, 'FPR': {0: 1.0953616823100001, 1: 0.24489252678500001, 2: 0.15106142277199999, 3: 0.104478605177, 4: 0.089172822253300005, 5: 0.079856258734300009, 6: 0.065881413455800009, 7: 0.059892194050699996, 8: 0.059892194050699996, 9: 0.0578957875824, 10: 0.94097291541899997, 11: 0.208291741532, 12: 0.14773407865800001, 13: 0.107805949291, 14: 0.093165635189999998, 15: 0.082518134025399995, 16: 0.074532508152000007, 17: 0.065881413455800009, 18: 0.062554069341799995, 19: 0.061888600519100001, 20: 0.85313103081100006, 21: 0.18899314567100001, 22: 0.14107939043000001, 23: 0.110467824582, 24: 0.099820323417899995, 25: 0.085180009316599997, 26: 0.078525321088700001, 27: 0.073201570506399985, 28: 0.071870632860800004, 29: 0.0705396952153}})
fig = {
'data': [
{
'x': df[df['kit']==kit]['FPR'],
'y': df[df['kit']==kit]['SNS'],
'name': kit,
} for kit in ['A', 'B', 'C']
],
}
py.iplot(fig)
I'm not sure how to do this directly from plotly; however, you can use interact function from ipywidgets library.
In your case it will be the following:
from ipywidgets import interact
df = pd.DataFrame({'freq': {0: 0.01, 1: 0.02, 2: 0.029999999999999999, 3: 0.040000000000000001, 4: 0.050000000000000003, 5: 0.059999999999999998, 6: 0.070000000000000007, 7: 0.080000000000000002, 8: 0.089999999999999997, 9: 0.10000000000000001, 10: 0.01, 11: 0.02, 12: 0.029999999999999999, 13: 0.040000000000000001, 14: 0.050000000000000003, 15: 0.059999999999999998, 16: 0.070000000000000007, 17: 0.080000000000000002, 18: 0.089999999999999997, 19: 0.10000000000000001, 20: 0.01, 21: 0.02, 22: 0.029999999999999999, 23: 0.040000000000000001, 24: 0.050000000000000003, 25: 0.059999999999999998, 26: 0.070000000000000007, 27: 0.080000000000000002, 28: 0.089999999999999997, 29: 0.10000000000000001}, 'kit': {0: 'B', 1: 'B', 2: 'B', 3: 'B', 4: 'B', 5: 'B', 6: 'B', 7: 'B', 8: 'B', 9: 'B', 10: 'A', 11: 'A', 12: 'A', 13: 'A', 14: 'A', 15: 'A', 16: 'A', 17: 'A', 18: 'A', 19: 'A', 20: 'C', 21: 'C', 22: 'C', 23: 'C', 24: 'C', 25: 'C', 26: 'C', 27: 'C', 28: 'C', 29: 'C'}, 'SNS': {0: 91.198979591799997, 1: 90.263605442199989, 2: 88.818027210899999, 3: 85.671768707499993, 4: 76.23299319729999, 5: 61.0969387755, 6: 45.1530612245, 7: 36.267006802700003, 8: 33.0782312925, 9: 30.739795918400002, 10: 90.646258503400006, 11: 90.306122449, 12: 90.178571428600009, 13: 89.498299319699996, 14: 88.435374149599994, 15: 83.588435374200003, 16: 75.212585034, 17: 60.969387755100001, 18: 47.278911564600001, 19: 37.627551020399999, 20: 90.986394557800011, 21: 90.136054421799997, 22: 89.540816326499993, 23: 88.690476190499993, 24: 86.479591836799997, 25: 82.397959183699996, 26: 73.809523809499993, 27: 63.180272108800004, 28: 50.935374149700003, 29: 41.241496598699996}, 'FPR': {0: 1.0953616823100001, 1: 0.24489252678500001, 2: 0.15106142277199999, 3: 0.104478605177, 4: 0.089172822253300005, 5: 0.079856258734300009, 6: 0.065881413455800009, 7: 0.059892194050699996, 8: 0.059892194050699996, 9: 0.0578957875824, 10: 0.94097291541899997, 11: 0.208291741532, 12: 0.14773407865800001, 13: 0.107805949291, 14: 0.093165635189999998, 15: 0.082518134025399995, 16: 0.074532508152000007, 17: 0.065881413455800009, 18: 0.062554069341799995, 19: 0.061888600519100001, 20: 0.85313103081100006, 21: 0.18899314567100001, 22: 0.14107939043000001, 23: 0.110467824582, 24: 0.099820323417899995, 25: 0.085180009316599997, 26: 0.078525321088700001, 27: 0.073201570506399985, 28: 0.071870632860800004, 29: 0.0705396952153}})
def plot_it(kit):
fig = {
'data': [
{
'x': df[df['kit']==kit]['FPR'],
'y': df[df['kit']==kit]['SNS'],
'name': kit
}
]
}
py.iplot(fig)
interact(plot_it, kit=('A', 'B', 'C'))