I've had a look at the following link but its not very clear https://plot.ly/pandas/dropdowns/.
I have the following figure generated in plotly but would like a dropdown menu (of A, B and C) to select and display the respective line only
import pandas as pd
import plotly
plotly.offline.init_notebook_mode()
import plotly.offline as py
from plotly.graph_objs import *
df = pd.DataFrame({'freq': {0: 0.01, 1: 0.02, 2: 0.029999999999999999, 3: 0.040000000000000001, 4: 0.050000000000000003, 5: 0.059999999999999998, 6: 0.070000000000000007, 7: 0.080000000000000002, 8: 0.089999999999999997, 9: 0.10000000000000001, 10: 0.01, 11: 0.02, 12: 0.029999999999999999, 13: 0.040000000000000001, 14: 0.050000000000000003, 15: 0.059999999999999998, 16: 0.070000000000000007, 17: 0.080000000000000002, 18: 0.089999999999999997, 19: 0.10000000000000001, 20: 0.01, 21: 0.02, 22: 0.029999999999999999, 23: 0.040000000000000001, 24: 0.050000000000000003, 25: 0.059999999999999998, 26: 0.070000000000000007, 27: 0.080000000000000002, 28: 0.089999999999999997, 29: 0.10000000000000001}, 'kit': {0: 'B', 1: 'B', 2: 'B', 3: 'B', 4: 'B', 5: 'B', 6: 'B', 7: 'B', 8: 'B', 9: 'B', 10: 'A', 11: 'A', 12: 'A', 13: 'A', 14: 'A', 15: 'A', 16: 'A', 17: 'A', 18: 'A', 19: 'A', 20: 'C', 21: 'C', 22: 'C', 23: 'C', 24: 'C', 25: 'C', 26: 'C', 27: 'C', 28: 'C', 29: 'C'}, 'SNS': {0: 91.198979591799997, 1: 90.263605442199989, 2: 88.818027210899999, 3: 85.671768707499993, 4: 76.23299319729999, 5: 61.0969387755, 6: 45.1530612245, 7: 36.267006802700003, 8: 33.0782312925, 9: 30.739795918400002, 10: 90.646258503400006, 11: 90.306122449, 12: 90.178571428600009, 13: 89.498299319699996, 14: 88.435374149599994, 15: 83.588435374200003, 16: 75.212585034, 17: 60.969387755100001, 18: 47.278911564600001, 19: 37.627551020399999, 20: 90.986394557800011, 21: 90.136054421799997, 22: 89.540816326499993, 23: 88.690476190499993, 24: 86.479591836799997, 25: 82.397959183699996, 26: 73.809523809499993, 27: 63.180272108800004, 28: 50.935374149700003, 29: 41.241496598699996}, 'FPR': {0: 1.0953616823100001, 1: 0.24489252678500001, 2: 0.15106142277199999, 3: 0.104478605177, 4: 0.089172822253300005, 5: 0.079856258734300009, 6: 0.065881413455800009, 7: 0.059892194050699996, 8: 0.059892194050699996, 9: 0.0578957875824, 10: 0.94097291541899997, 11: 0.208291741532, 12: 0.14773407865800001, 13: 0.107805949291, 14: 0.093165635189999998, 15: 0.082518134025399995, 16: 0.074532508152000007, 17: 0.065881413455800009, 18: 0.062554069341799995, 19: 0.061888600519100001, 20: 0.85313103081100006, 21: 0.18899314567100001, 22: 0.14107939043000001, 23: 0.110467824582, 24: 0.099820323417899995, 25: 0.085180009316599997, 26: 0.078525321088700001, 27: 0.073201570506399985, 28: 0.071870632860800004, 29: 0.0705396952153}})
fig = {
'data': [
{
'x': df[df['kit']==kit]['FPR'],
'y': df[df['kit']==kit]['SNS'],
'name': kit,
} for kit in ['A', 'B', 'C']
],
}
py.iplot(fig)
I'm not sure how to do this directly from plotly; however, you can use interact function from ipywidgets library.
In your case it will be the following:
from ipywidgets import interact
df = pd.DataFrame({'freq': {0: 0.01, 1: 0.02, 2: 0.029999999999999999, 3: 0.040000000000000001, 4: 0.050000000000000003, 5: 0.059999999999999998, 6: 0.070000000000000007, 7: 0.080000000000000002, 8: 0.089999999999999997, 9: 0.10000000000000001, 10: 0.01, 11: 0.02, 12: 0.029999999999999999, 13: 0.040000000000000001, 14: 0.050000000000000003, 15: 0.059999999999999998, 16: 0.070000000000000007, 17: 0.080000000000000002, 18: 0.089999999999999997, 19: 0.10000000000000001, 20: 0.01, 21: 0.02, 22: 0.029999999999999999, 23: 0.040000000000000001, 24: 0.050000000000000003, 25: 0.059999999999999998, 26: 0.070000000000000007, 27: 0.080000000000000002, 28: 0.089999999999999997, 29: 0.10000000000000001}, 'kit': {0: 'B', 1: 'B', 2: 'B', 3: 'B', 4: 'B', 5: 'B', 6: 'B', 7: 'B', 8: 'B', 9: 'B', 10: 'A', 11: 'A', 12: 'A', 13: 'A', 14: 'A', 15: 'A', 16: 'A', 17: 'A', 18: 'A', 19: 'A', 20: 'C', 21: 'C', 22: 'C', 23: 'C', 24: 'C', 25: 'C', 26: 'C', 27: 'C', 28: 'C', 29: 'C'}, 'SNS': {0: 91.198979591799997, 1: 90.263605442199989, 2: 88.818027210899999, 3: 85.671768707499993, 4: 76.23299319729999, 5: 61.0969387755, 6: 45.1530612245, 7: 36.267006802700003, 8: 33.0782312925, 9: 30.739795918400002, 10: 90.646258503400006, 11: 90.306122449, 12: 90.178571428600009, 13: 89.498299319699996, 14: 88.435374149599994, 15: 83.588435374200003, 16: 75.212585034, 17: 60.969387755100001, 18: 47.278911564600001, 19: 37.627551020399999, 20: 90.986394557800011, 21: 90.136054421799997, 22: 89.540816326499993, 23: 88.690476190499993, 24: 86.479591836799997, 25: 82.397959183699996, 26: 73.809523809499993, 27: 63.180272108800004, 28: 50.935374149700003, 29: 41.241496598699996}, 'FPR': {0: 1.0953616823100001, 1: 0.24489252678500001, 2: 0.15106142277199999, 3: 0.104478605177, 4: 0.089172822253300005, 5: 0.079856258734300009, 6: 0.065881413455800009, 7: 0.059892194050699996, 8: 0.059892194050699996, 9: 0.0578957875824, 10: 0.94097291541899997, 11: 0.208291741532, 12: 0.14773407865800001, 13: 0.107805949291, 14: 0.093165635189999998, 15: 0.082518134025399995, 16: 0.074532508152000007, 17: 0.065881413455800009, 18: 0.062554069341799995, 19: 0.061888600519100001, 20: 0.85313103081100006, 21: 0.18899314567100001, 22: 0.14107939043000001, 23: 0.110467824582, 24: 0.099820323417899995, 25: 0.085180009316599997, 26: 0.078525321088700001, 27: 0.073201570506399985, 28: 0.071870632860800004, 29: 0.0705396952153}})
def plot_it(kit):
fig = {
'data': [
{
'x': df[df['kit']==kit]['FPR'],
'y': df[df['kit']==kit]['SNS'],
'name': kit
}
]
}
py.iplot(fig)
interact(plot_it, kit=('A', 'B', 'C'))
Related
I have a file consisting of sales for different items. I have a column of model predictions named outputs and which model produced those predictions named, model. I need to take the current model's (that's in production) predictions in a column named 'FCAST_QTYand make them part of the outputs column with the string "FCAST_QTY" being put into themodel column. So essentially, melting that column (FCAST_QTY') into the columns output and model so the current production model is in the same columns and the multiple models that are in development. This will make it easier to compare/contrast. I'm not sure how to do this. Example data below.
import pandas as pd
from pandas import Timestamp
sales_dict = {'MO_YR': {0: Timestamp('2021-01-01 00:00:00'),
1: Timestamp('2021-02-01 00:00:00'),
2: Timestamp('2021-03-01 00:00:00'),
3: Timestamp('2021-04-01 00:00:00'),
4: Timestamp('2021-05-01 00:00:00'),
5: Timestamp('2021-06-01 00:00:00'),
6: Timestamp('2021-07-01 00:00:00'),
7: Timestamp('2021-08-01 00:00:00'),
8: Timestamp('2021-09-01 00:00:00'),
9: Timestamp('2021-10-01 00:00:00'),
10: Timestamp('2021-11-01 00:00:00'),
11: Timestamp('2021-12-01 00:00:00'),
12: Timestamp('2021-01-01 00:00:00'),
13: Timestamp('2021-02-01 00:00:00'),
14: Timestamp('2021-03-01 00:00:00')},
'ITEM_BASE': {0: '289461K',
1: '289461K',
2: '289461K',
3: '289461K',
4: '289461K',
5: '289461K',
6: '289461K',
7: '289461K',
8: '289461K',
9: '289461K',
10: '289461K',
11: '289461K',
12: '400520J',
13: '400520J',
14: '400520J'},
'eaches': {0: 2592,
1: 3844,
2: 759,
3: 825,
4: 663,
5: 2025,
6: 471,
7: 1160,
8: 5987,
9: 679,
10: 469,
11: 907,
12: 64,
13: 48,
14: 63},
'FCAST_QTY': {0: 2800.0,
1: 5200.0,
2: 550.0,
3: 475.0,
4: 575.0,
5: 475.0,
6: 650.0,
7: 550.0,
8: 7900.0,
9: 1187.0,
10: 1187.0,
11: 1900.0,
12: 51.0,
13: 55.0,
14: 59.0},
'log_eaches': {0: 7.860185057472165,
1: 8.254268770090183,
2: 6.63200177739563,
3: 6.715383386334682,
4: 6.496774990185863,
5: 7.61332497954064,
6: 6.154858094016418,
7: 7.05617528410041,
8: 8.697345730925353,
9: 6.520621127558696,
10: 6.15060276844628,
11: 6.810142450115136,
12: 4.158883083359672,
13: 3.871201010907891,
14: 4.143134726391533},
'output': {0: 8.646015798513993,
1: 8.378197900630752,
2: 7.045235414873291,
3: 5.117058321275769,
4: 9.082928370640056,
5: 5.225648643174155,
6: 7.446383013291042,
7: 6.307484284901181,
8: 7.752673979530179,
9: 9.02189934080111,
10: 4.677594714421006,
11: 7.218749101888444,
12: 4.04018241973268,
13: 3.940978322900716,
14: 3.962359464699719},
'model': {0: 'LR_output',
1: 'LR_output',
2: 'LR_output',
3: 'LR_output',
4: 'LR_output',
5: 'LR_output',
6: 'LR_output',
7: 'LR_output',
8: 'LR_output',
9: 'LR_output',
10: 'LR_output',
11: 'LR_output',
12: 'AR(12)',
13: 'AR(12)',
14: 'AR(12)'}}
df = pd.DataFrame.from_dict(sales_dict)
Expected Output Added:
expected_dict = {'MO_YR': {0: Timestamp('2021-01-01 00:00:00'),
1: Timestamp('2021-02-01 00:00:00'),
2: Timestamp('2021-03-01 00:00:00'),
3: Timestamp('2021-04-01 00:00:00'),
4: Timestamp('2021-05-01 00:00:00'),
5: Timestamp('2021-06-01 00:00:00'),
6: Timestamp('2021-07-01 00:00:00'),
7: Timestamp('2021-08-01 00:00:00'),
8: Timestamp('2021-09-01 00:00:00'),
9: Timestamp('2021-10-01 00:00:00'),
10: Timestamp('2021-11-01 00:00:00'),
11: Timestamp('2021-12-01 00:00:00'),
12: Timestamp('2021-01-01 00:00:00'),
13: Timestamp('2021-02-01 00:00:00'),
14: Timestamp('2021-03-01 00:00:00'),
15: Timestamp('2021-01-01 00:00:00'),
16: Timestamp('2021-02-01 00:00:00'),
17: Timestamp('2021-03-01 00:00:00'),
18: Timestamp('2021-04-01 00:00:00'),
19: Timestamp('2021-05-01 00:00:00'),
20: Timestamp('2021-06-01 00:00:00'),
21: Timestamp('2021-07-01 00:00:00'),
22: Timestamp('2021-08-01 00:00:00'),
23: Timestamp('2021-09-01 00:00:00'),
24: Timestamp('2021-10-01 00:00:00'),
25: Timestamp('2021-11-01 00:00:00'),
26: Timestamp('2021-12-01 00:00:00'),
27: Timestamp('2021-01-01 00:00:00'),
28: Timestamp('2021-02-01 00:00:00'),
29: Timestamp('2021-03-01 00:00:00')},
'ITEM_BASE': {0: '289461K',
1: '289461K',
2: '289461K',
3: '289461K',
4: '289461K',
5: '289461K',
6: '289461K',
7: '289461K',
8: '289461K',
9: '289461K',
10: '289461K',
11: '289461K',
12: '400520J',
13: '400520J',
14: '400520J',
15: '289461K',
16: '289461K',
17: '289461K',
18: '289461K',
19: '289461K',
20: '289461K',
21: '289461K',
22: '289461K',
23: '289461K',
24: '289461K',
25: '289461K',
26: '289461K',
27: '400520J',
28: '400520J',
29: '400520J'},
'eaches': {0: 2592,
1: 3844,
2: 759,
3: 825,
4: 663,
5: 2025,
6: 471,
7: 1160,
8: 5987,
9: 679,
10: 469,
11: 907,
12: 64,
13: 48,
14: 63,
15: 2592,
16: 3844,
17: 759,
18: 825,
19: 663,
20: 2025,
21: 471,
22: 1160,
23: 5987,
24: 679,
25: 469,
26: 907,
27: 64,
28: 48,
29: 63},
'log_eaches': {0: 7.860185057472165,
1: 8.254268770090183,
2: 6.63200177739563,
3: 6.715383386334682,
4: 6.496774990185863,
5: 7.61332497954064,
6: 6.154858094016418,
7: 7.05617528410041,
8: 8.697345730925353,
9: 6.520621127558696,
10: 6.15060276844628,
11: 6.810142450115136,
12: 4.158883083359672,
13: 3.871201010907891,
14: 4.143134726391533,
15: 7.860185057472165,
16: 8.254268770090183,
17: 6.63200177739563,
18: 6.715383386334682,
19: 6.496774990185863,
20: 7.61332497954064,
21: 6.154858094016418,
22: 7.05617528410041,
23: 8.697345730925353,
24: 6.520621127558696,
25: 6.15060276844628,
26: 6.810142450115136,
27: 4.158883083359672,
28: 3.871201010907891,
29: 4.143134726391533,},
'output': {0: 8.646015798513993,
1: 8.378197900630752,
2: 7.045235414873291,
3: 5.117058321275769,
4: 9.082928370640056,
5: 5.225648643174155,
6: 7.446383013291042,
7: 6.307484284901181,
8: 7.752673979530179,
9: 9.02189934080111,
10: 4.677594714421006,
11: 7.218749101888444,
12: 4.04018241973268,
13: 3.940978322900716,
14: 3.962359464699719,
15: 2800.0,
16: 5200.0,
17: 550.0,
18: 475.0,
19: 575.0,
20: 475.0,
21: 650.0,
22: 550.0,
23: 7900.0,
24: 1187.0,
25: 1187.0,
26: 1900.0,
27: 51.0,
28: 55.0,
29: 59.0},
'model': {0: 'LR_output',
1: 'LR_output',
2: 'LR_output',
3: 'LR_output',
4: 'LR_output',
5: 'LR_output',
6: 'LR_output',
7: 'LR_output',
8: 'LR_output',
9: 'LR_output',
10: 'LR_output',
11: 'LR_output',
12: 'AR(12)',
13: 'AR(12)',
14: 'AR(12)',
15:'FCAST_QTY',
16:'FCAST_QTY',
17:'FCAST_QTY',
18:'FCAST_QTY',
19:'FCAST_QTY',
20:'FCAST_QTY',
21:'FCAST_QTY',
22:'FCAST_QTY',
23:'FCAST_QTY',
24:'FCAST_QTY',
25:'FCAST_QTY',
26:'FCAST_QTY',
27:'FCAST_QTY',
28:'FCAST_QTY',
29:'FCAST_QTY'}}
df = pd.DataFrame.from_dict(expected_dict)
Create a new dataframe with your logic and append it to the original dataframe:
fcast_qty = (df
.drop(columns = ['output', 'model'])
.rename(columns={"FCAST_QTY":"output"})
.assign(model="FCAST_QTY")
)
pd.concat([df.drop(columns='FCAST_QTY'), fcast_qty], ignore_index = True)
MO_YR ITEM_BASE eaches log_eaches output model
0 2021-01-01 289461K 2592 7.860185 8.646016 LR_output
1 2021-02-01 289461K 3844 8.254269 8.378198 LR_output
2 2021-03-01 289461K 759 6.632002 7.045235 LR_output
3 2021-04-01 289461K 825 6.715383 5.117058 LR_output
4 2021-05-01 289461K 663 6.496775 9.082928 LR_output
5 2021-06-01 289461K 2025 7.613325 5.225649 LR_output
6 2021-07-01 289461K 471 6.154858 7.446383 LR_output
7 2021-08-01 289461K 1160 7.056175 6.307484 LR_output
8 2021-09-01 289461K 5987 8.697346 7.752674 LR_output
9 2021-10-01 289461K 679 6.520621 9.021899 LR_output
10 2021-11-01 289461K 469 6.150603 4.677595 LR_output
11 2021-12-01 289461K 907 6.810142 7.218749 LR_output
12 2021-01-01 400520J 64 4.158883 4.040182 AR(12)
13 2021-02-01 400520J 48 3.871201 3.940978 AR(12)
14 2021-03-01 400520J 63 4.143135 3.962359 AR(12)
15 2021-01-01 289461K 2592 7.860185 2800.000000 FCAST_QTY
16 2021-02-01 289461K 3844 8.254269 5200.000000 FCAST_QTY
17 2021-03-01 289461K 759 6.632002 550.000000 FCAST_QTY
18 2021-04-01 289461K 825 6.715383 475.000000 FCAST_QTY
19 2021-05-01 289461K 663 6.496775 575.000000 FCAST_QTY
20 2021-06-01 289461K 2025 7.613325 475.000000 FCAST_QTY
21 2021-07-01 289461K 471 6.154858 650.000000 FCAST_QTY
22 2021-08-01 289461K 1160 7.056175 550.000000 FCAST_QTY
23 2021-09-01 289461K 5987 8.697346 7900.000000 FCAST_QTY
24 2021-10-01 289461K 679 6.520621 1187.000000 FCAST_QTY
25 2021-11-01 289461K 469 6.150603 1187.000000 FCAST_QTY
26 2021-12-01 289461K 907 6.810142 1900.000000 FCAST_QTY
27 2021-01-01 400520J 64 4.158883 51.000000 FCAST_QTY
28 2021-02-01 400520J 48 3.871201 55.000000 FCAST_QTY
29 2021-03-01 400520J 63 4.143135 59.000000 FCAST_QTY
I am attempting to create a technical indicator ('Supertrend') using Pandas. The formula for this column is recursive.
(For people familiar with Pinescript, this column will replicate the result of this Pinescript function):
df['st_trendup'] = np.select(df['Close'].shift() > df['st_trendup'].shift(),df[['st_up','st_trendup'.shift()]].max(axis=1),df['st_up'])
The problem occurs in the true part of the np.select()because I cannot call .shift() on a string.
Normally, I would make a new column that uses .shift() beforehand but since this is recursive, I have to do it all in one line.
If possible I'd like to avoid using loops for speed; prefer solutions using native pandas or numpy functions.
What I am looking for
A way to find max function that can accomodate a .shift() call
Columns that are used:
def tr(high,low,close1):
return max(high - low, abs(high - close1), abs(low - close1))
df['st_closeprev'] = df['Close'].shift()
df['st_hl2'] = (df['High']+df['Low'])/2
df['st_tr'] = df.apply(lambda row: tr(row['High'],row['Low'],row['st_closeprev']),axis=1)
df['st_atr'] = df['st_tr'].ewm(alpha = 1/pd,adjust=False,min_periods=pd).mean()
df['st_up'] = df['st_hl2'] - factor * df['st_atr']
df['st_dn'] = df['st_hl2'] + factor * df['st_atr']
df['st_trendup'] = np.select(df['Close'].shift() > df['st_trendup'].shift(),df[['st_up','st_trendup'.shift()]].max(axis=1),df['st_up'])
Sample data obtained by the df.to_dict
{'Date': {0: Timestamp('2021-01-01 09:15:00'),
1: Timestamp('2021-01-01 09:30:00'),
2: Timestamp('2021-01-01 09:45:00'),
3: Timestamp('2021-01-01 10:00:00'),
4: Timestamp('2021-01-01 10:15:00'),
5: Timestamp('2021-01-01 10:30:00'),
6: Timestamp('2021-01-01 10:45:00'),
7: Timestamp('2021-01-01 11:00:00'),
8: Timestamp('2021-01-01 11:15:00'),
9: Timestamp('2021-01-01 11:30:00'),
10: Timestamp('2021-01-01 11:45:00'),
11: Timestamp('2021-01-01 12:00:00'),
12: Timestamp('2021-01-01 12:15:00'),
13: Timestamp('2021-01-01 12:30:00'),
14: Timestamp('2021-01-01 12:45:00'),
15: Timestamp('2021-01-01 13:00:00'),
16: Timestamp('2021-01-01 13:15:00'),
17: Timestamp('2021-01-01 13:30:00'),
18: Timestamp('2021-01-01 13:45:00'),
19: Timestamp('2021-01-01 14:00:00'),
20: Timestamp('2021-01-01 14:15:00'),
21: Timestamp('2021-01-01 14:30:00'),
22: Timestamp('2021-01-01 14:45:00'),
23: Timestamp('2021-01-01 15:00:00'),
24: Timestamp('2021-01-01 15:15:00'),
25: Timestamp('2021-01-04 09:15:00')},
'Open': {0: 31250.0,
1: 31376.0,
2: 31405.0,
3: 31389.4,
4: 31377.5,
5: 31347.8,
6: 31310.8,
7: 31343.4,
8: 31349.5,
9: 31349.9,
10: 31325.1,
11: 31310.9,
12: 31329.0,
13: 31376.0,
14: 31375.5,
15: 31357.4,
16: 31325.0,
17: 31341.1,
18: 31300.0,
19: 31324.5,
20: 31353.3,
21: 31350.0,
22: 31346.9,
23: 31330.0,
24: 31314.3,
25: 31450.2},
'High': {0: 31407.0,
1: 31425.0,
2: 31411.95,
3: 31389.45,
4: 31382.0,
5: 31350.0,
6: 31354.6,
7: 31359.0,
8: 31370.0,
9: 31364.7,
10: 31350.0,
11: 31337.9,
12: 31378.9,
13: 31419.5,
14: 31377.75,
15: 31360.0,
16: 31367.15,
17: 31345.2,
18: 31340.0,
19: 31367.0,
20: 31375.0,
21: 31370.0,
22: 31350.0,
23: 31334.6,
24: 31329.6,
25: 31599.0},
'Low': {0: 31250.0,
1: 31367.95,
2: 31352.5,
3: 31331.65,
4: 31301.4,
5: 31303.05,
6: 31310.0,
7: 31325.05,
8: 31335.35,
9: 31315.35,
10: 31281.9,
11: 31292.0,
12: 31316.25,
13: 31352.05,
14: 31335.0,
15: 31322.0,
16: 31318.25,
17: 31261.55,
18: 31283.3,
19: 31324.5,
20: 31322.0,
21: 31332.15,
22: 31324.1,
23: 31300.15,
24: 31280.0,
25: 31430.0},
'Close': {0: 31375.0,
1: 31398.3,
2: 31386.0,
3: 31377.0,
4: 31342.3,
5: 31311.7,
6: 31345.0,
7: 31349.0,
8: 31344.2,
9: 31327.6,
10: 31311.3,
11: 31325.6,
12: 31373.0,
13: 31375.0,
14: 31357.4,
15: 31326.0,
16: 31345.9,
17: 31300.6,
18: 31324.4,
19: 31353.8,
20: 31345.6,
21: 31341.6,
22: 31332.5,
23: 31311.0,
24: 31285.0,
25: 31558.4},
'Volume': {0: 259952,
1: 163775,
2: 105900,
3: 99725,
4: 115175,
5: 78625,
6: 67675,
7: 46575,
8: 53350,
9: 54175,
10: 96975,
11: 80925,
12: 79475,
13: 147775,
14: 38900,
15: 64925,
16: 52425,
17: 142175,
18: 81800,
19: 74950,
20: 68550,
21: 40350,
22: 47150,
23: 119200,
24: 222875,
25: 524625}}
Change:
df[['st_up','st_trendup'.shift()]].max(axis=1)
to:
df[['st_up','st_trendup']].assign(st_trendup = df['st_trendup'].shift()).max(axis=1)
I have two columns of data as indicated below:
data = {'labels': {0: '00',
1: '08',
2: '00',
3: '08',
4: '5',
5: '04',
6: '08',
7: '04',
8: '08',
9: '5',
10: '5',
11: '04',
12: '5',
13: '00',
14: '08',
15: '5',
16: '00',
17: '00',
18: '04',
19: '04'},
'scores': {0: 0.0023585677121699122,
1: 0.056371229170055104,
2: 0.005376756883710199,
3: 0.05694460526172507,
4: 0.1049131006122696,
5: 0.008102266910447686,
6: 0.09154342979296892,
7: -0.03761723194472211,
8: 0.010718527281161072,
9: 0.11988838522095685,
10: 0.09070139731152083,
11: 0.02994813107318378,
12: 0.09277903598030868,
13: 0.062223925985664286,
14: 0.1377963110579728,
15: 0.11898618005936024,
16: -0.021227109409528988,
17: 0.008938944493717238,
18: 0.03413068403999525,
19: 0.058688416622356965}}
df = pd.DataFrame(data)
I want am trying to plot the values in the scores and color it according to the labels. I have tried
sns.scatterplot(data=df,x='labels',y='scores');
This works but it doesn't show the clusters(each x value is separated) as shown here
I want the points to be in the same space and colored differently according to the df['labels'].
sns.scatterplot(data=df,x='labels',y='scores', hue='labels')
Context:
I have two very large pandas dataframes to join which barely fit in memory (8GB each, millions of rows) and have the challenge of performing a performant join using combinations of both indexed and non-indexed columns. Fuzzy joining is out of scope.
Variables in order of cardinality:
dataset_1 has these variables:
postcode, street_name, secondary_number, primary_number, unique_id
dataset_2 has these variables:
postcode, street_name, house_number, house_name, sub_building_name, different_unique_id
postcode and street_name are shared keys, and multiindexing seems the correct choice to improve joining performance in pandas:
dataset_1 = dataset_1.set_index(['postcode', 'street', "unique_id"]).sort_index()
dataset_2 = dataset_2.set_index(['postcode', 'street', "different_unique_id"]).sort_index()
Processing:
At this stage I can compute in pandas if memory allows. If not, I would use Dask, however it can't handle multi-indexes. In the event this were possible (or unnecessary) the sorting would still need to be handled in pandas as Dask cannot manage this. If Dask were an option this is how I would convert:
dd1 = dd.from_pandas(dataset_1, npartitions=1) #large left dataframe
del dataset_1 #to release the memory
dd2 = dd.from_pandas(dataset_2, npartitions=3) #partitioned right dataframe for performance
del dataset_2 #to release the memory
Problem:
The challenge is performing an inner join on non-null variables using the indexes ("postcode" and "street"), alongside non-indexed columns. Combinations of the non-indexed variables will be iterated in a for loop.
Solution Sketch:
This gives an idea what I would like to do to maintain the performance gains from the indexing, but is of course not syntactically possible:
output = pd.merge(df1, df2, how='inner', left_on=["postcode", "street_name", "secondary_number", "primary_number"], right_on=["postcode", "street_name", "house_name", "house_number"], left_index=[True,True,False,False], right_index=[True,True,False,False])
Summary:
My understanding is that pd.join can handle non-indexed and indexed columns, whereas pd.merge cannot. As a result I'm unsure how to achieve this join in pd.join where there is a combination of both indexed and non-indexed columns.
Example of intersects:
{'different_unique_id': {27: '{582D0636-8DEF-8F22-E053-6C04A8C01BAC}',
41: '{D9E869FE-7B55-4C36-AC43-695B9033A13B}',
33: '{93E6821E-554E-40FD-E053-6B04A8C0C1DF}',
1: '{288DCE29-0589-E510-E050-A8C06205480E}',
48: '{3A23DDD5-A0E8-41D2-A514-5B09385C301F}',
52: '{CEB16957-F7FA-4D1B-B45F-A390214735BC}',
13: '{404A5AF3-9B20-CD2B-E050-A8C063055C7B}',
16: '{64342BFD-FD07-422C-E053-6C04A8C0FB8A}',
57: '{29A8E769-8A10-4477-9494-FF55EF5FAE4B}',
10: '{404A5AF3-0B58-CD2B-E050-A8C063055C7B}',
21: '{55BDCAE6-0C10-521D-E053-6B04A8C0DD7A}',
31: '{5C676A02-1781-4152-950C-6E5CA2CBC487}',
7: '{68FEB20B-142E-38DA-E053-6C04A8C051AE}',
45: '{8F1B26BD-673F-53DB-E053-6C04A8C03649}',
12: '{2F115F7A-8F81-4124-9FD4-FB76E742B2C1}',
36: '{344AB2D7-4B59-4AB4-8F52-75B29BE8C509}',
20: '{965B6D91-D4B6-95E4-E053-6C04A8C07729}',
56: '{59872FD9-F39D-4BB9-95F6-91E002D948B1}',
22: '{6141DFF0-973F-4FEC-A582-7F310B566031}'},
'unique_id': {27: 10002277489,
41: 64023255,
33: 10007367447,
1: 22229221,
48: 10033235735,
52: 100062162615,
13: 50103744,
16: 10022903998,
57: 12015624,
10: 12154940,
21: 10024247587,
31: 100041193990,
7: 10008230730,
45: 10091640210,
12: 202107394,
36: 5062293,
20: 48114659,
56: 10001311242,
22: 10000443154},
'street': {27: 'thewharf',
41: 'parkroad',
33: 'oldmillclose',
1: 'thirdavenue',
48: 'woolnersway',
52: 'sumnerroad',
13: 'cliftongardens',
16: 'windhamroad',
57: 'westparkroad',
10: 'grangeroad',
21: 'staplersroad',
31: 'strand',
7: 'amhurstroad',
45: 'eatonroad',
12: 'northendroad',
36: 'belsizegrove',
20: 'watermillway',
56: 'orchardplace',
22: 'thurlowparkroad'},
'postcode': {27: 'lu72la',
41: 'cf626nt',
33: 'hr40aq',
1: 'bn32pd',
48: 'sg13ae',
52: 'gu97jx',
13: 'ct202ef',
16: 'bh14rn',
57: 'ub24af',
10: 'w55bu',
21: 'po302dp',
31: 'tq148aq',
7: 'e82ag',
45: 'ch47ew',
12: 'ha90ae',
36: 'nw34tt',
20: 'sw192rw',
56: 'so143hw',
22: 'se218hp'},
'secondary_number': {27: '76',
41: 'flat6',
33: '49',
1: 'flat10',
48: '145',
52: '31',
13: 'flat19',
16: 'flat7',
57: '76',
10: 'flat1',
21: 'flat1',
31: 'flat43',
7: 'flata',
45: '8',
12: '42',
36: 'flat9',
20: 'flat43',
56: 'flat156',
22: 'flat2'},
'primary_number': {27: 'eastdock',
41: 'courtlands',
33: 'watkinscourt',
1: 'ascothouse',
48: 'monumentcourt',
52: 'sumnercourt',
13: '22-24',
16: '77',
57: 'osterleyviews',
10: '55-59',
21: '138',
31: 'leandercourt',
7: '130',
45: 'greenbankhall',
12: 'danescourt',
36: 'holmefieldcourt',
20: 'bennetscourtyard',
56: 'oceanaboulevard',
22: '124f'},
'building_name': {27: 'eastdock',
41: 'courtlands',
33: 'watkinscourt',
1: 'ascothouse',
48: 'monumentcourt',
52: 'sumnercourt',
13: None,
16: None,
57: 'osterleyviews',
10: None,
21: None,
31: 'leandercourt',
7: None,
45: 'greenbankhall',
12: 'danescourt',
36: 'holmefieldcourt',
20: 'bennetscourtyard',
56: 'oceanaboulevard',
22: None},
'building_number': {27: None,
41: None,
33: None,
1: '18-20',
48: None,
52: None,
13: '22-24',
16: '77',
57: None,
10: '55-59',
21: '138',
31: None,
7: '130',
45: None,
12: None,
36: None,
20: None,
56: None,
22: '124f'},
'sub_building': {27: '76',
41: 'flat6',
33: '49',
1: 'flat10',
48: '145',
52: '31',
13: 'flat19',
16: 'flat7',
57: '76',
10: 'flat1',
21: 'flat1',
31: 'flat43',
7: 'flata',
45: '8',
12: '42',
36: 'flat9',
20: 'flat43',
56: 'flat156',
22: 'flat2'}}
I need help to create a multiple line graph using below DataFrame
num user_id first_result second_result result date point1 point2 point3 point4
0 0 1480R clear clear pass 9/19/2016 clear consider clear consider
1 1 419M consider consider fail 5/18/2016 consider consider clear clear
2 2 416N consider consider fail 11/15/2016 consider consider consider consider
3 3 1913I consider consider fail 11/25/2016 consider consider consider clear
4 4 1938T clear clear pass 8/1/2016 clear consider clear clear
5 5 1530C clear clear pass 6/22/2016 clear clear consider clear
6 6 1075L consider consider fail 9/13/2016 consider consider clear consider
7 7 1466N consider clear fail 6/21/2016 consider clear clear consider
8 8 662V consider consider fail 11/1/2016 consider consider clear consider
9 9 1187Y consider consider fail 9/13/2016 consider consider clear clear
10 10 138T consider consider fail 9/19/2016 consider clear consider consider
11 11 1461Z consider clear fail 7/18/2016 consider consider clear consider
12 12 807N consider clear fail 8/16/2016 consider consider clear clear
13 13 416Y consider consider fail 10/2/2016 consider clear clear clear
14 14 638A consider clear fail 6/21/2016 consider clear consider clear
data file linke data.xlsx or data as dict
data = {'num': {0: 0,
1: 1,
2: 2,
3: 3,
4: 4,
5: 5,
6: 6,
7: 7,
8: 8,
9: 9,
10: 10,
11: 11,
12: 12,
13: 13,
14: 14},
'user_id': {0: '1480R',
1: '419M',
2: '416N',
3: '1913I',
4: '1938T',
5: '1530C',
6: '1075L',
7: '1466N',
8: '662V',
9: '1187Y',
10: '138T',
11: '1461Z',
12: '807N',
13: '416Y',
14: '638A'},
'first_result': {0: 'clear',
1: 'consider',
2: 'consider',
3: 'consider',
4: 'clear',
5: 'clear',
6: 'consider',
7: 'consider',
8: 'consider',
9: 'consider',
10: 'consider',
11: 'consider',
12: 'consider',
13: 'consider',
14: 'consider'},
'second_result': {0: 'clear',
1: 'consider',
2: 'consider',
3: 'consider',
4: 'clear',
5: 'clear',
6: 'consider',
7: 'clear',
8: 'consider',
9: 'consider',
10: 'consider',
11: 'clear',
12: 'clear',
13: 'consider',
14: 'clear'},
'result': {0: 'pass',
1: 'fail',
2: 'fail',
3: 'fail',
4: 'pass',
5: 'pass',
6: 'fail',
7: 'fail',
8: 'fail',
9: 'fail',
10: 'fail',
11: 'fail',
12: 'fail',
13: 'fail',
14: 'fail'},
'date': {0: '9/19/2016',
1: '5/18/2016',
2: '11/15/2016',
3: '11/25/2016',
4: '8/1/2016',
5: '6/22/2016',
6: '9/13/2016',
7: '6/21/2016',
8: '11/1/2016',
9: '9/13/2016',
10: '9/19/2016',
11: '7/18/2016',
12: '8/16/2016',
13: '10/2/2016',
14: '6/21/2016'},
'point1': {0: 'clear',
1: 'consider',
2: 'consider',
3: 'consider',
4: 'clear',
5: 'clear',
6: 'consider',
7: 'consider',
8: 'consider',
9: 'consider',
10: 'consider',
11: 'consider',
12: 'consider',
13: 'consider',
14: 'consider'},
'point2': {0: 'consider',
1: 'consider',
2: 'consider',
3: 'consider',
4: 'consider',
5: 'clear',
6: 'consider',
7: 'clear',
8: 'consider',
9: 'consider',
10: 'clear',
11: 'consider',
12: 'consider',
13: 'clear',
14: 'clear'},
'point3': {0: 'clear',
1: 'clear',
2: 'consider',
3: 'consider',
4: 'clear',
5: 'consider',
6: 'clear',
7: 'clear',
8: 'clear',
9: 'clear',
10: 'consider',
11: 'clear',
12: 'clear',
13: 'clear',
14: 'consider'},
'point4': {0: 'consider',
1: 'clear',
2: 'consider',
3: 'clear',
4: 'clear',
5: 'clear',
6: 'consider',
7: 'consider',
8: 'consider',
9: 'clear',
10: 'consider',
11: 'consider',
12: 'clear',
13: 'clear',
14: 'clear'}
}
I need to create a bar graph and a line graph, I have created the bar graph using point1 where x = consider, clear and y = count of consider and clear
but I have no idea how to create a line graph by this scenario
x = date
y = pass rate (%)
Pass Rate is a number of clear/(consider + clear)
graph the rate for first_result, second_result, result all on the same graph
and the graph should look like below
please comment or answer how can I do it. if I can get an idea of grouping dates and getting the ratio then also great.
Here's my idea how to do it:
# first convert all `clear`, `consider` to 1,0
tmp_df = df[['first_result', 'second_result']].apply(lambda x: x.eq('clear').astype(int))
# convert `pass`, `fail` to 1,0
tmp_df['result'] = df.result.eq('pass').astype(int)
# copy the date
tmp_df['date'] = df['date']
# groupby and compute mean, i.e. number_pass/total_count
tmp_df = tmp_df.groupby('date').mean()
tmp_df.plot()
Output for this dataset