Related
Shape of the original dataset is 82580×30 with multiple string columns. Example dataset:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
df = pd.DataFrame({'Nationality': {0: 'DEU', 1: 'PRT', 2: 'PRT', 3: 'PRT', 4: 'FRA', 5: 'DEU', 6: 'CHE', 7: 'DEU', 8: 'GBR', 9: 'AUT', 10: 'PRT', 11: 'FRA', 12: 'OTR', 13: 'GBR', 14: 'ESP', 15: 'PRT', 16: 'OTR', 17: 'PRT', 18: 'ESP', 19: 'AUT'},
'Age': {0: 27.0, 1: 45.46, 2: 45.46, 3: 58.0, 4: 57.0, 5: 27.0, 6: 49.0, 7: 62.0, 8: 44.0, 9: 61.0, 10: 54.0, 11: 53.0, 12: 50.0, 13: 30.0, 14: 51.0, 15: 45.46, 16: 40.0, 17: 49.0, 18: 49.0, 19: 14.0},
'DaysSinceCreation': {0: 370, 1: 213, 2: 206, 3: 1018, 4: 835, 5: 52, 6: 597, 7: 217, 8: 999, 9: 1004, 10: 402, 11: 879, 12: 393, 13: 923, 14: 249, 15: 52, 16: 159, 17: 929, 18: 49, 19: 131},
'BookingsCheckedIn': {0: 1, 1: 0, 2: 0, 3: 1, 4: 1, 5: 1, 6: 1, 7: 2, 8: 1, 9: 1, 10: 1, 11: 1, 12: 1, 13: 1, 14: 1, 15: 0, 16: 0, 17: 1, 18: 1, 19: 0}})
# Encoding Variables
transformer = make_column_transformer((OneHotEncoder(sparse=False), ['Nationality']), remainder='passthrough')
transformed = transformer.fit_transform(df)
transformed_df = pd.DataFrame(transformed, columns=transformer.get_feature_names_out())
# Concat the two tables
transformed_df.reset_index(drop=True, inplace=True)
df.reset_index(drop=True, inplace=True)
df = pd.concat([transformed_df, df], axis=1)
# Remove old columns
df.drop(['Nationality'], axis = 1, inplace = True)
print('The shape after encoding: {}'.format(df.shape))
print(df.columns.unique())
The shape after encoding: (20, 14)
Index(['onehotencoder__Nationality_AUT', 'onehotencoder__Nationality_CHE',
'onehotencoder__Nationality_DEU', 'onehotencoder__Nationality_ESP',
'onehotencoder__Nationality_FRA', 'onehotencoder__Nationality_GBR',
'onehotencoder__Nationality_OTR', 'onehotencoder__Nationality_PRT',
'remainder__Age', 'remainder__DaysSinceCreation',
'remainder__BookingsCheckedIn', 'Age', 'DaysSinceCreation',
'BookingsCheckedIn'],
dtype='object')
After modeling, trying to test on a completely different test set:
df = pd.DataFrame({'Nationality': {0: 'CAN', 1: 'DEU', 2: 'PRT', 3: 'PRT', 4: 'FRA'},
'Age': {0: 27.0, 1: 29.0, 2: 24.0, 3: 24.0, 4: 46.0},
'DaysSinceCreation': {0: 222, 1: 988, 2: 212, 3: 685, 4: 1052},
'BookingsCheckedIn': {0: 0, 1: 1, 2: 1, 3: 1, 4: 0}})
# Encoding Variables
transformer = make_column_transformer(
(OneHotEncoder(sparse=False), ['Nationality']),
remainder='passthrough')
transformed = transformer.fit_transform(df)
transformed_df = pd.DataFrame(transformed, columns=transformer.get_feature_names_out())
# Concat the two tables
transformed_df.reset_index(drop=True, inplace=True)
df.reset_index(drop=True, inplace=True)
df = pd.concat([transformed_df, df], axis=1)
# Remove old columns
df.drop(['Nationality'], axis = 1, inplace = True)
print('The shape after encoding: {}'.format(df.shape))
print(df.columns.unique())
The shape after encoding: (5, 10)
Index(['onehotencoder__Nationality_CAN', 'onehotencoder__Nationality_DEU',
'onehotencoder__Nationality_FRA', 'onehotencoder__Nationality_PRT',
'remainder__Age', 'remainder__DaysSinceCreation',
'remainder__BookingsCheckedIn', 'Age', 'DaysSinceCreation',
'BookingsCheckedIn'],
dtype='object')
As can be seen, testing dataset has some features that were not present in the original training set and many features of training set are not present in test set. If I only use .values of X_train, y_train, X_test, y_test, I can run from logistic regression to Neural Net with >99% accuracy, but that feels like cheating and is not working out with Decision Trees. How do we deal with this?
I would like to contribute 2 inputs:
(1) the test set should be a subset of the training set, so the unknown Nationality 'CAN' is not allowed. Either: try to include the new 'CAN' in the training data, or try to replace it with 'GBR' instead in the test data.
(2) you should not do fit_transform() separately on training and test set. The right way is to fit on training set, then... transform on training set and transform on test set. To illustrate:
# Encoding Variables
transformer = make_column_transformer((OneHotEncoder(sparse=False), ['Nationality']), remainder='passthrough')
####transformed = transformer.fit_transform(df) #delete this
transformer.fit(df) #use this instead
transformed = transformer.transform(df) #use this instead
transformed_df = pd.DataFrame(transformed, columns=transformer.get_feature_names_out())
# Concat the two tables
<truncated>
print('The shape after encoding: {}'.format(df.shape))
The shape after encoding: (20, 14)
Second part, note that I have replaced 'CAN' with 'GBR'. And only use the previously fitted transformer to transform the test set:
df = pd.DataFrame({'Nationality': {0: 'GBR', 1: 'DEU', 2: 'PRT', 3: 'PRT', 4: 'FRA'},
'Age': {0: 27.0, 1: 29.0, 2: 24.0, 3: 24.0, 4: 46.0},
'DaysSinceCreation': {0: 222, 1: 988, 2: 212, 3: 685, 4: 1052},
'BookingsCheckedIn': {0: 0, 1: 1, 2: 1, 3: 1, 4: 0}})
# Encoding Variables
####transformer = make_column_transformer((OneHotEncoder(sparse=False), ['Nationality']), remainder='passthrough') #do not repeat, use the previous fitted model
####transformed = transformer.fit_transform(df) #delete this, NO fitting on test set
transformed = transformer.transform(df) #only do transform on test set
transformed_df = pd.DataFrame(transformed, columns=transformer.get_feature_names_out())
# Concat the two tables
<truncated>
print('The shape after encoding: {}'.format(df.shape))
The shape after encoding: (5, 14)
So the number of columns (14) are the same for both training set and test set
I saw variations of this question asked several times, but I don't think any of the variations I saw fixes it (other than "use matplotlib for combo-plots", but I'd appreciate help understanding why should I do that).
df1 = pd.DataFrame({'height': {0: 161, 1: 173, 2: 168, 3: 185, 4: 163},
'year': {0: 2015, 1: 2016, 2: 2017, 3: 2018, 4: 2019}})
df2 = pd.DataFrame({'year': {0: 2015, 1: 2015, 2: 2016, 3: 2016, 4: 2017,
5: 2017, 6: 2018, 7: 2018, 8: 2019, 9: 2019},
'weight': {0: 64, 1: 81, 2: 82, 3: 83, 4: 66,
5: 71, 6: 84, 7: 91, 8: 99, 9: 94},
'sex': {0: 'M', 1: 'F', 2: 'M', 3: 'F', 4: 'M',
5: 'F', 6: 'M', 7: 'F', 8: 'M', 9: 'F'}})
ax = sns.barplot(x='year', y='weight', hue='sex', data=df2)
ax2 = ax.twinx()
sns.lineplot(x='year', y='height', data=df1, ax=ax2)
I expected this to be a textbook example of a comboplot, but the result is:
Why is that? Shouldn't the X axes simply converge and make a nice plot? Of course, each plot renders fine individually.
If you plot them separately and check their xlim, you can see seaborn shifts the bar plot's x values down to 0 (the years are displayed separately via xticklabels):
ax = sns.barplot(x='year', y='weight', hue='sex', data=df2)
print(ax.get_xlim())
print(ax.get_xticklabels())
# (-0.5, 4.5)
# [Text(0, 0, '2015'), Text(1, 0, '2016'), Text(2, 0, '2017'), Text(3, 0, '2018'), Text(4, 0, '2019')]
The line plot does not shift the x values and plots the years in the 2000s range:
ax = sns.lineplot(x='year', y='height', data=df1)
print(ax.get_xlim())
# (2014.8, 2019.2)
One workaround is to use reset_index() on the line plot's data and use x='index' to manually shift its x values to 0 to align with the bar plot:
g = sns.lineplot(x='index', y='height', data=df1.reset_index(), ax=ax2)
Python / Pandas.
Matching Buy and Sell entries row by row.
BuyDF and SellDF are obtained from one excel file and sorted as per ascending Time (column L).
The image shows how the matching has to be done.
Match Buy and Sell entries by Name following first in first out method.
Take very first entry (Name AAA) from BuyDF and match with very first / Top most entry (Name AAA) from SellDF and move the matching row from SellDF in front of corrosponding row of BuyDF and delete the row Sell DF.
Go back to BuyDF second entry and match SellDF entry and move the matching row from SellDF and move the matching row from SellDF in front of corrosponding row of BuyDF and the row is deleted from Sell DF ...... and so on.
For names which do not match leave the matching rows Blank.
The ascending order (Time / Column L) should not be changed to maintain first in first out.
Tried using merge but didn't work for me.
How to proceed ?
BuyDF
{'Date': {0: '2019-04-01', 1: '2019-04-01', 2: '2019-04-01', 3: '2019-04-01', 4: '2019-04-02', 5: '2019-04-02', 6: '2019-04-02', 7: '2019-04-02', 8: '2019-04-05'}, 'Name': {0: 'AAA', 1: 'AAA', 2: 'AAA', 3: 'AAA', 4: 'BBB', 5: 'CCC', 6: 'CCC', 7: 'BBB', 8: 'AAA'}, 'Ref': {0: 1, 1: 1, 2: 1, 3: 1, 4: 5, 5: 7, 6: 7, 7: 6, 8: 1}, 'Seg': {0: 'S', 1: 'S', 2: 'S', 3: 'S', 4: 'L', 5: 'XL', 6: 'XL', 7: 'L', 8: 'S'}, 'Trans': {0: 'buy', 1: 'buy', 2: 'buy', 3: 'buy', 4: 'buy', 5: 'buy', 6: 'buy', 7: 'buy', 8: 'buy'}, 'Qty': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1}, 'Price': {0: 225, 1: 225, 2: 225, 3: 225, 4: 210, 5: 210, 6: 210, 7: 210, 8: 225}, 'Order ID': {0: 8249, 1: 111, 2: 654, 3: 111, 4: 888, 5: 444, 6: 444, 7: 888, 8: 111}, 'Trade ID': {0: 1010, 1: 1010, 2: 1010, 3: 1010, 4: 4645, 5: 132, 6: 132, 7: 4700, 8: 1010}, 'Time': {0: '2019-04-01 11:05:18', 1: '2019-04-01 13:05:18', 2: '2019-04-01 13:05:18', 3: '2019-04-01 13:05:59', 4: '2019-04-02 13:20:05', 5: '2019-04-02 13:35:02', 6: '2019-04-02 13:35:02', 7: '2019-04-02 14:20:12', 8: '2019-04-05 13:05:18'}}
SellDF
{'Date': {5: '2019-04-01', 6: '2019-04-02', 7: '2019-04-02', 8: '2019-04-02', 13: '2019-04-03', 14: '2019-04-05', 15: '2019-04-05'}, 'Name': {5: 'AAA', 6: 'BBB', 7: 'BBB', 8: 'BBB', 13: 'DDD', 14: 'AAA', 15: 'AAA'}, 'Ref': {5: 3, 6: 2, 7: 2, 8: 2, 13: 8, 14: 4, 15: 4}, 'Seg': {5: 'L', 6: 'X', 7: 'X', 8: 'X', 13: 'XS', 14: 'L', 15: 'L'}, 'Trans': {5: 'sell', 6: 'sell', 7: 'sell', 8: 'sell', 13: 'sell', 14: 'sell', 15: 'sell'}, 'Qty': {5: 1, 6: 1, 7: 1, 8: 1, 13: 1, 14: 1, 15: 1}, 'Price': {5: 210, 6: 210, 7: 210, 8: 210, 13: 210, 14: 210, 15: 210}, 'Order ID': {5: 555, 6: 222, 7: 222, 8: 222, 13: 999, 14: 555, 15: 555}, 'Trade ID': {5: 1640, 6: 1532, 7: 1532, 8: 1532, 13: 14623, 14: 1645, 15: 1645}, 'Time': {5: '2019-04-01 14:13:40', 6: '2019-04-02 13:10:32', 7: '2019-04-02 13:10:32', 8: '2019-04-02 13:10:32', 13: '2019-04-03 15:25:50', 14: '2019-04-05 14:41:45', 15: '2019-04-05 14:41:45'}}
Image posted for ease of understanding.
I am trying to calculate weighted average prices using pandas pivot table.
I have tried passing in a dictionary using aggfunc.
This does not work when passed into aggfunc, although it should calculate the correct weighted average.
'Price': lambda x: np.average(x, weights=df['Balance'])
I have also tried using a manual groupby:
df.groupby('Product').agg({
'Balance': sum,
'Price': lambda x : np.average(x, weights='Balance'),
'Value': sum
})
This also yields the error:
TypeError: Axis must be specified when shapes of a and weights differ.
Here is sample data
import pandas as pd
import numpy as np
price_dict = {'Product': {0: 'A',
1: 'A',
2: 'A',
3: 'A',
4: 'A',
5: 'B',
6: 'B',
7: 'B',
8: 'B',
9: 'B',
10: 'C',
11: 'C',
12: 'C',
13: 'C',
14: 'C'},
'Balance': {0: 10,
1: 20,
2: 30,
3: 40,
4: 50,
5: 60,
6: 70,
7: 80,
8: 90,
9: 100,
10: 110,
11: 120,
12: 130,
13: 140,
14: 150},
'Price': {0: 1,
1: 2,
2: 3,
3: 4,
4: 5,
5: 6,
6: 7,
7: 8,
8: 9,
9: 10,
10: 11,
11: 12,
12: 13,
13: 14,
14: 15},
'Value': {0: 10,
1: 40,
2: 90,
3: 160,
4: 250,
5: 360,
6: 490,
7: 640,
8: 810,
9: 1000,
10: 1210,
11: 1440,
12: 1690,
13: 1960,
14: 2250}}
Try to calculate weighted average by passing dict into aggfunc:
df = pd.DataFrame(price_dict)
df.pivot_table(
index='Product',
aggfunc = {
'Balance': sum,
'Price': np.mean,
'Value': sum
}
)
Output:
Balance Price Value
Product
A 150 3 550
B 400 8 3300
C 650 13 8550
The expected outcome should be :
Balance Price Value
Product
A 150 3.66 550
B 400 8.25 3300
C 650 13.15 8550
Here is one way using apply
df.groupby('Product').apply(lambda x : pd.Series(
{'Balance': x['Balance'].sum(),
'Price': np.average(x['Price'], weights=x['Balance']),
'Value': x['Value'].sum()}))
Out[57]:
Balance Price Value
Product
A 150.0 3.666667 550.0
B 400.0 8.250000 3300.0
C 650.0 13.153846 8550.0
I need help to create a multiple line graph using below DataFrame
num user_id first_result second_result result date point1 point2 point3 point4
0 0 1480R clear clear pass 9/19/2016 clear consider clear consider
1 1 419M consider consider fail 5/18/2016 consider consider clear clear
2 2 416N consider consider fail 11/15/2016 consider consider consider consider
3 3 1913I consider consider fail 11/25/2016 consider consider consider clear
4 4 1938T clear clear pass 8/1/2016 clear consider clear clear
5 5 1530C clear clear pass 6/22/2016 clear clear consider clear
6 6 1075L consider consider fail 9/13/2016 consider consider clear consider
7 7 1466N consider clear fail 6/21/2016 consider clear clear consider
8 8 662V consider consider fail 11/1/2016 consider consider clear consider
9 9 1187Y consider consider fail 9/13/2016 consider consider clear clear
10 10 138T consider consider fail 9/19/2016 consider clear consider consider
11 11 1461Z consider clear fail 7/18/2016 consider consider clear consider
12 12 807N consider clear fail 8/16/2016 consider consider clear clear
13 13 416Y consider consider fail 10/2/2016 consider clear clear clear
14 14 638A consider clear fail 6/21/2016 consider clear consider clear
data file linke data.xlsx or data as dict
data = {'num': {0: 0,
1: 1,
2: 2,
3: 3,
4: 4,
5: 5,
6: 6,
7: 7,
8: 8,
9: 9,
10: 10,
11: 11,
12: 12,
13: 13,
14: 14},
'user_id': {0: '1480R',
1: '419M',
2: '416N',
3: '1913I',
4: '1938T',
5: '1530C',
6: '1075L',
7: '1466N',
8: '662V',
9: '1187Y',
10: '138T',
11: '1461Z',
12: '807N',
13: '416Y',
14: '638A'},
'first_result': {0: 'clear',
1: 'consider',
2: 'consider',
3: 'consider',
4: 'clear',
5: 'clear',
6: 'consider',
7: 'consider',
8: 'consider',
9: 'consider',
10: 'consider',
11: 'consider',
12: 'consider',
13: 'consider',
14: 'consider'},
'second_result': {0: 'clear',
1: 'consider',
2: 'consider',
3: 'consider',
4: 'clear',
5: 'clear',
6: 'consider',
7: 'clear',
8: 'consider',
9: 'consider',
10: 'consider',
11: 'clear',
12: 'clear',
13: 'consider',
14: 'clear'},
'result': {0: 'pass',
1: 'fail',
2: 'fail',
3: 'fail',
4: 'pass',
5: 'pass',
6: 'fail',
7: 'fail',
8: 'fail',
9: 'fail',
10: 'fail',
11: 'fail',
12: 'fail',
13: 'fail',
14: 'fail'},
'date': {0: '9/19/2016',
1: '5/18/2016',
2: '11/15/2016',
3: '11/25/2016',
4: '8/1/2016',
5: '6/22/2016',
6: '9/13/2016',
7: '6/21/2016',
8: '11/1/2016',
9: '9/13/2016',
10: '9/19/2016',
11: '7/18/2016',
12: '8/16/2016',
13: '10/2/2016',
14: '6/21/2016'},
'point1': {0: 'clear',
1: 'consider',
2: 'consider',
3: 'consider',
4: 'clear',
5: 'clear',
6: 'consider',
7: 'consider',
8: 'consider',
9: 'consider',
10: 'consider',
11: 'consider',
12: 'consider',
13: 'consider',
14: 'consider'},
'point2': {0: 'consider',
1: 'consider',
2: 'consider',
3: 'consider',
4: 'consider',
5: 'clear',
6: 'consider',
7: 'clear',
8: 'consider',
9: 'consider',
10: 'clear',
11: 'consider',
12: 'consider',
13: 'clear',
14: 'clear'},
'point3': {0: 'clear',
1: 'clear',
2: 'consider',
3: 'consider',
4: 'clear',
5: 'consider',
6: 'clear',
7: 'clear',
8: 'clear',
9: 'clear',
10: 'consider',
11: 'clear',
12: 'clear',
13: 'clear',
14: 'consider'},
'point4': {0: 'consider',
1: 'clear',
2: 'consider',
3: 'clear',
4: 'clear',
5: 'clear',
6: 'consider',
7: 'consider',
8: 'consider',
9: 'clear',
10: 'consider',
11: 'consider',
12: 'clear',
13: 'clear',
14: 'clear'}
}
I need to create a bar graph and a line graph, I have created the bar graph using point1 where x = consider, clear and y = count of consider and clear
but I have no idea how to create a line graph by this scenario
x = date
y = pass rate (%)
Pass Rate is a number of clear/(consider + clear)
graph the rate for first_result, second_result, result all on the same graph
and the graph should look like below
please comment or answer how can I do it. if I can get an idea of grouping dates and getting the ratio then also great.
Here's my idea how to do it:
# first convert all `clear`, `consider` to 1,0
tmp_df = df[['first_result', 'second_result']].apply(lambda x: x.eq('clear').astype(int))
# convert `pass`, `fail` to 1,0
tmp_df['result'] = df.result.eq('pass').astype(int)
# copy the date
tmp_df['date'] = df['date']
# groupby and compute mean, i.e. number_pass/total_count
tmp_df = tmp_df.groupby('date').mean()
tmp_df.plot()
Output for this dataset