data Frame to dictionary - pandas

I can create a new dataframe based on the list of dicts. But how do I get the same list back from dataframe?
mylist=[{'points': 50, 'time': '5:00', 'year': 2010},
{'points': 25, 'time': '6:00', 'month': "february"},
{'points':90, 'time': '9:00', 'month': 'january'},
{'points_h1':20, 'month': 'june'}]
import pandas as pd
df = pd.DataFrame(mylist)
The following will return the dictionary as per column and not row as shown in the example above.
n [18]: df.to_dict()
Out[18]:
{'month': {0: nan, 1: 'february', 2: 'january', 3: 'june'},
'points': {0: 50.0, 1: 25.0, 2: 90.0, 3: nan},
'points_h1': {0: nan, 1: nan, 2: nan, 3: 20.0},
'time': {0: '5:00', 1: '6:00', 2: '9:00', 3: nan},
'year': {0: 2010.0, 1: nan, 2: nan, 3: nan}}

df.to_dict(outtype='records')
Answer is from from: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_dict.html

Related

How to keep the number and names of columns in training and test dataset equal after one hot encoding?

Shape of the original dataset is 82580×30 with multiple string columns. Example dataset:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
df = pd.DataFrame({'Nationality': {0: 'DEU', 1: 'PRT', 2: 'PRT', 3: 'PRT', 4: 'FRA', 5: 'DEU', 6: 'CHE', 7: 'DEU', 8: 'GBR', 9: 'AUT', 10: 'PRT', 11: 'FRA', 12: 'OTR', 13: 'GBR', 14: 'ESP', 15: 'PRT', 16: 'OTR', 17: 'PRT', 18: 'ESP', 19: 'AUT'},
'Age': {0: 27.0, 1: 45.46, 2: 45.46, 3: 58.0, 4: 57.0, 5: 27.0, 6: 49.0, 7: 62.0, 8: 44.0, 9: 61.0, 10: 54.0, 11: 53.0, 12: 50.0, 13: 30.0, 14: 51.0, 15: 45.46, 16: 40.0, 17: 49.0, 18: 49.0, 19: 14.0},
'DaysSinceCreation': {0: 370, 1: 213, 2: 206, 3: 1018, 4: 835, 5: 52, 6: 597, 7: 217, 8: 999, 9: 1004, 10: 402, 11: 879, 12: 393, 13: 923, 14: 249, 15: 52, 16: 159, 17: 929, 18: 49, 19: 131},
'BookingsCheckedIn': {0: 1, 1: 0, 2: 0, 3: 1, 4: 1, 5: 1, 6: 1, 7: 2, 8: 1, 9: 1, 10: 1, 11: 1, 12: 1, 13: 1, 14: 1, 15: 0, 16: 0, 17: 1, 18: 1, 19: 0}})
# Encoding Variables
transformer = make_column_transformer((OneHotEncoder(sparse=False), ['Nationality']), remainder='passthrough')
transformed = transformer.fit_transform(df)
transformed_df = pd.DataFrame(transformed, columns=transformer.get_feature_names_out())
# Concat the two tables
transformed_df.reset_index(drop=True, inplace=True)
df.reset_index(drop=True, inplace=True)
df = pd.concat([transformed_df, df], axis=1)
# Remove old columns
df.drop(['Nationality'], axis = 1, inplace = True)
print('The shape after encoding: {}'.format(df.shape))
print(df.columns.unique())
The shape after encoding: (20, 14)
Index(['onehotencoder__Nationality_AUT', 'onehotencoder__Nationality_CHE',
'onehotencoder__Nationality_DEU', 'onehotencoder__Nationality_ESP',
'onehotencoder__Nationality_FRA', 'onehotencoder__Nationality_GBR',
'onehotencoder__Nationality_OTR', 'onehotencoder__Nationality_PRT',
'remainder__Age', 'remainder__DaysSinceCreation',
'remainder__BookingsCheckedIn', 'Age', 'DaysSinceCreation',
'BookingsCheckedIn'],
dtype='object')
After modeling, trying to test on a completely different test set:
df = pd.DataFrame({'Nationality': {0: 'CAN', 1: 'DEU', 2: 'PRT', 3: 'PRT', 4: 'FRA'},
'Age': {0: 27.0, 1: 29.0, 2: 24.0, 3: 24.0, 4: 46.0},
'DaysSinceCreation': {0: 222, 1: 988, 2: 212, 3: 685, 4: 1052},
'BookingsCheckedIn': {0: 0, 1: 1, 2: 1, 3: 1, 4: 0}})
# Encoding Variables
transformer = make_column_transformer(
(OneHotEncoder(sparse=False), ['Nationality']),
remainder='passthrough')
transformed = transformer.fit_transform(df)
transformed_df = pd.DataFrame(transformed, columns=transformer.get_feature_names_out())
# Concat the two tables
transformed_df.reset_index(drop=True, inplace=True)
df.reset_index(drop=True, inplace=True)
df = pd.concat([transformed_df, df], axis=1)
# Remove old columns
df.drop(['Nationality'], axis = 1, inplace = True)
print('The shape after encoding: {}'.format(df.shape))
print(df.columns.unique())
The shape after encoding: (5, 10)
Index(['onehotencoder__Nationality_CAN', 'onehotencoder__Nationality_DEU',
'onehotencoder__Nationality_FRA', 'onehotencoder__Nationality_PRT',
'remainder__Age', 'remainder__DaysSinceCreation',
'remainder__BookingsCheckedIn', 'Age', 'DaysSinceCreation',
'BookingsCheckedIn'],
dtype='object')
As can be seen, testing dataset has some features that were not present in the original training set and many features of training set are not present in test set. If I only use .values of X_train, y_train, X_test, y_test, I can run from logistic regression to Neural Net with >99% accuracy, but that feels like cheating and is not working out with Decision Trees. How do we deal with this?
I would like to contribute 2 inputs:
(1) the test set should be a subset of the training set, so the unknown Nationality 'CAN' is not allowed. Either: try to include the new 'CAN' in the training data, or try to replace it with 'GBR' instead in the test data.
(2) you should not do fit_transform() separately on training and test set. The right way is to fit on training set, then... transform on training set and transform on test set. To illustrate:
# Encoding Variables
transformer = make_column_transformer((OneHotEncoder(sparse=False), ['Nationality']), remainder='passthrough')
####transformed = transformer.fit_transform(df) #delete this
transformer.fit(df) #use this instead
transformed = transformer.transform(df) #use this instead
transformed_df = pd.DataFrame(transformed, columns=transformer.get_feature_names_out())
# Concat the two tables
<truncated>
print('The shape after encoding: {}'.format(df.shape))
The shape after encoding: (20, 14)
Second part, note that I have replaced 'CAN' with 'GBR'. And only use the previously fitted transformer to transform the test set:
df = pd.DataFrame({'Nationality': {0: 'GBR', 1: 'DEU', 2: 'PRT', 3: 'PRT', 4: 'FRA'},
'Age': {0: 27.0, 1: 29.0, 2: 24.0, 3: 24.0, 4: 46.0},
'DaysSinceCreation': {0: 222, 1: 988, 2: 212, 3: 685, 4: 1052},
'BookingsCheckedIn': {0: 0, 1: 1, 2: 1, 3: 1, 4: 0}})
# Encoding Variables
####transformer = make_column_transformer((OneHotEncoder(sparse=False), ['Nationality']), remainder='passthrough') #do not repeat, use the previous fitted model
####transformed = transformer.fit_transform(df) #delete this, NO fitting on test set
transformed = transformer.transform(df) #only do transform on test set
transformed_df = pd.DataFrame(transformed, columns=transformer.get_feature_names_out())
# Concat the two tables
<truncated>
print('The shape after encoding: {}'.format(df.shape))
The shape after encoding: (5, 14)
So the number of columns (14) are the same for both training set and test set

How to (correctly) merge 2 Pandas DataFrames and scatter-plot

Thank you for your answers, in advance.
My end goal is to produce a scatter-plot - corruption as an explanatory variable (x axis, from a DataFrame 'corr') and inequality as a dependent variable (y axis, from a DataFrame 'inq').
A hint to produce an informative table (DataFrame) by joining these two Dataframes would be much appreciated
I have a dataframe 'inq' for a country inequality (GINI index) and another one 'corr' for country corruption index.
pd.DataFrame(
{
"country": {0: "Angola", 1: "Albania", 2: "United Arab Emirates"},
"1975": {0: nan, 1: nan, 2: nan},
"1976": {0: nan, 1: nan, 2: nan},
"2017": {0: nan, 1: 33.2, 2: nan},
"2018": {0: 51.3, 1: nan, 2: nan},
}
)
pd.DataFrame(
{
"country": {0: "Afghanistan", 1: "Angola", 2: "Albania"},
"1975": {0: 44.8, 1: 48.1, 2: 75.1},
"1976": {0: 44.8, 1: 48.1, 2: 75.1},
"2018": {0: 24.2, 1: 40.4, 2: 28.4},
"2019": {0: 40.5, 1: 37.6, 2: 35.9},
}
)
I concatenate and manipulate and get
cm = pd.concat([inq, corr], axis=0, keys=["Inequality", "Corruption"]).reset_index(
level=1, drop=True
)
a new Dataframe
pd.DataFrame(
{
"indicator": {0: "Inequality", 1: "Inequality", 2: "Inequality"},
"country": {0: "Angola", 1: "Albania", 2: "United Arab Emirates"},
"1967": {0: nan, 1: nan, 2: nan},
"1969": {0: nan, 1: nan, 2: nan},
"2018": {0: 51.3, 1: nan, 2: nan},
"2019": {0: nan, 1: nan, 2: nan},
}
)
You should concatenate your dataframe in a different way:
df = (pd.concat([inq.set_index('country'),
corr.set_index('country')],
axis=1,
keys=["Inequality", "Corruption"]
)
.stack(level=1)
)
Inequality Corruption
country
Angola 1975 NaN 48.1
1976 NaN 48.1
2018 51.3 40.4
2019 NaN 37.6
Albania 1975 NaN 75.1
1976 NaN 75.1
2017 33.2 NaN
2018 NaN 28.4
2019 NaN 35.9
Afghanistan 1975 NaN 44.8
1976 NaN 44.8
2018 NaN 24.2
2019 NaN 40.5
Then to plot:
df.plot.scatter(x='Corruption', y='Inequality')
NB. there is only one point as most of your data is NaN

pandas groupby + multiple aggregate/apply with multiple columns

I have this minimal sample data:
import pandas as pd
from pandas import Timestamp
data = pd.DataFrame({'Client': {0: "Client_1", 1: "Client_2", 2: "Client_2", 3: "Client_3", 4: "Client_3", 5: "Client_3", 6: "Client_4", 7: "Client_4"},
'Id_Card': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8},
'Type': {0: 'A', 1: 'B', 2: 'C', 3: np.nan, 4: 'A', 5: 'B', 6: np.nan, 7: 'B'},
'Loc': {0: 'ADW', 1: 'ZCW', 2: 'EWC', 3: "VWQ", 4: "OKS", 5: 'EQW', 6: "PKA", 7: 'CSA'},
'Amount': {0: 10.0, 1: 15.0, 2: 17.0, 3: 32.0, 4: np.nan, 5: 51.0, 6: 38.0, 7: -20.0},
'Net': {0: 30.0, 1: 42.0, 2: -10.0, 3: 15.0, 4: 98, 5: np.nan, 6: 23.0, 7: -10.0},
'Date': {0: Timestamp('2018-09-29 00:00:00'), 1: Timestamp('1996-08-02 00:00:00'), 2: np.nan, 3: Timestamp('2020-11-02 00:00:00'), 4: Timestamp('2008-12-27 00:00:00'), 5: Timestamp('2004-12-21 00:00:00'), 6: np.nan, 7: Timestamp('2010-08-25 00:00:00')}})
data
I'm trying to aggregate this data grouping by Client column. Counting the Id_Card per client, concatenating Type, Loc, separated by ; (e.g. A;B and ZCW;EWC values for Client_2, NOT A;ZCW B;EWC), sum the Amount, Net, per client, and getting the minimum Date per client. However, I'm facing some problems:
These functions works perfectly individually, but I can't find a way to mix the aggregate function and apply function:
Code example:
data.groupby("Client").agg({"Id_Card": "count", "Amount":"sum", "Date": "min"})
data.groupby('Client')['Loc'].apply(';'.join).reset_index()
The apply function doesn't work for columns with missing values:
Code example:
data.groupby('Client')['Type'].apply(';'.join).reset_index()
TypeError: sequence item 0: expected str instance, float found
The aggregate and apply functions don't allow me to put multiple columns for one transformation:
Code example:
cols_to_sum = ["Amount", "Net"]
data.groupby("Client").agg({"Id_Card": "count", cols_to_sum:"sum", "Date": "min"})
cols_to_join = ["Type", "Loc"]
data.groupby('Client')[cols_to_join].apply(';'.join).reset_index()
In (3) I only put Amount and Net and I could put them separately in the aggregate function, but I'm looking to a more efficient way as I'm working with plenty of columns.
The output expected is the same dataframe, but aggregated with the conditions outlined at the beggining.
For doing a join, you would have to filter out the NaN values. As join you have to apply at two places, I have created a separate function
def join_non_nan_values(elements):
return ";".join([elem for elem in elements if elem == elem]) # elem == elem will fail for Nan values
data.groupby("Client").agg({"Id_Card": "count", "Type": join_non_nan_values,
"Loc": join_non_nan_values, "Amount":"sum", "Net": "sum", "Date": "min"})
Go step by step, and prepare three different data frames to merge them later.
First dataframe is for simple functions like count,sum,mean
df1 = data.groupby("Client").agg({"Id_Card": "count", "Amount":"sum", "Net":sum, "Date": "min"}).reset_index()
Next you deal with Type and Loc join, we use fill na to deal with nan values
df2=data[['Client', 'Type']].fillna('').groupby("Client")['Type'].apply(
';'.join).reset_index()
df3=data[['Client', 'Loc']].fillna('').groupby("Client")['Loc'].apply(
';'.join).reset_index()
And finally you merge the results together:
data_new = df1.merge(df2, on='Client').merge(df3, on='Client')
data_new output:

Why my if condition is not working in python unnesting function?

I have written an unnest function in Python3.6 as below-
full_df = unnest(full_df,'Options','product code')
def unnest(df, col, col2,reset_index=False):
for item in df[col]:
if len(item)==0:
item=item.append('')
col_flat = pd.DataFrame([[i, x]
for i, y in df[col].apply(list).iteritems()
for x in y ], columns=['I', col]
)
col_flat = col_flat.set_index('I')
df = df.drop(col, 1)
df = df.merge(col_flat, left_index=True, right_index=True)
if reset_index:
df = df.reset_index(drop=True)
if df[col] is None:
merchant_product_code = df['Product code']
else:
merchant_product_code = df['Product code'] + '-' + df[col]
df['item_group_id'] = df['Product code']
df['Product code'] = merchant_product_code
return df
Problem I am facing here is, in case of Options value as []; it is removing the [] with empty and in Product Code column it is adding a hyphen(-) after product code.
I have converted my dataframe full_df as dictionary here so that you can test it.
{'Product code': {0: 'BBMTG', 1: 'BBDBPSD', 2: 'BBDBPEL', 3: 'BBDBPDR', 4: 'BBFTDR', 5: 'BBFTPBG', 6: 'BBFTPBS', 7: 'BBFTEY'}, 'Category': {0: 'Essentials', 1: 'Bedding /Bamboo Blanket', 2: 'Bedding /Bamboo Blanket', 3: 'Bedding /Bamboo Blanket', 4: 'Apparel', 5: 'Apparel', 6: 'Apparel', 7: 'Apparel'}, 'List price': {0: 8.9, 1: 45.0, 2: 45.0, 3: 45.0, 4: 28.0, 5: 28.0, 6: 28.0, 7: 28.0}, 'Price': {0: 8.9, 1: 45.0, 2: 45.0, 3: 45.0, 4: 28.0, 5: 28.0, 6: 28.0, 7: 28.0}, 'Options': {0: '[]', 1: '[]', 2: '[]', 3: '[]', 4: "['0-3m', '3-6m', '6-12m']", 5: "['0-3m', '3-6m', '6-12m']", 6: "['0-3m', '3-6m', '6-12m']", 7: "['0-3m', '3-6m', '6-12m']"}}
Can anyone look into this and help to make it work.
Finally I am able to solve my issue by changing below line-
item.append('[]')
and
df['Product code'] += df[col].apply(lambda val: '' if val == "[]" else '-' + val)

pivoting pandas df - turn column values into column names

I have a df:
pd.DataFrame({'time_period': {0: pd.Timestamp('2017-04-01 00:00:00'),
1: pd.Timestamp('2017-04-01 00:00:00'),
2: pd.Timestamp('2017-03-01 00:00:00'),
3: pd.Timestamp('2017-03-01 00:00:00')},
'cost1': {0: 142.62999999999994,
1: 131.97000000000003,
2: 142.62999999999994,
3: 131.97000000000003},
'revenue1': {0: 56,
1: 113.14999999999998,
2: 177,
3: 99},
'cost2': {0: 309.85000000000002,
1: 258.25,
2: 309.85000000000002,
3: 258.25},
'revenue2': {0: 4.5,
1: 299.63,2: 309.85,
3: 258.25},
'City': {0: 'Boston',
1: 'New York',2: 'Boston',
3: 'New York'}})
I want to re-structure this df such that for revenue and cost separately:
pd.DataFrame({'City': {0: 'Boston', 1: 'New York'},
'Apr-17 revenue1': {0: 56.0, 1: 113.15000000000001},
'Apr-17 revenue2': {0: 4.5, 1: 299.63},
'Mar-17 revenue1': {0: 177, 1: 99},
'Mar-17 revenue2': {0: 309.85000000000002, 1: 258.25}})
And a similar df for costs.
Basically, turn the time_period column values into column names like Apr-17, Mar-17 with revenue/cost string as appropriate and values of revenue1/revenue2 and cost1/cost2 respectively.
I've been playing around with pd.pivot_table with some success but I can't get exactly what I want.
Use set_index and unstack
import datetime as dt
df['time_period'] = df['time_period'].apply(lambda x: dt.datetime.strftime(x,'%b-%Y'))
df = df.set_index(['A', 'B', 'time_period'])[['revenue1', 'revenue2']].unstack().reset_index()
df.columns = df.columns.map(' '.join)
A B revenue1 Apr-2017 revenue1 Mar-2017 revenue2 Apr-2017 revenue2 Mar-2017
0 Boston Orlando 56.00 177.0 4.50 309.85
1 New York Dallas 113.15 99.0 299.63 258.25