Related
Shape of the original dataset is 82580×30 with multiple string columns. Example dataset:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
df = pd.DataFrame({'Nationality': {0: 'DEU', 1: 'PRT', 2: 'PRT', 3: 'PRT', 4: 'FRA', 5: 'DEU', 6: 'CHE', 7: 'DEU', 8: 'GBR', 9: 'AUT', 10: 'PRT', 11: 'FRA', 12: 'OTR', 13: 'GBR', 14: 'ESP', 15: 'PRT', 16: 'OTR', 17: 'PRT', 18: 'ESP', 19: 'AUT'},
'Age': {0: 27.0, 1: 45.46, 2: 45.46, 3: 58.0, 4: 57.0, 5: 27.0, 6: 49.0, 7: 62.0, 8: 44.0, 9: 61.0, 10: 54.0, 11: 53.0, 12: 50.0, 13: 30.0, 14: 51.0, 15: 45.46, 16: 40.0, 17: 49.0, 18: 49.0, 19: 14.0},
'DaysSinceCreation': {0: 370, 1: 213, 2: 206, 3: 1018, 4: 835, 5: 52, 6: 597, 7: 217, 8: 999, 9: 1004, 10: 402, 11: 879, 12: 393, 13: 923, 14: 249, 15: 52, 16: 159, 17: 929, 18: 49, 19: 131},
'BookingsCheckedIn': {0: 1, 1: 0, 2: 0, 3: 1, 4: 1, 5: 1, 6: 1, 7: 2, 8: 1, 9: 1, 10: 1, 11: 1, 12: 1, 13: 1, 14: 1, 15: 0, 16: 0, 17: 1, 18: 1, 19: 0}})
# Encoding Variables
transformer = make_column_transformer((OneHotEncoder(sparse=False), ['Nationality']), remainder='passthrough')
transformed = transformer.fit_transform(df)
transformed_df = pd.DataFrame(transformed, columns=transformer.get_feature_names_out())
# Concat the two tables
transformed_df.reset_index(drop=True, inplace=True)
df.reset_index(drop=True, inplace=True)
df = pd.concat([transformed_df, df], axis=1)
# Remove old columns
df.drop(['Nationality'], axis = 1, inplace = True)
print('The shape after encoding: {}'.format(df.shape))
print(df.columns.unique())
The shape after encoding: (20, 14)
Index(['onehotencoder__Nationality_AUT', 'onehotencoder__Nationality_CHE',
'onehotencoder__Nationality_DEU', 'onehotencoder__Nationality_ESP',
'onehotencoder__Nationality_FRA', 'onehotencoder__Nationality_GBR',
'onehotencoder__Nationality_OTR', 'onehotencoder__Nationality_PRT',
'remainder__Age', 'remainder__DaysSinceCreation',
'remainder__BookingsCheckedIn', 'Age', 'DaysSinceCreation',
'BookingsCheckedIn'],
dtype='object')
After modeling, trying to test on a completely different test set:
df = pd.DataFrame({'Nationality': {0: 'CAN', 1: 'DEU', 2: 'PRT', 3: 'PRT', 4: 'FRA'},
'Age': {0: 27.0, 1: 29.0, 2: 24.0, 3: 24.0, 4: 46.0},
'DaysSinceCreation': {0: 222, 1: 988, 2: 212, 3: 685, 4: 1052},
'BookingsCheckedIn': {0: 0, 1: 1, 2: 1, 3: 1, 4: 0}})
# Encoding Variables
transformer = make_column_transformer(
(OneHotEncoder(sparse=False), ['Nationality']),
remainder='passthrough')
transformed = transformer.fit_transform(df)
transformed_df = pd.DataFrame(transformed, columns=transformer.get_feature_names_out())
# Concat the two tables
transformed_df.reset_index(drop=True, inplace=True)
df.reset_index(drop=True, inplace=True)
df = pd.concat([transformed_df, df], axis=1)
# Remove old columns
df.drop(['Nationality'], axis = 1, inplace = True)
print('The shape after encoding: {}'.format(df.shape))
print(df.columns.unique())
The shape after encoding: (5, 10)
Index(['onehotencoder__Nationality_CAN', 'onehotencoder__Nationality_DEU',
'onehotencoder__Nationality_FRA', 'onehotencoder__Nationality_PRT',
'remainder__Age', 'remainder__DaysSinceCreation',
'remainder__BookingsCheckedIn', 'Age', 'DaysSinceCreation',
'BookingsCheckedIn'],
dtype='object')
As can be seen, testing dataset has some features that were not present in the original training set and many features of training set are not present in test set. If I only use .values of X_train, y_train, X_test, y_test, I can run from logistic regression to Neural Net with >99% accuracy, but that feels like cheating and is not working out with Decision Trees. How do we deal with this?
I would like to contribute 2 inputs:
(1) the test set should be a subset of the training set, so the unknown Nationality 'CAN' is not allowed. Either: try to include the new 'CAN' in the training data, or try to replace it with 'GBR' instead in the test data.
(2) you should not do fit_transform() separately on training and test set. The right way is to fit on training set, then... transform on training set and transform on test set. To illustrate:
# Encoding Variables
transformer = make_column_transformer((OneHotEncoder(sparse=False), ['Nationality']), remainder='passthrough')
####transformed = transformer.fit_transform(df) #delete this
transformer.fit(df) #use this instead
transformed = transformer.transform(df) #use this instead
transformed_df = pd.DataFrame(transformed, columns=transformer.get_feature_names_out())
# Concat the two tables
<truncated>
print('The shape after encoding: {}'.format(df.shape))
The shape after encoding: (20, 14)
Second part, note that I have replaced 'CAN' with 'GBR'. And only use the previously fitted transformer to transform the test set:
df = pd.DataFrame({'Nationality': {0: 'GBR', 1: 'DEU', 2: 'PRT', 3: 'PRT', 4: 'FRA'},
'Age': {0: 27.0, 1: 29.0, 2: 24.0, 3: 24.0, 4: 46.0},
'DaysSinceCreation': {0: 222, 1: 988, 2: 212, 3: 685, 4: 1052},
'BookingsCheckedIn': {0: 0, 1: 1, 2: 1, 3: 1, 4: 0}})
# Encoding Variables
####transformer = make_column_transformer((OneHotEncoder(sparse=False), ['Nationality']), remainder='passthrough') #do not repeat, use the previous fitted model
####transformed = transformer.fit_transform(df) #delete this, NO fitting on test set
transformed = transformer.transform(df) #only do transform on test set
transformed_df = pd.DataFrame(transformed, columns=transformer.get_feature_names_out())
# Concat the two tables
<truncated>
print('The shape after encoding: {}'.format(df.shape))
The shape after encoding: (5, 14)
So the number of columns (14) are the same for both training set and test set
I merged 3 dataframes mrna, meth, and cna. I want to remove any duplicate rows that either have the same Hugo_Symbol column value or have the same values across all the remaining columns (i.e., columns starting with "TCGA-").
import re
import pandas as pd
dfs = [mrna, meth, cna]
common = pd.concat(dfs, join='inner')
common["Hugo_Symbol"] = [re.sub(r'\|.+', "", str(i)) for i in common["Hugo_Symbol"]] # In Hugo_Symbol column, remove everything after the pipe except newline
common = common.drop_duplicates(subset="Hugo_Symbol") # Remove row if Hugo_Symbol is the same
common
A snippet of the dataframe:
common_dict = common.iloc[1:10,1:10].to_dict()
common_dict
{'TCGA-02-0001-01': {1: -0.9099,
2: -2.3351,
3: 0.2216,
4: 0.6798,
5: -2.48,
6: 0.7912,
7: -1.4578,
8: -3.8009,
9: 3.4868},
'TCGA-02-0003-01': {1: 0.0896,
2: -1.17,
3: 0.1255,
4: 0.2374,
5: -3.2629,
6: 1.2846,
7: -1.474,
8: -2.9891,
9: -0.1511},
'TCGA-02-0007-01': {1: -5.6511,
2: -2.8365,
3: 2.0026,
4: -0.6326,
5: -1.3741,
6: -3.437,
7: -1.047,
8: -4.185,
9: 2.1816},
'TCGA-02-0009-01': {1: 0.9795,
2: -0.5464,
3: 1.1115,
4: -0.2128,
5: -3.3461,
6: 1.3576,
7: -1.0782,
8: -3.4734,
9: -0.8985},
'TCGA-02-0010-01': {1: -0.7122,
2: 0.7651,
3: 2.4691,
4: 0.7222,
5: -1.7822,
6: -3.3403,
7: -1.6397,
8: 0.3424,
9: 1.7337},
'TCGA-02-0011-01': {1: -6.8649,
2: -0.4178,
3: 0.1858,
4: -0.0863,
5: -2.9486,
6: -3.843,
7: -0.9275,
8: -5.0462,
9: 0.9702},
'TCGA-02-0014-01': {1: -1.9439,
2: 0.3727,
3: -0.5368,
4: -0.1501,
5: 0.8977,
6: 0.5138,
7: -1.688,
8: 0.1778,
9: 1.7975},
'TCGA-02-0021-01': {1: -0.8761,
2: -0.2532,
3: 2.0574,
4: -0.9708,
5: -1.0883,
6: -1.0698,
7: -0.8684,
8: -5.3854,
9: 1.2353},
'TCGA-02-0024-01': {1: 1.6237,
2: -0.717,
3: -0.4517,
4: -0.5276,
5: -2.3993,
6: -4.3485,
7: 0.0811,
8: -2.5217,
9: 0.1883}}
Now, I want to drop any duplicate rows by subsetting all the columns beginning with "TCGA-" (i.e., all except the Hugo_Symbol column). How do I do it?
common = common.drop_duplicates(subset=[1:,], keep="first", inplace=False, ignore_index=False)
Here is the example data to reproduce the problem. It needed some changes to the data from dict of OP to have duplicates.
df = pd.DataFrame({
'Hugo_Symbol': ['ABC', 'DEF', 'GHI', 'JKL', 'MNO', 'ABC', 'GHI', 'XYZ', 'DEF', 'BBB', 'CCC'],
'TCGA-02-0001-01': [-0.9099, -2.3351, 0.2216, 0.6798, -2.48, 0.7912, -1.4578, -3.8009, 3.4868, -2.48, 3.4868],
'TCGA-02-0003-01': [0.0896, -1.17, 0.1255, 0.2374, -3.2629, 1.2846, -1.474, -2.9891, -0.1511, -3.2629, -0.1511],
'TCGA-02-0007-01': [-5.6511, -2.8365, 2.0026, -0.6326, -1.3741, -3.437, -1.047, -4.185, 2.1816, -1.3741, 2.1816],
'TCGA-02-0009-01': [0.9795, -0.5464, 1.1115, -0.2128, -3.3461, 1.3576, -1.0782, -3.4734, -0.8985, -3.3461, -0.8985],
'TCGA-02-0010-01': [-0.7122, 0.7651, 2.4691, 0.7222, -1.7822, -3.3403, -1.6397, 0.3424, 1.7337, -1.7822, 1.7337],
'TCGA-02-0011-01': [-6.8649, -0.4178, 0.1858, -0.0863, -2.9486, -3.843, -0.9275, -5.0462, 0.9702, -2.9486, 0.9702],
'TCGA-02-0014-01': [-1.9439, 0.3727, -0.5368, -0.1501, 0.8977, 0.5138, -1.688, 0.1778, 1.7975, 0.8977, 1.7975],
'TCGA-02-0021-01': [-0.8761, -0.2532, 2.0574, -0.9708, -1.0883, -1.0698, -0.8684, -5.3854, 1.2353, -1.0883, 1.2353],
'TCGA-02-0024-01': [1.6237, -0.717, -0.4517, -0.5276, -2.3993, -4.3485, 0.0811, -2.5217, 0.1883, -2.3993, 0.1883]})
We have some duplicates in the "Hugo_Symbol" column and the last two rows (different hugo symbol) have exactly same data as the rows at position 5 and 9.
With the ideas of #Code Different I created a mask and used it on the DataFrame.
tcga_cols = df.columns[df.columns.str.startswith("TCGA-")].to_list()
mask = df.duplicated("Hugo_Symbol") | df.duplicated(tcga_cols)
print(mask)
False False False False False True True False True True True
result = df[~mask]
print(result)
Hugo_Symbol TCGA-02-0001-01 TCGA-02-0003-01 TCGA-02-0007-01 TCGA-02-0009-01 TCGA-02-0010-01 TCGA-02-0011-01 TCGA-02-0014-01 TCGA-02-0021-01 TCGA-02-0024-01
0 ABC -0.9099 0.0896 -5.6511 0.9795 -0.7122 -6.8649 -1.9439 -0.8761 1.6237
1 DEF -2.3351 -1.1700 -2.8365 -0.5464 0.7651 -0.4178 0.3727 -0.2532 -0.7170
2 GHI 0.2216 0.1255 2.0026 1.1115 2.4691 0.1858 -0.5368 2.0574 -0.4517
3 JKL 0.6798 0.2374 -0.6326 -0.2128 0.7222 -0.0863 -0.1501 -0.9708 -0.5276
4 MNO -2.4800 -3.2629 -1.3741 -3.3461 -1.7822 -2.9486 0.8977 -1.0883 -2.3993
7 XYZ -3.8009 -2.9891 -4.1850 -3.4734 0.3424 -5.0462 0.1778 -5.3854 -2.5217
As you can see result only contains rows where the mask was False
EDIT:
I tested the logic on several cases and it seems to work just fine (for this example data) so I guess your real data has some format which causes problems.
For example if your columns have leading whitespaces str.startswith won't work properly.
As a workaround, do ALL your columns start with TCGA except the "hugo" column? Then you could just replace the tcga_cols line with:
tcga_cols = df.columns[1:]
I have this minimal sample data:
import pandas as pd
from pandas import Timestamp
data = pd.DataFrame({'Client': {0: "Client_1", 1: "Client_2", 2: "Client_2", 3: "Client_3", 4: "Client_3", 5: "Client_3", 6: "Client_4", 7: "Client_4"},
'Id_Card': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8},
'Type': {0: 'A', 1: 'B', 2: 'C', 3: np.nan, 4: 'A', 5: 'B', 6: np.nan, 7: 'B'},
'Loc': {0: 'ADW', 1: 'ZCW', 2: 'EWC', 3: "VWQ", 4: "OKS", 5: 'EQW', 6: "PKA", 7: 'CSA'},
'Amount': {0: 10.0, 1: 15.0, 2: 17.0, 3: 32.0, 4: np.nan, 5: 51.0, 6: 38.0, 7: -20.0},
'Net': {0: 30.0, 1: 42.0, 2: -10.0, 3: 15.0, 4: 98, 5: np.nan, 6: 23.0, 7: -10.0},
'Date': {0: Timestamp('2018-09-29 00:00:00'), 1: Timestamp('1996-08-02 00:00:00'), 2: np.nan, 3: Timestamp('2020-11-02 00:00:00'), 4: Timestamp('2008-12-27 00:00:00'), 5: Timestamp('2004-12-21 00:00:00'), 6: np.nan, 7: Timestamp('2010-08-25 00:00:00')}})
data
I'm trying to aggregate this data grouping by Client column. Counting the Id_Card per client, concatenating Type, Loc, separated by ; (e.g. A;B and ZCW;EWC values for Client_2, NOT A;ZCW B;EWC), sum the Amount, Net, per client, and getting the minimum Date per client. However, I'm facing some problems:
These functions works perfectly individually, but I can't find a way to mix the aggregate function and apply function:
Code example:
data.groupby("Client").agg({"Id_Card": "count", "Amount":"sum", "Date": "min"})
data.groupby('Client')['Loc'].apply(';'.join).reset_index()
The apply function doesn't work for columns with missing values:
Code example:
data.groupby('Client')['Type'].apply(';'.join).reset_index()
TypeError: sequence item 0: expected str instance, float found
The aggregate and apply functions don't allow me to put multiple columns for one transformation:
Code example:
cols_to_sum = ["Amount", "Net"]
data.groupby("Client").agg({"Id_Card": "count", cols_to_sum:"sum", "Date": "min"})
cols_to_join = ["Type", "Loc"]
data.groupby('Client')[cols_to_join].apply(';'.join).reset_index()
In (3) I only put Amount and Net and I could put them separately in the aggregate function, but I'm looking to a more efficient way as I'm working with plenty of columns.
The output expected is the same dataframe, but aggregated with the conditions outlined at the beggining.
For doing a join, you would have to filter out the NaN values. As join you have to apply at two places, I have created a separate function
def join_non_nan_values(elements):
return ";".join([elem for elem in elements if elem == elem]) # elem == elem will fail for Nan values
data.groupby("Client").agg({"Id_Card": "count", "Type": join_non_nan_values,
"Loc": join_non_nan_values, "Amount":"sum", "Net": "sum", "Date": "min"})
Go step by step, and prepare three different data frames to merge them later.
First dataframe is for simple functions like count,sum,mean
df1 = data.groupby("Client").agg({"Id_Card": "count", "Amount":"sum", "Net":sum, "Date": "min"}).reset_index()
Next you deal with Type and Loc join, we use fill na to deal with nan values
df2=data[['Client', 'Type']].fillna('').groupby("Client")['Type'].apply(
';'.join).reset_index()
df3=data[['Client', 'Loc']].fillna('').groupby("Client")['Loc'].apply(
';'.join).reset_index()
And finally you merge the results together:
data_new = df1.merge(df2, on='Client').merge(df3, on='Client')
data_new output:
The below mentioned are the loss values generated in the file 'log'(the iterations are actually more than this what I listed below). Attached the screenshot of the contents of the log file for ref. How to plot the Iteration (x-axis) vs Loss (y-axis) from these contents in the 'log' file ?
0: combined_hm_loss: 0.17613089
1: combined_hm_loss: 0.20243575
2: combined_hm_loss: 0.07203530
3: combined_hm_loss: 0.03444689
4: combined_hm_loss: 0.02623464
5: combined_hm_loss: 0.02061908
6: combined_hm_loss: 0.01562270
7: combined_hm_loss: 0.01253260
8: combined_hm_loss: 0.01102418
9: combined_hm_loss: 0.00958306
10: combined_hm_loss: 0.00824807
11: combined_hm_loss: 0.00694697
12: combined_hm_loss: 0.00640630
13: combined_hm_loss: 0.00593691
14: combined_hm_loss: 0.00521284
15: combined_hm_loss: 0.00445185
16: combined_hm_loss: 0.00408901
17: combined_hm_loss: 0.00377806
18: combined_hm_loss: 0.00314004
19: combined_hm_loss: 0.00287649
enter image description here
try this:
import pandas as pd
import numpy as np
import io
data = '''
index combined_hm_loss
0: 0.17613089
1: 0.20243575
2: 0.07203530
3: 0.03444689
4: 0.02623464
5: 0.02061908
6: 0.01562270
7: 0.01253260
8: 0.01102418
9: 0.00958306
10: 0.00824807
11: 0.00694697
12: 0.00640630
13: 0.00593691
14: 0.00521284
15: 0.00445185
16: 0.00408901
17: 0.00377806
18: 0.00314004
19: 0.00287649
'''
df = pd.read_csv(io.StringIO(data), delim_whitespace=True)
ax = df.plot.area(y='combined_hm_loss')
ax.invert_yaxis()
I have written an unnest function in Python3.6 as below-
full_df = unnest(full_df,'Options','product code')
def unnest(df, col, col2,reset_index=False):
for item in df[col]:
if len(item)==0:
item=item.append('')
col_flat = pd.DataFrame([[i, x]
for i, y in df[col].apply(list).iteritems()
for x in y ], columns=['I', col]
)
col_flat = col_flat.set_index('I')
df = df.drop(col, 1)
df = df.merge(col_flat, left_index=True, right_index=True)
if reset_index:
df = df.reset_index(drop=True)
if df[col] is None:
merchant_product_code = df['Product code']
else:
merchant_product_code = df['Product code'] + '-' + df[col]
df['item_group_id'] = df['Product code']
df['Product code'] = merchant_product_code
return df
Problem I am facing here is, in case of Options value as []; it is removing the [] with empty and in Product Code column it is adding a hyphen(-) after product code.
I have converted my dataframe full_df as dictionary here so that you can test it.
{'Product code': {0: 'BBMTG', 1: 'BBDBPSD', 2: 'BBDBPEL', 3: 'BBDBPDR', 4: 'BBFTDR', 5: 'BBFTPBG', 6: 'BBFTPBS', 7: 'BBFTEY'}, 'Category': {0: 'Essentials', 1: 'Bedding /Bamboo Blanket', 2: 'Bedding /Bamboo Blanket', 3: 'Bedding /Bamboo Blanket', 4: 'Apparel', 5: 'Apparel', 6: 'Apparel', 7: 'Apparel'}, 'List price': {0: 8.9, 1: 45.0, 2: 45.0, 3: 45.0, 4: 28.0, 5: 28.0, 6: 28.0, 7: 28.0}, 'Price': {0: 8.9, 1: 45.0, 2: 45.0, 3: 45.0, 4: 28.0, 5: 28.0, 6: 28.0, 7: 28.0}, 'Options': {0: '[]', 1: '[]', 2: '[]', 3: '[]', 4: "['0-3m', '3-6m', '6-12m']", 5: "['0-3m', '3-6m', '6-12m']", 6: "['0-3m', '3-6m', '6-12m']", 7: "['0-3m', '3-6m', '6-12m']"}}
Can anyone look into this and help to make it work.
Finally I am able to solve my issue by changing below line-
item.append('[]')
and
df['Product code'] += df[col].apply(lambda val: '' if val == "[]" else '-' + val)