Related
Shape of the original dataset is 82580×30 with multiple string columns. Example dataset:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
df = pd.DataFrame({'Nationality': {0: 'DEU', 1: 'PRT', 2: 'PRT', 3: 'PRT', 4: 'FRA', 5: 'DEU', 6: 'CHE', 7: 'DEU', 8: 'GBR', 9: 'AUT', 10: 'PRT', 11: 'FRA', 12: 'OTR', 13: 'GBR', 14: 'ESP', 15: 'PRT', 16: 'OTR', 17: 'PRT', 18: 'ESP', 19: 'AUT'},
'Age': {0: 27.0, 1: 45.46, 2: 45.46, 3: 58.0, 4: 57.0, 5: 27.0, 6: 49.0, 7: 62.0, 8: 44.0, 9: 61.0, 10: 54.0, 11: 53.0, 12: 50.0, 13: 30.0, 14: 51.0, 15: 45.46, 16: 40.0, 17: 49.0, 18: 49.0, 19: 14.0},
'DaysSinceCreation': {0: 370, 1: 213, 2: 206, 3: 1018, 4: 835, 5: 52, 6: 597, 7: 217, 8: 999, 9: 1004, 10: 402, 11: 879, 12: 393, 13: 923, 14: 249, 15: 52, 16: 159, 17: 929, 18: 49, 19: 131},
'BookingsCheckedIn': {0: 1, 1: 0, 2: 0, 3: 1, 4: 1, 5: 1, 6: 1, 7: 2, 8: 1, 9: 1, 10: 1, 11: 1, 12: 1, 13: 1, 14: 1, 15: 0, 16: 0, 17: 1, 18: 1, 19: 0}})
# Encoding Variables
transformer = make_column_transformer((OneHotEncoder(sparse=False), ['Nationality']), remainder='passthrough')
transformed = transformer.fit_transform(df)
transformed_df = pd.DataFrame(transformed, columns=transformer.get_feature_names_out())
# Concat the two tables
transformed_df.reset_index(drop=True, inplace=True)
df.reset_index(drop=True, inplace=True)
df = pd.concat([transformed_df, df], axis=1)
# Remove old columns
df.drop(['Nationality'], axis = 1, inplace = True)
print('The shape after encoding: {}'.format(df.shape))
print(df.columns.unique())
The shape after encoding: (20, 14)
Index(['onehotencoder__Nationality_AUT', 'onehotencoder__Nationality_CHE',
'onehotencoder__Nationality_DEU', 'onehotencoder__Nationality_ESP',
'onehotencoder__Nationality_FRA', 'onehotencoder__Nationality_GBR',
'onehotencoder__Nationality_OTR', 'onehotencoder__Nationality_PRT',
'remainder__Age', 'remainder__DaysSinceCreation',
'remainder__BookingsCheckedIn', 'Age', 'DaysSinceCreation',
'BookingsCheckedIn'],
dtype='object')
After modeling, trying to test on a completely different test set:
df = pd.DataFrame({'Nationality': {0: 'CAN', 1: 'DEU', 2: 'PRT', 3: 'PRT', 4: 'FRA'},
'Age': {0: 27.0, 1: 29.0, 2: 24.0, 3: 24.0, 4: 46.0},
'DaysSinceCreation': {0: 222, 1: 988, 2: 212, 3: 685, 4: 1052},
'BookingsCheckedIn': {0: 0, 1: 1, 2: 1, 3: 1, 4: 0}})
# Encoding Variables
transformer = make_column_transformer(
(OneHotEncoder(sparse=False), ['Nationality']),
remainder='passthrough')
transformed = transformer.fit_transform(df)
transformed_df = pd.DataFrame(transformed, columns=transformer.get_feature_names_out())
# Concat the two tables
transformed_df.reset_index(drop=True, inplace=True)
df.reset_index(drop=True, inplace=True)
df = pd.concat([transformed_df, df], axis=1)
# Remove old columns
df.drop(['Nationality'], axis = 1, inplace = True)
print('The shape after encoding: {}'.format(df.shape))
print(df.columns.unique())
The shape after encoding: (5, 10)
Index(['onehotencoder__Nationality_CAN', 'onehotencoder__Nationality_DEU',
'onehotencoder__Nationality_FRA', 'onehotencoder__Nationality_PRT',
'remainder__Age', 'remainder__DaysSinceCreation',
'remainder__BookingsCheckedIn', 'Age', 'DaysSinceCreation',
'BookingsCheckedIn'],
dtype='object')
As can be seen, testing dataset has some features that were not present in the original training set and many features of training set are not present in test set. If I only use .values of X_train, y_train, X_test, y_test, I can run from logistic regression to Neural Net with >99% accuracy, but that feels like cheating and is not working out with Decision Trees. How do we deal with this?
I would like to contribute 2 inputs:
(1) the test set should be a subset of the training set, so the unknown Nationality 'CAN' is not allowed. Either: try to include the new 'CAN' in the training data, or try to replace it with 'GBR' instead in the test data.
(2) you should not do fit_transform() separately on training and test set. The right way is to fit on training set, then... transform on training set and transform on test set. To illustrate:
# Encoding Variables
transformer = make_column_transformer((OneHotEncoder(sparse=False), ['Nationality']), remainder='passthrough')
####transformed = transformer.fit_transform(df) #delete this
transformer.fit(df) #use this instead
transformed = transformer.transform(df) #use this instead
transformed_df = pd.DataFrame(transformed, columns=transformer.get_feature_names_out())
# Concat the two tables
<truncated>
print('The shape after encoding: {}'.format(df.shape))
The shape after encoding: (20, 14)
Second part, note that I have replaced 'CAN' with 'GBR'. And only use the previously fitted transformer to transform the test set:
df = pd.DataFrame({'Nationality': {0: 'GBR', 1: 'DEU', 2: 'PRT', 3: 'PRT', 4: 'FRA'},
'Age': {0: 27.0, 1: 29.0, 2: 24.0, 3: 24.0, 4: 46.0},
'DaysSinceCreation': {0: 222, 1: 988, 2: 212, 3: 685, 4: 1052},
'BookingsCheckedIn': {0: 0, 1: 1, 2: 1, 3: 1, 4: 0}})
# Encoding Variables
####transformer = make_column_transformer((OneHotEncoder(sparse=False), ['Nationality']), remainder='passthrough') #do not repeat, use the previous fitted model
####transformed = transformer.fit_transform(df) #delete this, NO fitting on test set
transformed = transformer.transform(df) #only do transform on test set
transformed_df = pd.DataFrame(transformed, columns=transformer.get_feature_names_out())
# Concat the two tables
<truncated>
print('The shape after encoding: {}'.format(df.shape))
The shape after encoding: (5, 14)
So the number of columns (14) are the same for both training set and test set
I merged 3 dataframes mrna, meth, and cna. I want to remove any duplicate rows that either have the same Hugo_Symbol column value or have the same values across all the remaining columns (i.e., columns starting with "TCGA-").
import re
import pandas as pd
dfs = [mrna, meth, cna]
common = pd.concat(dfs, join='inner')
common["Hugo_Symbol"] = [re.sub(r'\|.+', "", str(i)) for i in common["Hugo_Symbol"]] # In Hugo_Symbol column, remove everything after the pipe except newline
common = common.drop_duplicates(subset="Hugo_Symbol") # Remove row if Hugo_Symbol is the same
common
A snippet of the dataframe:
common_dict = common.iloc[1:10,1:10].to_dict()
common_dict
{'TCGA-02-0001-01': {1: -0.9099,
2: -2.3351,
3: 0.2216,
4: 0.6798,
5: -2.48,
6: 0.7912,
7: -1.4578,
8: -3.8009,
9: 3.4868},
'TCGA-02-0003-01': {1: 0.0896,
2: -1.17,
3: 0.1255,
4: 0.2374,
5: -3.2629,
6: 1.2846,
7: -1.474,
8: -2.9891,
9: -0.1511},
'TCGA-02-0007-01': {1: -5.6511,
2: -2.8365,
3: 2.0026,
4: -0.6326,
5: -1.3741,
6: -3.437,
7: -1.047,
8: -4.185,
9: 2.1816},
'TCGA-02-0009-01': {1: 0.9795,
2: -0.5464,
3: 1.1115,
4: -0.2128,
5: -3.3461,
6: 1.3576,
7: -1.0782,
8: -3.4734,
9: -0.8985},
'TCGA-02-0010-01': {1: -0.7122,
2: 0.7651,
3: 2.4691,
4: 0.7222,
5: -1.7822,
6: -3.3403,
7: -1.6397,
8: 0.3424,
9: 1.7337},
'TCGA-02-0011-01': {1: -6.8649,
2: -0.4178,
3: 0.1858,
4: -0.0863,
5: -2.9486,
6: -3.843,
7: -0.9275,
8: -5.0462,
9: 0.9702},
'TCGA-02-0014-01': {1: -1.9439,
2: 0.3727,
3: -0.5368,
4: -0.1501,
5: 0.8977,
6: 0.5138,
7: -1.688,
8: 0.1778,
9: 1.7975},
'TCGA-02-0021-01': {1: -0.8761,
2: -0.2532,
3: 2.0574,
4: -0.9708,
5: -1.0883,
6: -1.0698,
7: -0.8684,
8: -5.3854,
9: 1.2353},
'TCGA-02-0024-01': {1: 1.6237,
2: -0.717,
3: -0.4517,
4: -0.5276,
5: -2.3993,
6: -4.3485,
7: 0.0811,
8: -2.5217,
9: 0.1883}}
Now, I want to drop any duplicate rows by subsetting all the columns beginning with "TCGA-" (i.e., all except the Hugo_Symbol column). How do I do it?
common = common.drop_duplicates(subset=[1:,], keep="first", inplace=False, ignore_index=False)
Here is the example data to reproduce the problem. It needed some changes to the data from dict of OP to have duplicates.
df = pd.DataFrame({
'Hugo_Symbol': ['ABC', 'DEF', 'GHI', 'JKL', 'MNO', 'ABC', 'GHI', 'XYZ', 'DEF', 'BBB', 'CCC'],
'TCGA-02-0001-01': [-0.9099, -2.3351, 0.2216, 0.6798, -2.48, 0.7912, -1.4578, -3.8009, 3.4868, -2.48, 3.4868],
'TCGA-02-0003-01': [0.0896, -1.17, 0.1255, 0.2374, -3.2629, 1.2846, -1.474, -2.9891, -0.1511, -3.2629, -0.1511],
'TCGA-02-0007-01': [-5.6511, -2.8365, 2.0026, -0.6326, -1.3741, -3.437, -1.047, -4.185, 2.1816, -1.3741, 2.1816],
'TCGA-02-0009-01': [0.9795, -0.5464, 1.1115, -0.2128, -3.3461, 1.3576, -1.0782, -3.4734, -0.8985, -3.3461, -0.8985],
'TCGA-02-0010-01': [-0.7122, 0.7651, 2.4691, 0.7222, -1.7822, -3.3403, -1.6397, 0.3424, 1.7337, -1.7822, 1.7337],
'TCGA-02-0011-01': [-6.8649, -0.4178, 0.1858, -0.0863, -2.9486, -3.843, -0.9275, -5.0462, 0.9702, -2.9486, 0.9702],
'TCGA-02-0014-01': [-1.9439, 0.3727, -0.5368, -0.1501, 0.8977, 0.5138, -1.688, 0.1778, 1.7975, 0.8977, 1.7975],
'TCGA-02-0021-01': [-0.8761, -0.2532, 2.0574, -0.9708, -1.0883, -1.0698, -0.8684, -5.3854, 1.2353, -1.0883, 1.2353],
'TCGA-02-0024-01': [1.6237, -0.717, -0.4517, -0.5276, -2.3993, -4.3485, 0.0811, -2.5217, 0.1883, -2.3993, 0.1883]})
We have some duplicates in the "Hugo_Symbol" column and the last two rows (different hugo symbol) have exactly same data as the rows at position 5 and 9.
With the ideas of #Code Different I created a mask and used it on the DataFrame.
tcga_cols = df.columns[df.columns.str.startswith("TCGA-")].to_list()
mask = df.duplicated("Hugo_Symbol") | df.duplicated(tcga_cols)
print(mask)
False False False False False True True False True True True
result = df[~mask]
print(result)
Hugo_Symbol TCGA-02-0001-01 TCGA-02-0003-01 TCGA-02-0007-01 TCGA-02-0009-01 TCGA-02-0010-01 TCGA-02-0011-01 TCGA-02-0014-01 TCGA-02-0021-01 TCGA-02-0024-01
0 ABC -0.9099 0.0896 -5.6511 0.9795 -0.7122 -6.8649 -1.9439 -0.8761 1.6237
1 DEF -2.3351 -1.1700 -2.8365 -0.5464 0.7651 -0.4178 0.3727 -0.2532 -0.7170
2 GHI 0.2216 0.1255 2.0026 1.1115 2.4691 0.1858 -0.5368 2.0574 -0.4517
3 JKL 0.6798 0.2374 -0.6326 -0.2128 0.7222 -0.0863 -0.1501 -0.9708 -0.5276
4 MNO -2.4800 -3.2629 -1.3741 -3.3461 -1.7822 -2.9486 0.8977 -1.0883 -2.3993
7 XYZ -3.8009 -2.9891 -4.1850 -3.4734 0.3424 -5.0462 0.1778 -5.3854 -2.5217
As you can see result only contains rows where the mask was False
EDIT:
I tested the logic on several cases and it seems to work just fine (for this example data) so I guess your real data has some format which causes problems.
For example if your columns have leading whitespaces str.startswith won't work properly.
As a workaround, do ALL your columns start with TCGA except the "hugo" column? Then you could just replace the tcga_cols line with:
tcga_cols = df.columns[1:]
So I'm trying to add data labels so you can see the values of each of my stacks when looking at a graph. I added the text option and put the column I want displayed, but it just returns in the hover information and not just displayed on the graph. How do I change this?
df2 = pd.DataFrame.from_dict({'Country': {0: 'Europe',
1: 'America',
2: 'Asia',
3: 'Europe',
4: 'America',
5: 'Asia',
6: 'Europe',
7: 'America',
8: 'Asia',
9: 'Europe',
10: 'America',
11: 'Asia'},
'Year': {0: 2014,
1: 2014,
2: 2014,
3: 2015,
4: 2015,
5: 2015,
6: 2016,
7: 2016,
8: 2016,
9: 2017,
10: 2017,
11: 2017},
'Amount': {0: 1600,
1: 410,
2: 150,
3: 1300,
4: 300,
5: 170,
6: 1000,
7: 500,
8: 200,
9: 900,
10: 500,
11: 210}})
fig = go.Figure()
x=[]
for i in df2['Year'].unique():
x.append(str(i))
for c in df2['Country'].unique():
df3 = df2[df2['Country'] == c]
fig.add_trace(go.Bar(x=x, y=df3['Amount'], name=c, text=df3['Amount']))
fig.update_layout(title="Personnel at Work",
barmode='stack',
title_x=.5,
yaxis={
'showgrid':False,
'visible':False
},
xaxis=dict(
tick0=0,
dtick=1,
),
plot_bgcolor='rgba(0,0,0,0)')
fig.show()
I had a similar problem and this block of code helped me!. Im not sure if it can help your case but give it a try.
fig.update_traces(texttemplate='%{your_labels =:.1f}', textposition='outside')
Go through all the use cases here,
https://plotly.com/python/text-and-annotations/
The below mentioned are the loss values generated in the file 'log'(the iterations are actually more than this what I listed below). Attached the screenshot of the contents of the log file for ref. How to plot the Iteration (x-axis) vs Loss (y-axis) from these contents in the 'log' file ?
0: combined_hm_loss: 0.17613089
1: combined_hm_loss: 0.20243575
2: combined_hm_loss: 0.07203530
3: combined_hm_loss: 0.03444689
4: combined_hm_loss: 0.02623464
5: combined_hm_loss: 0.02061908
6: combined_hm_loss: 0.01562270
7: combined_hm_loss: 0.01253260
8: combined_hm_loss: 0.01102418
9: combined_hm_loss: 0.00958306
10: combined_hm_loss: 0.00824807
11: combined_hm_loss: 0.00694697
12: combined_hm_loss: 0.00640630
13: combined_hm_loss: 0.00593691
14: combined_hm_loss: 0.00521284
15: combined_hm_loss: 0.00445185
16: combined_hm_loss: 0.00408901
17: combined_hm_loss: 0.00377806
18: combined_hm_loss: 0.00314004
19: combined_hm_loss: 0.00287649
enter image description here
try this:
import pandas as pd
import numpy as np
import io
data = '''
index combined_hm_loss
0: 0.17613089
1: 0.20243575
2: 0.07203530
3: 0.03444689
4: 0.02623464
5: 0.02061908
6: 0.01562270
7: 0.01253260
8: 0.01102418
9: 0.00958306
10: 0.00824807
11: 0.00694697
12: 0.00640630
13: 0.00593691
14: 0.00521284
15: 0.00445185
16: 0.00408901
17: 0.00377806
18: 0.00314004
19: 0.00287649
'''
df = pd.read_csv(io.StringIO(data), delim_whitespace=True)
ax = df.plot.area(y='combined_hm_loss')
ax.invert_yaxis()
When creating a bar plot from a Pandas DataFrame, the canvas is coming out blank (i.e., no bars showing). Tried in two different computers running the same Pandas version (v0.20.3), one will work and the other won't. This code reproduces the problem:
df = pd.DataFrame( {0: {0: 15.966058232618138,
1: 2.1807683719000992,
2: 0.87035229502695233,
3: 0.34367909767875798,
4: 0.18218519090896321},
1: {0: 11.118024492865494,
1: 0.69351230042284107,
2: 0.43197780592175244,
3: 0.076875254138056778,
4: 0.090691059750999822},
2: {0: 10.59611816777141,
1: 1.0043841242178624,
2: 0.66999680161427466,
3: 0.032357377554541628,
4: 0.18821105178736078},
3: {0: 0.19480519480519479,
1: 17.036783213824904,
2: 5.2625018367047067,
3: 1.5041249436616959,
4: 0.14895013123359582},
4: {0: 0.86666666666666659,
1: 53.71924947880472,
2: 99.890829694323145,
3: 10.031712688463491,
4: 4.6052631578947372},
5: {0: 1.8914728682170541,
1: 3554.8711656441715,
2: 573.03649635036504,
3: 0.72058823529411753,
4: 0.93846153846153835},
6: {0: 3.8978637334734652,
1: 0.19517839782493598,
2: 0.14753506501156222,
3: 0.021084786319386508,
4: 0.029238890916504161},
7: {0: 4.7377049180327866,
1: 0.056476683937823832,
2: 0.034086444007858548,
3: 0.99022801302931596,
4: 0.92809364548494977},
8: {0: 0.0058997050147492625,
1: 0.0,
2: 0.0,
3: 1.2954206878683853e-05,
4: 0.025023084025854108},
9: {0: 0.041333014548300184,
1: 0.23146322426025379,
2: 0.11579453571122432,
3: 0.3291825442962299,
4: 0.022578918480011249}} )
df.plot.bar( logy=True )
Trying to replicate the issue. The plot is shown above