How to set (1) to max elements in pandas dataframe and (0) to everything else? - pandas

Let's say I have a pandas DataFrame.
df = pd.DataFrame(index = [ix for ix in range(10)], columns=list('abcdef'), data=np.random.randn(10,6))
df:
a b c d e f
0 -1.238393 -0.755117 -0.228638 -0.077966 0.412947 0.887955
1 -0.342087 0.296171 0.177956 0.701668 -0.481744 -1.564719
2 0.610141 0.963873 -0.943182 -0.341902 0.326416 0.818899
3 -0.561572 0.063588 -0.195256 -1.637753 0.622627 0.845801
4 -2.506322 -1.631023 0.506860 0.368958 1.833260 0.623055
5 -1.313919 -1.758250 -1.082072 1.266158 0.427079 -1.018416
6 -0.781842 1.270133 -0.510879 -1.438487 -1.101213 -0.922821
7 -0.456999 0.234084 1.602635 0.611378 -1.147994 1.204318
8 0.497074 0.412695 -0.458227 0.431758 0.514382 -0.479150
9 -1.289392 -0.218624 0.122060 2.000832 -1.694544 0.773330
how to I get set 1 to rowwise max and 0 to other elements?
I came up with:
>>> for i in range(len(df)):
... df.loc[i][df.loc[i].idxmax(axis=1)] = 1
... df.loc[i][df.loc[i] != 1] = 0
generates
df:
a b c d e f
0 0 0 0 0 0 1
1 0 0 0 1 0 0
2 0 1 0 0 0 0
3 0 0 0 0 0 1
4 0 0 0 0 1 0
5 0 0 0 1 0 0
6 0 1 0 0 0 0
7 0 0 1 0 0 0
8 0 0 0 0 1 0
9 0 0 0 1 0 0
Does anyone has a better way of doing it? May be by getting rid of the for loop or applying lambda?

Use max and check for equality using eq and cast the boolean df to int using astype, this will convert True and False to 1 and 0:
In [21]:
df = pd.DataFrame(index = [ix for ix in range(10)], columns=list('abcdef'), data=np.random.randn(10,6))
df
Out[21]:
a b c d e f
0 0.797000 0.762125 -0.330518 1.117972 0.817524 0.041670
1 0.517940 0.357369 -1.493552 -0.947396 3.082828 0.578126
2 1.784856 0.672902 -1.359771 -0.090880 -0.093100 1.099017
3 -0.493976 -0.390801 -0.521017 1.221517 -1.303020 1.196718
4 0.687499 -2.371322 -2.474101 -0.397071 0.132205 0.034631
5 0.573694 -0.206627 -0.106312 -0.661391 -0.257711 -0.875501
6 -0.415331 1.185901 1.173457 0.317577 -0.408544 -1.055770
7 -1.564962 -0.408390 -1.372104 -1.117561 -1.262086 -1.664516
8 -0.987306 0.738833 -1.207124 0.738084 1.118205 -0.899086
9 0.282800 -1.226499 1.658416 -0.381222 1.067296 -1.249829
In [22]:
df = df.eq(df.max(axis=1), axis=0).astype(int)
df
Out[22]:
a b c d e f
0 0 0 0 1 0 0
1 0 0 0 0 1 0
2 1 0 0 0 0 0
3 0 0 0 1 0 0
4 1 0 0 0 0 0
5 1 0 0 0 0 0
6 0 1 0 0 0 0
7 0 1 0 0 0 0
8 0 0 0 0 1 0
9 0 0 1 0 0 0
Timings
In [24]:
# #Raihan Masud's method
%timeit df.apply( lambda x: np.where(x == x.max() , 1 , 0) , axis = 1)
# mine
%timeit df.eq(df.max(axis=1), axis=0).astype(int)
100 loops, best of 3: 7.94 ms per loop
1000 loops, best of 3: 640 µs per loop
In [25]:
# #Nader Hisham's method
%%timeit
def max_binary(df):
binary = np.where( df == df.max() , 1 , 0 )
return binary
​
df.apply( max_binary , axis = 1)
100 loops, best of 3: 9.63 ms per loop
You can see that my method is over 12X faster than #Raihan's method
In [4]:
%%timeit
for i in range(len(df)):
df.loc[i][df.loc[i].idxmax(axis=1)] = 1
df.loc[i][df.loc[i] != 1] = 0
10 loops, best of 3: 21.1 ms per loop
The for loop is also significantly slower

import numpy as np
def max_binary(df):
binary = np.where( df == df.max() , 1 , 0 )
return binary
df.apply( max_binary , axis = 1)

Following Nader's pattern, this is a shorter version:
df.apply( lambda x: np.where(x == x.max() , 1 , 0) , axis = 1)

Related

Data formatting for grouped boxplot using seaborn or matplotlib

I have 3 dataframes where column names and number of rows are exactly the same in all 3 data frames. I want to plot all the columns from all three dataframes as a grouped boxplot into one image using seaborn or matplotlib. But I am having difficulties in combining and formating the data so that I can plot them as grouped box plot.
df=
A B C D E F G H I J
0 0.031810 0.000556 0.007798 0.000741 0 0 0 0.000180 0.002105 0
1 0.028687 0.000571 0.009356 0.000000 0 0 0 0.000183 0.001250 0
2 0.029635 0.001111 0.009121 0.000000 0 0 0 0.000194 0.001111 0
3 0.030579 0.002424 0.007672 0.000000 0 0 0 0.000194 0.001176 0
4 0.028544 0.002667 0.007973 0.000000 0 0 0 0.000179 0.001333 0
5 0.027286 0.003226 0.006881 0.000000 0 0 0 0.000196 0.001111 0
6 0.031597 0.003030 0.006695 0.000000 0 0 0 0.000180 0.002353 0
7 0.034226 0.003030 0.010804 0.000667 0 0 0 0.000179 0.003333 0
8 0.035105 0.002941 0.010176 0.000645 0 0 0 0.000364 0.003529 0
9 0.035171 0.003125 0.012666 0.001250 0 0 0 0.000612 0.005556 0
df1 =
A B C D E F G H I J
0 0.034898 0.003750 0.014091 0.001290 0 0 0 0.001488 0.005333 0
1 0.042847 0.003243 0.011559 0.000625 0 0 0 0.002272 0.010769 0
2 0.046087 0.005455 0.013101 0.000588 0 0 0 0.002147 0.008750 0
3 0.042719 0.003684 0.010496 0.001333 0 0 0 0.002627 0.004444 0
4 0.042410 0.004211 0.011580 0.000645 0 0 0 0.003007 0.006250 0
5 0.044515 0.003500 0.013990 0.000000 0 0 0 0.003954 0.007000 0
6 0.046062 0.004865 0.013278 0.000714 0 0 0 0.004035 0.011111 0
7 0.043666 0.004444 0.013460 0.000625 0 0 0 0.003826 0.010000 0
8 0.039888 0.006857 0.014351 0.000690 0 0 0 0.004314 0.011474 0
9 0.048203 0.006667 0.016338 0.000741 0 0 0 0.005294 0.013603 0
df3 =
A B C D E F G H I J
0 0.048576 0.006471 0.020130 0.002667 0 0 0 0.005536 0.015179 0
1 0.056270 0.007179 0.021519 0.001429 0 0 0 0.005524 0.012333 0
2 0.054020 0.008235 0.024464 0.001538 0 0 0 0.005926 0.010445 0
3 0.047297 0.008649 0.026650 0.002198 0 0 0 0.005870 0.010000 0
4 0.049347 0.009412 0.022808 0.002838 0 0 0 0.006541 0.012222 0
5 0.052026 0.010000 0.019935 0.002714 0 0 0 0.005062 0.012222 0
6 0.055124 0.010625 0.022950 0.003499 0 0 0 0.005954 0.008964 0
7 0.044411 0.010909 0.019129 0.005709 0 0 0 0.005209 0.007222 0
8 0.047697 0.010270 0.017234 0.008800 0 0 0 0.004808 0.008355 0
9 0.048562 0.010857 0.020219 0.008504 0 0 0 0.005665 0.004862 0
I can do single boxplots by using the following:
g = sns.boxplot(data=df, color = 'white', fliersize=1, linewidth=2, meanline = True, showmeans=True)
But how to get all three in one figure seems a bit difficult. I see I need to re-arrange the whole data and use hue in order to get every thing from combined data frame, but how exactly should I format the data is a question. Any help?
You can do all in one sns.boxplot run by concatenate the dataframes and passing hue:
tmp = (pd.concat([d.assign(data=i) # assign adds the column `data` with values i
for i,d in enumerate([df,df1,df3])] # enumerate gives you a generator of pairs (0,df), (1,df1), (2,df2)
)
.melt(id_vars='data') # melt basically turns `id_vars` columns into index,
# and stacks other columns
)
sns.boxplot(data=tmp, x='variable', hue='data', y='value')
Output:

Get_dummies produces more columns than its supposed to

I'm using get_dummies on a column of data that has zeroes or 'D' or "E". Instead of producing 2 columns it produces 5 - C, D, E, N, O. I'm not sure what they are and how to make it do just 2 as its supposed to.
When I just pull that column shows 0's and D and E, but when I put it in get_dummies adds extra columns
data[[2]]
0
0
D
0
0
0
0
D
0
0
When I do this:
dummy = pd.get_dummies(data[2], dummy_na = False)
dummy.head()
I get
0 C D E N O PreferredContactTime
0 0 0 0 0 0 1
1 0 0 0 0 0 0
1 0 0 0 0 0 0
0 0 1 0 0 0 0
1 0 0 0 0 0 0
What are C , N and O? I don't understand what it is displaying at all.
Setup
dtype = pd.CategoricalDtype([0, 'C', 'D', 'E', 'N', 'O', 'PreferredContactTime'])
data = pd.DataFrame({2: [
'PreferredContactTime', 0, 0, 'D', 0, 0, 0, 0, 'D', 0, 0
]}).astype(dtype)
Your result
dummy = pd.get_dummies(data[2], dummy_na=False )
dummy.head()
0 C D E N O PreferredContactTime
0 0 0 0 0 0 0 1
1 1 0 0 0 0 0 0
2 1 0 0 0 0 0 0
3 0 0 1 0 0 0 0
4 1 0 0 0 0 0 0

How to speed up Pandas' "iterrows"

I have a Pandas dataframe which I want to transform in the following way: I have some sensor data from an intelligent floor which is in column "CAPACITANCE" (split by ",") and that data comes from the device indicated in column "DEVICE". Now I want to have one row with a column per sensor - each device has 8 sensors, so I want to have devices x 8 columns and in that column I want the sensor data from exactly that sensor.
But my code seems to be super slow since I have about 90.000 rows in that dataframe! Does anyone have a suggestion how to speed it up?
BEFORE:
CAPACITANCE DEVICE TIMESTAMP \
0 0.00,-1.00,0.00,1.00,1.00,-2.00,13.00,1.00 01,07 2017/11/15 12:24:42
1 0.00,0.00,-1.00,-1.00,-1.00,0.00,-1.00,0.00 01,07 2017/11/15 12:24:42
2 0.00,-1.00,-2.00,0.00,0.00,1.00,0.00,-2.00 01,07 2017/11/15 12:24:43
3 2.00,0.00,-2.00,-1.00,0.00,0.00,1.00,-2.00 01,07 2017/11/15 12:24:43
4 1.00,0.00,-2.00,1.00,1.00,-3.00,5.00,1.00 01,07 2017/11/15 12:24:44
AFTER:
01,01-0 01,01-1 01,01-2 01,01-3 01,01-4 01,01-5 01,01-6 01,01-7 \
0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0
01,02-0 01,02-1 ... 05,07-1 05,07-2 05,07-3 05,07-4 05,07-5 \
0 0 0 ... 0 0 0 0 0
1 0 0 ... 0 0 0 0 0
2 0 0 ... 0 0 0 0 0
3 0 0 ... 0 0 0 0 0
4 0 0 ... 0 0 0 0 0
05,07-6 05,07-7 TIMESTAMP 01,07-8
0 0 0 2017-11-15 12:24:42 1.00
1 0 0 2017-11-15 12:24:42 0.00
2 0 0 2017-11-15 12:24:43 -2.00
3 0 0 2017-11-15 12:24:43 -2.00
4 0 0 2017-11-15 12:24:44 1.00
# creating new dataframe based on the old one
floor_df_resampled = floor_df.copy()
floor_device = ["01,01", "01,02", "01,03", "01,04", "01,05", "01,06", "01,07", "01,08", "01,09", "01,10",
"02,01", "02,02", "02,03", "02,04", "02,05", "02,06", "02,07", "02,08", "02,09", "02,10",
"03,01", "03,02", "03,03", "03,04", "03,05", "03,06", "03,07", "03,08", "03,09",
"04,01", "04,02", "04,03", "04,04", "04,05", "04,06", "04,07", "04,08", "04,09",
"05,06", "05,07"]
# creating new columns
floor_objects = []
for device in floor_device:
for sensor in range(8):
floor_objects.append(device + "-" + str(sensor))
# merging new columns
floor_df_resampled = pd.concat([floor_df_resampled, pd.DataFrame(columns=floor_objects)], ignore_index=True, sort=True)
# part that takes loads of time
for index, row in floor_df_resampled.iterrows():
obj = row["DEVICE"]
sensor_data = row["CAPACITANCE"].split(',')
for idx, val in enumerate(sensor_data):
col = obj + "-" + str(idx + 1)
floor_df_resampled.loc[index, col] = val
floor_df_resampled.drop(["DEVICE"], axis=1, inplace=True)
floor_df_resampled.drop(["CAPACITANCE"], axis=1, inplace=True)
Like commented, I'm not sure why you want that many columns, but the new columns can be created as follows:
def explode(x):
dev_name = x.DEVICE.iloc[0]
ret_df = x.CAPACITANCE.str.split(',', expand=True).astype(float)
ret_df.columns = [f'{dev_name}-{col}' for col in ret_df.columns]
return ret_df
new_df = df.groupby('DEVICE').apply(explode).fillna(0)
and then you can merge this with the old data frame:
df = df.join(new_df)

how to convert pandas dataframe to libsvm format?

I have pandas data frame like below.
df
Out[50]:
0 1 2 3 4 5 6 7 8 9 ... 90 91 92 93 94 95 96 97 \
0 0 0 0 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1
1 0 1 1 1 0 0 1 1 1 1 ... 0 0 0 0 0 0 0 0
2 1 1 1 1 1 1 1 1 1 1 ... 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1
4 0 0 0 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1
5 1 0 0 1 1 1 1 0 0 0 ... 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1
7 0 0 0 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1
[8 rows x 100 columns]
I have target variable as an array as below.
[1, -1, -1, 1, 1, -1, 1, 1]
How can I map this target variable to a data frame and convert it into lib SVM format?.
equi = {0:1, 1:-1, 2:-1,3:1,4:1,5:-1,6:1,7:1}
df["labels"] = df.index.map[(equi)]
d = df[np.setdiff1d(df.columns,['indx','labels'])]
e = df.label
dump_svmlight_file(d,e,'D:/result/smvlight2.dat')er code here
ERROR:
File "D:/spyder/april.py", line 54, in <module>
df["labels"] = df.index.map[(equi)]
TypeError: 'method' object is not subscriptable
When I use
df["labels"] = df.index.list(map[(equi)])
ERROR:
AttributeError: 'RangeIndex' object has no attribute 'list'
Please help me to solve those errors.
I think you need convert index to_series and then call map:
df["labels"] = df.index.to_series().map(equi)
Or use rename of index:
df["labels"] = df.rename(index=equi).index
All together:
For difference of columns pandas has difference:
from sklearn.datasets import dump_svmlight_file
equi = {0:1, 1:-1, 2:-1,3:1,4:1,5:-1,6:1,7:1}
df["labels"] = df.rename(index=equi).index
e = df["labels"]
d = df[df.columns.difference(['indx','labels'])]
dump_svmlight_file(d,e,'C:/result/smvlight2.dat')
Also it seems label column is not necessary:
from sklearn.datasets import dump_svmlight_file
equi = {0:1, 1:-1, 2:-1,3:1,4:1,5:-1,6:1,7:1}
e = df.rename(index=equi).index
d = df[df.columns.difference(['indx'])]
dump_svmlight_file(d,e,'C:/result/smvlight2.dat')

Complex Excel Formula in Pandas

Excel Formulas I am trying to replicate in pandas:
Click here to download workbook
* Look at columns D, E and F
entsig and exsig are manual and can be changed. In real life they would be derived from the value of another column or a comparison of two other columns
ent = 1 if entsig previous = 1 and in = 0
in = 1 if ent previous = 1 or (in previous = 1 and ex = 0)
ex = 1 if exsig previous = 1 and in previous = 1
so either ent, in, or ex will always be = 1 but never more than one of them
import pandas as pd
df = pd.DataFrame(
[[0,0,0,0,0], [1,0,0,0,0], [1,0,0,0,0], [1,0,0,0,0], [0,0,0,0,0],
[0,1,0,0,0], [0,1,0,0,0], [1,0,0,0,0], [1,0,0,0,0], [0,0,0,0,0],
[0,0,0,0,0], [0,0,0,0,0], [0,1,0,0,0], [0,1,0,0,0], [0,1,0,0,0],
[0,0,0,0,0], [0,0,0,0,0], [1,0,0,0,0], [1,0,0,0,0], [1,0,0,0,0],
[1,1,0,0,0], [0,1,0,0,0], [0,1,0,0,0], [0,1,0,0,0]],
columns=['entsig', 'exsig','ent', 'in', 'ex'])
for i in df.index:
df['ent'][(df.entsig.shift(1)==1) & (df['ent'].shift(1) == 0) & (df['in'].shift(1) == 0)]=1
df['ex'][(df.exsig.shift(1)==1) & (df['in'].shift(1)==1)]=1
df['in'][(df.ent.shift(1)==1) | ((df['in'].shift(1)==1) & (df['ex']==0))]=1
for j in df.index:
df['ent'][df['in'] == 1]=0
df['in'][df['ex']==1]=0
df['ex'][df['ex'].shift(1)==1]=0
df
results in
entsig exsig ent in ex
0 0 0 0 0 0
1 1 0 0 0 0
2 1 0 1 0 0
3 1 0 0 1 0
4 0 0 0 1 0
5 0 1 0 1 0
6 0 1 0 0 1
7 1 0 0 0 0
8 1 0 1 0 0
9 0 0 0 1 0
10 0 0 0 1 0
11 0 0 0 1 0
12 0 1 0 1 0
13 0 1 0 0 1
14 0 1 0 0 0
15 0 0 0 0 0
16 0 0 0 0 0
17 1 0 0 0 0
18 1 0 1 0 0
19 1 0 0 1 0
20 1 1 0 1 0
21 0 1 0 0 1
22 0 1 0 0 0
23 0 1 0 0 0
Question
How can I make this code faster? It runs slow because it's a loop but I have not been able to come up with a solution that does not use loops. Any ideas or comments are appreciated.
If we can assume every group of 1's in entsig is followed by at least one 1 in
exsig, then you could compute ent, ex and in like this:
def ent_in_ex(df):
entsig_mask = (df['entsig'].diff().shift(1) == 1)
exsig_mask = (df['exsig'].diff().shift(1) == 1)
df.loc[entsig_mask, 'ent'] = 1
df.loc[exsig_mask, 'ex'] = 1
df['in'] = df['ent'].shift(1).cumsum().subtract(df['ex'].cumsum(), fill_value=0)
return df
If we can make this assumption, then ent_in_ex is significantly faster:
In [5]: %timeit orig(df)
10 loops, best of 3: 185 ms per loop
In [6]: %timeit ent_in_ex(df)
100 loops, best of 3: 2.23 ms per loop
In [95]: orig(df).equals(ent_in_ex(df))
Out[95]: True
where orig is the original code:
def orig(df):
for i in df.index:
df['ent'][(df.entsig.shift(1)==1) & (df['ent'].shift(1) == 0) & (df['in'].shift(1) == 0)]=1
df['ex'][(df.exsig.shift(1)==1) & (df['in'].shift(1)==1)]=1
df['in'][(df.ent.shift(1)==1) | ((df['in'].shift(1)==1) & (df['ex']==0))]=1
for j in df.index:
df['ent'][df['in'] == 1]=0
df['in'][df['ex']==1]=0
df['ex'][df['ex'].shift(1)==1]=0
return df