replacing dictionary values into a dataframe - pandas

I have the following df on one side:
ACCOR SA ADMIRAL ADECCO BANKIA BANKINTER
ADMIRAL 0 0 0 0 0
ADECCO 0 0 0 0 0
BANKIA 0 0 0 0 0
and the following dict on the other:
{'ADMIRAL': 1, 'ADECCO': -1, 'BANKIA': -1}
where the df.index values correspond to the the dict.keys
I would like to replace the dict.values into the df placing one value per row to obtain this output:
ACCOR SA ADMIRAL ADECCO BANKIA BANKINTER
ADMIRAL 0 1 0 0 0
ADECCO 0 0 -1 0 0
BANKIA 0 0 0 -1 0

Loop by dict values and set values by at:
d = {'ADMIRAL': 1, 'ADECCO': -1, 'BANKIA': -1}
for k, v in d.items():
df.at[k, k] = v
#alternative
#df.loc[k, k] = v
print (df)
ACCOR SA ADMIRAL ADECCO BANKIA BANKINTER
ADMIRAL 0 1 0 0 0
ADECCO 0 0 -1 0 0
BANKIA 0 0 0 -1 0
Another solution is create DataFrame by dict by MultiIndex.from_arrays and unstack:
s = pd.Series(list(d.values()), index=pd.MultiIndex.from_arrays([d.keys(), d.keys()]))
df1 = s.unstack()
print (df1)
ADECCO ADMIRAL BANKIA
ADECCO -1.0 NaN NaN
ADMIRAL NaN 1.0 NaN
BANKIA NaN NaN -1.0
And then replace non NaNs by combine_first:
df = df1.combine_first(df)
print (df)
ACCOR SA ADECCO ADMIRAL BANKIA BANKINTER
ADECCO 0.0 -1.0 0.0 0.0 0.0
ADMIRAL 0.0 0.0 1.0 0.0 0.0
BANKIA 0.0 0.0 0.0 -1.0 0.0

Related

Masked array assignment

I have a NxN array A, a NxN array B and a NxN mask (BitMatrix) M. Now I want to copy / assign the values of B to A only for the indices for which M is true. What is the best way to do that?
You can use logical indexing
julia> A = zeros(5,5); B = ones(5,5); M = rand(Bool, 5, 5)
5×5 Matrix{Bool}:
1 0 1 1 0
1 0 1 1 0
1 0 1 1 1
0 0 0 1 0
0 0 0 0 1
julia> A[M] = B[M]; A
5×5 Matrix{Float64}:
1.0 0.0 1.0 1.0 0.0
1.0 0.0 1.0 1.0 0.0
1.0 0.0 1.0 1.0 1.0
0.0 0.0 0.0 1.0 0.0
0.0 0.0 0.0 0.0 1.0
or simply write a loop:
julia> for i in eachindex(A, B, M)
if M[i]
A[i] = B[i]
end
end

Data formatting for grouped boxplot using seaborn or matplotlib

I have 3 dataframes where column names and number of rows are exactly the same in all 3 data frames. I want to plot all the columns from all three dataframes as a grouped boxplot into one image using seaborn or matplotlib. But I am having difficulties in combining and formating the data so that I can plot them as grouped box plot.
df=
A B C D E F G H I J
0 0.031810 0.000556 0.007798 0.000741 0 0 0 0.000180 0.002105 0
1 0.028687 0.000571 0.009356 0.000000 0 0 0 0.000183 0.001250 0
2 0.029635 0.001111 0.009121 0.000000 0 0 0 0.000194 0.001111 0
3 0.030579 0.002424 0.007672 0.000000 0 0 0 0.000194 0.001176 0
4 0.028544 0.002667 0.007973 0.000000 0 0 0 0.000179 0.001333 0
5 0.027286 0.003226 0.006881 0.000000 0 0 0 0.000196 0.001111 0
6 0.031597 0.003030 0.006695 0.000000 0 0 0 0.000180 0.002353 0
7 0.034226 0.003030 0.010804 0.000667 0 0 0 0.000179 0.003333 0
8 0.035105 0.002941 0.010176 0.000645 0 0 0 0.000364 0.003529 0
9 0.035171 0.003125 0.012666 0.001250 0 0 0 0.000612 0.005556 0
df1 =
A B C D E F G H I J
0 0.034898 0.003750 0.014091 0.001290 0 0 0 0.001488 0.005333 0
1 0.042847 0.003243 0.011559 0.000625 0 0 0 0.002272 0.010769 0
2 0.046087 0.005455 0.013101 0.000588 0 0 0 0.002147 0.008750 0
3 0.042719 0.003684 0.010496 0.001333 0 0 0 0.002627 0.004444 0
4 0.042410 0.004211 0.011580 0.000645 0 0 0 0.003007 0.006250 0
5 0.044515 0.003500 0.013990 0.000000 0 0 0 0.003954 0.007000 0
6 0.046062 0.004865 0.013278 0.000714 0 0 0 0.004035 0.011111 0
7 0.043666 0.004444 0.013460 0.000625 0 0 0 0.003826 0.010000 0
8 0.039888 0.006857 0.014351 0.000690 0 0 0 0.004314 0.011474 0
9 0.048203 0.006667 0.016338 0.000741 0 0 0 0.005294 0.013603 0
df3 =
A B C D E F G H I J
0 0.048576 0.006471 0.020130 0.002667 0 0 0 0.005536 0.015179 0
1 0.056270 0.007179 0.021519 0.001429 0 0 0 0.005524 0.012333 0
2 0.054020 0.008235 0.024464 0.001538 0 0 0 0.005926 0.010445 0
3 0.047297 0.008649 0.026650 0.002198 0 0 0 0.005870 0.010000 0
4 0.049347 0.009412 0.022808 0.002838 0 0 0 0.006541 0.012222 0
5 0.052026 0.010000 0.019935 0.002714 0 0 0 0.005062 0.012222 0
6 0.055124 0.010625 0.022950 0.003499 0 0 0 0.005954 0.008964 0
7 0.044411 0.010909 0.019129 0.005709 0 0 0 0.005209 0.007222 0
8 0.047697 0.010270 0.017234 0.008800 0 0 0 0.004808 0.008355 0
9 0.048562 0.010857 0.020219 0.008504 0 0 0 0.005665 0.004862 0
I can do single boxplots by using the following:
g = sns.boxplot(data=df, color = 'white', fliersize=1, linewidth=2, meanline = True, showmeans=True)
But how to get all three in one figure seems a bit difficult. I see I need to re-arrange the whole data and use hue in order to get every thing from combined data frame, but how exactly should I format the data is a question. Any help?
You can do all in one sns.boxplot run by concatenate the dataframes and passing hue:
tmp = (pd.concat([d.assign(data=i) # assign adds the column `data` with values i
for i,d in enumerate([df,df1,df3])] # enumerate gives you a generator of pairs (0,df), (1,df1), (2,df2)
)
.melt(id_vars='data') # melt basically turns `id_vars` columns into index,
# and stacks other columns
)
sns.boxplot(data=tmp, x='variable', hue='data', y='value')
Output:

How to speed up Pandas' "iterrows"

I have a Pandas dataframe which I want to transform in the following way: I have some sensor data from an intelligent floor which is in column "CAPACITANCE" (split by ",") and that data comes from the device indicated in column "DEVICE". Now I want to have one row with a column per sensor - each device has 8 sensors, so I want to have devices x 8 columns and in that column I want the sensor data from exactly that sensor.
But my code seems to be super slow since I have about 90.000 rows in that dataframe! Does anyone have a suggestion how to speed it up?
BEFORE:
CAPACITANCE DEVICE TIMESTAMP \
0 0.00,-1.00,0.00,1.00,1.00,-2.00,13.00,1.00 01,07 2017/11/15 12:24:42
1 0.00,0.00,-1.00,-1.00,-1.00,0.00,-1.00,0.00 01,07 2017/11/15 12:24:42
2 0.00,-1.00,-2.00,0.00,0.00,1.00,0.00,-2.00 01,07 2017/11/15 12:24:43
3 2.00,0.00,-2.00,-1.00,0.00,0.00,1.00,-2.00 01,07 2017/11/15 12:24:43
4 1.00,0.00,-2.00,1.00,1.00,-3.00,5.00,1.00 01,07 2017/11/15 12:24:44
AFTER:
01,01-0 01,01-1 01,01-2 01,01-3 01,01-4 01,01-5 01,01-6 01,01-7 \
0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0
01,02-0 01,02-1 ... 05,07-1 05,07-2 05,07-3 05,07-4 05,07-5 \
0 0 0 ... 0 0 0 0 0
1 0 0 ... 0 0 0 0 0
2 0 0 ... 0 0 0 0 0
3 0 0 ... 0 0 0 0 0
4 0 0 ... 0 0 0 0 0
05,07-6 05,07-7 TIMESTAMP 01,07-8
0 0 0 2017-11-15 12:24:42 1.00
1 0 0 2017-11-15 12:24:42 0.00
2 0 0 2017-11-15 12:24:43 -2.00
3 0 0 2017-11-15 12:24:43 -2.00
4 0 0 2017-11-15 12:24:44 1.00
# creating new dataframe based on the old one
floor_df_resampled = floor_df.copy()
floor_device = ["01,01", "01,02", "01,03", "01,04", "01,05", "01,06", "01,07", "01,08", "01,09", "01,10",
"02,01", "02,02", "02,03", "02,04", "02,05", "02,06", "02,07", "02,08", "02,09", "02,10",
"03,01", "03,02", "03,03", "03,04", "03,05", "03,06", "03,07", "03,08", "03,09",
"04,01", "04,02", "04,03", "04,04", "04,05", "04,06", "04,07", "04,08", "04,09",
"05,06", "05,07"]
# creating new columns
floor_objects = []
for device in floor_device:
for sensor in range(8):
floor_objects.append(device + "-" + str(sensor))
# merging new columns
floor_df_resampled = pd.concat([floor_df_resampled, pd.DataFrame(columns=floor_objects)], ignore_index=True, sort=True)
# part that takes loads of time
for index, row in floor_df_resampled.iterrows():
obj = row["DEVICE"]
sensor_data = row["CAPACITANCE"].split(',')
for idx, val in enumerate(sensor_data):
col = obj + "-" + str(idx + 1)
floor_df_resampled.loc[index, col] = val
floor_df_resampled.drop(["DEVICE"], axis=1, inplace=True)
floor_df_resampled.drop(["CAPACITANCE"], axis=1, inplace=True)
Like commented, I'm not sure why you want that many columns, but the new columns can be created as follows:
def explode(x):
dev_name = x.DEVICE.iloc[0]
ret_df = x.CAPACITANCE.str.split(',', expand=True).astype(float)
ret_df.columns = [f'{dev_name}-{col}' for col in ret_df.columns]
return ret_df
new_df = df.groupby('DEVICE').apply(explode).fillna(0)
and then you can merge this with the old data frame:
df = df.join(new_df)

How to set (1) to max elements in pandas dataframe and (0) to everything else?

Let's say I have a pandas DataFrame.
df = pd.DataFrame(index = [ix for ix in range(10)], columns=list('abcdef'), data=np.random.randn(10,6))
df:
a b c d e f
0 -1.238393 -0.755117 -0.228638 -0.077966 0.412947 0.887955
1 -0.342087 0.296171 0.177956 0.701668 -0.481744 -1.564719
2 0.610141 0.963873 -0.943182 -0.341902 0.326416 0.818899
3 -0.561572 0.063588 -0.195256 -1.637753 0.622627 0.845801
4 -2.506322 -1.631023 0.506860 0.368958 1.833260 0.623055
5 -1.313919 -1.758250 -1.082072 1.266158 0.427079 -1.018416
6 -0.781842 1.270133 -0.510879 -1.438487 -1.101213 -0.922821
7 -0.456999 0.234084 1.602635 0.611378 -1.147994 1.204318
8 0.497074 0.412695 -0.458227 0.431758 0.514382 -0.479150
9 -1.289392 -0.218624 0.122060 2.000832 -1.694544 0.773330
how to I get set 1 to rowwise max and 0 to other elements?
I came up with:
>>> for i in range(len(df)):
... df.loc[i][df.loc[i].idxmax(axis=1)] = 1
... df.loc[i][df.loc[i] != 1] = 0
generates
df:
a b c d e f
0 0 0 0 0 0 1
1 0 0 0 1 0 0
2 0 1 0 0 0 0
3 0 0 0 0 0 1
4 0 0 0 0 1 0
5 0 0 0 1 0 0
6 0 1 0 0 0 0
7 0 0 1 0 0 0
8 0 0 0 0 1 0
9 0 0 0 1 0 0
Does anyone has a better way of doing it? May be by getting rid of the for loop or applying lambda?
Use max and check for equality using eq and cast the boolean df to int using astype, this will convert True and False to 1 and 0:
In [21]:
df = pd.DataFrame(index = [ix for ix in range(10)], columns=list('abcdef'), data=np.random.randn(10,6))
df
Out[21]:
a b c d e f
0 0.797000 0.762125 -0.330518 1.117972 0.817524 0.041670
1 0.517940 0.357369 -1.493552 -0.947396 3.082828 0.578126
2 1.784856 0.672902 -1.359771 -0.090880 -0.093100 1.099017
3 -0.493976 -0.390801 -0.521017 1.221517 -1.303020 1.196718
4 0.687499 -2.371322 -2.474101 -0.397071 0.132205 0.034631
5 0.573694 -0.206627 -0.106312 -0.661391 -0.257711 -0.875501
6 -0.415331 1.185901 1.173457 0.317577 -0.408544 -1.055770
7 -1.564962 -0.408390 -1.372104 -1.117561 -1.262086 -1.664516
8 -0.987306 0.738833 -1.207124 0.738084 1.118205 -0.899086
9 0.282800 -1.226499 1.658416 -0.381222 1.067296 -1.249829
In [22]:
df = df.eq(df.max(axis=1), axis=0).astype(int)
df
Out[22]:
a b c d e f
0 0 0 0 1 0 0
1 0 0 0 0 1 0
2 1 0 0 0 0 0
3 0 0 0 1 0 0
4 1 0 0 0 0 0
5 1 0 0 0 0 0
6 0 1 0 0 0 0
7 0 1 0 0 0 0
8 0 0 0 0 1 0
9 0 0 1 0 0 0
Timings
In [24]:
# #Raihan Masud's method
%timeit df.apply( lambda x: np.where(x == x.max() , 1 , 0) , axis = 1)
# mine
%timeit df.eq(df.max(axis=1), axis=0).astype(int)
100 loops, best of 3: 7.94 ms per loop
1000 loops, best of 3: 640 µs per loop
In [25]:
# #Nader Hisham's method
%%timeit
def max_binary(df):
binary = np.where( df == df.max() , 1 , 0 )
return binary
​
df.apply( max_binary , axis = 1)
100 loops, best of 3: 9.63 ms per loop
You can see that my method is over 12X faster than #Raihan's method
In [4]:
%%timeit
for i in range(len(df)):
df.loc[i][df.loc[i].idxmax(axis=1)] = 1
df.loc[i][df.loc[i] != 1] = 0
10 loops, best of 3: 21.1 ms per loop
The for loop is also significantly slower
import numpy as np
def max_binary(df):
binary = np.where( df == df.max() , 1 , 0 )
return binary
df.apply( max_binary , axis = 1)
Following Nader's pattern, this is a shorter version:
df.apply( lambda x: np.where(x == x.max() , 1 , 0) , axis = 1)

Does np.array's astype prevent future edits in DataFrames?

I can change the first entry of the DataFrame initially:
In [6]: df = pd.DataFrame(np.random.rand(5,2))
In [7]: df
Out[7]:
0 1
0 0.514592 0.459589
1 0.329704 0.409099
2 0.061246 0.966191
3 0.336747 0.908513
4 0.169220 0.468437
In [8]: df.ix[0][0] = 1
In [9]: df
Out[9]:
0 1
0 1.000000 0.459589
1 0.329704 0.409099
2 0.061246 0.966191
3 0.336747 0.908513
4 0.169220 0.468437
But after I do this:
In [10]: df[0] = np.floor(df.index / 10).astype(int) * 10
In [11]: df
Out[11]:
0 1
0 0 0.459589
1 0 0.409099
2 0 0.966191
3 0 0.908513
4 0 0.468437
I can't find a way to change it.
In [12]: df.ix[0][0] = 1
In [13]: df
Out[13]:
0 1
0 0 0.459589
1 0 0.409099
2 0 0.966191
3 0 0.908513
4 0 0.468437
And I can't even change elements from other columns
In [16]: df.ix[0][1] = 1
In [17]: df
Out[17]:
0 1
0 0 0.459589
1 0 0.409099
2 0 0.966191
3 0 0.908513
4 0 0.468437
What's up with this?
you are editing a copy, try
In [3]: df = pd.DataFrame(np.random.rand(5,2))
In [4]: df[0] = np.floor(df.index / 10).astype(int) * 10
In [5]: df
Out[5]:
0 1
0 0 0.201611
1 0 0.390364
2 0 0.727422
3 0 0.941035
4 0 0.036764
In [6]: df.ix[0,1] = 1
In [7]: df
Out[7]:
0 1
0 0 1.000000
1 0 0.390364
2 0 0.727422
3 0 0.941035
4 0 0.036764