Pandas Dataframe Manipulation logic - pandas

Can use please help with below problem:
Given two dataframes df1 and df2, need to get something like result dataframe.
import pandas as pd
import numpy as np
feature_list = [ str(i) for i in range(6)]
df1 = pd.DataFrame( {'value' : [0,3,0,4,2,5]})
df2 = pd.DataFrame(0, index=np.arange(6), columns=feature_list)
Expected Dataframe :
Need to be driven by comparing values from df1 with column names (features) in df2. if they match, we put 1 in resultDf
Here's expected output (or resultsDf):

I think you need:
(pd.get_dummies(df1['value'])
.rename(columns = str)
.reindex(columns = df2.columns,
index = df2.index,
fill_value = 0))
0 1 2 3 4 5
0 1 0 0 0 0 0
1 0 0 0 1 0 0
2 1 0 0 0 0 0
3 0 0 0 0 1 0
4 0 0 1 0 0 0
5 0 0 0 0 0 1

Related

Dask - concatenate two same-column dataframes doesn't work

I have two dataframes without a header line, both with the same comma-separated columns.
I tried to read them into one dataframe with
dfoutputs = dd.read_csv(['outputsfile.csv', 'outputsfile2.csv'], names=colnames, header=None, dtype={'firstnr': 'Int64', 'secondnr': 'Int64', 'thirdnr': 'Int64', 'fourthnr': 'Int64'})
but this dataframe only contained outputsfile.csv rows.
Similar problem for reading and concat:
colnames=['firstnr', 'secondnr', 'thirdnr', 'fourthnr']
dfoutputs = dd.read_csv('outputsfile.csv', names=colnames, header=None, dtype={'firstnr': 'Int64', 'secondnr': 'Int64', 'thirdnr': 'Int64', 'fourthnr': 'Int64'})
print(dfoutputs.head(10))
dfoutputs2 = dd.read_csv('outputsfile2.csv', names=colnames, header=None, dtype={'firstnr': 'Int64', 'secondnr': 'Int64', 'thirdnr': 'Int64', 'fourthnr': 'Int64'})
print(dfoutputs2.head(10))
dfnew = dd.concat([dfoutputs, dfoutputs2])
print(dfnew.head(10))
Output:
firstnr secondnr thirdnr fourthnr
0 0 0 0 5000000000
1 1 0 0 5000000000
2 2 0 0 5000000000
3 3 0 0 5000000000
4 4 0 0 5000000000
5 5 0 0 5000000000
firstnr secondnr thirdnr fourthnr
0 11 0 0 5000000000
1 12 0 0 5000000000
firstnr secondnr thirdnr fourthnr
0 0 0 0 5000000000
1 1 0 0 5000000000
2 2 0 0 5000000000
3 3 0 0 5000000000
4 4 0 0 5000000000
5 5 0 0 5000000000
How can I combine both csv's to the same Dask dataframe?
As suggested by TennisTechBoy in the comments:
f=open("outputsfile.csv", "a")
f2=open("outputsfile2.csv", "r")
f2content = f2.readlines()
for i in range(len(f2content)):
f.write(f2content[i])
f.close()
f2.close()
A way to do this in Dask might be needed from a memory perspective.

Python Pandas Dataframe cell value split

I am lost on how to split the binary values such that each (0,1)value takes up a column of the data frame.
from jupyter
You can use concat with apply list:
df = pd.DataFrame({0:[1,2,3], 1:['1010','1100','0101']})
print (df)
0 1
0 1 1010
1 2 1100
2 3 0101
df = pd.concat([df[0],
df[1].apply(lambda x: pd.Series(list(x))).astype(int)],
axis=1, ignore_index=True)
print (df)
0 1 2 3 4
0 1 1 0 1 0
1 2 1 1 0 0
2 3 0 1 0 1
Another solution with DataFrame constructor:
df = pd.concat([df[0],
pd.DataFrame(df[1].apply(list).values.tolist()).astype(int)],
axis=1, ignore_index=True)
print (df)
0 1 2 3 4
0 1 1 0 1 0
1 2 1 1 0 0
2 3 0 1 0 1
EDIT:
df = pd.DataFrame({0:['1010','1100','0101']})
df1 = pd.DataFrame(df[0].apply(list).values.tolist()).astype(int)
print (df1)
0 1 2 3
0 1 0 1 0
1 1 1 0 0
2 0 1 0 1
But if need lists:
df[0] = df[0].apply(lambda x: [int(y) for y in list(x)])
print (df)
0
0 [1, 0, 1, 0]
1 [1, 1, 0, 0]
2 [0, 1, 0, 1]

Pandas Multilevel index for rows

This should be a simple thing, but after a few hours of searching, I'm still at a loss for what I'm doing wrong.
I've tried different methods using MultiIndexing.from_ and multiple other things, but I just can't get this right.
I need something like:
But instead I get:
What am I doing wrong?
import pandas as pd
list_of_customers = ['Client1', 'Client2', 'Client3']
stat_index = ['max', 'current', 'min']
list_of_historic_timeframes = ['16:10', '16:20', '16:30']
timeblock = pd.DataFrame(index=([list_of_customers, stat_index]), columns=list_of_historic_timeframes)
timeblock.fillna(0, inplace=True)
print(timeblock)
list_of_customers = ['Client1', 'Client2', 'Client3']
stat_index = ['max', 'current', 'min']
list_of_historic_timeframes = ['16:10', '16:20', '16:30']
timeblock = pd.DataFrame(
0,
pd.MultiIndex.from_product(
[list_of_customers, stat_index],
names=['Customer', 'Stat']
),
list_of_historic_timeframes
)
print(timeblock)
16:10 16:20 16:30
Customer Stat
Client1 max 0 0 0
current 0 0 0
min 0 0 0
Client2 max 0 0 0
current 0 0 0
min 0 0 0
Client3 max 0 0 0
current 0 0 0
min 0 0 0

Copy numpy array into Panda multiindex (same size)

I have two matrix: numpy square matrix and a panda multiindexed square matrix. They are the same size. The idea is to get the value from numpy into the multiindex panda matrix to navigate more easily into the data.
My matrix are around 100 000 x 100 000.
And my panda matrix has three level of index.
tuples = [('1','A','a'), ('1','A','b'), ('1','A','c'), ('1','B','a'), ('1','B','b'), ('1','B','c'), ('2','A','a'), ('2','A','b'), ('2','B','a')]
index = pd.MultiIndex.from_tuples(tuples, names=['geography', 'product','activity'])
df = pd.DataFrame(index=index, columns=index)
geography 1 2
product A B A B
activity a b c a b c a b a
geography product activity
1 A a 0 0 0 0 0 0 0 0 0
b 0 0 0 0 0 0 0 0 0
c 0 0 0 0 0 0 0 0 0
B a 0 0 0 0 0 0 0 0 0
b 0 0 0 0 0 0 0 0 0
c 0 0 0 0 0 0 0 0 0
2 A a 0 0 0 0 0 0 0 0 0
b 0 0 0 0 0 0 0 0 0
B a 0 0 0 0 0 0 0 0 0
np.random.rand(9,9)
array([[ 0.27302806, 0.33926193, 0.01489047, 0.71959889, 0.43500806,
0.03607795, 0.03747561, 0.43000199, 0.8091691 ],
[ 0.96626878, 0.37613022, 0.7739084 , 0.16724657, 0.01144436,
0.0107722 , 0.73513494, 0.13305542, 0.2910334 ],
[ 0.00622779, 0.93699165, 0.62725798, 0.25009469, 0.14010666,
0.61826728, 0.72060106, 0.58864557, 0.29375779],
[ 0.14937979, 0.45269751, 0.68450964, 0.15986812, 0.69879559,
0.06573519, 0.57504452, 0.49540882, 0.77283616],
[ 0.60933817, 0.2701683 , 0.69067959, 0.22806386, 0.79456502,
0.75107457, 0.2805325 , 0.27659171, 0.33446821],
[ 0.82860687, 0.27055835, 0.37684942, 0.18962783, 0.59885119,
0.31246936, 0.94522335, 0.53487273, 0.00611481],
[ 0.27683582, 0.23653112, 0.41250374, 0.5024068 , 0.27621212,
0.81379001, 0.6704781 , 0.87521485, 0.04577144],
[ 0.95516958, 0.21844023, 0.86558273, 0.52300142, 0.91328259,
0.7587479 , 0.15201837, 0.15376074, 0.12092142],
[ 0.36835891, 0.0381736 , 0.36473176, 0.30510363, 0.19433639,
0.43431018, 0.00112607, 0.35334684, 0.82307449]])
How I can put the value of the numpy matrix into in the panda multiindex matrix. The two matrix by construction have the same structure, i.e. the numpy matrix is the panda one without label indexes.
I found a dozen of examples to transform multiindex df into numpy array, but not in this way. Only one example of a 3 dimensional numpy array, but mine is not a 3-d np array.
Thanks to Divakar.
Something, just df[:] = np.random.rand(9,9) and it is all right.

How to set (1) to max elements in pandas dataframe and (0) to everything else?

Let's say I have a pandas DataFrame.
df = pd.DataFrame(index = [ix for ix in range(10)], columns=list('abcdef'), data=np.random.randn(10,6))
df:
a b c d e f
0 -1.238393 -0.755117 -0.228638 -0.077966 0.412947 0.887955
1 -0.342087 0.296171 0.177956 0.701668 -0.481744 -1.564719
2 0.610141 0.963873 -0.943182 -0.341902 0.326416 0.818899
3 -0.561572 0.063588 -0.195256 -1.637753 0.622627 0.845801
4 -2.506322 -1.631023 0.506860 0.368958 1.833260 0.623055
5 -1.313919 -1.758250 -1.082072 1.266158 0.427079 -1.018416
6 -0.781842 1.270133 -0.510879 -1.438487 -1.101213 -0.922821
7 -0.456999 0.234084 1.602635 0.611378 -1.147994 1.204318
8 0.497074 0.412695 -0.458227 0.431758 0.514382 -0.479150
9 -1.289392 -0.218624 0.122060 2.000832 -1.694544 0.773330
how to I get set 1 to rowwise max and 0 to other elements?
I came up with:
>>> for i in range(len(df)):
... df.loc[i][df.loc[i].idxmax(axis=1)] = 1
... df.loc[i][df.loc[i] != 1] = 0
generates
df:
a b c d e f
0 0 0 0 0 0 1
1 0 0 0 1 0 0
2 0 1 0 0 0 0
3 0 0 0 0 0 1
4 0 0 0 0 1 0
5 0 0 0 1 0 0
6 0 1 0 0 0 0
7 0 0 1 0 0 0
8 0 0 0 0 1 0
9 0 0 0 1 0 0
Does anyone has a better way of doing it? May be by getting rid of the for loop or applying lambda?
Use max and check for equality using eq and cast the boolean df to int using astype, this will convert True and False to 1 and 0:
In [21]:
df = pd.DataFrame(index = [ix for ix in range(10)], columns=list('abcdef'), data=np.random.randn(10,6))
df
Out[21]:
a b c d e f
0 0.797000 0.762125 -0.330518 1.117972 0.817524 0.041670
1 0.517940 0.357369 -1.493552 -0.947396 3.082828 0.578126
2 1.784856 0.672902 -1.359771 -0.090880 -0.093100 1.099017
3 -0.493976 -0.390801 -0.521017 1.221517 -1.303020 1.196718
4 0.687499 -2.371322 -2.474101 -0.397071 0.132205 0.034631
5 0.573694 -0.206627 -0.106312 -0.661391 -0.257711 -0.875501
6 -0.415331 1.185901 1.173457 0.317577 -0.408544 -1.055770
7 -1.564962 -0.408390 -1.372104 -1.117561 -1.262086 -1.664516
8 -0.987306 0.738833 -1.207124 0.738084 1.118205 -0.899086
9 0.282800 -1.226499 1.658416 -0.381222 1.067296 -1.249829
In [22]:
df = df.eq(df.max(axis=1), axis=0).astype(int)
df
Out[22]:
a b c d e f
0 0 0 0 1 0 0
1 0 0 0 0 1 0
2 1 0 0 0 0 0
3 0 0 0 1 0 0
4 1 0 0 0 0 0
5 1 0 0 0 0 0
6 0 1 0 0 0 0
7 0 1 0 0 0 0
8 0 0 0 0 1 0
9 0 0 1 0 0 0
Timings
In [24]:
# #Raihan Masud's method
%timeit df.apply( lambda x: np.where(x == x.max() , 1 , 0) , axis = 1)
# mine
%timeit df.eq(df.max(axis=1), axis=0).astype(int)
100 loops, best of 3: 7.94 ms per loop
1000 loops, best of 3: 640 µs per loop
In [25]:
# #Nader Hisham's method
%%timeit
def max_binary(df):
binary = np.where( df == df.max() , 1 , 0 )
return binary
​
df.apply( max_binary , axis = 1)
100 loops, best of 3: 9.63 ms per loop
You can see that my method is over 12X faster than #Raihan's method
In [4]:
%%timeit
for i in range(len(df)):
df.loc[i][df.loc[i].idxmax(axis=1)] = 1
df.loc[i][df.loc[i] != 1] = 0
10 loops, best of 3: 21.1 ms per loop
The for loop is also significantly slower
import numpy as np
def max_binary(df):
binary = np.where( df == df.max() , 1 , 0 )
return binary
df.apply( max_binary , axis = 1)
Following Nader's pattern, this is a shorter version:
df.apply( lambda x: np.where(x == x.max() , 1 , 0) , axis = 1)