Does np.array's astype prevent future edits in DataFrames? - numpy

I can change the first entry of the DataFrame initially:
In [6]: df = pd.DataFrame(np.random.rand(5,2))
In [7]: df
Out[7]:
0 1
0 0.514592 0.459589
1 0.329704 0.409099
2 0.061246 0.966191
3 0.336747 0.908513
4 0.169220 0.468437
In [8]: df.ix[0][0] = 1
In [9]: df
Out[9]:
0 1
0 1.000000 0.459589
1 0.329704 0.409099
2 0.061246 0.966191
3 0.336747 0.908513
4 0.169220 0.468437
But after I do this:
In [10]: df[0] = np.floor(df.index / 10).astype(int) * 10
In [11]: df
Out[11]:
0 1
0 0 0.459589
1 0 0.409099
2 0 0.966191
3 0 0.908513
4 0 0.468437
I can't find a way to change it.
In [12]: df.ix[0][0] = 1
In [13]: df
Out[13]:
0 1
0 0 0.459589
1 0 0.409099
2 0 0.966191
3 0 0.908513
4 0 0.468437
And I can't even change elements from other columns
In [16]: df.ix[0][1] = 1
In [17]: df
Out[17]:
0 1
0 0 0.459589
1 0 0.409099
2 0 0.966191
3 0 0.908513
4 0 0.468437
What's up with this?

you are editing a copy, try
In [3]: df = pd.DataFrame(np.random.rand(5,2))
In [4]: df[0] = np.floor(df.index / 10).astype(int) * 10
In [5]: df
Out[5]:
0 1
0 0 0.201611
1 0 0.390364
2 0 0.727422
3 0 0.941035
4 0 0.036764
In [6]: df.ix[0,1] = 1
In [7]: df
Out[7]:
0 1
0 0 1.000000
1 0 0.390364
2 0 0.727422
3 0 0.941035
4 0 0.036764

Related

Dask - concatenate two same-column dataframes doesn't work

I have two dataframes without a header line, both with the same comma-separated columns.
I tried to read them into one dataframe with
dfoutputs = dd.read_csv(['outputsfile.csv', 'outputsfile2.csv'], names=colnames, header=None, dtype={'firstnr': 'Int64', 'secondnr': 'Int64', 'thirdnr': 'Int64', 'fourthnr': 'Int64'})
but this dataframe only contained outputsfile.csv rows.
Similar problem for reading and concat:
colnames=['firstnr', 'secondnr', 'thirdnr', 'fourthnr']
dfoutputs = dd.read_csv('outputsfile.csv', names=colnames, header=None, dtype={'firstnr': 'Int64', 'secondnr': 'Int64', 'thirdnr': 'Int64', 'fourthnr': 'Int64'})
print(dfoutputs.head(10))
dfoutputs2 = dd.read_csv('outputsfile2.csv', names=colnames, header=None, dtype={'firstnr': 'Int64', 'secondnr': 'Int64', 'thirdnr': 'Int64', 'fourthnr': 'Int64'})
print(dfoutputs2.head(10))
dfnew = dd.concat([dfoutputs, dfoutputs2])
print(dfnew.head(10))
Output:
firstnr secondnr thirdnr fourthnr
0 0 0 0 5000000000
1 1 0 0 5000000000
2 2 0 0 5000000000
3 3 0 0 5000000000
4 4 0 0 5000000000
5 5 0 0 5000000000
firstnr secondnr thirdnr fourthnr
0 11 0 0 5000000000
1 12 0 0 5000000000
firstnr secondnr thirdnr fourthnr
0 0 0 0 5000000000
1 1 0 0 5000000000
2 2 0 0 5000000000
3 3 0 0 5000000000
4 4 0 0 5000000000
5 5 0 0 5000000000
How can I combine both csv's to the same Dask dataframe?
As suggested by TennisTechBoy in the comments:
f=open("outputsfile.csv", "a")
f2=open("outputsfile2.csv", "r")
f2content = f2.readlines()
for i in range(len(f2content)):
f.write(f2content[i])
f.close()
f2.close()
A way to do this in Dask might be needed from a memory perspective.

Pandas Dataframe Manipulation logic

Can use please help with below problem:
Given two dataframes df1 and df2, need to get something like result dataframe.
import pandas as pd
import numpy as np
feature_list = [ str(i) for i in range(6)]
df1 = pd.DataFrame( {'value' : [0,3,0,4,2,5]})
df2 = pd.DataFrame(0, index=np.arange(6), columns=feature_list)
Expected Dataframe :
Need to be driven by comparing values from df1 with column names (features) in df2. if they match, we put 1 in resultDf
Here's expected output (or resultsDf):
I think you need:
(pd.get_dummies(df1['value'])
.rename(columns = str)
.reindex(columns = df2.columns,
index = df2.index,
fill_value = 0))
0 1 2 3 4 5
0 1 0 0 0 0 0
1 0 0 0 1 0 0
2 1 0 0 0 0 0
3 0 0 0 0 1 0
4 0 0 1 0 0 0
5 0 0 0 0 0 1

Python Pandas Dataframe cell value split

I am lost on how to split the binary values such that each (0,1)value takes up a column of the data frame.
from jupyter
You can use concat with apply list:
df = pd.DataFrame({0:[1,2,3], 1:['1010','1100','0101']})
print (df)
0 1
0 1 1010
1 2 1100
2 3 0101
df = pd.concat([df[0],
df[1].apply(lambda x: pd.Series(list(x))).astype(int)],
axis=1, ignore_index=True)
print (df)
0 1 2 3 4
0 1 1 0 1 0
1 2 1 1 0 0
2 3 0 1 0 1
Another solution with DataFrame constructor:
df = pd.concat([df[0],
pd.DataFrame(df[1].apply(list).values.tolist()).astype(int)],
axis=1, ignore_index=True)
print (df)
0 1 2 3 4
0 1 1 0 1 0
1 2 1 1 0 0
2 3 0 1 0 1
EDIT:
df = pd.DataFrame({0:['1010','1100','0101']})
df1 = pd.DataFrame(df[0].apply(list).values.tolist()).astype(int)
print (df1)
0 1 2 3
0 1 0 1 0
1 1 1 0 0
2 0 1 0 1
But if need lists:
df[0] = df[0].apply(lambda x: [int(y) for y in list(x)])
print (df)
0
0 [1, 0, 1, 0]
1 [1, 1, 0, 0]
2 [0, 1, 0, 1]

copy_blanks(df,column) should copy the value in the original column to the last column for all values where the original column is blank

def copy_blanks(df, column):
like this, Please suggest me.
Input:
e-mail,number
n#gmail.com,0
p#gmail.com,1
h#gmail.com,0
s#gmail.com,0
l#gmail.com,1
v#gmail.com,0
,0
But, here we are having default_value option. In that we can use any value. when we have used this option. that value will adding.like below
e-mail,number
n#gmail.com,0
p#gmail.com,1
h#gmail.com,0
s#gmail.com,0
l#gmail.com,1
v#gmail.com,0
NA,0
But, my output is we have to default value and skip_blank options. when we will use skip_blank like true, then should not work default value,when we will keep skip_blank is false, then should work default value.
my output:
e-mail,number,e-mail_clean
n#gmail.com,0,n#gmail.com
p#gmail.com,1,p#gmail.com
h#gmail.com,0,h#gmail.com
s#gmail.com,0,s#gmail.com
l#gmail.com,1,l#gmail.com
v#gmail.com,0,v#gmail.com
,0,
consider your sample df
df = pd.DataFrame([
['n#gmail.com', 0],
['p#gmail.com', 1],
['h#gmail.com', 0],
['s#gmail.com', 0],
['l#gmail.com', 1],
['v#gmail.com', 0],
['', 0]
], columns=['e-mail','number'])
print(df)
e-mail number
0 n#gmail.com 0
1 p#gmail.com 1
2 h#gmail.com 0
3 s#gmail.com 0
4 l#gmail.com 1
5 v#gmail.com 0
6 0
If I understand you correctly:
def copy_blanks(df, column, skip_blanks=False, default_value='NA'):
df = df.copy()
s = df[column]
if not skip_blanks:
s = s.replace('', default_value)
df['{}_clean'.format(column)] = s
return df
copy_blanks(df, 'e-mail', skip_blanks=False)
e-mail number e-mail_clean
0 n#gmail.com 0 n#gmail.com
1 p#gmail.com 1 p#gmail.com
2 h#gmail.com 0 h#gmail.com
3 s#gmail.com 0 s#gmail.com
4 l#gmail.com 1 l#gmail.com
5 v#gmail.com 0 v#gmail.com
6 0 NA
copy_blanks(df, 'e-mail', skip_blanks=True)
e-mail number e-mail_clean
0 n#gmail.com 0 n#gmail.com
1 p#gmail.com 1 p#gmail.com
2 h#gmail.com 0 h#gmail.com
3 s#gmail.com 0 s#gmail.com
4 l#gmail.com 1 l#gmail.com
5 v#gmail.com 0 v#gmail.com
6 0
copy_blanks(df, 'e-mail', skip_blanks=False, default_value='new#gmail.com')
e-mail number e-mail_clean
0 n#gmail.com 0 n#gmail.com
1 p#gmail.com 1 p#gmail.com
2 h#gmail.com 0 h#gmail.com
3 s#gmail.com 0 s#gmail.com
4 l#gmail.com 1 l#gmail.com
5 v#gmail.com 0 v#gmail.com
6 0 new#gmail.com

How to set (1) to max elements in pandas dataframe and (0) to everything else?

Let's say I have a pandas DataFrame.
df = pd.DataFrame(index = [ix for ix in range(10)], columns=list('abcdef'), data=np.random.randn(10,6))
df:
a b c d e f
0 -1.238393 -0.755117 -0.228638 -0.077966 0.412947 0.887955
1 -0.342087 0.296171 0.177956 0.701668 -0.481744 -1.564719
2 0.610141 0.963873 -0.943182 -0.341902 0.326416 0.818899
3 -0.561572 0.063588 -0.195256 -1.637753 0.622627 0.845801
4 -2.506322 -1.631023 0.506860 0.368958 1.833260 0.623055
5 -1.313919 -1.758250 -1.082072 1.266158 0.427079 -1.018416
6 -0.781842 1.270133 -0.510879 -1.438487 -1.101213 -0.922821
7 -0.456999 0.234084 1.602635 0.611378 -1.147994 1.204318
8 0.497074 0.412695 -0.458227 0.431758 0.514382 -0.479150
9 -1.289392 -0.218624 0.122060 2.000832 -1.694544 0.773330
how to I get set 1 to rowwise max and 0 to other elements?
I came up with:
>>> for i in range(len(df)):
... df.loc[i][df.loc[i].idxmax(axis=1)] = 1
... df.loc[i][df.loc[i] != 1] = 0
generates
df:
a b c d e f
0 0 0 0 0 0 1
1 0 0 0 1 0 0
2 0 1 0 0 0 0
3 0 0 0 0 0 1
4 0 0 0 0 1 0
5 0 0 0 1 0 0
6 0 1 0 0 0 0
7 0 0 1 0 0 0
8 0 0 0 0 1 0
9 0 0 0 1 0 0
Does anyone has a better way of doing it? May be by getting rid of the for loop or applying lambda?
Use max and check for equality using eq and cast the boolean df to int using astype, this will convert True and False to 1 and 0:
In [21]:
df = pd.DataFrame(index = [ix for ix in range(10)], columns=list('abcdef'), data=np.random.randn(10,6))
df
Out[21]:
a b c d e f
0 0.797000 0.762125 -0.330518 1.117972 0.817524 0.041670
1 0.517940 0.357369 -1.493552 -0.947396 3.082828 0.578126
2 1.784856 0.672902 -1.359771 -0.090880 -0.093100 1.099017
3 -0.493976 -0.390801 -0.521017 1.221517 -1.303020 1.196718
4 0.687499 -2.371322 -2.474101 -0.397071 0.132205 0.034631
5 0.573694 -0.206627 -0.106312 -0.661391 -0.257711 -0.875501
6 -0.415331 1.185901 1.173457 0.317577 -0.408544 -1.055770
7 -1.564962 -0.408390 -1.372104 -1.117561 -1.262086 -1.664516
8 -0.987306 0.738833 -1.207124 0.738084 1.118205 -0.899086
9 0.282800 -1.226499 1.658416 -0.381222 1.067296 -1.249829
In [22]:
df = df.eq(df.max(axis=1), axis=0).astype(int)
df
Out[22]:
a b c d e f
0 0 0 0 1 0 0
1 0 0 0 0 1 0
2 1 0 0 0 0 0
3 0 0 0 1 0 0
4 1 0 0 0 0 0
5 1 0 0 0 0 0
6 0 1 0 0 0 0
7 0 1 0 0 0 0
8 0 0 0 0 1 0
9 0 0 1 0 0 0
Timings
In [24]:
# #Raihan Masud's method
%timeit df.apply( lambda x: np.where(x == x.max() , 1 , 0) , axis = 1)
# mine
%timeit df.eq(df.max(axis=1), axis=0).astype(int)
100 loops, best of 3: 7.94 ms per loop
1000 loops, best of 3: 640 µs per loop
In [25]:
# #Nader Hisham's method
%%timeit
def max_binary(df):
binary = np.where( df == df.max() , 1 , 0 )
return binary
​
df.apply( max_binary , axis = 1)
100 loops, best of 3: 9.63 ms per loop
You can see that my method is over 12X faster than #Raihan's method
In [4]:
%%timeit
for i in range(len(df)):
df.loc[i][df.loc[i].idxmax(axis=1)] = 1
df.loc[i][df.loc[i] != 1] = 0
10 loops, best of 3: 21.1 ms per loop
The for loop is also significantly slower
import numpy as np
def max_binary(df):
binary = np.where( df == df.max() , 1 , 0 )
return binary
df.apply( max_binary , axis = 1)
Following Nader's pattern, this is a shorter version:
df.apply( lambda x: np.where(x == x.max() , 1 , 0) , axis = 1)