Python Pandas Dataframe cell value split - pandas

I am lost on how to split the binary values such that each (0,1)value takes up a column of the data frame.
from jupyter

You can use concat with apply list:
df = pd.DataFrame({0:[1,2,3], 1:['1010','1100','0101']})
print (df)
0 1
0 1 1010
1 2 1100
2 3 0101
df = pd.concat([df[0],
df[1].apply(lambda x: pd.Series(list(x))).astype(int)],
axis=1, ignore_index=True)
print (df)
0 1 2 3 4
0 1 1 0 1 0
1 2 1 1 0 0
2 3 0 1 0 1
Another solution with DataFrame constructor:
df = pd.concat([df[0],
pd.DataFrame(df[1].apply(list).values.tolist()).astype(int)],
axis=1, ignore_index=True)
print (df)
0 1 2 3 4
0 1 1 0 1 0
1 2 1 1 0 0
2 3 0 1 0 1
EDIT:
df = pd.DataFrame({0:['1010','1100','0101']})
df1 = pd.DataFrame(df[0].apply(list).values.tolist()).astype(int)
print (df1)
0 1 2 3
0 1 0 1 0
1 1 1 0 0
2 0 1 0 1
But if need lists:
df[0] = df[0].apply(lambda x: [int(y) for y in list(x)])
print (df)
0
0 [1, 0, 1, 0]
1 [1, 1, 0, 0]
2 [0, 1, 0, 1]

Related

Pandas index clause across multiple columns in a multi-column header

I have a data frame with multi-column headers.
import pandas as pd
headers = pd.MultiIndex.from_tuples([("A", "u"), ("A", "v"), ("B", "x"), ("B", "y")])
f = pd.DataFrame([[1, 1, 0, 1], [1, 0, 0, 0], [0, 0, 1, 1], [1, 0, 1, 0]], columns = headers)
f
A B
u v x y
0 1 1 0 1
1 1 0 0 0
2 0 0 1 1
3 1 0 1 0
I want to select the rows in which either all the A columns or all the B columns are true.
I can do so explicitly.
f[f["A"]["u"].astype(bool) | f["A"]["v"].astype(bool)]
A B
u v x y
0 1 1 0 1
1 1 0 0 0
3 1 0 1 0
f[f["B"]["x"].astype(bool) | f["B"]["y"].astype(bool)]
A B
u v x y
0 1 1 0 1
2 0 0 1 1
3 1 0 1 0
I want to write a function select(f, top_level_name) where the indexing clause applies to all the columns under the same top level name such that
select(f, "A") == f[f["A"]["u"].astype(bool) | f["A"]["v"].astype(bool)]
select(f, "B") == f[f["B"]["x"].astype(bool) | f["B"]["y"].astype(bool)]
I want this function to work with arbitrary numbers of sub-columns with arbitrary names.
How do I write select?

a list as a sublist of a list from group into list

I have a dataframe, which has 2 columns,
a b
0 1 2
1 1 1
2 1 1
3 1 2
4 1 1
5 2 0
6 2 1
7 2 1
8 2 2
9 2 2
10 2 1
11 2 1
12 2 2
Is there a direct way to make a third column as below
a b c
0 1 2 0
1 1 1 1
2 1 1 0
3 1 2 1
4 1 1 0
5 2 0 0
6 2 1 1
7 2 1 0
8 2 2 1
9 2 2 0
10 2 1 0
11 2 1 0
12 2 2 0
in which target [1, 2] is a sublist of df.groupby('a').b.apply(list), find the 2 rows that firstly fit the target in every group.
df.groupby('a').b.apply(list) gives
1 [2, 1, 1, 2, 1]
2 [0, 1, 1, 2, 2, 1, 1, 2]
[1,2] is a sublist of [2, 1, 1, 2, 1] and [0, 1, 1, 2, 2, 1, 1, 2]
so far, I have a function
def is_sub_with_gap(sub, lst):
'''
check if sub is a sublist of lst
'''
ln, j = len(sub), 0
ans = []
for i, ele in enumerate(lst):
if ele == sub[j]:
j += 1
ans.append(i)
if j == ln:
return True, ans
return False, []
test on the function
In [55]: is_sub_with_gap([1,2], [2, 1, 1, 2, 1])
Out[55]: (True, [1, 3])
You can change output by select index values of groups in custom function, flatten it by Series.explode and then test index values by Index.isin:
L = [1, 2]
def is_sub_with_gap(sub, lst):
'''
check of sub is a sublist of lst
'''
ln, j = len(sub), 0
ans = []
for i, ele in enumerate(lst):
if ele == sub[j]:
j += 1
ans.append(i)
if j == ln:
return lst.index[ans]
return []
idx = df.groupby('a').b.apply(lambda x: is_sub_with_gap(L, x)).explode()
df['c'] = df.index.isin(idx).view('i1')
print (df)
a b c
0 1 2 0
1 1 1 1
2 1 1 0
3 1 2 1
4 1 1 0
5 2 0 0
6 2 1 1
7 2 1 0
8 2 2 1
9 2 2 0
10 2 1 0
11 2 1 0
12 2 2 0

copy_blanks(df,column) should copy the value in the original column to the last column for all values where the original column is blank

def copy_blanks(df, column):
like this, Please suggest me.
Input:
e-mail,number
n#gmail.com,0
p#gmail.com,1
h#gmail.com,0
s#gmail.com,0
l#gmail.com,1
v#gmail.com,0
,0
But, here we are having default_value option. In that we can use any value. when we have used this option. that value will adding.like below
e-mail,number
n#gmail.com,0
p#gmail.com,1
h#gmail.com,0
s#gmail.com,0
l#gmail.com,1
v#gmail.com,0
NA,0
But, my output is we have to default value and skip_blank options. when we will use skip_blank like true, then should not work default value,when we will keep skip_blank is false, then should work default value.
my output:
e-mail,number,e-mail_clean
n#gmail.com,0,n#gmail.com
p#gmail.com,1,p#gmail.com
h#gmail.com,0,h#gmail.com
s#gmail.com,0,s#gmail.com
l#gmail.com,1,l#gmail.com
v#gmail.com,0,v#gmail.com
,0,
consider your sample df
df = pd.DataFrame([
['n#gmail.com', 0],
['p#gmail.com', 1],
['h#gmail.com', 0],
['s#gmail.com', 0],
['l#gmail.com', 1],
['v#gmail.com', 0],
['', 0]
], columns=['e-mail','number'])
print(df)
e-mail number
0 n#gmail.com 0
1 p#gmail.com 1
2 h#gmail.com 0
3 s#gmail.com 0
4 l#gmail.com 1
5 v#gmail.com 0
6 0
If I understand you correctly:
def copy_blanks(df, column, skip_blanks=False, default_value='NA'):
df = df.copy()
s = df[column]
if not skip_blanks:
s = s.replace('', default_value)
df['{}_clean'.format(column)] = s
return df
copy_blanks(df, 'e-mail', skip_blanks=False)
e-mail number e-mail_clean
0 n#gmail.com 0 n#gmail.com
1 p#gmail.com 1 p#gmail.com
2 h#gmail.com 0 h#gmail.com
3 s#gmail.com 0 s#gmail.com
4 l#gmail.com 1 l#gmail.com
5 v#gmail.com 0 v#gmail.com
6 0 NA
copy_blanks(df, 'e-mail', skip_blanks=True)
e-mail number e-mail_clean
0 n#gmail.com 0 n#gmail.com
1 p#gmail.com 1 p#gmail.com
2 h#gmail.com 0 h#gmail.com
3 s#gmail.com 0 s#gmail.com
4 l#gmail.com 1 l#gmail.com
5 v#gmail.com 0 v#gmail.com
6 0
copy_blanks(df, 'e-mail', skip_blanks=False, default_value='new#gmail.com')
e-mail number e-mail_clean
0 n#gmail.com 0 n#gmail.com
1 p#gmail.com 1 p#gmail.com
2 h#gmail.com 0 h#gmail.com
3 s#gmail.com 0 s#gmail.com
4 l#gmail.com 1 l#gmail.com
5 v#gmail.com 0 v#gmail.com
6 0 new#gmail.com

How to set (1) to max elements in pandas dataframe and (0) to everything else?

Let's say I have a pandas DataFrame.
df = pd.DataFrame(index = [ix for ix in range(10)], columns=list('abcdef'), data=np.random.randn(10,6))
df:
a b c d e f
0 -1.238393 -0.755117 -0.228638 -0.077966 0.412947 0.887955
1 -0.342087 0.296171 0.177956 0.701668 -0.481744 -1.564719
2 0.610141 0.963873 -0.943182 -0.341902 0.326416 0.818899
3 -0.561572 0.063588 -0.195256 -1.637753 0.622627 0.845801
4 -2.506322 -1.631023 0.506860 0.368958 1.833260 0.623055
5 -1.313919 -1.758250 -1.082072 1.266158 0.427079 -1.018416
6 -0.781842 1.270133 -0.510879 -1.438487 -1.101213 -0.922821
7 -0.456999 0.234084 1.602635 0.611378 -1.147994 1.204318
8 0.497074 0.412695 -0.458227 0.431758 0.514382 -0.479150
9 -1.289392 -0.218624 0.122060 2.000832 -1.694544 0.773330
how to I get set 1 to rowwise max and 0 to other elements?
I came up with:
>>> for i in range(len(df)):
... df.loc[i][df.loc[i].idxmax(axis=1)] = 1
... df.loc[i][df.loc[i] != 1] = 0
generates
df:
a b c d e f
0 0 0 0 0 0 1
1 0 0 0 1 0 0
2 0 1 0 0 0 0
3 0 0 0 0 0 1
4 0 0 0 0 1 0
5 0 0 0 1 0 0
6 0 1 0 0 0 0
7 0 0 1 0 0 0
8 0 0 0 0 1 0
9 0 0 0 1 0 0
Does anyone has a better way of doing it? May be by getting rid of the for loop or applying lambda?
Use max and check for equality using eq and cast the boolean df to int using astype, this will convert True and False to 1 and 0:
In [21]:
df = pd.DataFrame(index = [ix for ix in range(10)], columns=list('abcdef'), data=np.random.randn(10,6))
df
Out[21]:
a b c d e f
0 0.797000 0.762125 -0.330518 1.117972 0.817524 0.041670
1 0.517940 0.357369 -1.493552 -0.947396 3.082828 0.578126
2 1.784856 0.672902 -1.359771 -0.090880 -0.093100 1.099017
3 -0.493976 -0.390801 -0.521017 1.221517 -1.303020 1.196718
4 0.687499 -2.371322 -2.474101 -0.397071 0.132205 0.034631
5 0.573694 -0.206627 -0.106312 -0.661391 -0.257711 -0.875501
6 -0.415331 1.185901 1.173457 0.317577 -0.408544 -1.055770
7 -1.564962 -0.408390 -1.372104 -1.117561 -1.262086 -1.664516
8 -0.987306 0.738833 -1.207124 0.738084 1.118205 -0.899086
9 0.282800 -1.226499 1.658416 -0.381222 1.067296 -1.249829
In [22]:
df = df.eq(df.max(axis=1), axis=0).astype(int)
df
Out[22]:
a b c d e f
0 0 0 0 1 0 0
1 0 0 0 0 1 0
2 1 0 0 0 0 0
3 0 0 0 1 0 0
4 1 0 0 0 0 0
5 1 0 0 0 0 0
6 0 1 0 0 0 0
7 0 1 0 0 0 0
8 0 0 0 0 1 0
9 0 0 1 0 0 0
Timings
In [24]:
# #Raihan Masud's method
%timeit df.apply( lambda x: np.where(x == x.max() , 1 , 0) , axis = 1)
# mine
%timeit df.eq(df.max(axis=1), axis=0).astype(int)
100 loops, best of 3: 7.94 ms per loop
1000 loops, best of 3: 640 µs per loop
In [25]:
# #Nader Hisham's method
%%timeit
def max_binary(df):
binary = np.where( df == df.max() , 1 , 0 )
return binary
​
df.apply( max_binary , axis = 1)
100 loops, best of 3: 9.63 ms per loop
You can see that my method is over 12X faster than #Raihan's method
In [4]:
%%timeit
for i in range(len(df)):
df.loc[i][df.loc[i].idxmax(axis=1)] = 1
df.loc[i][df.loc[i] != 1] = 0
10 loops, best of 3: 21.1 ms per loop
The for loop is also significantly slower
import numpy as np
def max_binary(df):
binary = np.where( df == df.max() , 1 , 0 )
return binary
df.apply( max_binary , axis = 1)
Following Nader's pattern, this is a shorter version:
df.apply( lambda x: np.where(x == x.max() , 1 , 0) , axis = 1)

Does np.array's astype prevent future edits in DataFrames?

I can change the first entry of the DataFrame initially:
In [6]: df = pd.DataFrame(np.random.rand(5,2))
In [7]: df
Out[7]:
0 1
0 0.514592 0.459589
1 0.329704 0.409099
2 0.061246 0.966191
3 0.336747 0.908513
4 0.169220 0.468437
In [8]: df.ix[0][0] = 1
In [9]: df
Out[9]:
0 1
0 1.000000 0.459589
1 0.329704 0.409099
2 0.061246 0.966191
3 0.336747 0.908513
4 0.169220 0.468437
But after I do this:
In [10]: df[0] = np.floor(df.index / 10).astype(int) * 10
In [11]: df
Out[11]:
0 1
0 0 0.459589
1 0 0.409099
2 0 0.966191
3 0 0.908513
4 0 0.468437
I can't find a way to change it.
In [12]: df.ix[0][0] = 1
In [13]: df
Out[13]:
0 1
0 0 0.459589
1 0 0.409099
2 0 0.966191
3 0 0.908513
4 0 0.468437
And I can't even change elements from other columns
In [16]: df.ix[0][1] = 1
In [17]: df
Out[17]:
0 1
0 0 0.459589
1 0 0.409099
2 0 0.966191
3 0 0.908513
4 0 0.468437
What's up with this?
you are editing a copy, try
In [3]: df = pd.DataFrame(np.random.rand(5,2))
In [4]: df[0] = np.floor(df.index / 10).astype(int) * 10
In [5]: df
Out[5]:
0 1
0 0 0.201611
1 0 0.390364
2 0 0.727422
3 0 0.941035
4 0 0.036764
In [6]: df.ix[0,1] = 1
In [7]: df
Out[7]:
0 1
0 0 1.000000
1 0 0.390364
2 0 0.727422
3 0 0.941035
4 0 0.036764