Replacing values in column not working pandas - pandas

my sample data set:
import pandas as pd
import numpy as np
df = {'ID': ['A',0,0,1,'A',1],
'ID1':['Yes','Yes','No','No','Yes','Yes']}
df = pd.DataFrame(df)
my real data set is read in from an excel file, the column 'ID1' contains 'Yes' or 'No'. the column 'ID' contains 1, 0 and 'A'.
I want to:
For column 'ID1' I want to replace 'Yes' as 1 and 'No' as 0.
for column 'ID' I want to replace 'A' as 0
I tried following ways
# The values didn't change
df['ID1']=df['ID1'].replace(['Yes', 'No'], [1, 0])
# Or, The values didn't change
df['ID1']=df['ID1'].replace(['Yes', 'No'], [1, 0],inplace='ignore')
# Or, it turns 'A' to 'nan'
df['ID'] = df['ID'].map({1: 1, 0: 0, 'A':0})
# OR, it turns 'A' to 'nan'
df['ID'] = df['ID'].map({1: 1, 0: 0, 'A':0}, na_action=None)
My code works perfectly if you run my sample data set code to get the sample data set, which converts the Series into DF, but it doesn't work with my real data set which I read in from an excel file. I searched online but couldn't figure out why. these columns from my real data set are object type, i tried converting to string but still doesn't work.
edit:
my code for reading my real data set:
path =os.chdir(r"S:\path")
df1 = pd.read_excel('data.xlsx',skiprows=[0])
df1['ID']=df1['ID'].str.strip()
df1['ID'] = df1['ID'].map({'1': 1, '0': 0, 'A':0}, na_action=None)
df1['ID1']=df1['ID1'].str.strip()
df1['ID1']=df1['ID1'].replace(['Yes', 'No'], [1, 0])
df1.head()
Out[55]:
ID1 ID
0 1 NaN
1 1 NaN
2 1 NaN
3 1 0.0
4 1 NaN
I have uploaded my file online, please check this link : https://filebin.ca/3UAh5051Psnv/test.xlsx

Try to clean up ID1 and ID columns:
df['ID'] = df['ID'].astype(str).str.strip().map({'1': 1, '0': 0, 'A':0}, na_action=None)
df['ID1'] = df['ID1'].str.strip().replace(['Yes', 'No'], [1, 0])
Result:
In [234]: df
Out[234]:
ID1 ID
0 1 1
1 1 1
2 1 1
3 1 0
4 1 1
5 1 1
6 1 0
7 1 1
8 1 1
9 1 1
10 1 1
11 1 1
12 1 1
13 1 0
14 1 1
15 1 1
16 1 0
17 1 1
18 1 1
19 1 1
20 1 1
21 1 1
22 1 1
23 1 1
24 1 1
25 1 1
26 1 1
27 1 1
28 1 1
29 1 1
30 1 1

Related

pandas sort_values with condition

I have a dataframe that I'd like to sort on cols time and b, where b sort is conditional on value of a. So if a == 1, sort from highest to lowest, and if a == -1, sort from lowest to highest. I would normally do something like df.sort_values(by=['time', 'b']) but I think it sorts b always from lowest to highest.
df = pd.DataFrame({'time': [0, 3, 2, 2, 1], 'a': [1, -1, 1, 1, -1], 'b': [4, 5, 1, 6, 2]})
time a b
0 0 1 4
1 3 -1 5
2 2 1 1
3 2 1 6
4 1 -1 2
desired output
time a b
0 0 1 4
1 1 -1 2
2 2 1 6
3 2 1 1
4 3 -1 5
Multiply a and b and use it as sorting key:
df['sort'] = df['a']*df['b']
df.sort_values(by=['time', 'sort'], ascending=[True, False]).drop('sort', axis=1)
output:
time a b
0 0 1 4
4 1 -1 2
3 2 1 6
2 2 1 1
1 3 -1 5
alternative:
df['sort'] = (1-df['a'])*df['b']
df.sort_values(by=['time', 'sort']).drop('sort', axis=1)
Pass ascending after create additional col for sorting
out = df.assign(key = df.a*df.b).sort_values(['time','key'],ascending=[True,False]).drop('key',1)
Out[59]:
time a b
0 0 1 4
4 1 -1 2
3 2 1 6
2 2 1 1
1 3 -1 5

Delete rows in dataframe based on info from Series

I would like to delete all rows in the Dataframe that have number of appereance = 10 and status = 1.
Example of Dataframe X is
ID Status
0 366804 0
1 371391 1
2 383537 1
3 383538 0
4 383539 0
...
First I found all rows with status=1 with count()=10
exclude=X[X.Status == 1].groupby('ID')['Status'].value_counts().loc[lambda x: x==10].index
exclude is Series
MultiIndex([( 371391, 1),
( 383537, 1),
...
Is it possible to delete rows in Dataframe X based od info for ID from Series ?
If your original DataFrame looks something like this:
print(df)
ID Status
0 366804 0
1 371391 1
2 383537 1
3 383538 0
4 383539 0
5 371391 1
6 371391 1
7 371391 1
8 371391 1
9 371391 1
10 371391 1
11 371391 1
12 371391 1
13 371391 1
And you group IDs and statuses together to find the IDs you want to exclude:
df2 = df.groupby(['ID', 'Status']).size().to_frame('size').reset_index()
print(df2)
ID Status size
0 366804 0 1
1 371391 1 10
2 383537 1 1
3 383538 0 1
4 383539 0 1
excludes = df2.loc[(df2['size'] == 10) & (df2['Status'] == 1), 'ID']
print(excludes)
1 371391
Name: ID, dtype: int64
Then you could use Series.isin and invert the boolean Series ~s:
df = df[~df['ID'].isin(excludes)]
print(df)
ID Status
0 366804 0
2 383537 1
3 383538 0
4 383539 0

Need to loop over pandas series to find indices of variable

I have a dataframe and a list. I would like to iterate over elements in the list and find their location in dataframe then store this to a new dataframe
my_list = ['1','2','3','4','5']
df1 = pd.DataFrame(my_list, columns=['Num'])
dataframe : df1
Num
0 1
1 2
2 3
3 4
4 5
dataframe : df2
0 1 2 3 4
0 9 12 8 6 7
1 11 1 4 10 13
2 5 14 2 0 3
I've tried something similar to this but doesn't work
for x in my_list:
i,j= np.array(np.where(df==x)).tolist()
df2['X'] = df.append(i)
df2['Y'] = df.append(j)
so looking for a result like this
dataframe : df1 updated
Num X Y
0 1 1 1
1 2 2 2
2 3 2 4
3 4 1 2
4 5 2 0
any hints or ideas would be appreciated
Instead of trying to find the value in df2, why not just make df2 a flat dataframe.
df2 = pd.melt(df2)
df2.reset_index(inplace=True)
df2.columns = ['X', 'Y', 'Num']
so now your df2 just looks like this:
Index X Y Num
0 0 0 9
1 1 0 11
2 2 0 5
3 3 1 12
4 4 1 1
5 5 1 14
You can of course sort by Num and if you just want the values from your list you can further filter df2:
df2 = df2[df2.Num.isin(my_list)]

Pandas: Grouping by values when a column is a list

I have a DataFrame like this one:
df = pd.DataFrame({'type':[[1,3],[1,2,3],[2,3]], 'value':[4,5,6]})
type | value
-------------
1,3 | 4
1,2,3| 5
2,3 | 6
I would like to group by the different values in the 'type' column so for example the sum of value would be:
type | sum
------------
1 | 9
2 | 11
3 | 15
Thanks for your help!
You need first reshape Dataframe by column type by DataFrame constructor, stack and reset_index. Then cast column type to int and last groupby with aggregating sum:
df1 = pd.DataFrame(df['type'].values.tolist(), index = df['value']) \
.stack() \
.reset_index(name='type')
df1.type = df1.type.astype(int)
print (df1)
value level_1 type
0 4 0 1
1 4 1 3
2 5 0 1
3 5 1 2
4 5 2 3
5 6 0 2
6 6 1 3
print (df1.groupby('type', as_index=False)['value'].sum())
type value
0 1 9
1 2 11
2 3 15
Another solution with join:
df1 = pd.DataFrame(df['type'].values.tolist()) \
.stack() \
.reset_index(level=1, drop=True) \
.rename('type') \
.astype(int)
print (df1)
0 1
0 3
1 1
1 2
1 3
2 2
2 3
Name: type, dtype: int32
df2 = df[['value']].join(df1)
print (df2)
value type
0 4 1
0 4 3
1 5 1
1 5 2
1 5 3
2 6 2
2 6 3
print (df2.groupby('type', as_index=False)['value'].sum())
type value
0 1 9
1 2 11
2 3 15
Version with Series where select first level of index by get_level_values, convert to Series by to_series and aggregate sum. Last reset_index and rename column index to type:
df1 = pd.DataFrame(df['type'].values.tolist(), index = df['value']).stack().astype(int)
print (df1)
value
4 0 1
1 3
5 0 1
1 2
2 3
6 0 2
1 3
dtype: int32
print (df1.index.get_level_values(0)
.to_series()
.groupby(df1.values)
.sum()
.reset_index()
.rename(columns={'index':'type'}))
type value
0 1 9
1 2 11
2 3 15
Edit by comment - it is a bit modified second solution with DataFrame.pop:
df = pd.DataFrame({'type':[[1,3],[1,2,3],[2,3]],
'value1':[4,5,6],
'value2':[1,2,3],
'value3':[4,6,1]})
print (df)
type value1 value2 value3
0 [1, 3] 4 1 4
1 [1, 2, 3] 5 2 6
2 [2, 3] 6 3 1
df1 = pd.DataFrame(df.pop('type').values.tolist()) \
.stack() \
.reset_index(level=1, drop=True) \
.rename('type') \
.astype(int)
print (df1)
0 1
0 3
1 1
1 2
1 3
2 2
2 3
Name: type, dtype: int32
print (df.join(df1).groupby('type', as_index=False).sum())
type value1 value2 value3
0 1 9 3 10
1 2 11 5 7
2 3 15 6 11

create a dataframe from a list of length-unequal lists

I try to convert such a list:
l = [[1, 2, 3, 17], [4, 19], [5]]
to a dataframe having each of the number as indice, and position of list as value.
For example, 19 is in the second list, I thus expect to get somwhere one row with "19" as index and "1" as value, and so on.
I managed to get it (cf.boiler plate below), but I guess there is something more simple
>>> df=pd.DataFrame(l)
>>> df=df.unstack().reset_index(level=0,drop=True)
>>> df=df[df.notnull()==True] # remove NaN rows
>>> df=pd.DataFrame(df)
>>> df = df.reset_index().set_index(0)
>>> print df
index
0
1 0
4 1
5 2
2 0
19 1
3 0
17 0
Thanks in advance.
In [52]: pd.DataFrame([(item, i) for i, seq in enumerate(l)
for item in seq]).set_index(0)
Out[52]:
1
0
1 0
2 0
3 0
17 0
4 1
19 1
5 2