Sum columns in pandas having string and number - pandas

I need to sum column and column b, which contain string in 1st row
>>> df
a b
0 c d
1 1 2
2 3 4
>>> df['sum'] = df.sum(1)
>>> df
a b sum
0 c d cd
1 1 2 3
2 3 4 7
I only need to add numeric values and get an output like
>>> df
a b sum
0 c d "dummyString/NaN"
1 1 2 3
2 3 4 7
I need to add only some columns
df['sum']=df['a']+df['b']

solution if mixed data - numeric with strings:
I think simpliest is convert non numeric values after sum by to_numeric to NaNs:
df['sum'] = pd.to_numeric(df[['a','b']].sum(1), errors='coerce')
Or:
df['sum'] = pd.to_numeric(df['a']+df['b'], errors='coerce')
print (df)
a b sum
0 c d NaN
1 1 2 3.0
2 3 4 7.0
EDIT:
Solutions id numbers are strings represenation - first convert to numeric and then sum:
df['sum'] = pd.to_numeric(df['a'], errors='coerce') + pd.to_numeric(df['b'], errors='coerce')
print (df)
a b sum
0 c d NaN
1 1 2 3.0
2 3 4 7.0
Or:
df['sum'] = (df[['a', 'b']].apply(lambda x: pd.to_numeric(x, errors='coerce'))
.sum(axis=1, min_count=1))
print (df)
a b sum
0 c d NaN
1 1 2 3.0
2 3 4 7.0

Related

What is the difference between the 'set' operation using loc vs iloc?

What is the difference between the 'set' operation using loc vs iloc?
df.iloc[2, df.columns.get_loc('ColName')] = 3
#vs#
df.loc[2, 'ColName'] = 3
Why does the website of iloc (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html) not have any set examples like those shown in loc website (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html)? Is loc the preferred way?
There isn't much of a difference to say. It all comes down to your need and requirement.
Say you have label of the index and column name (most of the time) you are supposed to use loc (location) operator to assign the values.
Whereas like in normal matrix, you usually are going to have only the index number of the row and column and hence the cell location via integers (for i) your are supposed to use iloc (integer based location) for assignment.
Pandas DataFrame support indexing via both usual integer based and index based.
The problem arise when the index (the row or column) is itself integer instead of some string. So to make a clear difference to what operation user want to perform using integer based or label based indexing the two operations is provided.
Main difference is iloc set values by position, loc by label.
Here are some alternatives:
Sample:
Not default index (if exist label 2 is overwritten cell, else appended new row with label):
np.random.seed(123)
df = pd.DataFrame(np.random.randint(10, size=(3,3)), columns=['A','B','C'], index=[2,1,8])
print (df)
A B C
2 2 2 6
1 1 3 9
8 6 1 0
df.iloc[2, df.columns.get_loc('A')] = 30
print (df)
A B C
2 2 2 6
1 1 3 9
8 30 1 0
Appended new row with 0:
df.loc[0, 'A'] = 70
print (df)
A B C
a 2.0 2.0 6.0
b 1.0 3.0 9.0
c 30.0 1.0 0.0
0 70.0 NaN NaN
Overwritten label 2:
df.loc[2, 'A'] = 50
print (df)
A B C
2 50 2 6
1 1 3 9
8 30 1 0
Default index (working same, because 3rd index has label 2):
df = pd.DataFrame(np.random.randint(10, size=(3,3)), columns=['A','B','C'])
print (df)
df.iloc[2, df.columns.get_loc('A')] = 30
print (df)
A B C
0 2 2 6
1 1 3 9
2 30 1 0
df.loc[2, 'A'] = 50
print (df)
A B C
0 2 2 6
1 1 3 9
2 50 1 0
Not integer index - (working for set by position, for select by label is appended new row):
np.random.seed(123)
df = pd.DataFrame(np.random.randint(10, size=(3,3)), columns=['A','B','C'], index=list('abc'))
print (df)
df.iloc[2, df.columns.get_loc('A')] = 30
print (df)
A B C
a 2 2 6
b 1 3 9
c 30 1 0
df.loc[2, 'A'] = 50
print (df)
A B C
a 2.0 2.0 6.0
b 1.0 3.0 9.0
c 30.0 1.0 0.0
2 50.0 NaN NaN

Setting value_counts that lower than a threshold as others

I want to set item with count<=1 as others, code for input table:
import pandas as pd
df=pd.DataFrame({"item":['a','a','a','b','b','c','d']})
input table:
item
0 a
1 a
2 a
3 b
4 b
5 c
6 d
expected output:
item result
0 a a
1 a a
2 a a
3 b b
4 b b
5 c other
6 d other
How could I achieve that?
Use Series.where with check if all values are duplciates by Series.duplicated with keep=False:
df['result'] = df.item.where(df.item.duplicated(keep=False), 'other')
Or use GroupBy.transform with greater by 1 by Series.gt:
df['result'] = df.item.where(df.groupby('item')['item'].transform('size').gt(1), 'other')
Or use Series.map with Series.value_counts:
df['result'] = df.item.where(df['item'].map(df['item'].value_counts()).gt(1), 'other')
print (df)
item result
0 a a
1 a a
2 a a
3 b b
4 b b
5 c other
6 d other
Use numpy.where with Groupby.transform and Series.le:
In [926]: import numpy as np
In [927]: df['result'] = np.where(df.groupby('item')['item'].transform('count').le(1), 'other', df.item)
In [928]: df
Out[928]:
item result
0 a a
1 a a
2 a a
3 b b
4 b b
5 c other
6 d other
OR use Groupby.size with merge:
In [917]: x = df.groupby('item').size().reset_index()
In [919]: ans = df.merge(x)
In [921]: ans['result'] = np.where(ans[0].le(1), 'other', ans.item)
In [923]: ans = ans.drop(0, 1)
In [924]: ans
Out[924]:
item result
0 a a
1 a a
2 a a
3 b b
4 b b
5 c other
6 d other

Pandas - groupby one column and get mean of all other columns

I have a dataframe, with columns:
cols = ['A', 'B', 'C']
If I groupby one column, say, 'A', like so:
df.groupby('A')['B'].mean()
It works.
But I need to groupby one column and then get the mean of all other columns. I've tried:
df[cols].groupby('A').mean()
But I get the error:
KeyError: 'A'
What am I missing?
Please try:
df.groupby('A').agg('mean')
sample data
B C A
0 1 4 K
1 2 6 S
2 4 7 K
3 6 3 K
4 2 1 S
5 7 3 K
6 8 9 K
7 9 3 K
print(df.groupby('A').agg('mean'))
B C
A
K 5.833333 4.833333
S 2.000000 3.500000
You can use df.groupby('col').mean(). For example to calcualte mean for columns 'A', 'B' and 'C':
A B C D
0 1 NaN 1 1
1 1 2.0 2 1
2 2 3.0 1 1
3 1 4.0 1 1
4 2 5.0 2 1
df[['A', 'B', 'C']].groupby('A').mean()
or
df.groupby('A')[['A', 'B', 'C']].mean()
Output:
B C
A
1 3.0 1.333333
2 4.0 1.500000
If you need mean for all columns:
df.groupby('A').mean()
Output:
B C D
A
1 3.0 1.333333 1.0
2 4.0 1.500000 1.0
Perhaps the missing column is string rather than numeric?
df = pd.DataFrame({
'A': ['big', 'small','small', 'small'],
'B': [1,0,0,0],
'C': [1,1,1,0],
'D': ['1','0','0','0']
})
df.groupby(['A']).mean()
Output:
A
B
C
big
1.0
1.0
small
0.0
0.6666666666666666
Here, converting the column to a numeric type such as int or float produces the desired result:
df.D = df.D.astype(int)
df.groupby(['A']).mean()
Output:
A
B
C
D
big
1.0
1.0
1.0
small
0.0
0.6666666666666666
0.0

splitting string columns while the first part before the splitting pattern is missing

I'm trying to split a string column into different columns and tried
How to split a column into two columns?
The pattern of the strings look like the following:
import pandas as pd
import numpy as np
>>> data = {'ab': ['a - b', 'a - b', 'b', 'c', 'whatever']}
>>> df = pd.DataFrame(data=data)
ab
0 a - b
1 a - b
2 b
3 c
4 whatever
>>> df['a'], df['b'] = df['ab'].str.split('-', n=1).str
ab a b
0 a - b a b
1 a - b a b
2 b b NaN
3 c c NaN
4 whatever whatever NaN
The expected result is
ab a b
0 a - b a b
1 a - b a b
2 b NaN b
3 c NaN c
4 whatever NaN whatever
The method I came up with is
df.loc[~ df.ab.str.contains(' - '), 'b'] = df['ab']
df.loc[~ df.ab.str.contains(' - '), 'a'] = np.nan
Is there more generic/efficient way to do this task?
we can extractall as long as we know the specific strings to extract:
df.ab.str.extract(r"(a)?(?:\s-\s)?(b)?")
Out[47]:
0 1
0 a b
1 a b
2 NaN b
3 a NaN
data used:
data = {'ab': ['a - b', 'a - b', 'b','a']}
df = pd.DataFrame(data=data)
with your edit, it seems your aim is to put anything that is by itself on the second column. You could do:
df.ab.str.extract(r"(\S*)(?:\s-\s)?(\b\S+)")
Out[59]:
0 1
0 a b
1 a b
2 b
3 c
4 whatever
I will using get_dummies
s=df['ab'].str.get_dummies(' - ')
s=s.mask(s.eq(1),s.columns.tolist()).mask(s.eq(0))
s
Out[7]:
a b
0 a b
1 a b
2 NaN b
Update
df.ab.str.split(' - ',expand=True).apply(lambda x : pd.Series(sorted(x,key=pd.notnull)),axis=1)
Out[22]:
0 1
0 a b
1 a b
2 None b
3 None c
4 None whatever

Pandas: Delete duplicated items in a specific column

I have a panda dataframe (here represented using excel):
Now I would like to delete all dublicates (1) of a specific row (B).
How can I do it ?
For this example, the result would look like that:
You can use duplicated for boolean mask and then set NaNs by loc, mask or numpy.where:
df.loc[df['B'].duplicated(), 'B'] = np.nan
df['B'] = df['B'].mask(df['B'].duplicated())
df['B'] = np.where(df['B'].duplicated(), np.nan,df['B'])
Alternative if need remove duplicates rows by B column:
df = df.drop_duplicates(subset=['B'])
Sample:
df = pd.DataFrame({
'B': [1,2,1,3],
'A':[1,5,7,9]
})
print (df)
A B
0 1 1
1 5 2
2 7 1
3 9 3
df.loc[df['B'].duplicated(), 'B'] = np.nan
print (df)
A B
0 1 1.0
1 5 2.0
2 7 NaN
3 9 3.0
df = df.drop_duplicates(subset=['B'])
print (df)
A B
0 1 1
1 5 2
3 9 3