adding two series with missing data - pandas

I would like to add series a and b together. I know that the keys in series b are always in a, but there will be keys in a that are not in b. I don't want nans to appear in the result.
For example:
In[114] a
Out[114]:
a 1
b 3
c 2
d 5
dtype: int64
In[115] b
Out[115]:
b 3
c 2
dtype: int64
If I just use the add function, I will get nans in the missing locations.
In[116] a.add(b)
Out[116]:
a NaN
b 6
c 4
d NaN
dtype: float64
The following is the result I desire:
In[117] c
Out[117]:
a 1
b 6
c 4
d 5
dtype: int64
Is there a clever way to do this?

use update and nested add, this should give you the desired output. I have tested it in ipython and it works.
d = {'d': 5, 'b': 3, 'c': 2, 'a': 1}
e = {'b': 3, 'c': 2}
Convert to series
ds=pd.Series(d)
es=pd.Series(e)
**ds.update(ds.add(es))**
ds
Out[49]:
a 1
b 6
c 4
d 5
dtype: int64

>>> a.add(b, fill_value=0)
a 1
b 6
c 4
d 5
dtype: float64

Related

Find names of n largest values in each row of dataframe

I can find the n largest values in each row of a numpy array (link) but doing so loses the column information which is what I want. Say I have some data:
import pandas as pd
import numpy as np
np.random.seed(42)
data = np.random.rand(5,5)
data = pd.DataFrame(data, columns = list('abcde'))
data
a b c d e
0 0.374540 0.950714 0.731994 0.598658 0.156019
1 0.155995 0.058084 0.866176 0.601115 0.708073
2 0.020584 0.969910 0.832443 0.212339 0.181825
3 0.183405 0.304242 0.524756 0.431945 0.291229
4 0.611853 0.139494 0.292145 0.366362 0.456070
I want the names of the largest contributors in each row. So for n = 2 the output would be:
0 b c
1 c e
2 b c
3 c d
4 a e
I can do it by looping over the dataframe but that would be inefficient. Is there a more pythonic way?
With pandas.Series.nlargest function:
df.apply(lambda x: x.nlargest(2).index.values, axis=1)
0 [b, c]
1 [c, e]
2 [b, c]
3 [c, d]
4 [a, e]
Another option using numpy.argpartition to find the top n index per row and then extract column names by index:
import numpy as np
nlargest_index = np.argpartition(data.values, data.shape[1] - n)[:, -n:]
data.columns.values[nlargest_index]
#array([['c', 'b'],
# ['e', 'c'],
# ['c', 'b'],
# ['d', 'c'],
# ['e', 'a']], dtype=object)
Can a dense ranking be used for this?
N = 2
threshold = len(data.columns) - N
nlargest = data[data.rank(method="dense", axis=1) > threshold]
>>> nlargest
a b c d e
0 NaN 0.950714 0.731994 NaN NaN
1 NaN NaN 0.866176 NaN 0.708073
2 NaN 0.969910 0.832443 NaN NaN
3 NaN NaN 0.524756 0.431945 NaN
4 0.611853 NaN NaN NaN 0.456070
>>> nlargest.stack()
0 b 0.950714
c 0.731994
1 c 0.866176
e 0.708073
2 b 0.969910
c 0.832443
3 c 0.524756
d 0.431945
4 a 0.611853
e 0.456070
dtype: float64

splitting string columns while the first part before the splitting pattern is missing

I'm trying to split a string column into different columns and tried
How to split a column into two columns?
The pattern of the strings look like the following:
import pandas as pd
import numpy as np
>>> data = {'ab': ['a - b', 'a - b', 'b', 'c', 'whatever']}
>>> df = pd.DataFrame(data=data)
ab
0 a - b
1 a - b
2 b
3 c
4 whatever
>>> df['a'], df['b'] = df['ab'].str.split('-', n=1).str
ab a b
0 a - b a b
1 a - b a b
2 b b NaN
3 c c NaN
4 whatever whatever NaN
The expected result is
ab a b
0 a - b a b
1 a - b a b
2 b NaN b
3 c NaN c
4 whatever NaN whatever
The method I came up with is
df.loc[~ df.ab.str.contains(' - '), 'b'] = df['ab']
df.loc[~ df.ab.str.contains(' - '), 'a'] = np.nan
Is there more generic/efficient way to do this task?
we can extractall as long as we know the specific strings to extract:
df.ab.str.extract(r"(a)?(?:\s-\s)?(b)?")
Out[47]:
0 1
0 a b
1 a b
2 NaN b
3 a NaN
data used:
data = {'ab': ['a - b', 'a - b', 'b','a']}
df = pd.DataFrame(data=data)
with your edit, it seems your aim is to put anything that is by itself on the second column. You could do:
df.ab.str.extract(r"(\S*)(?:\s-\s)?(\b\S+)")
Out[59]:
0 1
0 a b
1 a b
2 b
3 c
4 whatever
I will using get_dummies
s=df['ab'].str.get_dummies(' - ')
s=s.mask(s.eq(1),s.columns.tolist()).mask(s.eq(0))
s
Out[7]:
a b
0 a b
1 a b
2 NaN b
Update
df.ab.str.split(' - ',expand=True).apply(lambda x : pd.Series(sorted(x,key=pd.notnull)),axis=1)
Out[22]:
0 1
0 a b
1 a b
2 None b
3 None c
4 None whatever

How to join pandas dataframes which have a multiindex

Problem Description
I have a dataframe with a multi-index that is three levels deep (0, 1, 2) and I'd like to join this dataframe with another dataframe which is indexed by level 2 of my original dataframe.
In code, I'd like to turn:
pd.DataFrame(['a', 'b', 'c', 'd']).transpose().set_index([0, 1, 2])
and
pd.DataFrame(['c', 'e']).transpose().set_index(0)
into
pd.DataFrame(['a', 'b', 'c', 'd', 'e']).transpose().set_index([0, 1, 2])
What I've tried
I've tried using swaplevel and then join. Didn't work, though some of the error messages suggested that if only I could set on properly this might work.
I tried concat, but couldn't get this to work either. Not sure it can't work though...
Notes:
I have seen this question in which the answer seems to dodge the question (while solving the problem).
pandas will naturally do this for you if the names of the index levels line up. You can rename the index of the second dataframe and join accordingly.
d1 = pd.DataFrame(['a', 'b', 'c', 'd']).transpose().set_index([0, 1, 2])
d2 = pd.DataFrame(['c', 'e']).transpose().set_index(0)
d1.join(d2.rename_axis(2))
3 1
0 1 2
a b c d e
More Comprehensive Example
d1 = pd.DataFrame([
[1, 2],
[3, 4],
[5, 6],
[7, 8]
], pd.MultiIndex.from_product([['A', 'B'], ['X', 'Y']], names=['One', 'Two']))
d2 = pd.DataFrame([
list('abcdefg')
], ['Y'], columns=list('ABCDEFG'))
d3 = pd.DataFrame([
list('hij')
], ['A'], columns=list('HIJ'))
d1.join(d2.rename_axis('Two')).join(d3.rename_axis('One'))
0 1 A B C D E F G H I J
One Two
A X 1 2 NaN NaN NaN NaN NaN NaN NaN h i j
Y 3 4 a b c d e f g h i j
B X 5 6 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Y 7 8 a b c d e f g NaN NaN NaN

Two unaligned Pandas Series: concat raises error, adding does not but it returns a weird answer

Note: I am working with a rather old-ish Pandas 0.16.2, in Python 2.7.11.
My simplistic conceptual model for the adding of two Series was that it would involve an index-matching step that is similar to what goes on in a pd.concat(..., axis=1), ie. the Series indexes are lined up and then the values are added.
Therefore (modulo NaN handling I guess), I would expect u+v to work if, and only if concat([u, v], axis=1) works.
In the example below I build two Series with 'unalignable' indexes. My confusion is that concat does raise an error (as expected) but the adding does not -- and even more confusing is that the result of adding comes back with everything duplicated.
First I create a couple of series which have equal indexes (containing duplicates):
import string, pandas as pd
# Create a series with an index that has duplicates
u = pd.Series(range(5), index=list(string.ascii_lowercase)[:5])
u = pd.concat([u, u])
# Create another, same index but values reversed
v = pd.Series(range(5)[::-1], index=list(string.ascii_lowercase)[:5])
v = pd.concat([v, v])
Here they are:
In [2]: u
Out[2]:
a 0
b 1
c 2
d 3
e 4
a 0
b 1
c 2
d 3
e 4
dtype: int64
In [3]: v
Out[3]:
a 4
b 3
c 2
d 1
e 0
a 4
b 3
c 2
d 1
e 0
dtype: int64
They can be added since the indices are equal:
In [4]: u+v
Out[4]:
a 4
b 4
c 4
d 4
e 4
a 4
b 4
c 4
d 4
e 4
dtype: int64
If we sort v then its index gets reordered and since there is no longer an obvious way to line up v with u any more it is not surprising that concat raises an error:
In [5]: v.sort()
In [6]: v
Out[6]:
e 0
e 0
d 1
d 1
c 2
c 2
b 3
b 3
a 4
a 4
dtype: int64
In [7]: pd.concat([u, v], axis=1)
....
ValueError: cannot reindex from a duplicate axis
However, adding still works but bizarrely returns a longer series:
In [8]: u+v
Out[8]:
a 4
a 4
a 4
a 4
b 4
b 4
b 4
b 4
c 4
c 4
c 4
c 4
d 4
d 4
d 4
d 4
e 4
e 4
e 4
e 4
dtype: int64
What happened here?

pandas set_index with NA and None values seem to be not working

I am trying to index a pandas DataFrame using columns with occasional NA and None in them. This seems to be failing. In the example below, df0 has (None,e) combination on index 3, but df1 has (NaN,e). Any suggestions?
import pandas as pd
import numpy as np
df0 = pd.DataFrame({'k1':['4',np.NaN,'6',None,np.NaN], 'k2':['a','d',np.NaN,'e',np.NaN], 'v':[1,2,3,4,5]})
df1 = df0.copy().set_index(['k1','k2'])
>>> df0
Out[3]:
k1 k2 v
0 4 a 1
1 NaN d 2
2 6 NaN 3
3 None e 4
4 NaN NaN 5
>>> df1
Out[4]:
v
k1 k2
4 a 1
NaN d 2
6 NaN 3
NaN e 4
NaN 5
Edit: I see the point--so this is the expected behavior.
This is expected behaviour, the None value is being converted to NaN and as the value is duplicated it isn't being shown:
In [31]:
df1.index
Out[31]:
MultiIndex(levels=[['4', '6'], ['a', 'd', 'e']],
labels=[[0, -1, 1, -1, -1], [0, 1, -1, 2, -1]],
names=['k1', 'k2'])
From the above you can see that -1 is being used to display NaN values, with respect to the output, if your df was like the following then the output shows the same behaviour:
In [34]:
df0 = pd.DataFrame({'k1':['4',np.NaN,'6',1,1], 'k2':['a','d',np.NaN,'e',np.NaN], 'v':[1,2,3,4,5]})
df1 = df0.copy().set_index(['k1','k2'])
df1
Out[34]:
v
k1 k2
4 a 1
NaN d 2
6 NaN 3
1 e 4
NaN 5
You can see that 1 is repeated for the last two rows