Find names of n largest values in each row of dataframe - pandas

I can find the n largest values in each row of a numpy array (link) but doing so loses the column information which is what I want. Say I have some data:
import pandas as pd
import numpy as np
np.random.seed(42)
data = np.random.rand(5,5)
data = pd.DataFrame(data, columns = list('abcde'))
data
a b c d e
0 0.374540 0.950714 0.731994 0.598658 0.156019
1 0.155995 0.058084 0.866176 0.601115 0.708073
2 0.020584 0.969910 0.832443 0.212339 0.181825
3 0.183405 0.304242 0.524756 0.431945 0.291229
4 0.611853 0.139494 0.292145 0.366362 0.456070
I want the names of the largest contributors in each row. So for n = 2 the output would be:
0 b c
1 c e
2 b c
3 c d
4 a e
I can do it by looping over the dataframe but that would be inefficient. Is there a more pythonic way?

With pandas.Series.nlargest function:
df.apply(lambda x: x.nlargest(2).index.values, axis=1)
0 [b, c]
1 [c, e]
2 [b, c]
3 [c, d]
4 [a, e]

Another option using numpy.argpartition to find the top n index per row and then extract column names by index:
import numpy as np
nlargest_index = np.argpartition(data.values, data.shape[1] - n)[:, -n:]
data.columns.values[nlargest_index]
#array([['c', 'b'],
# ['e', 'c'],
# ['c', 'b'],
# ['d', 'c'],
# ['e', 'a']], dtype=object)

Can a dense ranking be used for this?
N = 2
threshold = len(data.columns) - N
nlargest = data[data.rank(method="dense", axis=1) > threshold]
>>> nlargest
a b c d e
0 NaN 0.950714 0.731994 NaN NaN
1 NaN NaN 0.866176 NaN 0.708073
2 NaN 0.969910 0.832443 NaN NaN
3 NaN NaN 0.524756 0.431945 NaN
4 0.611853 NaN NaN NaN 0.456070
>>> nlargest.stack()
0 b 0.950714
c 0.731994
1 c 0.866176
e 0.708073
2 b 0.969910
c 0.832443
3 c 0.524756
d 0.431945
4 a 0.611853
e 0.456070
dtype: float64

Related

Construct DataFrame from list of dicts

Trying to construct pandas DataFrame from list of dicts
List of dicts:
a = [{'1': 'A'},
{'2': 'B'},
{'3': 'C'}]
Pass list of dicts into pd.DataFrame():
df = pd.DataFrame(a)
Actual results:
1 2 3
0 A NaN NaN
1 NaN B NaN
2 NaN NaN C
pd.DataFrame(a, columns=['Key', 'Value'])
Actual results:
Key Value
0 NaN NaN
1 NaN NaN
2 NaN NaN
Expected results:
Key Value
0 1 A
1 2 B
2 3 C
try this,
from collections import ChainMap
data = dict(ChainMap(*a))
pd.DataFrame(data.items(), columns= ['Key','Value'])
O/P:
Key Value
0 1 A
1 2 B
2 3 C
Something like this with a list comprehension:
pd.DataFrame(([(x, y) for i in a for x, y in i.items()]),columns=['Key','Value'])
Key Value
0 1 A
1 2 B
2 3 C

How to join pandas dataframes which have a multiindex

Problem Description
I have a dataframe with a multi-index that is three levels deep (0, 1, 2) and I'd like to join this dataframe with another dataframe which is indexed by level 2 of my original dataframe.
In code, I'd like to turn:
pd.DataFrame(['a', 'b', 'c', 'd']).transpose().set_index([0, 1, 2])
and
pd.DataFrame(['c', 'e']).transpose().set_index(0)
into
pd.DataFrame(['a', 'b', 'c', 'd', 'e']).transpose().set_index([0, 1, 2])
What I've tried
I've tried using swaplevel and then join. Didn't work, though some of the error messages suggested that if only I could set on properly this might work.
I tried concat, but couldn't get this to work either. Not sure it can't work though...
Notes:
I have seen this question in which the answer seems to dodge the question (while solving the problem).
pandas will naturally do this for you if the names of the index levels line up. You can rename the index of the second dataframe and join accordingly.
d1 = pd.DataFrame(['a', 'b', 'c', 'd']).transpose().set_index([0, 1, 2])
d2 = pd.DataFrame(['c', 'e']).transpose().set_index(0)
d1.join(d2.rename_axis(2))
3 1
0 1 2
a b c d e
More Comprehensive Example
d1 = pd.DataFrame([
[1, 2],
[3, 4],
[5, 6],
[7, 8]
], pd.MultiIndex.from_product([['A', 'B'], ['X', 'Y']], names=['One', 'Two']))
d2 = pd.DataFrame([
list('abcdefg')
], ['Y'], columns=list('ABCDEFG'))
d3 = pd.DataFrame([
list('hij')
], ['A'], columns=list('HIJ'))
d1.join(d2.rename_axis('Two')).join(d3.rename_axis('One'))
0 1 A B C D E F G H I J
One Two
A X 1 2 NaN NaN NaN NaN NaN NaN NaN h i j
Y 3 4 a b c d e f g h i j
B X 5 6 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Y 7 8 a b c d e f g NaN NaN NaN

filter dataframe rows based on length of column values

I have a pandas dataframe as follows:
df = pd.DataFrame([ [1,2], [np.NaN,1], ['test string1', 5]], columns=['A','B'] )
df
A B
0 1 2
1 NaN 1
2 test string1 5
I am using pandas 0.20. What is the most efficient way to remove any rows where 'any' of its column values has length > 10?
len('test string1')
12
So for the above e.g., I am expecting an output as follows:
df
A B
0 1 2
1 NaN 1
If based on column A
In [865]: df[~(df.A.str.len() > 10)]
Out[865]:
A B
0 1 2
1 NaN 1
If based on all columns
In [866]: df[~df.applymap(lambda x: len(str(x)) > 10).any(axis=1)]
Out[866]:
A B
0 1 2
1 NaN 1
I had to cast to a string for Diego's answer to work:
df = df[df['A'].apply(lambda x: len(str(x)) <= 10)]
In [42]: df
Out[42]:
A B C D
0 1 2 2 2017-01-01
1 NaN 1 NaN 2017-01-02
2 test string1 5 test string1test string1 2017-01-03
In [43]: df.dtypes
Out[43]:
A object
B int64
C object
D datetime64[ns]
dtype: object
In [44]: df.loc[~df.select_dtypes(['object']).apply(lambda x: x.str.len().gt(10)).any(1)]
Out[44]:
A B C D
0 1 2 2 2017-01-01
1 NaN 1 NaN 2017-01-02
Explanation:
df.select_dtypes(['object']) selects only columns of object (str) dtype:
In [45]: df.select_dtypes(['object'])
Out[45]:
A C
0 1 2
1 NaN NaN
2 test string1 test string1test string1
In [46]: df.select_dtypes(['object']).apply(lambda x: x.str.len().gt(10))
Out[46]:
A C
0 False False
1 False False
2 True True
now we can "aggregate" it as follows:
In [47]: df.select_dtypes(['object']).apply(lambda x: x.str.len().gt(10)).any(axis=1)
Out[47]:
0 False
1 False
2 True
dtype: bool
finally we can select only those rows where value is False:
In [48]: df.loc[~df.select_dtypes(['object']).apply(lambda x: x.str.len().gt(10)).any(axis=1)]
Out[48]:
A B C D
0 1 2 2 2017-01-01
1 NaN 1 NaN 2017-01-02
Use the apply function of series, in order to keep them:
df = df[df['A'].apply(lambda x: len(x) <= 10)]

pandas set_index with NA and None values seem to be not working

I am trying to index a pandas DataFrame using columns with occasional NA and None in them. This seems to be failing. In the example below, df0 has (None,e) combination on index 3, but df1 has (NaN,e). Any suggestions?
import pandas as pd
import numpy as np
df0 = pd.DataFrame({'k1':['4',np.NaN,'6',None,np.NaN], 'k2':['a','d',np.NaN,'e',np.NaN], 'v':[1,2,3,4,5]})
df1 = df0.copy().set_index(['k1','k2'])
>>> df0
Out[3]:
k1 k2 v
0 4 a 1
1 NaN d 2
2 6 NaN 3
3 None e 4
4 NaN NaN 5
>>> df1
Out[4]:
v
k1 k2
4 a 1
NaN d 2
6 NaN 3
NaN e 4
NaN 5
Edit: I see the point--so this is the expected behavior.
This is expected behaviour, the None value is being converted to NaN and as the value is duplicated it isn't being shown:
In [31]:
df1.index
Out[31]:
MultiIndex(levels=[['4', '6'], ['a', 'd', 'e']],
labels=[[0, -1, 1, -1, -1], [0, 1, -1, 2, -1]],
names=['k1', 'k2'])
From the above you can see that -1 is being used to display NaN values, with respect to the output, if your df was like the following then the output shows the same behaviour:
In [34]:
df0 = pd.DataFrame({'k1':['4',np.NaN,'6',1,1], 'k2':['a','d',np.NaN,'e',np.NaN], 'v':[1,2,3,4,5]})
df1 = df0.copy().set_index(['k1','k2'])
df1
Out[34]:
v
k1 k2
4 a 1
NaN d 2
6 NaN 3
1 e 4
NaN 5
You can see that 1 is repeated for the last two rows

adding two series with missing data

I would like to add series a and b together. I know that the keys in series b are always in a, but there will be keys in a that are not in b. I don't want nans to appear in the result.
For example:
In[114] a
Out[114]:
a 1
b 3
c 2
d 5
dtype: int64
In[115] b
Out[115]:
b 3
c 2
dtype: int64
If I just use the add function, I will get nans in the missing locations.
In[116] a.add(b)
Out[116]:
a NaN
b 6
c 4
d NaN
dtype: float64
The following is the result I desire:
In[117] c
Out[117]:
a 1
b 6
c 4
d 5
dtype: int64
Is there a clever way to do this?
use update and nested add, this should give you the desired output. I have tested it in ipython and it works.
d = {'d': 5, 'b': 3, 'c': 2, 'a': 1}
e = {'b': 3, 'c': 2}
Convert to series
ds=pd.Series(d)
es=pd.Series(e)
**ds.update(ds.add(es))**
ds
Out[49]:
a 1
b 6
c 4
d 5
dtype: int64
>>> a.add(b, fill_value=0)
a 1
b 6
c 4
d 5
dtype: float64