Pandas apply to index - pandas

I have a df that has some columns and a multi-index with bytes datatype. to clean the columns I can do
for c in df.columns:
df[c] = df[c].apply(lambda x: x.decode('UTF-8'))
and for a single index this should work
df.index.map(lambda x: x.decode('UTF-8'))
But it appears to fail with multi-index. Is there anything similar I can do for the multi-index?
EDIT:
example
pd.DataFrame().from_dict({'val': {(b'A', b'a'): 1,
(b'A', b'b'): 2,
(b'B', b'a'): 3,
(b'B', b'b'): 4,
(b'B', b'c'): 5}})
and the desired output it
pd.DataFrame().from_dict({'val': {('A', 'a'): 1,
('A', 'b'): 2,
('B', 'a'): 3,
('B', 'b'): 4,
('B', 'c'): 5}})

Method 1:
df.index = pd.MultiIndex.from_tuples([(x[0].decode('utf-8'), x[1].decode('utf-8')) for x in df.index])
%timeit result: 1000 loops, best of 3: 573 µs per loop
Method 2:
df.reset_index().set_index('val').applymap(lambda x: x.decode('utf-8')).reset_index().set_index(['level_0', 'level_1'])
%timeit result: 100 loops, best of 3: 4.17 ms per loop

df.index.levels = ([ names.map(lambda x: x.decode('UTF-8')) for i, names in enumerate(df.index.levels)])
OUTPUT:
val
A a 1
b 2
B a 3
b 4
c 5

Related

Sort DataFrame asc/desc based on column value

I have this DataFrame
df = pd.DataFrame({'A': [100, 100, 300, 200, 200, 200], 'B': [60, 55, 12, 32, 15, 44], 'C': ['x', 'x', 'y', 'y', 'y', 'y']})
and I want to sort it by columns "A" and "B". "A" is always ascending. I also want ascending for "B" if "C == x", else descending for "B" if "C == y". So it would end up like this
df_sorted = pd.DataFrame({'A': [100, 100, 200, 200, 200, 300], 'B': [55, 60, 44, 32, 15, 12], 'C': ['x', 'x', 'y', 'y', 'y', 'y']})
I would filter each DataFrame into two Dataframe based on the value of C:
df_x = df.loc[df['C'] == 'x']
df_y = df.loc[df['C'] == 'y']
and then use "sort_values" like so:
df_x.sort_values(by=['A', 'B'], inplace=True)
sorting df_y will be different since you want one column ascending and the other descending, since "sort_values" is stable we can do it like so
df_y.sort_values(by=['A'], inplace=True)
df_y.sort_values(by=['b'], inplace=True, ascending=False)
You can then merge the DataFrames back and sort again by A and the order will remain.
You can set up a temporary column to invert the values of "B" when "C" equals "x", sort, and drop the column:
(df.assign(B2=df['B']*df['C'].eq('x').mul(2).sub(1))
.sort_values(by=['A', 'B2'])
.drop('B2', axis=1)
)
def function1(dd:pd.DataFrame):
return dd.sort_values(['A','B']) if dd.name=='x' else dd.sort_values(['A','B'],ascending=[True,False])
df.groupby('C').apply(function1).reset_index(drop=True)
A B C
0 100 55 x
1 100 60 x
2 200 44 y
3 200 32 y
4 200 15 y
5 300 12 y

Second-level aggregation in pandas

I have a simple example:
DF = pd.DataFrame(
{"F1" : ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'C'],
"F2" : [1, 2, 1, 2, 2, 3, 1, 2, 3, 2],
"F3" : ['xx', 'yy', 'zz', 'zz', 'zz', 'xx', 'yy', 'zz', 'zz', 'zz']})
DF
How can I improve the code so that in the F3-unique column, in addition to the list of unique values of the F3 column in the group, the number of appearances of these values in the group is displayed like this:
Use .groupby() + .sum() + value_counts() + .agg():
df2 = DF.groupby('F1')['F2'].sum()
df3 = (DF.groupby(['F1', 'F3'])['F3']
.value_counts()
.reset_index([2], name='count')
.apply(lambda x: x['F3'] + '-' + str(x['count']), axis=1)
)
df4 = df3.groupby(level=0).agg(' '.join)
df4.name = 'F3'
df_out = pd.concat([df2, df4], axis=1).reset_index()
Result:
print(df_out)
F1 F2 F3
0 A 4 xx-1 yy-1 zz-1
1 B 7 xx-1 zz-2
2 C 8 yy-1 zz-3
Seems like groupby aggregate's named aggregation + python's collections.Counter could work well here:
from collections import Counter
df2 = DF.groupby('F1', as_index=False).aggregate({
'F2': 'sum',
'F3': lambda g: ' '.join([f'{k}-{v}' for k, v in Counter(g).items()])
})
df2:
F1 F2 F3
0 A 4 xx-1 yy-1 zz-1
1 B 7 zz-2 xx-1
2 C 8 yy-1 zz-3
aggregating to a Counter turns a collection into a dictionary based on the number of unique values:
df2 = DF.groupby('F1', as_index=False).aggregate({
'F2': 'sum',
'F3': Counter
})
F1 F2 F3
0 A 4 {'xx': 1, 'yy': 1, 'zz': 1}
1 B 7 {'zz': 2, 'xx': 1}
2 C 8 {'yy': 1, 'zz': 3}
The surrounding comprehension is used to reformat the data display:
Sample with 1 row:
' '.join([f'{k}-{v}' for k, v in Counter({'xx': 1, 'yy': 1, 'zz': 1}).items()])
xx-1 yy-1 zz-1

Remove blank key entries in Pandas pd.to_dict

Pandas has a very nice feature to export our dataframes to a list of dicts via pd.to_dict('records').
For example:
d = pd.DataFrame({'a':[1,2,3], 'b':['a', 'b', None]})
d.to_dict('records')
returns
[{'a': 1, 'b': 'a'}, {'a': 2, 'b': 'b'}, {'a': 3, 'b': None}]
For my use case, I would prefer the following entry:
[{'a': 1, 'b': 'a'}, {'a': 2, 'b': 'b'}, {'a': 3}]
where you can see the key of b is removed from the third entry. This is default behavior in R when using jsonlite, and am wondering how I would remove keys with missing values from each entry.
We can do stack
l=d.stack().groupby(level=0).agg(lambda x : x.reset_index(level=0,drop=True).to_dict()).tolist()
Out[142]: [{'a': 1, 'b': 'a'}, {'a': 2, 'b': 'b'}, {'a': 3}]
Update: using list comprehension and itertuples with nested dict comprehension. It is the fastest
l = [{k: v for k, v in tup._asdict().items() if v is not None}
for tup in d.itertuples(index=False)]
Out[74]: [{'a': 1, 'b': 'a'}, {'a': 2, 'b': 'b'}, {'a': 3}]
Timing:
d1 = pd.concat([d]*5000, ignore_index=True)
In [76]: %timeit [{k: v for k, v in tup._asdict().items() if v is not None} for
...: tup in d1.itertuples(index=False)]
442 ms ± 28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Another way is using list comprehension and iterrows
l = [row.dropna().to_dict() for k, row in d.iterrows()]
Out[33]: [{'a': 1, 'b': 'a'}, {'a': 2, 'b': 'b'}, {'a': 3}]
iterrows has reputation of slow performance. I tested on sample of 15000 rows to compare against stack
In [49]: d1 = pd.concat([d]*5000, ignore_index=True)
In [50]: %timeit d1.stack().groupby(level=0).agg(lambda x : x.reset_index(level
...: =0,drop=True).to_dict()).tolist()
7.52 s ± 370 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [51]: %timeit [row.dropna().to_dict() for k, row in d1.iterrows()]
6.45 s ± 60.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Interesting result. However, I think if the data is bigger, it will be slower than stack
Dump everything into a dictionary, it should be faster, and easier with a list comprehension :
[{key:value
for key,value in entry.items()
if value is not None}
for entry in d.to_dict('records')]
[{'a': 1, 'b': 'a'}, {'a': 2, 'b': 'b'}, {'a': 3}]
Here is another approach (if performance is not an issue) with apply and series.dropna()
d.apply(lambda x: x.dropna().to_dict(),axis=1).tolist()
[{'a': 1, 'b': 'a'}, {'a': 2, 'b': 'b'}, {'a': 3}]

How to find elements of series containing a list?

I can find Series cells matching tuples...
>>> s = pd.Series([(1,2,3),(4,5,6)], index=[1,2])
>>> print s[s==(1,2,3)]
1 (1, 2, 3)
dtype: object
How do I do the same for lists:
>>> s = pd.Series([[1,2,3],[4,5,6]], index=[1,2])
>>> print s[s==[1,2,3]]
ValueError: Arrays were different lengths: 2 vs 3
Easy Approach
s[s.apply(tuple) == (1, 2, 3)]
1 [1, 2, 3]
dtype: object
Less Easy
Assumes all sub-lists are the same length
def contains_list(s, l):
a = np.array(s.values.tolist())
return (a == l).all(1)
s[contains_list(s, [1, 2, 3])]
1 [1, 2, 3]
dtype: object
Timing
Assume a larger series
s = pd.Series([[1,2,3],[4,5,6]] * 1000)
%timeit s[pd.DataFrame(s.values.tolist(), index=s.index).isin([1,2,3]).all(1)]
100 loops, best of 3: 2.22 ms per loop
%timeit s[contains_list(s, [1, 2, 3])]
1000 loops, best of 3: 1.01 ms per loop
%timeit s[s.apply(tuple) == (1, 2, 3)]
1000 loops, best of 3: 1.07 ms per loop
alternative solution:
In [352]: s[pd.DataFrame(s.values.tolist(), index=s.index).isin([1,2,3]).all(1)]
Out[352]:
1 [1, 2, 3]
dtype: object
step-by-step:
In [353]: pd.DataFrame(s.values.tolist(), index=s.index)
Out[353]:
0 1 2
1 1 2 3
2 4 5 6
In [354]: pd.DataFrame(s.values.tolist(), index=s.index).isin([1,2,3])
Out[354]:
0 1 2
1 True True True
2 False False False
In [355]: pd.DataFrame(s.values.tolist(), index=s.index).isin([1,2,3]).all(1)
Out[355]:
1 True
2 False
dtype: bool

Pandas groupby(dictionary) not returning intended result

I'm trying to group the following data:
>>> a=[{'A': 1, 'B': 2, 'C': 3, 'D':4, 'E':5, 'F':6},{'A': 2, 'B': 3, 'C': 4, 'D':5, 'E':6, 'F':7},{'A': 3, 'B': 4, 'C': 5, 'D':6, 'E':7, 'F':8}]
>>> df = pd.DataFrame(a)
>>> df
A B C D E F
0 1 2 3 4 5 6
1 2 3 4 5 6 7
2 3 4 5 6 7 8
With the Following Dictionary:
dict={'A':1,'B':1,'C':1,'D':2,'E':2,'F':2}
such that
df.groupby(dict).groups
Will output
{1:['A','B','C'],2:['D','E','F']}
Needed to add the axis argument to groupby:
>>> grouped = df.groupby(groupDict,axis=1)
>>> grouped.groups
{1: ['A', 'B', 'C'], 2: ['D', 'E', 'F']}