Pandas groupby(dictionary) not returning intended result

Pandas groupby(dictionary) not returning intended result - pandas

I'm trying to group the following data:
>>> a=[{'A': 1, 'B': 2, 'C': 3, 'D':4, 'E':5, 'F':6},{'A': 2, 'B': 3, 'C': 4, 'D':5, 'E':6, 'F':7},{'A': 3, 'B': 4, 'C': 5, 'D':6, 'E':7, 'F':8}]
>>> df = pd.DataFrame(a)
>>> df
A B C D E F
0 1 2 3 4 5 6
1 2 3 4 5 6 7
2 3 4 5 6 7 8
With the Following Dictionary:
dict={'A':1,'B':1,'C':1,'D':2,'E':2,'F':2}
such that
df.groupby(dict).groups
Will output
{1:['A','B','C'],2:['D','E','F']}

Needed to add the axis argument to groupby:
>>> grouped = df.groupby(groupDict,axis=1)
>>> grouped.groups
{1: ['A', 'B', 'C'], 2: ['D', 'E', 'F']}

Related

Sort a dictionary in a column in pandas

I have a dataframe as shown below.
user_id Recommended_modules Remaining_modules
1 {A:[5,11], B:[4]} {A:2, B:1}
2 {A:[8,4,2], B:[5], C:[6,8]} {A:7, B:1, C:2}
3 {A:[2,3,9], B:[8]} {A:5, B:1}
4 {A:[8,4,2], B:[5,1,2], C:[6]} {A:3, B:4, C:1}
Brief about the dataframe:
In the column Recommended_modules A, B and C are courses and the numbers inside the list are modules.
Key(Remaining_modules) = Course name
value(Remaining_modules) = Number of modules remaining in that course
From the above I would like to reorder the recommended_modules column based on the values in the Remaining_modules as shown below.
Expected Output:
user_id Ordered_Recommended_modules Ordered_Remaining_modules
1 {B:[4], A:[5,11]} {B:1, A:2}
2 {B:[5], C:[6,8], A:[8,4,2]} {B:1, C:2, A:7}
3 {B:[8], A:[2,3,9]} {B:1, A:5}
4 {C:[6], A:[8,4,2], B:[5,1,2]} {C:1, A:3, B:4}
Explanation:
For user_id = 2, Remaining_modules = {A:7, B:1, C:2}, sort like this {B:1, C:2, A:7}
similarly arrange Recommended_modules also in the same order as shown below
{B:[5], C:[6,8], A:[8,4,2]}.

It is possible, only need python 3.6+:
def f(x):
#https://stackoverflow.com/a/613218/2901002
d1 = {k: v for k, v in sorted(x['Remaining_modules'].items(), key=lambda item: item[1])}
L = d1.keys()
#https://stackoverflow.com/a/21773891/2901002
d2 = {key:x['Recommended_modules'][key] for key in L if key in x['Recommended_modules']}
x['Remaining_modules'] = d1
x['Recommended_modules'] = d2
return x
df = df.apply(f, axis=1)
print (df)
user_id Recommended_modules \
0 1 {'B': [4], 'A': [5, 11]}
1 2 {'B': [5], 'C': [6, 8], 'A': [8, 4, 2]}
2 3 {'B': [8], 'A': [2, 3, 9]}
3 4 {'C': [6], 'A': [8, 4, 2], 'B': [5, 1, 2]}
Remaining_modules
0 {'B': 1, 'A': 2}
1 {'B': 1, 'C': 2, 'A': 7}
2 {'B': 1, 'A': 5}
3 {'C': 1, 'A': 3, 'B': 4}

Compare the two column in different data frame in pandas

I have two table as shown below
user table:
user_id courses attended_modules
1 [A] {A:[1,2,3,4,5,6]}
2 [A,B,C] {A:[8], B:[5], C:[6]}
3 [A,B] {A:[2,3,9], B:[10]}
4 [A] {A:[3]}
5 [B] {B:[5]}
6 [A] {A:[3]}
7 [B] {B:[5]}
8 [A] {A:[4]}
Course table:
course_id modules
A [1,2,3,4,5,6,8,9]
B [5,8]
C [6,10]
From the above compare the attended_module in user table with modules in course table. Create a new column in user table Remaining_module as explained below.
Example: user_id = 1, attended the course A, and attended 6 modules, there are 8 modules in course so Remaining_module = {A:2}
Similarly for user_id = 2, Remaining_module = {A:7, B:1, C:1}
And So on...
Expected Output:
user_id attended_modules #Remaining_modules
1 {A:[1,2,3,4,5,6]} {A:2}
2 {A:[8], B:[5], C:[6]} {A:7, B:1, C:1}
3 {A:[2,3,9], B:[8]} {A:5, B:1}
4 {A:[3]} {A:7}
5 {B:[5]} {B:1}
6 {A:[3]} {A:7}
7 {B:[5]} {B:1}
8 {A:[4]} {A:7}

Idea is compare matched values of generator and sum True values:
df2 = df2.set_index('course_id')
mo = df2['modules'].to_dict()
#print (mo)
def f(x):
return {k: sum(i not in v for i in mo[k]) for k, v in x.items()}
df1['Remaining_modules'] = df1['attended_modules'].apply(f)
print (df1)
user_id courses attended_modules Remaining_modules
0 1 [A] {'A': [1, 2, 3, 4, 5, 6]} {'A': 2}
1 2 [A,B,C] {'A': [8], 'B': [5], 'C': [6]} {'A': 7, 'B': 1, 'C': 1}
2 3 [A,B] {'A': [2, 3, 9], 'B': [10]} {'A': 5, 'B': 2}
3 4 [A] {'A': [3]} {'A': 7}
4 5 [B] {'B': [5]} {'B': 1}
5 6 [A] {'A': [3]} {'A': 7}
6 7 [B] {'B': [5]} {'B': 1}
7 8 [A] {'A': [4]} {'A': 7}

Multi-column label-encoding: Print mappings

Following code can be used to transform strings into categorical labels:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
df = pd.DataFrame([['A','B','C','D','E','F','G','I','K','H'],
['A','E','H','F','G','I','K','','',''],
['A','C','I','F','H','G','','','','']],
columns=['A1', 'A2', 'A3','A4', 'A5', 'A6', 'A7', 'A8', 'A9', 'A10'])
pd.DataFrame(columns=df.columns, data=LabelEncoder().fit_transform(df.values.flatten()).reshape(df.shape))
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10
0 1 2 3 4 5 6 7 9 10 8
1 1 5 8 6 7 9 10 0 0 0
2 1 3 9 6 8 7 0 0 0 0
Question:
How can I query the mappings (it appears they are sorted alphabetically)?
I.e. a list like:
A: 1
B: 2
C: 3
...
I: 9
K: 10
Thank you!

yes, it's possible if you define the LabelEncoder separately and query its classes_ attribute later.
le = LabelEncoder()
data = le.fit_transform(df.values.flatten())
dict(zip(le.classes_[1:], np.arange(1, len(le.classes_))))
{'A': 1,
'B': 2,
'C': 3,
'D': 4,
'E': 5,
'F': 6,
'G': 7,
'H': 8,
'I': 9,
'K': 10}
The classes_ stores a list of classes, in the order that they were encoded.
le.classes_
array(['', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K'], dtype=object)
So you may safely assume the first element is encoded as 1, and so on.
To reverse encodings, use le.inverse_transform.

I think there is transform in LabelEncoder
le=LabelEncoder()
le.fit(df.values.flatten())
dict(zip(df.values.flatten(),le.transform(df.values.flatten()) ))
Out[137]:
{'': 0,
'A': 1,
'B': 2,
'C': 3,
'D': 4,
'E': 5,
'F': 6,
'G': 7,
'H': 8,
'I': 9,
'K': 10}

Pandas dataframe to json with key

I have a dataframe with columns ['a', 'b', 'c' ]
and would like to export in dictionnary as follow :
{ 'value of a' : { 'b': 3, 'c': 7},
'value2 of a' : { 'b': 7, 'c': 9}
}

I believe you need set_index with DataFrame.to_dict:
df = pd.DataFrame({'a':list('ABC'),
'b':[4,5,4],
'c':[7,8,9]})
print (df)
a b c
0 A 4 7
1 B 5 8
2 C 4 9
d = df.set_index('a').to_dict('index')
print (d)
{'A': {'b': 4, 'c': 7}, 'B': {'b': 5, 'c': 8}, 'C': {'b': 4, 'c': 9}}
And for json use DataFrame.to_json:
j = df.set_index('a').to_json(orient='index')
print (j)
{"A":{"b":4,"c":7},"B":{"b":5,"c":8},"C":{"b":4,"c":9}}

How to merge columns using mask

I am trying to merge two columns (Phone 1 and 2)
Here is my fake data:
import pandas as pd
employee = {'EmployeeID' : [0, 1, 2, 3, 4, 5, 6, 7],
'LastName' : ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'],
'Name' : ['w', 'x', 'y', 'z', None, None, None, None],
'phone1' : [1, 1, 2, 2, 4, 5, 6, 6],
'phone2' : [None, None, 3, 3, None, None, 7, 7],
'level_15' : [0, 1, 0, 1, 0, 0, 0, 1]}
df2 = pd.DataFrame(employee)
and I want the 'phone' column to be
'phone' : [1, 2, 3, 4, 5, 7, 9, 10]
In the beginning of my code, i split the names based on '/' and this code below creates a column with 0s and 1s which I used as mask to do other tasks through out my code.
df2 = (df2.set_index(cols)['name'].str.split('/',expand=True).stack().reset_index(name='Name'))
m = df2['level_15'].eq(0)
print (m)
#remove column level_15
df2 = df2.drop(['level_15'], axis=1)
#add last name for select first letter by condition, replace NaNs by forward fill
df2['last_name'] = df2['name'].str[:2].where(m).ffill()
df2['name'] = df2['name'].mask(m, df2['name'].str[2:])
I feel like there is a way to merge phone1 and phone2 using the 0s and 1s, but I can't figure out. Thank you.

First, start by filling in NaNs;
df2['phone2'] = df2.phone2.fillna(df2.phone1)
# Alternatively, based on your latest update
# df2['phone2'] = df2.phone2.mask(df2.phone2.eq(0)).fillna(df2.phone1)
You can just use np.where to merge columns on odd/even indices:
df2['phone'] = np.where(np.arange(len(df2)) % 2 == 0, df2.phone1, df2.phone2)
df2 = df2.drop(['phone1', 'phone2'], 1)
df2
EmployeeID LastName Name phone
0 0 a w 1
1 1 b x 2
2 2 c y 3
3 3 d z 4
4 4 e None 5
5 5 f None 6
6 6 g None 7
7 7 h None 8
Or, with Series.where/mask:
df2['phone'] = df2.pop('phone1').where(
np.arange(len(df2)) % 2 == 0, df2.pop('phone2')
)
Or,
df2['phone'] = df2.pop('phone1').mask(
np.arange(len(df2)) % 2 != 0, df2.pop('phone2)
)
df2
EmployeeID LastName Name phone
0 0 a w 1
1 1 b x 2
2 2 c y 3
3 3 d z 4
4 4 e None 5
5 5 f None 6
6 6 g None 7
7 7 h None 8

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Pandas groupby(dictionary) not returning intended result - pandas

Needed to add the axis argument to groupby: >>> grouped = df.groupby(groupDict,axis=1) >>> grouped.groups {1: ['A', 'B', 'C'], 2: ['D', 'E', 'F']}

Related

Sort a dictionary in a column in pandas

Compare the two column in different data frame in pandas

Multi-column label-encoding: Print mappings

Pandas dataframe to json with key

How to merge columns using mask

Categories

Resources