Compare the two column in different data frame in pandas - pandas

I have two table as shown below
user table:
user_id courses attended_modules
1 [A] {A:[1,2,3,4,5,6]}
2 [A,B,C] {A:[8], B:[5], C:[6]}
3 [A,B] {A:[2,3,9], B:[10]}
4 [A] {A:[3]}
5 [B] {B:[5]}
6 [A] {A:[3]}
7 [B] {B:[5]}
8 [A] {A:[4]}
Course table:
course_id modules
A [1,2,3,4,5,6,8,9]
B [5,8]
C [6,10]
From the above compare the attended_module in user table with modules in course table. Create a new column in user table Remaining_module as explained below.
Example: user_id = 1, attended the course A, and attended 6 modules, there are 8 modules in course so Remaining_module = {A:2}
Similarly for user_id = 2, Remaining_module = {A:7, B:1, C:1}
And So on...
Expected Output:
user_id attended_modules #Remaining_modules
1 {A:[1,2,3,4,5,6]} {A:2}
2 {A:[8], B:[5], C:[6]} {A:7, B:1, C:1}
3 {A:[2,3,9], B:[8]} {A:5, B:1}
4 {A:[3]} {A:7}
5 {B:[5]} {B:1}
6 {A:[3]} {A:7}
7 {B:[5]} {B:1}
8 {A:[4]} {A:7}

Idea is compare matched values of generator and sum True values:
df2 = df2.set_index('course_id')
mo = df2['modules'].to_dict()
#print (mo)
def f(x):
return {k: sum(i not in v for i in mo[k]) for k, v in x.items()}
df1['Remaining_modules'] = df1['attended_modules'].apply(f)
print (df1)
user_id courses attended_modules Remaining_modules
0 1 [A] {'A': [1, 2, 3, 4, 5, 6]} {'A': 2}
1 2 [A,B,C] {'A': [8], 'B': [5], 'C': [6]} {'A': 7, 'B': 1, 'C': 1}
2 3 [A,B] {'A': [2, 3, 9], 'B': [10]} {'A': 5, 'B': 2}
3 4 [A] {'A': [3]} {'A': 7}
4 5 [B] {'B': [5]} {'B': 1}
5 6 [A] {'A': [3]} {'A': 7}
6 7 [B] {'B': [5]} {'B': 1}
7 8 [A] {'A': [4]} {'A': 7}

Related

how to groupby and create list of strings

I have:
df=pd.DataFrame({'a':[1,1,2],'b':[[1,2,3],[2,5],[3]],'c':['f','df','ere']})
df
a b c
0 1 [1, 2, 3] f
1 1 [2, 5] df
2 2 [3] ere
I want to concatenate and create a list on each element:
pd.DataFrame({'a':[1,2],'b':[[1,2,3,2,5],[3]],'c':[['f', 'df'],['ere']]})
a b c
0 1 [1, 2, 3, 2, 5] [f, df]
1 2 [3] [ere]
I tried:
df.groupby('a').agg({'b': 'sum', 'c': lambda x: list(''.join(x))})
a b c
1 [1, 2, 3, 2, 5] [f, d, f]
2 [3] [e, r, e]
But it is not quite right.
Any suggesetions?
You almost get it right:
df.groupby('a', as_index=False).agg({
'b': 'sum',
'c': list # no join needed
})
Output:
a b c
0 1 [1, 2, 3, 2, 5] [f, df]
1 2 [3] [ere]

Sort a dictionary in a column in pandas

I have a dataframe as shown below.
user_id Recommended_modules Remaining_modules
1 {A:[5,11], B:[4]} {A:2, B:1}
2 {A:[8,4,2], B:[5], C:[6,8]} {A:7, B:1, C:2}
3 {A:[2,3,9], B:[8]} {A:5, B:1}
4 {A:[8,4,2], B:[5,1,2], C:[6]} {A:3, B:4, C:1}
Brief about the dataframe:
In the column Recommended_modules A, B and C are courses and the numbers inside the list are modules.
Key(Remaining_modules) = Course name
value(Remaining_modules) = Number of modules remaining in that course
From the above I would like to reorder the recommended_modules column based on the values in the Remaining_modules as shown below.
Expected Output:
user_id Ordered_Recommended_modules Ordered_Remaining_modules
1 {B:[4], A:[5,11]} {B:1, A:2}
2 {B:[5], C:[6,8], A:[8,4,2]} {B:1, C:2, A:7}
3 {B:[8], A:[2,3,9]} {B:1, A:5}
4 {C:[6], A:[8,4,2], B:[5,1,2]} {C:1, A:3, B:4}
Explanation:
For user_id = 2, Remaining_modules = {A:7, B:1, C:2}, sort like this {B:1, C:2, A:7}
similarly arrange Recommended_modules also in the same order as shown below
{B:[5], C:[6,8], A:[8,4,2]}.
It is possible, only need python 3.6+:
def f(x):
#https://stackoverflow.com/a/613218/2901002
d1 = {k: v for k, v in sorted(x['Remaining_modules'].items(), key=lambda item: item[1])}
L = d1.keys()
#https://stackoverflow.com/a/21773891/2901002
d2 = {key:x['Recommended_modules'][key] for key in L if key in x['Recommended_modules']}
x['Remaining_modules'] = d1
x['Recommended_modules'] = d2
return x
df = df.apply(f, axis=1)
print (df)
user_id Recommended_modules \
0 1 {'B': [4], 'A': [5, 11]}
1 2 {'B': [5], 'C': [6, 8], 'A': [8, 4, 2]}
2 3 {'B': [8], 'A': [2, 3, 9]}
3 4 {'C': [6], 'A': [8, 4, 2], 'B': [5, 1, 2]}
Remaining_modules
0 {'B': 1, 'A': 2}
1 {'B': 1, 'C': 2, 'A': 7}
2 {'B': 1, 'A': 5}
3 {'C': 1, 'A': 3, 'B': 4}

How to convert dictionary with list to dataframe with default index and column names

How to convert dictionary to dataframe with default index and column names
dictionary d = {0: [1, 'Sports', 222], 1: [2, 'Tools', 11], 2: [3, 'Clothing', 23]}
df
id type value
0 1 Sports 222
1 2 Tools 11
2 3 Clothing 23
Use DataFrame.from_dict with orient='index' parameter:
d = {0: [1, 'Sports', 222], 1: [2, 'Tools', 11], 2: [3, 'Clothing', 23]}
df = pd.DataFrame.from_dict(d, orient='index', columns=['id','type','value'])
print (df)
id type value
0 1 Sports 222
1 2 Tools 11
2 3 Clothing 23

How to merge columns using mask

I am trying to merge two columns (Phone 1 and 2)
Here is my fake data:
import pandas as pd
employee = {'EmployeeID' : [0, 1, 2, 3, 4, 5, 6, 7],
'LastName' : ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'],
'Name' : ['w', 'x', 'y', 'z', None, None, None, None],
'phone1' : [1, 1, 2, 2, 4, 5, 6, 6],
'phone2' : [None, None, 3, 3, None, None, 7, 7],
'level_15' : [0, 1, 0, 1, 0, 0, 0, 1]}
df2 = pd.DataFrame(employee)
and I want the 'phone' column to be
'phone' : [1, 2, 3, 4, 5, 7, 9, 10]
In the beginning of my code, i split the names based on '/' and this code below creates a column with 0s and 1s which I used as mask to do other tasks through out my code.
df2 = (df2.set_index(cols)['name'].str.split('/',expand=True).stack().reset_index(name='Name'))
m = df2['level_15'].eq(0)
print (m)
#remove column level_15
df2 = df2.drop(['level_15'], axis=1)
#add last name for select first letter by condition, replace NaNs by forward fill
df2['last_name'] = df2['name'].str[:2].where(m).ffill()
df2['name'] = df2['name'].mask(m, df2['name'].str[2:])
I feel like there is a way to merge phone1 and phone2 using the 0s and 1s, but I can't figure out. Thank you.
First, start by filling in NaNs;
df2['phone2'] = df2.phone2.fillna(df2.phone1)
# Alternatively, based on your latest update
# df2['phone2'] = df2.phone2.mask(df2.phone2.eq(0)).fillna(df2.phone1)
You can just use np.where to merge columns on odd/even indices:
df2['phone'] = np.where(np.arange(len(df2)) % 2 == 0, df2.phone1, df2.phone2)
df2 = df2.drop(['phone1', 'phone2'], 1)
df2
EmployeeID LastName Name phone
0 0 a w 1
1 1 b x 2
2 2 c y 3
3 3 d z 4
4 4 e None 5
5 5 f None 6
6 6 g None 7
7 7 h None 8
Or, with Series.where/mask:
df2['phone'] = df2.pop('phone1').where(
np.arange(len(df2)) % 2 == 0, df2.pop('phone2')
)
Or,
df2['phone'] = df2.pop('phone1').mask(
np.arange(len(df2)) % 2 != 0, df2.pop('phone2)
)
df2
EmployeeID LastName Name phone
0 0 a w 1
1 1 b x 2
2 2 c y 3
3 3 d z 4
4 4 e None 5
5 5 f None 6
6 6 g None 7
7 7 h None 8

Pandas groupby(dictionary) not returning intended result

I'm trying to group the following data:
>>> a=[{'A': 1, 'B': 2, 'C': 3, 'D':4, 'E':5, 'F':6},{'A': 2, 'B': 3, 'C': 4, 'D':5, 'E':6, 'F':7},{'A': 3, 'B': 4, 'C': 5, 'D':6, 'E':7, 'F':8}]
>>> df = pd.DataFrame(a)
>>> df
A B C D E F
0 1 2 3 4 5 6
1 2 3 4 5 6 7
2 3 4 5 6 7 8
With the Following Dictionary:
dict={'A':1,'B':1,'C':1,'D':2,'E':2,'F':2}
such that
df.groupby(dict).groups
Will output
{1:['A','B','C'],2:['D','E','F']}
Needed to add the axis argument to groupby:
>>> grouped = df.groupby(groupDict,axis=1)
>>> grouped.groups
{1: ['A', 'B', 'C'], 2: ['D', 'E', 'F']}