In Pandas I can merge two dataframes like so:
df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],
'value': [1, 2, 3, 5]})
df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],
'value': [5, 6, 7, 8]})
df1.merge(df2, how='left', left_on='lkey', right_on='rkey')
lkey value_x rkey value_y
0 foo 1 foo 5
1 foo 1 foo 8
2 bar 2 bar 6
3 baz 3 baz 7
4 foo 5 foo 5
5 foo 5 foo 8
What would the equivalent of this be in pyspark? A left join?
You can apply join in pyspark as
df = df1.join(df2, df1.lkey==df2.rkey, 'left_outer')
Related
I have a simple example:
DF = pd.DataFrame(
{"F1" : ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'C'],
"F2" : [1, 2, 1, 2, 2, 3, 1, 2, 3, 2],
"F3" : ['xx', 'yy', 'zz', 'zz', 'zz', 'xx', 'yy', 'zz', 'zz', 'zz']})
DF
How can I improve the code so that in the F3-unique column, in addition to the list of unique values of the F3 column in the group, the number of appearances of these values in the group is displayed like this:
Use .groupby() + .sum() + value_counts() + .agg():
df2 = DF.groupby('F1')['F2'].sum()
df3 = (DF.groupby(['F1', 'F3'])['F3']
.value_counts()
.reset_index([2], name='count')
.apply(lambda x: x['F3'] + '-' + str(x['count']), axis=1)
)
df4 = df3.groupby(level=0).agg(' '.join)
df4.name = 'F3'
df_out = pd.concat([df2, df4], axis=1).reset_index()
Result:
print(df_out)
F1 F2 F3
0 A 4 xx-1 yy-1 zz-1
1 B 7 xx-1 zz-2
2 C 8 yy-1 zz-3
Seems like groupby aggregate's named aggregation + python's collections.Counter could work well here:
from collections import Counter
df2 = DF.groupby('F1', as_index=False).aggregate({
'F2': 'sum',
'F3': lambda g: ' '.join([f'{k}-{v}' for k, v in Counter(g).items()])
})
df2:
F1 F2 F3
0 A 4 xx-1 yy-1 zz-1
1 B 7 zz-2 xx-1
2 C 8 yy-1 zz-3
aggregating to a Counter turns a collection into a dictionary based on the number of unique values:
df2 = DF.groupby('F1', as_index=False).aggregate({
'F2': 'sum',
'F3': Counter
})
F1 F2 F3
0 A 4 {'xx': 1, 'yy': 1, 'zz': 1}
1 B 7 {'zz': 2, 'xx': 1}
2 C 8 {'yy': 1, 'zz': 3}
The surrounding comprehension is used to reformat the data display:
Sample with 1 row:
' '.join([f'{k}-{v}' for k, v in Counter({'xx': 1, 'yy': 1, 'zz': 1}).items()])
xx-1 yy-1 zz-1
I have:
df=pd.DataFrame({'a':[1,1,2],'b':[[1,2,3],[2,5],[3]],'c':['f','df','ere']})
df
a b c
0 1 [1, 2, 3] f
1 1 [2, 5] df
2 2 [3] ere
I want to concatenate and create a list on each element:
pd.DataFrame({'a':[1,2],'b':[[1,2,3,2,5],[3]],'c':[['f', 'df'],['ere']]})
a b c
0 1 [1, 2, 3, 2, 5] [f, df]
1 2 [3] [ere]
I tried:
df.groupby('a').agg({'b': 'sum', 'c': lambda x: list(''.join(x))})
a b c
1 [1, 2, 3, 2, 5] [f, d, f]
2 [3] [e, r, e]
But it is not quite right.
Any suggesetions?
You almost get it right:
df.groupby('a', as_index=False).agg({
'b': 'sum',
'c': list # no join needed
})
Output:
a b c
0 1 [1, 2, 3, 2, 5] [f, df]
1 2 [3] [ere]
I have two table as shown below
user table:
user_id courses attended_modules
1 [A] {A:[1,2,3,4,5,6]}
2 [A,B,C] {A:[8], B:[5], C:[6]}
3 [A,B] {A:[2,3,9], B:[10]}
4 [A] {A:[3]}
5 [B] {B:[5]}
6 [A] {A:[3]}
7 [B] {B:[5]}
8 [A] {A:[4]}
Course table:
course_id modules
A [1,2,3,4,5,6,8,9]
B [5,8]
C [6,10]
From the above compare the attended_module in user table with modules in course table. Create a new column in user table Remaining_module as explained below.
Example: user_id = 1, attended the course A, and attended 6 modules, there are 8 modules in course so Remaining_module = {A:2}
Similarly for user_id = 2, Remaining_module = {A:7, B:1, C:1}
And So on...
Expected Output:
user_id attended_modules #Remaining_modules
1 {A:[1,2,3,4,5,6]} {A:2}
2 {A:[8], B:[5], C:[6]} {A:7, B:1, C:1}
3 {A:[2,3,9], B:[8]} {A:5, B:1}
4 {A:[3]} {A:7}
5 {B:[5]} {B:1}
6 {A:[3]} {A:7}
7 {B:[5]} {B:1}
8 {A:[4]} {A:7}
Idea is compare matched values of generator and sum True values:
df2 = df2.set_index('course_id')
mo = df2['modules'].to_dict()
#print (mo)
def f(x):
return {k: sum(i not in v for i in mo[k]) for k, v in x.items()}
df1['Remaining_modules'] = df1['attended_modules'].apply(f)
print (df1)
user_id courses attended_modules Remaining_modules
0 1 [A] {'A': [1, 2, 3, 4, 5, 6]} {'A': 2}
1 2 [A,B,C] {'A': [8], 'B': [5], 'C': [6]} {'A': 7, 'B': 1, 'C': 1}
2 3 [A,B] {'A': [2, 3, 9], 'B': [10]} {'A': 5, 'B': 2}
3 4 [A] {'A': [3]} {'A': 7}
4 5 [B] {'B': [5]} {'B': 1}
5 6 [A] {'A': [3]} {'A': 7}
6 7 [B] {'B': [5]} {'B': 1}
7 8 [A] {'A': [4]} {'A': 7}
I need to select column names where the count is greater than 2. I have this dataset:
Index | col_1 | col_2 | col_3 | col_4
-------------------------------------
0 | 5 | NaN | 4 | 2
1 | 2 | 2 | NaN | 2
2 | NaN | 3 | NaN | 1
3 | 3 | NaN | NaN | 1
The expected result is a list: ['col_1', 'col_4']
When I use
df.count() > 2
I get
col_1 True
col_2 False
col_3 False
col_4 True
Length: 4, dtype: bool
This is the code for testing
import pandas as pd
import numpy as np
data = {'col_1': [5, 2, np.NaN, 3],
'col_2': [np.NaN, 2, 3, np.NaN],
'col_3': [4, np.NaN, np.NaN, np.NaN],
'col_4': [2, 2, 1,1]}
frame = pd.DataFrame(data)
frame.count() > 2
You can do it this way.
import pandas as pd
import numpy as np
data = {'col_1': [5, 2, np.NaN, 3],
'col_2': [np.NaN, 2, 3, np.NaN],
'col_3': [4, np.NaN, np.NaN, np.NaN],
'col_4': [2, 2, 1,1]}
frame = pd.DataFrame(data)
expected_list = []
for col in list(frame.columns):
if frame[col].count() > 2:
expected_list.append(col)
Use dict can easily solve this:
frame[[key for key, value in dict(frame.count() > 2).items() if value]]
Try:
(df.columns)[(df.count() > 2).values].to_list()
I'm trying to group the following data:
>>> a=[{'A': 1, 'B': 2, 'C': 3, 'D':4, 'E':5, 'F':6},{'A': 2, 'B': 3, 'C': 4, 'D':5, 'E':6, 'F':7},{'A': 3, 'B': 4, 'C': 5, 'D':6, 'E':7, 'F':8}]
>>> df = pd.DataFrame(a)
>>> df
A B C D E F
0 1 2 3 4 5 6
1 2 3 4 5 6 7
2 3 4 5 6 7 8
With the Following Dictionary:
dict={'A':1,'B':1,'C':1,'D':2,'E':2,'F':2}
such that
df.groupby(dict).groups
Will output
{1:['A','B','C'],2:['D','E','F']}
Needed to add the axis argument to groupby:
>>> grouped = df.groupby(groupDict,axis=1)
>>> grouped.groups
{1: ['A', 'B', 'C'], 2: ['D', 'E', 'F']}