Pyspark -- How to left merge dataframes

Pyspark -- How to left merge dataframes - apache-spark-sql

In Pandas I can merge two dataframes like so:
df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],
'value': [1, 2, 3, 5]})
df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],
'value': [5, 6, 7, 8]})
df1.merge(df2, how='left', left_on='lkey', right_on='rkey')
lkey value_x rkey value_y
0 foo 1 foo 5
1 foo 1 foo 8
2 bar 2 bar 6
3 baz 3 baz 7
4 foo 5 foo 5
5 foo 5 foo 8
What would the equivalent of this be in pyspark? A left join?

You can apply join in pyspark as
df = df1.join(df2, df1.lkey==df2.rkey, 'left_outer')

Related

Second-level aggregation in pandas

I have a simple example:
DF = pd.DataFrame(
{"F1" : ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'C'],
"F2" : [1, 2, 1, 2, 2, 3, 1, 2, 3, 2],
"F3" : ['xx', 'yy', 'zz', 'zz', 'zz', 'xx', 'yy', 'zz', 'zz', 'zz']})
DF
How can I improve the code so that in the F3-unique column, in addition to the list of unique values of the F3 column in the group, the number of appearances of these values in the group is displayed like this:

Use .groupby() + .sum() + value_counts() + .agg():
df2 = DF.groupby('F1')['F2'].sum()
df3 = (DF.groupby(['F1', 'F3'])['F3']
.value_counts()
.reset_index([2], name='count')
.apply(lambda x: x['F3'] + '-' + str(x['count']), axis=1)
)
df4 = df3.groupby(level=0).agg(' '.join)
df4.name = 'F3'
df_out = pd.concat([df2, df4], axis=1).reset_index()
Result:
print(df_out)
F1 F2 F3
0 A 4 xx-1 yy-1 zz-1
1 B 7 xx-1 zz-2
2 C 8 yy-1 zz-3

Seems like groupby aggregate's named aggregation + python's collections.Counter could work well here:
from collections import Counter
df2 = DF.groupby('F1', as_index=False).aggregate({
'F2': 'sum',
'F3': lambda g: ' '.join([f'{k}-{v}' for k, v in Counter(g).items()])
})
df2:
F1 F2 F3
0 A 4 xx-1 yy-1 zz-1
1 B 7 zz-2 xx-1
2 C 8 yy-1 zz-3
aggregating to a Counter turns a collection into a dictionary based on the number of unique values:
df2 = DF.groupby('F1', as_index=False).aggregate({
'F2': 'sum',
'F3': Counter
})
F1 F2 F3
0 A 4 {'xx': 1, 'yy': 1, 'zz': 1}
1 B 7 {'zz': 2, 'xx': 1}
2 C 8 {'yy': 1, 'zz': 3}
The surrounding comprehension is used to reformat the data display:
Sample with 1 row:
' '.join([f'{k}-{v}' for k, v in Counter({'xx': 1, 'yy': 1, 'zz': 1}).items()])
xx-1 yy-1 zz-1

how to groupby and create list of strings

I have:
df=pd.DataFrame({'a':[1,1,2],'b':[[1,2,3],[2,5],[3]],'c':['f','df','ere']})
df
a b c
0 1 [1, 2, 3] f
1 1 [2, 5] df
2 2 [3] ere
I want to concatenate and create a list on each element:
pd.DataFrame({'a':[1,2],'b':[[1,2,3,2,5],[3]],'c':[['f', 'df'],['ere']]})
a b c
0 1 [1, 2, 3, 2, 5] [f, df]
1 2 [3] [ere]
I tried:
df.groupby('a').agg({'b': 'sum', 'c': lambda x: list(''.join(x))})
a b c
1 [1, 2, 3, 2, 5] [f, d, f]
2 [3] [e, r, e]
But it is not quite right.
Any suggesetions?

You almost get it right:
df.groupby('a', as_index=False).agg({
'b': 'sum',
'c': list # no join needed
})
Output:
a b c
0 1 [1, 2, 3, 2, 5] [f, df]
1 2 [3] [ere]

Compare the two column in different data frame in pandas

I have two table as shown below
user table:
user_id courses attended_modules
1 [A] {A:[1,2,3,4,5,6]}
2 [A,B,C] {A:[8], B:[5], C:[6]}
3 [A,B] {A:[2,3,9], B:[10]}
4 [A] {A:[3]}
5 [B] {B:[5]}
6 [A] {A:[3]}
7 [B] {B:[5]}
8 [A] {A:[4]}
Course table:
course_id modules
A [1,2,3,4,5,6,8,9]
B [5,8]
C [6,10]
From the above compare the attended_module in user table with modules in course table. Create a new column in user table Remaining_module as explained below.
Example: user_id = 1, attended the course A, and attended 6 modules, there are 8 modules in course so Remaining_module = {A:2}
Similarly for user_id = 2, Remaining_module = {A:7, B:1, C:1}
And So on...
Expected Output:
user_id attended_modules #Remaining_modules
1 {A:[1,2,3,4,5,6]} {A:2}
2 {A:[8], B:[5], C:[6]} {A:7, B:1, C:1}
3 {A:[2,3,9], B:[8]} {A:5, B:1}
4 {A:[3]} {A:7}
5 {B:[5]} {B:1}
6 {A:[3]} {A:7}
7 {B:[5]} {B:1}
8 {A:[4]} {A:7}

Idea is compare matched values of generator and sum True values:
df2 = df2.set_index('course_id')
mo = df2['modules'].to_dict()
#print (mo)
def f(x):
return {k: sum(i not in v for i in mo[k]) for k, v in x.items()}
df1['Remaining_modules'] = df1['attended_modules'].apply(f)
print (df1)
user_id courses attended_modules Remaining_modules
0 1 [A] {'A': [1, 2, 3, 4, 5, 6]} {'A': 2}
1 2 [A,B,C] {'A': [8], 'B': [5], 'C': [6]} {'A': 7, 'B': 1, 'C': 1}
2 3 [A,B] {'A': [2, 3, 9], 'B': [10]} {'A': 5, 'B': 2}
3 4 [A] {'A': [3]} {'A': 7}
4 5 [B] {'B': [5]} {'B': 1}
5 6 [A] {'A': [3]} {'A': 7}
6 7 [B] {'B': [5]} {'B': 1}
7 8 [A] {'A': [4]} {'A': 7}

How can I select the columns names where a condition is met

I need to select column names where the count is greater than 2. I have this dataset:
Index | col_1 | col_2 | col_3 | col_4
-------------------------------------
0 | 5 | NaN | 4 | 2
1 | 2 | 2 | NaN | 2
2 | NaN | 3 | NaN | 1
3 | 3 | NaN | NaN | 1
The expected result is a list: ['col_1', 'col_4']
When I use
df.count() > 2
I get
col_1 True
col_2 False
col_3 False
col_4 True
Length: 4, dtype: bool
This is the code for testing
import pandas as pd
import numpy as np
data = {'col_1': [5, 2, np.NaN, 3],
'col_2': [np.NaN, 2, 3, np.NaN],
'col_3': [4, np.NaN, np.NaN, np.NaN],
'col_4': [2, 2, 1,1]}
frame = pd.DataFrame(data)
frame.count() > 2

You can do it this way.
import pandas as pd
import numpy as np
data = {'col_1': [5, 2, np.NaN, 3],
'col_2': [np.NaN, 2, 3, np.NaN],
'col_3': [4, np.NaN, np.NaN, np.NaN],
'col_4': [2, 2, 1,1]}
frame = pd.DataFrame(data)
expected_list = []
for col in list(frame.columns):
if frame[col].count() > 2:
expected_list.append(col)

Use dict can easily solve this:
frame[[key for key, value in dict(frame.count() > 2).items() if value]]

Try:
(df.columns)[(df.count() > 2).values].to_list()

Pandas groupby(dictionary) not returning intended result

I'm trying to group the following data:
>>> a=[{'A': 1, 'B': 2, 'C': 3, 'D':4, 'E':5, 'F':6},{'A': 2, 'B': 3, 'C': 4, 'D':5, 'E':6, 'F':7},{'A': 3, 'B': 4, 'C': 5, 'D':6, 'E':7, 'F':8}]
>>> df = pd.DataFrame(a)
>>> df
A B C D E F
0 1 2 3 4 5 6
1 2 3 4 5 6 7
2 3 4 5 6 7 8
With the Following Dictionary:
dict={'A':1,'B':1,'C':1,'D':2,'E':2,'F':2}
such that
df.groupby(dict).groups
Will output
{1:['A','B','C'],2:['D','E','F']}

Needed to add the axis argument to groupby:
>>> grouped = df.groupby(groupDict,axis=1)
>>> grouped.groups
{1: ['A', 'B', 'C'], 2: ['D', 'E', 'F']}

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Pyspark -- How to left merge dataframes - apache-spark-sql

You can apply join in pyspark as df = df1.join(df2, df1.lkey==df2.rkey, 'left_outer')

Related

Second-level aggregation in pandas

how to groupby and create list of strings

Compare the two column in different data frame in pandas

How can I select the columns names where a condition is met

Pandas groupby(dictionary) not returning intended result

Categories

Resources