Spark is Giving Incorrect Results with Left Outer Join

Spark is Giving Incorrect Results with Left Outer Join - sql

I am doing a simple left outer join in PySpark and it is not giving correct results. Please see bellow. Value 5 (in column A) is between 1 (col B) and 10 (col C) that's why B and C should be in the output table in the first row. But I'm getting nulls. I've tried this in 3 different RDBMs MS SQL, PostGres, and SQLite all giving the correct results. Possible bug in Spark??
Table x
+---+
| A|
+---+
| 5|
| 15|
| 20|
| 50|
+---+
Table y
+----+----+---+
| B| C| D|
+----+----+---+
| 1| 10|abc|
| 21| 30|xyz|
|null|null| mn|
| 11| 20| o|
+----+----+---+
SELECT x.a, y.b, y.c, y.d
FROM x LEFT OUTER JOIN
y
ON x.a >= y.b AND x.a <= y.c
+---+----+----+----+
| a| b| c| d|
+---+----+----+----+
| 5|null|null|null|
| 15| 11| 20| o|
| 20| 11| 20| o|
| 50|null|null|null|
+---+----+----+----+

syntax LEFT JOIN
enter link description here
SELECT column1, column2 ...
FROM table_A
LEFT JOIN table_B ON join_condition
WHERE row_condition
Maybe it will help you
SELECT x.a, y.*
FROM x LEFT JOIN y ON x.id = y.xID
WHERE x.a >= y.b AND x.a <= y.c

The problem was that Spark loaded the columns as strings, not ints. Spark was doing the >= and <= comparison on strings that's why results were off.
Casting the A,B,C columns to int resolved the problem.
x=x.withColumn('A',x['A'].cast('int'))
y=y.withColumn('B',x['B'].cast('int'))
y=y.withColumn('C',x['C'].cast('int'))

Related

aggregate of array values

Given a table on a specific day with different hex_ids I would like to aggregate the data such that the total distinct users for hex_id A is the sum of distinct users in hex_id [A, B, C]
+----------+-------+------+---------+
| date_id|user_id|hex_id| hex_map|
+----------+-------+------+---------+
|2016-11-01| 100| A|[A, B, C]|
|2016-11-01| 300| B| [B]|
|2016-11-01| 400| B| [B]|
|2016-11-01| 100| C| [B, C]|
|2016-11-01| 200| C| [B, C]|
|2016-11-01| 300| C| [B, C]|
+----------+-------+------+---------+
I would like to aggregate the table on hex_id such that the value
+------+---------+---+
|hex_id| hex_map|cnt|
+------+---------+---+
| A|[A, B, C]| 1|
| B| [B]| 2|
| C| [B, C]| 3|
+------+---------+---+
becomes being replaced by the alphabets
+------+---------+---+
|hex_id| hex_map|cnt|
+------+---------+---+
| A| 6 | 1|
| B| 2 | 2|
| C| 5 | 3|
+------+---------+---+
This is run on spark sql 2.4.0 I am stumped on how to achieve this.
Where the value of 6 comes from [A+B+C]
my best attempt is
query="""
with cte as (select hex_id, hex_map, count(distinct user_id) cnt from tab group by hex_id, hex_map),
subq as (select hex_id as hex, cnt as cnts, explode(hex_map) xxt from cte),
sss (select * from subq a left join cte b on a.xxt = b.hex_id)
select hex, sum(cnt) from sss group by hex
"""
spark.sql(query).show()

Since you did not specify the behavior of your aggregation, I decided to use first, but you can adapt it to your wish.
The idea is to convert the character to the ascii representation, you can do that through the code below:
val df1 = spark.sql("select hex_id, first(hex_map) as first_hex_map from test group by hex_id")
df1.createOrReplaceTempView("df1")
val df2 = spark.sql("select hex_id, transform(first_hex_map, a -> ascii(a) - 64) as aggr from df1")
df2.createOrReplaceTempView("df2")
val df3 = spark.sql("select hex_id, aggr, aggregate(aggr, 0, (acc, x) -> acc + x) as final from df2")
final result:
+------+---------+-----+
|hex_id|aggr |final|
+------+---------+-----+
|A |[1, 2, 3]|6 |
|B |[2] |2 |
|C |[2, 3] |5 |
+------+---------+-----+
or using Dataset API:
df.groupBy("hex_id").agg(first("hex_map").as("first_hex_map"))
.withColumn("transformed", transform(col("first_hex_map"), a => ascii(a).minus(64)))
.withColumn("hex_map", aggregate(col("transformed"), lit(0), (acc, x) => acc.plus(x)))
Good luck!

How to use window function in Redshift?

I have 2 tables:
| Product |
|:----: |
| product_id |
| source_id|
Source
source_id
priority
sometimes there are cases when 1 product_id can contain few sources and my task is to select data with min priority from for example
| product_id | source_id| priority|
|:----: |:------:| :-----:|
| 10| 2| 9|
| 10| 4| 2|
| 20| 2| 9|
| 20| 4| 2|
| 30| 2| 9|
| 30| 4| 2|
correct result should be like:
| product_id | source_id| priority|
|:----: |:------:| :-----:|
| 10| 4| 2|
| 20| 4| 2|
| 30| 4| 2|
I am using query:
SELECT p.product_id, p.source_id, s.priority FROM Product p
INNER JOIN Source s on s.source_id = p.source_id
WHERE s.priority = (SELECT Min(s1.priority) OVER (PARTITION BY p.product_id) FROM Source s1)
but it returns error "this type of correlated subquery pattern is not supported yet" so as i understand i can't use such variant in Redshift, how should it be solved, are there any other ways?

You just need to unroll the where clause into the second data source and the easiest flag for min priority is to use the ROW_NUMBER() window function. You're asking Redshift to rerun the window function for each JOIN ON test which creates a lot of inefficiencies in clustered database. Try the following (untested):
SELECT p.product_id, p.source_id, s.priority
FROM Product p
INNER JOIN (
SELECT ROW_NUMBER() OVER (PARTITION BY p.product_id, order by s1.priority) as row_num,
source_id,
priority
FROM Source) s
on s.source_id = p.source_id
WHERE row_num = 1
Now the window function only runs once. You can also move the subquery to a CTE if that improve readability for your full case.

Already found best solution for that case:
SELECT
p.product_id
, p.source_id
, s.priority
, Min(s.priority) OVER (PARTITION BY p.product_id) as min_priority
FROM Product p
INNER JOIN Source s
ON s.source_id = p.source_id
WHERE s.priority = p.min_priority

JOIN ON either of the two columns but not both

I have two tables (orders, agents) i'm trying to join on either of the two columns but not both. There are some records in orders that have both of these columns populated and it returns as duplicate
orders:
|id|order|agent_id|username|
|--+-----+--------+--------|
| 1| ord1| 5| user1|
| 2| ord2| 6| user2|
| 3| ord3| 7| user3|
agents:
|id|agent|username|FName|LName|
|--+-----+--------+-----+-----|
| 5|agnt5| user2|FNam5|LNam5|
| 6|agnt6| user3|FNam6|LNam6|
| 7|agnt7| user4|FNam7|LNam7|
I tried joining with an OR clause
select o.id, o.order, o.agen_id,o.username, a.Fname, a.LName
from orders o
left join agents a
on a.id = o.agent_id or a.username = o.username
I'm getting the following results
|id|order|agent_id|username|Fname|LName|
|--+-----+--------+--------+-----+-----|
| 1| ord1| 5| user2|FNam5|LNam5|
| 1| ord1| 5| user2|FNam5|LNam5|
| 2| ord2| 6| user3|FNam6|LNam7|
| 2| ord2| 6| user3|FNam6|LNam7|
| 3| ord3| 7| user4|FNam5|LNam5|
Expected Results
|id|order|agent_id|username|Fname|LName|
|--+-----+--------+--------+-----+-----|
| 1| ord1| 5| user2|FNam5|LNam5|
| 2| ord2| 6| user3|FNam6|LNam7|
| 3| ord3| 7| user4|FNam5|LNam5|
It looks in the case where both the agent_id and username are a match, its matching both and duplicating it my results. Is there a way to prevent the username match when the agent_id match is present.

You can left join twice, with the condition that evicts the second join if the first one matches:
select
o.id,
o.order,
o.agent_id,
o.username,
coalesce(a1.fname, a2.fname) as fname,
coalesce(a1.lname, a2.lname) as lname
from orders o
left join agents a1 on a1.id = o.agent_id
left join agents a2 on a1.id is null and a1.username = o.username

Assuming the ID match takes precedence; then you need to add an 'AND' as follows
select o.id, o.order, o.agen_id,o.username, a.Fname, a.LName
from orders o
left join agents a
on a.id = o.agent_id or (a.username = o.username and a.id <> o.agent_id)

I do not know if I understood your question well enough. If you think I do not, please clarify it for me.
To avoid these duplicate results you could use the DISTINCT clause (MySQL DISTINCT documentation).
select distinct o.id, o.order, o.agen_id,o.username, a.Fname, a.LName
from orders o
left join agents a
on a.id = o.agent_id or a.username = o.username
Another option would be for you to make the union by only one of the columns. But as I do not know the data that this database will contain, I could not recommend it with 100% security.

Or conditions on join result on cross join

I am trying to join two dataset on spark, I am using spark version 2.1,
SELECT *
FROM Tb1
INNER JOIN Tb2
ON Tb1.key1=Tb2.key1
OR Tb1.key2=Tb2.Key2
But it results on cross join, how can I join two tables and get only matching records?
I also have tried left outer join, but also it is forcing me to change to cross join instead ??

Try this method
from pyspark.sql import SQLContext as SQC
sqc = SQC(sc)
x = [(1,2,3), (4,5,6), (7,8,9), (10,11,12), (13,14,15)]
y = [(1,4,5), (4,5,6), (10,11,16),(34,23,31), (56,14,89)]
x_df = sqc.createDataFrame(x,["x","y","z"])
y_df = sqc.createDataFrame(y,["x","y","z"])
cond = [(x_df.x == y_df.x) | ( x_df.y == y_df.y)]
x_df.join(y_df,cond, "inner").show()
output
+---+---+---+---+---+---+
| x| y| z| x| y| z|
+---+---+---+---+---+---+
| 1| 2| 3| 1| 4| 5|
| 4| 5| 6| 4| 5| 6|
| 10| 11| 12| 10| 11| 16|
| 13| 14| 15| 56| 14| 89|
+---+---+---+---+---+---+

By joining it twice:
select *
from Tb1
inner join Tb2
on Tb1.key1=Tb2.key1
inner join Tb2 as Tb22
on Tb1.key2=Tb22.Key2
Or Left joining both:
select *
from Tb1
left join Tb2
on Tb1.key1=Tb2.key1
left join Tb2 as Tb22
on Tb1.key2=Tb22.Key2

Join two data frames, select all columns from one and some columns from the other

Let's say I have a spark data frame df1, with several columns (among which the column id) and data frame df2 with two columns, id and other.
Is there a way to replicate the following command:
sqlContext.sql("SELECT df1.*, df2.other FROM df1 JOIN df2 ON df1.id = df2.id")
by using only pyspark functions such as join(), select() and the like?
I have to implement this join in a function and I don't want to be forced to have sqlContext as a function parameter.

Asterisk (*) works with alias. Ex:
from pyspark.sql.functions import *
df1 = df1.alias('df1')
df2 = df2.alias('df2')
df1.join(df2, df1.id == df2.id).select('df1.*')

Not sure if the most efficient way, but this worked for me:
from pyspark.sql.functions import col
df1.alias('a').join(df2.alias('b'),col('b.id') == col('a.id')).select([col('a.'+xx) for xx in a.columns] + [col('b.other1'),col('b.other2')])
The trick is in:
[col('a.'+xx) for xx in a.columns] : all columns in a
[col('b.other1'),col('b.other2')] : some columns of b

Without using alias.
df1.join(df2, df1.id == df2.id).select(df1["*"],df2["other"])

Here is a solution that does not require a SQL context, but maintains the metadata of a DataFrame.
a = sc.parallelize([['a', 'foo'], ['b', 'hem'], ['c', 'haw']]).toDF(['a_id', 'extra'])
b = sc.parallelize([['p1', 'a'], ['p2', 'b'], ['p3', 'c']]).toDF(["other", "b_id"])
c = a.join(b, a.a_id == b.b_id)
Then, c.show() yields:
+----+-----+-----+----+
|a_id|extra|other|b_id|
+----+-----+-----+----+
| a| foo| p1| a|
| b| hem| p2| b|
| c| haw| p3| c|
+----+-----+-----+----+

I believe that this would be the easiest and most intuitive way:
final = (df1.alias('df1').join(df2.alias('df2'),
on = df1['id'] == df2['id'],
how = 'inner')
.select('df1.*',
'df2.other')
)

drop duplicate b_id
c = a.join(b, a.a_id == b.b_id).drop(b.b_id)

Here is the code snippet that does the inner join and select the columns from both dataframe and alias the same column to different column name.
emp_df = spark.read.csv('Employees.csv', header =True);
dept_df = spark.read.csv('dept.csv', header =True)
emp_dept_df = emp_df.join(dept_df,'DeptID').select(emp_df['*'], dept_df['Name'].alias('DName'))
emp_df.show()
dept_df.show()
emp_dept_df.show()
Output for 'emp_df.show()':
+---+---------+------+------+
| ID| Name|Salary|DeptID|
+---+---------+------+------+
| 1| John| 20000| 1|
| 2| Rohit| 15000| 2|
| 3| Parth| 14600| 3|
| 4| Rishabh| 20500| 1|
| 5| Daisy| 34000| 2|
| 6| Annie| 23000| 1|
| 7| Sushmita| 50000| 3|
| 8| Kaivalya| 20000| 1|
| 9| Varun| 70000| 3|
| 10|Shambhavi| 21500| 2|
| 11| Johnson| 25500| 3|
| 12| Riya| 17000| 2|
| 13| Krish| 17000| 1|
| 14| Akanksha| 20000| 2|
| 15| Rutuja| 21000| 3|
+---+---------+------+------+
Output for 'dept_df.show()':
+------+----------+
|DeptID| Name|
+------+----------+
| 1| Sales|
| 2|Accounting|
| 3| Marketing|
+------+----------+
Join Output:
+---+---------+------+------+----------+
| ID| Name|Salary|DeptID| DName|
+---+---------+------+------+----------+
| 1| John| 20000| 1| Sales|
| 2| Rohit| 15000| 2|Accounting|
| 3| Parth| 14600| 3| Marketing|
| 4| Rishabh| 20500| 1| Sales|
| 5| Daisy| 34000| 2|Accounting|
| 6| Annie| 23000| 1| Sales|
| 7| Sushmita| 50000| 3| Marketing|
| 8| Kaivalya| 20000| 1| Sales|
| 9| Varun| 70000| 3| Marketing|
| 10|Shambhavi| 21500| 2|Accounting|
| 11| Johnson| 25500| 3| Marketing|
| 12| Riya| 17000| 2|Accounting|
| 13| Krish| 17000| 1| Sales|
| 14| Akanksha| 20000| 2|Accounting|
| 15| Rutuja| 21000| 3| Marketing|
+---+---------+------+------+----------+

I got an error: 'a not found' using the suggested code:
from pyspark.sql.functions import col df1.alias('a').join(df2.alias('b'),col('b.id') == col('a.id')).select([col('a.'+xx) for xx in a.columns] + [col('b.other1'),col('b.other2')])
I changed a.columns to df1.columns and it worked out.

function to drop duplicate columns after joining.
check it
def dropDupeDfCols(df):
newcols = []
dupcols = []
for i in range(len(df.columns)):
if df.columns[i] not in newcols:
newcols.append(df.columns[i])
else:
dupcols.append(i)
df = df.toDF(*[str(i) for i in range(len(df.columns))])
for dupcol in dupcols:
df = df.drop(str(dupcol))
return df.toDF(*newcols)

I just dropped the columns I didn't need from df2 and joined:
sliced_df = df2.select(columns_of_interest)
df1.join(sliced_df, on=['id'], how='left')
**id should be in `columns_of_interest` tho

df1.join(df2, ['id']).drop(df2.id)

If you need multiple columns from other pyspark dataframe then you can use this
based on single join condition
x.join(y, x.id == y.id,"left").select(x["*"],y["col1"],y["col2"],y["col3"])
based on multiple join condition
x.join(y, (x.id == y.id) & (x.no == y.no),"left").select(x["*"],y["col1"],y["col2"],y["col3"])

I very much like Xehron's answer above, and I suspect it's mechanically identical to my solution. This works in databricks, and presumably works in a typical spark environment (replacing keyword "spark" with "sqlcontext"):
df.createOrReplaceTempView('t1') #temp table t1
df2.createOrReplaceTempView('t2') #temp table t2
output = (
spark.sql("""
select
t1.*
,t2.desired_field(s)
from
t1
left (or inner) join t2 on t1.id = t2.id
"""
)
)

You could just make the join and after that select the wanted columns https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=dataframe%20join#pyspark.sql.DataFrame.join

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Spark is Giving Incorrect Results with Left Outer Join - sql

syntax LEFT JOIN enter link description here SELECT column1, column2 ... FROM table_A LEFT JOIN table_B ON join_condition WHERE row_condition Maybe it will help you SELECT x.a, y.* FROM x LEFT JOIN y ON x.id = y.xID WHERE x.a >= y.b AND x.a <= y.c

Related

aggregate of array values

How to use window function in Redshift?

JOIN ON either of the two columns but not both

Or conditions on join result on cross join

Join two data frames, select all columns from one and some columns from the other

Categories

Resources