Or conditions on join result on cross join

Or conditions on join result on cross join - sql

I am trying to join two dataset on spark, I am using spark version 2.1,
SELECT *
FROM Tb1
INNER JOIN Tb2
ON Tb1.key1=Tb2.key1
OR Tb1.key2=Tb2.Key2
But it results on cross join, how can I join two tables and get only matching records?
I also have tried left outer join, but also it is forcing me to change to cross join instead ??

Try this method
from pyspark.sql import SQLContext as SQC
sqc = SQC(sc)
x = [(1,2,3), (4,5,6), (7,8,9), (10,11,12), (13,14,15)]
y = [(1,4,5), (4,5,6), (10,11,16),(34,23,31), (56,14,89)]
x_df = sqc.createDataFrame(x,["x","y","z"])
y_df = sqc.createDataFrame(y,["x","y","z"])
cond = [(x_df.x == y_df.x) | ( x_df.y == y_df.y)]
x_df.join(y_df,cond, "inner").show()
output
+---+---+---+---+---+---+
| x| y| z| x| y| z|
+---+---+---+---+---+---+
| 1| 2| 3| 1| 4| 5|
| 4| 5| 6| 4| 5| 6|
| 10| 11| 12| 10| 11| 16|
| 13| 14| 15| 56| 14| 89|
+---+---+---+---+---+---+

By joining it twice:
select *
from Tb1
inner join Tb2
on Tb1.key1=Tb2.key1
inner join Tb2 as Tb22
on Tb1.key2=Tb22.Key2
Or Left joining both:
select *
from Tb1
left join Tb2
on Tb1.key1=Tb2.key1
left join Tb2 as Tb22
on Tb1.key2=Tb22.Key2

Related

group by on the multiple inner join in postgres

i have 3 tables the first table "A" is the master table
id_grp|group_name |created_on |status|
------+--------------+-----------------------+------+
17|Teller |2022-09-09 16:00:44.842| 1|
18|Combined Group|2022-09-09 10:16:42.473| 1|
16|admnistrator |2022-09-08 10:11:14.313| 1|
Then i have another table table "b"
id_config|id_grp|id_utilis|
---------+------+---------+
159| 16| 1|
161| 16| 54|
164| 17| 55|
438| 17| 88|
166| 18| 39|
167| 18| 20|
439| 16| 89|
198| 18| 51|
Then i have the last table "C"
id_config|id_grp|id_pol|
---------+------+------+
46| 16| 7|
48| 17| 8|
51| 18| 8|
52| 18| 7|
84| 18| 9|
113| 17| 9|
but when i using group by with multiple join as follows
SELECT
a.id_grp,
a.group_name,
a.created_on,
a.status,
count(b.id_utilis) AS users,
count(c.id_pol) AS policy
FROM a
inner JOIN b on a.id_grp = b.id_grp
inner JOIN c on a.id_grp = c.id_grp
GROUP BY a.id_grp, a.group_name, a.created_on, a.status,
but i am getting wront result there both the count are creating matrix and multiplying each other
id_grp|group_name |created_on |status|users|policy|
------+--------------+-----------------------+------+-----+------+
17|Teller |2022-09-09 16:00:44.842| 1| 10| 10|
16|admnistrator |2022-09-08 10:11:14.313| 1| 3| 3|
18|Combined Group|2022-09-09 10:16:42.473| 1| 18| 18|

select *
from a
join (select id_grp, count(*) as users from b group by id_grp) b using(id_grp)
join (select id_grp, count(*) as policy from c group by id_grp) c using(id_grp)
id_grp
group_name
created_on
status
users
policy
17
Teller
2022-09-09 16:00:44
1
2
2
18
Combined Group
2022-09-09 10:16:42
1
3
3
16
admnistrator
2022-09-08 10:11:14
1
3
1
Fiddle

Spark is Giving Incorrect Results with Left Outer Join

I am doing a simple left outer join in PySpark and it is not giving correct results. Please see bellow. Value 5 (in column A) is between 1 (col B) and 10 (col C) that's why B and C should be in the output table in the first row. But I'm getting nulls. I've tried this in 3 different RDBMs MS SQL, PostGres, and SQLite all giving the correct results. Possible bug in Spark??
Table x
+---+
| A|
+---+
| 5|
| 15|
| 20|
| 50|
+---+
Table y
+----+----+---+
| B| C| D|
+----+----+---+
| 1| 10|abc|
| 21| 30|xyz|
|null|null| mn|
| 11| 20| o|
+----+----+---+
SELECT x.a, y.b, y.c, y.d
FROM x LEFT OUTER JOIN
y
ON x.a >= y.b AND x.a <= y.c
+---+----+----+----+
| a| b| c| d|
+---+----+----+----+
| 5|null|null|null|
| 15| 11| 20| o|
| 20| 11| 20| o|
| 50|null|null|null|
+---+----+----+----+

syntax LEFT JOIN
enter link description here
SELECT column1, column2 ...
FROM table_A
LEFT JOIN table_B ON join_condition
WHERE row_condition
Maybe it will help you
SELECT x.a, y.*
FROM x LEFT JOIN y ON x.id = y.xID
WHERE x.a >= y.b AND x.a <= y.c

The problem was that Spark loaded the columns as strings, not ints. Spark was doing the >= and <= comparison on strings that's why results were off.
Casting the A,B,C columns to int resolved the problem.
x=x.withColumn('A',x['A'].cast('int'))
y=y.withColumn('B',x['B'].cast('int'))
y=y.withColumn('C',x['C'].cast('int'))

SQL query to find an output table

I have three dimension tables and a fact table and i need to write a query in such way that i join all the dimension columns with fact table to find out top 10 ATMs where most transactions are in the ’inactive’ state.I try the query with cartesian join but i dont know if this is the right way to join the tables.
select a.atm_number,a.atm_manufacturer,b.location,count(c.trans_id) as total_transaction_count,count(c.atm_status) as inactive_count
from dimen_atm a,dimen_location b,fact_atm_trans c
where a.atm_id = c.atm_id and b.location = c.location
order by inactive_count desc limit 10;
dimen_card_type
+------------+---------+
|card_type_id|card_type|
+------------+---------+
| 1| CIRRUS|
| 2| Dankort|
dimen_atm
+------+----------+----------------+---------------+
|atm_id|atm_number|atm_manufacturer|atm_location_id|
+------+----------+----------------+---------------+
| 1| 1| NCR| 16|
| 2| 2| NCR| 64|
+------+----------+----------------+---------------+
dimen_location
+-----------+--------------------+----------------+-------------+-------+------+------+
|location_id| location| streetname|street_number|zipcode| lat| lon|
+-----------+--------------------+----------------+-------------+-------+------+------+
| 1|Intern KÃƒÂ¸benhavn|RÃƒÂ¥dhuspladsen| 75| 1550|55.676|12.571|
| 2| KÃƒÂ¸benhavn| Regnbuepladsen| 5| 1550|55.676|12.571|
+-----------+--------------------+----------------+-------------+-------+------+------+
fact_atm_trans
+--------+------+--------------+-------+------------+----------+--------+----------+------------------+------------+------------+-------+----------+----------+------------+-------------------+
|trans_id|atm_id|weather_loc_id|date_id|card_type_id|atm_status|currency| service|transaction_amount|message_code|message_text|rain_3h|clouds_all|weather_id|weather_main|weather_description|
+--------+------+--------------+-------+------------+----------+--------+----------+------------------+------------+------------+-------+----------+----------+------------+-------------------+
| 1| 1| 16| 5229| 3| Active| DKK|Withdrawal| 5980| null| null| 0.0| 80| 803| Clouds| broken cloudsr|
| 2| 1| 16| 4090| 10| Active| DKK|Withdrawal| 3992| null| null| 0.0| 32| 802| Clouds| scattered cloudsr|
+--------+------+--------------+-------+------------+----------+--------+----------+------------------+------------+-----------

Joining tables and finding difference

I have a table which contains the following schema:
Table1
+------------------+--------------------+-------------------+-------------+-------------+
|student_id|project_id|name|project_name|approved|evaluation_type|grade| cohort_number|
I have another table with the following:
Table2
+-------------+----------+
|cohort_number|project_id|
My problem is: I want to get for each student_id the projects that he has not completed (no rows). The way i know all the projects he should have done is by checking the cohort_number. Basically I need the "diference" between the 2 tables. I want to fill table 1 with the missing entries, by comparing with table 2 project_id for that cohort_number.
I am not sure if I was clear.
I tried using LEFT JOIN, but I only get records where it matches. (I need the opposite)
Example:
Table1
|student_id|project_id|name| project_name| approved|evaluation_type| grade|cohort_number|
+----------+----------+--------------------+------+--------------------+--------+---------------+------------------
| 13| 18|Name| project/sd-03-bloc...| true| standard| 1.0| 3|
| 13| 7|Name| project/sd-03-bloc...| true| standard| 1.0| 3|
| 13| 27|Name| project/sd-03-bloc...| true| standard| 1.0| 3|
Table2
+-------------+----------+
|cohort_number|project_id|
+-------------+----------+
| 3| 18|
| 3| 27|
| 4| 15|
| 3| 7|
| 3| 35|
I want:
|student_id|project_id|name| project_name| approved|evaluation_type| grade|cohort_number|
+----------+----------+--------------------+------+--------------------+--------+---------------+------------------
| 13| 18|Name| project/sd-03-bloc...| true| standard| 1.0| 3|
| 13| 7|Name| project/sd-03-bloc...| true| standard| 1.0| 3|
| 13| 27|Name| project/sd-03-bloc...| true| standard| 1.0| 3|
| 13| 35|Name| project/sd-03-bloc...| false| standard| 0| 3|
Thanks in advance

If I followed you correctly, you can get all distinct (student_id, cohort_number, name) tuples from table1, and then bring all corresponding rows from table2. This basically gives you one row for each project that a student should have completed.
You can then bring table1 with a left join. "Missing" projects are identified by null values in columns project_name, approved, evaluation_type, grade.
select
s.student_id,
t2.project_id,
s.name,
t1.project_name,
t1.approved,
t1.evaluation_type,
t1.grade,
s.cohort_number
from (select distinct student_id, cohort_number, name from table1) s
inner join table2 t2
on t2.cohort_number = s.cohort_number
left join table1 t1
on t1.student_id = s.student_id
and t1.project_id = t.project_id

Join two data frames, select all columns from one and some columns from the other

Let's say I have a spark data frame df1, with several columns (among which the column id) and data frame df2 with two columns, id and other.
Is there a way to replicate the following command:
sqlContext.sql("SELECT df1.*, df2.other FROM df1 JOIN df2 ON df1.id = df2.id")
by using only pyspark functions such as join(), select() and the like?
I have to implement this join in a function and I don't want to be forced to have sqlContext as a function parameter.

Asterisk (*) works with alias. Ex:
from pyspark.sql.functions import *
df1 = df1.alias('df1')
df2 = df2.alias('df2')
df1.join(df2, df1.id == df2.id).select('df1.*')

Not sure if the most efficient way, but this worked for me:
from pyspark.sql.functions import col
df1.alias('a').join(df2.alias('b'),col('b.id') == col('a.id')).select([col('a.'+xx) for xx in a.columns] + [col('b.other1'),col('b.other2')])
The trick is in:
[col('a.'+xx) for xx in a.columns] : all columns in a
[col('b.other1'),col('b.other2')] : some columns of b

Without using alias.
df1.join(df2, df1.id == df2.id).select(df1["*"],df2["other"])

Here is a solution that does not require a SQL context, but maintains the metadata of a DataFrame.
a = sc.parallelize([['a', 'foo'], ['b', 'hem'], ['c', 'haw']]).toDF(['a_id', 'extra'])
b = sc.parallelize([['p1', 'a'], ['p2', 'b'], ['p3', 'c']]).toDF(["other", "b_id"])
c = a.join(b, a.a_id == b.b_id)
Then, c.show() yields:
+----+-----+-----+----+
|a_id|extra|other|b_id|
+----+-----+-----+----+
| a| foo| p1| a|
| b| hem| p2| b|
| c| haw| p3| c|
+----+-----+-----+----+

I believe that this would be the easiest and most intuitive way:
final = (df1.alias('df1').join(df2.alias('df2'),
on = df1['id'] == df2['id'],
how = 'inner')
.select('df1.*',
'df2.other')
)

drop duplicate b_id
c = a.join(b, a.a_id == b.b_id).drop(b.b_id)

Here is the code snippet that does the inner join and select the columns from both dataframe and alias the same column to different column name.
emp_df = spark.read.csv('Employees.csv', header =True);
dept_df = spark.read.csv('dept.csv', header =True)
emp_dept_df = emp_df.join(dept_df,'DeptID').select(emp_df['*'], dept_df['Name'].alias('DName'))
emp_df.show()
dept_df.show()
emp_dept_df.show()
Output for 'emp_df.show()':
+---+---------+------+------+
| ID| Name|Salary|DeptID|
+---+---------+------+------+
| 1| John| 20000| 1|
| 2| Rohit| 15000| 2|
| 3| Parth| 14600| 3|
| 4| Rishabh| 20500| 1|
| 5| Daisy| 34000| 2|
| 6| Annie| 23000| 1|
| 7| Sushmita| 50000| 3|
| 8| Kaivalya| 20000| 1|
| 9| Varun| 70000| 3|
| 10|Shambhavi| 21500| 2|
| 11| Johnson| 25500| 3|
| 12| Riya| 17000| 2|
| 13| Krish| 17000| 1|
| 14| Akanksha| 20000| 2|
| 15| Rutuja| 21000| 3|
+---+---------+------+------+
Output for 'dept_df.show()':
+------+----------+
|DeptID| Name|
+------+----------+
| 1| Sales|
| 2|Accounting|
| 3| Marketing|
+------+----------+
Join Output:
+---+---------+------+------+----------+
| ID| Name|Salary|DeptID| DName|
+---+---------+------+------+----------+
| 1| John| 20000| 1| Sales|
| 2| Rohit| 15000| 2|Accounting|
| 3| Parth| 14600| 3| Marketing|
| 4| Rishabh| 20500| 1| Sales|
| 5| Daisy| 34000| 2|Accounting|
| 6| Annie| 23000| 1| Sales|
| 7| Sushmita| 50000| 3| Marketing|
| 8| Kaivalya| 20000| 1| Sales|
| 9| Varun| 70000| 3| Marketing|
| 10|Shambhavi| 21500| 2|Accounting|
| 11| Johnson| 25500| 3| Marketing|
| 12| Riya| 17000| 2|Accounting|
| 13| Krish| 17000| 1| Sales|
| 14| Akanksha| 20000| 2|Accounting|
| 15| Rutuja| 21000| 3| Marketing|
+---+---------+------+------+----------+

I got an error: 'a not found' using the suggested code:
from pyspark.sql.functions import col df1.alias('a').join(df2.alias('b'),col('b.id') == col('a.id')).select([col('a.'+xx) for xx in a.columns] + [col('b.other1'),col('b.other2')])
I changed a.columns to df1.columns and it worked out.

function to drop duplicate columns after joining.
check it
def dropDupeDfCols(df):
newcols = []
dupcols = []
for i in range(len(df.columns)):
if df.columns[i] not in newcols:
newcols.append(df.columns[i])
else:
dupcols.append(i)
df = df.toDF(*[str(i) for i in range(len(df.columns))])
for dupcol in dupcols:
df = df.drop(str(dupcol))
return df.toDF(*newcols)

I just dropped the columns I didn't need from df2 and joined:
sliced_df = df2.select(columns_of_interest)
df1.join(sliced_df, on=['id'], how='left')
**id should be in `columns_of_interest` tho

df1.join(df2, ['id']).drop(df2.id)

If you need multiple columns from other pyspark dataframe then you can use this
based on single join condition
x.join(y, x.id == y.id,"left").select(x["*"],y["col1"],y["col2"],y["col3"])
based on multiple join condition
x.join(y, (x.id == y.id) & (x.no == y.no),"left").select(x["*"],y["col1"],y["col2"],y["col3"])

I very much like Xehron's answer above, and I suspect it's mechanically identical to my solution. This works in databricks, and presumably works in a typical spark environment (replacing keyword "spark" with "sqlcontext"):
df.createOrReplaceTempView('t1') #temp table t1
df2.createOrReplaceTempView('t2') #temp table t2
output = (
spark.sql("""
select
t1.*
,t2.desired_field(s)
from
t1
left (or inner) join t2 on t1.id = t2.id
"""
)
)

You could just make the join and after that select the wanted columns https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=dataframe%20join#pyspark.sql.DataFrame.join

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Or conditions on join result on cross join - sql

By joining it twice: select * from Tb1 inner join Tb2 on Tb1.key1=Tb2.key1 inner join Tb2 as Tb22 on Tb1.key2=Tb22.Key2 Or Left joining both: select * from Tb1 left join Tb2 on Tb1.key1=Tb2.key1 left join Tb2 as Tb22 on Tb1.key2=Tb22.Key2

Related

group by on the multiple inner join in postgres

Spark is Giving Incorrect Results with Left Outer Join

SQL query to find an output table

Joining tables and finding difference

Join two data frames, select all columns from one and some columns from the other

Categories

Resources