SPARK SQL query for match output - apache-spark-sql

I have 2 ds as below
ds1:
CustId Name Street1 City
=================================
1 Ron 1 Mn strt Hyd
2 Ashok westend av Delhi
3 Rajesh 5th Cross Mumbai
4 Venki 2nd Main NY
ds2:
Id CustName CustAddr1 City
=========================================
11 Ron 1 Mn Street Hyd
12 Ron eastend avn Patna
13 Rajesh 2nd Main Mumbai
14 Girish 100ft rd BLR
15 Dinesh 60ft Mum
16 Rajesh 1st Cross Mumbai
I am trying to find an exact match like ds1.Name --> ds2.CustName, ds1.city --> ds2.city
Output:
GrpID Rec_Id Count ds1.cond Rec_Id Count ds2.cond
======================================================================
1 1 1 Ron + Hyd 1001 1 Ron + Hyd
2 2 1 Rajesh + Mumbai 1002 2 Rajesh + Mumbai
How to write (SPARK) SQL query for it?
I tried
final Dataset<Row> rslt = spark.sql("select * from ds1 JOIN ds2 ON ds1.Name==ds2.CustName");
(using only name)
but it gives output of mXn for m matching rows in ds1 with n matching rows in ds2.
My first work on this. Any suggestion?

Related

Use inner join and left join with more than 2 tables? [duplicate]

This question already exists:
Use left join and inner join with more than 2 tables
Closed 4 months ago.
Here you can see I have three tables
Student table
Stud_id Name Br_id Email City_id
1001 Ankit 101 ankit#bmail.com 1
1002 Pranav 105 pranav#bmail.com 2
1003 Raj 102 raj#bmail.com 2
1004 Shyam 112 shyam#bmail.com 4
1005 Duke 112 duke#bmail.com 2
1006 Jhon 102 jhon#bmail.com 3
1007 Aman 101 aman#bmail.com 4
1008 Pavan 111 pavan#bmail.com 13
1009 Virat 112 virat#bmail.com 12
Branch Table
Br_id Br_name HOD Contact
101 CSE SH Rao 22345
102 MECH AP Sharma 28210
103 EXTC VK Reddy 34152
104 CHEM SK Mehta 45612
105 IT VL Shelke 22521
106 AI KH Verma 12332
107 PROD PG Kakde 90876
Address Table
City_id City Pincode
1 Mumbai 400121
2 Pune 450011
3 Lucknow 553001
4 Delhi 443221
5 Kolkata 213445
6 Chennai 345432
7 Nagpur 323451
8 Sri Nagar 321321
I am using here three tables first one is a student table and the second one is a Branch table and the third one is an Address table
So I am writing a query like this here you can see below
select [Name], Br_name, City
from student
inner join Branch on student.Br_id = Branch.Br_id
left join [Address] on student.City_id = [Address].City_id
I have three tables I want to show here the student's name and branch name city name, but I want to show the student who has their branch only. I also want to show the student who has their city as well as show a student who does not have any city.
I wrote the query above but here I am not getting the result. Here student's name who does not have any city what's wrong here in my above join SQL query?
Here as you can see I am getting this result which I do not want to show
Name Br_name City
Ankit CSE Mumbai
Pranav IT Pune
Raj MECH Pune
Jhon MECH Lucknow
Aman CSE Delhi
Why can I not get the students who do not have a city and who have a city? What's wrong here in my join query what I am missing here?
Please let me know what's wrong in my above join query what should I do to get proper result? I have given three tables above
You can see that and why I am not able to get the result what I want to show what is the wrong in the above my join query please let me know
I hope someone will help me. Thank you so much.

I have 3 tables Flight_schedule, Flights and third is Route I need a stored procedure in SQL which give the cheapest flight on a given date

I have 3 tables Flight_schedule, Flights and third is Route I need a stored procedure in SQL which give the cheapest flight on a given date.
When the parameter date is passed to procedure suppose 2 February so the result would be the lowest fare flight on the 2 Feb.
Here is the code where I have joined the tables and passed the parameter to stored procedure but when I am confused in the condition part.
Create proc spCheapestFlight
#FLIGHT_DATE DATE
AS
BEGIN
SELECT Flight_schedule.FlightDate,Flight_schedule.Departure,route.source, route.destination,
Flight_schedule.Arrival
,Flight_schedule.Fare,Flights.Flight_name
FROM Flight_schedule
inner join route ON Flight_schedule.Route_id=route.Route_id inner join
Flights on Flights.Flight_id=Flight_schedule.Flight_id
where Flight_schedule.FlightDate = #FLIGHT_DATE
END
Flight_schedule
F_Id
Flight_id
Total_Fare
FlightDate
Departure
Arrival
Route_Id
Fare
10000
1
1001
2022-02-01
09:30:30.0000000
11:20:45.0000000
100
7500.8
10001
2
1002
2022-02-02
09:45:30.0000000
11:55:45.0000000
101
7000.9
10002
3
1003
2022-02-03
10:30:30.0000000
12:20:45.0000000
102
5111.5
10003
4
1004
2022-02-04
11:30:30.0000000
14:20:45.0000000
103
5500.9
10004
5
1005
2022-02-05
12:30:30.0000000
15:20:45.0000000
104
9000.7
10005
1
1006
2022-02-06
13:30:30.0000000
16:20:45.0000000
105
8675.5
10006
2
1007
2022-02-07
14:30:30.0000000
17:20:45.0000000
106
4000.5
10007
3
1008
2022-02-08
15:30:30.0000000
18:20:45.0000000
107
4100.5
10008
4
1009
2022-02-09
16:30:30.0000000
19:20:45.0000000
101
4000.3
10009
2
1006
2022-02-10
06:30:30.0000000
08:20:45.0000000
108
4000.3
Flights
Flight_id
Flight_name
Capacity
1
Vistara
30
2
Indigo
30
3
SpiceJet
30
4
Go_First
30
5
Air India
30
Route
route_id
source
destination
100
Mumbai
Delhi
101
Delhi
Ahmedabad
102
Ahmedabad
Delhi
103
Ahmedabad
Mumbai
104
Chennai
Mumbai
105
Chennai
Goa
106
Chennai
Delhi
107
Goa
Delhi
108
Bangalore
Delhi
109
Hyderabad
Delhi
your data does not show more than 1 flight per day, in addition there is no need for table Flights and table Route to get result, you should join table FlightSchedule with itself in order to get the minimum fare. your problem can be solve with table-value function(TVF). you should change you PROCEDURE as follows:
CREATE PROCEDURE spCheapest_Costliest_Flight (#FlightDate DATE)
AS
SELECT FS.FlightDate,FS.Departure,route.source, route.destination,
FS.Arrival ,T.Fare,Flights.Flight_name
FROM Flight_schedule FS
inner join route ON FS.Route_id=route.Route_id
inner join Flights on Flights.Flight_id=FS.Flight_id
inner join (select FlightDate FlightDate,MIN(Fare) Fare
from Flight_schedule GROUP BY FlightDate) T
ON T.FlightDate=FS.FlightDate and T.FlightDate=#FlightDate and FS.FlightDate=#FlightDate
GO

3 Tables, JOIN query and alphabetic order

I am currently working with three tables where I am trying to figure out how to use a join to display once the title_id of any book with Dennis McCann as an editor. The tables have in common title_id and editor_id. Cant find a way to piece it all together. How to display once the title_id of any book with Dennis McCann as an editor?
SELECT * FROM title_editors;
EDITOR_ID TITLE_ EDITOR_ORDER
----------- ------ ------------
826-11-9034 Bu2075 2
826-11-9034 PS2091 2
826-11-9034 Ps2106 2
826-11-9034 PS3333 2
826-11-9034 PS7777 2
826-11-9034 pS1372 2
885-23-9140 MC2222 2
885-23-9140 MC3021 2
885-23-9140 Tc3281 2
885-23-9140 TC4203 2
885-23-9140 TC7777 2
321-55-8906 bU1032 2
321-55-8906 BU1111 2
321-55-8906 BU7832 2
321-55-8906 PC1035 2
321-55-8906 PC8888 2
321-55-8906 BU2075 3
777-02-9831 pc1035 3
777-02-9831 PC8888 3
943-88-7920 BU1032 1
943-88-7920 bu1111 1
943-88-7920 BU2075 1
943-88-7920 BU7832 1
943-88-7920 PC1035 1
943-88-7920 pc8888 1
993-86-0420 PS1372 1
993-86-0420 PS2091 1
993-86-0420 PS2106 1
993-86-0420 PS3333 1
993-86-0420 pS7777 1
993-86-0420 MC2222 1
993-86-0420 MC3021 1
993-86-0420 Tc3218 1
993-86-0420 TC4203 1
993-86-0420 TC7777 1
35 rows selected.
SQL> SELECT * FROM title_authors;
AUTHOR_ID TITLE_ AUTHOR_ORDER ROYALTY_SHARE
----------- ------ ------------ -------------
409-56-7008 Bu1032 1 .6
486-29-1786 PS7777 1 1
486-29-1786 pC9999 1 1
712-45-1867 MC2222 1 1
172-32-1176 Ps3333 1 1
213-46-8915 BU1032 2 .4
238-95-7766 PC1035 1 1
213-46-8915 Bu2075 1 1
998-72-3567 pS2091 1 .5
899-46-2035 PS2091 2 .5
998-72-3567 PS2106 1 1
722-51-5454 mc3021 1 .75
899-46-2035 MC3021 2 .25
807-91-6654 tC3218 1 1
274-80-9391 BU7832 1 1
427-17-2319 pC8888 1 .5
846-92-7186 PC8888 2 .5
756-30-7391 PS1372 1 .75
724-80-9391 PS1372 2 .25
724-80-9391 bu1111 1 .6
267-41-2394 bU1111 2 .4
672-71-3249 TC7777 1 .4
267-41-2394 TC7777 2 .3
472-27-2349 Tc7777 3 .3
648-92-1872 TC4203 1 1
25 rows selected.
SQL> SELECT * FROM editors;
EDITOR_ID EDITOR_LNAME EDITOR_FNAME EDITOR_POSITION PHONE ADDRESS CITY ST ZIP
----------- ----------------- ------------- --------------- ------------ -------------------- ------------ -- ------
321-55-8906 DeLongue Martinella Project 415 843-2222 3000 6th St. BERKELEY Ca 94710
723-48-9010 Sparks MANfred cOPY 303 721-3388 15 Sail DENVER Co 80237
777-02-9831 Samuelson Bernard proJect 415 843-6990 27 Yosemite OAKLAND Ca 94609
777-66-9902 Almond Alfred copy 312 699-4177 1010 E. DeVON CHICAGO Il 60018
826-11-9034 Himmel Eleanore pRoject 617 423-0552 97 Bleaker BOSTON Ma 02210
885-23-9140 Rutherford-Hayes Hannah PROJECT 301 468-3909 32 Rockbill Pike ROCKBILL MD 20852
993-86-0420 McCann Dennis acQuisition 301 468-3909 32 Rockbill Pike ROCKBill MD 20852
943-88-7920 Kaspchek Christof acquisitiOn 415 549-3909 18 Severe Rd. BERKELEY CA 94710
234-88-9720 Hunter Amanda acquisition 617 432-5586 18 Dowdy Ln. BOSTON MA 02210
You can try join on the table Editors and Ttile_Editors using the Editor_ID that will give you the matching records and you can filter out Only for the ' Dennis McCann ' using either multiple conditions in join or the where clause as,
WITHOUT WHERE
SELECT DISTINCT te.title_id,ed.EDITOR_ID,ed.EDITOR_LNAME,ed.EDITOR_FNAME
FROM
title_editors te JOIN editors ed
ON te.EDITOR_ID = ed.EDITOR_ID
AND ed.EDITOR_LNAME = 'McCann'
AND ed.EDITOR_FNAME = 'Dennis'
ORDER BY te.title_id
USing WHERE
SELECT DISTINCT te.title_id,ed.EDITOR_ID,ed.EDITOR_LNAME,ed.EDITOR_FNAME
FROM
title_editors te JOIN editors ed
ON te.EDITOR_ID = ed.EDITOR_ID
WHERE
ed.EDITOR_LNAME = 'McCann'
AND ed.EDITOR_FNAME = 'Dennis'
ORDER BY te.title_id
It would be easier with the in operator:
SELECT DISTINCT title_id
FROM title_editors
WHERE editor_id IN (SELECT editor_id
FROM editors
WHERE editor_fname = 'Dennis' AND
editor_lname = 'McCann')
ORDER BY title_id ASC

Returning only records that have matching fields in other records

I have a query that returns a list of customers and their addresses.
ID FName LName Address1 City Postcode
--------------------------------------------------------
1 James Smith 1 Bank Street London W1C 1AA
2 Sarah Jones 45 Moor Ave London SW1 1YH
3 Mary Smith 1 Bank Street London W1C 1AA
4 Sean Baker 17 White Blvd London SE3 7TH
5 Bob Patel 58B Canal St London NW2 2TT
6 Seeta Patel 58B Canal St London NW2 2TT
7 David Hound 4 Main St London E11 8AB
I'm trying to produce another query from this data that selects a list of customers who are related/living together.The criteria for this would be the same Address 1 and Postcode fields.
My question is how I can produce a query that only selects records that have at least 1 other record with matching [Address1] and [Postcode]? ie; in the above example return only records 1, 3, 5 and 6.
Select * From
Customers c JOIN
(SELECT Address1, PostCode FROM Customer GROUP BY Address1, PostCode HAVING Count(1) > 1) c2
ON c.Address1 = c2.Address1 AND c.PostCode = c2.PostCode

SQL Server: Merge Data Rows in single table in output

I have a SQL Server table with the following fields and sample data:
ID Name Address Age
23052-PF Peter Timbuktu 25
23052-D1 Jane Paris 22
23052-D2 David London 24
23050-PF Sam Beijing 22
23051-PF Nancy NYC 26
23051-D1 Carson Cali 22
23056-PF Grace LA 28
23056-D1 Smith Boston 23
23056-D2 Mark Adelaide 26
23056-D3 Hose Mexico 25
23056-D4 Mandy Victoria 24
Each ID with -PF is unique in the table.
Each ID with the -Dx is related to the same ID with the -PF.
Each ID with -PF may have 0 or more IDs with -Dx.
The maximum number of -Dx rows for a given -PF is 9.
i.e. an ID 11111-PF can have 11111-D1, 11111-D2, 11111-D3 up to 11111-D9.
Output expected for above sample data:
ID ID (without suffix) PF_Name PF_Address PF_Age D_Name D_Address D_Age
23052-PF 23052 Peter Timbuktu 25 Jane Paris 22
23052-PF 23052 Peter Timbuktu 25 David London 24
23050-PF 23050 Sam Beijing 22 NULL NULL NULL
23051-PF 23051 Nancy NYC 26 Carson Cali 22
23056-PF 23056 Grace LA 28 Smith Boston 23
23056-PF 23056 Grace LA 28 Mark Adelaide 26
23056-PF 23056 Grace LA 28 Hose Mexico 25
23056-PF 23056 Grace LA 28 Mandy Victoria 24
I need to be able to join the -PF and -Dx as above.
If a -PF has 0 Dx rows, then D_Name, D_Address and D_Age columns in the output should return NULL.
If a -PF has one or more Dx rows, then PF_Name, PF_Address and PF_Age should repeat for each row in the output and D_Name, D_Address and D_Age should contain the values from each related Dx row.
Need to use MSSQL.
Query should not use views or create additional tables.
Thanks for all your help!
select
pf.ID,
pf.IDNum,
pf.Name as PF_Name,
pf.Address as PF_Address,
pf.Age as PF_Age,
dx.Name as D_Name,
dx.Address as D_Address,
dx.Age as D_Age
from
(
select
ID, left(ID, 5) as IDNum, Name, Address, Age
from
mytable
where
right(ID, 3) = '-PF'
) pf
left outer join
(
select
ID, left(ID, 5) as IDNum, Name, Address, Age
from
mytable
where
right(ID, 3) != '-PF'
) dx
on pf.IDNum = dx.IDNum
SqlFiddle demo: http://sqlfiddle.com/#!6/dfdbb/1
SELECT t1.ID, LEFT(t1.ID,5) "ID (without Suffix)",
t1.Name "PF_Name", t1.Address "PF_Address", t1.Age "PF_Age",
t2.Name "D_Name", t2.Address "D_Address", t2.Age "D_Age"
FROM PFTable t1
LEFT JOIN PFTable t2 on LEFT(t1.ID,5) = LEFT(t2.ID,5)
WHERE RIGHT(t1.ID,2) = 'PF'