About sparkSql optimization - left join between small and large table

About sparkSql optimization - left join between small and large table - apache-spark-sql

spark-sql --master yarn --driver-memory 10G --executor-memory 20G --executor-cores 20 --num-executors 20
(I applied for the resource above.Forgive me for not applying for resources on a proportional basis...)
sql as follow:
select 'ADDRESS',count(a.pid)
FROM (SELECT pa.pid
FROM dmgr.ex_p10ids_address pa
WHERE pa.pt IN ('20200227')
AND pa.src_sys = 'APP0001'
AND pa.endtime = '99991231999'
AND pa.idtype NOT IN ('00')
AND certificate_type(pa.idtype, 'P10IDS') <> '0') a
LEFT JOIN (SELECT pr.apid, pr.pid
FROM p10ids_riskcon pr
WHERE pr.classcode NOT IN
('26371100', '26371200', '26371300', '13770100', '26376000')) b
ON a.pid = b.apid
OR a.pid = b.pid;
--takes 6hrs,33mins,15sec.
two src tables have different order of magnitudes:
select count(1) from (SELECT count(pa.pid)
FROM dmgr.ex_p10ids_address pa
WHERE pa.pt IN ('20200227')
AND pa.src_sys = 'APP0001'
AND pa.endtime = '99991231999'
AND pa.idtype NOT IN ('00')
AND certificate_type(pa.idtype, 'P10IDS') <> '0' --group by pa.pid
) t; --46644 /group by:45094
select count(1) from(
SELECT pr.apid, pr.pid
FROM p10ids_riskcon pr
WHERE pr.classcode NOT IN
('26371100', '26371200', '26371300', '13770100', '26376000')
--group by pr.apid, pr.pid
) t ; --14,9386,2737 / group by:4,8973,0113
How to reduce running time? What's wrong with my spark-sql resource?
Hope Your Comments~

Related

Long Running Query - Recommendations to improve performance in Redshift

SELECT
A.load,
A.sender,
A.latlong,
COUNT(distinct B.load) as load_count,
COUNT(distinct B.sender) as sender_count
FROM TABLE_A A
JOIN TABLE_B B ON
A.sender <> B.sender AND
(
A.latlong = B.latlong
or
(
lower(A.address_line1) = lower(B.address_line1)
and lower(A.city) = lower(B.city)
and lower(A.state) = lower(B.state)
and lower(A.country) = lower(B.country)
)
)
GROUP BY A.load, A.sender, A.latlong ;
I am trying to run a query as above sample, which runs for more time (approx 2 hrs) which is not at all expected. I am trying to split the query and do UNION but the result sets are not matching.
Can you please help with options to improve this query performance or alternative ways to achieve this in AWS?
Approximately 1.5 million records

I would suggest removing the to lower function and sanitizing the data to be lower case
select
A.load, A.sender, A.latlong,
count(distinct B.load) as load_count,
count(distinct B.sender) as sender_count
from
TABLE_A A
join
TABLE_B B
on
A.sender <> B.sender and
(
A.latlong = B.latlong
or
(
A.address_line1 = B.address_line1
and A.city) = B.city)
and A.state) = B.state)
and A.country) = B.country)
))
group by
A.load, A.sender, A.latlong ;

How to use Order By and With clause

I am using following SQL query to fetch set of records.
;WITH SFPIPELINE AS (
SELECT
PIPELINE_STRING,
PACKET_NUMBER,
PIPELINE_NUMBER
FROM
[RTMASTER].[DBO].[SF_PIPELINE]
WHERE
PIPELINE_STRING IN (
'SOLUTION_TEST',
'2018.01_SVC_SANDBOX',
'2018.01_SVC_ENG'
)
AND PACKET_NUMBER IN (98, 1090, 1092)
),
PROJ_INST_PIPELINE AS (
SELECT
DISTINCT PIP.PROJECT_INSTANCE_PIPELINE_ID,
PIP.PROJECT_INSTANCE_ID,
PIP.PACKET_NUMBER,
PIP.PROJECT_NUMBER,
PIP.SOURCE_SET_INSTANCE,
SFP.PIPELINE_STRING
FROM
PROJECT_INSTANCE_PIPELINE PIP
INNER JOIN SFPIPELINE SFP ON PIP.PACKET_NUMBER = SFP.PACKET_NUMBER
AND PIP.PIPELINE_NUMBER = SFP.PIPELINE_NUMBER
AND PIP.ACTIVE = 1
AND PIP.PROJECT_INSTANCE_PIPELINE_ID >= 20481038
),
PROJ_INST_BASE AS (
SELECT
PIP.PROJECT_INSTANCE_PIPELINE_ID,
PIP.PROJECT_NUMBER,
PIP.PACKET_NUMBER,
PIP.PIPELINE_STRING,
PIP.SOURCE_SET_INSTANCE,
PIP.PROJECT_INSTANCE_ID,
PIB.ORIGINAL_PROMOTER,
PIB.DEV_INSTANCE,
PROJECT_TYPE_NUMBER,
PIB.SUBVERSION_PROJECT_REVISION,
PIB.SUBVERSION _PROJECT_URL,
PIB.Front_End,
PIB.Back_End
FROM
PROJECT_INSTANCE_BASE PIB
INNER JOIN PROJ_INST_PIPELINE PIP ON PIB.PROJECT_INSTANCE_ID = PIP.PROJECT_INSTANCE_ID
AND PIP.PROJECT_NUMBER = PIB.PROJECT_NUMBER
AND PIB.PROJECT_TYPE_NUMBER IN (5, 105, 106)
),
SF_PROJ AS (
SELECT
PJTINST.PROJECT_INSTANCE_PIPELINE_ID,
PJTINST.PROJECT_INSTANCE_ID,
PJTINST.PROJECT_NUMBER,
PJTINST.PIPELINE_STRING,
PJTINST.ORIGINAL_PROMOTER,
PJTINST.SOURCE_SET_INSTANCE,
PJTINST.PROJECT_TYPE_NUMBER,
PJTINST.PACKET_NUMBER,
SFP.PROJECT_NAME,
PJTINST.SUBVERSION_PROJECT_REVISION,
PJTINST.SUBVERSION_PROJECT_URL,
PJTINST.Front_End,
PJTINST.Back_End
FROM
DBO.SF_PROJECT SFP
INNER JOIN PROJ_INST_BASE PJTINST ON SFP.PROJECT_NUMBER = PJTINST.PROJECT_NUMBER
),
USER_DETAIL AS (
SELECT
SFP.PROJECT_NAME,
SFP.PROJECT_NUMBER,
SFP.PROJECT_TYPE_NUMBER,
SFP.SOURCE_SET_INSTANCE,
SFP.PACKET_NUMBER,
SFP.PIPELINE_STRING,
SFP.SUBVERSION_PROJECT_REVISION,
SFP.SUBVERSION_PROJECT_URL,
SFP.PROJECT_INSTANCE_PIPELINE_ID,
SFP.PROJECT_INSTANCE_ID,
AIAA.EMAIL_ADDRESS,
SFP.Front_End,
SFP.Back_End
FROM
SF_ASSOCIATE_INFO_ALL_ASSOCIATES AIAA
INNER JOIN SF_PROJ SFP ON AIAA.OPER_ID = SFP.ORIGINAL_PROMOTER
),
FINAL AS (
SELECT
UD.PROJECT_NAME,
FP.Feature_Number,
UD.PROJECT_NUMBER,
UD.PROJECT_TYPE_NUMBER,
UD.SOURCE_SET_INSTANCE,
UD.PACKET_NUMBER,
UD.PIPELINE_STRING,
UD.SUBVERSION_PROJECT_REVISION,
UD.SUBVERSION_PROJECT_URL,
UD.PROJECT_INSTANCE_PIPELINE_ID,
UD.PROJECT_INSTANCE_ID,
UD.EMAIL_ADDRESS,
UD.Front_End,
UD.Back_End
FROM
[RTMaster].[dbo].[Feature_Projects_History] FP
INNER JOIN USER_DETAIL UD ON FP.Project_Instance_Pipeline_ID = UD.PROJECT_INSTANCE_PIPELINE_ID
)
SELECT
*
FROM
FINAL
Query is working fine only thing is the records are not sorted.
I want to use order by on PROJECT_INSTANCE_PIPELINE_ID so that all the rows are sorted. When I use ORDER BY clause seeing following error.
Error:
The ORDER BY clause is invalid in views, inline functions, derived tables, subqueries, and common table expressions, unless TOP, OFFSET or FOR XML is also specified.
Not sure how to use Order By and With Clause together.
Any thoughts!

try like below i just used PROJECT_NAME in order by
WITH SFPIPELINE AS
(SELECT PIPELINE_STRING, PACKET_NUMBER, PIPELINE_NUMBER FROM [RTMASTER].[DBO].[SF_PIPELINE]
WHERE PIPELINE_STRING IN ( 'SOLUTION_TEST', '2018.01_SVC_SANDBOX', '2018.01_SVC_ENG')
AND PACKET_NUMBER IN (98, 1090, 1092)),
PROJ_INST_PIPELINE AS
(SELECT DISTINCT PIP.PROJECT_INSTANCE_PIPELINE_ID, PIP.PROJECT_INSTANCE_ID, PIP.PACKET_NUMBER, PIP.PROJECT_NUMBER, PIP.SOURCE_SET_INSTANCE, SFP.PIPELINE_STRING FROM PROJECT_INSTANCE_PIPELINE PIP
INNER JOIN SFPIPELINE SFP ON PIP.PACKET_NUMBER = SFP.PACKET_NUMBER AND PIP.PIPELINE_NUMBER = SFP.PIPELINE_NUMBER AND PIP.ACTIVE = 1
AND PIP.PROJECT_INSTANCE_PIPELINE_ID >= 20481038),
PROJ_INST_BASE AS
(SELECT PIP.PROJECT_INSTANCE_PIPELINE_ID, PIP.PROJECT_NUMBER, PIP.PACKET_NUMBER, PIP.PIPELINE_STRING, PIP.SOURCE_SET_INSTANCE, PIP.PROJECT_INSTANCE_ID, PIB.ORIGINAL_PROMOTER, PIB.DEV_INSTANCE,PROJECT_TYPE_NUMBER, PIB.SUBVERSION_PROJECT_REVISION, PIB.SUBVERSION_PROJECT_URL,
PIB.Front_End, PIB.Back_End FROM PROJECT_INSTANCE_BASE PIB INNER JOIN PROJ_INST_PIPELINE PIP ON PIB.PROJECT_INSTANCE_ID = PIP.PROJECT_INSTANCE_ID AND PIP.PROJECT_NUMBER= PIB.PROJECT_NUMBER AND PIB.PROJECT_TYPE_NUMBER IN (5,105, 106)),
SF_PROJ AS
(SELECT PJTINST.PROJECT_INSTANCE_PIPELINE_ID, PJTINST.PROJECT_INSTANCE_ID, PJTINST.PROJECT_NUMBER, PJTINST.PIPELINE_STRING, PJTINST.ORIGINAL_PROMOTER, PJTINST.SOURCE_SET_INSTANCE, PJTINST.PROJECT_TYPE_NUMBER, PJTINST.PACKET_NUMBER, SFP.PROJECT_NAME,
PJTINST.SUBVERSION_PROJECT_REVISION, PJTINST.SUBVERSION_PROJECT_URL, PJTINST.Front_End, PJTINST.Back_End FROM DBO.SF_PROJECT SFP INNER JOIN PROJ_INST_BASE PJTINST ON SFP.PROJECT_NUMBER = PJTINST.PROJECT_NUMBER),
USER_DETAIL AS
(SELECT SFP.PROJECT_NAME, SFP.PROJECT_NUMBER, SFP.PROJECT_TYPE_NUMBER, SFP.SOURCE_SET_INSTANCE, SFP.PACKET_NUMBER, SFP.PIPELINE_STRING, SFP.SUBVERSION_PROJECT_REVISION, SFP.SUBVERSION_PROJECT_URL, SFP.PROJECT_INSTANCE_PIPELINE_ID, SFP.PROJECT_INSTANCE_ID, AIAA.EMAIL_ADDRESS, SFP.Front_End, SFP.Back_End
FROM SF_ASSOCIATE_INFO_ALL_ASSOCIATES AIAA INNER JOIN SF_PROJ SFP ON AIAA.OPER_ID = SFP.ORIGINAL_PROMOTER),
FINAL AS
(SELECT UD.PROJECT_NAME, FP.Feature_Number, UD.PROJECT_NUMBER, UD.PROJECT_TYPE_NUMBER, UD.SOURCE_SET_INSTANCE, UD.PACKET_NUMBER, UD.PIPELINE_STRING, UD.SUBVERSION_PROJECT_REVISION, UD.SUBVERSION_PROJECT_URL,
UD.PROJECT_INSTANCE_PIPELINE_ID, UD.PROJECT_INSTANCE_ID, UD.EMAIL_ADDRESS, UD.Front_End, UD.Back_End FROM [RTMaster].[dbo].[Feature_Projects_History] FP
INNER JOIN USER_DETAIL UD ON FP.Project_Instance_Pipeline_ID = UD.PROJECT_INSTANCE_PIPELINE_ID)
SELECT * FROM FINAL order by PROJECT_NAME -- use here others column name as you need

PIVOT on multiple columns SQL server (Aspen Relay Database)

I'm using a vendor supplied Relay Database (Aspen), which is running on MS SQL server). I'm attempting to write a pivot query that needs to pivot on 2 columns.
I created a temp table since the data is across multiple tables.
WITH TEMP_TABLE AS (
SELECT
R.LOCATIONID LLOCATIONID, R.ID RID, s.groupname SGROUPNAME,t.settingname TSETTINGNAME, s.setting SSETTING
from tsetting1 s
inner join tsettype1 t on t.relaytype=s.relaytype and t.groupname = s.groupname and t.rownumber = s.rownumber
INNER JOIN TREQUEST Q ON S.REQUESTID = Q.ID
INNER JOIN TRELAY R ON R.ID = Q.RELAYID
INNER JOIN TLOCATION L ON L.ID = R.LOCATIONID
where s.requestid=29117
)
select * from TEMP_TABLE
That select all from Temp returns 38 rows of data, a subset is shown here:
RID -----SGROUPNAME------TSETTINGNAME-------SSETTING
31297 LOAD1 ENABLE TRUE
31297 LOAD1 ANGLE 60
31297 LOAD2 CALCULATED_LOAD 12269
ETC....
I added this pivot, which gets me close:
PIVOT (MAX(SSETTING) FOR TSETTINGNAME IN (ENABLE, REACH, ANGLE, CALCULATED_LOADABILITY, ZLE, CTR, PTR, KVNOM, PICKUP, PERCENTAGE)) P
Returned result from Pivot:
RID-----SGROUPNAME-----ENABLE----REACH----ANGLE----CALCULATED_LOADABILITY
31297 LOAD1 TRUE 15 60 9444
31297 LOAD2 TRUE 10 30 12269
31297 LOAD3 TRUE 20 60 14167
ETC...
I would like to have the data as 1 record for RID 31297, where LOAD1-ENABLE, LOAD2-ENABLE, LOAD3-ENABLE, LOAD1-REACH, ETC. are all headers.
I've tried multiple pivots and cross apply, but I can't seem to get the data to display correctly.
Let me know if anything is unclear or if you need more information. Any help will be greatly appreciated.
Thanks,
Joe C.

It may be easier with a bunch of case statements and a group by RID. By the way this is the "original" method of pivoting before PIVOT was implemented.
select RID
, Load1_Enable = MAX(case when SGROUPNAME = 'Load1' then enable else null end)
, Load2_Enable = MAX(case when SGROUPNAME = 'Load2' then enable else null end)
from [YourTable]
group by RID
I would go all the way to the first cte though:
WITH TEMP_TABLE AS (
SELECT
R.LOCATIONID LLOCATIONID, R.ID RID, s.groupname SGROUPNAME,t.settingname TSETTINGNAME, s.setting SSETTING
from tsetting1 s
inner join tsettype1 t on t.relaytype=s.relaytype and t.groupname = s.groupname and t.rownumber = s.rownumber
INNER JOIN TREQUEST Q ON S.REQUESTID = Q.ID
INNER JOIN TRELAY R ON R.ID = Q.RELAYID
INNER JOIN TLOCATION L ON L.ID = R.LOCATIONID
where s.requestid=29117
)
select RID
,Load1_Enable = MAX(case when SGROUPNAME = 'Load1' and TSETTINGNAME = 'Enable' then SSETTING else null end)
from TEMP_TABLE
group by RID
Note MAX is only there for an aggregate -- you should be aggregating only one record.

How to use alias of a subquery to get the running total?

I have a UNION of 3 tables for calculating some balance and I need to get the running SUM of that balance but I can't use PARTITION OVER, because I must do it with a sql query that can work in Access.
My problem is that I cannot use JOIN on an alias subquery, it won't work.
How can I use alias in a JOIN to get the running total?
Or any other way to get the SUM that is not with PARTITION OVER, because it does not exist in Access.
This is my code so far:
SELECT korisnik_id, imePrezime, datum, Dug, Pot, (Dug - Pot) AS Balance
FROM (
SELECT korisnik_id, k.imePrezime, r.datum, SUM(IIF(u.jedinstven = 1, r.cena, k.kvadratura * r.cena)) AS Dug, '0' AS Pot
FROM Racun r
INNER JOIN Usluge u ON r.usluga_id = u.ID
INNER JOIN Korisnik k ON r.korisnik_id = k.ID
WHERE korisnik_id = 1
AND r.zgrada_id = 1
AND r.mesec = 1
AND r.godina = 2017
GROUP BY korisnik_id, k.imePrezime, r.datum
UNION ALL
SELECT korisnik_id, k.imePrezime, rp.datum, SUM(IIF(u.jedinstven = 1, rp.cena, k.kvadratura * rp.cena)) AS Dug, '0' AS Pot
FROM RacunP rp
INNER JOIN Usluge u ON rp.usluga_id = u.ID
INNER JOIN Korisnik k ON rp.korisnik_id = k.ID
WHERE korisnik_id = 1
AND rp.zgrada_id = 1
AND rp.mesec = 1
AND rp.godina = 2017
GROUP BY korisnik_id, k.imePrezime, rp.datum
UNION ALL
SELECT uu.korisnik_id, k.imePrezime, uu.datum, '0' AS Dug, SUM(uu.iznos) AS Pot
FROM UnosUplata uu
INNER JOIN Korisnik k ON uu.korisnik_id = k.ID
WHERE korisnik_id = 1
GROUP BY uu.korisnik_id, k.imePrezime, uu.datum
) AS a
ORDER BY korisnik_id

You can save a query (let's name it Query1) for the UNION of the 3 tables and then create another query that returns each row in the first query and calculates the sum of the rows that are before it (optionally checking that they are in the same group).
It should be something like this:
SELECT *, (
SELECT SUM(Value) FROM Query1 AS b
WHERE b.GroupNumber=a.GroupNumber
AND b.Position<=a.Position
) AS RunningSum
FROM Query1 AS a
However, it's more efficient to do that in the report.

Null/0 Answers Not Appearing

Another Question for Today :)
Wrote a query that works perfectly - except I want it to show NULL/0 values as 0 - and - that isn't happening. I tried approaching this two ways:
First I used isnull()
Select isnull(Count(*),0) as Total ,
z.zname
From STable s ,
SLTable sl ,
ZTable z ,
SETable se ,
SEETable see ,
SEGTable seg
Where s.sID = sl.sID
and sl.zID = z.zID
and s.sID = se.sID
and se.etID = see.etID
and see.segID = seg.segID
and see.segID = 3
Group By z.zname
order by z.zname
Is doesn't seem to give me the Null/0 values
Then I tried using a sum/case approach
Select sum(case when see.segID <> 3 then 0 else 1 end) as Total ,
z.zname
From STable s ,
SLTable sl ,
Table z ,
SETable se ,
SEETable see ,
SEGTable seg
Where s.sID = sl.sID
and sl.zID = z.zID
and s.sID = se.sID
and se.etID = see.etID
and see.segID = seg.segID
and see.segID = 3
Group By z.zname
order by z.zname
And still no 0 values - so now I'm stumped :(

Well, it's unclear what you're actually trying to find. However, a big part of your problem is that you're using old-school, pre-ISO/ANSI join syntax. If you refactor your join to use modern join syntax, you'll get a query that looks somethinn (a lot, actually) like this:
select zName = z.zname ,
Total = count(*)
From ZTable z
join SLTable sl on sl.zID = z.zID
join STable s on s.sID = sl.sID
join SETable se on se.sID = s.sID
join SEETable see on see.etID = se.etID
and see.segID = 3
join SEGTable seg on seg.segID = see.segID
Group By z.zname
order by z.zname
I suspect that what you want to get is a list of all zNames and their respect counts of having segID = 3. Since you are using inner joins, you'll only ever see zNames that have a match. What you can do is something like this:
select zName = z.zname ,
Total = sum(case see.segID when 3 then 1 else 0 end)
from ZTable z
left join SLTable sl on sl.zID = z.zID
left join STable s on s.sID = sl.sID
left join SETable se on se.sID = s.sID
left join SEETable see on see.etID = se.etID
group By z.zname
order by z.zname
The above will return every row from zTable at least once, with null values for the columns of any table for which no match was found. Then we group it and count the rows where segID is 3.

Actually just to semi answer my own question - just realized that where - see.segID-3 - so it'll only turn out results with segID = 3 - so it cannot NOT have 3 - right?
But Im specifically looking for segIDs as 3 - just, if there is nothing there then display 0

Here's what I am assuming you need changed
Select
Count(*) as Total,
z.zname
From STable s, SLTable sl, ZTable z, SETable se, SEETable see, SEGTable seg
Where s.sID=sl.sID
and sl.zID=z.zID
and s.sID=se.sID
and se.etID=see.etID
and see.segID=seg.segID
and see.segID=3
AND TABLE.YOURCOLUMN IS NOT NULL -- THESE ARE
AND TABLE.YOURCOLUMN <> 0 -- NEW
GROUP BY z.zname
ORDER BY z.zname

In your Query
isnull(Count(*),0) as Total,
Count will never return null, it will be numeric values ranging from 0 to n.
So your isnull condition will never get satisfy.
You can simply write select count() as total .. instead of isnull(Count(),0) and it will return 0 when ever it will find no rows in table.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

About sparkSql optimization - left join between small and large table - apache-spark-sql

Related

Long Running Query - Recommendations to improve performance in Redshift

How to use Order By and With clause

PIVOT on multiple columns SQL server (Aspen Relay Database)

How to use alias of a subquery to get the running total?

Null/0 Answers Not Appearing

Categories

Resources