I am reverse engineering some legacy SQL algorithms to move to apache spark.
I have encountered a across apply which I understand is TSQL specific and there is no direct equivalent in ANSII or Spark SQL.
The sanitized algorithm is:
SELECT
Id_P ,
Monthindex ,
(
SELECT
100 * (STDEV(ResEligible.num_valid) / AVG(ResEligible.num_valid)) AS Pre_Coef_Var
FROM
tbl_p a CROSS APPLY
(
SELECT
e.Monthindex ,
e.num AS num_valid
FROM
dbo.tbl_p e
WHERE
e.Monthindex = a.MonthIndex
AND e.Id_P = a.Id_P
UNION ALL
SELECT DISTINCT
B1.[MonthIndex ] ,
Tr.num AS num_valid
FROM
#tbl_pr B1
INNER JOIN
#tbl_pr B2
ON
B1.[Id_P] = B2.[Id_P]
AND B2.Rang - B1.Rang BETWEEN 0 AND 2
INNER JOIN
dbo.tbl_p Tr
ON
Tr.Id_P = B1.Id_P
AND Tr.Monthindex = B1.Monthindex
WHERE
a.Id_P = B1.[Id_P]
AND B2.[MonthIndex] =
(
SELECT
MAX([MonthIndex])
FROM
#tbl_pr
WHERE
[MonthIndex] < a.MonthIndex
AND [Id_P] = a.Id_P) ) AS ResEligible
WHERE
a.Id_P = result.Id_P
AND a.MonthIndex = result.MonthIndex) AS Coeff
FROM
tbl_p AS result
WHERE
1 = 1
AND MonthIndex = #CurrentMonth
GROUP BY
Id_P ,
Monthindex) AS CC
so for every row in alias b we cross apply to the inner queries.
Is it possible to re-write the cross apply in terms of join operations (or otherwise) so I can re-implement in spark sql?
Cheers
Terry
Seems like you could rewrite your query as the below:
SELECT T1.col1,
T1.col2,
sq.col3Sum
FROM tbl1 T1
CROSS JOIN (SELECT SUM(T1sq.Col3) AS col3Sum
FROM tbl1 T1sq
JOIN tbl2 T2 ON T1sq.Col1 = T2.Col2
JOIN tbl3 T3 ON T2.col1 = T3.Col1) sq;
Seems odd, however, that there was no JOIN criteria between the 2 references to tbl1.
Related
I have a where Clause that I need to check if values exists in a table, and I'm doing that in a (subquery). The problem is, that should be made based on
values - 'FIX' and 'VAR'. Depending on each, we need to check on a different table (subquery). To achieve that goal I'm using a Case When statement in the where clause, as shown below:
select *
FROM T1
where
(upper(trim(ITAXAVAR)) = 'S'
and
(
upper(trim(CTIPAMOR)) not in ('A','U','F')
)
)
and
--problem starts here.....
(case ucase(trim(CTIPTXFX)) --Values 'FIX';'VAR';'PUR'
WHEN 'FIX'
THEN
(concat(trim(CPRZTXFX),trim(CTAXAREF)) not in
(select trim(A.tayd91c0_celemtab)
from cd_estruturais.tat91_tabelas A
where A.tayd91c0_ctabela = 'W03' and
--data_date_part = '${Data_ref}' and --por vezes não temos actualização TAT91 para mesma data_ref das tabelas
A.data_date_part = (select max(B.data_date_part)
from cd_estruturais.tat91_tabelas B
where A.tayd91c0_ctabela = B.tayd91c0_ctabela and
B.data_date_part > date_add(TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP())),-5)
)
and length(nvl(trim(A.tayd91c0_celemtab),'')) <> 0
)
)
WHEN 'VAR'
THEN
(concat(trim(CTAXAREF),trim(CPERRVTX)) not in
(select concat(trim(A.CTXREF),trim(A.CPERRVTX))
from land_estruturais.cat01_taxref A
where A.data_date_part > date_add(TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP())),-5)
and length(nvl(concat(trim(A.CTXREF),trim(A.CPERRVTX)),'')) <> 0
)
)
END
)
;
Below is a simplified view of the same query:
select *
FROM T1
where
(--first criteria
)
and
--problem starts here.....
(case ucase(trim(CTIPTXFX)) --Values 'FIX';'VAR';'PUR'
WHEN 'FIX'
THEN
(field1 not in
(subquery 1)
)
WHEN 'VAR'
THEN
(field1 not in
(subquery 2)
END
)
;
Can anyone tell me what I'm doing wrong, please?
I seems to me that Impala does not support the subqueries inside a Case When Statement.
Thank you.
Impala doesnt support Subqueries in the select list.
So, you need to rewrite the SQL like below -
Use LEFT ANTI JOIN in place of NOT IN() to link subqueries to T1.
To handle case when, use UNION ALL for different conditions.
SELECT * FROM T1
LEFT ANTI JOIN subqry1 y ON T1.id = y.id
WHERE col='FIX'
UNION ALL
SELECT * FROM T1
LEFT ANTI JOIN subqry2 y ON T1.id = y.id
WHERE col='VAR'
I tried to change the simple SQL you posted above. The main SQL is too complex and need table setup and data to prove the logic.
Here is my version of your simple SQL -
select * FROM T1
LEFT ANTI JOIN subquery1 ON subquery1.column = T1.field1
where (--first criteria )
and ucase(trim(CTIPTXFX))='FIX'
UNION ALL
select * FROM T1
LEFT ANTI JOIN subquery2 ON subquery2.column = T1.field1
where (--first criteria )
and ucase(trim(CTIPTXFX))='VAR'
Pls note, Anti join and union all can be expensive so if your table size if huge, please tune them accordingly.
Below is my current code. I'm not sure what the best way is to amend this to give me the results I need.
SELECT
T1.SC,
T1.AN,
T1.DOFS_DATE,
T2.M_ID,
T3.OPDT,
T4.MARKER,
T5.E_DTE,
T5.E_TME,
T5.E_PST_DTE,
T5.E_AMT,
T5.E_NAR_O,
T5.E_NAR_T
FROM E_Base.AR_MyTable T1
LEFT JOIN E_Base.Translation T2
ON T1.SC = T2.SC
AND T1.AN = T2.AN
LEFT JOIN E_Base.BA T3
ON T2.M_ID = T3.M_ID
LEFT JOIN E_Base.APF T4
ON T3.M_ID = T4.M_ ID
AND MARKER = 54
LEFT JOIN U_DB.TEH_201804 T5
ON T2.M_ID = T5.M_ID
AND T1.DOFS_DATE = T5.E_PST_DTE
QUALIFY ROW_NUMBER() OVER (PARTITION BY T2.M_ID ORDER BY T2.ID_END_DATE DESC, T3.E_END_DATE DESC) = 1
The above code works. However, it is the final left join on T5 where I need help.
In T1 each M_ID has assigned it's own DOFS_DATE that could be any date within the year and I want the data from T5 U_DB.TEH_201804 for the matching date. However, 5 U_DB.TEH_201804 relates to only April 2018. There are 12 tables with the same database (201804, 201805, 201806 etc) that all have the exact same columns but relate to a different month within the year.
Ideally, I want to left join the columns from T5 once but search all 12 tables within the database to bring back the data where the dates correspond.
I was thinking UNION but am unsure how to work this in.
Any help would be greatly appreciated!
Thanks
You could change you code related to table t5 wuth a left join on a subquery that select the union all for all the bale you need ...... (i have named the subquery TT)
SELECT
T1.SC,
T1.AN,
T1.DOFS_DATE,
T2.M_ID,
T3.OPDT,
T4.MARKER,
TT.E_DTE,
TT.E_TME,
TT.E_PST_DTE,
TT.E_AMT,
TT.E_NAR_O,
TT.E_NAR_T
FROM E_Base.AR_MyTable T1
LEFT JOIN E_Base.Translation T2
ON T1.SC = T2.SC
AND T1.AN = T2.AN
LEFT JOIN E_Base.BA T3
ON T2.M_ID = T3.M_ID
LEFT JOIN E_Base.APF T4
ON T3.M_ID = T4.M_ ID
AND MARKER = 54
LEFT JOIN (
select *
FROM U_DB.TEH_201804
UNION ALL
select *
FROM U_DB.TEH_201805
UNION ALL
select *
FROM U_DB.TEH_201806
UNION ALL
select *
FROM U_DB.TEH_201807
UNION ALL
.....
) TT ON T2.M_ID = TT.M_ID
AND T1.DOFS_DATE = TT.E_PST_DTE
QUALIFY ROW_NUMBER() OVER (PARTITION BY T2.M_ID ORDER BY T2.ID_END_DATE DESC, T3.E_END_DATE DESC) = 1
It's hard to tell without additional details like explain and QueryLog step data.
Based on #scaisEdge answer:
You can try to move the first two joins into a Derived Table to apply the ROW_NUMBER early (possible because you do Outer Joins only):
SELECT
dt.*,
T4.MARKER,
TT.E_DTE,
TT.E_TME,
TT.E_PST_DTE,
TT.E_AMT,
TT.E_NAR_O,
TT.E_NAR_T
FROM
(
SELECT
T1.SC,
T1.AN,
T1.DOFS_DATE,
T2.M_ID,
T3.OPDT
FROM E_Base.AR_MyTable T1
LEFT JOIN E_Base.Translation T2
ON T1.SC = T2.SC
AND T1.AN = T2.AN
LEFT JOIN E_Base.BA T3
ON T2.M_ID = T3.M_ID
QUALIFY Row_Number()
Over (PARTITION BY T2.M_ID
ORDER BY T2.ID_END_DATE DESC, T3.E_END_DATE DESC) = 1
) AS dt
LEFT JOIN E_Base.APF T4
ON dt.M_ID = T4.M_ID
AND MARKER = 54
LEFT JOIN
(
SELECT *
FROM U_DB.TEH_201804
UNION ALL
SELECT *
FROM U_DB.TEH_201805
UNION ALL
SELECT *
FROM U_DB.TEH_201806
UNION ALL
SELECT *
FROM U_DB.TEH_201807
UNION ALL
.....
) TT
ON dt.M_ID = TT.M_ID
AND dt.DOFS_DATE = TT.E_PST_DTE
It might also help the optimizer to provide additional info about the data ranges. Those tables should have CHECK-constraints to tell the optimizer that they contain only data from a single month, if they don't exist try adding a WHERE-condition to each Select, e.g. WHERE E_PST_DTE BETWEEN DATE '2018-04-01' AND DATE '2018-04-30'.
Of course, always check Explain if the plan actually changes...
I made this view in sql server to combine the values of 2 records of multiple columns. But the problem with this solution is that you need a concat for every column in table2. I would like to know if it is possible to do the concat part with a loop and a dynamic variable for the column numbers (columns in table2 are called 1,2,3,4,5....) of table2.
SELECT
dbo.table1.lot_id AS lot,
dbo.table1.hybird_id AS hybrid,
concat(
LEFT( (SELECT dbo.table2.[1] FROM dbo.table2 WHERE dbo.table2.parentals_id = dbo.table1.parental_male_id AND dbo.table2.lot_id = dbo.table1.lot_id) , 1),
LEFT( (SELECT dbo.table2.[1] FROM dbo.table2 WHERE dbo.table2.parentals_id = dbo.table1.parental_female_id AND dbo.table2.lot_id = dbo.table1.lot_id) , 1)
) AS '1',
--above concat x31 times more
FROM dbo.table2
INNER JOIN dbo.table1 ON dbo.table2.lot_id = dbo.table1.lot_id
GROUP BY dbo.table1.lot_id, dbo.table1.hybird_id,
dbo.table1.parental_male_id,
dbo.table1.parental_female_id
I tried a few things but nothing worked, any ideas?
Try to simplify it a bit, kind of
SELECT lot, hybrid, parental_male_id, parental_female_id
concat(Left(m.[1],1), left(f.[1], 1)) AS [1]
--,..
FROM (
SELECT dbo.table1.lot_id AS lot
, dbo.table1.hybird_id AS hybrid
, dbo.table1.parental_male_id
, dbo.table1.parental_female_id
FROM dbo.table2
INNER JOIN dbo.table1 ON dbo.table2.lot_id = dbo.table1.lot_id
GROUP BY dbo.table1.lot_id, dbo.table1.hybird_id,
dbo.table1.parental_male_id,
dbo.table1.parental_female_id
) t
JOIN dbo.table2 m ON m.parentals_id = t.parental_male_id AND m.lot_id = lot)
JOIN dbo.table2 f ON f.parentals_id = t.parental_female_id AND f.lot_id = lot)
Please have a look at the query below - I am getting invalid identifier t1.oid in the below inner query.
I have column oid in iclr_request t1
select t1.requestNo
, t2.routeDistance,
, (
select WM_CONCAT(crc7) as "TravCirc7s"
from (
select (
select crc7
from dim_afi_dnld_stn_v1
where stn_sys_nbr = t3.stn_sys_nbr
and rownum=1
) as crc7
from iclr_trav_circ7 t3
where request_oid = **t1.oid**
and sub_route_index=0
and station_type_oid = 1
order by sequence
)
)
from iclr_request t1
, iclr_summary_results t2
where t1.oid = t2.request_oid
You can try this:
select t1.requestNo , t2.routeDistance,
WM_CONCAT((select crc7 from dim_afi_dnld_stn_v1 where stn_sys_nbr = t3.stn_sys_nbr and rownum=1)) as "TravCirc7s"
from iclr_request t1
join iclr_summary_results t2 on t1.oid = t2.request_oid
left join iclr_trav_circ7 t3 on t3.request_oid = t1.oid
and t3.sub_route_index=0
and t3.station_type_oid = 1
group by t1.requestNo , t2.routeDistance;
Correlated subqueries may refer their parents only 1 level above (although some Oracle documentation says it's unlimited)
EDIT: It doesn't save the order by sequence in WM_CONCAT. You may need to wrap it a parent query and then wm_concat
This query takes a long time to run on MS Sql 2008 DB with 70GB of data.
If i run the 2 where clauses seperately it takes a lot less time.
EDIT - I need to change the 'select *' to 'delete' afterwards, please keep it in mind when answering. thanks :)
select *
From computers
Where Name in
(
select T2.Name
from
(
select Name
from computers
group by Name
having COUNT(*) > 1
) T3
join computers T2 on T3.Name = T2.Name
left join policyassociations PA on T2.PK = PA.EntityId
where (T2.EncryptionStatus = 0 or T2.EncryptionStatus is NULL) and
(PA.EntityType <> 1 or PA.EntityType is NULL)
)
OR
ClientId in
(
select substring(ClientID,11,100)
from computers
)
Swapping IN for EXISTS will help.
Also, as per Gordon's answer: UNION can out-perform OR.
SELECT computers.*
FROM computers
LEFT
JOIN policyassociations
ON policyassociations.entityid = computers.pk
WHERE (
computers.encryptionstatus = 0
OR computers.encryptionstatus IS NULL
)
AND (
policyassociations.entitytype <> 1
OR policyassociations.entitytype IS NULL
)
AND EXISTS (
SELECT name
FROM (
SELECT name
FROM computers
GROUP
BY name
HAVING Count(*) > 1
) As duplicate_computers
WHERE name = computers.name
)
UNION
SELECT *
FROM computers As c
WHERE EXISTS (
SELECT SubString(clientid, 11, 100)
FROM computers
WHERE SubString(clientid, 11, 100) = c.clientid
)
You've now updated your question asking to make this a delete.
Well the good news is that instead of the "OR" you just make two DELETE statements:
DELETE
FROM computers
LEFT
JOIN policyassociations
ON policyassociations.entityid = computers.pk
WHERE (
computers.encryptionstatus = 0
OR computers.encryptionstatus IS NULL
)
AND (
policyassociations.entitytype <> 1
OR policyassociations.entitytype IS NULL
)
AND EXISTS (
SELECT name
FROM (
SELECT name
FROM computers
GROUP
BY name
HAVING Count(*) > 1
) As duplicate_computers
WHERE name = computers.name
)
;
DELETE
FROM computers As c
WHERE EXISTS (
SELECT SubString(clientid, 11, 100)
FROM computers
WHERE SubString(clientid, 11, 100) = c.clientid
)
;
Some things I would look at are
1. are indexes in place?
2. 'IN' will slow your query, try replacing it with joins,
3. you should use column name, I guess 'Name' in this case, while using count(*),
4. try selecting required data only, by selecting particular columns.
Hope this helps!
or can be poorly optimized sometimes. In this case, you can just split the query into two subqueries, and combine them using union:
select *
From computers
Where Name in
(
select T2.Name
from
(
select Name
from computers
group by Name
having COUNT(*) > 1
) T3
join computers T2 on T3.Name = T2.Name
left join policyassociations PA on T2.PK = PA.EntityId
where (T2.EncryptionStatus = 0 or T2.EncryptionStatus is NULL) and
(PA.EntityType <> 1 or PA.EntityType is NULL)
)
UNION
select *
From computers
WHERE ClientId in
(
select substring(ClientID,11,100)
from computers
);
You might also be able to improve performance by replacing the subqueries with explicit joins. However, this seems like the shortest route to better performance.
EDIT:
I think the version with join's is:
select c.*
From computers c left outer join
(select c.Name
from (select c.*, count(*) over (partition by Name) as cnt
from computers c
) c left join
policyassociations PA
on T2.PK = PA.EntityId and PA.EntityType <> 1
where (c.EncryptionStatus = 0 or c.EncryptionStatus is NULL) and
c.cnt > 1
) cpa
on c.Name = cpa.Name left outer join
(select substring(ClientID, 11, 100) as name
from computers
) csub
on c.Name = csub.name
Where cpa.Name is not null or csub.Name is not null;