I have huge table with 8500+ partitions and it uses with join to small table.
I have problem with Hive limit = 3k partitions and execution time.
I use it like this:
select
/* broadcast(a) */
/* streamtable(b) */
/* broadcast(c) */
/* broadcast(d) */
a.fields, b.field, c.field, d.field
from <small_table_1> a
left join (select * from <huge_table> where date_part > 20150100
union all
select * from <huge_table> where date_part between 20070100 and 20150100
union all
select * from <huge_table> where date_part < 20070100
) b on a.field = b.field
left join <small_table_2> on .....
left join <small_table_2> on .....
Is there a better way to optimize my query?
Related
I have two tables :
First one with client_id and shop_id: each client has several shop_id that he visited.
Second one with all shop_id.
I need to get random shop_id that client had visited from table_1 (it may be min(shop_id from table_1)
And random shop_id that client had NOT visited from table_2.
Seems than cross join can help:
proc sql;
select a.client_id, min(a.shop_id) as id_1, min(b.shop_id) as id_2
from table_1 a, table_2 b
where a.shop_id <> b.shop_id
group by 1
;quit;
But the problem is that tables are huge and this approach will take infinitely much time.
Can you help?
Here is a method using left join:
select min(cs.shop_id) as visited_shop_id,
min(case when cs.shop_id is null then a.shop_id end) as not_visited_shop_id
from all_shops a left join
client_shops cs
on cs.shop_id = a.shop_id and
cs.client = ?
Here's a method using the Except operator, to subtract the set of visited shops from the set of all client/shop pairs (assuming you also have a clients table). If you want to exclude clients who haven't visited any shops or who have visited all shops, simply change the two left joins to regular joins.
proc sql;
create table unvisited_shops_updated as
select c.client_id,
u1.first_unvisited_shop,
v1.first_visited_shop
from clients c
left join ( /* For each client, get the first shop_id they havn't visited */
select u.client_id,
MIN(u.shop_id) as first_unvisited_shop
from (
select c.client_id, /* Get list of all client/shop combinations */
s.shop_id
from clients c
cross join shops s
except /* Remove client/shop combinations that have been visited */
select v.client_id,
v.shop_id
from client_shop_visits v
) u
group by u.client_id
) u1
on u1.client_id = c.client_id
left join ( /* For each client, get the first shop_id they have visited */
select v.client_id,
MIN(v.shop_id) as first_visited_shop
from client_shop_visits v
group by v.client_id
) v1
on v1.client_id = c.client_id
order by c.client_id
;
run;
Here's the performance I get on my PC using the below test script:
Original query execution time: 32.53 seconds CPU time
Updated query execution time: 0.10 seconds CPU time
Full test script is below.
%let shop_count = 1000;
%let client_count = 100;
%let visit_count = 50000;
data shops;
do shop_id = 1 to &shop_count;
output;
end;
run;
data clients;
do client_id = 1 to &client_count;
output;
end;
run;
data client_shop_visits;
do visit_id = 1 to &visit_count;
client_id = rand("Integer", 1, &client_count);
shop_id = rand("Integer", 1, &shop_count);
output;
end;
run;
proc sql;
create table unvisited_shops_original as
select a.client_id, min(a.shop_id) as id_1, min(b.shop_id) as id_2
from client_shop_visits a, shops b
where a.shop_id <> b.shop_id
group by 1
;
run;
proc sql;
create table unvisited_shops_updated as
select c.client_id,
u1.first_unvisited_shop,
v1.first_visited_shop
from clients c
left join ( /* For each client, get the first shop_id they havn't visited */
select u.client_id,
MIN(u.shop_id) as first_unvisited_shop
from (
select c.client_id, /* Get list of all client/shop combinations */
s.shop_id
from clients c
cross join shops s
except /* Remove client/shop combinations that have been visited */
select v.client_id,
v.shop_id
from client_shop_visits v
) u
group by u.client_id
) u1
on u1.client_id = c.client_id
left join ( /* For each client, get the first shop_id they have visited */
select v.client_id,
MIN(v.shop_id) as first_visited_shop
from client_shop_visits v
group by v.client_id
) v1
on v1.client_id = c.client_id
order by c.client_id
;
run;
Other option is to filter out shops that clients didn't visit, then run
monotonic()
it will get numerate your shops that customers never visited -- then do the same for clients, and simpy join them
PROC SQL;
CREATE TABLE WORK.QUERY_FOR_FISH AS
SELECT DISTINCT t1.Species,
/* birds_monotonic */
(monotonic()) AS birds_monotonic
FROM SASHELP.FISH t1;
CREATE TABLE WORK.QUERY_FOR_CARS AS
SELECT DISTINCT t1.Make,
t1.Model,
t1.Type,
/* cars_monotonic */
(monotonic()) AS cars_monotonic
FROM SASHELP.CARS t1;
CREATE TABLE WORK.QUERY_FOR_FISH_0000 AS
SELECT DISTINCT t1.Species,
t1.birds_monotonic,
t2.Make,
t2.Model,
t2.Type,
t2.cars_monotonic
FROM WORK.QUERY_FOR_FISH t1
LEFT JOIN WORK.QUERY_FOR_CARS t2 ON (t1.birds_monotonic = t2.cars_monotonic);
QUIT;
I have a question on sql desgin.
Context:
I have a table called t_master and 13 other tables (lets call them a,b,c... for simplicity) where it needs to compared.
Logic:
t_master will be compared to table 'a' where t_master.gen_val =
a.value.
If record exist in t_master, retrieve t_master record, else retrieve 'a' record.
I do not need to retrieve the records if it exists in both tables (t_master and a) - XOR condition
Repeat this comparison with the remaining 12 tables.
I have some idea on doing this, using WITH to subquery the non-master tables (a,b,c...) first with their respective WHERE clause.
Then use XOR statement to retrieve the records.
Something like
WITH a AS (SELECT ...),
b AS (SELECT ...)
SELECT field1,field2...
FROM t_master FULL OUTER JOIN a FULL OUTER JOIN b FULL OUTER JOIN c...
ON t_master.gen_value = a.value
WHERE ((field1 = x OR field2 = y ) AND NOT (field1 = x AND field2 = y))
AND ....
.
.
.
.
Seeing that I have 13 tables that I need to full outer join, is there a better way/design to handle this?
Otherwise I would have at least 2*13 lines of WHERE clause which I'm not sure if that will have impact on the performance as t_master is sort of a log table.
**Assume I cant change any schema.
Currently I'm not sure if this SQL will working correctly yet, so I'm hoping someone can guide me in the right direction regarding this.
update from used_by_already's suggestion:
This is what I'm trying to do (comparison between 2 tables first, before I add more, but I am unable to get values from ATP_R.TBL_HI_HDR HI_HDR as it is in the NOT EXISTS subquery.
How do i overcome this?
SELECT LOG_REPO.UNIQ_ID,
LOG_REPO.REQUEST_PAYLOAD,
LOG_REPO.GEN_VAL,
LOG_REPO.CREATED_BY,
TO_CHAR(LOG_REPO.CREATED_DT,'DD/MM/YYYY') AS CREATED_DT,
HI_HDR.HI_NO R_VALUE,
HI_HDR.CREATED_BY R_CREATED_BY,
TO_CHAR(HI_HDR.CREATED_DT,'DD/MM/YYYY') AS R_CREATED_DT
FROM ATP_COMMON.VW_CMN_LOG_GEN_REPO LOG_REPO JOIN ATP_R.TBL_HI_HDR HI_HDR ON LOG_REPO.GEN_VAL = HI_HDR.HI_NO
WHERE NOT EXISTS
(SELECT NULL
FROM ATP_R.TBL_HI_HDR HI_HDR
WHERE LOG_REPO.GEN_VAL = HI_HDR.HI_NO
)
UNION ALL
SELECT LOG_REPO.UNIQ_ID,
LOG_REPO.REQUEST_PAYLOAD,
LOG_REPO.GEN_VAL,
LOG_REPO.CREATED_BY,
TO_CHAR(LOG_REPO.CREATED_DT,'DD/MM/YYYY') AS CREATED_DT,
HI_HDR.HI_NO R_VALUE,
HI_HDR.CREATED_BY R_CREATED_BY,
TO_CHAR(HI_HDR.CREATED_DT,'DD/MM/YYYY') AS R_CREATED_DT
FROM ATP_R.TBL_HI_HDR HI_HDR JOIN ATP_COMMON.VW_CMN_LOG_GEN_REPO LOG_REPO ON HI_HDR.HI_NO = LOG_REPO.GEN_VAL
WHERE NOT EXISTS
(SELECT NULL
FROM ATP_COMMON.VW_CMN_LOG_GEN_REPO LOG_REPO
WHERE HI_HDR.HI_NO = LOG_REPO.GEN_VAL
)
Full outer joins used to exclude all matching rows can be an expensive query. You don't supply much detail, but perhaps using NOT EXISTS would be simpler and maybe it will produce a better explain plan. Something along these lines.
select
cola,colb,colc
from t_master m
where not exists (
select null from a where m.keycol = a.fk_to_m
)
and not exists (
select null from b where m.keycol = b.fk_to_m
)
and not exists (
select null from c where m.keycol = c.fk_to_m
)
union all
select
cola,colb,colc from a
where not exists (
select null from t_master m where a.fk_to_m = m.keycol
)
union all
select
cola,colb,colc from b
where not exists (
select null from t_master m where b.fk_to_m = m.keycol
)
union all
select
cola,colb,colc from c
where not exists (
select null from t_master m where c.fk_to_m = m.keycol
)
You could union the 13 a,b,c ... tables to simplify the coding, but that may not perform so well.
I have scrapped my previous question as I did not do a good job explaining. Maybe this will be simpler.
I have the following query.
Select * from comp_eval_hdr, comp_eval_pi_xref, core_pi, comp_eval_dtl
where comp_eval_hdr.START_DATE between TO_DATE('01-JAN-16' , 'DD-MON-YY')
and TO_DATE('12-DEC-17' , 'DD-MON-YY')
and comp_eval_hdr.COMP_EVAL_ID = comp_eval_dtl.COMP_EVAL_ID
and comp_eval_hdr.COMP_EVAL_ID = comp_eval_pi_xref.COMP_EVAL_ID
and core_pi.PI_ID = comp_eval_pi_xref.PI_ID
and core_pi.PROGRAM_CODE = 'PS'
Now if I only want a random 100 rows from the comp_eval_hdr table to join with the other tables how would I go about it? If it makes it easier you can disregard the comp_eval_dtl table.
I think you are pretty much there. You just need subqueries, table aliases, and JOIN conditions:
SELECT . . .
FROM (SELECT a.*
FROM (SELECT a.*
FROM a
WHERE a.START_DATE BEWTWEEN DATE '2016-01-01' AND DATE '2017-12-12'
ORDER BY DBMS_RANDOM.VALUE
) a
WHERE ROWNUM <= 100
) a JOIN
mapping m
ON a.? = m.? JOIN
b
ON m.? = b.?;
The ? is just a placeholder for the join columns.
It's a bit of a stretch to know what you want with the question as written but here's my attempt.
WITH rand_list AS
(SELECT * FROM comp_eval_hdr
WHERE comp_eval_hdr.START_DATE BEWTWEEN TO_DATE('01-JAN-16' , 'DD-MON-YY') AND TO_DATE('12-DEC-17' , 'DD-MON-YY')
ORDER BY DBMS_RANDOM.VALUE)
first_100 AS
(SELECT *
FROM rand_list
WHERE ROWNUM <=100)
SELECT md.col_1, t3.col_a
FROM first_100 md
INNER JOIN
table2 t2 ON md.id_column = t2.fk_comp_eval_hdr_id
INNER JOIN
table3 t3 ON t3.id_column = t2.fk_table3_id
You haven't given any indication how they join or the table names and obviously I haven't run this against any mock tables.
You've got a list of randomised records with RAND_LIST which you could, if you wanted, combine with the FIRST_100 query (your choice).
The main query then just joins that through your mapping table (T2) to your 'multiples' table (T3).
how does table 2 look like?...Let me put one example as person table and order table?
select * from (
select * from person ps , order order where ps.city = 'mumbai' and ps.id = order.purchasedby ) porder where porder.rownum <= 100
I did not tested it but it will look something like this.
I am writing a query that requires a self-join of a large table (> 1 million rows)
I'm only interested in the rows that were created today which I can filter using a recording_time column that contains the epoch time.
However, I'm not certain that the below query is actually limiting the tables BEFORE doing the join.
SELECT B.ani
FROM [app].[dbo].[recordings] B
INNER JOIN [app].[dbo].[recordings] A
ON B.callid = A.callid AND B.dnis = A.ani
where A.filename LIKE '%680627.wav'
AND B.recording_time > 1485340000
Filter rows that were created today and use that new table to join.
SELECT B.ani
FROM ( SELECT * FROM [app].[dbo].[recordings] where recording_time > 1485340000 ) B
INNER JOIN ( SELECT * FROM [app].[dbo].[recordings] where recording_time > 1485340000 ) A
ON B.callid = A.callid AND B.dnis = A.ani
where A.filename LIKE '%680627.wav'
I have a query like the following
select *
from (
select *
from callTableFunction(#paramPrev)
.....< a whole load of other joins, wheres , etc >........
) prevValues
full join
(
select *
from callTableFunction(#paramCurr)
.....< a whole load of other joins, wheres , etc >........
) currValues on prevValues.Field1 = currValues.Field1
....<other joins with the same subselect as the above two with different parameters passed in
where ........
group by ....
The following subselect is common to all the subselects in the query bar the #param to the table function.
select *
from callTableFunction(#param)
.....< a whole load of other joins, wheres , etc >........
One option is for me to convert this into a function and call the function, but i dont like this as I may be changing the
subselect query quite often for.....or I am wondering if there is an alternative using CTE
like
with sometable(#param1) as
(
select *
from callTableFunction(#param)
.....< a whole load of other joins, wheres , etc >........
)
select
sometable(#paramPrev) prevValues
full join sometable(#currPrev) currValues on prevValues.Field1 = currValues.Field1
where ........
group by ....
Is there any syntax like this or technique I can use like this.
This is in SQL Server 2008 R2
Thanks.
What you're trying to do is not supported syntax - CTE's cannot be parameterised in this way.
See books online - http://msdn.microsoft.com/en-us/library/ms175972.aspx.
(values in brackets after a CTE name are an optional list of output column names)
If there are only two parameter values (paramPrev and currPrev), you might be able to make the code a little easier to read by splitting them into two CTEs - something like this:
with prevCTE as (
select *
from callTableFunction(#paramPrev)
.....< a whole load of other joins, wheres , etc
........ )
,curCTE as (
select *
from callTableFunction(#currPrev)
.....< a whole load of other joins, wheres , etc
........ ),
select
prevCTE prevValues
full join curCTE currValues on
prevValues.Field1 = currValues.Field1 where
........ group by
....
You should be able to wrap the subqueries up as parameterized inline table-valued functions, and then use them with an OUTER JOIN:
CREATE FUNCTION wrapped_subquery(#param int) -- assuming it's an int type, change if necessary...
RETURNS TABLE
RETURN
SELECT * FROM callTableFunction(#param)
.....< a whole load of other joins, wheres , etc ........
GO
SELECT *
FROM
wrapped_subquery(#paramPrev) prevValues
FULL OUTER JOIN wrapped_subquery(#currPrev) currValues ON prevValues.Field1 = currValues.Field1
WHERE ........
GROUP BY ....
After failing to assign scalar variables before with, i finally got a working solution using a stored procedure and a temp table:
create proc hours_absent(#wid nvarchar(30), #start date, #end date)
as
with T1 as(
select c from t
),
T2 as(
select c from T1
)
select c from T2
order by 1, 2
OPTION(MAXRECURSION 365)
Calling the stored procedure:
if object_id('tempdb..#t') is not null drop table #t
create table #t([month] date, hours float)
insert into #t exec hours_absent '9001', '2014-01-01', '2015-01-01'
select * from #t
There may be some differences between my example and what you want depending on how your subsequent ON statements are formulated. Since you didn't specify, I assumed that all the subsequent joins were against the first table.
In my example I used literals rather than #prev,#current but you can easily substitute variables in place of literals to achieve what you want.
-- Standin function for your table function to create working example.
CREATE FUNCTION TestMe(
#parm int)
RETURNS TABLE
AS
RETURN
(SELECT #parm AS N, 'a' AS V UNION ALL
SELECT #parm + 1, 'b' UNION ALL
SELECT #parm + 2, 'c' UNION ALL
SELECT #parm + 2, 'd' UNION ALL
SELECT #parm + 3, 'e');
go
-- This calls TestMe first with 2 then 4 then 6... (what you don't want)
-- Compare these results with those below
SELECT t1.N AS AN, t1.V as AV,
t2.N AS BN, t2.V as BV,
t3.N AS CN, t3.V as CV
FROM TestMe(2)AS t1
FULL JOIN TestMe(4)AS t2 ON t1.N = t2.N
FULL JOIN TestMe(6)AS t3 ON t1.N = t3.N;
-- Put your #vars in place of 2,4,6 adding select statements as needed
WITH params
AS (SELECT 2 AS p UNION ALL
SELECT 4 AS p UNION ALL
SELECT 6 AS p)
-- This CTE encapsulates the call to TestMe (and any other joins)
,AllData
AS (SELECT *
FROM params AS p
OUTER APPLY TestMe(p.p)) -- See! only coded once
-- Add any other necessary joins here
-- Select needs to deal with all the columns with identical names
SELECT d1.N AS AN, d1.V as AV,
d2.N AS BN, d2.V as BV,
d3.N AS CN, d3.V as CV
-- d1 gets limited to values where p = 2 in the where clause below
FROM AllData AS d1
-- Outer joins require the ANDs to restrict row multiplication
FULL JOIN AllData AS d2 ON d1.N = d2.N
AND d1.p = 2 AND d2.p = 4
FULL JOIN AllData AS d3 ON d1.N = d3.N
AND d1.p = 2 AND d2.p = 4 AND d3.p = 6
-- Since AllData actually contains all the rows we must limit the results
WHERE(d1.p = 2 OR d1.p IS NULL)
AND (d2.p = 4 OR d2.p IS NULL)
AND (d3.p = 6 OR d3.p IS NULL);
What you want to do is akin to a pivot and so the complexity of the needed query is similar to creating a pivot result without using the pivot statement.
Were you to use Pivot, duplicate rows (such as I included in this example) would be aggreagted. This is also a solution for doing a pivot where aggregation is unwanted.