Convert from JOIN on ROWID in Netezza to RedShift

Convert from JOIN on ROWID in Netezza to RedShift - sql

I'm converting ETL queries written for Netezza to RedShift. I'm facing some issues with ROWID, because it's not supported in RedShift. I have tried using the key columns in the predicates, based on which ROWID is being generated to actually do a workaround. But i'm confused which columns would be used if there are multiple join operations. So is there anyone who can help me convert the query. I even tried to use ROW_NUMBER() over () function, but it also doesn't work because row ids won't be unique for all rows.
Here are the queries from netezza:
Query #1
CREATE TEMP TABLE TMPRY_DELTA_UPD_1000 AS
SELECT
nvl(PT.HOST_CRRNCY_SRRGT_KEY,-1) as HOST_CRRNCY_SRRGT_KEY,
delta1.ROWID ROW_ID
FROM TMPRY_POS_TX_1000 PT
LEFT JOIN TMPRY_TX_CSTMR_1000 TC ON PT.TX_SRRGT_KEY = TC.TX_SRRGT_KEY AND PT.UPDT_TMSTMP > '2017-01-01'
AND PT.INS_TMSTMP < '2017-01-01' AND PT.DVSN_NBR = 70
JOIN INS_EDW_CP.DM_TX_LINE_FCT delta1 ON PT.TX_SRRGT_KEY = delta1.TX_SRRGT_KEY
WHERE
(
delta1.HOST_CRRNCY_SRRGT_KEY <> PT.HOST_CRRNCY_SRRGT_KEY OR
)
AND PT.DVSN_NBR = 70;
Query #2
UPDATE INS_EDW_CP..DM_TX_LINE_FCT base
SET
base.HOST_CRRNCY_SRRGT_KEY = delta1.HOST_CRRNCY_SRRGT_KEY,
)
FROM TMPRY_DELTA_UPD_1000 delta1
WHERE base.ROWID = delta1.ROW_ID;
How can i convert query # 2?

Well, most of the time I have seen joins on rowid it is due to performance optimizations, but in some cases there ARE no unique combination of columns in the table.
Please talk to the people owning these data & run your own analysis of different key combinations and then get back to us.

Related

Tuning Oracle Query for slow select

I'm working on an oracle query that is doing a select on a huge table, however the joins with other tables seem to be costing a lot in terms of time of processing.
I'm looking for tips on how to improve the working of this query.
I'm attaching a version of the query and the explain plan of it.
Query
SELECT
l.gl_date,
l.REST_OF_TABLES
(
SELECT
MAX(tt.task_id)
FROM
bbb.jeg_pa_tasks tt
WHERE
l.project_id = tt.project_id
AND l.task_number = tt.task_number
) task_id
FROM
aaa.jeg_labor_history l,
bbb.jeg_pa_projects_all p
WHERE
p.org_id = 2165
AND l.project_id = p.project_id
AND p.project_status_code = '1000'
Something to mention:
This query takes data from oracle to send it to a sql server database, so I need it to be this big, I can't narrow the scope of the query.
the purpose is to set it to a sql server job with SSIS so it runs periodically

One obvious suggestion is not to use sub query in select clause.
Instead, you can try to join the tables.
SELECT
l.gl_date,
l.REST_OF_TABLES
t.task_id
FROM
aaa.jeg_labor_history l
Join bbb.jeg_pa_projects_all p
On (l.project_id = p.project_id)
Left join (SELECT
tt.project_id,
tt.task_number,
MAX(tt.task_id) task_id
FROM
bbb.jeg_pa_tasks tt
Group by tt.project_id, tt.task_number) t
On (l.project_id = t.project_id
AND l.task_number = t.task_number)
WHERE
p.org_id = 2165
AND p.project_status_code = '1000';
Cheers!!

As I don't know exactly how many rows this query is returning or how many rows this table/view has.
I can provide you few simple tips which might be helpful for you for better query performance:
Check Indexes. There should be indexes on all fields used in the WHERE and JOIN portions of the SQL statement.
Limit the size of your working data set.
Only select columns you need.
Remove unnecessary tables.
Remove calculated columns in JOIN and WHERE clauses.
Use inner join, instead of outer join if possible.
You view contains lot of data so you can also break down and limit only the information you need from this view

SQL Query Performance Issues Using Subquery

I am having issues with my query run time. I want the query to automatically pull the max id for a column because the table is indexed off of that column. If i punch in the number manually, it runs in seconds, but i want the query to be more dynamic if possible.
I've tried placing the sub-query in different places with no luck
SELECT *
FROM TABLE A
JOIN TABLE B
ON A.SLD_MENU_ITM_ID = B.SLD_MENU_ITM_ID
AND B.ACTV_FLG = 1
WHERE A.WK_END_THU_ID_NU >= (SELECT DISTINCT MAX (WK_END_THU_ID_NU) FROM TABLE A)
AND A.WK_END_THU_END_YR_NU = YEAR(GETDATE())
AND A.LGCY_NATL_STR_NU IN (7731)
AND B.SLD_MENU_ITM_ID = 4314
I just want this to run faster. Maybe there is a different approach i should be taking?

I would move the subquery to the FROM clause and change the WHERE clause to only refer to A:
SELECT *
FROM A CROSS JOIN
(SELECT MAX(WK_END_THU_ID_NU) as max_wet
FROM A
) am
ON a.WK_END_THU_ID_NU = max_wet JOIN
B
ON A.SLD_MENU_ITM_ID = B.SLD_MENU_ITM_ID AND
B.ACTV_FLG = 1
WHERE A.WK_END_THU_END_YR_NU = YEAR(GETDATE()) AND
A.LGCY_NATL_STR_NU IN (7731) AND
A.SLD_MENU_ITM_ID = 4314; -- is the same as B
Then you want indexes. I'm pretty sure you want indexes on:
A(SLD_MENU_ITM_ID, WK_END_THU_END_YR_NU, LGCY_NATL_STR_NU, SLD_MENU_ITM_ID)
B(SLD_MENU_ITM_ID, ACTV_FLG)
I will note that moving the subquery to the FROM clause probably does not affect performance, because SQL Server is smart enough to only execute it once. However, I prefer table references in the FROM clause when reasonable. I don't think a window function would actually help in this case.

Alternate solution for the query - Used INTERSECT function in oracle plsql

I am working on the query. I have two tables one is detail table where not grouping happen and its like including all the values and other table is line table which has important column grouped together from detail table.
I want to show all the column from line table and some column from detail table.
I am using below query to fetch my records
SELECT ab.*,
cd.phone_number,
cd.id
FROM xxx_line ab,
xxx_detail cd
WHERE cd.reference_number = ab.reference_number
AND cd.org_id = ab.org_id
AND cd.request_id = ab.request_id
AND ab.request_id = 13414224
INTERSECT
SELECT ab.*,
cd.phone_number,
cd.id
FROM xxx_line ab,
xxx_detail cd
WHERE cd.reference_number = ab.reference_number
AND cd.org_id = ab.org_id
AND cd.request_id = ab.request_id
AND ab.request_id = 13414224
The query is working fine...
But I want to know is there any other way for I can achieve the same result by not even using Intersect.
I purpose is to find out all possible way to get the same output.

The INTERSECT operator returns the unique set of rows returned by each query. The code can be re-written with a DISTINCT operator to make the meaning clearer:
SELECT DISTINCT
xxx_line.*,
xxx_detail.phone_number,
xxx_detail.id
FROM xxx_line
JOIN xxx_detail
ON xxx_line.reference_number = xxx_detail.reference_number
AND xxx_line.org_id = xxx_detail.org_id
AND xxx_line.request_id = xxx_detail.request_id
WHERE xxx_line.request_id = 13414224
I also replaced the old-fashioned join syntax with the newer ANSI join syntax (which makes relationships clearer by forcing the join tables and conditions to be listed close to each other) and removed the meaningless table aliases (because code complexity is more directly related to the number of variables than the number of characters).

Oracle : Indexes not being used

I have a query which is not using my indexes. Can someone say why?
explain plan set statement_id = 'bad8' for
select
g1.g1id,a.a1id from atable a,
(
select
phone,address,g1id from gtable g
where
g.active = 0 and
(g.name is not null) AND
(SYSDATE - g.CTIME <= 2*365)
) g1
where
(
(a.phone.ph1 = g1.phone.ph1 and
a.phone.ph2 = g1.phone.ph2 and
a.phone.ph3 = g1.phone.ph3
)
OR
(a.address.ad1 = g1.address.ad1 and a.address.ad2 = g1.address.ad2)
)
In both the tables : atable,gtable I have these indexes :
1. On phone.ph1,phone.ph2,phone.ph3
2. On address.ad1,address.ad2
phone,address are of custom data types.
Using Oracle 11g.
Here is the explain plan query and output :
SELECT cardinality "Rows",
lpad(' ',level-1)||operation||' '||
options||' '||object_name "Plan"
FROM PLAN_TABLE
CONNECT BY prior id = parent_id
AND prior statement_id = statement_id
START WITH id = 0
AND statement_id = 'bad8'
ORDER BY id;
Result:
> Rows Plan
490191190 SELECT STATEMENT
> null CONCATENATION
> 490190502 HASH JOIN
> 511841 TABLE ACCESS FULL gtable
> 41332965 PARTITION LIST ALL
> 41332965 TABLE ACCESS FULL atable
> 688 HASH JOIN
> 376893 TABLE ACCESS FULL gtable
> 41332965 PARTITION LIST ALL
> 41332965 TABLE ACCESS FULL atable
Both atable,gtable have more than 10 million rows each.
Most values in columns phone and address don't repeat.

What indices Oracle chosen depends on many factor including things you haven't mentioned in your question such as the number of rows in the table, frequency of values within a column and whether you have separate or combined indices when more than one column is indexed.
Having said that, I suppose that the main reason your indices aren't used are:
You don't join directly with GTABLE / GLOBAL. Instead you join with a view that has three additional WHERE clauses that aren't part of the index and thus make it less effective in this constellation.
The JOIN condition includes an OR, which makes it difficult to use indices.
Update:
If Oracle used your indices to do the join - which is already very difficult due to the OR condition - it would end up with a huge number of ROWIDs. For each ROWID, it then had to fetch the full row. Since a full table scan can easily be up to 50 times faster than a fetch by ROWID (I don't know what value Oracle uses), it will only use the indices if it reliably knows that the join will reduce the number of rows to fetch by a factor of 50.
In your case, there are the remaining WHERE conditions (g.active = 0, g.name is not null, SYSDATE - g.CTIME <= 2*365), which aren't represented in the indices. So they have to applied after the join and after the GTABLE rows have been fetched. This makes it even more difficult to reach a 50 times smaller result set than a full table scan.
So I'm pretty sure the Oracle cost estimate is correct, i.e. using the indices would result in a more expensive query and even longer execution time.

We can say "your query does not use your indexes because does not need them". A hash join is better. To use your indexes, oracle need to full scan them(4 indexes), make two joins, make a rowid or, and after that read from tables probably many blocks. If he belives that the result has many rows, the CBO coose the full scans, because is faster.
There are no conditions that reduce the number of rows taken from tables. There is no range scan. It must do full scans.

Slow SQL query, not sure how to optimize

So I have to deal with a database that has no indexes (not my design, it frustrates the hell out of me). I'm running a query that takes approximately three seconds to return, and I need it to be faster.
Here are the relevant tables and columns:
gs_pass_data au_entry ground_station
-gs_pass_data_id -au_id -ground_station_id
-start_time -gs_pass_data_id -ground_station_name
-end_time -comments
-ground_station_id
And my query is:
SELECT DISTINCT gs_pass_data_id,start_time,end_time,
ground_station_name FROM gs_pass_data
JOIN ground_station
ON gs_pass_data.ground_station_id =
ground_station.ground_station_id
JOIN au_entry ON au_entry.gs_pass_data_id =
gs_pass_data.gs_pass_data_id
WHERE (start_time BETWEEN #prevTime AND #nextTime)
AND comments = 'AU is identified.'
ORDER BY start_time
I've tried using EXISTS instead of DISTINCT with no improvements. I've read everything I can about SQL optimization but I cannot seem to get this query down to a reasonable time (reasonable being < 0.5 seconds). Any ideas would be greatly appreciated.

Without indexes, you're hosed. The DB engine will have to do full table scans, each time, every time. Fiddling with the queries is just rearranging deck chairs on the Titanic. Fix the DB now, before it gets even worse as data piles up.

The query can also be written without the distinct and with a group by instead. It'll probably make no difference at all though. Standard advice is the same as everyone else's. Add indexes, drop 'order by` so +1 to #Marc B
SELECT gs_pass_data_id,start_time,end_time,ground_station_name
FROM gs_pass_data
JOIN ground_station
ON gs_pass_data.ground_station_id = ground_station.ground_station_id
JOIN au_entry
ON au_entry.gs_pass_data_id = gs_pass_data.gs_pass_data_id
WHERE (start_time BETWEEN #prevTime AND #nextTime)
AND comments = 'AU is identified.'
GROUP BY gs_pass_data_id,start_time,end_time,ground_station_name
ORDER BY start_time

Since you can't create indexes on the tables... do you have the authority to created indexed views?
SQL 2005 - http://technet.microsoft.com/en-us/library/cc917715.aspx
SQL 2008 - http://msdn.microsoft.com/en-us/library/dd171921(v=sql.100).aspx
That would give you the benefit of indexes, but not alter the original tables...

You can try the following, I don't know what else you can do or if this will make it any faster at all :/
SELECT DISTINCT gs_pass_data_id,start_time,end_time,ground_station_name
FROM
(
-- My idea is to make this first table as small as possible first, which will then make the joins quicker (TM)
SELECT *
FROM gs_pass_data
WHERE (start_time BETWEEN #prevTime AND #nextTime)
) t
INNER JOIN ground_station ON gs_pass_data.ground_station_id = ground_station.ground_station_id
INNER JOIN
(
-- Same as above
SELECT *
FROM au_entry
WHERE comments = N'AU is identified.' -- Make sure comments is the same type as the text string. You said nvarchar so make the string your searching by nvarchar
) t2 ON au_entry.gs_pass_data_id = gs_pass_data.gs_pass_data_id
ORDER BY start_time
-- OR TRY THIS
SELECT DISTINCT gs_pass_data_id,start_time,end_time,ground_station_name
FROM
(
-- My idea is to make this first table as small as possible first, which will then make the joins quicker (TM)
SELECT *
FROM gs_pass_data
WHERE (start_time BETWEEN #prevTime AND #nextTime)
) t
INNER JOIN ground_station ON gs_pass_data.ground_station_id = ground_station.ground_station_id
INNER JOIN au_entry ON au_entry.gs_pass_data_id = gs_pass_data.gs_pass_data_id
WHERE comments = N'AU is identified.' -- Make sure comments is the same type as the text string. You said nvarchar so make the string your searching by nvarchar
ORDER BY start_time

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas