I have many hive scripts (somewhat 20-25 scripts), each scripts having multiple queries. I want to run these scripts using spark so that the process can run faster. As map reduce job in hive takes long time to execute from spark it will be much faster. Below is the code I have written but its working for 3-4 files but when given multiple files with multiple queries its getting failed.
Below is the code for the same. Please help me if possible to optimize the same.
val spark = SparkSession.builder.master("yarn").appName("my app").enableHiveSupport().getOrCreate()
val filename = new java.io.File("/mapr/tmp/validation_script/").listFiles.filter(_.getName.endsWith(".hql")).toList
for ( i <- 0 to filename.length - 1)
{
val filename1 = filename(i)
scala.io.Source.fromFile(filename1).getLines()
.filterNot(_.isEmpty) // filter out empty lines
.foreach(query =>
spark.sql(query))
}
some of the error I cam getting is like
ERROR SparkSubmit: Job aborted.
org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:224)
ERROR FileFormatWriter: Aborting job null.
org.apache.spark.SparkException: Job aborted due to stage failure: ShuffleMapStage 12 (sql at validationtest.scala:67) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.FetchFailedException: failed to allocate 16777216 byte(s) of direct memory (used: 1023410176, max: 1029177344) at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:528)
many different types of error I get when run the same code multiple times.
Below is how one of the HQL file looks like. its name is xyz.hql and has
drop table pontis_analyst.daydiff_log_sms_distribution
create table pontis_analyst.daydiff_log_sms_distribution as select round(datediff(date_sub(current_date(),cast(date_format(CURRENT_DATE ,'u') as int) ),cast(subscriberActivationDate as date))/7,4) as daydiff,subscriberkey as key from pontis_analytics.prepaidsubscriptionauditlog
drop table pontis_analyst.weekly_sms_usage_distribution
create table pontis_analyst.weekly_sms_usage_distribution as select sum(event_count_ge) as eventsum,subscriber_key from pontis_analytics.factadhprepaidsubscriptionsmsevent where effective_date_ge_prt < date_sub(current_date(),cast(date_format(CURRENT_DATE ,'u') as int) - 1 ) and effective_date_ge_prt >= date_sub(date_sub(current_date(),cast(date_format(CURRENT_DATE ,'u') as int) ),84) group by subscriber_key;
drop table pontis_analyst.daydiff_sms_distribution
create table pontis_analyst.daydiff_sms_distribution as select day.daydiff,sms.subscriber_key,sms.eventsum from pontis_analyst.daydiff_log_sms_distribution day inner join pontis_analyst.weekly_sms_usage_distribution sms on day.key=sms.subscriber_key
drop table pontis_analyst.weekly_sms_usage_final_distribution
create table pontis_analyst.weekly_sms_usage_final_distribution as select spp.subscriberkey as key, case when spp.tenure < 3 then round((lb.eventsum )/dayDiff,4) when spp.tenure >= 3 then round(lb.eventsum/12,4)end as result from pontis_analyst.daydiff_sms_distribution lb inner join pontis_analytics.prepaidsubscriptionsubscriberprofilepanel spp on spp.subscriberkey = lb.subscriber_key
INSERT INTO TABLE pontis_analyst.validatedfinalResult select 'prepaidsubscriptionsubscriberprofilepanel' as fileName, 'average_weekly_sms_last_12_weeks' as attributeName, tbl1_1.isEqual as isEqual, tbl1_1.isEqualCount as isEqualCount, tbl1_2.countAll as countAll, (tbl1_1.isEqualCount/tbl1_2.countAll)* 100 as percentage from (select tbl1_0.isEqual as isEqual, count(isEqual) as isEqualCount from (select case when round(aal.result) = round(srctbl.average_weekly_sms_last_12_weeks) then 1 when aal.result is null then 1 when aal.result = 'NULL' and srctbl.average_weekly_sms_last_12_weeks = '' then 1 when aal.result = '' and srctbl.average_weekly_sms_last_12_weeks = '' then 1 when aal.result is null and srctbl.average_weekly_sms_last_12_weeks = '' then 1 when aal.result is null and srctbl.average_weekly_sms_last_12_weeks is null then 1 else 0 end as isEqual from pontis_analytics.prepaidsubscriptionsubscriberprofilepanel srctbl left join pontis_analyst.weekly_sms_usage_final_distribution aal on srctbl.subscriberkey = aal.key) tbl1_0 group by tbl1_0.isEqual) tbl1_1 inner join (select count(*) as countAll from pontis_analytics.prepaidsubscriptionsubscriberprofilepanel) tbl1_2 on 1=1
Your issue is your code is running out of memory as shown below
failed to allocate 16777216 byte(s) of direct memory (used: 1023410176, max: 1029177344)
Though what you are trying to do is not optimal way of doing things in Spark but I would recommend that you remove the memory serialization as it will not help in anyways. You should cache data only if it is going to be used in multiple transformations. If it is going to be used once there is no reason to put the data in cache.
Related
We are facing a performance issue while executing a stored procedure. It usually takes between 10-15 minutes to run, but sometimes it takes up to more than 30 minutes to execute.
We captured visualize plan execute files for the Normal run and Long run cases.
By checking the visualized plan we came to know that, one particular Insert block of code takes extra time in the long run. And by checking
"EXPLAIN PLAN FOR SQL PLAN CACHE ENTRY <plan_id> "
the table we found that the order of execution differs in the long run.
This is the block which takes extra time to run sometimes.
INSERT INTO #TMP_DATI_SALDI_LORDI_BASE (
"COD_SCENARIO","COD_PERIODO","COD_CONTO","COD_DEST1","COD_DEST2","COD_DEST3","COD_DEST4","COD_DEST5"
,"IMPORTO","COD_VALUTA","IMPORTO_VALUTA_ORIGINARIA","COD_VALUTA_ORIGINARIA","NOTE"
)
( SELECT
SCEN_P.SCENARIO
,SCEN_P.PERIOD
,ACCOUT_ADJ.ATTRIBUTO1 AS "COD_CONTO"
,DATAS_rev.COD_DEST1
,DATAS_rev.COD_DEST2
,DATAS_rev.COD_DEST3
,__typed_NString__($1, 50)
,'RPT_NON'
,SUM(
CASE WHEN INFO.INCOT = 'FOB' THEN
CASE ACCOUT_rev.ATTRIBUTO1 WHEN 'CalcInsurance' THEN
0
ELSE
DATAS_rev.IMPORTO
END
ELSE
DATAS_rev.IMPORTO
END
* (DATAS_ADJ.IMPORTO - DATAS.IMPORTO)
)
,DATAS_rev.COD_VALUTA
,SUM(
CASE WHEN INFO.INCOT = 'FOB' THEN
CASE ACCOUT_rev.ATTRIBUTO1 WHEN 'CalcInsurance' THEN
0
ELSE
DATAS_rev.IMPORTO_VALUTA_ORIGINARIA
END
ELSE
DATAS_rev.IMPORTO_VALUTA_ORIGINARIA
END
* (DATAS_ADJ.IMPORTO_VALUTA_ORIGINARIA - DATAS.IMPORTO_VALUTA_ORIGINARIA)
)
,DATAS_rev.COD_VALUTA_ORIGINARIA
,'CPM_SP_CACL_FY_E3 Parts Option ADJ'
FROM #TMP_TAGERT_SCEN_P SCEN_P
INNER JOIN #TMP_DATI_SALDI_LORDI_BASE DATAS_rev
ON DATAS_rev.COD_SCENARIO = SCEN_P.SCENARIO
AND DATAS_rev.COD_PERIODO = SCEN_P.PERIOD
AND LEFT(DATAS_rev.COD_DEST3, 1) = 'O'
INNER JOIN CONTO ACCOUT_rev
ON ACCOUT_rev.COD_CONTO = DATAS_rev.COD_CONTO
AND ACCOUT_rev.ATTRIBUTO1 IN ('CalcFOB','CalcInsurance') --FOB,Insurance(Ocean freight is Nothing by Option)
INNER JOIN #DSL DATAS
ON DATAS.COD_SCENARIO = 'LAUNCH'
AND DATAS.COD_PERIODO = 12
AND DATAS.COD_DEST1 = 'NC'
AND DATAS.COD_DEST2 = 'NC'
AND DATAS.COD_DEST3 = 'F001'
AND DATAS.COD_DEST4 = DATAS_rev.COD_DEST4
AND DATAS.COD_DEST5 = 'INP'
INNER JOIN CONTO ACCOUT
ON ACCOUT.COD_CONTO = DATAS.COD_CONTO
AND ACCOUT.ATTRIBUTO2 = 'E3'
INNER JOIN CONTO ACCOUT_ADJ
ON ACCOUT_ADJ.ATTRIBUTO3 = DATAS.COD_CONTO
AND ACCOUT_ADJ.ATTRIBUTO2 = 'HE3'
INNER JOIN #DSL DATAS_ADJ
ON LEFT(DATAS_ADJ.COD_SCENARIO,4) = LEFT(SCEN_P.SCENARIO,4)
AND DATAS_ADJ.COD_PERIODO = 12
AND DATAS_ADJ.COD_DEST1 = DATAS.COD_DEST1
AND DATAS_ADJ.COD_DEST2 = DATAS.COD_DEST2
AND DATAS_ADJ.COD_DEST3 = DATAS.COD_DEST3
AND DATAS_ADJ.COD_DEST4 = DATAS.COD_DEST4
AND DATAS_ADJ.COD_DEST5 = DATAS.COD_DEST5
AND DATAS_ADJ.COD_CONTO = ACCOUT_ADJ.COD_CONTO
LEFT OUTER JOIN #TMP_KDPWT_INCOTERMS INFO
ON INFO.P_CODE = DATAS.COD_DEST4
GROUP BY
SCEN_P.SCENARIO,SCEN_P.PERIOD,ACCOUT_ADJ.ATTRIBUTO1,DATAS_rev.COD_DEST1,DATAS_rev.COD_DEST2
,DATAS_rev.COD_DEST3, DATAS.COD_DEST4,DATAS_rev.COD_VALUTA,DATAS_rev.COD_VALUTA_ORIGINARIA,INFO.INCOT
)
I will share the order of execution details also for normal and long run case.
Could someone please help us to overcome this issue? And also we don't know how to fix the order of the join execution. Is there any way to fix the join order execution, Please guide us.
Thanks in advance
Vinothkumar
Without a lot more detailed information, there is no way to tell exactly why your INSERT statement shows this alternating runtime behaviour.
Based on my experience, such an analysis can take quite some time and there are only few people available that are capable to perform it. If you can get someone like that to look at this, make sure to understand and learn.
What I can tell from the information shared is this
using temporary tables to structure a multi-stage data flow is the wrong thing to do on SAP HANA. Instead, use table variables in SQLScript.
if you insist on using the temporary tables, make them at least column tables; this will allow to avoid a need for some internal data materialisation.
when using joins make sure that the joined columns are of the same data type. The explain plan is full of TO_INT(), TO_DECIMAL(), and other conversion functions. Those take time, memory, and make it hard for the optimiser(s) to estimate cardinalities.
as the statement uses a lot of temporary tables, the different join orders can easily result from different volumes of data that was present when the SQL was parsed, prepared and optimised. One option to avoid this is to have HANA ignore any cached plans for the statement. The documentation has the HINTS for that.
And that is about what I can say about this with the available information.
I have the following query which i am directly executing in my Code & putting it in datatable. The problem is it is taking more than 10 minutes to execute this query. The main part which is taking time is NON EXISTS.
SELECT
[t0].[PayrollEmployeeId],
[t0].[InOutDate],
[t0].[InOutFlag],
[t0].[InOutTime]
FROM [dbo].[MachineLog] AS [t0]
WHERE
([t0].[CompanyId] = 1)
AND ([t0].[InOutDate] >= '2016-12-13')
AND ([t0].[InOutDate] <= '2016-12-14')
AND
( NOT (EXISTS(
SELECT NULL AS [EMPTY]
FROM [dbo].[TO_Entry] AS [t1]
WHERE
([t1].[EmployeeId] = [t0].[PayrollEmployeeId])
AND ([t1]. [CompanyId] = 1)
AND ([t0].[PayrollEmployeeId] = [t1].[EmployeeId])
AND (([t0].[InOutDate]) = [t1].[Entry_Date])
AND ([t1].[Entry_Method] = 'M')
))
)
ORDER BY
[t0].[PayrollEmployeeId], [t0].[InOutDate]
Is there any way i can optimize this query? What is the work around for this. It is taking too much of time.
It seems that you can convert the NOT EXISTS into a LEFT JOIN query with second table returning NULL values
Please check following SELECT and modify if required to fulfill your requirements
SELECT
[t0].[PayrollEmployeeId], [t0].[InOutDate], [t0].[InOutFlag], [t0].[InOutTime]
FROM [dbo].[MachineLog] AS [t0]
LEFT JOIN [dbo].[TO_Entry] AS [t1]
ON [t1].[EmployeeId] = [t0].[PayrollEmployeeId]
AND [t0].[PayrollEmployeeId] = [t1].[EmployeeId]
AND [t0].[InOutDate] = [t1].[Entry_Date]
AND [t1]. [CompanyId] = 1
AND [t1].[Entry_Method] = 'M'
WHERE
([t0].[CompanyId] = 1)
AND ([t0].[InOutDate] >= '2016-12-13')
AND ([t0].[InOutDate] <= '2016-12-14')
AND [t1].[EmployeeId] IS NULL
ORDER BY
[t0].[PayrollEmployeeId], [t0].[InOutDate]
You will realize that there is an informative message on the execution plan for your query
It is informing that there is a missing cluster index with an effect of 30% on the execution time
It seems that transaction data is occurring based on some date fields like Entry time.
Dates fields especially on your case are strong candidates for clustered indexes. You can create an index on Entry_Date column
I guess you have already some index on InOutDate
You can try indexing this field as well
I am running a SQL query in Oracle 12.1.0 database, which is taking drastically different amount of time between first and subsequent invocations.
Our second and subsequent invocations are taking almost 10 times the time it took for first invocation.( too bad - that I don't have a cold backup ).
According to the developer, no changes have been made in the application or underlying tables.
We are truncating and analyzing (using DBMS_STATS) all relevant tables before each invocation. But are unable to replicate the performance seen in first invocation.
My gut feeling is that it has something to do with parsing and bad query plan.
The statement is as follows :
SELECT CASE input_val
WHEN 'DEL'
THEN
(SELECT a_key
FROM d_ad, ad_xr
WHERE d_ad.m_id = ad_xr.m_id
AND d_ad.source = ad_xr.source
AND ad_xr.c_id = :1
AND ad_xr.source = :3 )
WHEN 'WRITE'
THEN
(SELECT w_key
FROM wp_xr x, d_wl w
WHERE w.m_id = x.m_id
AND w.source = x.source
AND x.client_id = :4
AND x.source = :5 )
WHEN 'APPEND'
THEN
(SELECT a_key
FROM F0_A
WHERE a_id = :5 AND source = :7 )
END
FROM DUAL
Ok I have a rather unique situation and I can't believe there is not a better way of doing this than my solution.
Requirements:
Table 2 - EpmTask_UserView_RM is a subset of table 1 -
MSP_EpmTask_UserView So while all the fields match Table 1 has many
more rows than table 2
Table 2 needs to get updated from table 1 based on the date a task has changed (We can't do a drop and replace) There are three cases:
Task updates where something has changed about the task (We will know based on the task date stamp)
Task Deletes where a task has been deleted
Task Adds where a new task exists
I have 3 different queries that do this and am thinking there is a better way.
**** DELETE Tasks from ZZZ_TEST_OF_UPDATE_MSP_EpmTask_UserView_RM table if no longer present in Production***/
USE [ProjectWebApp]
GO
DELETE FROM [dbo].[ZZZ_TEST_OF_UPDATE_MSP_EpmTask_UserView_RM]
WHERE [dbo].[ZZZ_TEST_OF_UPDATE_MSP_EpmTask_UserView_RM].TaskUID IN
(SELECT
/*Subquery to select all records in ZZZ_TEST_OF_UPDATE_MSP_EpmTask_UserView_RM NOT found in MSP_EpmTask_UserView_RM */
[ProjectWebApp].[dbo].[ZZZ_TEST_OF_UPDATE_MSP_EpmTask_UserView_RM].[TaskUID]
FROM [ProjectWebApp].[dbo].[ZZZ_TEST_OF_UPDATE_MSP_EpmTask_UserView_RM]
LEFT JOIN [MSPSPRO].[ProjectWebApp].[dbo].[MSP_EpmTask_UserView] as Prod
on Prod.TaskUID = [ProjectWebApp].[dbo].[ZZZ_TEST_OF_UPDATE_MSP_EpmTask_UserView_RM].TASKuid
where Prod.TaskUID is NULL)
Query 2 the Update
UPDATE [dbo].[ZZZ_TEST_OF_UPDATE_MSP_EpmTask_UserView_RM]
SET
[ProjectUID] = Source.[ProjectUID]
,[TaskUID] = Source.[TaskUID]
,[TaskName] = Source.[TaskName]
,[TaskIndex] = Source.[TaskIndex]
,[TaskOutlineLevel] = Source.[TaskOutlineLevel]
,[TaskOutlineNumber] = Source.[TaskOutlineNumber]
,[TaskStartDate] = Source.[TaskStartDate]
,[TaskFinishDate] = Source.[TaskFinishDate]
,[TaskActualStartDate] = Source.[TaskActualStartDate]
,[TaskActualFinishDate] = Source.[TaskActualFinishDate]
,[TaskPercentCompleted] = Source.[TaskPercentCompleted]
,[Health] = Source.[Health]
,[Milestone Significance Level] = Source.[Milestone Significance Level]
,[TaskModifiedDate] = Source.[TaskModifiedDate]
,[TaskBaseline1StartDate] = Source.[TaskBaseline1StartDate]
,[TaskBaseline1FinishDate] = Source.[TaskBaseline1FinishDate]
,[TaskBaseline1Duration] = Source.[TaskBaseline1Duration]
,[QueryTimestamp] = GetDate()
FROM [MSPSPRO].[ProjectWebApp].[dbo].[MSP_EpmTask_UserView] AS Source
WHERE Source.TaskUID = [dbo].[ZZZ_TEST_OF_UPDATE_MSP_EpmTask_UserView_RM].TaskUID
AND GetDate() - Source.TaskModifiedDate <= .01 -- Update any task changed in last 14 minutes (14 minutes = 1% of a full day, ie '.01')
GO
Task 3 the add
SELECT
[MSP_EpmProject_UserView].[ProjectUID]
,[TaskUID]
,[TaskName]
,[TaskIndex]
,[TaskOutlineLevel]
,[TaskOutlineNumber]
,[TaskStartDate]
,[TaskFinishDate]
,[TaskActualStartDate]
,[TaskActualFinishDate]
,[TaskPercentCompleted]
,[Health]
,[Milestone Significance Level]
,[TaskModifiedDate]
,[TaskBaseline1StartDate]
,[TaskBaseline1FinishDate]
,[TaskBaseline1Duration]
,GetDate() as QueryTimestamp
INTO [ProjectWebApp].[dbo].[MSP_EpmTask_UserView_RM]
FROM [MSPSPRO].[ProjectWebApp].[dbo].[MSP_EpmTask_UserView]
Inner Join [MSPSPRO].[ProjectWebApp].[dbo].[MSP_EpmProject_UserView]
on [MSP_EpmProject_UserView].projectUID = [MSP_EpmTask_UserView].ProjectUID
WHERE [SMO Programs] = 'SMO Day 1 Release Management'
AND [Milestone Significance Level] is not null
/*AND [TaskModifiedDate] > (getdate() - 1)*/
Thoughts?
This looks like an ideal situation for a MERGE statement. If you haven't used them much or at all, I'd strongly suggest this site as a primer.
A MERGE can carry out INSERT, UPDATE, and DELETE in one shot, in the right conditions. The basic idea is that you compare rows in two tables, your source and destination, and from that comparison (and potentially other conditions) you then take the appropriate action.
MERGE can perform very well because it carries out these actions in bulk - but do test it out. Sometimes people have found them to be slower than using the separate statements in some situations. Indexing correctly (Microsoft suggest indexing the columns used to join in both tables) can help immensely. Writing the MERGE statement correctly and well is important in terms of both getting the right result, and getting good performance - so definitely do your reading up if you haven't used them before. The above link is a good starter, but there are plenty of other articles around.
I want to create a materialized view(mv) please see below SQL query. When I try to create the materialized view my temp tablespace completely being (~128g) used and given below error
SQL Error: ORA-12801: error signaled in parallel query server P007
ORA-01652: unable to extend temp segment by 64 in tablespace TEMP1
12801. 00000 - "error signaled in parallel query server %s"
then I checked in the OEM it used parallelism 8 degree. So I disabled the parallelism using alter statement (ALTER SESSION DISABLE PARALLEL QUERY). Then the mv ran long and took several hours and got created. Please suggest is there any approaches to create it with out using much space temp space. The count for the select query for this MV is around 55 million rows. Any suggestions really appreciated.
DB: Oracle 11gR2
CREATE MATERIALIZED VIEW TEST NOLOGGING
REFRESH FORCE ON DEMAND
ENABLE QUERY REWRITE
AS
select
table4.num as "Number",table4.num as "SNum",
table4.status as "S_status",
'Open' as "NLP",
create_table2.fmonth as "SMN",
table6.wgrp as "SOW",
(table2.end_dt - create_table2.dt) as "elp",
table6.d_c as "SDC",
create_table2.fiscal_quarter_name as "SQN",
'TS' as "SSL",
table3.table3_id as "SR Owner CEC ID",
table4.sev as "ssev",
SUBSTR(table8.stech,1,INSTR(table8.stech,'=>')-1) as "srtech",
SUBSTR(table8.stech,INSTR(table8.stech,'=>')+2) as "srstech",
table5.sr_type as "SR Type",
table5.problem_code as "SR Problem Code",
--null as "SR Entry Channel",
--null as "SR Time in Status (Days)",
table6.center,
table6.th1col,
table6.master_theater,
table6.rol_3,
table7.center hier_17_center,
table7.rol_1
table7.rol_2,
table7.rol_3 wg,
table2.dt as "SBD",
table2.wk_n as "SBFW",
table2.fmonth as "SBFM",
table3.defect_indicator as "Has Defect",
table2.sofw,
table2.sofm
from
A table1
join B table2 on (table1.date_id = table2.dw_date_key)
join C table3 on (table1.date_id = table3.date_id and table1.incident_id = table3.incident_id)
join D table4 on (table3.incident_id = table4.incident_id and table4.key_d <= table3.date_id and table3.table3_id = table4.current_owner_table3_id)
join E table5 on table4.incident_id = table5.incident_id
join B create_table2 on (table5.creation_dw_date_key = create_table2.dw_date_key)
join F table6 on (table1.objectnumber=table6.DW_WORKGROUP_KEY)
join G table7 on (table1.objectnumber=table7.DW_WORKGROUP_KEY)
left outer JOIN H table8 ON (table8.natural_key= table5.UPDATED_COT_TECH_KEY)
where
table4.bl_incident_key in (select max(bl_incident_key) from D b
where b.incident_id=table3.incident_id and b.key_d <= table3.date_id and b.current_owner_table3_id = table3.table3_id)
and table2.fiscal_year_name in ('FY2013','FY2014')
Without knowing your system, tables or data i assume, that
some of the 8 tables have many rows (>> 55 millons)
join-predicates and filters will not reduce the amount of data significantly
so nearly all data will be written to the mv
Probably the execution plan will use some hash-operations and/or sort-aggregations.
This hashing and sorting cannot be done in memory, if hash and sort segments are too big.
So this will be done in temp.
8 parallel slots will probably use more temp than 1 session. So this can be reason for the ora.
You can
accept the several hours. normally such operations are done at night or weekend
it doesn't matter, if it takes 4 or 1 hour.
Increase temp
Try to scale the degree of parallelism by a hint: create .... as select /*+ parallel(n) */ table4.num...
Use 2 or 4 or 8 for n to have 2,4,8 slots
Try some indexes for the joined columns, i.e.
TABLE1(DATE_ID, INCIDENT_ID)
TABLE1(OBJECTNUMBER)
TABLE2(DW_DATE_KEY)
TABLE2(FISCAL_YEAR_NAME)
TABLE3(DATE_ID, INCIDENT_ID, TABLE3_ID)
TABLE3(INCIDENT_ID, TABLE3_ID, DATE_ID)
TABLE4(INCIDENT_ID, CURRENT_OWNER_TABLE3_ID, KEY_D, BL_INCIDENT_KEY)
TABLE5(INCIDENT_ID)
TABLE5(CREATION_DW_DATE_KEY)
TABLE5(UPDATED_COT_TECH_KEY)
TABLE6(DW_WORKGROUP_KEY)
TABLE7(DW_WORKGROUP_KEY)
TABLE8(NATURAL_KEY)
And use explain plan for the different sqls to see wich plan oracle will generate.