Optimize joining two big tables ORACLE 19C - sql

how can I optimize the query below :
SELECT A.CNACT, A.FACML, A.LCACT, H.CAECH, H.CMECH, H.MCCMP,H.DAHIS,RANK() OVER (PARTITION BY H.CNACT,H.CAECH,H.CMECH ORDER BY H.DAHIS DESC) RK
FROM NATACF A,HISTER H WHERE A.CNACT = H.CNACT;
select count (*) FROM NATACF; -->74794
select count (*) FROM HISTER; -->2100720
you find in attachment the execution plan
Thank you.
As you see window sort and hash JOIN are not optimised effectively. What is the best way to optimise this?
the screenshot below of prod database :

Long story short, you want ALL data from *both" tables - no filtering in place.
Oracle reads whole smaller (driving) table into hash map.
Uses joining column CNACT as a key
They reads whole bigger table and performs lookup in the hashmap for each row read.
the complexity is O(N+M), each row is read only once
There is no way to evaluate such a query in faster way (aside from dirty tricks like putting both tables into CLUSTER, pinning tables in buffer cache,...).
PS: it is strange that explain plan shows 2 sec - both tables are not actually so big.
While prod DB says 5hours.
Try to execute the query using set timing on set echo off set pagesize 0 set termout off set feedback off set pause off set verify off set headings off. Basically read the whole result and then discard it and print exec. time. And you will see.
Maybe it is the app (or network) who has problem to transfer the whole big result set. Is such a case you would see "SQL*Net Message to client" wait event in AWR. Like the database is waiting for the application to accept more data. Like you are sending about 14GB of data into the application.
For example Java has problems with GC, or each row is used to costly Java Object creation.

we resolved the problem using the with :
WITH Z AS(SELECT
| X.DATRA,X.COINI,X.COINT,X.NUCPT,N.COBCN,X.CNACT,N.LCACT,N.CNACR,DECODE(X.CSOPT,NULL,X.CAECH,O.CAECR) CAECH,
| DECODE(X.CSOPT,NULL,X.CMECH,O.CMECR) CMECH,X.CSOPT,X.MTSNA,N.COTSJ,X.CSENS,X.QTCCP,D.TXCHA,R.MAINT,X.CODEV,
| D.CDVRF,C.TYEDI,C.NUSES
10: | FROM CUMPOR X,
| MRXIDE C,.....
The WITH clause - The materialized subquery data is persistent through the query.

Related

Is SELECT TOP always the fastest way to get a preview of a query you want on SQL?

I have the following query that I've ran:
SELECT TOP 100 certs.CertId, COUNT(cluster.BGTJobId) C
FROM [CentralDB_US_33].[dbo].[JobSkillClusterIndex] cluster
INNER JOIN [Eagle].[raw].[certs] certs
ON certs.BGTJobId = cluster.BGTJobId
GROUP BY cluster.skillClusterId, certs.CertId
Ultimately, I want to get the full results and not just the top 100, but for previewing purposes, is this the fastest way to go?
Since you've mentioned this is for preview purposes, so I'm assuming you just want data out of the query and you want it to run FAST regardless of the data it returns, and seeing that you mentioned that the query takes 14 minutes to execute, a quick 'hack-fix' would be to use something like below:
SELECT
certs.CertId
, COUNT(cluster.BGTJobId)
FROM
(SELECT TOP 100
certs.CertId
FROM [Eagle].[raw].[certs] certs) certs
INNER JOIN [CentralDB_US_33].[dbo].[JobSkillClusterIndex] cluster
ON certs.BGTJobId = cluster.BGTJobId
GROUP BY cluster.skillClusterId, certs.CertId
Aggregating data (in your case COUNT) is a very expensive operation and should be done only at the last part of the query on as little data as possible. That is why, for "preview" purposes I have selected onyl the first 100 certificates and made the COUNT on that data.
However, because you mentioned that the query takes 14 minutes to run, the problems are elsewhere and usually this is due to design (query design, index design or even table design).
You should ask yourself if you really want to go over all of the data in the tables and get all of the matching rows from both tables, and aren't you possibly missing a WHERE clause?
If you do decide that there is a WHERE clause needed, are there any indexes to help filtering the data based on the conditions of your WHERE clause (and even the join columns - certs.BGTJobId and cluster.BGTJobId?
Yes top select query is fastest for preview purposes that why its also shown in management studio GUI right click. But if you are running custom query just check that where clause / grouping etc are done is part of the clustered index.

Amazon Redshift queries mysteriously dying

Why is my Amazon Redshift query sometimes working, sometimes getting killed, and sometimes running out of memory?
This is a simple query:
dev=# EXPLAIN SELECT row_number, browser_cookie, "timestamp", request_path,
status, outcome, duration, referrer
FROM annotated_apache_logs
WHERE date = '2015-09-15';
QUERY PLAN
------------------------------------------------------------------------------------
XN Seq Scan on annotated_apache_logs (cost=0.00..114376.71 rows=9150137 width=207)
Filter: (date = '2015-09-15'::date)
Pulling about 9 million rows:
dev=# SELECT count(*) FROM annotated_apache_logs WHERE date = '2015-09-15';
count
---------
9150137
(1 row)
And choking:
dev=# SELECT row_number, browser_cookie, "timestamp", request_path,
status, outcome, duration, referrer
FROM annotated_apache_logs
WHERE date = '2015-09-15';
out of memory
Sometimes the sql says Killed. Sometimes it works. Sometimes I get out of memory. No idea why. The table looks like this (I've removed rows not in the above query):
CREATE TABLE IF NOT EXISTS annotated_apache_logs (
row_number double precision,
browser_cookie character varying(240),
timestamp integer,
request_path character varying(2500),
status character varying(12),
outcome character varying(128),
duration integer,
referrer character varying(2500)
)
DISTKEY (date)
SORTKEY (browser_cookie);
And I've worked very hard to get all of those columns as small as I can to reduce memory usage. What do I look for now? If I read the EXPLAIN output correctly, this might return a couple of gigs of data. Not much data, no joins, nothing fancy. For a "petabyte scale data warehouse", that's trivial, so I'm assuming I'm missing something fundamental here.
You should use cursors to fetch the result set in chunks. See http://docs.aws.amazon.com/redshift/latest/dg/declare.html
If your client application uses an ODBC connection and your query creates a result set that is too large to fit in memory, you can stream the result set to your client application by using a cursor. When you use a cursor, the entire result set is materialized on the leader node, and then your client can fetch the results incrementally.
Edit:
Assuming that you want the entire result set rather than filtering using where/limit.
If your query is actually running out of memory, check what is the concurrency for the WLM queue under which this query runs. Try to increase the available memory for this queue or reduce the concurrency, this will allow your query to have more memory.
P.S:
When it says "Petabyte scale", it does not mean it has petabyte of RAM for you. There are a lot of factors which decide how much memory your query is actually getting while execution,
What is the node type you are using?
How many nodes?
What other queries are running when you are running this query?

Need for long and dynamic select query/view sqlite

I have a need to generate a long select query of potentially thousands of where conditions like (table1.a = ? OR table1.a = ? OR ...) AND (table2.b = ? OR table2.b = ? ...) AND....
I initially started building a class to make this more bearable, but have since stopped to wonder if this will work well. This query is going to be hammering a table of potentially 10s of millions of rows joined with 2 more tables with thousands of rows.
A number of concerns are stemming from this:
1.) I wanted to use these statements to generate a temp view so I could easily transfer over existing code base, the point here is I want to filter data that I have down for analysis based on selected parameters in a GUI, so how poorly will a view do in this scenario?
2.) Can sqlite even parse a query with thousands of binds?
3.) Isn't there a framework that can make generating this query easier other than with string concatenation?
4.) Is the better solution to dump all of the WHERE variables into hash sets in memory and then just write a wrapper for my DB query object that gets next() until a query is encountered this satisfies all my conditions? My concern here is, the application generates graphs procedurally on scrolls, so waiting to draw while calling query.next() x 100,000 might cause an annoying delay? Ideally I don't want to have to wait on the next row that satisfies everything for more than 30ms at a time.
edit:
New issue, it came to my attention that sqlite3 is limited to 999 bind values(host parameters) at compile time.
So it seems as if the only way to accomplish what I had originally intended is to
1.) Generate the entire query via string concatenations(my biggest concern being, I don't know how slow parsing all the data inside sqlite3 will be)
or
2.) Do the blanket query method(select * from * where index > ? limit ?) and call next() until I hit what valid data in my compiled code(including updating index variable and re-querying repeatedly)
I did end up writing a wrapper around the QSqlQuery object that will walk a table using index > variable and limit to allow "walking" the table.
Consider dumping the joined results without filters (denormalized) into a flat file and index it with Fastbit, a bitmap index engine.

mysterious oracle query

if a query in oracle takes the first time it is executed 11 minutes, and the next time, the same query 25 seconds, with the buffer being flushed, what is the possible cause? could it be that the query is written in a bad way?
set timing on;
set echo on
set lines 999;
insert into elegrouptmp select idcll,idgrpl,0 from elegroup where idgrpl = 109999990;
insert into SLIMONTMP (idpartes, indi, grecptseqs, devs, idcll, idclrelpayl)
select rel.idpartes, rel.indi, rel.idgres,rel.iddevs,vpers.idcll,nvl(cdsptc.idcll,vpers.idcll)
from
relbqe rel,
elegrouptmp ele,
vrdlpers vpers
left join cdsptc cdsptc on
(cdsptc.idclptcl = vpers.idcll and
cdsptc.cdptcs = 'NOS')
where
rel.idtits = '10BCPGE ' and
vpers.idbqes = rel.idpartes and
vpers.cdqltptfc = 'N' and
vpers.idcll = ele.idelegrpl and
ele.idgrpl = 109999990;
alter system flush shared_pool;
alter system flush buffer_cache;
alter system flush global context;
select /* original */ mvtcta_part_SLIMONtmp.idpartes,mvtcta_part_SLIMONtmp.indi,mvtcta_part_SLIMONtmp.grecptseqs,mvtcta_part_SLIMONtmp.devs,
mvtcta_part_SLIMONtmp.idcll,mvtcta_part_SLIMONtmp.idclrelpayl,mvtcta_part_vrdlpers1.idcll,mvtcta_part_vrdlpers1.shnas,mvtcta_part_vrdlpers1.cdqltptfc,
mvtcta_part_vrdlpers1.idbqes,mvtcta_part_compte1.idcll,mvtcta_part_compte1.grecpts,mvtcta_part_compte1.seqc,mvtcta_part_compte1.devs,mvtcta_part_compte1.sldminud,
mvtcta.idcll,mvtcta.grecptseqs,mvtcta.devs,mvtcta.termel,mvtcta.dtcptl,mvtcta.nusesi,mvtcta.fiches,mvtcta.indl,mvtcta.nuecrs,mvtcta.dtexel,mvtcta.dtvall,
mvtcta.dtpayl,mvtcta.ioi,mvtcta.mtd,mvtcta.cdlibs,mvtcta.libcps,mvtcta.sldinitd,mvtcta.flagtypei,mvtcta.flagetati,mvtcta.flagwarnl,mvtcta.flagdonei,mvtcta.oriindl,
mvtcta.idportfl,mvtcta.extnuecrs
from SLIMONtmp mvtcta_part_SLIMONtmp
left join vrdlpers mvtcta_part_vrdlpers1 on
(
mvtcta_part_vrdlpers1.idbqes = mvtcta_part_SLIMONtmp.idpartes
and mvtcta_part_vrdlpers1.cdqltptfc = 'N'
and mvtcta_part_vrdlpers1.idcll = mvtcta_part_SLIMONtmp.idcll
)
left join compte mvtcta_part_compte1 on
(
mvtcta_part_compte1.idcll = mvtcta_part_vrdlpers1.idcll
and mvtcta_part_compte1.grecpts = substr (mvtcta_part_SLIMONtmp.grecptseqs, 1, 2 )
and mvtcta_part_compte1.seqc = substr (mvtcta_part_SLIMONtmp.grecptseqs, -1 )
and mvtcta_part_compte1.devs = mvtcta_part_SLIMONtmp.devs
and (mvtcta_part_compte1.devs = ' ' or ' ' = ' ')
and mvtcta_part_compte1.cdpartc not in ( 'L' , 'R' )
)
left join mvtcta mvtcta on
(
mvtcta.idcll = mvtcta_part_SLIMONtmp.idclrelpayl
and mvtcta.devs = mvtcta_part_SLIMONtmp.devs
and mvtcta.grecptseqs = mvtcta_part_SLIMONtmp.grecptseqs
and mvtcta.flagdonei <> 0
and mvtcta.devs = mvtcta_part_compte1.devs
and mvtcta.dtvall > 20101206
)
where 1=1
order by mvtcta_part_compte1.devs,
mvtcta_part_SLIMONtmp.idpartes,
mvtcta_part_SLIMONtmp.idclrelpayl,
mvtcta_part_SLIMONtmp.grecptseqs,
mvtcta.dtvall;
"if a query in oracle takes the first
time it is executed 11 minutes, and
the next time, the same query 25
seconds, with the buffer being
flushed, what is the possible cause?"
The thing is, flushing the DB Buffers, like this ...
alter system flush shared_pool
/
... wipes the Oracle data store but there are other places where data gets cached. For instance the chances are your OS caches its file reads.
EXPLAIN PLAN is good as a general guide to how the database thinks it will execute a query, but it is only a prediction. It can be thrown out by poor statistics or ambient conditions. It is not good at explaining why a specific instance of a query took as much time as it did.
So, if you really want to understand what occurs when the database executes a specific query you need to get down and dirty, and learn how to use the Wait Interface. This is a very powerful tracing mechanism, which allows us to see the individual events that happen over the course of a single query execution. Each version of Oracle has extended the utility and richness of the Wait Interface, but it has been essential to proper tuning since Oracle 9i (if not earlier).
Find out more by reading Roger Schrag's very good overview .
In your case you'll want to run the trace multiple times. In order to make it easier to compare results you should use a separate session for each execution, setting the 10046 event each time.
What else is happening on the box when you ran these? You can get different timings based on other processes chewing resources. Also, with a lot of joins, performance will depend on memory usage (hash_area_size, sort_area_size, etc) and availability, so perhaps you are paging (check temp space size/usage also). In short, try sql_trace and tkprof to analyze deeper
Sometimes a block is written to the file system before it is committed (a dirty block). When that block is read later, Oracle sees that it was uncommitted. It checks the open transaction and, if the transaction isn't still there, it knows the change was committed. Therefore it writes the block back as a clean block. It is called delayed block cleanout.
That is one possible reason why reading blocks for the first time can be slower than a subsequent re-read.
Could be the second time the execution plan is known. Maybe the optimizer has a very hard time finding a execution plan for some reason.
Try setting
alter session set optimizer_max_permutations=100;
and rerun the query. See if that makes any difference.
could it be that the query is written in a bad way?
"bad" is a rather emotional expression - but broadly speaking, yes, if a query performs significantly faster the second time it's run, it usually means there are ways to optimize the query.
Actually optimizing the query is - as APC says - rather a question of "down and dirty". Obvious candidate in your examply might be the substring - if the table is huge, and the substring misses the index, I'd imagine that would take a bit of time, and I'd imagine the result of all those substrin operations are cached somewhere.
Here's Tom Kyte's take on flushing Oracle buffers as a testing practice. Suffice it to say he's not a fan. He favors the approach of attempting to emulate your production load with your test data ("real life"), and tossing out the first and last runs. #APC's point about OS caching is Tom's point - to get rid of that (non-trivial!) effect you'd need to bounce the server, not just the database.

long running queries: observing partial results?

As part of a data analysis project, I will be issuing some long running queries on a mysql database. My future course of action is contingent on the results I obtain along the way. It would be useful for me to be able to view partial results generated by a SELECT statement that is still running.
Is there a way to do this? Or am I stuck with waiting until the query completes to view results which were generated in the very first seconds it ran?
Thank you for any help : )
In general case the partial result cannot be produced. For example, if you have an aggregate function with GROUP BY clause, then all data should be analysed, before the 1st row is returned. LIMIT clause will not help you, because it is applied after the output is computed. Maybe you can give a concrete data and SQL query?
One thing you may consider is sampling your tables down. This is good practice in data analysis in general to get your iteration speed up when you're writing code.
For example, if you have table create privelages and you have some mega-huge table X with key unique_id and some data data_value
If unique_id is numeric, in nearly any database
create table sample_table as
select unique_id, data_value
from X
where mod(unique_id, <some_large_prime_number_like_1013>) = 1
will give you a random sample of data to work your queries out, and you can inner join your sample_table against the other tables to improve speed of testing / query results. Thanks to the sampling your query results should be roughly representative of what you will get. Note, the number you're modding with has to be prime otherwise it won't give a correct sample. The example above will shrink your table down to about 0.1% of the original size (.0987% to be exact).
Most databases also have better sampling and random number methods than just using mod. Check the documentaion to see what's available for your version.
Hope that helps,
McPeterson
It depends on what your query is doing. If it needs to have the whole result set before producing output - such as might happen for queries with group by or order by or having clauses, then there is nothing to be done.
If, however, the reason for the delay is client-side buffering (which is the default mode), then that can be adjusted using "mysql-use-result" as an attribute of the database handler rather than the default "mysql-store-result". This is true for the Perl and Java interfaces: I think in the C interface, you have to use an unbuffered version of the function that executes the query.