Optimizing the Oracle group on queries in very large data of table

Optimizing the Oracle group on queries in very large data of table - sql

I have a query I would like to optimize. This is the query:
SELECT "c"."NETSW_ACQEREF" AS "BANK",
count("c"."NETSW_ACQEREF") AS "QTY",
sum("c"."TRAN_AMNT") / 100 AS "AMOUNT",
count(distinct "c"."TERM_ID") as "terminals"
FROM "CSCLWH"."CLWH_COMMON_DATA" "c"
WHERE ("c"."TRAN_DATE" between 20201101 AND 20201111)
AND ("TRAN_TYPE" IN
('00', '01', '10', '12', '19', '20', '26', '29', '50', '51', '52'))
AND ("RESP_CODE" IN ('0', '00', '000', '400'))
AND ("MTI" IN ('1100', '1200', '1240', '1400', '1420'))
GROUP BY "c"."NETSW_ACQEREF"
ORDER BY "BANK"
These are the explain plan results with huge cost:
Cost 5102095 Time 00:03:20
it has date 3 million rows I've created group by index but it less useful. Can you please show me a way to get the cost down?

The aggregation operations COUNT and SUM can't be optimized much, and also there is no HAVING clause, so your best bet here would probably be to add a multi-column index covering the entire WHERE clause:
CREATE INDEX idx ON "CSCLWH"."CLWH_COMMON_DATA" (TRAN_DATE, TRAN_TYPE, RESP_CODE, MTI);
This index, if used, would at least allow Oracle to discard many records not matching the where filter. The exact order of the columns used in the index would depend on the cardinality of the data in each column. Typically, you want to put columns first which are more restrictive, placing less restrictive columns last.

I can see two potential sources of slowness in your query. You can run a couple of tests to see which is worse. There is an easy way to fix one of them; I don't think you can do much about the other.
You don't only have the group by aggregation at the overall query level; you also have a count(distinct {something}). That count distinct is a nested aggregation which is expensive. What happens if you remove the word "distinct" there? Meaning, how does the execution time change? (Of course, that will not give you the result you need; but it will tell you HOW EXPENSIVE the "distinct" is.)
Unfortunately, if THAT is the biggest bottleneck, there is nothing you can do about it.
The other source of slowness is the ORDER BY clause at the end of the query. A bit of background: there are essentially two ways to SORT BY. One is to order the expressions you "group by"; the other is to hash them. In the old days, Oracle used "sort" group by - which is expensive. As a side effect, results were ordered by the GROUP BY expressions even without an explicit ORDER BY clause; that is how developers acquired very poor habits.
At some point Oracle "learned" that "hash" group by is faster. However, they fell into a trap: when you have GROUP BY followed by ORDER BY the same expressions, Oracle thought (incorrectly for most cases) that they can save time by doing both in one shot by simply using the old "sort" group by. This is very wasteful when 3 million rows in the input result perhaps in 300 groups. Better to hash group by for the 3 million rows, and then have the (additional, but trivial) step of ordering 300 output rows. Why Oracle is so dumb as not to see this, I don't know - it's just how it is.
This problem, though, has a very simple solution. You can force hash group by with the use_hash_aggregation hint. (First, you can simply remove the ORDER BY clause from your query to see if that's the problem; if you see no improvement, then adding the hint about hash aggregation will not help.)
I have no idea which of the two problems I described is worse. And if it's the "sort group by" (the only one you can do something about), don't expect miracles. You may see the execution time drop from 3 minutes and 20 seconds to 2 minutes or 2 minutes and 30 seconds or whatnot; not an order of magnitude of improvement.

I wonder if two levels of aggregation with appropriate indexes would help:
SELECT bank, SUM(qty) as qty, SUM(amount) as amount,
count(*) as terminals
FROM (SELECT "c"."NETSW_ACQEREF" AS bank, "c"."TERM_ID",
count(*) AS qty,
sum("c"."TRAN_AMNT") / 100 AS "AMOUNT",
FROM "CSCLWH"."CLWH_COMMON_DATA" "c"
WHERE "c"."TRAN_DATE" between 20201101 AND 20201111 AND
"TRAN_TYPE" IN ('00', '01', '10', '12', '19', '20', '26', '29', '50', '51', '52') AND
"RESP_CODE" IN ('0', '00', '000', '400') AND
"MTI" IN ('1100', '1200', '1240', '1400', '1420')
GROUP BY "c"."NETSW_ACQEREF", "c"."TERM_ID"
) c
GROUP BY bank
ORDER BY BANK;
This assumes that tran_type, resp_code, and MTI are all strings. If they are numbers, then change the comparisons to use numbers.
Then you want an index for the WHERE clause. It is quite unclear what the best possibilities are, but something like (tran_date, mti, tran_type, resp_code) -- these should be most selective first.

Related

Query optimization on Postgresql

SQL optimization problem, which of the two solutions below is the most efficient?
I have the table from the image, I need to group the data by CPF and date and know if the CPFs had at least one login_ok = true on a specific date. Both solutions below satisfy my need but the goal is to find the best query.
We can have multiple login_ok = true and login_ok = false for CPFs on a specific date. I just need to know if there was at least one login_ok = true
I already have two solutions, I want to discuss how to make another more efficient

Maybe this would work for your problem:
SELECT
t2.CPF,
t2.data
FROM (
SELECT
CPF,
date(data) AS data
from db_risco.site_rn_login
WHERE login_ok
) t2
GROUP BY 1,2
ORDER BY t2.data
DISTINCT would also work, and I doubt it would pose any performance threat in your case. Usually it evals expressions (like date(data)) before checking for uniqueness.
By using a subquery, in this case, you can select upfront which CPFs to include and then extract date. Finally you'd group by on a quite smaller number os lines, since those were previously selected.

PostgreSQL has the function BOOL_OR to check whether the expression is true for at least one row. It is likely to be optimised for this kind of task.
select cpf, date(data) as data, bool_or(login_ok) as status_login
from db_risco.site_rn_login
group by cpf, date(data);
An index on (cpf, date(data)) or even on (cpf, date(data), login_ok) could help speed up the query.
On a side note: You may also want to order your results with ORDER BY. Don't rely on GROUP BY doing this. The order of the rows resulting from a query is arbitrary without a GROUP BY clause.

In Spark 3.0, how would you use sql to remove duplicates from a specific column? From my understanding, distinct is not supported

I am working in Spark 3.0 and I have the below sql query which results in an output of 188 rows (pictured below). If it matters, fly4 is a TempView:
SELECT kv.ident, kv.speed, kv.alt FROM fly4
WHERE kv.alt >30000 AND kv.lat IS NOT null AND kv.ident IS NOT null
AND kv.speed >590
SORT BY kv.ident, kv.speed DESC
output from above query
I would like to add something that will only return unique idents, the highest speed for each unique ident, and the respective alt for those unique idents (3 columns total: ident, speed, alt). Seems fairly simple in concept but I haven't been able to figure it out. Any help or direction is much appreciated!

You can do a group by and aggregation to remove duplicates and get the highest speed:
SELECT
kv.ident,
max(struct(kv.speed as speed, kv.alt as alt))['speed'] as speed,
max(struct(kv.speed as speed, kv.alt as alt))['alt'] as alt
FROM fly4
WHERE kv.alt >30000 AND kv.lat IS NOT null AND kv.ident IS NOT null AND kv.speed >590
GROUP BY kv.ident
SORT BY kv.ident, speed DESC

How to optimize following SQL Query?

Query is
SELECT DISTINCT A.X1, A.X2, A.X3, TO_DATE(A.EVNT_SCHED_DATE,'DD-Mon-YYYY') AS EVNT_SCHED_DATE,
A.X4, A.MOVEMENT_TYPE, TRIM(A.EFFECTIVE_STATUS) AS STATUS, A.STATUS_TIME, A.TYPE,
A.LEG_NUMBER,
CASE WHEN A.EFFECTIVE_STATUS='BT' THEN 'NLT'
WHEN A.EFFECTIVE_STATUS='NLT' THEN 'NLT'
WHEN A.EFFECTIVE_STATUS='MKUP' THEN 'MKUP'
END AS STATUS
FROM PHASE1.DY_STATUS_ZONE A
WHERE A.LAST_LEG_FLAG='Y'
AND SCHLD_DATE>='01-Apr-2019'--TO_DATE(''||MNTH_DATE||'','DD-Mon-YYYY')
AND SCHLD_DATE<='20-Feb-2020'--TO_DATE(''||TILL_DATE||'','DD-Mon-YYYY')
AND A.MOVEMENT_TYPE IN ('P')
AND (EXCEPTIONAL_FLAG='N' OR EXCEPTION_TYPE='5') ---------SS
PHASE1.DY_STATUS_ZONE has 710246 records in it , Please guide if this query can be optimized ?

You could try adding an index which covers the WHERE clause:
CREATE INDEX idx ON PHASE1.DY_STATUS_ZONE (LAST_LEG_FLAG, SCHLD_DATE, MOVEMENT_TYPE,
EXCEPTIONAL_FLAG, EXCEPTION_TYPE);
Depending on the cardinality of your data, the above index may or may not be used.

The problem might be the select distinct. This can be hard to optimize because it removes duplicates. Even if no rows are duplicated, Oracle still does the work. If it is not needed remove it.
For your particular query, I would write it as:
WHERE A.LAST_LEG_FLAG = 'Y' AND
SCHLD_DATE >= DATAE '2019-04-01 AND
SCHLD_DATE <= DATE '2020-02-20' AND
A.MOVEMENT_TYPE = 'P' AND
EXCEPTIONAL_FLAG IN ('N', '5')
The date formats don't affect performance. Just readability and maintainability.
For this query, the optimal index is probably: (LAST_LEG_FLAG, MOVEMENT_TYPE, SCHLD_DATE, EXCEPTIONAL_FLAG). The last two columns might be switched, if EXCEPTIONAL_FLAG is more selective than SCHLD_DATE.
However, if this returns many rows, then the SELECT DISTINCT will be the gating factor for the query. And that is much more difficult to optimize.

Count(1) from a table having million records is slow even with parallel(8) hint

I am trying to take count of records from table which has 194 million records. Used parallel hints and index fast scan but still its slow. Please suggest any alternative or improvement ideas for the query attached.
SELECT
/*+ parallel(cs_salestransaction 8)
index_ffs(cs_salestransaction CS_SALESTRANSACTION_COMPDATE)
index_ffs(cs_salestransaction CS_SALESTRANSACTION_AK1) */
COUNT(1)
FROM cs_salestransaction
WHERE processingunitseq=38280596832649217
AND (compensationdate BETWEEN DATE '28-06-17' AND DATE '26-01-18'
OR eventtypeseq IN (16607023626823731, 16607023626823732, 16607023626823733, 16607023626823734));
Here is Execution plan:
[]
The query gave result but took 2 hours to calculate 194 million.
Edits:
Code edited to add DATE per suggestion by Littlefoot.
Code edited with actual column names.
I am new to stack overflow, hence have attached plan as image.

Also, if compensationdate is DATE datatype, don't compare it to strings (because '28-JUL-17' is a string) and force Oracle to perform implicit conversion & spend time over nothing. Switch to
compensationdate BETWEEN date '2017-07-28' and date '2018-01-26'

Having OR CONDITION in where clause ignores the use of index in the query. You should get rid of OR condition. There can be multiple ways for that. One of the method is -
SELECT /*+ parallel(sales 8)
index_ffs(sales ,sales_COMPDATE)
index_ffs(sales , sales_eventtypeseq )*/
COUNT(1)
FROM sales
WHERE processingunitseq=38
AND compensationdate BETWEEN TO_DATE('28-JUL-17') AND TO_DATE('26-JAN-18')
UNION ALL
SELECT /*+ parallel(sales 8)
index_ffs(sales ,sales_COMPDATE)
index_ffs(sales , sales_eventtypeseq )*/
COUNT(1)
FROM sales
WHERE processingunitseq=38
AND compensationdate NOT BETWEEN TO_DATE('28-JUL-17') AND TO_DATE('26-JAN-18') -- To avoid duplicates
AND eventtypeseq IN (1, 2, 3, 4);
For other suggestions, Please post the execution plan of the query.

MDX: efficient way to filter tuples where particular columns are not empty?

Suppose I have an MDX query like this:
SELECT Measure1, Measure2, Measure3 ON COLUMNS
[Region].[Region].[Region] ON ROWS
FROM TheCube
If I wanted to exclude rows where ALL THREE measures are empty, I would use SELECT NON EMPTY, which works fast. But I actually need to exclude rows where both Measure1 and Measure2 are empty, even if Measure3 has a value - because in this particular cube Measure3 always has a value, so NON EMPTY has no effect at all.
I could do
SELECT Measure1, Measure2, Measure3 ON COLUMNS
FILTER ([Region].[Region].[Region],
NOT (IsEmpty(Measure1) AND IsEmpty(Measure2)) ON ROWS
FROM TheCube
and it even works, but it takes forever: an order of magnitude longer than the NON EMPTY query above. In fact, even if I filter by an expression that is always true, like FILTER(..., 1=1), it also takes a lot of time.
Is there a more efficient way to filter out rows where both Measure1 and Measure2 are empty?

I think you are looking for the similar function NonEmpty.
http://msdn.microsoft.com/en-us/library/ms145988.aspx
Here is a good explanation between them: http://thatmsftbiguy.com/nonemptymdx/

Just retyping the resulting query in a more readable manner:
SELECT Measure1, Measure2, Measure3 ON COLUMNS
NonEmpty([Region].[Region].[Region],
{ [Measure1], [Measure2] }) ON ROWS
WHERE -- some filter
If you don't use WHERE, you must be very careful to check what exactly your NonEmpty() runs on.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas