Optimization by replacing subquery with named subquery having INLINE hint - sql

Let's have these two tables:
create table table_x(
x_id varchar2(100) primary key
);
create table table_y(
x_id varchar2(100) references table_x(x_id),
stream varchar2(10),
val_a number,
val_b number
);
create index table_y_idx on table_y (x_id, stream);
Suppose we have millions of rows in each table and table_y contains from 0 to 10 rows per each x_id.
The queries in the following examples return 200 rows by filter substr(x_id, 2, 1) = 'B'.
It's required to optimize the query:
QUERY 1
select
x.x_id,
y.val_a,
y.val_b
from table_x x
left join (select
x_id,
max(val_a) KEEP (DENSE_RANK FIRST ORDER BY stream) as val_a,
max(val_b) KEEP (DENSE_RANK FIRST ORDER BY stream) as val_b
from table_y
group by x_id
) y on x.x_id = y.x_id
where substr(x.x_id, 2, 1) = 'B'; -- intentionally not use the primary key filter
------
PLAN 1
----------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost | Time |
----------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 10000 | 2400000 | 22698 | 00:04:33 |
| * 1 | HASH JOIN OUTER | | 10000 | 2400000 | 22698 | 00:04:33 |
| * 2 | TABLE ACCESS FULL | TABLE_X | 10000 | 120000 | 669 | 00:00:09 |
| 3 | VIEW | | 10692 | 2437776 | 22029 | 00:04:25 |
| 4 | SORT GROUP BY | | 10692 | 245916 | 22029 | 00:04:25 |
| 5 | TABLE ACCESS FULL | TABLE_Y | 1069200 | 24591600 | 19359 | 00:03:53 |
----------------------------------------------------------------------------------
* 1 - access("X"."X_ID"="Y"."X_ID"(+))
* 2 - filter(SUBSTR("X"."X_ID", 2, 1)='B')
There's a way of significant optimization, so QUERY 2 returns rows 2-3 times faster than QUERY 1. The INLINE hint is сritically important, because without it the second performs as slow as the first one.
QUERY 2
with
table_y_total as (
select --+ INLINE
x_id,
max(val_a) KEEP (DENSE_RANK FIRST ORDER BY stream) as val_a,
max(val_b) KEEP (DENSE_RANK FIRST ORDER BY stream) as val_b
from table_y
group by x_id
)
select
x.x_id,
(select val_a from table_y_total y where y.x_id = x.x_id) as val_a,
(select val_b from table_y_total y where y.x_id = x.x_id) as val_b
from table_x x
where substr(x.x_id, 2, 1) = 'B';
------
PLAN 2
-----------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost | Time |
-----------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 10000 | 120000 | 669 | 00:00:09 |
| 1 | SORT GROUP BY NOSORT | | 1 | 19 | 103 | 00:00:02 |
| 2 | TABLE ACCESS BY INDEX ROWID | TABLE_Y | 100 | 1900 | 103 | 00:00:02 |
| * 3 | INDEX RANGE SCAN | TABLE_Y_IDX | 100 | | 3 | 00:00:01 |
| 4 | SORT GROUP BY NOSORT | | 1 | 20 | 103 | 00:00:02 |
| 5 | TABLE ACCESS BY INDEX ROWID | TABLE_Y | 100 | 2000 | 103 | 00:00:02 |
| * 6 | INDEX RANGE SCAN | TABLE_Y_IDX | 100 | | 3 | 00:00:01 |
| * 7 | TABLE ACCESS FULL | TABLE_X | 10000 | 120000 | 669 | 00:00:09 |
-----------------------------------------------------------------------------------------
* 3 - access("X_ID"=:B1)
* 6 - access("X_ID"=:B1)
* 7 - filter(SUBSTR("X"."X_ID", 2, 1)='B')
Since the first query uses less code duplication I would prefer to keep it.
Is there a hint or another trick to meet the following conditions both?
keep the first query code (QUERY 1)
force optimizer use the second plan (PLAN 2)

Perhaps you have oversimplified your code, but doesn't this do what you want:
select y.x_id,
max(y.val_a) KEEP (DENSE_RANK FIRST ORDER BY stream) as val_a,
max(y.val_b) KEEP (DENSE_RANK FIRST ORDER BY stream) as val_b
from table_y y
where substr(y.x_id, 2, 1) = 'B'
group by x_id;
I don't think the join to table x is unnecessary, as you have framed the question.

use index hint
select /*+index(index_name)*/ from table

Since full scan on table_x is the cheapest part of the plan, there's an approach with filtering it before joining table_y. Although optimizer decides to use full scan on table_y by default, hinting with index(y) helps to reduce timing to 110% of the QUERY 2's.
with
table_x_filtered as (
select x_id
from table_x
where substr(x_id, 2, 1) = 'B'
)
select /*+ index(y table_y_idx) */
x.x_id,
max(val_a) KEEP (DENSE_RANK FIRST ORDER BY stream) as val_a,
max(val_b) KEEP (DENSE_RANK FIRST ORDER BY stream) as val_b
from table_x_filtered x
left join table_y y on y.x_id = x.x_id
group by x.x_id;

Related

How can I define IIf parameters across different records in a table?

I've defined a query that filters out records that are null in a specific field. I'd like to also calculate a query field that returns the type of record that follows the record that was filtered out, if it matches the parameters. The way I thought to do this was with an IIf statement with multiple parameters:
Preparing: IIf([tblCustomers!OrderId]=([tblCustomers!OrderId]+1)
AND [tblCustomers!OrderStatus]="Preparing","Preparing","")
This didn't work as I hoped, but I wasn't too surprised, as it would have to return data from the field initially tested. So, the argument that adds 1 is actually doing nothing.
Is there a way to target the next record in the table, test if it matches one of two or three strings, then return which one it is?
Edit: Following #mazoula's solution, it seems a correlated subquery is indeed the answer here. Following the guide on allenbrowne.com (linked by June7), I seemed to be on the right track. Here is my code for retrieving the status of a previous record:
SELECT tblCustomers.AccountId,
tblCustomers.OrderId,
tblCustomers.OrderStatus,
tblCustomers.OrderShipped,
tblCustomers.OrderNotes,
(SELECT TOP 1 Dupe.OrderStatus
FROM tblCustomers AS Dupe
WHERE Dupe.AccountId = tblCustomers.AccountId
AND Dupe.OrderId > tblCustomers.OrderId
ORDER BY Dupe.AccountId DESC, Dupe.OrderId) AS NextStatus
FROM tblCustomers
WHERE (((tblCustomers.OrderShipped)="N") AND
((tblCustomers.OrderNotes) Is Null))
ORDER BY tblCustomers.AccountId DESC;
Unfortunately, I am met with the following error:
At most one record can be returned by this subquery
Doing a little more research, I found that incorporating an INNER JOIN expression should solve this.
...
FROM tblCustomers
INNER JOIN OrderStatus Dupe ON Dupe.AccountId = tblCustomers.AccountId
WHERE ...
This is where I've hit another roadblock and, when the syntax is at least correct, I receive the error:
Join expression not supported.
Is this a simple syntax issue, or have misunderstood the role of a Join expression?
in Access 2016 I do this in two parts because access throws the error: must use an updateable query when I try to update based on a subquery. For instance, if I want to replace the Null Values in TableA.Field3 with 'a' if the next record's Field3 is 'a'
tableA:
-------------------------------------------------------------------------------------
| ID | Field1 | Field2 | Field3 |
-------------------------------------------------------------------------------------
| 1 | a | 1 | |
-------------------------------------------------------------------------------------
| 2 | b | 2 | |
-------------------------------------------------------------------------------------
| 3 | c | 3 | a |
-------------------------------------------------------------------------------------
| 4 | d | 4 | b |
-------------------------------------------------------------------------------------
| 5 | e | 5 | |
-------------------------------------------------------------------------------------
| 6 | f | 6 | b |
-------------------------------------------------------------------------------------
I make a table on which to base the update query:
Replacement: (SELECT TOP 1 Dupe.Field3 FROM [TableA] as Dupe WHERE Dupe.ID > [TableA].[ID])
'SQL PANE'
SELECT TableA.ID, TableA.Field1, TableA.Field2, TableA.Field3, (SELECT TOP 1 Dupe.Field3 FROM [TableA] as Dupe WHERE Dupe.ID > [TableA].[ID]) AS Replacement INTO TempTable
FROM TableA;
TempTable:
----------------------------------------------------------------------------------------------------------
| ID | Field1 | Field2 | Field3 | Replacement |
----------------------------------------------------------------------------------------------------------
| 1 | a | 1 | | |
----------------------------------------------------------------------------------------------------------
| 2 | b | 2 | | a |
----------------------------------------------------------------------------------------------------------
| 3 | c | 3 | a | b |
----------------------------------------------------------------------------------------------------------
| 4 | d | 4 | b | |
----------------------------------------------------------------------------------------------------------
| 5 | e | 5 | | b |
----------------------------------------------------------------------------------------------------------
| 6 | f | 6 | b | |
----------------------------------------------------------------------------------------------------------
Finally do the Update
UPDATE TempTable INNER JOIN TableA ON TempTable.ID = TableA.ID SET TableA.Field3 = [TempTable].[Replacement]
WHERE (((TempTable.Replacement)='a'));
TableA after update
-------------------------------------------------------------------------------------
| ID | Field1 | Field2 | Field3 |
-------------------------------------------------------------------------------------
| 1 | a | 1 | |
-------------------------------------------------------------------------------------
| 2 | b | 2 | a |
-------------------------------------------------------------------------------------
| 3 | c | 3 | a |
-------------------------------------------------------------------------------------
| 4 | d | 4 | b |
-------------------------------------------------------------------------------------
| 5 | e | 5 | |
-------------------------------------------------------------------------------------
| 6 | f | 6 | b |
notes: In the Make Table query remember to sort TableA and Dupe in the same way. Here we use the default sort of increasing ID for TableA then grab the first record with a higher ID using the default sort again. the only reason I did the filtering to 'a' in the update query is it made the Make Table query simpler.

Oracle SQL Any comparision with subquery raises right paranthesis missing error

The query works fine with any operator for multiple values for any comparison.
SELECT Name, ID
from tblABC
where ID = ANY (1,2,3,4,5 )
But when a subquery is used for any comparison a right parenthesis missing error occurs
SELECT Name, ID
from tblABC
where ID = ANY (select ID from tblXYZ where ROWNUM <= 10 order by ID desc )
The subquery just gives the top 10 recent id entries from the selected table. Should there be a conversion to number or missing condition in this query?
The reason is order by, which is not necessary as it is evaluated after count stopkey (which is rownum < <constant>).
select *
from table(dbms_xplan.display_cursor(format => 'BASIC +PREDICATE'));
| PLAN_TABLE_OUTPUT |
| :----------------------------------------------------------------------- |
| EXPLAINED SQL STATEMENT: |
| ------------------------ |
| select /*+ gather_plan_statistics */ * from t where rownum < 5 order by |
| 1 asc |
| |
| Plan hash value: 846588679 |
| |
| ------------------------------------ |
| | Id | Operation | Name | |
| ------------------------------------ |
| | 0 | SELECT STATEMENT | | |
| | 1 | SORT ORDER BY | | |
| |* 2 | COUNT STOPKEY | | |
| | 3 | TABLE ACCESS FULL| T | |
| ------------------------------------ |
| |
| Predicate Information (identified by operation id): |
| --------------------------------------------------- |
| |
| 2 - filter(ROWNUM<5) |
| |
If you are on Oracle 12C+, then you may use fetch first:
select *
from dual
where 1 = any(select l from t order by 1 asc fetch first 4 rows only)
| DUMMY |
| :---- |
| X |
Or row_number() for older versions:
select *
from dual
where 1 = any (
select l
from (
select l, row_number() over(order by l asc) as rn
from t
)
where rn < 5
)
| DUMMY |
| :---- |
| X |
db<>fiddle here
It is order by part. It is not supported within sub-queries like this.
Just remove it. You don't need it for comparison anyway.
SELECT Name, ID
from tblABC
where ID = ANY (select ID from tblXYZ where ROWNUM <= 10 )
You can use FETCH FIRST <n> ROWS ONLY instead of using the old ROWNUM in the subquery.
For example:
SELECT Name, ID
from tblABC
where ID = ANY (select ID
from tblXYZ
order by ID desc
fetch first 10 rows only)
See running example at db<>fiddle.

Associate usage records with the corresponding usage plan in BigQuery

Customer's resource usage:
+-------+-------------+-----------------------+
| usage | customer_id | timestamp |
+-------+-------------+-----------------------+
| 10 | 1 | 2019-01-12T01:00:00 |
| 16 | 1 | 2019-02-12T02:00:00 |
| 26 | 1 | 2019-03-12T03:00:00 |
| 24 | 1 | 2019-04-12T04:00:00 |
| 4 | 1 | 2019-05-15T01:00:00 |
+-------+-------------+-----------------------+
This table shows the usage reported every hour for every customer. Minutes and seconds are always zero.
Customer's plan change log:
+--------+-------------+-----------------------+
| plan | customer_id | timestamp |
+--------+-------------+-----------------------+
| A | 1 | 2018-12-12T01:24:00 |
| B | 1 | 2019-01-12T02:31:00 |
| C | 1 | 2019-03-12T03:53:00 |
+--------+-------------+-----------------------+
When a customer changes his usage plan, the action is stored in a change log.
Result: associate each usage record with a usage plan.
+-------+-------------+--------+-----------------------+
| usage | customer_id | plan | timestamp |
+-------+-------------+--------+-----------------------+
| 10 | 1 | A | 2019-01-05T01:00:00 |
| 16 | 1 | B | 2019-02-12T02:00:00 |
| 26 | 1 | C | 2019-03-10T03:00:00 |
| 24 | 1 | C | 2019-04-12T04:00:00 |
| 4 | 1 | C | 2019-05-15T01:00:00 |
+-------+-------------+--------+-----------------------+
What I have tried: to determine the plan for a specific usage record, I take the timestamp of that record and look for the most recent plan change record in the usage plan log:
SELECT
customer_id,
plan,
timestamp,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY timestamp DESC) seqnum
FROM
`project.dataset.table`
WHERE seqnum = 1 AND timestamp <= timestamp_of_the_usage_record
However I am not sure, how to combine that with the usage table. I tried:
WITH log AS (
SELECT
customer_id,
plan,
timestamp,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY timestamp DESC) seqnum
FROM
`project.dataset.plan_change_log`
)
SELECT
t1.customer_id,
log.plan,
t1.usage,
t1.timestamp
FROM
`project.dataset.usage` t1
FULL JOIN log
ON log.customer_id = t1.customer_id AND log.timestamp <= t1.timestamp AND seqnum = 1
The result table has fewer rows than the original usage table because of the join condition. However the amount of rows should stay the same. Any ideas how to solve that problem?
You were on the right track, although the data in your example is a bit off for the first and third line of the end result.
with data as (
SELECT
t1.customer_id,
log.plan,
t1.usage,
t1.timestamp,
log.timestamp as logt,
ROW_NUMBER() OVER (PARTITION BY t1.customer_id, t1.timestamp ORDER BY log.timestamp DESC) seqnum
FROM
resource t1
FULL JOIN log
ON log.customer_id = t1.customer_id AND log.timestamp <= t1.timestamp
)
select * from data where seqnum = 1
You want to do create the sequence on the result of the join, not before.

Calculating consecutive range of dates with a value in Hive

I want to know if it is possible to calculate the consecutive ranges of a specific value for a group of Id's and return the calculated value(s) of each one.
Given the following data:
+----+----------+--------+
| ID | DATE_KEY | CREDIT |
+----+----------+--------+
| 1 | 8091 | 0.9 |
| 1 | 8092 | 20 |
| 1 | 8095 | 0.22 |
| 1 | 8096 | 0.23 |
| 1 | 8098 | 0.23 |
| 2 | 8095 | 12 |
| 2 | 8096 | 18 |
| 2 | 8097 | 3 |
| 2 | 8098 | 0.25 |
+----+----------+--------+
I want the following output:
+----+-------------------------------+
| ID | RANGE_DAYS_CREDIT_LESS_THAN_1 |
+----+-------------------------------+
| 1 | 1 |
| 1 | 2 |
| 1 | 1 |
| 2 | 1 |
+----+-------------------------------+
In this case, the ranges are the consecutive days with credit less than 1. If there is a gap between date_key column, then the range won't have to take the next value, like in ID 1 between 8096 and 8098 date key.
Is it possible to do this with windowing functions in Hive?
Thanks in advance!
You can do this with a running sum classifying rows into groups, incrementing by 1 every time a credit<1 row is found(in the date_key order). Thereafter it is just a group by.
select id,count(*) as range_days_credit_lt_1
from (select t.*
,sum(case when credit<1 then 0 else 1 end) over(partition by id order by date_key) as grp
from tbl t
) t
where credit<1
group by id
The key is to collapse all the consecutive sequence and compute their length, I struggled to achieve this in a relatively clumsy way:
with t_test as
(
select num,row_number()over(order by num) as rn
from
(
select explode(array(1,3,4,5,6,9,10,15)) as num
)
)
select length(sign)+1 from
(
select explode(continue_sign) as sign
from
(
select split(concat_ws('',collect_list(if(d>1,'v',d))), 'v') as continue_sign
from
(
select t0.num-t1.num as d from t_test t0
join t_test t1 on t0.rn=t1.rn+1
)
)
)
Get the previous number b in the seq for each original a;
Check if a-b == 1, which shows if there is a "gap", marked as 'v';
Merge all a-b to a string, and then split using 'v', and compute length.
To get the ID column out, another string which encode id should be considered.

Impala query - optimize a query to get the uniques for given key

I'm looking for ways to count unique users that have a specific pkey and also the count of unique users who didn't have that pkey.
Here is a sample table:
userid | pkey | pvalue
------------------------------
U1 | x | vx
U1 | y | vy
U1 | z | vz
U2 | y | vy
U3 | z | vz
U4 | null | null
I get the expected results to get the unique users who has the pkey='y' and those who didn't using this query but turns out to be expensive:
WITH all_rows AS
( SELECT userid,
IF( pkey='y', pval, 'none' ) AS val,
SUM( IF(pkey='y',1,0) ) AS has_key
FROM some_table
GROUP BY userid, val)
SELECT val,
count(distinct(userid)) uniqs
FROM all_rows
WHERE has_key=1
GROUP BY val
UNION ALL
SELECT 'no_key_set' val,
count(distinct(userid)) uniqs
FROM all_rows a1 LEFT ANTI JOIN
all_rows a2 on (a1.userid = a2.userid and a2.has_key=1)
GROUP BY val;
Results:
val | uniqs
--------------------
vy | 2
no_key_set | 2
I'm looking to avoid using any temp tables, so any better ways this can be achieved?
Thanks!
By using EXPLAIN, you can observe that most of the cost is spent on doing excessive GROUP BY aggregations rather than on using subqueries in your original query.
Here is a straightforward implementation
WITH t1 AS (
SELECT pkey, COUNT(*) AS cnt
FROM table
WHERE pkey IS NOT NULL
GROUP BY pkey
), t2 AS (
SELECT COUNT(DISTINCT userid) AS total_cnt
FROM table
)
SELECT
CONCAT('no_', pkey) AS pkey,
(total_cnt - cnt) AS cnt
FROM t1, t2
UNION ALL
SELECT * FROM t1
t1 gets a table of unique user count per pkey
+------+-----+
| pkey | cnt |
+------+-----+
| x | 1 |
| z | 2 |
| y | 2 |
+------+-----+
t2 gets the number of total unique users
+-----------+
| total_cnt |
+-----------+
| 4 |
+-----------+
we can use the result from t2 to get the complement table of t1
+------+-----+
| pkey | cnt |
+------+-----+
| no_x | 3 |
| no_z | 2 |
| no_y | 2 |
+------+-----+
a final union of the two tables gives a result of
+------+-----+
| pkey | cnt |
+------+-----+
| no_x | 3 |
| no_z | 2 |
| no_y | 2 |
| x | 1 |
| z | 2 |
| y | 2 |
+------+-----+