oracle sql query optimization further 1 - sql

I have written a query to select * from bdb to get only updated values in PRICE for the combination of DAY,INST in the newest ACT
I created A table like
CREATE TABLE bdb(
ACT NUMBER(8) NOT NULL,
INST NUMBER(8) NOT NULL,
DAY DATE NOT NULL,
PRICE VARCHAR2 (3),
CURR NUMBER (8,2),
PRIMARY KEY (ACT,INST,DAY)
);
used this to populate the table
DECLARE
t_day bdb.day%type:= '1-JAN-16';
n pls_integer;
BEGIN
<< act_loop >>
FOR i IN 1..3 LOOP --NUMBER OF ACT i
<< inst_loop >>
FOR j IN 1..1000 LOOP --NUMBER OF INST j
t_day:='3-JAN-16';
<< day_loop >>
FOR k IN 1..260 LOOP --NUMBER OF DAYS k
n:= dbms_random.value(1,3);
INSERT into bdb (ACT,INST,DAY,PRICE,CURR) values (i,j,t_day,n,10.3);
t_day:=t_day+1;
END loop day_loop;
END loop inst_loop;
END loop act_loop;
END;
/
using this query
I get only the DAY,INST,PRICE
select day,inst,price from bdb where (act=(select max(act) from bdb))
minus
select day,inst,price from bdb where act=(select max(act)-1 from bdb);
above one is fast.but I want to get all the field in efficient way.
the one I came up with bit slow which is this,
select
e1.*
from
(select
*
from
bdb
where
(act=(select max(act) from bdb))
)e1,
(select day,inst,price from bdb where (act=(select max(act) from bdb))
minus
select day,inst,price from bdb where act=(select max(act)-1 from bdb)) e2
where
e1.day=e2.day and e1.inst=e2.inst;
can anyone give any suggestion to how to optimized this any more? or with out using cross join with two table how to get the required output.Help me ;)
simply I need is
ACT INST DAY PRI CURR
------------------------------------
3 890 05-MAR-16 3 10.3
3 890 06-MAR-16 2 10.3
3 890 07-MAR-16 2 10.3
3 891 05-MAR-16 2 10.3
3 891 06-MAR-16 1 10.3
3 891 07-MAR-16 2 10.3
4 890 05-MAR-16 3 10.3
4 890 06-MAR-16 2 10.3
4 890 07-MAR-16 1 10.3
4 891 05-MAR-16 2 10.3
4 891 06-MAR-16 2 10.3
4 891 07-MAR-16 1 10.3
Here for (890,05-MAR-16) (890,06-MAR-16) (890,06-MAR-16)
(891,05-MAR-16) (891,06-MAR-16) (891,06-MAR-16) in act=3
price are
3,2,2
2,1,2
but when act=4 happens
(890,07-MAR-16)
(891,06-MAR-16)
(891,07-MAR-16)
price values are change from what they were in act=3.
others not change
ultimately what I need is
ACT INST DAY PRI CURR
------------------------------------
4 890 07-MAR-16 1 10.3
4 891 06-MAR-16 2 10.3
4 891 07-MAR-16 1 10.3

It looks like you're after the day, inst and price values which have a row where the act column has the maximum act value out of the whole table, but doesn't have a row where the act column is one less than the max act value.
You could try this:
SELECT day,
inst,
price
FROM (SELECT day,
inst,
price,
act,
MAX(act) OVER () max_overall_act
FROM bdb)
WHERE act IN (max_overall_act, max_overall_act -1)
GROUP BY day, inst, price
HAVING MAX(CASE WHEN act = max_overall_act THEN 1 END) = 1
AND MAX(CASE WHEN act = max_overall_act - 1 THEN 1 END) IS NULL;
First of all, the subquery finds the maximum act value across the whole table.
Then we select all rows that have an act value that is the maximum value or one less than that.
After that, we group the rows and find out which ones have an act = max act val, but don't have an act = max act val -1.
However, from what you said in your post:
I have written a query to select * from bdb to get only updated values in PRICE for the combination of DAY,INST in the newest ACT
neither the query you came up with and the above query in my answer seem to tally with what you are after.
I think instead, you're after something like:
SELECT act,
inst,
DAY,
price,
curr,
prev_price -- if desired
FROM (SELECT act,
inst,
DAY,
price,
curr,
LEAD(price) OVER (PARTITION BY inst, DAY ORDER BY act DESC) prev_price,
row_number() OVER (PARTITION BY inst, DAY ORDER BY act DESC) rn
FROM bdb)
WHERE rn = 1
AND prev_price != price;
What this does is use the LEAD() analytic (based on the descending act order) to find the price of the row with the previous act for each inst and day, along with the rownumber.
Then to find the latest act row, we simply select the rows where the rownumber is 1 and also where the previous price doesn't match the current price. You can then display both the current and the previous price, if you want to.

Related

How to identify rows per group before a certain value gap?

I'd like to update a certain column in a table based on the difference in a another column value between neighboring rows in PostgreSQL.
Here is a test setup:
CREATE TABLE test(
main INTEGER,
sub_id INTEGER,
value_t INTEGER);
INSERT INTO test (main, sub_id, value_t)
VALUES
(1,1,8),
(1,2,7),
(1,3,3),
(1,4,85),
(1,5,40),
(2,1,3),
(2,2,1),
(2,3,1),
(2,4,8),
(2,5,41);
My goal is to determine in each group main starting from sub_id 1 which value in diff exceeds a certain threshold (e.g. <10 or >-10) by checking in ascending order by sub_id. Until the threshold is reached I would like to flag every passed row AND the one row where the condition is FALSE by filling column newval with a value e.g. 1.
Should I use a loop or are there smarter solutions?
The task description in pseudocode:
FOR i in GROUP [PARTITION BY main ORDER BY sub_id]:
DO until diff > 10 OR diff <-10
SET newval = 1 AND LEAD(newval) = 1
Basic SELECT
As fast as possible:
SELECT *, bool_and(diff BETWEEN -10 AND 10) OVER (PARTITION BY main ORDER BY sub_id) AS flag
FROM (
SELECT *, value_t - lag(value_t, 1, value_t) OVER (PARTITION BY main ORDER BY sub_id) AS diff
FROM test
) sub;
Fine points
Your thought model evolves around the window function lead(). But its counterpart lag() is a bit more efficient for the purpose, since there is no off-by-one error when including the row before the big gap. Alternatively, use lead() with inverted sort order (ORDER BY sub_id DESC).
To avoid NULL for the first row in the partition, provide value_t as default as 3rd parameter, which makes the diff 0 instead of NULL. Both lead() and lag() have that capability.
diff BETWEEN -10 AND 10 is slightly faster than #diff < 11 (clearer and more flexible, too). (# being the "absolute value" operator, equivalent to the abs() function.)
bool_or() or bool_and() in the outer window function is probably cheapest to mark all rows up to the big gap.
Your UPDATE
Until the threshold is reached I would like to flag every passed row AND the one row where the condition is FALSE by filling column newval with a value e.g. 1.
Again, as fast as possible.
UPDATE test AS t
SET newval = 1
FROM (
SELECT main, sub_id
, bool_and(diff BETWEEN -10 AND 10) OVER (PARTITION BY main ORDER BY sub_id) AS flag
FROM (
SELECT main, sub_id
, value_t - lag(value_t, 1, value_t) OVER (PARTITION BY main ORDER BY sub_id) AS diff
FROM test
) sub
) u
WHERE (t.main, t.sub_id) = (u.main, u.sub_id)
AND u.flag;
Fine points
Computing all values in a single query is typically substantially faster than a correlated subquery.
The added WHERE condition AND u.flag makes sure we only update rows that actually need an update.
If some of the rows may already have the right value in newval, add another clause to avoid those empty updates, too: AND t.newval IS DISTINCT FROM 1
See:
How do I (or can I) SELECT DISTINCT on multiple columns?
SET newval = 1 assigns a constant (even though we could use the actually calculated value in this case), that's a bit cheaper.
db<>fiddle here
Your question was hard to comprehend, the "value_t" column was irrelevant to the question, and you forgot to define the "diff" column in your SQL.
Anyhow, here's your solution:
WITH data AS (
SELECT main, sub_id, value_t
, abs(value_t
- lead(value_t) OVER (PARTITION BY main ORDER BY sub_id)) > 10 is_evil
FROM test
)
SELECT main, sub_id, value_t
, CASE max(is_evil::int)
OVER (PARTITION BY main ORDER BY sub_id
ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
WHEN 1 THEN NULL ELSE 1 END newval
FROM data;
I'm using a CTE to prepare the data (computing whether a row is "evil"), and then the "max" window function is used to check if there were any "evil" rows before the current one, per partition.
EXISTS on an aggregating subquery:
UPDATE test u
SET value_t = NULL
WHERE EXISTS (
SELECT * FROM (
SELECT main,sub_id
, value_t , ABS(value_t - lag(value_t)
OVER (PARTITION BY main ORDER BY sub_id) ) AS absdiff
FROM test
) x
WHERE x.main = u.main
AND x.sub_id <= u.sub_id
AND x.absdiff >= 10
)
;
SELECT * FROM test
ORDER BY main, sub_id;
Result:
UPDATE 3
main | sub_id | value_t
------+--------+---------
1 | 1 | 8
1 | 2 | 7
1 | 3 | 3
1 | 4 |
1 | 5 |
2 | 1 | 3
2 | 2 | 1
2 | 3 | 1
2 | 4 | 8
2 | 5 |
(10 rows)

Group by difference from starting point of group

I have a lot of measurements in Postgres database table and I need to split this set in groups when some value goes too far away from a "starting" point of the current group (more then some threshold). Sort order is determined by the id column.
Example: splitting with threshold = 1:
id measurements
---------------
1 1.5
2 1.4
3 1.8
4 2.6
5 3.7
6 3.5
7 3.0
8 2.6
9 2.5
10 2.8
Should be split in groups as follows:
id measurements group
---------------------
1 1.5 0 --- start new group
2 1.4 0
3 1.8 0
4 2.6 1 --- start new group because it too far from 1.5
5 3.7 2 --- start new group because it too far from 2.6
6 3.5 2
7 3.0 2
8 2.6 3 --- start new group because it too far from 3.7
9 2.5 3
10 2.8 3
I can do this by writing a function using LOOP, but I'm looking for a more efficient way. Performance is very important as the actual table contains millions of rows.
Is it possible to achieve the goal by using PARTITION OVER, CTE or any other kind of SELECT?
Is it possible to achieve the goal by using PARTITION OVER, CTE or any other kind of SELECT?
This is an inherently procedural problem. Depending on where you start, all later rows can end up in a different group and / or with a different group value. Window functions (using the PARTITION clause) are no good for this.
You can use a recursive CTE:
WITH RECURSIVE rcte AS (
(
SELECT id
, measurement
, measurement - 1 AS grp_min
, measurement + 1 AS grp_max
, 1 AS grp
FROM tbl
ORDER BY id
LIMIT 1
)
UNION ALL
(
SELECT t.id
, t.measurement
, CASE WHEN t.same_grp THEN r.grp_min ELSE t.measurement - 1 END -- AS grp_min
, CASE WHEN t.same_grp THEN r.grp_max ELSE t.measurement + 1 END -- AS grp_max
, CASE WHEN t.same_grp THEN r.grp ELSE r.grp + 1 END -- AS grp
FROM rcte r
CROSS JOIN LATERAL (
SELECT *, t.measurement BETWEEN r.grp_min AND r.grp_max AS same_grp
FROM tbl t
WHERE t.id > r.id
ORDER BY t.id
LIMIT 1
) t
)
)
SELECT id, measurement, grp
FROM rcte;
It's elegant. And decently fast. But only about as fast as - or even slower than - a procedural language function with a single loop over the set - when implemented efficiently:
CREATE OR REPLACE FUNCTION f_measurement_groups(_threshold numeric = 1)
RETURNS TABLE (id int, grp int, measurement numeric) AS
$func$
DECLARE
_grp_min numeric;
_grp_max numeric;
BEGIN
grp := 0; -- init
FOR id, measurement IN
SELECT * FROM tbl t ORDER BY t.id
LOOP
IF measurement BETWEEN _grp_min AND _grp_max THEN
RETURN NEXT;
ELSE
SELECT INTO grp , _grp_min , _grp_max
grp + 1, measurement - _threshold, measurement + _threshold;
RETURN NEXT;
END IF;
END LOOP;
END
$func$ LANGUAGE plpgsql;
Call:
SELECT * FROM f_measurement_groups(); -- optionally supply different threshold
db<>fiddle here
My money is on the procedural function.
Typically, set-based solutions are faster. But not when solving an inherently procedural problem.
Related:
GROUP BY and aggregate sequential numeric values
You seem to be starting a group when the difference between rows exceeds 0.5. If I assume you have an ordering column, you can use lag() and cumulative sum to get your groups:
select t.*,
count(*) filter (where prev_value < value - 0.5) as grouping
from (select t.*,
lag(value) over (order by <ordering col>) as prev_value
from t
) t
One way to attack this problem is by using a recursive CTE. This example is written using SQL Server syntax (because I don't work with postgres). It should be straightforward to translate, however.
-- Table #Test:
-- sequenceno measurements
-- ----------- ------------
-- 1 1.5
-- 2 1.4
-- 3 1.8
-- 4 2.6
-- 5 3.7
-- 6 3.5
-- 7 3.0
-- 8 2.6
-- 9 2.5
-- 10 2.8
WITH datapoints
AS
(
SELECT sequenceno,
measurements,
startmeasurement = measurements,
groupno = 0
FROM #Test
WHERE sequenceno = 1
UNION ALL
SELECT sequenceno = A.sequenceno + 1,
measurements = B.measurements,
startmeasurement =
CASE
WHEN abs(B.measurements - A.startmeasurement) >= 1 THEN B.measurements
ELSE A.startmeasurement
END,
groupno =
A.groupno +
CASE
WHEN abs(B.measurements - A.startmeasurement) >= 1 THEN 1
ELSE 0
END
FROM datapoints as A
INNER JOIN #Test as B
ON A.sequenceno + 1 = B.sequenceno
)
SELECT sequenceno,
measurements,
groupno
FROM datapoints
ORDER BY
sequenceno
-- Output:
-- sequenceno measurements groupno
-- ----------- --------------- -------
-- 1 1.5 0
-- 2 1.4 0
-- 3 1.8 0
-- 4 2.6 1
-- 5 3.7 2
-- 6 3.5 2
-- 7 3.0 2
-- 8 2.6 3
-- 9 2.5 3
-- 10 2.8 3
Note that I added a "sequenceno" column in the starting table because relational tables are considered to be unordered sets. Also, if the number of input values is too large (over 90-100), you may have to adjust the MAXRECURSION value (at least in SQL Server).
Additional note: Just noticed that the original question mentions that there are millions of records in the input data sets. The CTE approach would only work if that data could be broken up into manageable chunks.

How to merge two tables with one to many relationship

I have two tables to main orders and ordered products.
Table 1: ORDERS
"CREATE TABLE IF NOT EXISTS ORDERS("
"id_order INTEGER PRIMARY KEY AUTOINCREMENT,
"o_date TEXT,"
"o_seller TEXT,"
"o_buyer TEXT,"
"o_shipping INTEGER,"
"d_amount INTEGER,"
"d_comm INTEGER,"
"d_netAmount INTEGER)"
Table 2: ORDERED_PRODUCTS
"CREATE TABLE IF NOT EXISTS dispatch_products("
"id_order INTEGER NOT NULL REFERENCES ORDERS(id_order),"
"product_name INTEGER,"
"quantity INTEGER,"
"rate INTEGER)"
I tried to join these two tables using following query:
SELECT *
FROM ORDERS a
INNER JOIN ORDERED_PRODUCTS b
ON a.id_order = b.id_order
WHERE a.buyer = 'abc'
The issue is with the entries with multiple products in table 2.
The output I'm getting is like below:
order_ID date seller buyer Ship amt comm nAmt Prod Qty Rate
1 A x 5 100 5 115 Scale 10 10
2 B abc 10 100 5 115 pen 5 10
2 B abc 10 100 5 115 paper 10 5
3 C xyz 10 100 5 220 book 5 20
3 C xyz 10 100 5 220 stapl 10 10
expected output:
order_ID date seller buyer Ship amt comm nAmt Prod Qty Rate
1 A x 5 100 5 115 Scale 10 10
2 B abc 10 100 5 115 pen 5 10
Paper 10 5
3 C xyz 10 100 5 220 Book 5 20
Stapl 10 10
Databases don't really work like that; you got what you asked for, and with no duplicates (all rows are different). You're looking at the columns of data that came from orders and saying "oh, the data is duplicated" but it isn't - it's joined "in context"
Imagine I gave you just one of your sample rows from your expected output:
Paper 10 5
Promise I just copy pasted that.
What order is it from?
No idea.. You've lost the context, so it could be from any order. Rows are individual entities, that stand alone and without reference to any other row, as a set of data. This is why the same order info needs to appear on each row. A database could be made to produce the expected output you asked for, but it would be really quote complex in a low end database like sqlite. More important to me is to point out why there's a difference between what you thought the query would give you, and what it gave you, as I think that's the real problem: the query gave you what it was supposed to, there's no fault in it; it's more a faulty assumption of what you'd get
If you're trying to prepare a report that uses the order as some kind of header, select them individually in the front end app. Select ALL the orders, then one by one (order by order) pull all the item detail out, building the report as you go:
myorders = dbquery("SELECT * FROM ORDERS")
for each(order o in myorders)
print(o.header)
details = dbquery("SELECT * FROM dispatch_products where id_order = ?", o.id)
for each(detail d in details)
print(d.prod, d.qty, d.rate)
Here's a way to make the DB do it, but you'll need a version of SQLite that supports window functions (3.10 doesn't) or another db (SQLS > 2008, Oracle > 9, or other big-name db from the last 10 or so years, or a very recent MySQL):
SELECT
CASE WHEN rn = 1 THEN d.o_date END as o_date,
CASE WHEN rn = 1 THEN d.o_seller END as o_seller,
CASE WHEN rn = 1 THEN d.o_buyer END as o_buyer,
CASE WHEN rn = 1 THEN d.o_shipping END as o_shipping,
CASE WHEN rn = 1 THEN d.d_amount END as d_amount,
CASE WHEN rn = 1 THEN d.d_comm END as d_comm,
CASE WHEN rn = 1 THEN d.d_netAmount END as d_netAmount,
d.name,
d.qty,
d.rate
FROM
SELECT o.*, op.name, op.qty, op.rate, row_number() over(partition by o.id_order order by op.name, op.qty, op.rate) rn
FROM ORDERS o
INNER JOIN ORDERED_PRODUCTS op
ON o.id_order = op.id_order
WHERE o.buyer = 'abc'
) d
ORDER BY d.id_order, d.rn
We basically take your query, add on a row number that restarts every time order id changes, and only show data from the orders table where rownumber is 1. If your SQLite doesn't have row_number you can fake it: How to use ROW_NUMBER in sqlite which i'll leave as an exercise for the reader :)

Update SQL column using Rank() function

I have a table with existing data. For each unique value in the first column of this table, we have a column that is supposed to be in sequential order, but this table has gotten out of order. I want to run a SQL statement that will put this second column back in order. I was able to see the results I want with this SQL:
select FORMULA_ID, ATTRIB_CODE, ATTRIB_VAL, ATTRIB_ORDER,
rank() over (partition by formula_id order by attrib_code, attrib_val) AS WANT_THIS
from ATTRIB
Which yields:
FORMULA_ID ATTRIB_CODE ATTRIB_VAL ATTRIB_ORDER WANT_THIS
----------- -------------------- ---------------- ------------ ---------
2791 C_BRAND ROMAN HOLIDAY 3 1
2791 C_ENDUSE DINNER 4 2
2791 C_ENDUSE SNACK 6 3
2791 C_ENDUSER 10-17 7 4
2791 C_PRODTYPE SALAD 13 5
2791 C_RELIG ANY 14 6
2821 C_ALLERGEN PEANUT 1 1
2821 C_ALLERGEN SOY 2 2
2821 C_BRAND ROMAN HOLIDAY 1 3
2821 C_ENDUSE DINNER 1 4
As you can see, the WANT_THIS column orders the rows and resets to 1 when it gets to a new FORMULA_ID. But I don't know how to convert this into an UPDATE statement that will actually put the value in WANT_THIS into the column ATTRIB_ORDER. Is there a way to convert the SQL above into an UPDATE statement?
This is one way:
WITH CTE AS
(
SELECT FORMULA_ID,
ATTRIB_CODE,
ATTRIB_VAL,
ATTRIB_ORDER,
RANK() OVER (PARTITION BY formula_id
ORDER BY attrib_code, attrib_val) AS WANT_THIS
FROM ATTRIB
)
UPDATE CTE
SET ATTRIB_ORDER = WANT_THIS;
This should work on MySql server:
UPDATE attrib
LEFT JOIN (
SELECT formula_id, attrib_code, attrib_val,
rank() over (partition by formula_id order by attrib_code, attrib_val)
want_this FROM attrib
) AS new_values
ON
attrib.formula_id = new_values.formula_id AND
attrib.attrib_code = new_values.attrib_code AND
attrib_val = new_values.attrib_val
SET
attrib_order = new_values.want_this
Short description: We are updating the attrib table. First we must calculate new_values using a subquery. Then we connect (LEFT JOIN) the subquery with existing attrib table. After the connection is made, we exactly know to which row want_this should be applied. The ON condition is long here and it would be better to use unique identifier if possible.

Select unique records [duplicate]

This question already has answers here:
Fetch the rows which have the Max value for a column for each distinct value of another column
(35 answers)
Closed 2 years ago.
I'm working with a table that has about 50 colums and 100,000 rows.
One column, call it TypeID, has 10 possible values:
1 thourgh 10.
There can be 10,000 records of TypeID = 1, and 10,000 records of TypeID = 2 and so one.
I want to run a SELECT statement that will return 1 record of each distinct TypeID.
So something like
TypeID JobID Language BillingDt etc
------------------------------------------------
1 123 EN 20130103 etc
2 541 FR 20120228 etc
3 133 FR 20110916 etc
4 532 SP 20130822 etc
5 980 EN 20120714 etc
6 189 EN 20131009 etc
7 980 SP 20131227 etc
8 855 EN 20111228 etc
9 035 JP 20130615 etc
10 103 EN 20100218 etc
I've tried:
SELECT DISTINCT TypeID, JobID, Language, BillingDt, etc
But that produces multiple TypeID rows of the same value. I get a whole bunch of '4', '10', and so on.
This is an ORACLE Database that I'm working with.
Any advise would be greatly appreciated; thanks!
You can use ROW_NUMBER() to get the top n per group:
SELECT TypeID,
JobID,
Language,
BillingDt,
etc
FROM ( SELECT TypeID,
JobID,
Language,
BillingDt,
etc,
ROW_NUMBER() OVER(PARTITION BY TypeID ORDER BY JobID) RowNumber
FROM T
) T
WHERE RowNumber = 1;
SQL Fidle
You may need to change the ORDER BY clause to fit your requirements, as you've not said how to pick one row per TypeID I had to guess.
WITH RankedQuery AS
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY TypeID ORDER BY [ordercolumn] DESC) AS rn
FROM [table]
)
SELECT *
FROM RankedQuery
WHERE rn = 1;
This will return the top row for each type id, you can add an order by if you want a specific row, not just any.