How calculate previous cumulative product in SQL (detail below) - sql

I'm trying to estimate a new scrap rate (loss factor) in a production line using SQL.
Basically, there are several operation in one machine, with qty in and qty out for each one of the operation.
The following operation has, as qty in, the qty out of the previous operation.
And this scrap rate (loss factor) needs to be carry over the operation.
So, operation 1 will have qty out / qty in from operation 1 = scrap rate 1; operation 2 has qty out / qty in * scrap rate 1; and so on.
I know I can use "exp(sum(log(column)) OVER (ORDER BY column))" to get the overall, lets say machine scrap rate, but I need to have an cumulative per machine-operation level.
Hope the image attached can explain better the outcome.
I'm struggling to calculate the column G (OutFactorAccumulated) in the image. Hope someone can help me.
Data and expected results example
calculate cumulative product

I think you are most of the way there.
The final step is simply to take the OutfactorAccumulated column, and do a similar windowed function over it to calculate the next column e.g.,
MIN(OutFactorAccumulated) OVER (PARTITION BY Machine).
Note also that the other windowed function (the SUM) should also have a PARTITION BY Machine in the window to ensure that each machine only uses its own data.
Here is a db<>fiddle with the example code below in SQL Server/T-SQL.
The last CTE 'MachineData_with_ExpQtyOut` is the one that has the windows MIN function to do the calculation.
In the Fiddle I have also added a second machine B1 with some data I made up - to demonstrate it works with multiple machines.
(Note lots of CAST AS decimal(14,10) to match your data - there's probably a better way to do this).
CREATE TABLE #MachData (Machine nvarchar(10), Operation int, QtyIn int, QtyOut int, PRIMARY KEY (Machine, Operation));
INSERT INTO #MachData (Machine, Operation, QtyIn, QtyOut) VALUES
(N'A1', 1, 100, 100),
(N'A1', 2, 100, 95),
(N'A1', 3, 95, 95),
(N'A1', 4, 95, 94),
(N'A1', 5, 94, 86),
(N'A1', 6, 86, 66),
(N'A1', 7, 66, 66),
(N'A1', 8, 66, 66),
(N'A1', 9, 66, 66);
WITH MachData_with_Factors AS
(SELECT Machine,
Operation,
QtyIn,
QtyOut,
CAST(1 - CAST(QtyOut AS decimal(14,10))/CAST(QtyIn AS decimal(14,10)) AS decimal(14,10)) AS LossFactor,
CAST(CAST(QtyOut AS decimal(14,10))/CAST(QtyIn AS decimal(14,10)) AS decimal(14,10)) AS OutFactor
FROM #MachData
),
MachineData_with_Acc AS
(SELECT *,
CAST(exp(SUM(log(OutFactor)) OVER (PARTITION BY Machine ORDER BY Operation)) AS decimal(14,10)) AS OutFactorAccumulated
FROM MachData_with_Factors
),
MachineData_with_ExpQtyOut AS
(SELECT *,
CAST(OutFactorAccumulated * 100.0 / MIN(OutFactorAccumulated) OVER (PARTITION BY machine) AS decimal(14,10)) AS NewExpectedQtyOut
FROM MachineData_with_Acc
)
SELECT *
FROM MachineData_with_ExpQtyOut
ORDER BY Machine, Operation;
Results are as below
Machine Operation QtyIn QtyOut LossFactor OutFactor OutFactorAccumulated NewExpectedQtyOut
------------------------------- ---------------------------------------------------------------------------
A1 1 100 100 0.0000000000 1.0000000000 1.0000000000 151.5151515152
A1 2 100 95 0.0500000000 0.9500000000 0.9500000000 143.9393939394
A1 3 95 95 0.0000000000 1.0000000000 0.9500000000 143.9393939394
A1 4 95 94 0.0105263158 0.9894736842 0.9400000000 142.4242424242
A1 5 94 86 0.0851063830 0.9148936170 0.8600000000 130.3030303030
A1 6 86 66 0.2325581395 0.7674418605 0.6600000000 100.0000000000
A1 7 66 66 0.0000000000 1.0000000000 0.6600000000 100.0000000000
A1 8 66 66 0.0000000000 1.0000000000 0.6600000000 100.0000000000
A1 9 66 66 0.0000000000 1.0000000000 0.6600000000 100.0000000000

Related

Return a list of IDs in which geometry connect to each other with SQL Amazon Redshift

I have a list of geometry IDs which contains the Linestrings. I want to create a query that returns the object IDs as a LISTAGG from those that connect to each other.
Here an example of a geometry I have with their IDs:
I want to input all IDs into SQL and return all geometry that are connected. So I'm expecting something like this:
| IDs |
|:----------------------------------------------------------|
|0001, 0002, 0003, 0004, 0005, 0006, 0007 |
|0008, 0009, 0010, 0011, 0012, 0013, 0014, 0015, 0016, 0017, 0018, 0019|
I have a table with a set of IDs and their respective Linestrings as this:
| ID | Vector |
|:---|:----------------------------------|
|0001|Linestring(1 2, 2 3, 3 4, 4 5, 5 6)|
|0002|Linestring(6 7, 8 9) |
|0003|Linestring(9 10, 11 12, 13 14) |
|0004|Linestring(14 15, 16 17) |
|0005|Linestring(17 18, 18 19, 19 20) |
And so on.
I'm pretty new with Amazon Redshift and I'm just struggling to find a way to do this.
I tried using ST_Buffer but then I'm stuck. Since the geometry connects to each other, maybe something that can bring all connections. I already got the A-B point of each geometry, but now I need to find a way to get the whole set of links.
This is what I currently have:
CREATE TEMP TABLE geoList
(ID BIGINT,
AB_Coords VARCHAR);
INSERT INTO geoList
SELECT ID,
CONCAT(SPLIT_PART(vector,',',1),')') AS startEndPoint
FROM geometry
WHERE ID IN (0001, 0002, 0003, 0004, 0005, 0006, 0007, 0008, 0009, 0010, 0011, 0012, 0013, 0014, 0015, 0016, 0017, 0018, 0019);
INSERT INTO geoList
SELECT ID,
CONCAT('LINESTRING (',TRIM(SPLIT_PART(vector,',',LEN(vector)-LEN(REPLACE(vector,',',''))+1))) AS startEndPoint
FROM geometry
WHERE ID IN (0001, 0002, 0003, 0004, 0005, 0006, 0007, 0008, 0009, 0010, 0011, 0012, 0013, 0014, 0015, 0016, 0017, 0018, 0019);
SELECT *, g1.ID = g2.ID AS sameGeo FROM geoList g1
LEFT JOIN geoList g2
ON g1.AB_Coords = g2.AB_Coords
I'm stuck here...
Thanks!
The algorithm you need is clustering, specifically DBSCAN. With systems that inherently support it, you just call DBSCAN, and then aggregate by cluster ID it returns to get the list of groups.
E.g. with BigQuery, similar query can be done in PostgreSQL:
with data as (
select 1 id, 'Linestring(1 2, 2 3, 3 4, 4 5, 5 6)' line union all
select 2, 'Linestring(1 3, 4 1)' union all
select 6, 'Linestring(25 10, 30 15)' union all
select 7, 'Linestring(25 15, 30 10)'
)
select array_agg(id)
from (
select id, st_clusterdbscan(st_geogfromtext(line), 0, 1) over() as cluster_id
from data
) group by cluster_id
Result:
[1,2]
[6,7]
It looks like Redshift does not support DBSCAN though. The workaround would be to use external code - there are implementations for most languages. Another option is to do it natively in SQL using Recursive CTE, e.g. see discussion about implementing DBSCAN this way in this paper: https://db.in.tum.de/~schuele/data/ssdbm2022.pdf

How to divide integers in Oracle SQL to produce floating point numbers?

I have just started learning Oracle SQL. I am trying to divide two columns of numeric datatype which is same as an integer, I think. I want to create a new column in the table or float data type, divide an existing numeric column by an integer and put its value into the float column. I am using this code for the division and updating part:
update Student set AVERAGE = TOTAL/3;
Here, TOTAL is the numeric column and AVERAGE is float. But when I print the table using:
select * from Student;
, AVERAGE is shown with rounded values of the division. I tried two solutions that I found on the internet:
update Student set AVERAGE = (TOTAL*1.00)/3;
And:
update Student set AVERAGE = cast(TOTAL as float(2))/3;
But both are not working. What am I doing wrong?
Here is the output I am getting:
ROLL_NO SNAME MATHS CHEM PHY TOTAL AVERAGE
---------- --------------- ---------- ---------- ---------- ---------- ----------
101 abcd 56 68 80 204 70
102 efgh 81 78 70 229 80
103 ijkl 69 73 78 220 70
104 mnop 90 89 92 271 90
105 qrst 80 89 79 248 80
First, you need to understand what FLOAT datatype in Oracle means.
The Oracle FLOAT data type is the subtype of the NUMBER data type.
Synatx:
FLOAT(p)
p is precision in Binary bits.
Following formula is used to convert between binary and decimal
precision: Decimal = 0.30103 * Binary
Now, According to the result you are getting, I think your column (AVERAGE) datatype is FLOAT(1).
If you need more precision then you need to alter your table with more precision value in Binary.
Let's take an example:
CREATE TABLE TEST (
f1 FLOAT,
f2 FLOAT(1),
f3 FLOAT(4),
f4 FLOAT(7)
);
INSERT
INTO
TEST(
f1,
f2,
f3,
f4
)
VALUES(
10 / 3,
10 / 3,
10 / 3,
10 / 3
);
select * from TEST;
Output:
db<>fiddle demo
If you do not provide any precision then Oracle will take the maximum precision(126 bits
--> 37 decimal).
In the above example, the data type of the column f1, f2, f3, and f4 is FLOAT, FLOAT(1), FLOAT(4), and FLOAT(7).
The corresponding precision in decimal digits of the column f1, f2 <-- (Your case), f3 and f4 is 37(126 * 0.30103), 1 (1 * 0.30103) <-- (Your case), 2 (4 * 0.30103), and 3 (7 * 0.30103).
So, Conclusion is --> alter your table and change the precision of the AVERAGE column's FLOAT datatype according to your requirement.
Cheers!!
This is a little long for a comment.
The column average is going to be displayed based on the datatype of that column. Oracle will convert the "numbers" being divided so the result is accurate, I think using the number type.
You can run the following code to see that the division result is always the same:
select cast(10 as int) / cast(3 as int),
cast(10 as numeric) / cast(3 as numeric),
cast(10 as float) / cast(3 as float)
from dual;
So the data type of the operands doesn't make a difference.
On the other hand, the data type of the result does. These produce different results:
select cast(10 / 3 as int),
cast(10 / 3 as float),
cast(10 / 3 as number),
cast(10 / 3 as numeric(5, 1))
from dual;
In Oracle the NUMBER data type is already a floating point type. It's unusual in that it's a base-10 floating point number type so it's safe to use for calculations involving money, but it's still a floating point type. Docs here
It is possible to define a NUMBER which holds only integers by defining a subtype or a particular field as having 0 for the scale component, e.g.
nInt_value NUMBER(10,0);
or
SUBTYPE TEN_DIGIT_INTEGER_TYPE IS NUMBER(10,0);
in which case nInt_value will only be able to hold whole numbers of 10 digits or less.
Note that SUBTYPE is only available in PL/SQL - in other words, you can't define a SUBTYPE in a PL/SQL module and then use it as a database field. Docs here

Find a median among the nearest values using Oracle SQL

I have a set of data which consists of periodically collected values. I want to calculate a median using 2 left and right neighbors of a current value for each element of set.
For example, the set is:
21
22
23
-10
20
22
19
21
100
20
For the first value we pick 21, 22, 23 which median is 22. So for 21 we have 22. For -10 we have 22, 23, -10, 20, 22. Median is 22.
I use this method to get rid of "deviant" values which are abnormal for this set.
I guess I should somehow use median analytic function. Something like that:
SELECT (SELECT median(d.value)
FROM my_set d
WHERE d.key_val = s.key_val
AND d.order_value BETWEEN s.order_value - 2 AND s.order_value + 2) median_val
,s.key_val
,s.order_value
FROM my_set s
I would be happy to see any other approaches or some improved approaches to solve this question.
You did not specify anything about your table structure so I'm just guessing from your SQL what fields there are and what they're supposed to mean, but consider an attempt like this one:
SELECT s1.key_val, s1.order_value, s1.value, MEDIAN(s2.value) as med
FROM my_set s1
LEFT OUTER JOIN my_set s2
ON s2.key_val = s1.key_val
AND (s1.order_value - 2) <= s2.order_value
AND s2.order_value <= (s1.order_value + 2)
GROUP BY s1.key_val, s1.order_value, s1.value

Can SQL Server perform an update on rows with a set operation on the aggregate max or min value?

I am a fairly experienced SQL Server developer but this problem has me REALLY stumped.
I have a FUNCTION. The function is referencing a table that is something like this...
PERFORMANCE_ID, JUDGE_ID, JUDGING_CRITERIA, SCORE
--------------------------------------------------
101, 1, 'JUMP_HEIGHT', 8
101, 1, 'DEXTERITY', 7
101, 1, 'SYNCHRONIZATION', 6
101, 1, 'SPEED', 9
101, 2, 'JUMP_HEIGHT', 6
101, 2, 'DEXTERITY', 5
101, 2, 'SYNCHRONIZATION', 8
101, 2, 'SPEED', 9
101, 3, 'JUMP_HEIGHT', 9
101, 3, 'DEXTERITY', 6
101, 3, 'SYNCHRONIZATION', 7
101, 3, 'SPEED', 8
101, 4, 'JUMP_HEIGHT', 7
101, 4, 'DEXTERITY', 6
101, 4, 'SYNCHRONIZATION', 5
101, 4, 'SPEED', 8
In this example there are 4 judges (with IDs 1, 2, 3, and 4) judging a performance (101) against 4 different criteria (JUMP_HEIGHT, DEXTERITY, SYNCHRONIZATION, SPEED).
(Please keep in mind that in my real data there are 10+ criteria and at least 6 judges.)
I want to aggregate the results in a score BY JUDGING_CRITERIA and then aggregate those into a final score by summing...something like this...
SELECT SUM (Avgs) FROM
(SELECT AVG(SCORE) Avgs
FROM PERFORMANCE_SCORES
WHERE PERFORMANCE_ID=101
GROUP BY JUDGING_CRITERIA) result
BUT... that is not quite what I want IN THAT I want to EXCLUDE from the AVG the highest and lowest values for each JUDGING_CRITERIA grouping. That is the part that I can't figure out. The AVG should be applied only to the MIDDLE values of the GROUPING FOR EACH JUDGING_CRITERIA. The HI value and the LO value for JUMP_HEIGHT should not be included in the average. The HI value and the LO value for DEXTERITY should not be included in the average. ETC.
I know this could be accomplished with a cursor to set the hi and lo for each criteria to NULL. But this is a FUNCTION and should be extremely fast.
I am wondering if there is a way to do this as a SET operation but still automatically exclude HI and LO from the aggregation?
Thanks for your help. I have a feeling it can probably be done with some advanced SQL syntax but I don't know it.
One last thing. This example is actually a simplification of the problem I am trying to solve. I have other constraints not mentioned here for the sake of simplicity.
Seth
EDIT: -Moved the WHERE clause to inside the CTE.
-Removed JudgeID from the partition
This would be my approach
;WITH Agg1 AS
(
SELECT PERFORMANCE_ID
,JUDGE_ID
,JUDGING_CRITERIA
,SCORE
,MinFind = ROW_NUMBER() OVER ( PARTITION BY PERFORMANCE_ID
,JUDGING_CRITERIA
ORDER BY SCORE ASC )
,MaxFind = ROW_NUMBER() OVER ( PARTITION BY PERFORMANCE_ID
,JUDGING_CRITERIA
ORDER BY SCORE DESC )
FROM PERFORMANCE_SCORES
WHERE PERFORMANCE_ID=101
)
SELECT AVG(Score)
FROM Agg1
WHERE MinFind > 1
AND MaxFind > 1
GROUP BY JUDGING_CRITERIA

SQL - suppressing duplicate *adjacent* records

I need to run a Select statement (DB2 SQL) that does not pull adjacent row duplicates based on a certain field. In specific, I am trying to find out when data changes, which is made difficult because it might change back to its original value.
That is to say, I have a table that vaguely resembles the below, sorted by Letter and then by Date:
A, 5, 2009-01-01
A, 12, 2009-02-01
A, 12, 2009-03-01
A, 12, 2009-04-01
A, 9, 2009-05-01
A, 9, 2009-06-01
A, 5, 2009-07-01
And I want to get the results:
A, 5, 2009-01-01
A, 12, 2009-02-01
A, 9, 2009-05-01
A, 5, 2009-07-01
discarding adjacent duplicates but keeping the last row (despite it having the same number as the first row). The obvious:
Select Letter, Number, Min(Update_Date) from Table group by Letter, Number
does not work -- it doesn't include the last row.
Edit: As there seems to have been some confusion, I have clarified the month column into a date column. It was meant as a human-parseable short form, not as actual valid data.
Edit: The last row is not important BECAUSE it is the last row, but because it has a "new value" that is also an "old value". Grouping by NUMBER would wrap it in with the first row; it needs to remain a separate entity.
Depending on which DB2 you're on, there are analytic functions which can make this problem easy to solve. An example in Oracle is below, but the select syntax appears to be pretty similar.
create table t1 (c1 char, c2 number, c3 date);
insert into t1 VALUES ('A', 5, DATE '2009-01-01');
insert into t1 VALUES ('A', 12, DATE '2009-02-01');
insert into t1 VALUES ('A', 12, DATE '2009-03-01');
insert into t1 VALUES ('A', 12, DATE '2009-04-01');
insert into t1 VALUES ('A', 9, DATE '2009-05-01');
insert into t1 VALUES ('A', 9, DATE '2009-06-01');
insert into t1 VALUES ('A', 5, DATE '2009-07-01');
SQL> l
1 SELECT C1, C2, C3
2 FROM (SELECT C1, C2, C3,
3 LAG(C2) OVER (PARTITION BY C1 ORDER BY C3) AS PRIOR_C2,
4 LEAD(C2) OVER (PARTITION BY C1 ORDER BY C3) AS NEXT_C2
5 FROM T1
6 )
7 WHERE C2 <> PRIOR_C2
8 OR PRIOR_C2 IS NULL -- to pick up the first value
9 ORDER BY C1, C3
SQL> /
C C2 C3
- ---------- -------------------
A 5 2009-01-01 00:00:00
A 12 2009-02-01 00:00:00
A 9 2009-05-01 00:00:00
A 5 2009-07-01 00:00:00
This is not possible with set based commands (i.e. group by and such).
You may be able to do this by using cursors.
Personally, I would get the data into my client application and do the filtering there.
The first thing you'd have to do is identify the sequence within which you wish to view/consider the the data. Values of "Jan, Feb, Mar" don't help, because the data's not in alphabetical order. And what happens when you flip from Dec to Jan? Step 1: identify a sequence that uniquely defines each row with regards to your problem.
Next, you have to be able to compare item #x with item #x-1, to see if it has changed. If changed, include; if not changed, exclude. Trivial when using procedural code loops (cursors in SQL), but would you want to use those? They tend not to perform too well.
One SQL-based way to do this is to join the table on itself, with the join clause being "MyTable.SequenceVal = MyTable.SequenceVal - 1". Throw in a comparison, make sure you don't toss the very first row of the set (where there is no x-1), and you're done. Note that performance may suck if the "SequenceVal" is not indexed.
Using an "EXCEPT" clause is one way to do it. See below for the solution. I've included all of my test steps here. First, I created a session table (this will go away after I disconnect from my database).
CREATE TABLE session.sample (
letter CHAR(1),
number INT,
update_date DATE
);
Then I imported your sample data:
IMPORT FROM sample.csv OF DEL INSERT INTO session.sample;
Verified that your sample information is in the database:
SELECT * FROM session.sample;
LETTER NUMBER UPDATE_DATE
------ ----------- -----------
A 5 01/01/2009
A 12 02/01/2009
A 12 03/01/2009
A 12 04/01/2009
A 9 05/01/2009
A 9 06/01/2009
A 5 07/01/2009
7 record(s) selected.
I wrote this with an EXCEPT clause, and used the "WITH" to try to make it clearer. Basically, I'm trying to select all rows that have a previous date entry. Then, I exclude all of those rows from a select on the whole table.
WITH rows_with_previous AS (
SELECT s.*
FROM session.sample s
JOIN session.sample s2
ON s.letter = s2.letter
AND s.number = s2.number
AND s.update_date = s2.update_date - 1 MONTH
)
SELECT *
FROM session.sample
EXCEPT ALL
SELECT *
FROM rows_with_previous;
Here is the result:
LETTER NUMBER UPDATE_DATE
------ ----------- -----------
A 5 01/01/2009
A 12 04/01/2009
A 9 06/01/2009
A 5 07/01/2009
4 record(s) selected.