SQL Mean Absolute Deviation - sql

I am attempting to calculate the "Mean Absolute Deviation" from the following SQL Table:
|ID | Typical | SMA | MAD
| 1 | 10 | |
| 2 | 20 | |
| 3 | 5 | 11.67 |
| 4 | 12 | 12.33 |
| 5 | 14 | 10.33 |
| 6 | 6 | 10.67 |
| 7 | 2 | 7.33 |
| 8 | 17 | 8.33 |
| 9 | 5 | 8.00 |
Calculating the MAD requires:
SUM ( ABS( Current Row SMA - Typical)) Over the Previous 2 rows and current. This is then divided by 3. So for ID #3 it would be:
MAD = (ABS(11.67 - 5) + ABS(11.67 - 20) + ABS(11.67 - 10)) / 3
I first did this with Dyamic SQL, looping through and creating a LAG for each previous row. This works, but is very inefficient when its scaled up to a higher lookback period. I then attempted the below which I really thought would work, but did not:
DECLARE #sma_current numeric(20,10)
UPDATE PY SET
#sma_current = [SMA(20)],
[MAD] = W.[MAD]
FROM (
SELECT [id],
((sum(abs(#sma_current - [Typical])) OVER (ORDER BY [id] ASC ROWS
BETWEEN 2 PRECEDING AND CURRENT ROW))/3) [MAD]
FROM PY ) W WHERE PY.[id] = W.[id] AND PY.[id] >= 3
Any help would be greatly appreciated.

Since the ids are sequential and there are no gaps you can do the update with a double self join:
update p
set p.mad = round(
(abs(p.sma - p.typical) + abs(p.sma - p1.typical) + abs(p.sma - p2.typical)) / 3.0
, 2)
from py p
inner join py p1 on p1.id + 1 = p.id
inner join py p2 on p2.id + 2 = p.id
See the demo.
Or with lag() window function:
with cte as (
select *,
lag(typical, 1) over (order by id) typical1,
lag(typical, 2) over (order by id) typical2
from py
)
update cte
set mad = round(
(abs(sma - typical) + abs(sma - typical1) + abs(sma - typical2)) / 3.0
, 2)
from cte
See the demo.
Results:
> ID | Typical | SMA | MAD
> -: | ------: | ----: | ---:
> 1 | 10 | |
> 2 | 20 | |
> 3 | 5 | 11.67 | 5.56
> 4 | 12 | 12.33 | 5.11
> 5 | 14 | 10.33 | 3.56
> 6 | 6 | 10.67 | 3.11
> 7 | 2 | 7.33 | 4.44
> 8 | 17 | 8.33 | 5.78
> 9 | 5 | 8 | 6

I am guessing that you are using SQL Server. You can use an updatable CTE/subquery so do the calculation there and then set it:
WITH toupdate as (
SELECT id,
(sum(abs(#sma_current - [Typical])) OVER (ORDER BY [id] ASC ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) / 3) as new_mad
FROM PY
)
UPDATE toupdate
SET mad = new_mad;

Related

Multiply with Previous Value in Oracle SQL

Its easy to multiply (or sum/divide/etc.) with previous row in Excel spreadsheet, however, I could not do it so far in Oracle SQL.
A B C
199901 3.81 51905
199902 -6.09 48743.9855
199903 4.75 51059.32481
199904 6.39 54322.01567
199905 -2.35 53045.4483
199906 2.65 54451.15268
199907 1.1 55050.11536
199908 -1.45 54251.88869
199909 0 54251.88869
199910 4.37 56622.69622
Above, column B is static and column C has the formula as:
((B2/100)+1)*C1
((B3/100)+1)*C2
((B4/100)+1)*C3
Example: 51905 from row 1 multiplied with -6.09 from row 2:
((-6.09/100)+1)*51905
I have been trying analytical and window functions, but not succeeded yet. LAG function can give previous row value in current row, but cannot give calculated previous value.
This can be done with a help of MODEL clause
select *
FROM (
SELECT t.*,
row_number() over (order by a) as rn
from table1 t
)
MODEL
DIMENSION BY (rn)
MEASURES ( A, B, 0 c )
RULES (
c[rn=1] = 51905, -- value in a first row
c[rn>1] = round( c[cv()-1] * (b[cv()]/100 +1), 6 )
)
;
Demo: http://sqlfiddle.com/#!4/9756ed/11
| RN | A | B | C |
|----|--------|-------|--------------|
| 1 | 199901 | 3.81 | 51905 |
| 2 | 199902 | -6.09 | 48743.9855 |
| 3 | 199903 | 4.75 | 51059.324811 |
| 4 | 199904 | 6.39 | 54322.015666 |
| 5 | 199905 | -2.35 | 53045.448298 |
| 6 | 199906 | 2.65 | 54451.152678 |
| 7 | 199907 | 1.1 | 55050.115357 |
| 8 | 199908 | -1.45 | 54251.888684 |
| 9 | 199909 | 0 | 54251.888684 |
| 10 | 199910 | 4.37 | 56622.696219 |

Record batching on bases of running total values by specific number (FileSize wise batching)

We are dealing with large recordset and are currently using NTILE() to get the range of FileIDs and then using FileID column in BETWEEN clause to get specific records set. Using FileID in BETWEEN clause is a mandatory requirement from Developers. So, we cannot have random FileIDs in one batch, it has to be incremental.
As per new requirement, we have to make range based on FileSize column, e.g. 100 GB per batch.
For example:
Batch 1 : 1 has 100 size So ID: 1 record only.
Batch 2 : 2,3,4,5 = 80 but it is < 100 GB, so have to take FileId 6 if 120 GB (Total 300 GB)
Batch 3 : 7 ID has > 100 so 1 record only
And so on…
Below are my sample code, but it is not giving the expected result:
CREATE TABLE zFiles
(
FileId INT
,FileSize INT
)
INSERT INTO dbo.zFiles (
FileId
,FileSize
)
VALUES (1, 100)
,(2, 20)
,(3, 20)
,(4, 30)
,(5, 10)
,(6, 120)
,(7, 400)
,(8, 50)
,(9, 100)
,(10, 60)
,(11, 40)
,(12, 5)
,(13, 20)
,(14, 95)
,(15, 40)
DECLARE #intBatchSize FLOAT = 100;
SELECT y.FileID ,
y.FileSize ,
y.RunningTotal ,
DENSE_RANK() OVER (ORDER BY CEILING(RunningTotal / #intBatchSize)) Batch
FROM ( SELECT i.FileID ,
i.FileSize ,
RunningTotal = SUM(i.FileSize) OVER ( ORDER BY i.FileID ) -- RANGE UNBOUNDED PRECEDING)
FROM dbo.zFiles AS i WITH ( NOLOCK )
) y
ORDER BY y.FileID;
Result:
+--------+----------+--------------+-------+
| FileID | FileSize | RunningTotal | Batch |
+--------+----------+--------------+-------+
| 1 | 100 | 100 | 1 |
| 2 | 20 | 120 | 2 |
| 3 | 20 | 140 | 2 |
| 4 | 30 | 170 | 2 |
| 5 | 10 | 180 | 2 |
| 6 | 120 | 300 | 3 |
| 7 | 400 | 700 | 4 |
| 8 | 50 | 750 | 5 |
| 9 | 100 | 850 | 6 |
| 10 | 60 | 910 | 7 |
| 11 | 40 | 950 | 7 |
| 12 | 5 | 955 | 7 |
| 13 | 20 | 975 | 7 |
| 14 | 95 | 1070 | 8 |
| 15 | 40 | 1110 | 9 |
+--------+----------+--------------+-------+
Expected Result:
+--------+---------------+---------+
| FileID | FileSize (GB) | BatchNo |
+--------+---------------+---------+
| 1 | 100 | 1 |
| 2 | 20 | 2 |
| 3 | 20 | 2 |
| 4 | 30 | 2 |
| 5 | 10 | 2 |
| 6 | 120 | 2 |
| 7 | 400 | 3 |
| 8 | 50 | 4 |
| 9 | 100 | 4 |
| 10 | 60 | 5 |
| 11 | 40 | 5 |
| 12 | 5 | 6 |
| 13 | 20 | 6 |
| 14 | 95 | 6 |
| 15 | 40 | 7 |
+--------+---------------+---------+
We can achieve this if somehow we can reset the running total once it gets over 100. We can write a loop to have this result, but for that we need to go record by record, which is time consuming.
Please somebody help us on this?
You need to do this with a recursive CTE:
with cte as (
select z.fileid, z.filesize, z.filesize as batch_filesize, 1 as batchnum
from zfiles z
where z.fileid = 1
union all
select z.fileid, z.filesize,
(case when cte.batch_filesize + z.filesize > #intBatchSize
then z.filesize
else cte.batch_filesize + z.filesize
end),
(case when cte.batch_filesize + z.filesize > #intBatchSize
then cte.batchnum + 1
else cte.batchnum
end)
from cte join
zfiles z
on z.fileid = cte.fileid + 1
)
select *
from cte;
Note: I realize that fileid probably is not a sequence. You can create a sequence using row_number() in a CTE, to make this work.
There is a technical reason why running sums don't work for this. Essentially, any given fileid needs to know the breaks before it.
Small modification on above answered by Gordon Linoff and got expected result.
DECLARE #intBatchSize INT = 100
;WITH cte as (
select z.fileid, z.filesize, z.filesize as batch_filesize, 1 as batchnum
from zfiles z
where z.fileid = 1
union all
select z.fileid, z.filesize,
(case when cte.batch_filesize >= #intBatchSize
then z.filesize
else cte.batch_filesize + z.filesize
end),
(case when cte.batch_filesize >= #intBatchSize
then cte.batchnum + 1
else cte.batchnum
end)
from cte join
zfiles z
on z.fileid = cte.fileid + 1
)
select *
from cte;

Postgres width_bucket() not assigning values to buckets correctly

In postgresql 9.5.3 I can't get width_bucket() to work as expected, it appears to be assigning values to the wrong buckets.
Dataset:
1
2
4
32
43
82
104
143
232
295
422
477
Expected output (bucket ranges and zero-count rows added to help analysis):
bucket | bucketmin | bucketmax | Expect | Actual
--------+-----------+-----------+--------|--------
1 | 1 | 48.6 | 5 | 5
2 | 48.6 | 96.2 | 1 | 2
3 | 96.2 | 143.8 | 2 | 1
4 | 143.8 | 191.4 | 0 | 0
5 | 191.4 | 239 | 1 | 1
6 | 239 | 286.6 | 0 | 1
7 | 286.6 | 334.2 | 1 | 0
8 | 334.2 | 381.8 | 0 | 1
9 | 381.8 | 429.4 | 1 | 0
10 | 429.4 | 477 | 1 | 1
Actual output:
wb | count
----+-------
1 | 5
2 | 2
3 | 1
5 | 1
6 | 1
8 | 1
10 | 1
Code to generate actual output:
create temp table metrics (val int);
insert into metrics (val) values(1),(2),(4),(32),(43),(82),(104),(143),(232),(295),(422),(477);
with metric_stats as (
select
cast(min(val) as float) as minV,
cast(max(val) as float) as maxV
from metrics m
),
hist as (
select
width_bucket(val, s.minV, s.maxV, 9) wb,
count(*)
from metrics m, metric_stats s
group by 1 order by 1
)
select * from hist;
Your calculations appear to be off. The following query:
with metric_stats as (
select cast(min(val) as float) as minV,
cast(max(val) as float) as maxV
from metrics m
)
select g.n,
s.minV + ((s.maxV - s.minV) / 9) * (g.n - 1) as bucket_start,
s.minV + ((s.maxV - s.minV) / 9) * g.n as bucket_end
from generate_series(1, 9) g(n) cross join
metric_stats s
order by g.n
Yields the following bins:
1 1 53.8888888888889
2 53.8888888888889 106.777777777778
3 106.777777777778 159.666666666667
4 159.666666666667 212.555555555556
5 212.555555555556 265.444444444444
6 265.444444444444 318.333333333333
7 318.333333333333 371.222222222222
8 371.222222222222 424.111111111111
9 424.111111111111 477
I think you intend for the "9" to be a "10", if you want 10 buckets.

Window running function except current row

I have a theoretical question, so I'm not interested in alternative solutions. Sorry.
Q: Is it possible to get the window running function values for all previous rows, except current?
For example:
with
t(i,x,y) as (
values
(1,1,1),(2,1,3),(3,1,2),
(4,2,4),(5,2,2),(6,2,8)
)
select
t.*,
sum(y) over (partition by x order by i) - y as sum,
max(y) over (partition by x order by i) as max,
count(*) filter (where y > 2) over (partition by x order by i) as cnt
from
t;
Actual result is
i | x | y | sum | max | cnt
---+---+---+-----+-----+-----
1 | 1 | 1 | 0 | 1 | 0
2 | 1 | 3 | 1 | 3 | 1
3 | 1 | 2 | 4 | 3 | 1
4 | 2 | 4 | 0 | 4 | 1
5 | 2 | 2 | 4 | 4 | 1
6 | 2 | 8 | 6 | 8 | 2
(6 rows)
I want to have max and cnt columns behavior like sum column, so, result should be:
i | x | y | sum | max | cnt
---+---+---+-----+-----+-----
1 | 1 | 1 | 0 | | 0
2 | 1 | 3 | 1 | 1 | 0
3 | 1 | 2 | 4 | 3 | 1
4 | 2 | 4 | 0 | | 0
5 | 2 | 2 | 4 | 4 | 1
6 | 2 | 8 | 6 | 4 | 1
(6 rows)
It can be achieved using simple subquery like
select t.*, lag(y,1) over (partition by x order by i) as yy from t
but is it possible using only window function syntax, without subqueries?
Yes, you can. This does the trick:
with
t(i,x,y) as (
values
(1,1,1),(2,1,3),(3,1,2),
(4,2,4),(5,2,2),(6,2,8)
)
select
t.*,
sum(y) over w as sum,
max(y) over w as max,
count(*) filter (where y > 2) over w as cnt
from t
window w as (partition by x order by i
rows between unbounded preceding and 1 preceding);
The frame_clause selects just those rows from the window frame that you are interested in.
Note that in the sum column you'll get null rather than 0 because of the frame clause: the first row in the frame has no row before it. You can coalesce() this away if needed.
SQLFiddle

ACCESS SQL : How to calculate wage (multiply function) and count no.of working days (count, distinct) of each staff between 2 dates

I need to create a form to summarize wage of each employees according to Date_start and Date_end that I selected.
I have 3 related tables.
tbl_labor
lb_id | lb_name | lb_OT ($/day) | If_social_sec
1 | John | 10 | yes
2 | Mary | 10 | no
tbl_production
pdtn_date | lb_id | pdtn_qty(pcs) | pd_making_id
5/9/12 | 1 | 200 | 12
5/9/12 | 1 | 40 | 13
5/9/12 | 2 | 300 | 12
7/9/12 | 1 | 48 | 13
13/9/12 | 2 | 220 | 14
15/9/12 | 1 | 20 | 12
20/9/12 | 1 | 33 | 14
21/9/12 | 2 | 55 | 14
21/9/12 | 1 | 20 | 12
tbl_pdWk_process
pd_making_id | pd_cost($/dozen) | pd_id
12 | 2 | 001
13 | 5 | 001
14 | 6 | 002
The outcome will look like this:
lb_name | no.working days | Total($)| OT payment | Social_sec($)| Net Wage |
John | 4 | 98.83 | 20 (2x10) | 15 | 103.83 (98.83+20-15)|
Mary | 2 | 160 | 10 (1x10) | 0 | 170 (160+10-0) |
My conditions are:
Wage are calculate between 2 dates that i specify eg. 5/9/12 - 20/9/12
Wage must be calculated from (pd_cost * pdtn_qty). However, pdtn_qty was kept in 'pieces' whereas pd_cost was kept in 'dozen'. Therefore the formula should be (pdtn_qty * pd_cost)/12
Add OT * no. of OT days that each worker did (eg. John had 2 OT days, Mary 1 OT day)
Total wages must be deducted eg. 15$ if If_social_sec are "TRUE"
Count no. of working days that each employees had worked.
I've tried but i couldn't merge all this condition together in one SQL statement, so could you please help me. Thank you.
OK this is really messy. Mainly because Access has no COUNT(DISTINCT ) option. So counting the working days is a mess. If you can skip that, then you can drop all the pdn1,pdn2,pdn3 stuff. But id does work. Couple of notes
1. I think your maths is not quite right in the example given, I make it like this:
2 I've used the IIF clause to simulate 2 OT for john, 1 for mary. You'll need to change that. Good luck.
select
lab.lb_name,
max(days),
sum(prod.pdtn_qty * pdWk.pd_cost / 12) as Total ,
max(lab.lb_OT * iif(lab.lb_id=1,2,1)) as OTPayment,
max(iif(lab.if_social_sec='yes' , 15,0 ) ) as Social_Sec,
sum(prod.pdtn_qty * pdWk.pd_cost / 12.00) +
max(lab.lb_OT * iif(lab.lb_id=1,2,1)) -
max(iif(lab.if_social_sec='yes' , 15,0 ) ) as NetWage
from
tbl_labor as lab,
tbl_production as prod,
tbl_pdWk_process as pdwk,
(select pdn2.lb_id, count(pdn2.lb_id) as days from
(select lb_id
from tbl_production pdn1
where pdn1.pdtn_date >= #9/5/2012#
and pdn1.pdtn_date <= #2012-09-20#
group by lb_id, pdtn_date ) as pdn2
group by pdn2.lb_id) as pdn3
where prod.pdtn_date >= #9/5/2012#
and prod.pdtn_date <= #2012-09-20#
and prod.lb_id = lab.lb_id
and prod.pd_making_id = pdwk.pd_making_id
and lab.lb_id = pdn3.lb_id
group by lab.lb_name
OK to add the items not in production table, you'll need to append something like this:
Union
select lab.lb_name,
0,
0,
max(lab.lb_OT * iif(lab.lb_id=1,2,1)) ,
max(iif(lab.if_social_sec='yes' , 15,0 ) ),0
from tbl_labor lab
where lb_id not in ( select lb_id from tbl_production where pdtn_date >= #2012-09-05# and pdtn_date <= #2012-09-20# )
group by lab.lb_name
Hope this helps.