Postgres width_bucket() not assigning values to buckets correctly - sql

In postgresql 9.5.3 I can't get width_bucket() to work as expected, it appears to be assigning values to the wrong buckets.
Dataset:
1
2
4
32
43
82
104
143
232
295
422
477
Expected output (bucket ranges and zero-count rows added to help analysis):
bucket | bucketmin | bucketmax | Expect | Actual
--------+-----------+-----------+--------|--------
1 | 1 | 48.6 | 5 | 5
2 | 48.6 | 96.2 | 1 | 2
3 | 96.2 | 143.8 | 2 | 1
4 | 143.8 | 191.4 | 0 | 0
5 | 191.4 | 239 | 1 | 1
6 | 239 | 286.6 | 0 | 1
7 | 286.6 | 334.2 | 1 | 0
8 | 334.2 | 381.8 | 0 | 1
9 | 381.8 | 429.4 | 1 | 0
10 | 429.4 | 477 | 1 | 1
Actual output:
wb | count
----+-------
1 | 5
2 | 2
3 | 1
5 | 1
6 | 1
8 | 1
10 | 1
Code to generate actual output:
create temp table metrics (val int);
insert into metrics (val) values(1),(2),(4),(32),(43),(82),(104),(143),(232),(295),(422),(477);
with metric_stats as (
select
cast(min(val) as float) as minV,
cast(max(val) as float) as maxV
from metrics m
),
hist as (
select
width_bucket(val, s.minV, s.maxV, 9) wb,
count(*)
from metrics m, metric_stats s
group by 1 order by 1
)
select * from hist;

Your calculations appear to be off. The following query:
with metric_stats as (
select cast(min(val) as float) as minV,
cast(max(val) as float) as maxV
from metrics m
)
select g.n,
s.minV + ((s.maxV - s.minV) / 9) * (g.n - 1) as bucket_start,
s.minV + ((s.maxV - s.minV) / 9) * g.n as bucket_end
from generate_series(1, 9) g(n) cross join
metric_stats s
order by g.n
Yields the following bins:
1 1 53.8888888888889
2 53.8888888888889 106.777777777778
3 106.777777777778 159.666666666667
4 159.666666666667 212.555555555556
5 212.555555555556 265.444444444444
6 265.444444444444 318.333333333333
7 318.333333333333 371.222222222222
8 371.222222222222 424.111111111111
9 424.111111111111 477
I think you intend for the "9" to be a "10", if you want 10 buckets.

Related

How to get columns when using buckets (width_bucket)

I would like to know which row were moved to a bucket.
SELECT
width_bucket(s.score, sl.mins, sl.maxs, 9) as buckets,
COUNT(*)
FROM scores s
CROSS JOIN scores_limits sl
GROUP BY 1
ORDER BY 1;
My actual return:
buckets | count
---------+-------
1 | 182
2 | 37
3 | 46
4 | 15
5 | 29
7 | 18
8 | 22
10 | 11
| 20
What I expect to return:
SELECT buckets FROM buckets_table [...] WHERE scores.id = 1;
How can I get, for example, the column 'id' of table scores?
I believe you can include the id in an array with array_agg. If I recreate your case with
create table test (id serial, score int);
insert into test(score) values (10),(9),(5),(4),(10),(2),(5),(7),(8),(10);
The data is
id | score
----+-------
1 | 10
2 | 9
3 | 5
4 | 4
5 | 10
6 | 2
7 | 5
8 | 7
9 | 8
10 | 10
(10 rows)
Using the following and aggregating the id with array_agg
SELECT
width_bucket(score, 0, 10, 11) as buckets,
COUNT(*) nr_ids,
array_agg(id) agg_ids
FROM test s
GROUP BY 1
ORDER BY 1;
You get
buckets | nr_ids | agg_ids
---------+--------+----------
3 | 1 | {6}
5 | 1 | {4}
6 | 2 | {3,7}
8 | 1 | {8}
9 | 1 | {9}
10 | 1 | {2}
12 | 3 | {1,5,10}

Recursive join with SUM

I have data in the following format:
FromStateID ToStateID Seconds
1 2 10
2 3 20
3 4 15
4 5 5
I need the following output
FromStateID ToStateID Seconds
1 2 10
2 3 20
3 4 15
4 5 5
1 3 10+20
1 4 10+20+15
1 5 10+20+15+5
2 4 20+15
2 5 20+15+5
3 5 15+5
This output shows the total time taken FromStateId to ToStateId in every combination in chronological order.
Please help.
I think this is a recursive CTE that follows the links:
with cte as (
select FromStateID, ToStateID, Seconds
from t
union all
select cte.FromStateId, t.ToStateId, cte.Seconds + t.Seconds
from cte join
t
on cte.toStateId = t.FromStateId
)
select *
from cte;
Here is a db<>fiddle.
#Gordon LinOff is the better solution. Below is another option to achieve the same.
You can achieve this using CROSS JOIN and GROUP BY
DECLARE #table table(FromStateId int, ToStateId int, seconds int)
insert into #table
values
(1 ,2 ,10),
(2 ,3 ,20),
(3 ,4 ,15),
(4 ,5 ,5 );
;with cte_fromToCombination as
(select f.fromStateId, t.tostateId
from
(select distinct fromStateId from #table) as f
cross join
(select distinct toStateId from #table) as t
)
select c.FromStateId, c.ToStateId, t.sumseconds as Total_seconds
from cte_fromToCombination as c
CROSS APPLY
(SELECT sum(t.seconds)
from
#table as t
WHERE t.ToStateId <= c.ToStateId
) as t(sumseconds)
where c.tostateId > c.fromStateId
order by FromStateId,ToStateId
+-------------+-----------+---------------+
| FromStateId | ToStateId | Total_seconds |
+-------------+-----------+---------------+
| 1 | 2 | 10 |
| 1 | 3 | 30 |
| 1 | 4 | 45 |
| 1 | 5 | 50 |
| 2 | 3 | 30 |
| 2 | 4 | 45 |
| 2 | 5 | 50 |
| 3 | 4 | 45 |
| 3 | 5 | 50 |
| 4 | 5 | 50 |
+-------------+-----------+---------------+

SQL Mean Absolute Deviation

I am attempting to calculate the "Mean Absolute Deviation" from the following SQL Table:
|ID | Typical | SMA | MAD
| 1 | 10 | |
| 2 | 20 | |
| 3 | 5 | 11.67 |
| 4 | 12 | 12.33 |
| 5 | 14 | 10.33 |
| 6 | 6 | 10.67 |
| 7 | 2 | 7.33 |
| 8 | 17 | 8.33 |
| 9 | 5 | 8.00 |
Calculating the MAD requires:
SUM ( ABS( Current Row SMA - Typical)) Over the Previous 2 rows and current. This is then divided by 3. So for ID #3 it would be:
MAD = (ABS(11.67 - 5) + ABS(11.67 - 20) + ABS(11.67 - 10)) / 3
I first did this with Dyamic SQL, looping through and creating a LAG for each previous row. This works, but is very inefficient when its scaled up to a higher lookback period. I then attempted the below which I really thought would work, but did not:
DECLARE #sma_current numeric(20,10)
UPDATE PY SET
#sma_current = [SMA(20)],
[MAD] = W.[MAD]
FROM (
SELECT [id],
((sum(abs(#sma_current - [Typical])) OVER (ORDER BY [id] ASC ROWS
BETWEEN 2 PRECEDING AND CURRENT ROW))/3) [MAD]
FROM PY ) W WHERE PY.[id] = W.[id] AND PY.[id] >= 3
Any help would be greatly appreciated.
Since the ids are sequential and there are no gaps you can do the update with a double self join:
update p
set p.mad = round(
(abs(p.sma - p.typical) + abs(p.sma - p1.typical) + abs(p.sma - p2.typical)) / 3.0
, 2)
from py p
inner join py p1 on p1.id + 1 = p.id
inner join py p2 on p2.id + 2 = p.id
See the demo.
Or with lag() window function:
with cte as (
select *,
lag(typical, 1) over (order by id) typical1,
lag(typical, 2) over (order by id) typical2
from py
)
update cte
set mad = round(
(abs(sma - typical) + abs(sma - typical1) + abs(sma - typical2)) / 3.0
, 2)
from cte
See the demo.
Results:
> ID | Typical | SMA | MAD
> -: | ------: | ----: | ---:
> 1 | 10 | |
> 2 | 20 | |
> 3 | 5 | 11.67 | 5.56
> 4 | 12 | 12.33 | 5.11
> 5 | 14 | 10.33 | 3.56
> 6 | 6 | 10.67 | 3.11
> 7 | 2 | 7.33 | 4.44
> 8 | 17 | 8.33 | 5.78
> 9 | 5 | 8 | 6
I am guessing that you are using SQL Server. You can use an updatable CTE/subquery so do the calculation there and then set it:
WITH toupdate as (
SELECT id,
(sum(abs(#sma_current - [Typical])) OVER (ORDER BY [id] ASC ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) / 3) as new_mad
FROM PY
)
UPDATE toupdate
SET mad = new_mad;

Record batching on bases of running total values by specific number (FileSize wise batching)

We are dealing with large recordset and are currently using NTILE() to get the range of FileIDs and then using FileID column in BETWEEN clause to get specific records set. Using FileID in BETWEEN clause is a mandatory requirement from Developers. So, we cannot have random FileIDs in one batch, it has to be incremental.
As per new requirement, we have to make range based on FileSize column, e.g. 100 GB per batch.
For example:
Batch 1 : 1 has 100 size So ID: 1 record only.
Batch 2 : 2,3,4,5 = 80 but it is < 100 GB, so have to take FileId 6 if 120 GB (Total 300 GB)
Batch 3 : 7 ID has > 100 so 1 record only
And so on…
Below are my sample code, but it is not giving the expected result:
CREATE TABLE zFiles
(
FileId INT
,FileSize INT
)
INSERT INTO dbo.zFiles (
FileId
,FileSize
)
VALUES (1, 100)
,(2, 20)
,(3, 20)
,(4, 30)
,(5, 10)
,(6, 120)
,(7, 400)
,(8, 50)
,(9, 100)
,(10, 60)
,(11, 40)
,(12, 5)
,(13, 20)
,(14, 95)
,(15, 40)
DECLARE #intBatchSize FLOAT = 100;
SELECT y.FileID ,
y.FileSize ,
y.RunningTotal ,
DENSE_RANK() OVER (ORDER BY CEILING(RunningTotal / #intBatchSize)) Batch
FROM ( SELECT i.FileID ,
i.FileSize ,
RunningTotal = SUM(i.FileSize) OVER ( ORDER BY i.FileID ) -- RANGE UNBOUNDED PRECEDING)
FROM dbo.zFiles AS i WITH ( NOLOCK )
) y
ORDER BY y.FileID;
Result:
+--------+----------+--------------+-------+
| FileID | FileSize | RunningTotal | Batch |
+--------+----------+--------------+-------+
| 1 | 100 | 100 | 1 |
| 2 | 20 | 120 | 2 |
| 3 | 20 | 140 | 2 |
| 4 | 30 | 170 | 2 |
| 5 | 10 | 180 | 2 |
| 6 | 120 | 300 | 3 |
| 7 | 400 | 700 | 4 |
| 8 | 50 | 750 | 5 |
| 9 | 100 | 850 | 6 |
| 10 | 60 | 910 | 7 |
| 11 | 40 | 950 | 7 |
| 12 | 5 | 955 | 7 |
| 13 | 20 | 975 | 7 |
| 14 | 95 | 1070 | 8 |
| 15 | 40 | 1110 | 9 |
+--------+----------+--------------+-------+
Expected Result:
+--------+---------------+---------+
| FileID | FileSize (GB) | BatchNo |
+--------+---------------+---------+
| 1 | 100 | 1 |
| 2 | 20 | 2 |
| 3 | 20 | 2 |
| 4 | 30 | 2 |
| 5 | 10 | 2 |
| 6 | 120 | 2 |
| 7 | 400 | 3 |
| 8 | 50 | 4 |
| 9 | 100 | 4 |
| 10 | 60 | 5 |
| 11 | 40 | 5 |
| 12 | 5 | 6 |
| 13 | 20 | 6 |
| 14 | 95 | 6 |
| 15 | 40 | 7 |
+--------+---------------+---------+
We can achieve this if somehow we can reset the running total once it gets over 100. We can write a loop to have this result, but for that we need to go record by record, which is time consuming.
Please somebody help us on this?
You need to do this with a recursive CTE:
with cte as (
select z.fileid, z.filesize, z.filesize as batch_filesize, 1 as batchnum
from zfiles z
where z.fileid = 1
union all
select z.fileid, z.filesize,
(case when cte.batch_filesize + z.filesize > #intBatchSize
then z.filesize
else cte.batch_filesize + z.filesize
end),
(case when cte.batch_filesize + z.filesize > #intBatchSize
then cte.batchnum + 1
else cte.batchnum
end)
from cte join
zfiles z
on z.fileid = cte.fileid + 1
)
select *
from cte;
Note: I realize that fileid probably is not a sequence. You can create a sequence using row_number() in a CTE, to make this work.
There is a technical reason why running sums don't work for this. Essentially, any given fileid needs to know the breaks before it.
Small modification on above answered by Gordon Linoff and got expected result.
DECLARE #intBatchSize INT = 100
;WITH cte as (
select z.fileid, z.filesize, z.filesize as batch_filesize, 1 as batchnum
from zfiles z
where z.fileid = 1
union all
select z.fileid, z.filesize,
(case when cte.batch_filesize >= #intBatchSize
then z.filesize
else cte.batch_filesize + z.filesize
end),
(case when cte.batch_filesize >= #intBatchSize
then cte.batchnum + 1
else cte.batchnum
end)
from cte join
zfiles z
on z.fileid = cte.fileid + 1
)
select *
from cte;

How to return percentage in PostgreSQL?

CREATE TEMPORARY TABLE
CREATE TEMP TABLE percentage(
gid SERIAL,
zoom smallint NOT NULL,
x smallint NOT NULL,
y smallint NOT NULL
);
INSERT DATA
INSERT INTO percentage(zoom, x, y) VALUES
(0,5,20),
(0,5,21), (0,5,21),
(0,5,22), (0,5,22), (0,5,22),
(0,5,23), (0,5,23), (0,5,23), (0,5,23),
(0,5,24), (0,5,24), (0,5,24), (0,5,24), (0,5,24),
(1,5,20),
(1,5,21), (1,5,21),
(1,5,22), (1,5,22), (1,5,22),
(1,5,23), (1,5,23), (1,5,23), (1,5,23),
(1,5,24), (1,5,24), (1,5,24), (1,5,24), (1,5,24);
How many times certain tile shows up (tile is represented by x and y)
SELECT zoom, x, y, count(*) AS amount
FROM percentage
GROUP BY zoom,x,y
ORDER BY zoom, amount;
Result:
zoom | x | y | amount
------+---+----+--------
0 | 5 | 20 | 1
0 | 5 | 21 | 2
0 | 5 | 22 | 3
0 | 5 | 23 | 4
0 | 5 | 24 | 5
1 | 5 | 20 | 1
1 | 5 | 21 | 2
1 | 5 | 22 | 3
1 | 5 | 23 | 4
1 | 5 | 24 | 5
(10 rows)
Question
How to get back percentage of each tile (x and y) for certain zoom, or in other words, how many times have the certain tile showed up for certain zoom?
Wanted result:
zoom | x | y | amount | percentage
------+---+----+--------+-----------
0 | 5 | 20 | 1 | 6.667
0 | 5 | 21 | 2 | 13.333
0 | 5 | 22 | 3 | 20
0 | 5 | 23 | 4 | 26.667
0 | 5 | 24 | 5 | 33.333
1 | 5 | 20 | 1 | 6.667
1 | 5 | 21 | 2 | 13.333
1 | 5 | 22 | 3 | 20
1 | 5 | 23 | 4 | 26.667
1 | 5 | 24 | 5 | 33.333
(10 rows)
*This is just a sample data, percentages are not supposed to be the same, except as a pure coincidence!
If am not wrong you are looking for this
SELECT zoom,x,y,
amount,
( amount / Cast(Sum(amount) OVER(partition BY zoom) AS FLOAT) ) * 100 as amt_percentage
FROM (SELECT zoom,x, y,
Count(*) AS amount
FROM percentage
GROUP BY zoom,x,y) a
Or even
SELECT zoom,x,y,
Count(*) AS amount,
( Count(*) / Cast(Sum(Count(*))OVER(partition BY zoom) AS FLOAT) ) * 100 AS amt_percentage
FROM percentage
GROUP BY zoom,x,y
Casting the denominator to FLOAT is avoid the Integer division