PostgreSQL, Cumulative amount with interval - sql

Hello there i have this example dataset:
employee_id | amount | cumulative_amount
-------------+------------+-----------------
2 | 100 | 100
6 | 220 | 320
7 | 45 | 365
8 | 50 | 415
9 | 110 | 525
16 | 300 | 825
17 | 250 | 1075
18 | 200 | 1275
And interval, let's say 300:
I'd like to pick only rows, that match the interval, with condition:
Pick value if it's >= previous value+interval
(e.g if start Val = 100, next matching row is where cumulative amount >= 400, and so on)
:
employee_id | amount | cumulative_amount
-------------+------------+-----------------
2 | 100 | 100 <-- $Start
6 | 220 | 320 - 400
7 | 45 | 365 - 400
8 | 50 | 415 <-- 1
9 | 110 | 525 - 715 (prev value (415)+300)
16 | 300 | 825 <-- 2
17 | 250 | 1075 - 1125 (825+300)
18 | 200 | 1275 <-- 3
so final result would be :
employee_id | amount | cumulative_amount
-------------+------------+-----------------
2 | 100 | 100
8 | 50 | 415
16 | 300 | 825
18 | 200 | 1275
How to achieve this in PostgreSQL in most efficient way ?
Column cumulative_amount is progressive sum of column amount
and it's calculated in another query, which result is dataset above, table is ordered by employee_id.
Regards.

not saying it is the most effective way, but probably the easiest:
s=# create table s1(a int, b int, c int);
CREATE TABLE
Time: 10.262 ms
s=# copy s1 from stdin delimiter '|';
...
s=# with g as (select generate_series(100,1300,300) s)
, o as (select *,sum(b) over (order by a) from s1)
, c as (select *, min(sum) over (partition by g.s)
from o
join g on sum >= g.s and sum < g.s + 300
)
select a,b,sum from c
where sum = min
;
a | b | sum
----+-----+------
2 | 100 | 100
8 | 50 | 415
16 | 300 | 825
17 | 250 | 1075
(4 rows)
here I used order by a as you sad your cumulative sum is by first column (which reconciled with third row)

Related

SQL, Update with most recent data info

I have 2 tables shown below:
Table 1
Student ID - DATE_NO - SCORE
Table 2
STUDENT_ID - DATE_NO - HT - WT
Table 1 has the physical test scores and the date of the test for each student while Table 2 lists their height (HT) and weight (WT) and the date they were measured.
Example Data:
Table 1
Student ID | DATE_NO | SCORE |
125 | 3 | 90 |
572 | 6 | 75 |
687 | 11 | 95 |
Table 2
Student_ID | DATE_NO | HT | WT |
125 | 2 | 70 | 150 |
125 | 3 | 72 | 155 |
125 | 6 | 72 | 160 |
572 | 2 | 70 | 200 |
572 | 5 | 70 | 225 |
572 | 8 | 70 | 215 |
572 | 9 | 70 | 220 |
687 | 4 | 65 | 140 |
687 | 7 | 67 | 150 |
687 | 11 | 70 | 155 |
687 | 12 | 67 | 160 |
I am not guaranteed to have the exact same DATE_NO for both HT/WT and the Test score date. I want the most recent HT and WT for each student when they took their physical test. Based on the example data above, the optimal join would give me the table below:
Modified Table 1
Student ID | DATE_NO | HT | WT |
125 | 3 | 72 | 155 |
572 | 6 | 70 | 225 |
687 | 11 | 70 | 155 |
I'd like to use the UPDATE statement on Table 1, so after altering Table 1 with HT int and WT int, I attempt to do the following:
UPDATE T1
SET HT = T2.HT, WT = T2.WT
FROM Table_1 as T1
INNER JOIN Table_2 AS T2 ON T1.STUDENT_ID = T2.STUDENT_ID
WHERE (T1.DATE_NO) >= (T2.DATE_NO)
But the result gives me the FIRST record that meets the criteria. Switching greater than to less than [ >= to <= ] Make the HT/WT for each student the entries for Month 6,8, and 12) when it should be month 3,8, and 11. Any suggestions?
FYI: Won't be able to apply any solutions till Friday.
Is it something like this you're looking for:
UPDATE Q
SET
T1_HT = T2_HT
, T1_WT = T2_WT
FROM
(
SELECT
T1.HT T1_HT
, T1.WT T1_WT
, T2.HT T2_HT
, T2.WT T2_WT
, ROW_NUMBER() OVER (PARTITION BY T1.STUDENT_ID ORDER BY T2.DATE_NO DESC) R
FROM
Table_1 T1
JOIN Table_2 T2 ON
T1.STUDENT_ID = T2.STUDENT_ID
AND T2.DATE_NO <= T1.DATE_NO
) Q
WHERE R = 1
SELECT ts.student_id,
ts.date_no,
hw.ht,
hw.wt
FROM test_scores ts,
ht_wt hw
WHERE hw.student_id = ts.student_id
AND hw.date_no <= ts.date_no
AND hw.date_no =
(SELECT max(date_no)
FROM ht_wt
WHERE date_no <= ts.date_no
AND student_id = ts.student_id)
sql fiddle here

Record batching on bases of running total values by specific number (FileSize wise batching)

We are dealing with large recordset and are currently using NTILE() to get the range of FileIDs and then using FileID column in BETWEEN clause to get specific records set. Using FileID in BETWEEN clause is a mandatory requirement from Developers. So, we cannot have random FileIDs in one batch, it has to be incremental.
As per new requirement, we have to make range based on FileSize column, e.g. 100 GB per batch.
For example:
Batch 1 : 1 has 100 size So ID: 1 record only.
Batch 2 : 2,3,4,5 = 80 but it is < 100 GB, so have to take FileId 6 if 120 GB (Total 300 GB)
Batch 3 : 7 ID has > 100 so 1 record only
And so on…
Below are my sample code, but it is not giving the expected result:
CREATE TABLE zFiles
(
FileId INT
,FileSize INT
)
INSERT INTO dbo.zFiles (
FileId
,FileSize
)
VALUES (1, 100)
,(2, 20)
,(3, 20)
,(4, 30)
,(5, 10)
,(6, 120)
,(7, 400)
,(8, 50)
,(9, 100)
,(10, 60)
,(11, 40)
,(12, 5)
,(13, 20)
,(14, 95)
,(15, 40)
DECLARE #intBatchSize FLOAT = 100;
SELECT y.FileID ,
y.FileSize ,
y.RunningTotal ,
DENSE_RANK() OVER (ORDER BY CEILING(RunningTotal / #intBatchSize)) Batch
FROM ( SELECT i.FileID ,
i.FileSize ,
RunningTotal = SUM(i.FileSize) OVER ( ORDER BY i.FileID ) -- RANGE UNBOUNDED PRECEDING)
FROM dbo.zFiles AS i WITH ( NOLOCK )
) y
ORDER BY y.FileID;
Result:
+--------+----------+--------------+-------+
| FileID | FileSize | RunningTotal | Batch |
+--------+----------+--------------+-------+
| 1 | 100 | 100 | 1 |
| 2 | 20 | 120 | 2 |
| 3 | 20 | 140 | 2 |
| 4 | 30 | 170 | 2 |
| 5 | 10 | 180 | 2 |
| 6 | 120 | 300 | 3 |
| 7 | 400 | 700 | 4 |
| 8 | 50 | 750 | 5 |
| 9 | 100 | 850 | 6 |
| 10 | 60 | 910 | 7 |
| 11 | 40 | 950 | 7 |
| 12 | 5 | 955 | 7 |
| 13 | 20 | 975 | 7 |
| 14 | 95 | 1070 | 8 |
| 15 | 40 | 1110 | 9 |
+--------+----------+--------------+-------+
Expected Result:
+--------+---------------+---------+
| FileID | FileSize (GB) | BatchNo |
+--------+---------------+---------+
| 1 | 100 | 1 |
| 2 | 20 | 2 |
| 3 | 20 | 2 |
| 4 | 30 | 2 |
| 5 | 10 | 2 |
| 6 | 120 | 2 |
| 7 | 400 | 3 |
| 8 | 50 | 4 |
| 9 | 100 | 4 |
| 10 | 60 | 5 |
| 11 | 40 | 5 |
| 12 | 5 | 6 |
| 13 | 20 | 6 |
| 14 | 95 | 6 |
| 15 | 40 | 7 |
+--------+---------------+---------+
We can achieve this if somehow we can reset the running total once it gets over 100. We can write a loop to have this result, but for that we need to go record by record, which is time consuming.
Please somebody help us on this?
You need to do this with a recursive CTE:
with cte as (
select z.fileid, z.filesize, z.filesize as batch_filesize, 1 as batchnum
from zfiles z
where z.fileid = 1
union all
select z.fileid, z.filesize,
(case when cte.batch_filesize + z.filesize > #intBatchSize
then z.filesize
else cte.batch_filesize + z.filesize
end),
(case when cte.batch_filesize + z.filesize > #intBatchSize
then cte.batchnum + 1
else cte.batchnum
end)
from cte join
zfiles z
on z.fileid = cte.fileid + 1
)
select *
from cte;
Note: I realize that fileid probably is not a sequence. You can create a sequence using row_number() in a CTE, to make this work.
There is a technical reason why running sums don't work for this. Essentially, any given fileid needs to know the breaks before it.
Small modification on above answered by Gordon Linoff and got expected result.
DECLARE #intBatchSize INT = 100
;WITH cte as (
select z.fileid, z.filesize, z.filesize as batch_filesize, 1 as batchnum
from zfiles z
where z.fileid = 1
union all
select z.fileid, z.filesize,
(case when cte.batch_filesize >= #intBatchSize
then z.filesize
else cte.batch_filesize + z.filesize
end),
(case when cte.batch_filesize >= #intBatchSize
then cte.batchnum + 1
else cte.batchnum
end)
from cte join
zfiles z
on z.fileid = cte.fileid + 1
)
select *
from cte;

Select well spread points from a big table

I'm trying to write a stored procedure for selecting X amount of well spread points in time from a big table.
I have a table points:
"Userid" integer
, "Time" timestamp with time zone
, "Value" integer
It contains hundreds of millions of records. And about a million of records per each user.
I want to select X points (lets say 50), which all well spread from time A to time B. The problem is that the points are not spread equally (if one point is in 6:00:00, the next point may be after 15 seconds, 20, or 4 minutes for example).
Selection all the points for an id can take up to 60 seconds (because there are about a million points).
Is there any way to select the exact amount of points I desire, as much well spread as possible, in a fast way?
Sample data:
+--------+---------------------+-------+
| UserId | Time | Value |
+--------+---------------------+-------+
1 | 1 | 2017-04-10 14:00:00 | 1 |
2 | 1 | 2017-04-10 14:00:10 | 10 |
3 | 1 | 2017-04-10 14:00:20 | 32 |
4 | 1 | 2017-04-10 14:00:35 | 80 |
5 | 1 | 2017-04-10 14:00:58 | 101 |
6 | 1 | 2017-04-10 14:01:00 | 203 |
7 | 1 | 2017-04-10 14:01:30 | 204 |
8 | 1 | 2017-04-10 14:01:40 | 205 |
9 | 1 | 2017-04-10 14:02:02 | 32 |
10 | 1 | 2017-04-10 14:02:15 | 7 |
11 | 1 | 2017-04-10 14:02:30 | 900 |
12 | 1 | 2017-04-10 14:02:45 | 22 |
13 | 1 | 2017-04-10 14:03:00 | 34 |
14 | 1 | 2017-04-10 14:03:30 | 54 |
15 | 1 | 2017-04-10 14:04:00 | 54 |
16 | 1 | 2017-04-10 14:06:00 | 60 |
17 | 1 | 2017-04-10 14:07:20 | 654 |
18 | 1 | 2017-04-10 14:07:40 | 32 |
19 | 1 | 2017-04-10 14:08:00 | 33 |
20 | 1 | 2017-04-10 14:08:12 | 32 |
21 | 1 | 2017-04-10 14:10:00 | 8 |
+--------+---------------------+-------+
I want to select 11 "best" points from the list above, for the user with Id 1,
from time 2017-04-10 14:00:00 to 2017-04-10 14:10:00.
Currently its done on the server, after selecting all the points for the user.
I calculate the "best times" by dividing the difference in times and get a list such as: 14:00:00,14:01:00,....14:10:00 (11 "best times", as the amount of points). Than I look for the closest point for each "best time", that not have been selected yet.
The result will be points: 1, 6, 9, 13, 15, 16, 17, 18, 19, 20, 21
Edit:
I'm trying something like this:
SELECT * FROM "points"
WHERE "Userid" = 1 AND
(("Time" =
(SELECT "Time" FROM
"points"
ORDER BY abs(extract(epoch from '2017-04-10 14:00:00' - "Time"))
LIMIT 1)) OR
("Time" =
(SELECT "Time" FROM
"points"
ORDER BY abs(extract(epoch from '2017-04-10 14:01:00' - "Time"))
LIMIT 1)) OR
("Time" =
(SELECT "Time" FROM
"points"
ORDER BY abs(extract(epoch from '2017-04-10 14:02:00' - "Time"))
LIMIT 1)))
The problems here are that:
A) It doesn't take in account points that already have been selected.
B) Because of the ORDER BY, each additional time increases the running time of the query by ~ 1 second, and for 50 points I get back to the 1 minute mark.
There is an optimization problem behind your question that's hard to solve with just SQL.
That said, your attempt of an approximation can be implemented to use an index and show good performance irregardless of table size. You need this index if you don't have it already:
CREATE INDEX ON points ("Userid", "Time");
Query:
SELECT *
FROM generate_series(timestamptz '2017-04-10 14:00:00+0'
, timestamptz '2017-04-10 14:09:00+0' -- 1 min *before* end!
, interval '1 minute') grid(t)
LEFT JOIN LATERAL (
SELECT *
FROM points
WHERE "Userid" = 1
AND "Time" >= grid.t
AND "Time" < grid.t + interval '1 minute' -- same interval
ORDER BY "Time"
LIMIT 1
) t ON true;
dbfiddle here
Most importantly, the rewritten query can use above index and will be very fast, solving problem B).
It also addresses problem A) to some extent as no point is returned more than once. If there is no row between two adjacent points in the grid, you get no row in the result. Using LEFT JOIN .. ON true keeps all grid rows and appends NULL in this case. Eliminate those NULL rows by switching to CROSS JOIN. You may get fewer result rows this way.
I am only search ahead of each grid point. You might append a second LATERAL join to also search behind each grid point (just another index-scan), and take the closer one of the two results (ignoring NULL). But that introduces two problems:
If one match is behind and the next is ahead, the gap widens.
You need special treatment for lower and / or upper bound of the outer interval
And you need two LATERAL joins with two index scans.
You could use a recursive CTE to search 1 minute ahead of the last time actually found, but then the total number of rows returned varies even more.
It all comes down to an exact definition of what you need, and where compromises are allowed.
Related:
What is the difference between a LATERAL JOIN and a subquery in PostgreSQL?
Aggregating the most recent joined records per week
MySQL/Postgres query 5 minutes interval data
Optimize GROUP BY query to retrieve latest row per user
answer use generate_series('2017-04-10 14:00:00','2017-04-10 14:10:00','1 minute'::interval) and join for comparison.
for others to save time on data set:
t=# create table points(i int,"UserId" int,"Time" timestamp(0), "Value" int,b text);
CREATE TABLE
Time: 13.728 ms
t=# copy points from stdin delimiter '|';
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 1 | 1 | 2017-04-10 14:00:00 | 1 |
>> 2 | 1 | 2017-04-10 14:00:10 | 10 |
3 | 1 | 2017-04-10 14:00:20 | 32 |
4 | 1 | 2017-04-10 14:00:35 | 80 |
5 | 1 | 2017-04-10 14:00:58 | 101 |
6 | 1 | 2017-04-10 14:01:00 | 203 |
7 | 1 | 2017-04-10 14:01:30 | >> 204 |
8 | 1 | 2017-04-10 14:01:40 | 205 |
9 | 1 | 2017-04-10 14:02:02 | 32 |
10 | 1 | 2017-04-10 14:02:15 | 7 |
11 | 1 | 2017-04-10 14:02:30 | 900 |
12 | 1 | 2017-04-10 14:02:45 | 22 |
>> >> >> >> >> >> >> >> >> >> 13 | 1 | 2017-04-10 14:03:00 | 34 |
14 | 1 | 2017-04-10 14:03:30 | 54 |
15 | 1 | 2017-04-10 14:04:00 | 54 |
16 | 1 | 2017-04-10 14:06:00 | 60 |
17 | 1 | 2017-04-10 14:07:20 | 654 |
18 | 1 | 2017-04-10 14:07:40 | 32 |
19 | 1 | 2017-04-10 14:08:00 | 33 |
20 | 1 | 2017-04-10 14:08:12 | 32 |
21 | 1 | 2017-04-10 14:10:00 | 8 |>> >> >> >> >> >> >> >> \.
>> \.
COPY 21
Time: 7684.259 ms
t=# alter table points rename column "UserId" to "Userid";
ALTER TABLE
Time: 1.013 ms
Frankly I don't understand the request. This is how I got it from description and results are different from expected by OP:
t=# with r as (
with g as (
select generate_series('2017-04-10 14:00:00','2017-04-10 14:10:00','1 minute'::interval) s
)
select *,abs(extract(epoch from '2017-04-10 14:02:00' - "Time"))
from g
join points on g.s = date_trunc('minute',"Time")
order by abs
limit 11
)
select i, "Time","Value",abs
from r
order by i;
i | Time | Value | abs
----+---------------------+-------+-----
4 | 2017-04-10 14:00:35 | 80 | 85
5 | 2017-04-10 14:00:58 | 101 | 62
6 | 2017-04-10 14:01:00 | 203 | 60
7 | 2017-04-10 14:01:30 | 204 | 30
8 | 2017-04-10 14:01:40 | 205 | 20
9 | 2017-04-10 14:02:02 | 32 | 2
10 | 2017-04-10 14:02:15 | 7 | 15
11 | 2017-04-10 14:02:30 | 900 | 30
12 | 2017-04-10 14:02:45 | 22 | 45
13 | 2017-04-10 14:03:00 | 34 | 60
14 | 2017-04-10 14:03:30 | 54 | 90
(11 rows)
I added abs column to justify why I thought those rows fit request better

Hive, ordering lines using a variable lag

I have the following hive table:
product | price
A | 100
B | 102
C | 220
D | 240
E | 242
F | 410
For every line I would like to divide the lower price by the current price, if the result is greater than 0.9 I would like to increments a row number. If the result is lower than 0.9 then row number should be 1 for this line, and current price become lower price, then iterate.
Result should look like:
product | price | row_number
A | 100 | 1
B | 102 | 2
C | 220 | 1
D | 240 | 2
E | 242 | 3
F | 410 | 1
Because:
lower price = 100: product A get 1 as row_number
100/102 >= 0.9: product B get 2 as row_number
100/220 < 0.9: product C get 1 as row_number, lower price = 220
220/240 >= 0.9: product D get 2 as row_number
220/242 >= 0.9: product E get 3 as row_number
220/410 < 0.9: product F get 1 as row_number, lower price = 410
I was thinking about creating a temporary_row_number just ordered by price:
product | price | temp_row_number
A | 100 | 1
B | 102 | 2
C | 220 | 3
D | 240 | 4
E | 242 | 5
F | 410 | 6
And then:
Select
product,
price,
case
when lag(price,temp_row_number-1,0)/price over() >= 0.9 then lag(price,temp_row_number-1,0)
else price
end as test
from my_table
This will retrieve:
product | price | test
A | 100 | 100
B | 102 | 100
C | 220 | 220
D | 240 | 240
E | 242 | 242
F | 410 | 410
But ideally I would like to retrieve
product | price | test
A | 100 | 100
B | 102 | 100
C | 220 | 220
D | 240 | 220
E | 242 | 220
F | 410 | 410
So I could compute row_number row using the row_number() function order by product and price and get the expected result.
WITH CTE
AS
(select product,price,(case when price between 100 and 200 then 1
when price between 200 and 300 then 2
when price between 300 and 400 then 3 END ) AS RN
FROM #test)
SELECT Product,Price, ROW_NUMBER() OVER (PARTITION BY RN ORDER BY RN) FROM CTE
ORDER BY Product

SQL Server 2008 - accumulating column

I would like to accumulate my data as you can see below there is origin table table1:
What is the best query for to do this?
Is possible to do this dynamically - when I add more types of terms??
Table 1
ID | term | value
-----------------------
1 | I | 100
2 | I | 200
3 | II | 100
4 | II | 50
5 | II | 75
6 | III | 50
7 | III | 65
8 | IV | 30
9 | IV | 45
And the result should be like below:
YTD | Acc Value
------------------
I-I | 300
I-II | 525
I-III| 640
I-IV | 715
Thanks
select
(select min(term) from yourtable ) +'-'+term,
(select sum(value) from yourtable t1 where t1.term<=t.term)
from yourtable t
group by term