Optimizing a Vertica SQL query to do running totals - sql

I have a table S with time series data like this:
key day delta
For a given key, it's possible but unlikely that days will be missing.
I'd like to construct a cumulative column from the delta values (positive INTs), for the purposes of inserting this cumulative data into another table. This is what I've got so far:
SELECT key, day,
SUM(delta) OVER (PARTITION BY key ORDER BY day asc RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW),
delta
FROM S
In my SQL flavor, default window clause is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, but I left that in there to be explicit.
This query is really slow, like order of magnitude slower than the old broken query, which filled in 0s for the cumulative count. Any suggestions for other methods to generate the cumulative numbers?
I did look at the solutions here:
Running total by grouped records in table
The RDBMs I'm using is Vertica. Vertica SQL precludes the first subselect solution there, and its query planner predicts that the 2nd left outer join solution is about 100 times more costly than the analytic form I show above.

I think you're essentially there. You may just need to update the syntax a bit:
SELECT s_qty,
Sum(s_price)
OVER(
partition BY NULL
ORDER BY s_qty ASC rows UNBOUNDED PRECEDING ) "Cumulative Sum"
FROM sample_sales;
Output:
S_QTY | Cumulative Sum
------+----------------
1 | 1000
100 | 11000
150 | 26000
200 | 28000
250 | 53000
300 | 83000
2000 | 103000
(7 rows)
reference link:
https://dwgeek.com/vertica-cumulative-sum-average-and-example.html/

Sometimes it's faster to just use a correlated subquery:
SELECT
[key]
, [day]
, delta
, (SELECT SUM(delta) FROM S WHERE [key] < t1.[key]) AS DeltaSum
FROM S t1

Related

Partition rows based on percent value of the data range

What will be the select query if I want to rank and partition rows based on percent range of the partitioning column?
For example, Let say I have the below table structure (the column 'Rank' needs to be populated).
And I want to rank the rows based on the order of score for the row, but only within the +/- 10% of the amount range from the current row's amount. That is, for the first row, the amount is 2.3, and +/-10% of 2.3 is: 2.07 - 2.53. So while ranking first row, I should rank based on the score and consider only those rows which has the amount in the range 2.07 - 2.53 (in this case its id's 1,5,11). Based on this logic the percentile rank is populated in the last column, and the rank for first row will be 0.5. Similarly, perform the same steps for each row.
Question is how can I do this with PERCENT_RANK() or RANK() or NTILE() with partition clause as part of a select query? The original table does not have the last column, that is the column that needs to be populated. I need the percentile ranking of the row based on the score within the 10% range of amount.
PRODUCT_ID
Amount
Score
Percent_Rank
1
2.3
45
0.5
2
2.7
30
0
3
2.0
40
0.5
4
2.6
50
1
5
2.2
35
0
6
5.1
25
0
7
4.8
40
1
8
6.1
60
0
9
22.1
70
0.33
10
8.2
20
0
11
2.1
50
1
12
22.2
60
0
13
22.3
80
1
14
22.4
75
0.66
I tried using the PERCENT_RANK() over partition() but its not considering the range. I cannot use range unbounded preceding and following in the partition clause because I need the range to be within 10% of the amount in the current row.
You may try PERCENT_RANK() with a self join as the following:
SELECT PRODUCT_ID, Amount, Score, Percent_Rank
FROM
(
SELECT A.PRODUCT_ID, A.Amount ,A.Score, B.Amount AS B_Amount,
PERCENT_RANK() OVER (PARTITION BY A.PRODUCT_ID ORDER BY B.SCORE) Percent_Rank
FROM table_name A JOIN table_name B
ON B.Amount BETWEEN A.Amount-A.Amount *0.1 AND A.Amount+A.Amount*0.1
) T
WHERE Amount=B_Amount
See a demo.
I think that you can just nest your percent_rank in a subquery once you have calculated the bucket number based on equally spaced scores.
The trickiest part of this example is actually getting the fixed width buckets. It might be simpler if we could use width_bucket() but some databases don't support that, so I had to compute manually (in 'bucketed' inline table).
Here is the example. I used postgres to create the mockup test table, because it has a very nice generate_series(), but the actual example SQL should run on any database.
create table product_scores as (
select
product_id,
score
from
generate_series(1,2) product_id,
generate_series(1,50) score);
This created a table with two product ids and 50 scores for each one.
with ranges as (
select
product_id,
(max(score)-min(score))*(1+1e-10) range,
min(score) minscore from product_scores
group by product_id),
bucketed as (
select
ranges.product_id,
score,
floor((score-minscore)*10.0/range) bucket
from
ranges
inner
join product_scores
on
ranges.product_id=product_scores.product_id)
select
product_id,
score,
bucket,
percent_rank()
over (partition by product_id,bucket order by score) from bucketed;
No the 1e-10 is not a joke. Unfortunately roundoff error would assign the highest value to a bucket all by itself unless we expand the range by a tiny amount. But once we have a workable range we can then calculate the partition easily enough by checking range.
Then having the partition number you can to the percent_rank() as usual, as shown.

Calculate average of time difference between consecutive row

What would be the most efficient query to calculate the average of time difference between consecutive rows in a table? Note that the table has no primary key.
If the table looks like below:
| tran_end_time |
|-----------------------|
|2022-02-08 07:04:46.610|
|2022-02-08 07:09:47.403|
|2022-02-08 07:14:48.100|
|2022-02-08 07:20:03.973|
Then I need the answer to be:
avg('2022-02-08 07:20:03.973' - '2022-02-08 07:14:48.100',
'2022-02-08 07:14:48.100' - '2022-02-08 07:09:47.403',
'2022-02-08 07:09:47.403' - '2022-02-08 07:04:46.610')
We can use DATEDIFF along with LAG:
WITH cte AS (
SELECT tran_end_time,
LAG(tran_end_time) OVER (ORDER BY tran_end_time) AS tran_end_time_lag
FROM yourTable
)
SELECT AVG(DATEDIFF(minute, tran_end_time_lag, tran_end_time)) AS diff_avg
FROM cte
WHERE tran_end_time_lag IS NOT NULL;
Note that the WHERE clause in the final query above ensures that we do not include any diff involving the earliest record.

% of total calculation without subquery in Postgres

I'm trying to create a "Percentage of Total" column and currently using a subquery with no issues:
SELECT ID, COUNT(*), COUNT(*) / (SELECT COUNT(*)
FROM DATA) AS % OF TOTAL FROM DATA GROUP BY ID;
| ID | COUNT | % OF TOTAL |
| 1 | 100 | 0.10 |
| 2 | 800 | 0.80 |
| 3 | 100 | 0.10 |
However, for reasons outside the scope of this question, I'm looking to see if there is any way to accomplish this without using a subquery. Essentially, the application uses logic outside of the SQL query to determine what the WHERE clause is and injects it into the query. That logic does not account for the existence of subqueries like the above, so before going back and rebuilding all of the existing logic to account for this scenario, I figured I'd see if there's another solution first.
I've tried accomplishing this effect with a window function, but to no avail.
Use window functions:
SELECT ID, COUNT(*),
COUNT(*) / SUM(COUNT(*)) OVER () AS "% OF TOTAL"
FROM DATA
GROUP BY ID;
SELECT id, count(*) AS ct
, round(count(*)::numeric
/ sum(count(*)) OVER (ORDER BY id), 2) AS pct_of_running_total
FROM data
GROUP BY id;
You must add ORDER BY to the window function or the order of rows is arbitrary. I may seem correct at first, but that can change any time and without warning. It seems you want to order rows by id.
And you obviously don't want integer division, which would truncate fractional digits. I cast to numeric and round the result to two fractional digits like in your result.
Related answer:
Postgres window function and group by exception
Key to understanding why this works is the sequence of evens in a SELECT query:
Best way to get result count before LIMIT was applied

SQL statement to match dates that are the closest?

I have the following table, let's call it Names:
Name Id Date
Dirk 1 27-01-2015
Jan 2 31-01-2015
Thomas 3 21-02-2015
Next I have the another table called Consumption:
Id Date Consumption
1 26-01-2015 30
1 01-01-2015 20
2 01-01-2015 10
2 05-05-2015 20
Now the problem is, that I think that doing this using SQL is the fastest, since the table contains about 1.5 million rows.
So the problem is as follows, I would like to match each Id from the Names table with the Consumption table provided that the difference between the dates are the lowest, so we have: Dirk consumes on 27-01-2015 about 30. In case there are two dates that have the same "difference", I would like to calculate the average consumption on those two dates.
While I know how to join, I do not know how to code the difference part.
Thanks.
DBMS is Microsoft SQL Server 2012.
I believe that my question differs from the one mentioned in the comments, because it is much more complicated since it involves comparison of dates between two tables rather than having one date and comparing it with the rest of the dates in the table.
This is how you could it in SQL Server:
SELECT Id, Name, AVG(Consumption)
FROM (
SELECT n.Id, Name, Consumption,
RANK() OVER (PARTITION BY n.Id
ORDER BY ABS(DATEDIFF(d, n.[Date], c.[Date]))) AS rnk
FROM Names AS n
INNER JOIN Consumption AS c ON n.Id = c.Id ) t
WHERE t.rnk = 1
GROUP BY Id, Name
Using RANK with PARTITION BY n.Id and ORDER BY ABS(DATEDIFF(d, n.[Date], c.[Date])) you can locate all matching records per Id: all records with the smallest difference in days are going to have rnk = 1.
Then, using AVG in the outer query, you are calculating the average value of Consumption between all matching records.
SQL Fiddle Demo

Oracle Running Total

Looking for advice with 2 different types of sub-totals using PLSQL.
I need to pull a data set with 1) a unique headcount, and 2) a total number of credits, as a running total over time.
Raw Data:
This is the transactional data -- every time a student registers or a course, a record is inserted with the date, student id, and credits (along with course number and a bunch of other relevant data). One record per course per student.
STUDENT_ID CREDITS DATE
1 3 01-JAN-12
1 2 02-JAN-12
57 1 03-JAN-12
1 1 03-JAN-12
Processed Data:
This is what the boss needs to see -- it will be used for trending later (to see, for example, how this year's Jan-01 is measuring up against last year's Jan-01, etc.).
UniqueHeadcount SumCredits Date
1 3 01-JAN-12
1 5 02-JAN-12
2 7 03-JAN-12
The brute approach to this is to write a bunch of separate SELECTS (one for each day), and UNION them together. For example:
SELECT
COUNT(DISTINCT STUDENT_ID) as "UniqueHeadcount",
SUM(CREDIT_HR) as "SumCredits",
'01-JAN-12' as "DATE"
FROM
REGISTRATIONS
WHERE
TO_CHAR(DATE,'yyyymmdd') <= '20120101'
GROUP BY
'01-JAN-12'
UNION
SELECT
COUNT(DISTINCT STUDENT_ID) as "UniqueHeadcount",
SUM(CREDIT_HR) as "SumCredits",
'02-JAN-12' as "DATE"
FROM
REGISTRATIONS
WHERE
TO_CHAR(DATE,'yyyymmdd') <= '20120102'
GROUP BY
'02-JAN-12'
UNION
...
And that works -- the results are accurate -- but as you can see -- this is nowhere near elegant -- and if you have to do it for 365 days, well...it's a beast. There's got to be a better way to do it.
So far in my search, I've learned about an 'OVER' clause that I can use -- like this:
SELECT
COUNT(DISTINCT STUDENT_ID) OVER(ORDER BY TRUNC(RSTS_DATE) ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) "UniqueHeadcount",
SUM(CREDIT_HR) OVER(ORDER BY TRUNC(RSTS_DATE) ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as "SumCredits",
TRUNC(RSTS_DATE) as "DATE"
FROM
REGISTRATIONS
This query is way, way shorter (yay) -- but has two significant problems that I can't yet find my way around. First is that it doesn't work (by design, aparently?) with the COUNT DISTINCT. So I comment that out for a moment, but then run into the second problem: it ignores the TRUNC() function. The RSTS_DATE, though it appears to be just a day/month/year value when you run a SELECT on it, actually holds the time as well, so the result set I get is not summed simply over date, but also over times -- so instead of one record per day, my processed data returns hundreds of records per day (one for each individual course registration). For example:
UniqueHeadcount SumCredits Date
1 3 01-JAN-12
1 5 02-JAN-12
2 6 03-JAN-12 (hidden time: 07:32:27)
2 7 03-JAN-12 (hidden time: 08:01:33)
Not what I'm after.
So I'm looking for expertise -- if what I've explained so far makes sense -- is there another way to use the OVER clause, or perhaps there may be another feature of PLSQL altogether I should be using for this? I'm not strong in PLSQL if you can't tell, but if anyone can give me some direction -- even just words to google, I'd appreciate the help.
Thanks
Try this:
WITH CRdata AS
(
SELECT COUNT(DISTINCT STUDENT_ID) AS UniqueHeadcount,
SUM(CREDIT_HR) AS SumCredits,
TRUNC(RSTS_DATE) RSTS_DATE
FROM REGISTRATIONS
GROUP BY TRUNC(RSTS_DATE)
)
SELECT SUM(UniqueHeadcount) OVER(ORDER BY RSTS_DATE ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS UniqueHeadcount,
SUM(SumCredits) OVER(ORDER BY RSTS_DATE ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS SumCredits,
RSTS_DATE
FROM CRdata