Configurable SQL GROUP BY variable - sql

I am currently developing some SQL aggregation queries to calculate data from one source table. The goal is, to have different aggregation granularities in one executable query / function / etc. I am currently developing on PostgreSQL but the code should be ANSI SQL compliant as much as possible to be compatible to most DB variants.
Example:
SELECT
COUNT(a) as amount,
SUM(b) as sum,
c as static_grouping,
#vargr as variable_grouping,
#vardesc as variable_grouping_description
FROM whatever
GROUP BY c, #vargr, #vardesc
#vargr can be date driven like daily, weekly, monthly, ...
#vardesc is the identifier to see aggregation type as text
Having multiple queries with UNION is not an option, since there are multiple grouping statements changing (resulting in 60+ single queries per result set). Is there a way to do this with a function, while loop, etc.?
Thanks for a hint in the right direction, have a good day and stay safe!
Best regards
Christian

If you want to group by varying date granularity, then one option uses date_trunc(). You would typically use one of the supported precision (such as day, week, month and so on) as parameter. Assuming that your date or timestamp column is tscol, you would do:
SELECT
COUNT(a) as amount,
SUM(b) as sum,
c as static_grouping,
DATE_TRUNC($1, tscol) as variable_grouping,
$2 as variable_grouping_description
FROM whatever
GROUP BY c, DATE_TRUNC($1, tscol)

Related

When can aggregate functions be nested in standard SQL?

I know it wasn't allowed in SQL-92. But since then it may have changed, particularly when there's a window applied. Can you explain the changes and give the version (or versions if there were more) in which they were introduced?
Examples
Is SUM(COUNT(votes.option_id)) OVER() valid syntax per standard SQL:2016 (or earlier)?
This is my comment (unanswered, an probably unlikely to in such an old question) in Why can you nest aggregate functions when using a window function in PostgreSQL?.
The Calculating Running Total (SQL) kata at Codewars has as its most upvoted solution (using PostgreSQL 13.0, a highly standard compliant engine, so the code is likely to be standard) this one:
SELECT
CREATED_AT::DATE AS DATE,
COUNT(CREATED_AT) AS COUNT,
SUM(COUNT(CREATED_AT)) OVER (ORDER BY CREATED_AT::DATE ROWS UNBOUNDED PRECEDING)::INT AS TOTAL
FROM
POSTS
GROUP BY
CREATED_AT::DATE
(Which could be simplified to:
SELECT
created_at::DATE date,
COUNT(*) COUNT,
SUM(COUNT(*)) OVER (ORDER BY created_at::DATE)::INT total
FROM posts
GROUP BY created_at::DATE
I assume the ::s are a new syntax for casting I didn't know of. And that casting from TIMESTAMP to DATE is now allowed (in SQL-92 it wasn't).)
As this SO answer explains, Oracle Database allows it even without a window, pulling in the GROUP BY from context. I don't know if the standard allows it.
You already noticed the difference yourself: It's all about the window. COUNT(*) without an OVER clause for instance is an aggregation function. COUNT(*) with an OVER clause is a window function.
By using aggregation functions you condense the original rows you get after the FROM clause and WHERE clause are applied to either the specified group in GROUP BY or to one row in the absence of a GROUP BY clause.
Window functions, aka analytic functions, are applied afterwards. They don't change the number of result rows, but merely add information by looking at all or some rows (the window) of the selected data.
In
SELECT
options.id,
options.option_text,
COUNT(votes.option_id) as vote_count,
COUNT(votes.option_id) / SUM(COUNT(votes.option_id)) OVER() * 100.0 as vote_percentage
FROM options
LEFT JOIN votes on options.id = votes.option_id
GROUP BY options.id;
we first join votes to options and then count the votes per option by aggregating the joined rows down to one result row per option (GROUP BY options.id). We count on a non-nullable column in the votes table (COUNT(votes.option_id), so we get a zero count in case there are no votes, because in an outer joined row this column is set to null.
After aggregating all rows and getting thus one row per option we apply a window function (SUM() OVER) on this result set. We apply the analytic SUM on the vote count (SUM(COUNT(votes.option_id)) by looking at the whole result set (empty OVER clause), thus getting the same total vote count in every row. We use this value for a calculation: option's vote count diveded by total vote count times 100, which is the option's percentage of total votes.
The PostgreSQL query is very similar. We select the number of posts per date (COUNT(created_at) is nothing else than a mere COUNT(*)) along with a running total of these counts (by using a window that looks at all rows up to the current row).
So, while this looks like we are nesting two aggregate functions, this is not really the case, because SUM OVER is not considered an agregation function but an analytic/window function.
Oracle does allow applying an aggregate function directly on another, thus invoking a final aggregation on a previous grouped by aggregation. This allows us to get one result row of, say, the average of sums without having to write a subquery for this. This is not compliant with the SQL standard, however, and very unpopular even among Oracle developers at that.

How does the group by function works in PostgreSQL? (beginner)

I don't know much at all about SQL, I've just toyed with it here and there through the years but never really 'used' it.
I'm trying to get a list of prices / volumes and aggregate them:
CREATE TABLE IF NOT EXISTS test (
ts timestamp without time zone NOT NULL,
price decimal NOT NULL,
volume decimal NOT NULL
);
what I'd like is to extract:
min price
max price
sum volume
sum (price * volume) / sum (volume)
By 1h slices
If I forget about the last line for now, I have:
SELECT MIN(price) min_price, MAX(price) max_price, SUM(volume) sum_vol, date_trunc('hour', ts) ts_group FROM test
GROUP BY ts_group;
My understanding is that 'GROUP BY ts_group' will calculate ts_group, build groups of rows and then apply the MIN / MAX / SUM functions after. Since the syntax doesn't make any sense to me (entries on the select line would be treated differently while being declared together vs. building groups and then declaring operations on the groups), I could be dramatically wrong here.
But that will not return the min_price, max_price and sum_vol results after the grouping; I get ts, price and volume in the results.
If I remove the GROUP BY line to try to see all the output, I get the error:
column "test.ts" must appear in the GROUP BY clause or be used in an aggregate function
Which I don't really understand either...
I looked at:
must appear in the GROUP BY clause or be used in an aggregate function but I don't really get it
and I looked at the doc (https://www.postgresqltutorial.com/postgresql-group-by/) which shows working example, but doesn't really clarify what is wrong with what I'm trying to do here.
While I'd be happy to have a working solution, I'm more looking from an explanation, or pointers toward good resources, that would allow me to understand this.
I have this working solution:
SELECT MIN(price) min_price, MAX(price) max_price, SUM(volume) sum_vol, (SUM(price * volume)/SUM(volume)) vwap FROM test
GROUP BY date_trunc('hour', ts);
but I still don't understand the error message from my question
All of your expressions in SQL must use data elements and functions that are known to PostgreSQL. In your first example, ts_group is neither an element of your table, nor a defined function, so it complained that it did not know how to calculate it.
Your second example works because date_trunc is a known function and ts is defined as a data element of the test table.
It also gets you the correct grouping (by hour intervals) because date_trunc 'blurs' all of those unique timestamps that by themselves would not combine into groups.
Without a GROUP BY, then having any aggregates in your select list means it will aggregate everything down to just one row. But how does it aggregate date_trunc('hour', ts) down to one row, since there is no aggregating function specified for it? If you were using MySQL, it would just pick some arbitrary value for the column from all the seen values and report that as the "aggregate". But PostgreSQL is not so cavalier with your data. If your query is vague in this way, it refuses to run it. If you just want to see some value from the set without caring which one it is, you could use min or max to aggregate it.
Since the syntax doesn't make any sense to me (entries on the select line would be treated differently while being declared together vs. building groups and then declaring operations on the groups),
You are trying to understand SQL as if it were C. But it is very different. Just learn it for what it is, without trying to force it to be something else. The select list is where you define the columns you want to see in the output. They may be computed in different ways, but what they have in common is that you want each of them to show up in the output, so they are listed together in that spot.

Calculating Outliers - Nested Aggregate Error

I am currently working SQL Workbench/J and Amazon Redshift.
I am working on a query with the intent to identify the number of outliers within a data set.
My source data contains one record per day for multiple symbols. I am utilizing 30 days of trailing data. In short, for 30 days there are ten symbols with 30 records each.
I am then utilizing the following query to calculate the mean, standard deviation, and upper/lower control limits for each unique symbol based upon the 30 day data set.
select
symbol,
avg(high) as MEAN,
cast(stddev_samp(high) as dec(14,2)) STDV,
(MEAN+STDV*3) as UCL,
(MEAN-STDV*3) as LCL
from historical
group by symbol
;
My next step will be calculating how many individual values from the 'high' column exceed the upper control limit calculated value. I have tried to add the following count(case...) statement, but it is failing:
select
symbol,
avg(high) as MEAN,
cast(stddev_samp(high) as dec(14,2)) STDV,
(MEAN+STDV*3) as UCL,
(MEAN-STDV*3) as LCL,
count(case when high>avg(high) then 1 else 0 end) as outlier
from historical
group by symbol
;
The specific error is
Amazon Invalid operation: aggregate function calls may not have nested aggregate or window function
Is a count(case..) statement the right method to utilize here, or what would the recommended approach or example be?
There are a number of ways to do this but I think all of them involve a sub-query. This is because you have an aggregate (avg) compared to a per-row value (high) and then summing the the comparison.
I'd go with a sub-query where you perform an avg() window function partitioned by symbol. This will give you the average of the group on every row then just do the query as you have it. Kinda like this:
I am currently working SQL Workbench/J and Amazon Redshift.
I am working on a query with the intent to identify the number of outliers within a data set.
My source data contains one record per day for multiple symbols. I am utilizing 30 days of trailing data. In short, for 30 days there are ten symbols with 30 records each.
I am then utilizing the following query to calculate the mean, standard deviation, and upper/lower control limits for each unique symbol based upon the 30 day data set.
select symbol, avg(high) as MEAN, cast(stddev_samp(high) as dec(14,2)) STDV, (MEAN+STDV3) as UCL, (MEAN-STDV3) as LCL from historical group by symbol ;
My next step will be calculating how many individual values from the 'high' column exceed the upper control limit calculated value. I have tried to add the following count(case...) statement, but it is failing:
select symbol, avg(high) as MEAN, cast(stddev_samp(high) as dec(14,2)) STDV, (MEAN+STDV3) as UCL,
(MEAN-STDV3) as LCL, count(case when high>group_avg then 1 else 0 end) as outlier
from (
select *, avg(high) over (partition by symbol) as group_avg
from historical )
group by symbol ;
(You could also replace "avg(high) as MEAN" with "min(group_avg) as MEAN" since you already computed the average in the window function. Just a possible slight optimization.)
Use window functions to calculate the values for the standard deviation and mean. Then aggregate:
select symbol, mean, STDV,
(MEAN+STDV*3) as UCL, (MEAN-STDV*3) as LCL,
sum( (high > mean)::int) ) as outlier
from (select h.*,
avg(high) over (partition by symbol) as mean,
cast(stddev_samp(high) over (partition by symbol) as dec(14,2)) as STDV
from historical h
) h
group by symbol, mean, STDV;
Your definition of "outlier" is rather strange -- merely being higher than the average is going to happen (very roughly) about half the time. The more typical definition I have seen is outside the range of 2 standard deviations.
As a comment not directly related to the SQL. It seems unusual for me to be using future data to determine outliers. I would expect that a trailing 30 days would be used for that purpose. However, that is not the question you have asked here.

Monthly Moving Average of User Activity in SQL Server Using Window Functions

Let's say I have a table UserActivity in SQL Server 2012 with two columns:
ActivityDateTime
UserID
I want to calculate number of distinct users with any activity in a 30-day period (my monthly active users) on a daily basis. (So I have a 30-day window that increments a day at a time. How do I do this efficiently using window functions in SQL Server?
The output would look like this:
Date,NumberActiveUsersInPrevious30Days
01-01-2010,13567
01-02-2010,14780
01-03-2010,13490
01-04-2010,15231
01-05-2010,15321
01-06-2010,14513
...
SQL Server doesn't support COUNT(DISTINCT ... ) OVER () or a numeric value (30 PRECEDING) in conjunction with RANGE
I wouldn't bother trying to coerce window functions into doing this. Because of the COUNT(DISTINCT UserID) requirement it is always going to have to re-examine the entire 30 day window for each date.
You can create a calendar table with a row for each date and use
SELECT C.Date,
NumberActiveUsersInPrevious30Days
FROM Calendar C
CROSS APPLY (SELECT COUNT(DISTINCT UserID)
FROM UserActivity
WHERE ActivityDateTime >= DATEADD(DAY, -30, C.[Date])
AND ActivityDateTime < C.[Date]) CA(NumberActiveUsersInPrevious30Days)
WHERE C.Date BETWEEN '2010-01-01' AND '2010-01-06'
Option 1: For (while) loop though each day and select 30 days backward for each (obviously quite slow).
Option 2: A separate table with a row for each day and join on the original table (again quite slow).
Option 3: Recursive CTEs or stored procs (still not doing much better).
Option 4: For (while) loop in combination with cursors (efficient, but requires some advanced SQL knowledge). With this solution you will step through each day and each row in order and keep track of the average (you'll need some sort of wrap-around array to know what value to subtract when a day moves out of range).
Option 5: Option 3 in a general-purpose / scripting programming language (C++ / Java / PHP) (easy to do with basic knowledge of one of those languages, efficient).
Some related questions.

SQL query question

I'm trying to do something in a query that I've never done before. it probably requires variables, but i've never done that, and I'm not sure that it does.
What I want is to get a list of sales, grouped first by affiliate, then by it's month.
I can do that, but here's the twist... I don't want the month, but month 1, month 2, month 3...
And those aren't Jan, feb, march, but the number of months since the day of first sale.
Is this possible in a query at all, or do I need to do this in my code.
Oh, mysql 5.1.something...
Sure, just write an expression in SQL that generates the number of months since the first sale (Do you mean the first sale for that afiliate? If so, you'll need a subquery)
And since you say you want a list of sales, I assume you don't really want to "Group By" affilaite and monthcount, you just want to Sort, or Order By those values)
If you wanted the Average sales amount, or the Count of sales, or some other Aggregate function of sales data, then you would be doing a "Group By"...
And I don't think you need to worry about sorting by the number of months, you can simply sort by the difference between each sales date and the rearliest sale date for each affiliate. (If you wanted to apply a third sorting rule, after the sales date sort, then you would need to be more careful.)
Select * From Sales S
Order By Affiliate,
SalesDate - (Select Min(SalesDate)
From Sales
Where Affiliate = S.Affiliate)
Or, if you really need it to be by the difference in months
Select * From Sales S
Order By Affiliate,
Month(SalesDate) -
(Select Month(Min(SalesDate))
From Sales
Where Affiliate = S.Affiliate)
This is possible in standard SQL if you use what I like to call "SQL gymnastics". It can be done with subqueries.
But it looks incredibly ugly, is hard to maintain and it's really not worth it. You're far better off using one of the many programming languages that wrap SQL (such as PL/SQL) or even a general purpose language that can call SQL (such as Python).
The result will be in two languages but will be all the more understandable than the same thing written in just SQL.