postgres - partial column in SELECT/GROUP BY - column must appear in the GROUP BY clause or be used in an aggregate function - sql

Both the following two statements produce an error in Postgres:
SELECT substring(start_time,1,8) AS date, count(*) as total from cdrs group by date;
SELECT substring(start_time,1,8) AS date, count(*) as total from cdrs group by substring(start_time,1,8);
The error is:
column "cdrs.start_time" must appear in the GROUP BY clause or be used
in an aggregate function
My reading of postgres docs is that both SELECT and GROUP BY can use an expression
postgres 8.3 SELECT
The start_time field is a string and has a date/time in form ccyymmddHHMMSS. In mySQL they both produce desired and expected results:
+----------+-------+
| date | total |
+----------+-------+
| 20091028 | 9 |
| 20091029 | 110 |
| 20091120 | 14 |
| 20091121 | 4 |
+----------+-------+
4 rows in set (0.00 sec)
I need to stick with Postgres (heroku). Any suggestions?
p.s. there is lots of other discussion around that talks about missing items in GROUP BY and why mySQL accepts this, why others don't ... strict adherence to SQL spec etc etc, but I think this is sufficiently different to 1062158/converting-mysql-select-to-postgresql and 1769361/postgresql-group-by-different-from-mysql to warrant a separate question.

You did something else that you didn't describe in the question, as both of your queries work just fine. Tested on 8.5 and 8.3.8:
# create table cdrs (start_time text);
CREATE TABLE
# insert into cdrs (start_time) values ('20090101121212'),('20090101131313'),('20090510040603');
INSERT 0 3
# SELECT substring(start_time,1,8) AS date, count(*) as total from cdrs group by date;
date | total
----------+-------
20090510 | 1
20090101 | 2
(2 rows)
# SELECT substring(start_time,1,8) AS date, count(*) as total from cdrs group by substring(start_time,1,8);
date | total
----------+-------
20090510 | 1
20090101 | 2
(2 rows)

Just to summarise, error
column "cdrs.start_time" must appear in the GROUP BY clause or be used in an aggregate function
was caused (in this case) by ORDER BY start_time clause. Full statement needed to be either:
SELECT substring(start_time,1,8) AS date, count(*) as total FROM cdrs GROUP BY substring(start_time,1,8) ORDER BY substring(start_time,1,8);
or
SELECT substring(start_time,1,8) AS date, count(*) as total FROM cdrs GROUP BY date ORDER BY date;

Two simple things you might try:
Upgrade to postgres 8.4.1
Both queries Work Just Fine For Me(tm) under pg841
Group by ordinal position
That is, GROUP BY 1 in this case.

Related

Running sum of unique users in redshift

I have a table with as follows with user visits by day -
| date | user_id |
|:-------- |:-------- |
| 01/31/23 | a |
| 01/31/23 | a |
| 01/31/23 | b |
| 01/30/23 | c |
| 01/30/23 | a |
| 01/29/23 | c |
| 01/28/23 | d |
| 01/28/23 | e |
| 01/01/23 | a |
| 12/31/22 | c |
I am looking to get a running total of unique user_id for the last 30 days . Here is the expected output -
| date | distinct_users|
|:-------- |:-------- |
| 01/31/23 | 5 |
| 01/30/23 | 4 |
.
.
.
Here is the query I tried -
SELECT date
, SUM(COUNT(DISTINCT user_id)) over (order by date rows between 30 preceding and current row) AS unique_users
FROM mytable
GROUP BY date
ORDER BY date DESC
The problem I am running into is that this query not counting the unique user_id - for instance the result I am getting for 01/31/23 is 9 instead of 5 as it is counting user_id 'a' every time it occurs.
Thank you, appreciate your help!
Not the most performant approach, but you could use a correlated subquery to find the distinct count of users over a window of the past 30 days:
SELECT
date,
(SELECT COUNT(DISTINCT t2.user_id)
FROM mytable t2
WHERE t2.date BETWEEN t1.date - INTERVAL '30 day' AND t1.date) AS distinct_users
FROM mytable t1
ORDER BY date;
There are a few things going on here. First window functions run after group by and aggregation. So COUNT(DISTINCT user_id) gives the count of user_ids for each date then the window function runs. Also, window function set up like this work over the past 30 rows, not 30 days so you will need to fill in missing dates to use them.
As to how to do this - I can only think of the "expand to the data so each date and id has a row" method. This will require a CTE to generate the last 2 years of dates plus 30 days so that the look-back window works for the first dates. Then window over the past 30 days for each user_id and date to see which rows have an example of this user_id within the past 30 days, setting the value to NULL if no uses of the user_id are present within the window. Then Count the user_ids counts (non NULL) grouping by just date to get the number of unique user_ids for that date.
This means expanding the data significantly but I see no other way to get truly unique user_ids over the past 30 days. I can help code this up if you need but will look something like:
WITH RECURSIVE CTE to generate the needed dates,
CTE to cross join these dates with a distinct set of all the user_ids in user for the past 2 years,
CTE to join the date/user_id data set with the table of real data for past 2 years and 30 days and window back counting non-NULL user_ids, partition by date and user_id, order by date, and setting any zero counts to NULL with a DECODE() or CASE statement,
SELECT, grouping by just date count the user_ids by date;

Calculating an average of averages in SQL Server

I want to do something very simple but I'm obviously missing a trick! I want to get an average of average values but I want to include the weighting of the original average calculation. I'll use a stripped back version of what I'm attempting to do.
So let's say I have the following table
Product date RunInterval AvgDuration_secs Executions
--------------------------------------------------------------------
A 29/12/19 1 1 100
A 29/12/19 2 2 10
What I want to find out is what the average duration was for Product A on 29/12. All the things I've tried so far are giving me an average of 1.5 secs ie it's adding together the duration of 1 & 2 secs (3) and dividing by number of rows (2) to give 1.5. What I want to get to is to have the average but taking into account how often it runs so (100*1) + (10*2) / 110 = 1.09 secs. I've tried various attempts with GROUP BY statements and CURSORS but not getting there.
I'm evidently tackling it the wrong way! Any help welcome :-)
You can do it like this:
select product, date,
round(1.0 * sum([Executions] * [AvgDuration_secs]) / sum([Executions]), 2) result
from tablename
group by product, date
I'm not sure if you want RunInterval or AvgDuration_secs in the 1st sum.
See the demo.
Results:
> product | date | result
> :------ | :---------| :-----
> A | 29/12/2019| 1.09
If you got those results from a query or view that select from some table, then grouped by Product & date & RunInterval.
Then you could simply run a query that on that table that
groups only by the Product & date.
An example:
--
-- Sample data
--
CREATE TABLE sometable
(
Product varchar(30),
ExecutionDatetime datetime,
RunInterval int
);
WITH RCTE_NUMS AS
(
SELECT 1 AS n
UNION ALL
SELECT n+1
FROM RCTE_NUMS
WHERE n < 110
)
INSERT INTO sometable
(Product, ExecutionDatetime, RunInterval)
SELECT
'A' p,
DATEADD(minute,n*12,'2019-12-29 01:00:00') dt,
IIF(n<=100,1,2) ri
FROM RCTE_NUMS
OPTION (MAXRECURSION 1000);
110 rows affected
select
Product,
cast(ExecutionDatetime as date) as [Date],
AVG(1.0*RunInterval) AS AvgDuration_secs,
COUNT(*) AS Executions
from sometable t
group by
Product,
cast(ExecutionDatetime as date)
ORDER BY Product, [Date]
Product | Date | AvgDuration_secs | Executions
:------ | :------------------ | :--------------- | ---------:
A | 29/12/2019 00:00:00 | 1.090909 | 110
db<>fiddle here

MSSQL Sum of values referenced by another table

I'm attempting to create a report on the total money spent per day.
In the database are these two tables. They are matched using "UID" made at creation.
I've created this query but it results in duplicate dates.
Select LEFT(f.timestamp, 10) timestamp, sum(s.Total) Total
FROM dbo.purchasing AS f
Join (SELECT uid,SUM(CONVERT(DECIMAL(18,2), (CONVERT(DECIMAL(18,4), qty) * price))) Total
FROM dbo.purchasingitems
GROUP BY uid)
AS s ON f.uid = s.uid
GROUP BY TIMESTAMP
purchasing:
+--+---------+------------+--------+---+
|ID| UID | timestamp | contact|...|
+--+---------+------------+--------+---+
| 1|abr92nas9| 01/01/2018 | ROB |...|
| 2|nsa93m187| 02/02/2018 | ROB |...|
+--+---------+------------+--------+---+
purchasingitems:
+--+---------+-----+--------+---+
|ID| UID | QTY | Price |...|
+--+---------+-----+--------+---+
| 1|abr92nas9| 20 | 0.2435 |...|
| 2|abr92nas9| 5 | 0.5 |...|
| 3|nsa93m187| 1 | 100 |...|
| 4|nsa93m187| 4 | 15.5 |...|
+--+---------+-----+--------+---+
You need to group by the expression:
SELECT LEFT(f.timestamp, 10) as timestamp, sum(s.Total) as Total
FROM dbo.purchasing f JOIN
(SELECT uid, SUM(CONVERT(DECIMAL(18,2), (CONVERT(DECIMAL(18,4), qty) * price))) as Total
FROM dbo.purchasingitems
GROUP BY uid
) s
ON f.uid = s.uid
GROUP BY LEFT(f.timestamp, 10);
Notes:
You should not be storing date/time values as strings (unless you have a really good reason). If timestamp is a date, you should use cast(timestamp as date).
You should not be using string functions on date/times.
timestamp is a keyword in SQL Server (although not reserved), so it is not a good choice for a column name.
Your problem is that you think that GROUP BY timestamp refers to the expression in the SELECT. SQL Server does not support column aliases, so it can only refer to the column of that name.
I don't see a reason to convert to decimal for the multiplication. You might have a good reason.
You probably want order by as well, to ensure that the result set is in a sensible order.
Data you posted does NOT produce duplicates
No reason for the sub query
Select LEFT(f.timestamp, 10) timestamp,
SUM(CONVERT(DECIMAL(18,2), (CONVERT(DECIMAL(18,4), s.qty) * s.price))) Total
FROM dbo.purchasing AS f
join dbo.purchasingitems s
ON f.uid = s.uid
GROUP BY f.TIMESTAMP

How to do a sub-select per result entry in postgresql?

Assume I have a table with only two columns: id, maturity. maturity is some date in the future and is representative of until when a specific entry will be available. Thus it's different for different entries but is not necessarily unique. And with time number of entries which have not reached this maturity date changes.
I need to count a number of entries from such a table that were available on a specific date (thus entries that have not reached their maturity). So I basically need to join this two queries:
SELECT generate_series as date FROM generate_series('2015-10-01'::date, now()::date, '1 day');
SELECT COUNT(id) FROM mytable WHERE mytable.maturity > now()::date;
where instead of now()::date I need to put entry from the generated series. I'm sure this has to be simple enough, but I can't quite get around it. I need the resulting solution to remain a query, thus it seems that I can't use for loops.
Sample table entries:
id | maturity
---+-------------------
1 | 2015-10-03
2 | 2015-10-05
3 | 2015-10-11
4 | 2015-10-11
Expected output:
date | count
------------+-------------------
2015-10-01 | 4
2015-10-02 | 4
2015-10-03 | 3
2015-10-04 | 3
2015-10-05 | 2
2015-10-06 | 2
NOTE: This count doesn't constantly decrease, since new entries are added and this count increases.
You have to use fields of outer query in WHERE clause of a sub-query. This can be done if the subquery is in the SELECT clause of the outer query:
SELECT generate_series,
(SELECT COUNT(id)
FROM mytable
WHERE mytable.maturity > generate_series)
FROM generate_series('2015-10-01'::date, now()::date, '1 day');
More info: http://www.techonthenet.com/sql_server/subqueries.php
I think you want to group your data by the maturity Date.
Check this:
select maturity,count(*) as count
from your_table group by maturity;

Trending sum over time

I have a table (in Postgres 9.1) that looks something like this:
CREATE TABLE actions (
user_id: INTEGER,
date: DATE,
action: VARCHAR(255),
count: INTEGER
)
For example:
user_id | date | action | count
---------------+------------+--------------+-------
1 | 2013-01-01 | Email | 1
1 | 2013-01-02 | Call | 3
1 | 2013-01-03 | Email | 3
1 | 2013-01-04 | Call | 2
1 | 2013-01-04 | Voicemail | 2
1 | 2013-01-04 | Email | 2
2 | 2013-01-04 | Email | 2
I would like to be able to view a user's total actions over time for a specific set of actions; for example, Calls + Emails:
user_id | date | count
-----------+-------------+---------
1 | 2013-01-01 | 1
1 | 2013-01-02 | 4
1 | 2013-01-03 | 7
1 | 2013-01-04 | 11
2 | 2013-01-04 | 2
The monstrosity that I've created so far looks like this:
SELECT
date, user_id, SUM(count) OVER (PARTITION BY user_id ORDER BY date) AS count
FROM
actions
WHERE
action IN ('Call', 'Email')
GROUP BY
user_id, date, count;
Which works for single actions, but seems to break for multiple actions when they happen on the same day, for example instead of the expected 11 on 2013-01-04, we get 9:
date | user_id | count
------------+--------------+-------
2013-01-01 | 1 | 1
2013-01-02 | 1 | 4
2013-01-03 | 1 | 7
2013-01-04 | 1 | 9 <-- should be 11?
2013-01-04 | 2 | 2
Is it possible to tweak my query to resolve this issue? I tried removing the grouping on count, but Postgres doesn't seem to like that:
column "actions.count" must appear in the GROUP BY clause
or be used in an aggregate function
LINE 2: date, user_id, SUM(count) OVER (PARTITION BY user...
^
This query produces the result you are looking for:
SELECT DISTINCT
date, user_id, SUM(count) OVER (PARTITION BY user_id ORDER BY date) AS count
FROM actions
WHERE
action IN ('Call', 'Email');
The default window is already what you want, according to the official docs and the "DISTINCT" eliminates duplicate rows when both Emails and Calls happen on the same day.
See SQL Fiddle.
The table has a column named "count", and the expresion in the SELECT clause is aliased as "count", it is ambiguous.
Read documentation: http://www.postgresql.org/docs/9.0/static/sql-select.html#SQL-GROUPBY
In case of ambiguity, a GROUP BY name will be interpreted as an
input-column name rather than an output column name.
That means, that your query does not group by "count" evaluated in the SELECT clause, but rather it groups by "count" values taken from the table.
This query gives expected results, see SQL Fiddle
SELECT date, user_id, count
from (
Select date, user_id,
SUM(count) OVER (PARTITION BY user_id ORDER BY date) AS count
FROM actions
WHERE
action IN ('Call', 'Email')
) alias
GROUP BY
user_id, date, count;
Asserts
It is unclear whether you want to sort by user_id or date
It is also unclear whether you want to include dates in the result list, for which there is no row in the base table. In this case, refer to this closely related answer:
PostgreSQL: running count of rows for a query 'by minute'
Repair names
First off, I am using this test table instead of your problematic table:
CREATE TEMP TABLE actions (
user_id integer,
thedate date,
action text,
ct integer
);
Your use of reserved words and function names as identifiers (column names) is part of the problem.
Repair query
Combine aggregate and window functions
Since aggregate functions are applied first, your original query lumps the two rows found for user_id = 1 and thedate = '2013-01-04' into one. You have to multiply by count(*) to get the actual running count.
You can do this without subquery, since you can combine aggregate functions and window functions. Aggregate functions are applied first. You can even have a window functions over the result of aggregate functions.
SELECT thedate
, user_id
, sum(ct * count(*)) OVER (PARTITION BY user_id
ORDER BY thedate) AS running_ct
FROM actions
WHERE action IN ('Call', 'Email')
GROUP BY user_id, thedate, ct
ORDER BY user_id, thedate;
Or simplify to:
...
, sum(sum(ct)) OVER (PARTITION BY user_id
ORDER BY thedate) AS running_ct
...
This should also be the fastest of the solutions presented.
Here, the inner sum() is an aggregate function, while the outer sum() is a window function - over the result of the aggregate function.
Or use DISTINCT
Another way would to use DISTINCT or DISTINCT ON, since that is applied after window functions:
DISTINCT - this is possible, since running_ct is guaranteed to be the same in this case anyway, since all peers are summed at once for the default frame definition of window functions.
SELECT DISTINCT
thedate
, user_id
, sum(ct) OVER (PARTITION BY user_id ORDER BY thedate) AS running_ct
FROM actions
WHERE action IN ('Call', 'Email')
ORDER BY thedate, user_id;
Or simplify with DISTINCT ON:
SELECT DISTINCT ON (thedate, user_id)
...
->SQLfiddle demonstrating all variants.