Trending sum over time - sql

I have a table (in Postgres 9.1) that looks something like this:
CREATE TABLE actions (
user_id: INTEGER,
date: DATE,
action: VARCHAR(255),
count: INTEGER
)
For example:
user_id | date | action | count
---------------+------------+--------------+-------
1 | 2013-01-01 | Email | 1
1 | 2013-01-02 | Call | 3
1 | 2013-01-03 | Email | 3
1 | 2013-01-04 | Call | 2
1 | 2013-01-04 | Voicemail | 2
1 | 2013-01-04 | Email | 2
2 | 2013-01-04 | Email | 2
I would like to be able to view a user's total actions over time for a specific set of actions; for example, Calls + Emails:
user_id | date | count
-----------+-------------+---------
1 | 2013-01-01 | 1
1 | 2013-01-02 | 4
1 | 2013-01-03 | 7
1 | 2013-01-04 | 11
2 | 2013-01-04 | 2
The monstrosity that I've created so far looks like this:
SELECT
date, user_id, SUM(count) OVER (PARTITION BY user_id ORDER BY date) AS count
FROM
actions
WHERE
action IN ('Call', 'Email')
GROUP BY
user_id, date, count;
Which works for single actions, but seems to break for multiple actions when they happen on the same day, for example instead of the expected 11 on 2013-01-04, we get 9:
date | user_id | count
------------+--------------+-------
2013-01-01 | 1 | 1
2013-01-02 | 1 | 4
2013-01-03 | 1 | 7
2013-01-04 | 1 | 9 <-- should be 11?
2013-01-04 | 2 | 2
Is it possible to tweak my query to resolve this issue? I tried removing the grouping on count, but Postgres doesn't seem to like that:
column "actions.count" must appear in the GROUP BY clause
or be used in an aggregate function
LINE 2: date, user_id, SUM(count) OVER (PARTITION BY user...
^

This query produces the result you are looking for:
SELECT DISTINCT
date, user_id, SUM(count) OVER (PARTITION BY user_id ORDER BY date) AS count
FROM actions
WHERE
action IN ('Call', 'Email');
The default window is already what you want, according to the official docs and the "DISTINCT" eliminates duplicate rows when both Emails and Calls happen on the same day.
See SQL Fiddle.

The table has a column named "count", and the expresion in the SELECT clause is aliased as "count", it is ambiguous.
Read documentation: http://www.postgresql.org/docs/9.0/static/sql-select.html#SQL-GROUPBY
In case of ambiguity, a GROUP BY name will be interpreted as an
input-column name rather than an output column name.
That means, that your query does not group by "count" evaluated in the SELECT clause, but rather it groups by "count" values taken from the table.
This query gives expected results, see SQL Fiddle
SELECT date, user_id, count
from (
Select date, user_id,
SUM(count) OVER (PARTITION BY user_id ORDER BY date) AS count
FROM actions
WHERE
action IN ('Call', 'Email')
) alias
GROUP BY
user_id, date, count;

Asserts
It is unclear whether you want to sort by user_id or date
It is also unclear whether you want to include dates in the result list, for which there is no row in the base table. In this case, refer to this closely related answer:
PostgreSQL: running count of rows for a query 'by minute'
Repair names
First off, I am using this test table instead of your problematic table:
CREATE TEMP TABLE actions (
user_id integer,
thedate date,
action text,
ct integer
);
Your use of reserved words and function names as identifiers (column names) is part of the problem.
Repair query
Combine aggregate and window functions
Since aggregate functions are applied first, your original query lumps the two rows found for user_id = 1 and thedate = '2013-01-04' into one. You have to multiply by count(*) to get the actual running count.
You can do this without subquery, since you can combine aggregate functions and window functions. Aggregate functions are applied first. You can even have a window functions over the result of aggregate functions.
SELECT thedate
, user_id
, sum(ct * count(*)) OVER (PARTITION BY user_id
ORDER BY thedate) AS running_ct
FROM actions
WHERE action IN ('Call', 'Email')
GROUP BY user_id, thedate, ct
ORDER BY user_id, thedate;
Or simplify to:
...
, sum(sum(ct)) OVER (PARTITION BY user_id
ORDER BY thedate) AS running_ct
...
This should also be the fastest of the solutions presented.
Here, the inner sum() is an aggregate function, while the outer sum() is a window function - over the result of the aggregate function.
Or use DISTINCT
Another way would to use DISTINCT or DISTINCT ON, since that is applied after window functions:
DISTINCT - this is possible, since running_ct is guaranteed to be the same in this case anyway, since all peers are summed at once for the default frame definition of window functions.
SELECT DISTINCT
thedate
, user_id
, sum(ct) OVER (PARTITION BY user_id ORDER BY thedate) AS running_ct
FROM actions
WHERE action IN ('Call', 'Email')
ORDER BY thedate, user_id;
Or simplify with DISTINCT ON:
SELECT DISTINCT ON (thedate, user_id)
...
->SQLfiddle demonstrating all variants.

Related

How to select unique sessions per unique dates with SQL?

I'm struggling with my SQL. I want to select all unique sessions on unique dates from a table. I don't get the results I want.
Example of table:
session_id | date
87654321 | 2020-05-22 09:10:10
12345678 | 2020-05-23 10:19:50
12345678 | 2020-05-23 10:20:23
87654321 | 2020-05-23 12:00:10
This is my SQL right now. I select all distinct dates from a datetime column. I also count all distinct session_id's. I group them by date.
SELECT DISTINCT DATE_FORMAT(`date`, '%d-%m-%Y') as 'date', COUNT(DISTINCT `session_id`) as 'count' FROM `logging` GROUP BY 'date'
What I want to see is (with example above):
date | count
22-05-2020 | 1
23-05-2020 | 2
The result I get with my real table (with 354 sessions on 3 different dates) right now is:
date | count
21-05-2020 | 200
Edit
Changes ` to '.
The name of the field and the name of the alias is the same (date). Please try to use different name for the alias to avoid confusion in GROUP BY part
You probably want to group on your date expression
SELECT DATE_FORMAT(`date`, '%d-%m-%Y') as `date`, COUNT(DISTINCT `session_id`) as `count` FROM `logging` GROUP BY DATE_FORMAT(`date`, '%d-%m-%Y')

Query and return user requests if dates are consecutive

I am attempting to group records together by consecutive dates in the request_date column and user field but only return if the count is equal or above a certain number, say 3.
At the moment the Columns I have would be
user_id | request_date |
--------|--------------|
3 | 2019-01-01 |
5 | 2019-05-08 |
3 | 2019-01-02 |
4 | 2019-08-09 |
3 | 2019-01-03 |
the query would ideally return something along the lines of:
user_id: 3
num_of_reqs: 3
first_date: 2019-01-01
last_date: 2019-01-03
any insight would be appreciated.
You can use window functions. In particular, subtracting an increasing sequence from the date column will be constant when the dates are consecutive.
Something like this:
select user_id, count(*) as num_requests,
min(request_date), max(request_date)
from (select t.*,
row_number() over (partition by user_id order by request_date) as seqnm
from t
) t
group by user_id, (request_date - seqnum)
If you want to limit to a particular number, then add a having clause:
having count(*) >= 3
for instance.

sql query to get unique id for a row in oracle based on its continuity

I have a problem that needs to be solved using sql in oracle.
I have a dataset like given below:
value | date
-------------
1 | 01/01/2017
2 | 02/01/2017
3 | 03/01/2017
3 | 04/01/2017
2 | 05/01/2017
2 | 06/01/2017
4 | 07/01/2017
5 | 08/01/2017
I need to show the result in the below format:
value | date | Group
1 | 01/01/2017 | 1
2 | 02/01/2017 | 2
3 | 03/01/2017 | 3
3 | 04/01/2017 | 3
2 | 05/01/2017 | 4
2 | 06/01/2017 | 4
4 | 07/01/2017 | 5
5 | 08/01/2017 | 6
The logic is whenever value changes over date, it gets assigned a new group/id, but if its the same as the previous one , then its part of the same group.
Here is one method using lag() and cumulative sum:
select t.*,
sum(case when value = prev_value then 0 else 1 end) over (order by date) as grp
from (select t.*,
lag(value) over (order by date) as prev_value
from t
) t;
The logic here is to simply count the number of times that the value changes from one month to the next.
This assumes that date is actually stored as a date and not a string. If it is a string, then the ordering will not be correct. Either convert to a date or use a column that specifies the correct ordering.
Here is a solution using the MATCH_RECOGNIZE clause, introduced in Oracle 12.*
select value, dt, mn as grp
from inputs
match_recognize (
order by dt
measures match_number() as mn
all rows per match
pattern ( a b* )
define b as value = prev(value)
)
order by dt -- if needed
;
Here is how this works: Other than SELECT, FROM and ORDER BY, the query has only one clause, MATCH_RECOGNIZE. What this clause does is: it takes the rows from inputs and it orders them by dt. Then it searches for patterns: one row, marked as a, with no constraints, followed by zero or more rows b, where b is defined by the condition that the value is the same as for the prev[ious] row. What the clause calculates or measures is the match_number() - first "match" of the pattern, second match etc. We use this match number as the group number (grp) in the outer query - that's all we needed!
*Notes: The existence of solutions like this shows why it is important for posters to state their Oracle version. (Run the statement select * from v$version to find out.) Also: date and group are reserved words in Oracle and shouldn't be used as column names. Not even for posting made-up sample data. (There are workarounds but they aren't needed in this case.) Also, whenever using dates like 03/01/2017 in a post, please indicate whether that is March 1 or January 3, there's no way for "us" to tell. (It wasn't important in this case, but it is in the vast majority of cases.)

Get last element of an ordered set in postgresql

I am trying to get the last element of an ordered set, stored in a database table. The ordering is defined by one of the columns in the table. Also the table contains multiple sets, so I want the last one for each of the sets.
As an example consider the following table:
benchmarks=# select id,sorter from aggtest ;
id | sorter
----+--------
1 | 1
3 | 1
5 | 1
2 | 2
7 | 2
4 | 1
6 | 2
(7 rows)
Sorter 1 and 2 define each of the sets, sets are ordered by the id column. To get the last element of each set, I defined an aggregate function:
CREATE FUNCTION public.last_agg ( anyelement, anyelement )
RETURNS anyelement LANGUAGE sql IMMUTABLE STRICT AS $$
SELECT $2;
$$;
CREATE AGGREGATE public.last (
sfunc = public.last_agg,
basetype = anyelement,
stype = anyelement
);
As explained here.
However when I use this I get:
benchmarks=# select last(id),sorter from aggtest group by sorter order by sorter;
last | sorter
------+--------
4 | 1
6 | 2
(2 rows)
However, I want to get (5,1) and (7,2) as these are the last ids (numerically) in the set. Looking at how the aggregate mechanism works, I can see quite well, why the result is not what I want. The items are returned in the order I added them, and then aggregated so that the last one I added is returned.
I tried sorting by ids, so that each group is sorted independently, however that does not work:
benchmarks=# select last(id),sorter from aggtest group by sorter order by sorter,id;
ERROR: column "aggtest.id" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: ...(id),sorter from aggtest group by sorter order by sorter,id;
If I wrap the sorting criteria in another aggregate, I get wrong data again:
benchmarks=# select last(id),sorter from aggtest group by sorter order by sorter,last(id);
last | sorter
------+--------
4 | 1
6 | 2
(2 rows)
Also grouping by id in addition to sorter does not work obviously.
Of course there is an easier way, to get the last (highest) id for each group by using the max aggregate. However, I am not so much interested in the id but as in data associated with it (i.e. in the same row). Hence I do not to sort by id and then aggregate so that the row with the highest id is returned for each group.
What is the best way to accomplish this?
EDIT: Why does max(id) grouped by sorter not work
Assume the following complete table (unsorter represents the additional data I have in the table):
benchmarks=# select * from aggtest ;
id | sorter | unsorter
----+--------+----------
1 | 1 | 1
3 | 1 | 2
5 | 1 | 3
2 | 2 | 4
7 | 2 | 5
4 | 1 | 6
6 | 2 | 7
(7 rows)
I would like to retrieve the lines:
id | sorter | unsorter
----+--------+----------
5 | 1 | 3
7 | 2 | 5
However with max(id) and grouping by sorter I get:
benchmarks=# select max(id),sorter,unsorter from aggtest group by sorter;
ERROR: column "aggtest.unsorter" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: select max(id),sorter,unsorter from aggtest group by sorter;
Using a max(unsorter) obviously does not work either:
benchmarks=# select max(id),sorter,max(unsorter) from aggtest group by sorter;
max | sorter | max
-----+--------+-----
5 | 1 | 6
7 | 2 | 7
(2 rows)
However using distinct (the accepted answer) I get:
benchmarks=# select distinct on (sorter) id,sorter,unsorter from aggtest order by sorter, id desc;
id | sorter | unsorter
----+--------+----------
5 | 1 | 3
7 | 2 | 5
(2 rows)
Which has the correct additional data. The join approach also seems to work, by is slightly slower on the test data.
Why not use a window function:
select id, sorter
from (
select id, sorter,
row_number() over (partition by sorter order by id desc) as rn
from aggtest
) t
where rn = 1;
Or using Postgres distinct on operator which is usually faster:
select distinct on (sorter) id, sorter
from aggtest
order by sorter, id desc
You write:
Of course there is an easier way, to get the last (highest) id for
each group by using the max aggregate. However, I am not so much
interested in the id but as in data associated with it (i.e. in the
same row).
This query will give you the data associated with the highest id of each sorter group.
select a.* from aggtest a
join (
select max(id) max_id, sorter
from aggtest
group by sorter
) b on a.id = b.max_id and a.sorter = b.sorter
select distinct max(id) over (partition by sorter) id,sorter
from aggtest order by 2 asc
returns:
5;1
7;2

postgres - partial column in SELECT/GROUP BY - column must appear in the GROUP BY clause or be used in an aggregate function

Both the following two statements produce an error in Postgres:
SELECT substring(start_time,1,8) AS date, count(*) as total from cdrs group by date;
SELECT substring(start_time,1,8) AS date, count(*) as total from cdrs group by substring(start_time,1,8);
The error is:
column "cdrs.start_time" must appear in the GROUP BY clause or be used
in an aggregate function
My reading of postgres docs is that both SELECT and GROUP BY can use an expression
postgres 8.3 SELECT
The start_time field is a string and has a date/time in form ccyymmddHHMMSS. In mySQL they both produce desired and expected results:
+----------+-------+
| date | total |
+----------+-------+
| 20091028 | 9 |
| 20091029 | 110 |
| 20091120 | 14 |
| 20091121 | 4 |
+----------+-------+
4 rows in set (0.00 sec)
I need to stick with Postgres (heroku). Any suggestions?
p.s. there is lots of other discussion around that talks about missing items in GROUP BY and why mySQL accepts this, why others don't ... strict adherence to SQL spec etc etc, but I think this is sufficiently different to 1062158/converting-mysql-select-to-postgresql and 1769361/postgresql-group-by-different-from-mysql to warrant a separate question.
You did something else that you didn't describe in the question, as both of your queries work just fine. Tested on 8.5 and 8.3.8:
# create table cdrs (start_time text);
CREATE TABLE
# insert into cdrs (start_time) values ('20090101121212'),('20090101131313'),('20090510040603');
INSERT 0 3
# SELECT substring(start_time,1,8) AS date, count(*) as total from cdrs group by date;
date | total
----------+-------
20090510 | 1
20090101 | 2
(2 rows)
# SELECT substring(start_time,1,8) AS date, count(*) as total from cdrs group by substring(start_time,1,8);
date | total
----------+-------
20090510 | 1
20090101 | 2
(2 rows)
Just to summarise, error
column "cdrs.start_time" must appear in the GROUP BY clause or be used in an aggregate function
was caused (in this case) by ORDER BY start_time clause. Full statement needed to be either:
SELECT substring(start_time,1,8) AS date, count(*) as total FROM cdrs GROUP BY substring(start_time,1,8) ORDER BY substring(start_time,1,8);
or
SELECT substring(start_time,1,8) AS date, count(*) as total FROM cdrs GROUP BY date ORDER BY date;
Two simple things you might try:
Upgrade to postgres 8.4.1
Both queries Work Just Fine For Me(tm) under pg841
Group by ordinal position
That is, GROUP BY 1 in this case.