SQL Grouped Results with Counts and Cumulative Counts - sql

I have the following data in my 'user' table:
user_id | create_timestamp
1 2017-08-01
2 2017-08-01
3 2017-08-02
4 2017-08-03
5 2017-08-03
6 2017-08-03
7 2017-08-04
8 2017-08-04
9 2017-08-04
10 2017-08-04
I want to create a SQL query that has three columns:
1. Grouped results by create_timestamp
2. A count of the results by date
3. A cumulative count as the date goes on.
Here's what the result set should look like:
create_timestamp daily cumulative
2017-08-01 2 2
2017-08-02 1 3
2017-08-03 3 6
2017-08-04 4 10

You would use window functions for this:
select create_timestamp, count(*) as cnt,
sum(count(*)) over (order by create_timestamp) as cumulative
from t
group by create_timestamp
order by create_timestamp;
This functionality is available in SQL Server 2012+.
Note: You may need to extract the date from the time stamp:
select convert(date, create_timestamp) as dte, count(*) as cnt,
sum(count(*)) over (order by convert(date, create_timestamp)) as cumulative
from t
group by convert(date, create_timestamp)
order by convert(date, create_timestamp);

You can use this query.
DECLARE #UserLog TABLE (user_id INT , create_timestamp DATE)
INSERT INTO #UserLog
VALUES
(1,'2017-08-01'),
(2,'2017-08-01'),
(3,'2017-08-02'),
(4,'2017-08-03'),
(5,'2017-08-03'),
(6,'2017-08-03'),
(7,'2017-08-04'),
(8,'2017-08-04'),
(9,'2017-08-04'),
(10,'2017-08-04')
;WITH T AS (
SELECT create_timestamp, COUNT(*) daily FROM #UserLog
GROUP BY create_timestamp)
SELECT
create_timestamp,
daily,
SUM(daily) OVER( ORDER BY create_timestamp ASC
ROWS UNBOUNDED PRECEDING ) cumulative
FROM T
Result
create_timestamp daily cumulative
---------------- ----------- -----------
2017-08-01 2 2
2017-08-02 1 3
2017-08-03 3 6
2017-08-04 4 10

Related

Finding the highest after grouping by month

In postgres, I want to output the persons who have the highest no. of "discussed" requests for each month, irrespective of the year i.e. there should be 12 outputs.
ID PERSON REQUEST DATE
4 datanoise opened 2010-09-02
5 marsuboss opened 2010-09-02
6 m3talsmith opened 2010-09-06
7 sferik opened 2010-09-08
8 sferik opened 2010-09-09
8 dtrasbo discussed 2010-09-09
8 brianmario discussed 2010-09-09
8 sferik discussed 2010-09-09
9 rsim opened 2011-09-09
.....more tuples to follow
*This is just a small part of the databse. also assume that the dataset is big enough that all months are represented in the date column.
Test data:
CREATE TEMPORARY TABLE foo( id SERIAL PRIMARY KEY, name INTEGER NOT NULL,
dt DATE NULL, request BOOL NOT NULL );
INSERT INTO foo (name,dt,request) SELECT random()*1000,
'2010-01-01'::DATE+('1 DAY'::INTERVAL)*(random()*3650), random()>0.5
FROM generate_series(1,100000) n;
SELECT * FROM foo LIMIT 10;
id | name | dt | request
----+------+------------+---------
1 | 110 | 2014-11-05 | f
2 | 747 | 2015-03-12 | t
3 | 604 | 2014-09-26 | f
4 | 211 | 2011-12-14 | t
5 | 588 | 2016-12-15 | f
6 | 96 | 2012-02-19 | f
7 | 17 | 2018-09-18 | t
8 | 591 | 2018-02-15 | t
9 | 370 | 2015-07-28 | t
10 | 844 | 2019-05-16 | f
Now you have to get the count per name and month, then get the max count, but that won't give you the name that has the maximum, which requires joining back with the previous result. In order to do the group by only once, it is done in a CTE:
WITH totals AS (
SELECT EXTRACT(month FROM dt) mon, name, count(*) cnt FROM foo
WHERE request=true GROUP BY name,mon
)
SELECT * FROM
(SELECT mon, max(cnt) cnt FROM totals GROUP BY mon) x
JOIN totals USING (mon,cnt);
If several names have the same maximum count, they will be returned both. To keep only one, you can use DISTICT ON:
WITH (same as above)
SELECT DISTINCT ON (mon) * FROM
(SELECT mon, max(cnt) cnt FROM totals GROUP BY mon) x
JOIN totals USING (mon,cnt) ORDER BY mon,name;
You can also use DISTINCT ON to keep only one row per month, specified by the ORDER clause, in this cas by count desc, so it keeps the highest count.
SELECT DISTINCT ON (mon) * FROM (
SELECT EXTRACT(month FROM dt) mon, name, count(*) cnt FROM foo
WHERE request=true GROUP BY name,mon
)x ORDER BY mon, cnt DESC;
...or you could hack an argmax() function by sticking the primary key into an array passed to max(), which means it will return the id of the row which has the maximum value:
SELECT mon, cntid[1] cnt, name FROM
(SELECT mon, max(ARRAY[cnt,id]) cntid FROM (
SELECT EXTRACT(month FROM dt) mon, name, count(*) cnt, min(id) id FROM foo
WHERE request=true GROUP BY name,mon
) x GROUP BY mon)y
JOIN foo ON (foo.id=cntid[2]);
Which one will be faster?...
given your table is named t01 and the colum date is date1 (and in string format):
create temp table t02 as
select extract(month from CAST(date1 as date)) as month, person, count(*) nb from t01 where request = 'discussed' group by 1, 2 ;
create temp table t03 as
select month, max(nb) max_nb from t02 group by 1 ;
the result is :
select month , person from t02 a natural join t03 b where a.nb = b.max_nb;
https://rextester.com/BYMM84335[ : run here]1
I would recommend distinct on. If you want to combine all the months into a single "uber-month":
select distinct on (extract(month from date)) person, extract(month from date), count(*) as num_discussed
from t
where request = 'discussed'
group by person, extract(month from date)
order by extract(month from date), num_discussed desc;
Distinct on is a very handy Postgres extension. It returns on row per "group", which is defined by the expressions in parentheses. The row is the "first" one determined by the order by clause.
If you want the highest month regardless of year:
select distinct on (extract(month from date)) person, date_trunc('month', date), count(*) as num_discussed
from t
where request = 'discussed'
group by person, date_trunc('month', date)
order by extract(month from date), num_discussed desc;

PostgreSQL Select the r.* by MIN() with group-by on two columns

The example schema of a table called results
id
user_id
activity_id
activity_type_id
start_date_local
elapsed_time
1
100
11111
1
2014-01-07 04:34:38
4444
2
100
22222
1
2015-04-14 06:44:42
5555
3
100
33333
1
2015-04-14 06:44:42
7777
4
100
44444
2
2014-01-07 04:34:38
12345
5
200
55555
1
2015-12-22 16:32:56
5023
The problem
Select the results of fastest activities (i.e. minimum elapsed time) of each user by activity_type_id and year.
(Basically, in this simplified example, record ID=3 should be excluded from the selection, because record ID=2 is the fastest for user 100 of the given activity_type_id 1 and the year of 2015)
What I have tried
SELECT user_id,
activity_type_id,
EXTRACT(year FROM start_date_local) AS year,
MIN(elapsed_time) AS fastest_time
FROM results
GROUP BY activity_type_id, user_id, year
ORDER BY activity_type_id, user_id, year;
Actual
Which selects the correct result set I want, but only contains the grouped by columns
user_id
activity_type_id
year
fastest_time
100
1
2014
4444
100
1
2015
5555
100
2
2014
12345
200
1
2015
5023
Goal
To have the actual full record with all columns. i.e. results.* + year
id
user_id
activity_id
activity_type_id
start_date_local
year
elapsed_time
1
100
11111
1
2014-01-07 04:34:38
2014
2014
2
100
22222
1
2015-04-14 06:44:42
2015
5555
4
100
44444
2
2014-01-07 04:34:38
2014
12345
5
200
55555
1
2015-12-22 16:32:56
2015
5023
I think you want this:
SELECT DISTINCT ON (user_id, activity_type_id, EXTRACT(year FROM start_date_local))
*, EXTRACT(year FROM start_date_local) AS year
FROM results
ORDER BY user_id, activity_type_id, year, elapsed_time;
You can use a window function for this:
select id, user_id, activity_id, activity_type_id, start_date_local, year, elapsed_time
from (
SELECT id,
user_id,
activity_id,
activity_type_id,
start_date_local,
EXTRACT(year FROM start_date_local) AS year,
elapsed_time,
min(elapsed_time) over (partition by user_id, activity_type_id, EXTRACT(year FROM start_date_local)) as fastest_time
FROM results
) t
where elapsed_time = fastest_time
order by activity_type_id, user_id, year;
Alternatively using distinct on ()
select distinct on (activity_type_id, user_id, extract(year from start_date_local))
id,
user_id,
activity_id,
activity_type_id,
extract(year from start_date_local) as year,
elapsed_time
from results
order by activity_type_id, user_id, year, elapsed_time;
Online example

How to use SQL to get column count for a previous date?

I have the following table,
id status price date
2 complete 10 2020-01-01 10:10:10
2 complete 20 2020-02-02 10:10:10
2 complete 10 2020-03-03 10:10:10
3 complete 10 2020-04-04 10:10:10
4 complete 10 2020-05-05 10:10:10
Required output,
id status_count price ratio
2 0 0 0
2 1 10 0
2 2 30 0.33
I am looking to add the price for previous row. Row 1 is 0 because it has no previous row value.
Find ratio ie 10/30=0.33
You can use analytical function ROW_NUMBER and SUM as follows:
SELECT
id,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY date) - 1 AS status_count,
COALESCE(SUM(price) OVER (PARTITION BY id ORDER BY date), 0) - price as price
FROM yourTable;
DB<>Fiddle demo
I think you want something like this:
SELECT
id,
COUNT(*) OVER (PARTITION BY id ORDER BY date) - 1 AS status_count,
COALESCE(SUM(price) OVER (PARTITION BY id
ORDER BY date ROWS BETWEEN
UNBOUNDED PRECEDING AND 1 PRECEDING), 0) price
FROM yourTable;
Demo
Please also check another method:
with cte
as(*,ROW_NUMBER() OVER (PARTITION BY id ORDER BY date) - 1 AS status_count,
SUM(price) OVER (PARTITION BY id ORDER BY date) ss from yourTable)
select id,status_count,isnull(ss,0)-price price
from cte

SQL query to find continuous local max, min of date based on category column

I have the following data set
Customer_ID Category FROM_DATE TO_DATE
1 5 1/1/2000 12/31/2001
1 6 1/1/2002 12/31/2003
1 5 1/1/2004 12/31/2005
2 7 1/1/2010 12/31/2011
2 7 1/1/2012 12/31/2013
2 5 1/1/2014 12/31/2015
3 7 1/1/2010 12/31/2011
3 7 1/5/2012 12/31/2013
3 5 1/1/2014 12/31/2015
The result I want to achieve is to find continuous local min/max date for Customers with the same category and identify any gap in dates:
Customer_ID FROM_Date TO_Date Category
1 1/1/2000 12/31/2001 5
1 1/1/2002 12/31/2003 6
1 1/1/2004 12/31/2005 5
2 1/1/2010 12/31/2013 7
2 1/1/2014 12/31/2015 5
3 1/1/2010 12/31/2011 7
3 1/5/2012 12/31/2013 7
3 1/1/2014 12/31/2015 5
My code works fine for customer 1 (return all 3 rows) and customer 2(return 2 rows with min and max date for each category) but for customer 3, it cannot identify the gap between 12/31/2011 and 1/5/2012 for category 7.
Customer_ID FROM_Date TO_Date Category
3 1/1/2010 12/31/2013 7
3 1/1/2014 12/31/2015 5
Here is my code:
SELECT Customer_ID, Category, min(From_Date), max(To_Date) FROM
(
SELECT Customer_ID, Category, From_Date,To_Date
,row_number() over (order by member_id, To_Date) - row_number() over (partition by Customer_ID order by Category) as p
FROM FFS_SAMP
) X
group by Customer_ID,Category,p
order by Customer_ID,min(From_Date),Max(To_Date)
This is a type of gaps and islands problem. Probably the safest method is to use a cumulative max() to look for overlaps with previous records. Where there is no overlap, then an "island" of records starts. So:
select customer_id, min(from_date), max(to_date), category
from (select t.*,
sum(case when prev_to_date >= from_date then 0 else 1 end) over
(partition by customer_id, category
order by from_date
) as grp
from (select t.*,
max(to_date) over (partition by customer_id, category
order by from_date
rows between unbounded preceding and 1 preceding
) as prev_to_date
from t
) t
) t
group by customer_id, category, grp;
Your attempt is quite close. You just need to fix the over() clause of the window functions:
select customer_id, category, min(from_date), max(to_date)
from (
select
fs.*,
row_number() over (partition by customer_id order from_date)
- row_number() over (partition by customer_id, category order by from_date) as grp
from ffs_samp fs
) x
group by customer_id, category, grp
order by customer_id, min(from_date)
Note that this method assumes no gaps or overlalp in the periods of a given customer, as show in your sample data.

Current record with group by function

Trying to get userid recent aggregate value for session_id.
(session_id 3 has two records, recent agg value is 80.00
session_id 4 has four records, recent agg value is 95.00
session_id 6 has three records, recent agg value is 72.00
Table:session_agg
id session_id userid agg date
-- ---------- ------ ----- -------
1 3 11 60.00 1573561586
4 3 11 80.00 1573561586
6 4 11 35.00 1573561749
7 4 11 50.00 1573561751
8 4 11 70.00 1573561912
10 4 11 95.00 1573561921
11 6 14 40.00 1573561945
12 6 14 67.00 1573561967
13 6 14 72.00 1573561978
select id, session_id, userid, agg, date from session_agg
WHERE date IN (select MAX(date) from session_agg GROUP BY session_id) AND
userid = 11
If you want to stick with your current approach, then you need to correlate the session_id in the subquery which checks for the max date for each session:
SELECT id, session_id, userid, add, date
FROM session_agg sa1
WHERE
date = (SELECT MAX(date) FROM session_agg sa2 WHERE sa2.session_id = sa1.session_id) AND
userid = 11;
But, if your version of SQL supports analytic functions, ROW_NUMBER is an easier way to do this:
WITH cte AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY session_id ORDER BY date DESC) rn
FROM session_agg
)
SELECT id, session_id, userid, add, date
FROM cte
WHERE rn = 1;