PostgreSQL Select the r.* by MIN() with group-by on two columns - sql

The example schema of a table called results
id
user_id
activity_id
activity_type_id
start_date_local
elapsed_time
1
100
11111
1
2014-01-07 04:34:38
4444
2
100
22222
1
2015-04-14 06:44:42
5555
3
100
33333
1
2015-04-14 06:44:42
7777
4
100
44444
2
2014-01-07 04:34:38
12345
5
200
55555
1
2015-12-22 16:32:56
5023
The problem
Select the results of fastest activities (i.e. minimum elapsed time) of each user by activity_type_id and year.
(Basically, in this simplified example, record ID=3 should be excluded from the selection, because record ID=2 is the fastest for user 100 of the given activity_type_id 1 and the year of 2015)
What I have tried
SELECT user_id,
activity_type_id,
EXTRACT(year FROM start_date_local) AS year,
MIN(elapsed_time) AS fastest_time
FROM results
GROUP BY activity_type_id, user_id, year
ORDER BY activity_type_id, user_id, year;
Actual
Which selects the correct result set I want, but only contains the grouped by columns
user_id
activity_type_id
year
fastest_time
100
1
2014
4444
100
1
2015
5555
100
2
2014
12345
200
1
2015
5023
Goal
To have the actual full record with all columns. i.e. results.* + year
id
user_id
activity_id
activity_type_id
start_date_local
year
elapsed_time
1
100
11111
1
2014-01-07 04:34:38
2014
2014
2
100
22222
1
2015-04-14 06:44:42
2015
5555
4
100
44444
2
2014-01-07 04:34:38
2014
12345
5
200
55555
1
2015-12-22 16:32:56
2015
5023

I think you want this:
SELECT DISTINCT ON (user_id, activity_type_id, EXTRACT(year FROM start_date_local))
*, EXTRACT(year FROM start_date_local) AS year
FROM results
ORDER BY user_id, activity_type_id, year, elapsed_time;

You can use a window function for this:
select id, user_id, activity_id, activity_type_id, start_date_local, year, elapsed_time
from (
SELECT id,
user_id,
activity_id,
activity_type_id,
start_date_local,
EXTRACT(year FROM start_date_local) AS year,
elapsed_time,
min(elapsed_time) over (partition by user_id, activity_type_id, EXTRACT(year FROM start_date_local)) as fastest_time
FROM results
) t
where elapsed_time = fastest_time
order by activity_type_id, user_id, year;
Alternatively using distinct on ()
select distinct on (activity_type_id, user_id, extract(year from start_date_local))
id,
user_id,
activity_id,
activity_type_id,
extract(year from start_date_local) as year,
elapsed_time
from results
order by activity_type_id, user_id, year, elapsed_time;
Online example

Related

Finding the highest after grouping by month

In postgres, I want to output the persons who have the highest no. of "discussed" requests for each month, irrespective of the year i.e. there should be 12 outputs.
ID PERSON REQUEST DATE
4 datanoise opened 2010-09-02
5 marsuboss opened 2010-09-02
6 m3talsmith opened 2010-09-06
7 sferik opened 2010-09-08
8 sferik opened 2010-09-09
8 dtrasbo discussed 2010-09-09
8 brianmario discussed 2010-09-09
8 sferik discussed 2010-09-09
9 rsim opened 2011-09-09
.....more tuples to follow
*This is just a small part of the databse. also assume that the dataset is big enough that all months are represented in the date column.
Test data:
CREATE TEMPORARY TABLE foo( id SERIAL PRIMARY KEY, name INTEGER NOT NULL,
dt DATE NULL, request BOOL NOT NULL );
INSERT INTO foo (name,dt,request) SELECT random()*1000,
'2010-01-01'::DATE+('1 DAY'::INTERVAL)*(random()*3650), random()>0.5
FROM generate_series(1,100000) n;
SELECT * FROM foo LIMIT 10;
id | name | dt | request
----+------+------------+---------
1 | 110 | 2014-11-05 | f
2 | 747 | 2015-03-12 | t
3 | 604 | 2014-09-26 | f
4 | 211 | 2011-12-14 | t
5 | 588 | 2016-12-15 | f
6 | 96 | 2012-02-19 | f
7 | 17 | 2018-09-18 | t
8 | 591 | 2018-02-15 | t
9 | 370 | 2015-07-28 | t
10 | 844 | 2019-05-16 | f
Now you have to get the count per name and month, then get the max count, but that won't give you the name that has the maximum, which requires joining back with the previous result. In order to do the group by only once, it is done in a CTE:
WITH totals AS (
SELECT EXTRACT(month FROM dt) mon, name, count(*) cnt FROM foo
WHERE request=true GROUP BY name,mon
)
SELECT * FROM
(SELECT mon, max(cnt) cnt FROM totals GROUP BY mon) x
JOIN totals USING (mon,cnt);
If several names have the same maximum count, they will be returned both. To keep only one, you can use DISTICT ON:
WITH (same as above)
SELECT DISTINCT ON (mon) * FROM
(SELECT mon, max(cnt) cnt FROM totals GROUP BY mon) x
JOIN totals USING (mon,cnt) ORDER BY mon,name;
You can also use DISTINCT ON to keep only one row per month, specified by the ORDER clause, in this cas by count desc, so it keeps the highest count.
SELECT DISTINCT ON (mon) * FROM (
SELECT EXTRACT(month FROM dt) mon, name, count(*) cnt FROM foo
WHERE request=true GROUP BY name,mon
)x ORDER BY mon, cnt DESC;
...or you could hack an argmax() function by sticking the primary key into an array passed to max(), which means it will return the id of the row which has the maximum value:
SELECT mon, cntid[1] cnt, name FROM
(SELECT mon, max(ARRAY[cnt,id]) cntid FROM (
SELECT EXTRACT(month FROM dt) mon, name, count(*) cnt, min(id) id FROM foo
WHERE request=true GROUP BY name,mon
) x GROUP BY mon)y
JOIN foo ON (foo.id=cntid[2]);
Which one will be faster?...
given your table is named t01 and the colum date is date1 (and in string format):
create temp table t02 as
select extract(month from CAST(date1 as date)) as month, person, count(*) nb from t01 where request = 'discussed' group by 1, 2 ;
create temp table t03 as
select month, max(nb) max_nb from t02 group by 1 ;
the result is :
select month , person from t02 a natural join t03 b where a.nb = b.max_nb;
https://rextester.com/BYMM84335[ : run here]1
I would recommend distinct on. If you want to combine all the months into a single "uber-month":
select distinct on (extract(month from date)) person, extract(month from date), count(*) as num_discussed
from t
where request = 'discussed'
group by person, extract(month from date)
order by extract(month from date), num_discussed desc;
Distinct on is a very handy Postgres extension. It returns on row per "group", which is defined by the expressions in parentheses. The row is the "first" one determined by the order by clause.
If you want the highest month regardless of year:
select distinct on (extract(month from date)) person, date_trunc('month', date), count(*) as num_discussed
from t
where request = 'discussed'
group by person, date_trunc('month', date)
order by extract(month from date), num_discussed desc;

SQL query to find continuous local max, min of date based on category column

I have the following data set
Customer_ID Category FROM_DATE TO_DATE
1 5 1/1/2000 12/31/2001
1 6 1/1/2002 12/31/2003
1 5 1/1/2004 12/31/2005
2 7 1/1/2010 12/31/2011
2 7 1/1/2012 12/31/2013
2 5 1/1/2014 12/31/2015
3 7 1/1/2010 12/31/2011
3 7 1/5/2012 12/31/2013
3 5 1/1/2014 12/31/2015
The result I want to achieve is to find continuous local min/max date for Customers with the same category and identify any gap in dates:
Customer_ID FROM_Date TO_Date Category
1 1/1/2000 12/31/2001 5
1 1/1/2002 12/31/2003 6
1 1/1/2004 12/31/2005 5
2 1/1/2010 12/31/2013 7
2 1/1/2014 12/31/2015 5
3 1/1/2010 12/31/2011 7
3 1/5/2012 12/31/2013 7
3 1/1/2014 12/31/2015 5
My code works fine for customer 1 (return all 3 rows) and customer 2(return 2 rows with min and max date for each category) but for customer 3, it cannot identify the gap between 12/31/2011 and 1/5/2012 for category 7.
Customer_ID FROM_Date TO_Date Category
3 1/1/2010 12/31/2013 7
3 1/1/2014 12/31/2015 5
Here is my code:
SELECT Customer_ID, Category, min(From_Date), max(To_Date) FROM
(
SELECT Customer_ID, Category, From_Date,To_Date
,row_number() over (order by member_id, To_Date) - row_number() over (partition by Customer_ID order by Category) as p
FROM FFS_SAMP
) X
group by Customer_ID,Category,p
order by Customer_ID,min(From_Date),Max(To_Date)
This is a type of gaps and islands problem. Probably the safest method is to use a cumulative max() to look for overlaps with previous records. Where there is no overlap, then an "island" of records starts. So:
select customer_id, min(from_date), max(to_date), category
from (select t.*,
sum(case when prev_to_date >= from_date then 0 else 1 end) over
(partition by customer_id, category
order by from_date
) as grp
from (select t.*,
max(to_date) over (partition by customer_id, category
order by from_date
rows between unbounded preceding and 1 preceding
) as prev_to_date
from t
) t
) t
group by customer_id, category, grp;
Your attempt is quite close. You just need to fix the over() clause of the window functions:
select customer_id, category, min(from_date), max(to_date)
from (
select
fs.*,
row_number() over (partition by customer_id order from_date)
- row_number() over (partition by customer_id, category order by from_date) as grp
from ffs_samp fs
) x
group by customer_id, category, grp
order by customer_id, min(from_date)
Note that this method assumes no gaps or overlalp in the periods of a given customer, as show in your sample data.

3 or more consecutive entries in the last 15 days

I have the following data:
ID EMP_ID SALE_DATE
---------------------------------
1 777 5/28/2016
2 777 5/29/2016
3 777 5/30/2016
4 777 5/31/2016
5 888 5/26/2016
6 888 5/28/2016
7 888 5/29/2016
8 999 5/29/2016
9 999 5/30/2016
10 999 5/31/2016
i need to fetch data for emp_id having 3 or more days of consecutive sales in the last 15 days.
Output should be:
777
999
Following is the query:
SELECT TRUNC (sale_date), emp_id
FROM table1
WHERE sale_date >= SYSDATE - 14
GROUP BY TRUNC (sale_date), emp_id
HAVING COUNT (*) >= 3
But this returns consecutive transactions in the last three days only.
Note: This is oracle.
Assuming you have one row per day, you can use lead():
select distinct emp_id
from (select t1.*,
lead(sale_date, 1) over (partition by emp_id order by sale_date) as sd_1,
lead(sale_date, 2) over (partition by emp_id order by sale_date) as sd_2
from table1 t1
where sale_date >= trunc(sysdate) - 14
) t
where sd_1 = sale_date + 1 and
sd_2 = sale_date + 2;

SQL Grouped Results with Counts and Cumulative Counts

I have the following data in my 'user' table:
user_id | create_timestamp
1 2017-08-01
2 2017-08-01
3 2017-08-02
4 2017-08-03
5 2017-08-03
6 2017-08-03
7 2017-08-04
8 2017-08-04
9 2017-08-04
10 2017-08-04
I want to create a SQL query that has three columns:
1. Grouped results by create_timestamp
2. A count of the results by date
3. A cumulative count as the date goes on.
Here's what the result set should look like:
create_timestamp daily cumulative
2017-08-01 2 2
2017-08-02 1 3
2017-08-03 3 6
2017-08-04 4 10
You would use window functions for this:
select create_timestamp, count(*) as cnt,
sum(count(*)) over (order by create_timestamp) as cumulative
from t
group by create_timestamp
order by create_timestamp;
This functionality is available in SQL Server 2012+.
Note: You may need to extract the date from the time stamp:
select convert(date, create_timestamp) as dte, count(*) as cnt,
sum(count(*)) over (order by convert(date, create_timestamp)) as cumulative
from t
group by convert(date, create_timestamp)
order by convert(date, create_timestamp);
You can use this query.
DECLARE #UserLog TABLE (user_id INT , create_timestamp DATE)
INSERT INTO #UserLog
VALUES
(1,'2017-08-01'),
(2,'2017-08-01'),
(3,'2017-08-02'),
(4,'2017-08-03'),
(5,'2017-08-03'),
(6,'2017-08-03'),
(7,'2017-08-04'),
(8,'2017-08-04'),
(9,'2017-08-04'),
(10,'2017-08-04')
;WITH T AS (
SELECT create_timestamp, COUNT(*) daily FROM #UserLog
GROUP BY create_timestamp)
SELECT
create_timestamp,
daily,
SUM(daily) OVER( ORDER BY create_timestamp ASC
ROWS UNBOUNDED PRECEDING ) cumulative
FROM T
Result
create_timestamp daily cumulative
---------------- ----------- -----------
2017-08-01 2 2
2017-08-02 1 3
2017-08-03 3 6
2017-08-04 4 10

Writing subquery within SUM using values of 1 table

Now I have a table and I am trying to calculate for each book_id the total sales in the past 100 days for every day in the past 1 year.
book_id location seller daily_sales order_day
ABC 1 XYZ 100 2017-05-05
ABC 1 XYZ 120 2017-05-07
ABC 1 XYZ 40 2017-02-10
.
.
.
So what I am trying to expect in the result is:
book_id order_day sum
ABC 2017-05-05 100+40
ABC 2017-05-07 100+120+40
ABC 2017-02-10 40
For this I wrote a query like this:
select book_id, to_char(order_day),
SUM(case when order_day between order_day -100 and order_day then daily_sales else 0 end) sum
FROM bookDetailsTable
where location = 1 AND ORDER_DAY BETWEEN TO_DATE('20170725','YYYYMMDD') - 359 AND TO_DATE('20170725','YYYYMMDD')
group by seller, book_id, order_day
I guess I am doing wrong and I should write a select statement within the SUM statement to select data for the past 100 days.
You should get the result with this
select A.book_id,
A.order_day,
( select sum(b.daily_sales)
from bookDetailsTable b
where A.book_id = B.book_id
and B.order_day between A.order_day -100 and A.order_day
)
from bookDetailsTable A
where A.order_day between ADD_MONTHS(trunc(sysdate),-12) and trunc(sysdate)
If you understand the principle of the query, you should be able to add your other restrictions, like seller or location
This is a perfect case for using analytic functions, specifically the SUM() analytic function, along with the windowing clause:
WITH bookdetailstable AS (SELECT 'ABC' book_id, 1 LOCATION, 'XYZ' seller, 100 daily_sales, to_date('05/05/2016', 'dd/mm/yyyy') order_day FROM dual UNION ALL
SELECT 'ABC' book_id, 1 LOCATION, 'XYZ' seller, 120 daily_sales, to_date('07/05/2016', 'dd/mm/yyyy') order_day FROM dual UNION ALL
SELECT 'ABC' book_id, 1 LOCATION, 'XYZ' seller, 40 daily_sales, to_date('10/02/2016', 'dd/mm/yyyy') order_day FROM dual UNION ALL
SELECT 'ABC' book_id, 1 LOCATION, 'XYZ' seller, 600 daily_sales, to_date('10/02/2017', 'dd/mm/yyyy') order_day FROM dual)
SELECT book_id,
to_char(order_day, 'yyyy-mm-dd') order_day,
total_sales_last_100_days
FROM (SELECT book_id,
order_day,
SUM(daily_sales) OVER (PARTITION BY book_id ORDER BY order_day
RANGE BETWEEN 100 PRECEDING AND CURRENT ROW) total_sales_last_100_days
FROM bookdetailstable
where order_day >= add_months(trunc(sysdate) - 100, -12))
where order_day >= add_months(trunc(SYSDATE), -12);
BOOK_ID ORDER_DAY TOTAL_SALES_LAST_100_DAYS
------- ---------- -------------------------
ABC 2016-02-10 40
ABC 2016-05-05 140
ABC 2016-05-07 260
ABC 2017-02-10 600
This simply says get the sum of daily_sales for each book_id (you can think of the partition by clause as being similar to the group by clause - it simply defines the group of rows the function applies over) ordered by the order_day, looking at the 100 preceding rows and the current row.
If you needed to work out the cumulative sum for specific book_ids based on location (and seller and ....), then you would need to include the extra grouping columns in the partition by clause.
Since you want to restrict the results to the past year, assuming you want the first row to return the count for the past 100 days as well, rather than starting with the current day, you need to include 100 days prior to a year ago. Then you restrict the rows to the year's worth of data you're interested in.
That's because analytic functions work across the data after it's been filtered by the where clause, so if you want to include data from outside the current where clause, you're going to have to look for a way to include those rows and then do the additional filtering later.