Postgres query different COUNT and ROW_NUMBER() - sql

I have a table messages with the following columns
group_id BIGINT,
user_id BIGINT,
message_date timestamp
For the right user_id I would like to be able to count the total rows with that user_id, the distinct groups with that user_id, and considering a leaderboard made by the count of user_id, the position too.
I tried this query
SELECT main.total_m, main.group_number, main.pos
FROM (
SELECT user_id, COUNT(group_id) AS group_number, COUNT(user_id) AS total_m,
ROW_NUMBER() OVER (
PARTITION BY COUNT(user_id)
ORDER BY COUNT(user_id) DESC
) AS pos
FROM messages
WHERE message_date > date_trunc('week', now())
GROUP BY user_id, group_id
) AS main
WHERE user_id = %s
But I don't get the result I would like to have. Where am I wrong?

The power of "sample data" and "expected result" is it enables others to answer efficiently. The following is a complete guess, but perhaps it will prompt you to prepare a "Minimal, Complete, and Verifiable Example" (MCVE)
The detials below can be accessed at SQL Fiddle
PostgreSQL 9.6 Schema Setup:
CREATE TABLE Messages
(USER_ID int, GROUP_ID int, MESSAGE_DATE timestamp)
;
INSERT INTO Messages
(USER_ID, GROUP_ID, MESSAGE_DATE)
VALUES
(1, 7, '2017-09-01 10:00:00'),
(1, 6, '2017-09-02 10:00:00'),
(1, 5, '2017-09-03 10:00:00'),
(1, 4, '2017-09-04 10:00:00'),
(1, 7, '2017-09-05 10:00:00'),
(2, 6, '2017-09-01 10:00:00'),
(2, 5, '2017-09-02 10:00:00'),
(2, 7, '2017-09-03 10:00:00'),
(2, 6, '2017-09-04 10:00:00'),
(2, 4, '2017-09-05 10:00:00'),
(2, 8, '2017-09-11 10:00:00')
;
Query 1:
select
user_id
, num_grps
, num_msgs
, dense_rank() over(order by num_grps DESC, num_msgs DESC, max_date DESC, user_id) rnk
from (
select
user_id
, count(distinct group_id) num_grps
, count(*) num_msgs
, max(message_date) max_date
from messages
group by
user_id
) d
Results:
| user_id | num_grps | num_msgs | rnk |
|---------|----------|----------|-----|
| 2 | 5 | 6 | 1 |
| 1 | 4 | 5 | 2 |

Looking at just the inner query, I see this in the select:
SELECT user_id, COUNT(group_id), ...
But this in the GROUP BY:
GROUP BY user_id, group_id
Put those together, and you'll never have a COUNT() result of anything other than 1, because each group_id has it's own group. It works for the same for total_m column.

Related

Compare values between two tables with over partition criteria

DB-Fiddle
/* Table Campaigns */
CREATE TABLE campaigns (
id SERIAL PRIMARY KEY,
insert_time DATE,
campaign VARCHAR,
tranches VARCHAR,
quantity DECIMAL);
INSERT INTO campaigns
(insert_time, campaign, tranches, quantity)
VALUES
('2021-01-01', 'C001', 't', '500'),
('2021-01-01', 'C002', 't', '600'),
('2021-01-02', 'C001', 't', '500'),
('2021-01-02', 'C002', 't', '600');
/* Table Tranches */
CREATE TABLE tranches (
id SERIAL PRIMARY KEY,
insert_time DATE,
campaign VARCHAR,
tranches VARCHAR,
quantity DECIMAL);
INSERT INTO tranches
(insert_time, campaign, tranches, quantity)
VALUES
('2021-01-01', 'C001', 't1', '200'),
('2021-01-01', 'C001', 't2', '120'),
('2021-01-01', 'C001', 't3', '180'),
('2021-01-01','C002', 't1', '350'),
('2021-01-01','C002', 't2', '250'),
('2021-01-02', 'C001', 't1', '400'),
('2021-01-02', 'C001', 't2', '120'),
('2021-01-02', 'C001', 't3', '180'),
('2021-01-02','C002', 't1', '350'),
('2021-01-02','C002', 't2', '250');
Expected Result:
insert_time | campaign | tranches | quantity_campaigns | quantity_tranches | check
--------------|------------|------------|---------------------|---------------------|-----------
2021-01-01 | C001 | t | 500 | 500 | ok
2021-01-01 | C002 | t | 600 | 600 | ok
--------------|------------|------------|---------------------|---------------------|------------
2021-01-02 | C001 | t | 500 | 700 | error
2021-01-02 | C002 | t | 600 | 500 | ok
I want to compare the total quantity per campaign in table campaigns with the total quantity per campaign in table tranches.
So far I have been able to develop this query:
SELECT
c.insert_time AS insert_time,
c.campaign AS campaign,
c.tranches AS tranches,
c.quantity AS quantity_campaigns,
t.quantity AS quantity_tranches,
(CASE WHEN
MAX(c.quantity) OVER(PARTITION BY c.insert_time, c.campaign) = SUM(t.quantity) OVER(PARTITION BY t.insert_time, t.campaign)
THEN 'ok' ELSE 'error' END) AS check
FROM campaigns c
LEFT JOIN tranches t ON c.campaign = t.campaign
ORDER BY 1,2,3,4,5;
However, it does not give me the expected result?
What do I need to change to make it work?
I think the result you're looking for should be something like this. The problem is that you're trying to aggregate over two groupings after a join which will either yield too many results or incorrect calculations. By aggregating in CTE, and then joining the CTEs after aggregation has occurred you can achieve the results you are looking for. See my example below:
WITH campaign_agg AS(
SELECT c.insert_time, c.campaign, c.tranches, MAX(c.quantity) c_quantity
FROM campaigns c
GROUP BY c.insert_time, c.campaign, c.tranches
), tranch_agg AS(
SELECT t.insert_time, t.campaign, SUM(t.quantity) as t_sum
FROM tranches t
GROUP BY t.insert_time, t.campaign
)
SELECT c.insert_time, c.campaign, c.tranches, c.c_quantity, t.t_sum,
CASE WHEN c.c_quantity = t.t_sum THEN 'ok' ELSE 'error' END as check
FROM campaign_agg c
JOIN
tranch_agg t ON
t.insert_time = c.insert_time
AND t.campaign = c.campaign
ORDER BY c.insert_time, c.campaign
I have a db-fiddle for this as well: https://www.db-fiddle.com/f/33x4upVEcgTMNehiHCKzfN/1
DB-Fiddle
SELECT
c.insert_time AS insert_time,
c.campaign AS campaign,
c.tranches AS tranches,
SUM(c.quantity) AS quantity_campaigns,
SUM(t1.quantity) AS quantity_tranches,
(CASE WHEN SUM(c.quantity) <> SUM(t1.quantity) THEN 'error' ELSE 'ok' END) AS check
FROM campaigns c
LEFT JOIN
(SELECT
t.insert_time AS insert_time,
t.campaign AS campaign,
SUM(t.quantity) AS quantity
FROM tranches t
GROUP BY 1,2
ORDER BY 1,2) t1 on t1.insert_time = c.insert_time AND t1.campaign = c.campaign
GROUP BY 1,2,3
ORDER BY 1,2,3;

postgresql How show most frequent value per day date

I've got a problem with a query that is supposed to return the value which occur most per date
+------------+------------------+
| Date | value |
+------------+------------------+
| 2020-01-01 | Programmer |
| 2020-01-02 | Technician |
| 2020-01-03 | Business Analyst |
+------------+------------------+
So far I have done
select count(headline) as asd, publication_date, employer -> 'name' as dsa from jobhunter
group by publication_date,dsa
ORDER BY publication_date DESC
But it shows 2020-12-31 19:06:00 instead of just YYYY-MM-DD
Any idea on how to fix this?
enter image description here
Test data:
create table tbl (
id serial primary key,
row_datetime TIMESTAMP,
row_val VARCHAR(60)
);
insert into tbl (row_datetime, row_val) values ('2021-01-01 00:00:00', 'a');
insert into tbl (row_datetime, row_val) values ('2021-01-01 01:00:00', 'a');
insert into tbl (row_datetime, row_val) values ('2021-01-01 02:00:00', 'b');
insert into tbl (row_datetime, row_val) values ('2021-01-02 00:00:00', 'a');
insert into tbl (row_datetime, row_val) values ('2021-01-02 01:00:00', 'b');
insert into tbl (row_datetime, row_val) values ('2021-01-02 02:00:00', 'b');
Example query:
SELECT dt, val, cnt
FROM (
SELECT dt, val, cnt, ROW_NUMBER() OVER (PARTITION BY dt ORDER BY cnt DESC) AS row_num
FROM (
SELECT dt, val, COUNT(val) AS cnt
FROM (
SELECT DATE(row_datetime) AS dt, row_val AS val FROM tbl
) AS T1 GROUP BY dt, val
) AS T2
) AS T3
WHERE row_num=1
ORDER BY dt ASC
You can additionally customize your query to optimize the performance, get more fields, etc.

Calculate total time without vacations in postgres

I have a database table that represents activities and for each activity, how long it took.
It looks something like this :
activity_id | name | status | start_date | end_date
=================================================================
1 | name1 | WIP | 2019-07-24 ... | 2019-07-24 ...
start_date and end_date are timestamps. I use a view with a column total_time that is described like that:
date_part('day'::text,
COALESCE(sprint_activity.end_date::timestamp with time zone, CURRENT_TIMESTAMP)
- sprint_activity.start_date::timestamp with time zone
) + date_part('hour'::text,
COALESCE(sprint_activity.end_date::timestamp with time zone, CURRENT_TIMESTAMP)
- sprint_activity.start_date::timestamp with time zone
) / 24::double precision AS total_time
I would like to create a table for vacation or half day vacations that looks like:
date | work_percentage
=================================================
2019-07-24 | 0.4
2019-07-23 | 0.7
And then, I would like to calculate total_time in a way that uses this vacations table such that:
If a date is not in the column it's considered to have work_percentage==1
For every date that is in the table, reduce the relative percentage from the total_time query.
So let's take an example:
Activity - "Write report" started at 11-July-2019 14:00 and ended at 15-July-2019 19:00 - so the time diff is 4 days and 5 hours.
The 13th and 14th were weekend so I'd like to have a column in the vacations table that holds 2019-07-13 with work_percentage == 1 and the same for the 14th.
Deducting those vacations, the time diff would be 2 days and 5 hours as the 13th and 14th are not workdays.
Hope this example explains it better.
I think you can take this example and add some modifications based on your database
Just ddl statements to test script
create table activities (
user_id int,
activity_id int,
name text,
status text,
start_date timestamp,
end_date timestamp
);
create table vacations (
user_id int,
date date,
work_percentage numeric
);
insert into activities
values
(1, 1, 'name1', 'WIP', timestamp'2019-07-20 10:00:00', timestamp'2019-07-25 8:00:00'),
(2, 2, 'name2', 'DONE', timestamp'2019-07-28 19:00:00', timestamp'2019-08-01 7:00:00'),
(1, 3, 'name3', 'DONE', timestamp'2019-07-21 12:00:00', timestamp'2019-07-21 15:00:00'),
(-1, 4, 'Write report', 'DONE', timestamp'2019-07-11 14:00:00', timestamp'2019-07-15 19:00:00');
insert into vacations
values
(1, date'2019-07-21', 0.5),
(1, date'2019-07-22', 0),
(1, date'2019-07-23', 0.25),
(2, date'2019-07-29', 0),
(2, date'2019-07-30', 0),
(-1, date'2019-07-13', 0),
(-1, date'2019-07-14', 0);
sql script
with
daily_activity as (
select
*,
date(
generate_series(
date(start_date),
date(end_date),
interval'1 day')
) as date_key
from
activities
),
raw_data as (
select
da.*,
v.work_percentage,
case
when date(start_date) = date(end_date)
then (end_date - start_date) * coalesce(work_percentage, 1)
when date(start_date) = date_key
then (date(start_date) + 1 - start_date) * coalesce(work_percentage, 1)
when date(end_date) = date_key
then (end_date - date(end_date)) * coalesce(work_percentage, 1)
else interval'24 hours' * coalesce(work_percentage, 1)
end as activity_coverage
from
daily_activity as da
left join vacations as v on da.user_id = v.user_id
and da.date_key = v.date
)
select
user_id,
activity_id,
name,
status,
start_date,
end_date,
justify_interval(sum(activity_coverage)) as total_activity_time
from
raw_data
group by
1, 2, 3, 4, 5, 6

select within period plus last before period

thanks to everyone who took the time to comment and answer.
-
I have a price history table like that (pseudocode):
table price_history (
product_id,
price,
changed_date
)
in which the historical prices of some products are stored:
1, 1.0, '2017-12-18'
1, 1.2, '2017-12-20'
1, 0.9, '2018-04-20'
1, 1.1, '2018-07-20'
1, 1.3, '2018-07-22'
2, 10.0, '2017-12-15'
2, 11.0, '2017-12-16'
2, 9.9, '2018-01-02'
2, 10.3, '2018-04-04
Now I want the prices of some products within a certain period. Eg. between 2018-01-01 and now.
The simple approach:
SELECT * FROM price_history
WHERE product_id in (1,2) AND changed_date >= 2018-01-01
is not ok, since the individual price for each product from 2018-01-01 until the first price change is not included:
1, 0.9, '2018-04-20'
1, 1.1, '2018-07-20'
1, 1.3, '2018-07-22'
2, 9.9, '2018-01-02'
2, 10.3, '2018-04-04
But it is crucial to know the prices from the start of the period.
So, in addition to the price changes within the period, the last change before must also included.
The result should be like so:
1, 1.2, '2017-12-20'
1, 0.9, '2018-04-20'
1, 1.1, '2018-07-20'
1, 1.3, '2018-07-22'
2, 11.0, '2017-12-16'
2, 9.9, '2018-01-02'
2, 10.3, '2018-04-04
Q: how to specify such a select statement?
Edit:
The test scenario and the solution from Ajay Gupta
CREATE TABLE price_history (
product_id integer,
price float,
changed_date timestamp
);
INSERT INTO price_history (product_id,price,changed_date) VALUES
(1, 1.0, '2017-12-18'),
(1, 1.2, '2017-12-20'),
(1, 0.9, '2018-04-20'),
(1, 1.1, '2018-07-20'),
(1, 1.3, '2018-07-22'),
(2, 10.0, '2017-12-15'),
(2, 11.0, '2017-12-16'),
(2, 9.9, '2018-01-02'),
(2, 10.3, '2018-04-04');
Winning Select:
with cte1 as
(Select *, lag(changed_date,1,'01-01-1900')
over(partition by product_id order by changed_date)
as FromDate from price_history),
cte2 as (Select product_id, max(FromDate)
as changed_date from cte1
where '2018-01-01'
between FromDate and changed_date group by product_id)
Select p.* from price_history p
join cte2 c on p.product_id = c.product_id
where p.changed_date >= c.changed_date
order by product_id,changed_date;
Result:
product_id | price | changed_date
------------+-------+---------------------
1 | 1.2 | 2017-12-20 00:00:00
1 | 0.9 | 2018-04-20 00:00:00
1 | 1.1 | 2018-07-20 00:00:00
1 | 1.3 | 2018-07-22 00:00:00
2 | 11 | 2017-12-16 00:00:00
2 | 9.9 | 2018-01-02 00:00:00
2 | 10.3 | 2018-04-04 00:00:00
I must admit, this is way beyond my limited (PG-)SQL skills.
Using Lag and cte
with cte1 as (
Select *,
lag(changed_date,1,'01-01-1900') over(partition by product_id order by changed_date) as FromDate
from price_history
), cte2 as (
Select product_id, max(FromDate) as changed_date
from cte1
where '2018-01-01' between FromDate and changed_date
group by product_id
)
Select p.*
from price_history p
join cte2 c on p.product_id = c.product_id
where p.changed_date >= c.changed_date;
I guess this is what you are looking for
SELECT Top 1 * FROM price_history WHERE product_id in (1,2) AND changed_date < 2018-01-01
UNION ALL
SELECT * FROM price_history WHERE product_id in (1,2) AND changed_date >= 2018-01-01
You need 1st change date and all other date >"2018-01-01"
select product_id,price, changed_date
from
(
select product_id,price, changed_date,
row_number() over(partition by product_id order by changed_date ) as rn
from price_history
) x
where x.rn = 2 and product_id in (1,2);
union all
select product_id,price, changed_datefrom from price_history
where product_id in (1,2) and changed_date >= '2018-01-01'
If you did have the option to change your table structure, a different approach would be to have both start_date and end_date in your table, this way your records would not depend on prev/next row and your query becomes easier to write. See Slowly changing dimension - Type 2
If you want to solve the problem with existing structure, in PostgresSQL you can use LIMIT 1 to get latest record before changed_date:
SELECT
*
FROM
price_history
WHERE
product_id in (1,2)
AND changed_date >= '2018-01-01'
UNION ALL
-- this would give you the latest price before changed_date
SELECT
*
FROM
price_history
WHERE
product_id in (1,2)
AND changed_date < '2018-01-01'
ORDER BY
changed_date DESC
LIMIT 1
The solution with union is still simpler but not realized correctly in other answers. So:
SELECT * FROM price_history
WHERE product_id in (1,2) AND changed_date >= '2018-01-01'
union all
(
select distinct on (product_id)
*
from price_history
where product_id in (1,2) AND changed_date < '2018-01-01'
order by product_id, changed_date desc)
order by product_id, changed_date;
Demo

Calculate total time worked in a day with multiple stops and starts

I can use DATEDIFF to find the difference between one set of dates like this
DATEDIFF(MINUTE, #startdate, #enddate)
but how would I find the total time span between multiple sets of dates? I don't know how many sets (stops and starts) I will have.
The data is on multiple rows with start and stops.
ID TimeStamp StartOrStop TimeCode
----------------------------------------------------------------
1 2017-01-01 07:00:00 Start 1
2 2017-01-01 08:15:00 Stop 2
3 2017-01-01 10:00:00 Start 1
4 2017-01-01 11:00:00 Stop 2
5 2017-01-01 10:30:00 Start 1
6 2017-01-01 12:00:00 Stop 2
This code would work assuming that your table only store data from one person, and they should be of the order Start/Stop/Start/Stop
WITH StartTime AS (
SELECT
TimeStamp
, ROW_NUMBER() PARTITION BY (ORDER BY TimeStamp) RowNum
FROM
<<table>>
WHERE
TimeCode = 1
), StopTime AS (
SELECT
TimeStamp
, ROW_NUMBER() PARTITION BY (ORDER BY TimeStamp) RowNum
FROM
<<table>>
WHERE
TimeCode = 2
)
SELECT
SUM (DATEDIFF( MINUTE, StartTime.TimeStamp, StopTime.TimeStamp )) As TotalTime
FROM
StartTime
JOIN StopTime ON StartTime.RowNum = StopTime.RowNum
This will work if your starts and stops are reliable. Your sample has two starts in order - 10:00 and 10:30 starts. I assume in production you will have an employee id to group on, so I added this to the sample data in place of the identity column.
Also in production, the CTE sets will be reduced by using a parameter on date. If there are overnight shifts, you would want your stops CTE to use dateadd(day, 1, #startDate) as your upper bound when retrieving end date.
Set up sample:
declare #temp table (
EmpId int,
TimeStamp datetime,
StartOrStop varchar(55),
TimeCode int
);
insert into #temp
values
(1, '2017-01-01 07:00:00', 'Start', 1),
(1, '2017-01-01 08:15:00', 'Stop', 2),
(1, '2017-01-01 10:00:00', 'Start', 1),
(1, '2017-01-01 11:00:00', 'Stop', 2),
(2, '2017-01-01 10:30:00', 'Start', 1),
(2, '2017-01-01 12:00:00', 'Stop', 2)
Query:
;with starts as (
select t.EmpId,
t.TimeStamp as StartTime,
row_number() over (partition by t.EmpId order by t.TimeStamp asc) as rn
from #temp t
where Timecode = 1 --Start time code?
),
stops as (
select t.EmpId,
t.TimeStamp as EndTime,
row_number() over (partition by t.EmpId order by t.TimeStamp asc) as rn
from #temp t
where Timecode = 2 --Stop time code?
)
select cast(min(sub.StartTime) as date) as WorkDay,
sub.EmpId as Employee,
min(sub.StartTime) as ClockIn,
min(sub.EndTime) as ClockOut,
sum(sub.MinutesWorked) as MinutesWorked
from
(
select strt.EmpId,
strt.StartTime,
stp.EndTime,
datediff(minute, strt.StartTime, stp.EndTime) as MinutesWorked
from starts strt
inner join stops stp
on strt.EmpId = stp.EmpId
and strt.rn = stp.rn
)sub
group by sub.EmpId
This works assuming your table has an incremental ID and interleaving start/stop records
--Data sample as provided
declare #temp table (
Id int,
TimeStamp datetime,
StartOrStop varchar(55),
TimeCode int
);
insert into #temp
values
(1, '2017-01-01 07:00:00', 'Start', 1),
(2, '2017-01-01 08:15:00', 'Stop', 2),
(3, '2017-01-01 10:00:00', 'Start', 1),
(4, '2017-01-01 11:00:00', 'Stop', 2),
(5, '2017-01-01 10:30:00', 'Start', 1),
(6, '2017-01-01 12:00:00', 'Stop', 2)
--let's see every pair start/stop and discard stop/start
select start.timestamp start, stop.timestamp stop,
datediff(mi,start.timestamp,stop.timestamp) minutes
from #temp start inner join #temp stop
on start.id+1= stop.id and start.timecode=1
--Sum all for required result
select sum(datediff(mi,start.timestamp,stop.timestamp) ) totalMinutes
from #temp start inner join #temp stop
on start.id+1= stop.id and start.timecode=1
Results
+-------------------------+-------------------------+---------+
| start | stop | minutes |
+-------------------------+-------------------------+---------+
| 2017-01-01 07:00:00.000 | 2017-01-01 08:15:00.000 | 75 |
| 2017-01-01 10:00:00.000 | 2017-01-01 11:00:00.000 | 60 |
| 2017-01-01 10:30:00.000 | 2017-01-01 12:00:00.000 | 90 |
+-------------------------+-------------------------+---------+
+--------------+
| totalMinutes |
+--------------+
| 225 |
+--------------+
Maybe the tricky part is the join clause. We need to join #table with itself by deferring 1 ID. Here is where on start.id+1= stop.id did its work.
In the other hand, for excluding stop/start couple we use start.timecode=1. In case we don't have a column with this information, something like stop.id%2=0 works just fine.