Group dataset two ways in PostgreSQL - sql

I have a large dataset that I need to group in two different ways. My hope is that I will be able to run a query once, so that I won't have to run two separate queries.
I guess this might be possible using ROLLUP or GROUPING SETS, but I must admit I do not fully understand how I can use them for this.
This is a basic example of what I'm trying to do. The two questions I'm trying to answer with one query is:
How many users have been created per company per day?
Which companies has created the most users during the entire period? Top 5 companies would be enough.
CREATE TABLE IF NOT EXISTS tmp_users (
id INTEGER NOT NULL,
name TEXT NOT NULL,
created TIMESTAMP NOT NULL,
companyid INTEGER NOT NULL
);
INSERT INTO tmp_users (id, name, created, companyid)
VALUES
(1, 'Lindsay', '2019-01-01', 1),
(2, 'Michael', '2019-01-02', 1),
(3, 'Stan', '2019-01-04', 3),
(4, 'Gob', '2019-01-04', 1),
(5, 'Buster', '2019-01-01', 1),
(6, 'Lucille', '2019-01-03', 2),
(7, 'Sally', '2019-01-05', 3);
-- Get users created per day, per company
SELECT
DATE_TRUNC('DAY', created) AS created,
companyid,
COUNT(*) AS numberofusers
FROM tmp_users
GROUP BY
DATE_TRUNC('DAY', created),
companyid
ORDER BY DATE_TRUNC('DAY', created) DESC;
-- Users per company, with filter
SELECT
companyid,
COUNT(*) AS numberofusers
FROM tmp_users
GROUP BY
companyid
HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC;

grouping sets can be used to return multiple levels of aggregation in a single Select:
-- Get users created per day, per company
SELECT *
FROM
(
SELECT
DATE_TRUNC('DAY', created) AS created,
companyid,
Count(*) AS numberofusers,
Row_Number() -- instead of TOP n
Over (PARTITION BY CASE WHEN DATE_TRUNC('DAY', created) IS NULL THEN 0 ELSE 1 END
ORDER BY Count(*) DESC) AS rn
FROM tmp_users
GROUP BY GROUPING SETS
(
(DATE_TRUNC('DAY', created), companyid) -- daily data
,companyid -- company data
)
) AS dt
WHERE created IS NOT NULL -- all daily data
OR rn <= 5 -- plus the TOP 5 companies
ORDER BY created ASC NULLS FIRST;
See db<>fiddle

Related

How to apply partition on the table to get first data of each month?

CREATE TABLE employee
(
joining_date date,
employee_type character varying,
name character varying ,
typeid integer
);
insert into employee VALUES
('2021-08-12','as','hjasghg', 1),
('2021-08-13', 'Rs', 'sa', 1),
('2021-08-14','asktyuk','hjasg', 1),
('2021-09-12','as','hjasghg', 1),
('2021-09-13', 'Rs', 'sa', 1),
('2021-09-14','asktyuk','hjasg', 1),
('2022-08-02','as','hjasghg', 2),
('2022-08-03','as','hjasghg', 2),
('2022-08-04', 'Rs', 'sa', 2),
('2022-08-05','asktyuk','hjasg', 2);
I want to obtain the columns containing reading of first data of each month for different typeid.
I have tried applying partition but I can't seem to extract the correct data.
with cte_a as
(select row_number() over (partition by typeid order by joining_date asc) sno, * from employee)
select * from cte_a where sno = 1;
I expected results from date '2021-09-12','2021-08-12','2022-08-02', but I only got '2022-08-02','2021-08-12' in final result.
demo:db<>fiddle
You could use DISTINCT ON in combination with date_part() function
SELECT DISTINCT ON (typeid, date_part('month', joining_date))
*
FROM employee
ORDER BY typeid, date_part('month', joining_date)
date_part('month', ...) "normalizes" the date to the first of a month (or cut the day part, if you like)
DISTINCT ON returns the first of all ordered groups, in that case the group of typeid and your normalized date

How to select columns that aren't part of an aggregate query using HAVING SUM() in the WHERE and selecting only certain rows on db2

Using AS400 db2 for this.
I have a table of orders. From that table I have to:
Get all orders from a specified list of order IDs and type
Group by the user_id on those orders
Check to make sure the total order amount on the group is greater than $100
Return all orders that matched the group but the results won't be grouped, which includes order_id which is not part of the group
I got a bit stuck because the AS400 did not like that I was asking to select a field that wasn't part of the group, which I need.
I came up with this query, but it's slow.
-- Create a common temp table we can use in both places
WITH wantedOrders AS (
SELECT order_id FROM orders
WHERE
-- Only orders from the web
order_type = 'web'
-- And only orders that we want to get at this time
AND order_id IN
(
50,
20,
30
)
)
-- Our main select that gets all order information, even the non-grouped stuff
SELECT
t1.order_id,
t1.user_id,
t1.amount,
t2.total_amount,
t2.count
FROM orders AS t1
-- Join in the group data where we can do our query
JOIN (
SELECT
user_id,
SUM(amount) as total_amount,
COUNT(*) AS count
FROM
orders
-- Re use the temp table to get the order numbers
WHERE order_id IN (SELECT order_id FROM wantedOrders)
GROUP BY
user_id
HAVING SUM(amount)>100
) AS t2 ON t2.user_id=t1.user_id
-- Make sure we only use the order numbers
WHERE order_id IN (SELECT order_id FROM wantedOrders)
ORDER BY t1.user_id ASC;
What's the better way to write this query?
Try this:
WITH
wantedOrders (order_id) AS
(
VALUES 1, 2
)
, orders (order_id, user_id, amount) AS
(
VALUES
(1, 1, 50)
, (2, 1, 50)
, (1, 2, 60)
, (2, 2, 60)
, (3, 3, 200)
, (4, 3, 200)
)
-- Our main select that gets all order information, even the non-grouped stuff
SELECT *
FROM
(
SELECT
order_id,
user_id,
amount,
SUM (amount) OVER (PARTITION BY user_id) AS total_amount,
COUNT (*) OVER (PARTITION BY user_id) AS count
FROM orders t
WHERE EXISTS
(
SELECT 1
FROM wantedOrders w
WHERE w.order_id = t.order_id
)
) A
WHERE total_amount > 100
ORDER BY user_id ASC
ORDER_ID
USER_ID
AMOUNT
TOTAL_AMOUNT
COUNT
1
2
60
120
2
2
2
60
120
2
If order_id is the PK of the table. Then just add the columns you need to the wantedOrders query and use it as your "base" (instead of using orders and refiltering it. You should end up joining wantedOrders with itself.
You can do:
select t.*
from orders t
join (
select user_id
from orders t
where order_id in (50, 20, 30)
group by user_id
having sum(total_amount) > 100
) s on s.user_id = t.user_id
The first table orders as t will produce the data you want. It will be filtered by the second "table expression" s that preselects the groups according to your logic.

postgresql: group by columns/windows function/min-max and complex query

Imagine I've invoices from two branches. I need to select min and max invoice date and on that date show branch id. If at min/max date several branches have invoices, choose any.
CREATE TEMP TABLE invoice (
id int not null,
branch_id int not null,
c_date date not null,
PRIMARY KEY (id)
);
insert into invoice (id, branch_id, c_date) values
(1, 1, '2020-01-01')
,(2, 2, '2020-01-01')
,(3, 1, '2020-01-02')
,(4, 2, '2020-01-02')
,(5, 2, '2020-01-03');
The straightforward solution is (skip max part to do not overcomplicate the query).
select i2.branch_id, i2.c_date from (
select min(i1.id) minid
from (select min(i.c_date) mind, max(i.c_date) maxd
from invoice i
)a
join invoice i1
on a.mind=i1.c_date) b
join invoice i2 on b.minid=i2.id
Window function solution a bit simpler but awkward too. Please keep in mind that the actual query is more complex, and I provide only the core part.
select * from (
select a.branch_id, a.c_date from(
select *, rank() over (order by c_date) r from invoice i
) a where a.r=1
limit 1
) mn,
(select a.branch_id, a.c_date from(
select *, rank() over (order by c_date desc) r from invoice i
) a where a.r=1
limit 1
) mx
Any guesses on how to write the query more elegantly?
One method is a trick using arrays:
select min(date),
(array_agg(branch_id order by date))[1] as first_branch,
max(date),
(array_agg(branch_id order by date desc))[1] as last_branch
from invoice;
This does aggregate all values into an array, so you wouldn't want to use this if there are too many values in each result row.

Display duplicate row indicator and get only one row when duplicate

I built the schema at http://sqlfiddle.com/#!18/7e9e3
CREATE TABLE BoatOwners
(
BoatID INT,
OwnerDOB DATETIME,
Name VARCHAR(200)
);
INSERT INTO BoatOwners (BoatID, OwnerDOB,Name)
VALUES (1, '2021-04-06', 'Bob1'),
(1, '2020-04-06', 'Bob2'),
(1, '2019-04-06', 'Bob3'),
(2, '2012-04-06', 'Tom'),
(3, '2009-04-06', 'David'),
(4, '2006-04-06', 'Dale1'),
(4, '2009-04-06', 'Dale2'),
(4, '2013-04-06', 'Dale3');
I would like to write a query that would produce the following result characteristics :
Returns only one owner per boat
When multiple owners on a single boat, return the youngest owner.
Display a column to indicate if a boat has multiple owners.
So the following data set when apply that query would produce
I tried
ROW_NUMBER() OVER (PARTITION BY ....
but haven't had much luck so far.
with data as (
select BoatID, OwnerDOB, Name,
row_number() over (partition by BoatID order by OwnerDOB desc) as rn,
count() over (partition by BoatID) as cnt
from BoatOwners
)
select BoatID, OwnerDOB, Name,
case when cnt > 1 then 'Yes' else 'No' end as MultipleOwner
from data
where rn = 1
This is just a case of numbering the rows for each BoatId group and also counting the rows in each group, then filtering accordingly:
select BoatId, OwnerDob, Name, Iif(qty=1,'No','Yes') MultipleOwner
from (
select *, Row_Number() over(partition by boatid order by OwnerDOB desc)rn, Count(*) over(partition by boatid) qty
from BoatOwners
)b where rn=1

SQL Running total grouped by ID

Using this Query, I need to populate the NULL column with running total for each row where it would correspond to the paid amount over the period of a calendar year, year to date, of the current table. This running total should be grouped by member_id.
SELECT id=identity(int,1,1), cast(null as numeric(22,3)) as max_running_total, *
INTO #temp
FROM Customer_DB..Sales_Table
ORDER BY Date_Column asc
UPDATE #temp
SET max_running_total = (SELECT SUM(paid_amount)
FROM #temp
WHERE id <= id
GROUP BY member_id)
Since you have not given the schema, I have taken a sample schema and have tried to a rolling sum. You can use the same sql windows functions and achieve your results
CREATE TABLE amt
(
id INT,
paid_amount DECIMAL,
running_total DECIMAL
)
insert INTO amt VALUES (1, 100, NULL), (2, 50, NULL), (3, 50, NULL)
SELECT id, paid_amount,
SUM(paid_amount) over(ORDER BY id ROWS BETWEEN unbounded preceding AND CURRENT ROW) AS running_total
FROM amt