postgresql: group by columns/windows function/min-max and complex query

postgresql: group by columns/windows function/min-max and complex query - sql

Imagine I've invoices from two branches. I need to select min and max invoice date and on that date show branch id. If at min/max date several branches have invoices, choose any.
CREATE TEMP TABLE invoice (
id int not null,
branch_id int not null,
c_date date not null,
PRIMARY KEY (id)
);
insert into invoice (id, branch_id, c_date) values
(1, 1, '2020-01-01')
,(2, 2, '2020-01-01')
,(3, 1, '2020-01-02')
,(4, 2, '2020-01-02')
,(5, 2, '2020-01-03');
The straightforward solution is (skip max part to do not overcomplicate the query).
select i2.branch_id, i2.c_date from (
select min(i1.id) minid
from (select min(i.c_date) mind, max(i.c_date) maxd
from invoice i
)a
join invoice i1
on a.mind=i1.c_date) b
join invoice i2 on b.minid=i2.id
Window function solution a bit simpler but awkward too. Please keep in mind that the actual query is more complex, and I provide only the core part.
select * from (
select a.branch_id, a.c_date from(
select *, rank() over (order by c_date) r from invoice i
) a where a.r=1
limit 1
) mn,
(select a.branch_id, a.c_date from(
select *, rank() over (order by c_date desc) r from invoice i
) a where a.r=1
limit 1
) mx
Any guesses on how to write the query more elegantly?

One method is a trick using arrays:
select min(date),
(array_agg(branch_id order by date))[1] as first_branch,
max(date),
(array_agg(branch_id order by date desc))[1] as last_branch
from invoice;
This does aggregate all values into an array, so you wouldn't want to use this if there are too many values in each result row.

Related

How to select columns that aren't part of an aggregate query using HAVING SUM() in the WHERE and selecting only certain rows on db2

Using AS400 db2 for this.
I have a table of orders. From that table I have to:
Get all orders from a specified list of order IDs and type
Group by the user_id on those orders
Check to make sure the total order amount on the group is greater than $100
Return all orders that matched the group but the results won't be grouped, which includes order_id which is not part of the group
I got a bit stuck because the AS400 did not like that I was asking to select a field that wasn't part of the group, which I need.
I came up with this query, but it's slow.
-- Create a common temp table we can use in both places
WITH wantedOrders AS (
SELECT order_id FROM orders
WHERE
-- Only orders from the web
order_type = 'web'
-- And only orders that we want to get at this time
AND order_id IN
(
50,
20,
30
)
)
-- Our main select that gets all order information, even the non-grouped stuff
SELECT
t1.order_id,
t1.user_id,
t1.amount,
t2.total_amount,
t2.count
FROM orders AS t1
-- Join in the group data where we can do our query
JOIN (
SELECT
user_id,
SUM(amount) as total_amount,
COUNT(*) AS count
FROM
orders
-- Re use the temp table to get the order numbers
WHERE order_id IN (SELECT order_id FROM wantedOrders)
GROUP BY
user_id
HAVING SUM(amount)>100
) AS t2 ON t2.user_id=t1.user_id
-- Make sure we only use the order numbers
WHERE order_id IN (SELECT order_id FROM wantedOrders)
ORDER BY t1.user_id ASC;
What's the better way to write this query?

Try this:
WITH
wantedOrders (order_id) AS
(
VALUES 1, 2
)
, orders (order_id, user_id, amount) AS
(
VALUES
(1, 1, 50)
, (2, 1, 50)
, (1, 2, 60)
, (2, 2, 60)
, (3, 3, 200)
, (4, 3, 200)
)
-- Our main select that gets all order information, even the non-grouped stuff
SELECT *
FROM
(
SELECT
order_id,
user_id,
amount,
SUM (amount) OVER (PARTITION BY user_id) AS total_amount,
COUNT (*) OVER (PARTITION BY user_id) AS count
FROM orders t
WHERE EXISTS
(
SELECT 1
FROM wantedOrders w
WHERE w.order_id = t.order_id
)
) A
WHERE total_amount > 100
ORDER BY user_id ASC
ORDER_ID
USER_ID
AMOUNT
TOTAL_AMOUNT
COUNT
1
2
60
120
2
2
2
60
120
2

If order_id is the PK of the table. Then just add the columns you need to the wantedOrders query and use it as your "base" (instead of using orders and refiltering it. You should end up joining wantedOrders with itself.

You can do:
select t.*
from orders t
join (
select user_id
from orders t
where order_id in (50, 20, 30)
group by user_id
having sum(total_amount) > 100
) s on s.user_id = t.user_id
The first table orders as t will produce the data you want. It will be filtered by the second "table expression" s that preselects the groups according to your logic.

BigQuery SQL: Sum of first N related items

I would like to know the sum of a value in the first n items in a related table. For example, I want to get the sum of a companies first 6 invoices (the invoices can be sorted by ID asc)
Current SQL:
SELECT invoices.company_id, SUM(invoices.amount)
FROM invoices
JOIN companies on invoices.company_id = companies.id
GROUP BY invoices.company_id
This seems simple but I can't wrap my head around it.

Consider also below approach
select company_id, (
select sum(amount)
from t.amounts amount
) as top_six_invoices_amount
from (
select invoices.company_id,
array_agg(invoices.amount order by invoices.invoice_id limit 6) amounts
from your_table invoices
group by invoices.company_id
) t

You can create order row numbers to the lines in a partition based on invoice id and filter to it, something like this:
with array_table as (
select 'a' field, * from unnest([3, 2, 1 ,4, 6, 3]) id
union all
select 'b' field, * from unnest([1, 2, 1, 7]) id
)
select field, sum(id) from (
select field, id, row_number() over (partition by a.field order by id desc) rownum
from array_table a
)
where rownum < 3
group by field
More examples for analytical examples here:
https://medium.com/#aliz_ai/analytic-functions-in-google-bigquery-part-1-basics-745d97958fe2
https://cloud.google.com/bigquery/docs/reference/standard-sql/analytic-function-concepts

Display duplicate row indicator and get only one row when duplicate

I built the schema at http://sqlfiddle.com/#!18/7e9e3
CREATE TABLE BoatOwners
(
BoatID INT,
OwnerDOB DATETIME,
Name VARCHAR(200)
);
INSERT INTO BoatOwners (BoatID, OwnerDOB,Name)
VALUES (1, '2021-04-06', 'Bob1'),
(1, '2020-04-06', 'Bob2'),
(1, '2019-04-06', 'Bob3'),
(2, '2012-04-06', 'Tom'),
(3, '2009-04-06', 'David'),
(4, '2006-04-06', 'Dale1'),
(4, '2009-04-06', 'Dale2'),
(4, '2013-04-06', 'Dale3');
I would like to write a query that would produce the following result characteristics :
Returns only one owner per boat
When multiple owners on a single boat, return the youngest owner.
Display a column to indicate if a boat has multiple owners.
So the following data set when apply that query would produce
I tried
ROW_NUMBER() OVER (PARTITION BY ....
but haven't had much luck so far.

with data as (
select BoatID, OwnerDOB, Name,
row_number() over (partition by BoatID order by OwnerDOB desc) as rn,
count() over (partition by BoatID) as cnt
from BoatOwners
)
select BoatID, OwnerDOB, Name,
case when cnt > 1 then 'Yes' else 'No' end as MultipleOwner
from data
where rn = 1

This is just a case of numbering the rows for each BoatId group and also counting the rows in each group, then filtering accordingly:
select BoatId, OwnerDob, Name, Iif(qty=1,'No','Yes') MultipleOwner
from (
select *, Row_Number() over(partition by boatid order by OwnerDOB desc)rn, Count(*) over(partition by boatid) qty
from BoatOwners
)b where rn=1

how to simplify this multiple-CTE solution to this sql question?

DDL:
create table transactions
(
product_id int,
store_id int,
quantity int,
price numeric
);
DML:
insert into transactions values
(1, 1, 10, 2),
(2, 1, 5, 2),
(1, 2, 5, 4),
(2, 2, 2, 4),
(2, 3, 1, 20),
(1, 3, 1, 8),
(2, 4, 2, 10),
(1, 5, 2, 5),
(2, 5, 1, 3),
(2, 6, 4, 8);
I'm trying to find the top 3 products of the top 3 stores, both are based on sale amount. The solution I have is to use cte as below:
with cte as
(
select store_id, rank_store
from
(select
*,
dense_rank() over(order by sale desc) as rank_store
from
(select
store_id, sum(quantity * price) as sale
from transactions
group by 1) t) t2
where
rank_store <= 3
),
cte2 as
(
select
a.store_id, a.product_id,
sum(a.quantity * a.price) as sale_store_product
from
transactions as a
join
cte as b on a.store_id = b.store_id
group by
1, 2
order by
1, 2
),
cte3 as
(
select
*,
dense_rank() over (partition by store_id order by sale_store_product desc) as rank_product
from
cte2
)
select *
from cte3
where rank_product <= 3;
Here is the expected result:
Basically, the first cte is to get the top 3 stores based on sale amount, I use dense_rank() window function to handle tie cases. then the 2nd cte is to get the top 3 stores' products and their total sale amount. The last cte is to use dense_rank() window function again to rank the products in each stores based on their sale amount. then my last query is to get the top 3 products in each store based on the sale amount.
I'm wondering if this can be improved a bit since I feel three CTEs is kind of too complicated. Appreciate for sharing any solutions and ideas. Thanks.

I'm trying to find the top 3 products of the top 3 stores
How can this be done without aggregating the data twice -- once for store/products and once for stores? This is possible using window functions along with aggregation:
select sp.*
from (select sp.*,
dense_rank() over (order by store_sales, store_id) as store_seqnum
from (select t.store_id, t.product_id,
sum(quantity * price) as sp_sales,
sum(sum(quantity * price)) over (partition by store_id) as store_sales,
row_number() over (partition by t.store_id order by sum(quantity * price)) as sp_seqnum
from transactions t
group by t.store_id, t.product_id
) sp
) sp
where store_seqnum <= 3 and sp_seqnum <= 3;
The inner subquery calculates the product/store information. The next level ranks the stores -- notes that ties are broken using store_id.
Here is a db<>fiddle.

Group dataset two ways in PostgreSQL

I have a large dataset that I need to group in two different ways. My hope is that I will be able to run a query once, so that I won't have to run two separate queries.
I guess this might be possible using ROLLUP or GROUPING SETS, but I must admit I do not fully understand how I can use them for this.
This is a basic example of what I'm trying to do. The two questions I'm trying to answer with one query is:
How many users have been created per company per day?
Which companies has created the most users during the entire period? Top 5 companies would be enough.
CREATE TABLE IF NOT EXISTS tmp_users (
id INTEGER NOT NULL,
name TEXT NOT NULL,
created TIMESTAMP NOT NULL,
companyid INTEGER NOT NULL
);
INSERT INTO tmp_users (id, name, created, companyid)
VALUES
(1, 'Lindsay', '2019-01-01', 1),
(2, 'Michael', '2019-01-02', 1),
(3, 'Stan', '2019-01-04', 3),
(4, 'Gob', '2019-01-04', 1),
(5, 'Buster', '2019-01-01', 1),
(6, 'Lucille', '2019-01-03', 2),
(7, 'Sally', '2019-01-05', 3);
-- Get users created per day, per company
SELECT
DATE_TRUNC('DAY', created) AS created,
companyid,
COUNT(*) AS numberofusers
FROM tmp_users
GROUP BY
DATE_TRUNC('DAY', created),
companyid
ORDER BY DATE_TRUNC('DAY', created) DESC;
-- Users per company, with filter
SELECT
companyid,
COUNT(*) AS numberofusers
FROM tmp_users
GROUP BY
companyid
HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC;

grouping sets can be used to return multiple levels of aggregation in a single Select:
-- Get users created per day, per company
SELECT *
FROM
(
SELECT
DATE_TRUNC('DAY', created) AS created,
companyid,
Count(*) AS numberofusers,
Row_Number() -- instead of TOP n
Over (PARTITION BY CASE WHEN DATE_TRUNC('DAY', created) IS NULL THEN 0 ELSE 1 END
ORDER BY Count(*) DESC) AS rn
FROM tmp_users
GROUP BY GROUPING SETS
(
(DATE_TRUNC('DAY', created), companyid) -- daily data
,companyid -- company data
)
) AS dt
WHERE created IS NOT NULL -- all daily data
OR rn <= 5 -- plus the TOP 5 companies
ORDER BY created ASC NULLS FIRST;
See db<>fiddle

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

postgresql: group by columns/windows function/min-max and complex query - sql

Related

How to select columns that aren't part of an aggregate query using HAVING SUM() in the WHERE and selecting only certain rows on db2

BigQuery SQL: Sum of first N related items

Display duplicate row indicator and get only one row when duplicate

how to simplify this multiple-CTE solution to this sql question?

Group dataset two ways in PostgreSQL

Categories

Resources