Select multiple distinct rows - sql

I have a table with following data.
id country serial other_column
1 us 123 1
2 us 456 1
3 gb 123 1
4 gb 456 1
5 jp 777 1
6 jp 888 1
7 us 123 2
8 us 456 3
9 gb 456 4
10 us 123 1
11 us 123 1
Is there a way to fetch 2 rows per unique country and unique serial?
For example, expecting following results from my query.
us,123,1 comes twice cos there was 3 of the same kind and I want 2 rows per unique country and unique serial.
us,123,1
us,123,1
us,456,1
gb,123,1
gb,456,1
jp,777,1
jp,888,1
I can't use:
select distinct country, serial from my_table;
Since I want 2 rows per distinct value match for country and serial. Pls advice.

Assign DENSE_RANK and ROW_NUMBER to your data set using a CTE or subquery then return rows with a ROW_NUMBER less than 3 and a DENSE_RANK equal to 1. Also, since you did not specify the ORDERING, I've added a custom ORDER BY to handle your Country sorting to match your desired output above.
SELECT
ID,
Country,
Serial,
other_column
FROM
(SELECT
*,
ROW_NUMBER() OVER (PARTITION BY Country, Serial ORDER BY Country, Serial, other_column) AS RN,
DENSE_RANK() OVER (PARTITION BY Country, Serial ORDER BY Country, Serial, other_column) AS DR
FROM my_table) A
WHERE RN < 3 AND DR = 1
ORDER BY CASE WHEN Country = 'us' THEN 1
WHEN Country = 'gb' THEN 2
WHEN Country = 'jp' THEN 3
ELSE 4
END ASC, Country, Serial, other_column, ID
Result:
| ID | Country | Serial | other_column |
|----|---------|--------|---------------|
| 1 | us | 123 | 1 |
| 10 | us | 123 | 1 |
| 2 | us | 456 | 1 |
| 3 | gb | 123 | 1 |
| 4 | gb | 456 | 1 |
| 5 | jp | 777 | 1 |
| 6 | jp | 888 | 1 |
Fiddle here.

since you didn't specify any logic which rows it should pick based on the 'other column' value (i am assuming it doesn't matter for you).
Having said that, my code will always pick two rows based on unique country and serial with the other_column value as ascending. For example if you have 3 rows:
us, 123, 1
us, 123, 1
us, 123, 2
it will go for first two since other_column value is set to ASC, if you want the other way around you can change the code to order by DESC within the partition by clause.
If there are less than 3 rows for the country and serial it would just pick 1 row.
for example
us, 456, 1
us, 456, 1
us, 123, 1
us, 123, 2
us, 123, 3
would result in:
us, 456,1
us, 123,1
us, 123,2
with main as (
select
*,
count(*) over(partition by country, serial) as total_occurence,
row_number() over(partition by country, serial order by other_column) as rank_
from <table_name>
),
conditions as (
select *,
case when total_occurence < 3 and rank_ = 1 then true
when total_occurence >=3 and rank_ in (1,2) then true else
false end as is_relevant
from main
)
select * from conditions where is_relevant

Related

Select row A if a condition satisfies else select row B for each group

We have 2 tables, bookings and docs
bookings
booking_id | name
100 | "Val1"
101 | "Val5"
102 | "Val6"
docs
doc_id | booking_id | doc_type_id
6 | 100 | 1
7 | 100 | 2
8 | 101 | 1
9 | 101 | 2
10 | 101 | 2
We need the result like this:
booking_id | doc_id
100 | 7
101 | 10
Essentially, we are trying to get the latest record of doc per booking, but if doc_type_id 2 is present, select the latest record of doc type 2 else select latest record of doc_type_id 1.
Is this possible to achieve with a performance friendly query as we need to apply this in a very huge query?
You can do it with FIRST_VALUE() window function by sorting properly the rows for each booking_id so that the rows with doc_type_id = 2 are returned first:
SELECT DISTINCT booking_id,
FIRST_VALUE(doc_id) OVER (PARTITION BY booking_id ORDER BY doc_type_id = 2 DESC, doc_id DESC) rn
FROM docs;
If you want full rows returned then you could use ROW_NUMBER() window function:
SELECT booking_id, doc_id, doc_type_id
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY booking_id ORDER BY doc_type_id = 2 DESC, doc_id DESC) rn
FROM docs
) t
WHERE rn = 1;

A running summary of totals in SQL Server

Come up against an issue where I want to summarize results in a query.
Example as follows:
NAME | FRUIT | PRICE
-----+-------+------
JOHN | APPLE | 2
JOHN | APPLE | 2
JOHN | APPLE | 2
JOHN | APPLE | 2
DAVE | GRAPE | 3
DAVE | GRAPE | 3
DAVE | GRAPE | 3
This is my table at the moment, what i need though is to have a summary of Johns business, like below:
NAME | FRUIT | PRICE
-----+-------+------
JOHN | APPLE | 2
JOHN | APPLE | 2
JOHN | APPLE | 2
JOHN | APPLE | 2
JOHN | TOTAL | 8
DAVE | GRAPE | 3
DAVE | GRAPE | 3
DAVE | GRAPE | 3
DAVE | TOTAL | 9
I have tried to group the information but it does not reflect what i want, plus if John were to have different fruit it would need to sum that up before it sums up the next part and it needs to have a running total for all values in the NAME field as there will be a number of customers.
Any advice would be great
EDIT
I have tried using Rollup but I keep getting totals of all values in a seperate column where as I would like to see it as the way it is formatted above
A solution with UNION and GROUP BY.
;WITH PricesWithTotals AS
(
SELECT
Name,
Fruit,
Price
FROM
YourTable
UNION ALL
SELECT
Name,
Fruit = 'TOTAL',
Price = SUM(Price)
FROM
YourTable
GROUP BY
Name
)
SELECT
Name,
Fruit,
Price
FROM
PricesWithTotals
ORDER BY
Name,
CASE WHEN Fruit <> 'Total' THEN 1 ELSE 999 END ASC,
Fruit
This will get you a running total per customer per fruit:
create table #Sales([Name] varchar(20), Fruit varchar(20), Price int)
insert into #Sales([Name], Fruit, Price)
values
('JOHN','APPLE',2),
('JOHN','APPLE',2),
('JOHN','APPLE',2),
('JOHN','APPLE',2),
('DAVE','GRAPE',3),
('DAVE','GRAPE',3),
('DAVE','GRAPE',3)
Select c.*
, SUM(Price) OVER (PARTITION BY c.[Name], c.[Fruit] ORDER BY c.[Name], c.[Fruit] rows between unbounded preceding and current ROW ) as RunningTotal
from #Sales c
order by c.[Name], c.[Fruit] asc
drop table #Sales
Output:
The solution to your problem is GROUPING SETS. However, your rows are not unique. Alas, so this adds a unique value, just so you can keep your original rows:
with t as (
select t.*, row_number() over (order by (select null)) as seqnum
from t
)
select name, ,
coalesce(fruit, 'Total') as fruit,
sum(price) as price
from t
group by grouping sets ( (name, fruit, seqnum), (name) )
order by name,
(case when fruit is not null then 1 else 2 end);

Postgres: Deleting rows that are duplicated in one column based on the conditions of another column

I have a PostgreSQL table that stores user details called users as shown below
ID | user name | item | dos | Charge|
1 | Ed | 32 |01-02-1987| 1 |
2 | Taya | 01 |05-07-1981|-1 |
3 | Damian | 32 |22-19-1990| 1 |
2 | Taya | 01 |05-07-1981| 1 |
2 | Taya | 01 |05-07-1981| 1 |
1 | Ed | 32 |01-02-1987|-1 |
I want to delete rows where they are same across id, username, item and dos & sum of charges is 0. This means both row 1 and row 6 for ed gets deleted.
With more than 2 occurences, if the sum of charge is 1, i want one of the row with charge -1 and 1 deleted which means one row with charge 1 will be retained. For eg: ROw 2 and Row for Taya will be deleted.
The output table that i am after is:
ID | user name | item | dos | Charge|
3 | Damian | 32 |22-19-1990| 1 |
2 | Taya | 01 |05-07-1981| 1 |
Any ideas?
You want the having clause:
This will get you the output you want:
select
id, user_name, item, dos, sum (charge)
from table
group by
id, user_name, item, dos
having
sum (charge) != 0
If you're really trying to delete the records that make it zero:
delete from table
where (id, user_name, item, dos) in (
select id, user_name, item, dos
from table
group by id, user_name, item, dos
having sum (charge) = 0
)
This does the same thing, and is quite a bit more code, but because it's using a semi-join it might be better for really large datasets:
with delete_me as (
select id, user_name, item, dos
from table
group by id, user_name, item, dos
having sum (charge) = 0
)
delete from table t
where exists (
select null
from delete_me d
where
t.id = d.id and
t.user_name = d.user_name and
t.item = d.item and
t.dos = d.dos
)

SQL: order by, then select first row with distinct value for multiple columns

As a simplified example, I need to select each instance where a customer had a shipping address that was different from their previous shipping address. So I have a large table with columns such as:
purchase_id | cust_id | date | address | description
-----------------------------------------------------------
1 | 5 | jan | address1 | desc1
2 | 6 | jan | address2 | desc2
3 | 5 | feb | address1 | desc3
4 | 6 | feb | address2 | desc4
5 | 5 | mar | address3 | desc5
6 | 5 | mar | address3 | desc6
7 | 5 | apr | address1 | desc7
8 | 6 | may | address4 | desc8
Note that customers can "move back" to a previous address as customer 5 did in row 7.
What I want to select (and as efficiently as possible as this is a quite large table) is the first row out of every 'block' wherein a customer had subsequent orders shipped to the same address. In this example that would be rows 1,2,5,7,and 8. In all the others, the customer has the same address as their previous order.
So effectively I want to first ORDER BY (cust_id, date), then SELECT purchase_id, cust_id, min(date), address, description.
However I'm having trouble because SQL usualy requires GROUP BY to be done before ORDER BY. I can't therefore figure out how to adapt e.g. either of the top answers to this question (which I otherwise quite like.) It is necessary (conceptually, at least) to order by date before grouping or using aggregate functions like min(), otherwise I would miss instances like row 7 in my example table, where a customer 'moved back' to a previous address.
Note also that two customers can share an address, so I need to effectively group by both cust_id and address after ordering by date.
I'm using snowflake which I believe has most of the same commands available as recent versions of PostgreSQL and SQL Server (although I'm fairly new to snowflake so not completely sure.)
Sorry for a late reply. I meant to react to this post a few days ago.
The "most proper" way I can think of is to use the LAG function.
Take this:
select purchase_id, cust_id, address,
lag(address, 1) over (partition by cust_id order by purchase_id) prev_address
from x order by cust_id, purchase_id;
-------------+---------+----------+--------------+
PURCHASE_ID | CUST_ID | ADDRESS | PREV_ADDRESS |
-------------+---------+----------+--------------+
1 | 5 | address1 | [NULL] |
3 | 5 | address1 | address1 |
5 | 5 | address3 | address1 |
6 | 5 | address3 | address3 |
7 | 5 | address1 | address3 |
2 | 6 | address2 | [NULL] |
4 | 6 | address2 | address2 |
8 | 6 | address4 | address2 |
-------------+---------+----------+--------------+
And then you can easily detect rows with the events like you described
select purchase_id, cust_id, address, prev_address from (
select purchase_id, cust_id, address,
lag(address, 1) over (partition by cust_id order by purchase_id) prev_address
from x
) sub
where not equal_null(address, prev_address)
order by cust_id, purchase_id;
-------------+---------+----------+--------------+
PURCHASE_ID | CUST_ID | ADDRESS | PREV_ADDRESS |
-------------+---------+----------+--------------+
1 | 5 | address1 | [NULL] |
5 | 5 | address3 | address1 |
7 | 5 | address1 | address3 |
2 | 6 | address2 | [NULL] |
8 | 6 | address4 | address2 |
-------------+---------+----------+--------------+
Note that I'm using EQUAL_NULL function to have NULL=NULL semantics.
Note that the LAG function can be computationally intensive though (but comparable with using ROW_NUMBER proposed earlier)
You can use row_number window function to do the trick:
;with cte as(select *, row_number() over(partition by cust_id, address
order by purchase_id) as rn from table)
select * from cte
where rn = 1
Snowflake has introduced CONDITIONAL_CHANGE_EVENT, which ideally solves described case:
Returns a window event number for each row within a window partition when the value of the argument expr1 in the current row is different from the value of expr1 in the previous row. The window event number starts from 0 and is incremented by 1 to indicate the number of changes so far within that window
Data preparation:
CREATE OR REPLACE TABLE t(purchase_id INT, cust_id INT,
date DATE, address TEXT, description TEXT);
INSERT INTO t(purchase_id, cust_id, date, address, description)
VALUES
( 1, 5, '2021-01-01'::DATE ,'address1','desc1')
,( 2, 6, '2021-01-01'::DATE ,'address2','desc2')
,( 3, 5, '2021-02-01'::DATE ,'address1','desc3')
,( 4, 6, '2021-02-01'::DATE ,'address2','desc4')
,( 5, 5, '2021-03-01'::DATE ,'address3','desc5')
,( 6, 5, '2021-03-01'::DATE ,'address3','desc6')
,( 7, 5, '2021-04-01'::DATE ,'address1','desc7')
,( 8, 6, '2021-05-01'::DATE ,'address4','desc8');
Query:
SELECT *,
CONDITIONAL_CHANGE_EVENT(address) OVER (PARTITION BY CUST_ID ORDER BY DATE) AS CCE
FROM t
ORDER BY purchase_id;
Once the subgroup: CCE column is identified, QUALIFY could be used to find the first row per each CUST_ID, CCE.
Full query:
WITH cte AS (
SELECT *,
CONDITIONAL_CHANGE_EVENT(address) OVER (PARTITION BY CUST_ID ORDER BY DATE) AS CCE
FROM t
)
SELECT *
FROM cte
QUALIFY ROW_NUMBER() OVER(PARTITION BY CUST_ID, CCE ORDER BY DATE) = 1
ORDER BY purchase_id;
Output:
This would probably be best solved by a subquery to get the first purchase for each user, then using IN to filter rows based on that result.
To clarify, purchase_id is an autoincrement column, correct? If so, a purchase with a higher purchase_id must have been created at a later date, and the following should suffice:
SELECT *
FROM purchases
WHERE purchase_id IN (
SELECT MIN(purchase_id) AS first_purchase_id
FROM purchases
GROUP BY cust_id
)
If you only want the first purchase for customers with more than one address, add a HAVING clause to your subquery:
SELECT *
FROM purchases
WHERE purchase_id IN (
SELECT MIN(purchase_id) AS first_purchase_id
FROM purchases
GROUP BY cust_id
HAVING COUNT(DISTINCT address) > 1
)
Fiddle: http://sqlfiddle.com/#!9/12d75/6
However, if purchase_id is NOT an autoincrement column, then SELECT on both cust_id and min(date) on your subquery and use an INNER JOIN on cust_id and min(date):
SELECT *
FROM purchases
INNER JOIN (
SELECT cust_id, MIN(date) AS min_date
FROM purchases
GROUP BY cust_id
HAVING COUNT(DISTINCT address) > 1
) cust_purchase_date
ON purchases.cust_id = cust_purchase_date.cust_id AND purchases.date = cust_purchase_date.min_date
The first query example will probably be faster, however, so use that if you purchase_id is an autoincrement column.
Yet more late options/opinions:
Given this is a edge detection, LAG/LEAD (depending which edge you are looking for) is the simplest tool.
Marcin's LAG option can be moved from a sub-select to a first level option, with QUALIFY.
Where the NOT and EQUAL_NULL adds value is if there was a null address the first LAG would also return null, those would be not equal, and on flipping, become true. So EQUAL_NULL safe compare catches that nicely.
SELECT *
FROM data_table
QUALIFY not equal_null(address, lag(address) over(partition by cust_id order by purchase_id))
ORDER BY 1
giving:
PURCHASE_ID
CUST_ID
DATE
ADDRESS
DESCRIPTION
1
5
2021-01-01
address1
desc1
2
6
2021-01-01
address2
desc2
5
5
2021-03-01
address3
desc5
7
5
2021-04-01
address1
desc7
8
6
2021-05-01
address4
desc8
Lukasz's CONDITIONAL_CHANGE_EVENT is a very nice solution, but CONDITIONAL_CHANGE_EVENT is not just finding a change edge but enumerating them, so if you we looking for the 5th change, or such, then CONDITIONAL_CHANGE_EVENT saves you having to chain a LAG/LEAD with a ROW_NUMBER(). And as such, you cannot collapse that solution into a single block:
like:
ROW_NUMBER() OVER(PARTITION BY CUST_ID, CONDITIONAL_CHANGE_EVENT(address) OVER (PARTITION BY CUST_ID ORDER BY DATE) ORDER BY DATE) = 1
because the implicit row_number inside CONDITIONAL_CHANGE_EVENT generates the error:
Window function x may not be nested inside another window function.

SQL AVG(COUNT(*))?

I'm trying to find out the average number of times a value appears in a column, group it based on another column and then perform a calculation on it.
I have 3 tables a little like this
DVD
ID | NAME
1 | 1
2 | 1
3 | 2
4 | 3
COPY
ID | DVDID
1 | 1
2 | 1
3 | 2
4 | 3
5 | 1
LOAN
ID | DVDID | COPYID
1 | 1 | 1
2 | 1 | 2
3 | 2 | 3
4 | 3 | 4
5 | 1 | 5
6 | 1 | 5
7 | 1 | 5
8 | 1 | 2
etc
Basically, I'm trying to find all the copy ids that appear in the loan table LESS times than the average number of times for all copies of that DVD.
So in the example above, copy 5 of dvd 1 appears 3 times, copy 2 twice and copy 1 once so the average for that DVD is 2. I want to list all the copies of that (and each other) dvd that appear less than that number in the Loan table.
I hope that makes a bit more sense...
Thanks
Similar to dotjoe's solution, but using an analytic function to avoid the extra join. May be more or less efficient.
with
loan_copy_total as
(
select dvdid, copyid, count(*) as cnt
from loan
group by dvdid, copyid
),
loan_copy_avg as
(
select dvdid, copyid, cnt, avg(cnt) over (partition by dvdid) as copy_avg
from loan_copy_total
)
select *
from loan_copy_avg lca
where cnt <= copy_avg;
This should work in Oracle:
create view dvd_count_view
select dvdid, count(1) as howmanytimes
from loans
group by dvdid;
select avg(howmanytimes) from dvd_count_view;
Untested...
with
loan_copy_total as
(
select dvdid, copyid, count(*) as cnt
from loan
group by dvdid, copyid
),
loan_copy_avg as
(
select dvdid, avg(cnt) as copy_avg
from loan_copy_total
group by dvdid
)
select lct.*, lca.copy_avg
from loan_copy_avg lca
inner join loan_copy_total lct on lca.dvdid = lct.dvdid
and lct.cnt <= lca.copy_avg;