Select distinct rows based on some, but not all columns - sql

I originally ran into this problem while working on SQL queries that select certain aggregate values (min, max etc) from grouped results. For example, select the cheapest fruit, its variety and the price, off each fruit group. The common solution is to first group the fruits along with the cheapest price using MIN, then self join it to get the other column ("variety" in this case).
Now say if we have more than one variety of a fruit with the same price, and that price happened to be the lowest price. So we end up getting results like this:
Apple Fuji 5.00
Apple Green 5.00
Orange valencia 3.00
Pear bradford 6.00
How do I make it so that only one kind of apple shows up in the final result? It can be any one of the varieties, be it the record that shows up the first, last or random.
So basically I need to eliminate rows based on two of the three columns being equal, and it doesn't matter which rows get eliminated as long as there is one left in the final result set.
Any help would be appreciated.

Try this... I added more fruits. The way to read it is to start from the inner most From clause and work your way out.
create table fruit (
FruitName varchar(50) not null,
FruitVariety varchar(50) not null,
Price decimal(10,2) not null
)
insert into fruit (FruitName, FruitVariety, Price)
values ('Apple','Fuji',5.00)
insert into fruit (FruitName, FruitVariety, Price)
values ('Apple','Green',5.00)
insert into fruit (FruitName, FruitVariety, Price)
values ('Orange','Valencia',3.00)
insert into fruit (FruitName, FruitVariety, Price)
values ('Orange','Navel',5.00)
insert into fruit (FruitName, FruitVariety, Price)
values ('Pear','Bradford',6.00)
insert into fruit (FruitName, FruitVariety, Price)
values ('Pear','Nashi',8.00)
select
rankedCheapFruits.FruitName,
rankedCheapFruits.FruitVariety,
rankedCheapFruits.Price
from (
select
f.FruitName,
f.FruitVariety,
f.Price,
row_number() over(
partition by f.FruitName
order by f.FruitName, f.FruitVariety
) as FruitRank
from (
select
f.FruitName,
min(f.Price) as LowestPrice
from Fruit f
group by
f.FruitName
) as cheapFruits
join Fruit f on cheapFruits.FruitName = f.FruitName
and f.Price = cheapFruits.LowestPrice
) rankedCheapFruits
where rankedCheapFruits.FruitRank = 1

You could use a MIN operator, that would limit it to the first row

One option is to rank the rows based on some criteria (alphabetical order of fruit variety) and then pick the minimum of the rank.
There is a rank() function in ms-sql for exactly this purpose.

Related

Effective way of locating top ranked rows on Oracle DB

I have a large table (millions of records) and I need to write an efficient select statement.
The table looks like this:
create table tab1 (
pt_key number
, cp_key number
, ext_info varchar2(10)
, resp_nm varchar2(20)
, resp_dttm date
, rank number
);
Sample records:
insert into tab1 values (1,1,'info1','OK', to_date('01.03.18 17:00:00','DD.MM.RR HH24:MI'),1);
insert into tab1 values (1,1,'info2','FAILED', to_date('01.03.18 17:00:00','DD.MM.RR HH24:MI'),2);
insert into tab1 values (1,1,'info3','SENT', to_date('01.03.18 17:00:00','DD.MM.RR HH24:MI'),3);
insert into tab1 values (1,1,'info4','SENT', to_date('02.03.18 17:00:00','DD.MM.RR HH24:MI'),3);
insert into tab1 values (1,2,'info5','OK', to_date('05.03.18 17:00:00','DD.MM.RR HH24:MI'),1);
insert into tab1 values (1,2,'info6','OK', to_date('06.03.18 17:00:00','DD.MM.RR HH24:MI'),1);
insert into tab1 values (1,2,'info7','FAILED', to_date('01.03.18 17:00:00','DD.MM.RR HH24:MI'),2);
I would like the query to return for each combination of pt_key and cp_key (part of composite primary key, other columns are not indexed) record with the highest rank. If there are (for each combination of pt_key and cp_key) several records with the same highest rank then pick the one with the greatest resp_dttm.
The select statement should return only the first four columns.
For the above posted sample data the desired result would be:
1 1 info4 SENT
1 2 info7 FAILED
Thanks for help.
Here's one approach using row_number():
select *
from (
select *, row_number() over (partition by pt_key, cp_key
order by rank desc, resp_dttm desc) rn
from tab1
) t
where rn = 1
Here's another approach using FIRST aggregate function:
select pt_key,
cp_key,
max(ext_info) keep (dense_rank first order by t.rank desc, t.resp_dttm desc) as ext_info,
max(resp_nm) keep (dense_rank first order by t.rank desc, t.resp_dttm desc) as resp_nm
from tab1 t
group by pt_key, cp_key
Here's how it works on Oracle Live SQL
EDIT 2:
Result:
PT_KEY | CP_KEY | EXT_INFO | RESP_NM
--------+--------+----------+---------
1 | 1 | info4 | SENT
1 | 2 | info7 | FAILED
EDIT 1:
This solution has an important drawback, if for a certain combination of pt_key and cp_key, there are multiple rows with the same rank and resp_dttm values. In that case it will "combine" those rows, and calculate the aggregates for ext_info and resp_nm (in my example it'll take max value).
You can refine that behavior, by adding tertiary sort criteria, to make the ranking distinct (e.g. add all other columns from the primary key).
The answer from #sgeddes is a bit better in that sense, that it will use one (random) row from the equally ranked rows, without combining the data, and without having to add sorting criteria. It also is easier to maintain/update, as it has the ranking criteria in one place, while mine has it in two spots.
You should probably test performance of both in your specific scenario (e.g. specific indices, specific data profile/statistics).

Get distinct individual column values (not distinct pairs) from two tables in single query

I have two tables like the following. One is for sport talents of some people and second for arts talents. One may not have a sport talent to list and same applies for art talent.
CREATE TABLE SPORT_TALENT(name varchar(10), TALENT varchar(10));
CREATE TABLE ART_TALENT(name varchar(10), TALENT varchar(10));
INSERT INTO SPORT_TALENT(name, TALENT) VALUES
('Steve', 'Footbal')
,('Steve', 'Golf')
,('Bob' , 'Golf')
,('Mary' , 'Tennnis');
INSERT INTO ART_TALENT(name, TALENT) VALUES
('Steve', 'Dancer')
, ('Steve', 'Singer')
, ('Bob' , 'Dancer')
, ('Bob' , 'Singer')
, ('John' , 'Dancer');
Now I want to list down sport talent and art talent of one person. I would like to avoid duplication. But I don't mind if there is a "null" in any output. I tried the following
select distinct sport_talent.talent as s_talent,art_talent.talent as a_talent
from sport_talent
JOIN art_talent on sport_talent.name=art_talent.name
where (sport_talent.name='Steve' or art_talent.name='Steve');
s_talent | a_talent
----------+----------
Footbal | Dancer
Golf | Singer
Footbal | Singer
Golf | Dancer
I would like to avoid redundancy and need something like the following (distinct values of sport talents + distinct values of art talents).
s_talent | a_talent
----------+----------
Footbal | Dancer
Golf | Singer
As mentioned in subject, I am not looking for distinct combinations. But at the same time, it's OK if there are some records with "null" value in one column. I am relatively new to SQL.
Try:
SELECT s_talent, a_talent
FROM (
SELECT distinct on (talent) talent as s_talent,
dense_rank() over (order by talent) as x
FROM SPORT_TALENT
WHERE name='Steve'
) x
FULL OUTER JOIN (
SELECT distinct on (talent) talent as a_talent,
dense_rank() over (order by talent) as x
FROM ART_TALENT
WHERE name='Steve'
) y
ON x.x = y.x
Demo: http://sqlfiddle.com/#!15/66e04/3
There are no duplicates in your query. Each of the four records in your query return is unique. This result may not be what you want, but seems like its problem is not the duplicate.
Postgres 9.4
... introduces unnest() with multiple arguments. Does exactly what you want, and should be fast, too. Per documentation:
The special table function UNNEST may be called with any number of
array parameters, and it returns a corresponding number of columns, as
if UNNEST (Section 9.18) had been called on each parameter separately
and combined using the ROWS FROM construct.
About ROWS FROM:
Compare result of two table functions using one column from each
SELECT *
FROM unnest(
ARRAY(SELECT DISTINCT talent FROM sport_talent WHERE name = 'Steve')
, ARRAY(SELECT DISTINCT talent FROM art_talent WHERE name = 'Steve')
) AS t(s_talent, a_talent);
Postgres 9.3 or older
SELECT s_talent, a_talent
FROM (
SELECT talent AS s_talent, row_number() OVER () AS rn
FROM sport_talent
WHERE name = 'Steve'
GROUP BY 1
) s
FULL JOIN (
SELECT talent AS a_talent, row_number() OVER () AS rn
FROM art_talent
WHERE name = 'Steve'
GROUP BY 1
) a USING (rn);
Similar previous answers with more explanation:
What type of JOIN to use
Sort columns independently, such that all nulls are last per column
This is similar to what #kordirko posted, but uses GROUP BY to get distinct talents, which is evaluated before window functions. So we only need a bare row_number() and not the more expensive dense_rank().
About the sequence of events in a SELECT query:
Best way to get result count before LIMIT was applied
SQL Fiddle.

Combining select distinct with group and ordering

A simplified example for illustration: Consider a table "fruit" with 3 columns: name, count and the date purchased. Need an alphabetical list of the fruits and their count the last time they were bought. I am a bit confused by the order of sorting and how distinct is applied. My attempt -
drop table if exists fruit;
create table fruit (
name varchar(8),
count integer,
dateP datetime
);
insert into fruit (name, count, dateP) values
('apple', 4, '2014-03-18 16:24:37'),
('orange', 2, '2013-12-11 11:20:16'),
('apple', 7, '2014-07-05 08:34:21'),
('banana', 6, '2014-06-20 19:10:15'),
('orange', 6, '2014-07-22 17:41:12'),
('banana', 4, '2014-08-15 21:26:37'), -- last
('orange', 5, '2014-12-11 11:20:16'), -- last
('apple', 3, '2014-09-25 18:54:32'), -- last
('apple', 5, '2014-02-05 18:47:18'),
('apple', 12, '2013-09-25 14:18:57'),
('banana', 5, '2013-04-18 15:59:04'),
('apple', 9, '2014-01-29 11:47:45');
-- Expecting:
-- apple 3
-- banana 4
-- orange 5
select distinct name, count
from fruit
group by name
order by name, dateP;
-- Produces:
-- apple 9
-- banana 5
-- orange 5
Try this:-
select f1.name,f1.count
from
fruit f1
inner join
(select name,max(dateP) date_P from fruit group by name) f2
on f1.name = f2.name and f1.dateP = f2.date_P
order by f1.name
EDITED for the last line :)
Try the following:
SELECT fruit.name, fruit.count, fruit.dateP
FROM fruit
INNER JOIN (
SELECT name, Max(dateP) AS lastPurchased
FROM fruit
GROUP BY name
) AS dt ON (dt.name = fruit.name AND dt.lastPurchased = fruit.dateP )
Here is a demo of this example on SQLFiddle.
When faced before with a similar situation I resolved as follows, it requires the use of a primary key, in this case I have added UID.
SELECT a.Name,a.Count FROM Fruit a WHERE a.UID IN
(SELECT b.UID FROM Fruit b
WHERE b.Name = a.Name ORDER BY b.DateP Desc,b.UID DESC LIMIT 1)
This also avoids the possibility that the same fruit was purchased twice at the exact same time; unlikely in this example but in a large scale system it is a possibility which could come back to haunt you. It handles this by ordering by UID as well so it will choose the purchase most recently added to the table (assuming incrementing primary key).
Edited to remove the TOP 1 invalid syntax
In SQLite 3.7.11 or later, you can use MAX/MIN to select from which record in a group other values are returned (but this requires that you have that maximum in the result):
SELECT name, count, MAX(dateP)
FROM fruit
GROUP BY name
ORDER BY name
If you wanna improve your performance, use Common Table Expressions instead of nested Select clauses.

Product price comparison in sql

I have a table looks like given below query, I add products price in this table daily, with different sellers name :
create table Product_Price
(
id int,
dt date,
SellerName varchar(20),
Product varchar(10),
Price money
)
insert into Product_Price values (1, '2012-01-16','Sears','AA', 32)
insert into Product_Price values (2, '2012-01-16','Amazon', 'AA', 40)
insert into Product_Price values (3, '2012-01-16','eBay','AA', 27)
insert into Product_Price values (4, '2012-01-17','Sears','BC', 33.2)
insert into Product_Price values (5, '2012-01-17','Amazon', 'BC',30)
insert into Product_Price values (6, '2012-01-17','eBay', 'BC',51.4)
insert into Product_Price values (7, '2012-01-18','Sears','DE', 13.5)
insert into Product_Price values (8, '2012-01-18','Amazon','DE', 11.1)
insert into Product_Price values (9, '2012-01-18', 'eBay','DE', 9.4)
I want result like this for n number of sellers(As more sellers added in table)
DT PRODUCT Sears[My Site] Amazon Ebay Lowest Price
1/16/2012 AA 32 40 27 Ebay
1/17/2012 BC 33.2 30 51.4 Amazon
1/18/2012 DE 7.5 11.1 9.4 Sears
I think this is what you're looking for.
SQLFiddle
It's kind of ugly, but here's a little breakdown.
This block allows you to get a dynamic list of your values. (Can't remember who I stole this from, but it's awesome. Without this, pivot really isn't any better than a big giant case statement approach to this.)
DECLARE #cols AS VARCHAR(MAX)
DECLARE #query AS NVARCHAR(MAX)
select #cols = STUFF((SELECT distinct ',' +
QUOTENAME(SellerName)
FROM Product_Price
FOR XML PATH(''), TYPE
).value('.', 'NVARCHAR(MAX)')
, 1, 1, '')
Your #cols variable comes out like so:
[Amazon],[eBay],[Sears]
Then you need to build a string of your entire query:
select #query =
'select piv1.*, tt.sellername from (
select *
from
(select dt, product, SellerName, sum(price) as price from product_price group by dt, product, SellerName) t1
pivot (sum(price) for SellerName in (' + #cols + '))as bob
) piv1
inner join
(select t2.dt,t2.sellername,t1.min_price from
(select dt, min(price) as min_price from product_price group by dt) t1
inner join (select dt,sellername, sum(price) as price from product_price group by dt,sellername) t2 on t1.min_price = t2.price) tt
on piv1.dt = tt.dt
'
The piv1 derived table gets you the pivoted values. The cleverly named tt derived table gets you the seller who has the minimum sales for each day.
(Told you it was kind of ugly.)
And finally, you run your query:
execute(#query)
And you get:
DT PRODUCT AMAZON EBAY SEARS SELLERNAME
2012-01-16 AA 40 27 32 eBay
2012-01-17 BC 30 51.4 33.2 Amazon
2012-01-18 DE 11.1 9.4 13.5 eBay
(sorry, can't make that bit line up).
I would think that if you have a reporting tool that can do crosstabs, this would be a heck of a lot easier to do there.
The problem is this requirement:
I want result like this for n number of sellers
If you have a fixed, known number of columns for your results, there are several techniques to PIVOT your data. But if the number of columns is not known, you're in trouble. The SQL language really wants you to be able to describe the exact nature of the result set for the select list in terms of the number and types of columns up front.
It sounds like you can't do that. This leaves you with two options:
Query the data to know how many stores you have and their names, and then use that information to build a dynamic sql statement.
(Preferred option) Perform the pivot in client code.
This is something that would probably work well with a PIVOT. Microsoft's docs are actually pretty useful on PIVOT and UNPIVOT.
http://technet.microsoft.com/en-us/library/ms177410(v=sql.105).aspx
Basically it allows you to pick a column, in your case SellerName, and pivot that out so that the elements of the column themselves become columns in the new result. The values that go in the new "Ebay", "Amazon", etc. columns would be an aggregate that you choose - in this case the MAX or MIN or AVG of the price.
For the final "Lowest Price" column you'd likely be best served by doing a subquery in your main query which finds the lowest value per product/date and then joining that back in to get the SellerName. Something like:
SELECT
Product_Price.Date
,Product_Price.Product
,Product_Price.MinimumSellerName
FROM
(SELECT
MIN(Price) AS min_price
,Product
,Date
FROM Product_Price
GROUP BY
Product
,Date) min_price
INNER JOIN Product_Price
ON min_price.Product = Product_Price.Product
AND min_price.Date = Product_Price.Date
Then just put the pivot around that and include the MinimumSellerName columnm, just like you include date and product.

SQL DISTINCT, GROUP BY OR....?

I have a database with the following columns
SKU | designID | designColor | width | height | price | etc.
SKU number is unique and designID is repeated.
Basically, I want to DISTINCT or GROUP BY designID and get the value of the rest of row even though they are not repeated.
Example:
123 | A-1 | RED | 2 | 3 | $200 | etc.
135 | A-2 | BLU | 8 | 4 | $150 | etc.
After all, I should be able to sort them by either column. I already tried GROUP BY and DISTINCT but non of them return the rest of the row's value.
Example:
SELECT DISTINCT designID
FROM tbl_name
Which will return
A-1
A-2
and no other data.
GROUP BY example:
SELECT designID, designColor
FROM tbl_name
GROUP BY designID, designColor
Which will return
A-01 | RED
A-02 | BLU
Any idea so I can have DISTINCT result with all the row values?
Thanks in advance.
====================================
Thanks everybody for all your time and tips, Please let me describe more;
basically I need to eliminate the repeated designID and show just one of them and it doen't matter which one of them, first, middle or last one. Important is the one I show has to have all the row information, like SKU, Price, Size, etc. I dont't know, maybe I should use a different code rather than DISTINCT or GROUP BY.
Here is what I want from database.
Unless I misunderstand, you can SELECT DISTINCT on multiple columns:
SELECT
DISTINCT designID,
designColor,
width,
height,
price
FROM tbl_name
ORDER BY designColor
This will give you all the unique rows. If you have, for example, two designID values across 15 total rows with 2 and 3 different designColor values respectively, this will give you 5 rows.
If you don't care which row will be returned, you could use MAX and a subquery-group by:
create table #test(
SKU int,
designID varchar(10),
designColor varchar(10),
width int,
height int,
price real,
etc varchar(50)
)
insert into #test values(123, 'A-1' ,'RED', 2, 3, '200', 'etc')
insert into #test values(135, 'A-2' ,'BLUE', 8, 4, '150', 'etc')
insert into #test values(128, 'A-2' ,'YELLOW', 6, 9, '300', 'etc')
select t.* FROM #test t INNER JOIN
(
SELECT MAX(SKU) as MaxSKU,designID
FROM #test
GROUP BY designID
) tt
ON t.SKU = tt.MaxSKU;
drop table #test;
Result:
SKU designID designColor width height price etc
123 A-1 RED 2 3 200 etc
135 A-2 BLUE 8 4 150 etc
If they are all guaranteed to be duplicate (100% i.e. all columns) then a distinct would be your friend. i.e.
SELECT DISTINCT design_id, designColor, width, height, price FROM tbl_name
This will give distinct values on everything except SKU (which will always be unique and foil your distinct.
If you want unique designId values and the other results are different, then you need to figure out which of the values you want. If you really don't care, you can just arbitarily pick and aggregate function (say, MIN) and use GROUP BY
i.e.
SELECT designID, MIN(designColor) FROM tbl_name GROUP BY designID
This will give you a unique design id and a value for the other columns.
If you want the designID for the biggest skew, you could use a ranking function i.e.
;WITH rankedSKUs
AS
(
SELECT SKU, ROW_NUMBER() OVER(ORDER BY SKU DESC) as id
FROM tbl_name
)
SELECT *
FROM tbl_name T
WHERE EXISTS(SELECT * FROM rankedSKUs where id = 1 and SKU = T.sku)
This will return all columns for each distinct designID taking the largest value for SKU as authoritative for each designed.
If you want return every field, you might as well remove the distinct (assuming you have an id like you seem to).
Your request is really weird because if you take say,
SELECT DISTINCT designID
FROM tbl_name
you get a list of unique design id's, and if you then look up in the table for all rows with those id's, you'll get every single row in the table.
As a side note, the use of distinct usually means you designed your database badly (ie, not normalized) or that you designed your query badly (ie, you know, really badly). My money is on the former.
If you use LINQ you can use something like this:
get_data_context().my_table.GroupBy( t => t.designID ).Select( t => new { t.Key,
REST = t.Select( u => new { u.SKU , u.designID , u.designColor , u.width ,
u.height , u.price } ) } );