Top-N per group (MSSQL) [duplicate] - sql

This question already has answers here:
Select top 10 records for each category
(14 answers)
Closed 2 years ago.
I have 10k - 1m goods wich are discribed by fields product_id, name, category, price. Which is the fastest way to fetched 10 most expensive goods from each category? Previously I checked this answer https://stackoverflow.com/a/176985/9513268.
My table:
-------------------------------------
|product_id| name | category| price |
-------------------------------------
| 1 |Phone | Gadgets | 599.99|
------------------------------------
| 2 |Jacket| Clothes | 399.00|
-------------------------------------
| ... | ... | ... | ... |
-------------------------------------

You can use window functions, as showned in the answer that you linked.
select *
from (
select t.*, rank() over(partition by category order by price desc) rn
from mytable t
) t
where rn <= 10
order by category, rn
The key is to properly define the over() clause of the window function. You want the top 10 by category, so this column goes to the partition by; you want the top most expensive goods, so the order by criteria is on descending price.
You can run the subquery separately and stare and the rn column to better understand the logic.

Related

SQL get just one column from the max value [duplicate]

This question already has answers here:
SQL - Select first 10 rows only?
(12 answers)
Closed 2 years ago.
I have this table
| sale_id | amount |
| ------- | ------ |
| 5 | 3 |
| 1 | 2 |
| 3 | 1 |
And i need select JUST the sale_id of the max amount, just the number id 5 because 3 is the max amount. Sounds simple, but im having problems with this.
Can someone help me, please?
In standard SQL, this looks like:
select sale_id
from t
order by amount desc
fetch first 1 row only;
Not all databases support the fetch clause, but all have some mechanism for returning a result set with one row, such as limit or select top.
MS SQL dialect
select top 1 sale_id
from tbl
order by amount desc
In SQL server this can be achieved by:
SELECT TOP 1 sale_id FROM t order by amount desc
Incase you have duplicates max amount and need to fetch both sale_id then you can go for windowing function.

PostgreSQL: Customers preferred product and second most preferred product

I'm pretty new to SQL (currently using PostgreSQL but interested in knowledge about any SQL), and am trying to figure something that I guess should be relatively straightforward.
I have a table containing one row per customer transaction, for each transaction I know what the customer bought. I am interested in finding out what product is each customers preferred choice, and then their second to most preferred choice (and in the end, on a general basis what is the preferred second choice when the preferred choice is unavailable).
Below is a mock up of what the data could look like:
+---------------------+-----------------+
| Customer_id | Product bought |
+---------------------+-----------------+
| 1 | DVD |
+- -+- -+
| 1 | DVD |
+- -+- -+
| 1 | Blu-ray |
+- -+- -+
| 1 | DVD |
+- -+- -+
| 2 | DVD |
+- -+- -+
| 2 | DVD |
The successful results would be something like this:
+---------------------+--------------------------------+
| Customer_id | Preferred #1 | Preferred #2 |
+---------------------+--------------------------------+
| 1 | DVD | Blu-ray |
+- -+- -+
| 2 | DVD | $NULL$ |
(And as mentioned earlier, the final result (most likely done in Python/R and not in SQL, would be to see a general basis as "If Preferred #1 is DVD, then Preferred #2 is Blu-ray", "If Preferred #1 is Blu-ray, then Preferred #2 is Sandwich"... and so on)
Cheers
This is a combination of a greatest-n-per-group and a pivot problem (sometimes also referred to as crosstab)
The first step you need to do is to identify the two preferred products.
In your case you need to combine a group by query with window functions.
The following query counts how often each customer has bought each product:
select customer_id,
product_bought,
count(*) as num_products
from sales
group by customer_id, product_bought
order by customer_id;
This can be enhanced to include a rank for the number of times a product was bought:
select customer_id,
product_bought,
count(*) as num_products,
dense_rank() over (partition by customer_id order by count(*) desc) as rnk
from sales
group by customer_id, product_bought
order by customer_id;
This would return the following result (based on your sample data):
customer_id | product_bought | num_products | rnk
------------+----------------+--------------+----
1 | DVD | 3 | 1
1 | Blu-ray | 1 | 2
2 | DVD | 2 | 1
We cannot apply a where condition on the rnk column directly, so we need a derived table for that:
select customer_id, product_bought
from (
select customer_id,
product_bought,
count(*) as num_products,
dense_rank() over (partition by customer_id order by count(*) desc) as rnk
from sales
group by customer_id, product_bought
) t
where rnk <= 2
order by customer_id;
Now we need to convert the two rows for each customer into columns. This could e.g. be done using a common table expression:
with preferred_products as (
select *
from (
select customer_id,
product_bought,
count(*) as num_products,
dense_rank() over (partition by customer_id order by count(*) desc) as rnk
from sales
group by customer_id, product_bought
) t
where rnk <= 2
)
select p1.customer_id,
p1.product_bought as "Product #1",
p2.product_bought as "Product #2"
from preferred_products p1
left join preferred_products p2 on p1.customer_id = p2.customer_id and p2.rnk = 2
where p1.rnk = 1
This then returns
customer_id | Product #1 | Product #2
------------+------------+-----------
1 | DVD | Blu-ray
2 | DVD |
The above is standard SQL and will work on any modern DBMS.
Online example: http://rextester.com/VAID15638

How to find most-correlated X for each Y?

I have a query I can run, which produces rows like this:
ID | category | property_A | property_B
----+----------+------------+------------
1 | X | tall | old
2 | X | short | old
3 | X | tall | old
4 | X | short | young
5 | Y | short | old
6 | Y | short | old
7 | Y | tall | old
I'd like to find, for each category and property_B, what is the most common property_A, and put that into another table somewhere for later use. So here I'd like to know that in category X, old people tend to be tall and young people short, while in category Y, old people tend to be short.
The domain of each column is finite, and not too large - there are something like 200 categories, and a dozen or so of property_A and property_B. So I could write a dumb script on my client, which queries the database 200*12*12 times doing a limited query, but that seems like it must be the wrong approach, as well as wasteful given that it's expensive to produce this table and then throw most of it away.
But I don't even know what words to look up to find the right approach: "sql find correlated rows" shows how to find integer correlations, but I'm not interested in integers. So what do I do instead?
You can readily do this with aggregation and the window/analytic functions. You want the top ranked one by count. The following returns the most popular A:
select category, property_b, property_a as MostPopularA
from (select category, property_b, property_a, count(*) as cnt,
row_number() over (partition by category, property_b order by count(*) desc) as seqnum
from table t
group by category, property_b, property_a
) t
where seqnum = 1;
If you want to get all values when there is a tie, then use dense_rank() instead of row_number().
I suggest a combination of GROUP BY and DISTINCT ON, which is faster / simpler / more elegant in Postgres:
SELECT DISTINCT ON (category, property_b)
category, property_b, property_a, count(*) AS ct
FROM tbl
GROUP BY category, property_b, property_a
ORDER BY category, property_b, ct DESC;
Returns:
category | property_b | property_a | ct
---------+------------+------------+----
X | old | tall | 2
X | young | short | 1
Y | old | short | 2
If multiple peers tie for the most common value, only one arbitrary pick is returned.
This works in a single query level without subquery, since aggregation (GROUP BY) is applied before the DISTINCT step. Detailed explanation for DISTINCT ON:
Select first row in each GROUP BY group?
SQL Fiddle.

Calculate mode in SQL

I have seen the answer from a previous post, which works fine but I have a small dilemma.
Taking the same scenario:
A table that list students' grades per class. I want a result set that looks like:
BIO...B
CHEM...C
Where the "B" and "C" are the modes for the class and want to get the mode for the class.
Once I applied the below query, i got the following output:
Class | Score | Freq | Ranking
2010 | B | 8 | 1
2010 | C | 8 | 1
2011 | A | 10 | 1
2012 | B | 11 | 1
In 2010, I have two grades with the same frequency. What if..I just want to display the highest score, in this case will be "B". How can I achieve that? I would need to assign rankings to the letter grades, but I'm not sure how. Please advise.
Thanks.
Prior post:
SQL Server mode SQL
The query I used to retrieve the data was the answer from Peter:
;WITH Ranked AS (
SELECT
ClassName, Grade
, GradeFreq = COUNT(*)
, Ranking = DENSE_RANK() OVER (PARTITION BY ClassName ORDER BY COUNT(*) DESC)
FROM Scores
GROUP BY ClassName, Grade
)
SELECT * FROM Ranked WHERE Ranking = 1
Change:
SELECT * FROM Ranked WHERE Ranking = 1
To:
SELECT Class, MIN(Grade) AS HighestGrade, Freq, Ranking
FROM Ranked
WHERE Ranking = 1
GROUP BY Class, Freq, Ranking

Remove redundant SQL price cost records

I have a table costhistory with fields id,invid,vendorid,cost,timestamp,chdeleted. It looks like it was populated with a trigger every time a vendor updated their list of prices.
It has redundant records - since it was populated regardless of whether price changed or not since last record.
Example:
id | invid | vendorid | cost | timestamp | chdeleted
1 | 123 | 1 | 100 | 1/1/01 | 0
2 | 123 | 1 | 100 | 1/2/01 | 0
3 | 123 | 1 | 100 | 1/3/01 | 0
4 | 123 | 1 | 500 | 1/4/01 | 0
5 | 123 | 1 | 500 | 1/5/01 | 0
6 | 123 | 1 | 100 | 1/6/01 | 0
I would want to remove records with ID 2,3,5 since they do not reflect any change since the last price update.
I'm sure it can be done, though it might take several steps.
Just to be clear, this table has swelled to 100gb and contains 600M rows. I am confident that a proper cleanup will take this table's size down by 90% - 95%.
Thanks!
The approach you take will vary depending on the database you are using. For SQL Server 2005+, the following query should give you the records you want to remove:
select id
from (
select id, Rank() over (Partition BY invid, vendorid, cost order by timestamp) as Rank
from costhistory
) tmp
where Rank > 1
You can then delete them like this:
delete from costhistory
where id in (
select id
from (
select id, Rank() over (Partition BY invid, vendorid, cost order by timestamp) as Rank
from costhistory
) tmp
)
I would suggest that you recreate the table using a group by query. Also, I assume the the "id" column is not used in any other tables. If that is the case, then you need to fix those tables as well.
Deleting such a large quantity of records is likely to take a long, long time.
The query would look like:
insert into newversionoftable(invid, vendorid, cost, timestamp, chdeleted)
select invid, vendorid, cost, timestamp, chdeleted
from table
group by invid, vendorid, cost, timestamp, chdeleted
If you do opt for a delete, I would suggestion:
(1) Fix the code first, so no duplicates are going in.
(2) Determine the duplicate ids and place them in a separate table.
(3) Delete in batches.
To find the duplicate ids, use something like:
select *
from (select id,
row_number() over (partition by invid, vendorid, cost, timestamp, chdeleted order by timestamp) as seqnum
from table
) t
where seqnum > 1
If you want to keep the most recent version instead, then use "timestamp desc" in the order by clause.