SQL - How to retrieve the rows with the highest rank on a column with multiple duplicates rows in another one - sql

I'm quite new to SQL, so it might be a noob question:
Let's say my query is:
select [Item], [Answers] from table1
And it retrieves the following table:
Item Answers
------------------
Car Expensive
Car Cheap
Car Medium
Boat Expensive
Boat Very Expensive
Boat Ultra Expensive
Given a certain second table (or dictionary, I don't really know how to do it) {Cheap: 1, Medium: 2, Expensive: 3, Very Expensive:4, Ultra Expensive: 5} - meaning that "Ultra Expensive" is the highest rank and "Cheap" is the lowest rank.
In SQL, on this kind of table with many duplicates in Column A (Item) how do I retrieve the highest ranked value in Column B (Answers) for each unique value in Column A?
In this example, I would like to get:
Item Answers
------------------
Car Expensive
Boat Ultra Expensive
Just one of each duplicated value in column 'Item' and its highest ranked possible value in 'Answers'?

You can use a correlated subquery:
select item, answer
from table1 t
where answer = (
select top 1 answer
from table1 t1
where t1.item = t.item
order by
case answer
when 'Cheap' then 1
when 'Medium' then 2
when 'Expensive' then 3
when 'Very Expensive' then 4
when 'Ultra Expensive' then 5
end desc
)
The subquery filters on the answer that has the highest rank for the given item, using a conditional order by clause and top 1.
Another option is to use row_number() for filtering:
select item, answer
from (
select
item,
answer,
row_number() over(
partition by item
order by case answer
when 'Cheap' then 1
when 'Medium' then 2
when 'Expensive' then 3
when 'Very Expensive' then 4
when 'Ultra Expensive' then 5
end desc
) rn
from table1
) t
where rn = 1

Related

SQL query which will extract conditionally the values from top categories the first and the 2nd where CATEGORY is OTHER

I have this table. The table just a small example and has more obs.
id
CATEGORY
AMOUNT
1
TECH
120
1
FUN
220
2
OTHER
340
2
PARENTS
220
made by id category amount spent in each category.I want to select ID and Category in which the ID spents the most but in case if category is OTHER I want to get 2nd most spending category.
I have a constraint. I CANNOT use the the subquery and select with filter WHERE CATEGORY <> 'OTHER'. It just makes my machine to go out of the memory (For reasons Idk)
This is what I have tried.
I have tried to create a row_number () over (partition by id order by amount desc) rn.
and then
select id, category from table where row num = 1 group by 1,2
**buttt. I don't know how to say to query. If CATEGORY is OTHER then take row num=2 . **
id
CATEGORY
AMOUNT
ROW NUM
1
TECH
120
2
1
FUN
220
1
2
OTHER
340
1
2
PARENTS
220
2
Another thing I was thinking to do is to write qualify function
QUALIFY ROW_NUMBER() OVER (PARTITION BY ID ORDER BY AMOUNT DESC) <1.
Also here I am getting only 1st records in which there is also OTHER. If I could filter it out within QUALIFY and say if CATEGORY is 'OTHER' don't consider it.
I am using Databricks.

how can I count some values for data in a table based on same key in another table in Bigquery?

I have one table like bellow. Each id is unique.
id
times_of_going_out
fef666
2
S335gg
1
9a2c50
1
and another table like this one ↓. In this second table the "id" is not unique, there are different "category_name" for a single id.
id
category_name
city
S335gg
Games & Game Supplies
tk
9a2c50
Telephone Companies
os
9a2c50
Recreation Centers
ky
fef666
Recreation Centers
ky
I want to find the difference between destinations(category_name) of people who go out often(times_of_going_out<5) and people who don't go out often(times_of_going_out<=5).
** Both tables are a small sample of large tables.
 ・ Where do people who go out twice often go?
 ・ Where do people who go out 6times often go?
Thank you
The expected result could be something like
less than 5
more than 5
top ten “category_name” for uid’s with "times_of_going_out" less than 5 times
top ten “category_name” for uid’s with "times_of_going_out" more than 5 times
Steps:
combining data and aggregating total time_going_out
creating the categories that you need : less than equal to 5 and more than 5. if you don't need equal to 5, you can adjust the code
ranking both categories with top 10, using dense_rank(). this will produce the rank from 1 - 10 based on the total time_going out
filtering the cases so it takes top 10 values for both categories
with main as (
select
category_name,
sum(coalesce(times_of_going_out,0)) as total_time_per_category
from table1 as t1
left join table2 as t2
on t1.id = t2.id
group by 1
),
category as (
select
*,
if(total_time_per_category >= 5, 'more than 5', 'less than equal to 5') as is_more_than_5_times
from main
),
ranking_ as (
select *,
case when
is_more_than_5_times = 'more than 5' then
dense_rank() over (partition by is_more_than_5_times order by total_time_per_category desc)
else NULL
end AS rank_more_than_5,
case when
is_more_than_5_times = 'less than equal to 5' then
dense_rank() over (partition by is_more_than_5_times order by total_time_per_category)
else NULL
end AS rank_less_than_equal_5
from category
)
select
is_more_than_5_times,
string_agg(category_name,',') as list
from ranking_
where rank_less_than_equal_5 <=10 or rank_more_than_5 <= 10
group by 1

Case Statement for multiple criteria

I would like to ignore some of the results of my query as for all intents and purposes, some of the results are a duplicate, but based on the way the request was made, we need to use this hierarchy and although we are seeing different 'Company_Name' 's, we need to ignore one of the results.
Query:
SELECT
COUNT(DISTINCT A12.Company_name) AS Customer_Name_Count,
Company_Name,
SUM(Total_Sales) AS Total_Sales
FROM
some_table AS A12
GROUP BY
2
ORDER BY
3 ASC, 2 ASC
This code omits half a doze joins and where statements that are not germane to this question.
Results:
Customer_Name_Count Company_Name Total_Sales
-------------------------------------------------------------
1 3 Blockbuster 1,000
2 6 Jimmy's Bar 1,500
3 6 Jimmy's Restaurant 1,500
4 9 Impala Hotel 2,000
5 12 Sports Drink 2,500
In the above set, we can see that numbers 2 & 3 have the same count and the same total_sales number and similar company names. Is there a way to create a case statement that takes these 3 factors into consideration and then drops one or the other for Jimmy's enterprises? The other issue is that this has to be variable as there are other instances where this happens. And I would only want this to happen if the count and sales number match each other with a similar name in the company name.
Desired result:
Customer_Name_Count Company_Name Total_Sales
--------------------------------------------------------------
1 3 Blockbuster 1,000
2 6 Jimmy's Bar 1,500
3 9 Impala Hotel 2,000
4 12 Sports Drink 2,500
Looks like other answers are accurate based on assumption that Company_IDs are the same for both.
If Company_IDs are different for both Jimmy's Bar and Jimmy's Restaurant then you can use something like this. I suggest you get functional users involved and do some data clean-up else you'll be maintaining this every time this issue arise:
SELECT
COUNT(DISTINCT CASE
WHEN A12.Company_Name = 'Name2' THEN 'Name1'
ELSE A12.Company_Name
END) AS Customer_Name_Count
,CASE
WHEN A12.Company_Name = 'Name2' THEN 'Name1'
ELSE A12.Company_Name
END AS Company_Name
,SUM(A12.Total_Sales) AS Total_Sales
FROM some_table er
GROUP BY CASE
WHEN A12.Company_Name = 'Name2' THEN 'Name1'
ELSE A12.Company_Name
END
Your problem is that the joins you are using are multiplying the number of rows. Somewhere along the way, multiple names are associated with exactly the same entity (which is why the numbers are the same). You can fix this by aggregating by the right id:
SELECT COUNT(DISTINCT A12.Company_name) AS Customer_Name_Count,
MAX(Company_Name) as Company_Name,
SUM(Total_Sales) AS Total_Sales
FROM some_table AS A12
GROUP BY Company_id -- I'm guessing the column is something like this
ORDER BY 3 ASC, 2 ASC;
This might actually overstate the sales (I don't know). Better would be fixing the join so it only returned one name. One possibility is that it is a type-2 dimension, meaning that there is a time component for values that change over time. You may need to restrict the join to a single time period.
You need to have function to return a common name for the companies and then use DISTINCT:
SELECT DISTINCT
Customer_Name_Count,
dbo.GetCommonName(Company_Name) as Company_Name,
Total_Sales
FROM dbo.theTable
You can try to use ROW_NUMBER with window function to make row number by Customer_Name_Count and Total_Sales then get rn = 1
SELECT * FROM (
SELECT *,ROW_NUMBER() OVER(PARTITION BY Customer_Name_Count,Total_Sales ORDER BY Company_Name) rn
FROM (
SELECT
COUNT(DISTINCT A12.Company_name) AS Customer_Name_Count,
Company_Name,
SUM(Total_Sales) AS Total_Sales
FROM
some_table AS A12
GROUP BY
Company_Name
)t1
)t1
WHERE rn = 1

sql combining 2 queries with different order by group by

I have a query where I am counting the most frequent response in a database and ranking them by highest amount so using group by and order by.
The following shows how to do it for one:
select health, count(health) as count
from [Health].[Questionaire]
group by Health
order by count(Health) desc
which outputs the following:
Health Count
----------- -----
Very Good 6
Good 5
Poor 4
I would like to do with another column on the same table another query similar to the following so two queries using one sql statement like the following:
Health Count Diet Count
----------- ----- ----- -----
Very Good 6 Very Good 6
Good 5 Good 4
Poor 4 Poor 3
UPDATE!!
Hello this is how the table looks like at the moment
ID Diet Health
----------- ----- -------
101 Very Good Very Good
102 Poor Good
103 Poor Poor
I would like to do with another column on the same table another query similar to the following so two queries using one sql statement like the following:
Health Count Diet Count
----------- ----- ----- -----
Very Good 2 Very Good 1
Poor 1 Good 1
Good 0 Poor 1
Can anyone please help me out with this one?
Can provide further clarification if needed!
Here are 2 different ways of doing it, notice i removed the redundant column:
Test data:
DECLARE #t table(Health varchar(20), Diet varchar(20))
INSERT #t values
('Very good', 'Very good'),
('Poor', 'Good'),
('Poor', 'Poor')
Query 1:
;WITH CTE1 as
(
SELECT Health, count(*) CountHealth
FROM #t --[Health].[Questionaire]
GROUP BY health
), CTE2 as
(
SELECT Diet, count(*) CountDiet
FROM #t --[Health].[Questionaire]
GROUP BY Diet
)
SELECT
coalesce(Health, Diet) Grade,
coalesce(CountHealth, 0) CountHealth,
coalesce(CountDiet, 0) CountDiet
FROM CTE1
FULL JOIN
CTE2
ON CTE1.Health = CTE2.Diet
ORDER BY CountHealth DESC
Result 1:
Grade CountHealth CountDiet
Poor 2 1
Very good 1 1
Good 0 1
Mixing the results like that is really not good practice, so here is a different solution
Query 2:
SELECT Health, count(*) Count, 'Health' Grade
FROM #t --[Health].[Questionaire]
GROUP BY health
UNION ALL
SELECT Diet, count(*) CountDiet, 'Diet'
FROM #t --[Health].[Questionaire]
GROUP BY Diet
ORDER BY Grade, Count DESC
Result 2:
Health Count Grade
Good 1 Diet
Poor 1 Diet
Very good 1 Diet
Poor 2 Health
Very good 1 Health
You need to join the table to itself, but (as your sample data shows) to deal with gaps in actual data for specific values.
If you have a table that has the range of health/diet values:
select
v.value Status,
count(a.id) healthCount,
count(b.id) DietCount
from health_diet_values v
left join Questionaire a on a.health = v.value
left join Questionaire b on b.diet = v.value
group by v.value
or if you don't have such a table, you need to generate the list of values manually and join from that:
select
v.value Status,
count(a.id) healthCount,
count(b.id) DietCount
from (select 'Very Good' value union all
select 'Good' union all
select 'Poor') v
left join Questionaire a on a.health = v.value
left join Questionaire b on b.diet = v.value
group by v.value
Both of these queries produce zeroes if there is no matching data for the value.
Note that in your desired output you have a redundant column - you repeat the value column. The above queries produce output that looks like:
Status HealthCount DietCount
-------------------------------
Very Good 2 1
Good 1 1
Poor 0 1

Joining onto a table that doesn't have ranges, but requires ranges

Trying to find the best way to write this SQL statement.
I have a customer table that has the internal credit score of that customer. Then i have another table with definitions of that credit score. I would like to join these tables together, but the second table doesn't have any way to link it easily.
The score of the customer is an integer between 1-999, and the definition table has these columns:
Score
Description
And these rows:
60 LOW
99 MED
999 HIGH
So basically if a customer has a score between 1 and 60 they are low, 61-99 they are med, and 100-999 they are high.
I can't really INNER JOIN these, because it would only join them IF the score was 60, 99, or 999, and that would exclude anyone else with those scores.
I don't want to do a case statement with the static numbers, because our scores may change in the future and I don't want to have to update my initial query when/if they do. I also cannot create any tables or functions to do this- I need to create a SQL statement to do it for me.
EDIT:
A coworker said this would work, but its a little crazy. I'm thinking there has to be a better way:
SELECT
internal_credit_score
(
SELECT
credit_score_short_desc
FROM
cf_internal_credit_score
WHERE
internal_credit_score = (
SELECT
max(credit.internal_credit_score)
FROM
cf_internal_credit_score credit
WHERE
cs.internal_credit_score <= credit.internal_credit_score
AND credit.internal_credit_score <= (
SELECT
min(credit2.internal_credit_score)
FROM
cf_internal_credit_score credit2
WHERE
cs.internal_credit_score <= credit2.internal_credit_score
)
)
)
FROM
customer_statements cs
try this, change your table to contain the range of the scores:
ScoreTable
-------------
LowScore int
HighScore int
ScoreDescription string
data values
LowScore HighScore ScoreDescription
-------- --------- ----------------
1 60 Low
61 99 Med
100 999 High
query:
Select
.... , Score.ScoreDescription
FROM YourTable
INNER JOIN Score ON YourTable.Score>=Score.LowScore
AND YourTable.Score<=Score.HighScore
WHERE ...
Assuming you table is named CreditTable, this is what you want:
select * from
(
select Description, Score
from CreditTable
where Score > 80 /*client's credit*/
order by Score
)
where rownum = 1
Also, make sure your high score reference value is 1000, even though client's highest score possible is 999.
Update
The above SQL gives you the credit record for a given value. If you want to join with, say, Clients table, you'd do something like this:
select
c.Name,
c.Score,
(select Description from
(select Description from CreditTable where Score > c.Score order by Score)
where rownum = 1)
from clients c
I know this is a sub-select that executed for each returning row, but then again, CreditTable is ridiculously small and there will be no significant performance loss because of the the sub-select usage.
You can use analytic functions to convert the data in your score description table to ranges (I assume that you meant that 100-999 should map to 'HIGH', not 99-999).
SQL> ed
Wrote file afiedt.buf
1 with x as (
2 select 60 score, 'Low' description from dual union all
3 select 99, 'Med' from dual union all
4 select 999, 'High' from dual
5 )
6 select description,
7 nvl(lag(score) over (order by score),0) + 1 low_range,
8 score high_range
9* from x
SQL> /
DESC LOW_RANGE HIGH_RANGE
---- ---------- ----------
Low 1 60
Med 61 99
High 100 999
You can then join this to your CUSTOMER table with something like
SELECT c.*,
sd.*
FROM customer c,
(select description,
nvl(lag(score) over (order by score),0) + 1 low_range,
score high_range
from score_description) sd
WHERE c.credit_score BETWEEN sd.low_range AND sd.high_range