Count items that have diff formatting with SQL - sql

I am confused how to count items that are the same but have different formatting. E.g we want to know how many different fruits people have and have the following data:
Mary|Apple|
Mary|apple|
Mary|Apple |
Mary|Orange|
Liu|Grape|
Liu|Apple|
I expect the output
Mary|2
Liu|2
But if I do count(distinct fruits) then I get
Mary|4
Liu|2
If there any way to deal with formatting in this case?

You could count them:
Removing initial and trailing spaces - use TRIM().
Removing the letter case - use LOWER().
As in:
select
name,
count(distinct lower(trim(fruits)))
from my_table
group by name
You could use the same strategy with the name column if it happens to have similar irregularities.

Take the distinct count of the lowercase version of the fruit names:
SELECT
name,
COUNT(DISTINCT LOWER(TRIM(fruit))) AS cnt
FROM yourTable
GROUP BY
name;
Demo
The demo is in MySQL, but the same logic should work in SQLite.

DISTINCT is not the only tool here. You can use GROUP BY with a normalized version of the column text to combine all the different casings and trailing spaces into one. For example:
SELECT name, fruit, count(fruit) AS cnt
FROM test
GROUP BY name, trim(upper(fruit));
gives me
name fruit cnt
---------- ---------- ----------
Liu Apple 1
Liu Grape 1
Mary Apple 3
Mary Orange 1
However, it looks like you want the total number of different types of fruit per person. So...
WITH totals(name, fruit) AS
(SELECT name, fruit
FROM test
GROUP BY name, trim(upper(fruit)))
SELECT name, count(fruit) AS fruits
FROM totals
GROUP BY name;
gives me
name fruits
---------- ----------
Liu 2
Mary 2

Related

How to return all names that appear multiple times in table [duplicate]

This question already has answers here:
What's the SQL query to list all rows that have 2 column sub-rows as duplicates?
(10 answers)
Closed last year.
Suppose I have the following schema:
student(name, siblings)
The related table has names and siblings. Note the number of rows of the same name will appear the same number of times as the number of siblings an individual has. For instance, a table could be as follows:
Jack, Lucy
Jack, Tim
Meaning that Jack has Lucy and Tim as his siblings.
I want to identify an SQL query that reports the names of all students who have 2 or more siblings. My attempt is the following:
select name
from student
where count(name) >= 1;
I'm not sure I'm using count correctly in this SQL query. Can someone please help with identifying the correct SQL query for this?
You're almost there:
select name
from student
group by name
having count(*) > 1;
HAVING is a where clause that runs after grouping is done. In it you can use things that a grouping would make available (like counts and aggregations). By grouping on the name and counting (filtering for >1, if you want two or more, not >=1 because that would include 1) you get the names you want..
This will just deliver "Jack" as a single result (in the example data from the question). If you then want all the detail, like who Jack's siblings are, you can join your grouped, filtered list of names back to the table:
select *
from
student
INNER JOIN
(
select name
from student
group by name
having count(*) > 1
) morethanone ON morethanone.name = student.name
You can't avoid doing this "joining back" because the grouping has thrown the detail away in order to create the group. The only way to get the detail back is to take the name list the group gave you and use it to filter the original detail data again
Full disclosure; it's a bit of a lie to say "can't avoid doing this": SQL Server supports something called a window function, which will effectively perform a grouping in the background and join it back to the detail. Such a query would look like:
select student.*, count(*) over(partition by name) n
from student
And for a table like this:
jack, lucy
jack, tim
jane, bill
jane, fred
jane, tom
john, dave
It would produce:
jack, lucy, 2
jack, tim, 2
jane, bill, 3
jane, fred, 3
jane, tom, 3
john, dave, 1
The rows with jack would have 2 on because there are two jack rows. There are 3 janes, there is 1 john. You could then wrap all that in a subquery and filter for n > 1 which would remove john
select *
from
(
select student.*, count(*) over(partition by name) n
from student
) x
where x.n > 1
If SQL Server didn't have window functions, it would look more like:
select *
from
student
INNER JOIN
(
select name, count(*) as n
from student
group by name
) x ON x.name = student.name
The COUNT(*) OVER(PARTITION BY name) is like a mini "group by name and return the count, then auto join back to the main detail using the name as key" i.e. a short form of the latter query
You can do:
select name
from student as s1
where exists (
select s2
from student as s2
where s1.name = s2.name and s1.siblings != s2.siblings
)
I think the best approach is what 'Caius Jard' mentioned. However, additional way if you want to get how many siblings each name has .
SELECT name, COUNT(*) AS Occurrences
FROM student
GROUP BY name
HAVING (COUNT(*) > 1)
I wanted to share another solution I came up with:
select s1.name
from student s1, student s2
where s1.name = s2.name and s1.sibling != s2.sibling;

Case Statement for multiple criteria

I would like to ignore some of the results of my query as for all intents and purposes, some of the results are a duplicate, but based on the way the request was made, we need to use this hierarchy and although we are seeing different 'Company_Name' 's, we need to ignore one of the results.
Query:
SELECT
COUNT(DISTINCT A12.Company_name) AS Customer_Name_Count,
Company_Name,
SUM(Total_Sales) AS Total_Sales
FROM
some_table AS A12
GROUP BY
2
ORDER BY
3 ASC, 2 ASC
This code omits half a doze joins and where statements that are not germane to this question.
Results:
Customer_Name_Count Company_Name Total_Sales
-------------------------------------------------------------
1 3 Blockbuster 1,000
2 6 Jimmy's Bar 1,500
3 6 Jimmy's Restaurant 1,500
4 9 Impala Hotel 2,000
5 12 Sports Drink 2,500
In the above set, we can see that numbers 2 & 3 have the same count and the same total_sales number and similar company names. Is there a way to create a case statement that takes these 3 factors into consideration and then drops one or the other for Jimmy's enterprises? The other issue is that this has to be variable as there are other instances where this happens. And I would only want this to happen if the count and sales number match each other with a similar name in the company name.
Desired result:
Customer_Name_Count Company_Name Total_Sales
--------------------------------------------------------------
1 3 Blockbuster 1,000
2 6 Jimmy's Bar 1,500
3 9 Impala Hotel 2,000
4 12 Sports Drink 2,500
Looks like other answers are accurate based on assumption that Company_IDs are the same for both.
If Company_IDs are different for both Jimmy's Bar and Jimmy's Restaurant then you can use something like this. I suggest you get functional users involved and do some data clean-up else you'll be maintaining this every time this issue arise:
SELECT
COUNT(DISTINCT CASE
WHEN A12.Company_Name = 'Name2' THEN 'Name1'
ELSE A12.Company_Name
END) AS Customer_Name_Count
,CASE
WHEN A12.Company_Name = 'Name2' THEN 'Name1'
ELSE A12.Company_Name
END AS Company_Name
,SUM(A12.Total_Sales) AS Total_Sales
FROM some_table er
GROUP BY CASE
WHEN A12.Company_Name = 'Name2' THEN 'Name1'
ELSE A12.Company_Name
END
Your problem is that the joins you are using are multiplying the number of rows. Somewhere along the way, multiple names are associated with exactly the same entity (which is why the numbers are the same). You can fix this by aggregating by the right id:
SELECT COUNT(DISTINCT A12.Company_name) AS Customer_Name_Count,
MAX(Company_Name) as Company_Name,
SUM(Total_Sales) AS Total_Sales
FROM some_table AS A12
GROUP BY Company_id -- I'm guessing the column is something like this
ORDER BY 3 ASC, 2 ASC;
This might actually overstate the sales (I don't know). Better would be fixing the join so it only returned one name. One possibility is that it is a type-2 dimension, meaning that there is a time component for values that change over time. You may need to restrict the join to a single time period.
You need to have function to return a common name for the companies and then use DISTINCT:
SELECT DISTINCT
Customer_Name_Count,
dbo.GetCommonName(Company_Name) as Company_Name,
Total_Sales
FROM dbo.theTable
You can try to use ROW_NUMBER with window function to make row number by Customer_Name_Count and Total_Sales then get rn = 1
SELECT * FROM (
SELECT *,ROW_NUMBER() OVER(PARTITION BY Customer_Name_Count,Total_Sales ORDER BY Company_Name) rn
FROM (
SELECT
COUNT(DISTINCT A12.Company_name) AS Customer_Name_Count,
Company_Name,
SUM(Total_Sales) AS Total_Sales
FROM
some_table AS A12
GROUP BY
Company_Name
)t1
)t1
WHERE rn = 1

SQL: How to get the AVG(MIN(number))?

I am looking for the AVERAGE (overall) of the MINIMUM number (grouped by person).
My table looks like this:
Rank Name
1 Amy
2 Amy
3 Amy
2 Bart
1 Charlie
2 David
5 David
1 Ed
2 Frank
4 Frank
5 Frank
I want to know the AVERAGE of the lowest scores. For these people, the lowest scores are:
Rank Name
1 Amy
2 Bart
1 Charlie
2 David
1 Ed
2 Frank
Giving me a final answer of 1.5 - because three people have a MIN(Rank) of 1 and the other three have a MIN(Rank) of 2. That's what I'm looking for - a single number.
My real data has a couple hundred rows, so it's not terribly big. But I can't figure out how to do this in a single, simple statement. Thank you for any help.
Try this:
;WITH MinScores
AS
(
SELECT
"Rank",
Name,
ROW_NUMBER() OVER(PARTITION BY Name ORDER BY "Rank") row_num
FROM Table1
)
SELECT
CAST(SUM("Rank") AS DECIMAL(10, 2)) /
COUNT("Rank")
FROM MinScores
WHERE row_num = 1;
SQL Fiddle Demo
Selecting the set of minimum values is straightforward. The cast() is necessary to avoid integer division later. You could also avoid integer division by casting to float instead of decimal. (But you should be aware that floats are "useful approximations".)
select name, cast(min(rank) as decimal) as min_rank
from Table1
group by name
Now you can use the minimums as a common table expression, and select from it.
with minimums as (
select name, cast(min(rank) as decimal) as min_rank
from Table1
group by name
)
select avg(min_rank) avg_min_rank
from minimums
If you happen to need to do the same thing on a platform that doesn't support common table expressions, you can a) create a view of minimums, and select from that view, or b) use the minimums as a derived table.
You might try using a derived table to get the minimums, then get the average minimum in the outer query, as in:
-- Get the avg min rank as a decimal
select avg(MinRank * 1.0) as AvgRank
from (
-- Get everyone's min rank
select min([Rank]) as MinRank
from MyTable
group by Name
) as a
I think the easiest one will be
for max
select name , max_rank = max(rank)
from table
group by name;
for average
select name , avg_rank = avg(rank)
from table
cgroup by name;

Finding attribute of record being the MAX of something in GROUP BY statement?

This is a bit of a hard question to formulate (any edits are appreciated) but here goes. Let's say you have two tables: FRUIT and BASKET. Fruits (representing actual items, not fruit species) are grouped by baskets. Here is one example of FRUIT:
FRUIT_ID WEIGHT_IN_GRAMS BASKET_ID FRUIT_TYPE
-------- --------------- --------- ----------
1 100 1 Apple
2 200 1 Orange
3 150 1 Lemon
4 100 2 Apple
5 300 2 Plum
What I want is FRUIT_ID of the heaviest fruit in each basket. In other words:
FRUIT_ID BASKET_ID FRUIT_TYPE
-------- --------- ----------
2 1 Orange
5 2 Plum
Here is the SQL I have come up with to find this:
select fruit_id, basket_id, fruit_type
from fruit f
join (select basket_id, max(weight_in_grams) max_weight
from fruit
group by basket_id) t on f.basket_id = t.basket_id
and f.weight_in_grams = t.max_weight;
This could work except that WEIGHT_IN_GRAMS is not guaranteed to be unique. In the case where there were duplicates, I'd want the one which is alphabetically last.
Any takers?
PS: I know I could wrap my query above in yet another query, but this feels rather messy, so what I'm looking for is optimizing this if at all possible.
You can use dense_rank to assign numbers to each row. partition by creates groups that have their own sequences (in this case each basket). order by specifies the sorting, so order by weight first (descending), and name second.
The first row of each basket gets the number 1. Then, wrap the query to be able to select only the rows that got number 1. You'll need to do this in a subquery, because analytic functions cannot be used in the where clause, and neither in the having clause, I believe.
select
fruit_id, basket_id, fruit_type, weight_in_grams
from
(select
fruit_id, basket_id, fruit_type, weight_in_grams,
dense_rank() over (partition by basket_id order by weight_in_grams desc, fruit_type desc) as rank
from
fruit f)
where
rank = 1

How to get a proper count in sql server when retrieving a lot of fields?

Here is my scenario,
I have query that returns a lot of fields. One of the fields is called ID and I want to group by ID and show a count in descending order. However, since I am bringing back more fields, it becomes harder to show a true count because I have to group by those other fields. Here is an example of what I am trying to do. If I just have 2 fields (ID, color) and I group by color, I may end up with something like this:
ID COLOR COUNT
== ===== =====
2 red 10
3 blue 5
4 green 24
Lets say I add another field which is actually the same person, but they have a different spelling of their name which throws the count off, so I might have something like this:
ID COLOR NAME COUNT
== ===== ====== =====
2 Red Jim 5
2 Red Jimmy 5
3 Red Bob 3
3 Red Robert 2
4 Red Johnny 12
4 Red John 12
I want to be able to bring back ID, Color, Name, and Count, but display the counts like in the first table. Is there a way to do this using the ID?
If you want a single result set, you would have to omit the name, as in your first post
SELECT Id, Color, COUNT(*)
FROM YourTable
GROUP By Id, Color
Now, you could get your desired functionality with a subquery, although not elegant
SELECT Id, Color Name, (SELECT COUNT(*)
FROM YourTable
Where Id = O.Id
AND Color = O.Color
) AS "Count"
FROM YourTable O
GROUP BY Id, Color, Name
This should work as you desire
Try this:-
SELECT DISTINCT a.ID, a.Color, a.Name, b.Count
FROM yourTable
INNER JOIN (
SELECT ID, Color, Count(1) [Count] FROM yourTable
GROUP BY ID, Color
) b ON a.ID = b.ID, a.Color = b.Color
ORDER BY [Count] DESC
Try doing a sub query to get the count.
-- MarkusQ