Keeping Unique Rows with Group By Cube

Keeping Unique Rows with Group By Cube - sql

Suppose I have data that includes the SSN of a student, the college campus they attended, and their wages for a given year. Like so...
create table #thetable (SSN int, campus int, wage int);
insert into #thetable(SSN, campus, wage)
values
(111111111,1,100),
(111111111,2,100),
(222222222,1,250),
(222222222,2,250),
(333333333,1,50),
(444444444,2,400);
Now, I want to get the average wage of the students at each campus, and the average wage of students from all campuses put together... So I do something like this:
select campus, avg(wage)
from #thetable
group by cube(campus);
The problem is that I don't want to double-count the students who attended two campuses when I'm grouping the campuses together. This is the output I'm getting (double counts students 111111111 and 2222222222):
Campus (no column name)
1 133
2 250
NULL 191
My desired output is this (no double counting):
Campus (no column name)
1 133
2 250
NULL 200
Can this be accomplished without using multiple queries and the UNION operator? If so, how? (Incidentally, I realize that this table is not normalized... would normalizing help?)

You can't do this with one column. The cube is going to rollup the values based on the calculations on each line. So, if a row is included in one calculation, it will be included in the sum.
You can do this, though, by weighting the values by 1 divided by the frequency. This "divides" a student equally across the campuses to each adds to 1:
select campus, avg(wage) as avg_wage, sum(wage*weight) / sum(weight) avg_wage_weighted
from (select t.*, (1.0 / count(*) over (partition by SSN)) as weight
from #thetable t
) t
group by cube(campus);
The second column should be the value you want. You can then embed this further in a subquery to get one column:
select campus, (case when campus is null then avg_wage_weighted else avg_wage end)
from (select campus, avg(wage) as avg_wage, sum(wage*weight) / sum(weight) avg_wage_weighted
from (select t.*, (1.0 / count(*) over (partition by SSN)) as weight
from #thetable t
) t
group by cube(campus)
) t
Here is a SQL Fiddle showing the solution.

Figured it out with a correlated sub-query. Works for me.
select campus,
(
select avg(wage)
from
(
select ssn, campus, wage, row_number() over(partition by SSN order by wage) as RN
from #thetable as inside
where (inside.campus=outside.campus or outside.campus is null)
) as middle
where RN=1
)
from #thetable outside
group by cube(campus);

Related

How to create an additional column with the percentages related to a count distinct statement

I'm trying to query each distinct medical speciality (e.g. oncologist, pediatrician, etc.) in a table and then count the number of times a claim (claim_id) is linked to it, which I've done using this:
select distinct specialization, count(distinct claim_id) AS Claim_Totals
from table1
group by specialization
order by Claim_Totals DESC
However, I also want to include an additional column which lists the % that each speciality makes up in the table (based on the number of claim_id related to it). So for instance, if there were 100 total claims and "cardiologist" had 25 claim_id records related to it, "oncologist" had 15, "general surgeon" had 10, and so forth, I want the output to look like this:
specialization | Claims_Totals | PERCENTAGE
___________________________________________
cardiologist 25 25%
oncologist 15 15%
general surgeon 10 10%

Could do this? I'm not familiar with Barbaros's syntax. If that works its more concise and better.
select specialization, count(distinct claim_id) AS Claim_Totals, count(distinct claim_id)/total_claims
from table1
INNER JOIN ( SELECT COUNT(DISTINCT claim_id)*1.0000 total_claims AS total_claims
FROM table1 ) TMP
ON 1 = 1
group by specialization
order by Claim_Totals DESC
select specialization,
count(distinct claim_id) AS claim_by_spec,
count(distinct claim_id)/
( SELECT COUNT(DISTINCT claim_id)*1.0000
FROM table1 ) AS percentage_calc
from table1
group by specialization
order by Claim_Totals DESC

You can use sum(count(distinct)) over() to get the overall claims and use it in the denominator to get the percentage.
select specialization
,count(distinct claim_id) AS Claim_Totals
,round(100*count(distinct claim_id)/sum(count(distinct claim_id)) over(),3) as percentage
from table1
group by specialization

You can use
,concat_ws('',count(distinct claim_id),'%') as percentage
or
,concat(count(distinct claim_id),'%') as percentage
as added to the select list's tail
Btw, distinct before specialization in the select list is redundant, since already included in the group by list.

Because you are using count(distinct), window functions are less useful. You can try:
select t1.specialization,
count(distinct t1.claim_id) AS Claim_Totals,
count(distinct t1.claim_id) / tt1.num_claims
from table1 t1 cross join
(select count(distinct claim_id) as num_claims
from table1
) tt1
group by t1.specialization
order by Claim_Totals DESC

Convert table into grouped statistics of same table

In MS-SQL, I have a table hasStudied(sid, ccode, grade) (student id, course code, grade) which keeps track of the past courses a student has studied and the grade they've gotten.
As output of my query, I want to return a list of courses, with the percentage of passing (= not 'F') students in the column next to it, in descending order by that percentage, like this:
C1 : 85
C3 : 70
C2 : 67
etc.
I have currently managed to break them into two separate tables, one containing coursecode and the number of people passing the course, one containing coursecode and the number of people who have read the course.
This is done by two relatively simple statements, but requires me to do a lot of inefficient calculating in java.
Is there any way to make this in a single query?

Assuming you do not have two entries with the same student under one course, this should do it:
SELECT
ccode,
ROUND((passed::numeric(15,2) / taken_course::numeric(15,2)) * 100, 0) AS percentage_passed
FROM(
SELECT
ccode,
sum(CASE WHEN grade > 2 THEN 1 ELSE 0 END) AS passed,
count(1) AS taken_course
FROM
hasStudied
GROUP BY ccode
) foo
ORDER BY ccode
-- since you want to order DESC by values, instead do
-- ORDER BY percentage_passed

I think you are looking for the usage of cte:
create table #temp(StId int, ccode varchar(5), grade varchar(1))
insert into #temp Values (1,'A1','A'),(1,'A1','F'),(2,'A2','B'),(3,'A2','F'),(4,'A2','F'),(4,'A3','F'),(5,'A3','F')
;with cte as (
select ccode
from #temp
group by ccode
)
select cte.ccode,ratioOfPass = cast(sum(case when t.grade <> 'F' then 1.0 else 0.0 end) as float) / count(*)
from cte
inner join #temp t on t.ccode = cte.ccode
group by cte.ccode
While calculating, use sum with case-when and do not forget to cast the value of sum to float.

Get top ranked values across multiple fields

Imagine a table that has three fields, a unique user ID and two non-unique traits (eg age, sex, etc): user / traitA / traitB
In this table I want to pull the most frequent value for each trait in a single query. If our traits were School Year / Major then a result could be: Junior / Biology. Note, this does NOT mean Juniors in Biology are the most common combination, just that each value itself is most common in its trait.
This is obviously possible running two separate queries, grouping by a single fields and putting a rank and having combo in. But my specific problem has more fields and the cost to do subsequent queries is expensive.

Selecting single most common trait:
SELECT age
FROM table_name
GROUP BY age
ORDER BY COUNT(*) DESC
LIMIT 1
To select most common values from multiple columns this query worked in Postgre:
SELECT DISTINCT
FIRST_VALUE(age) OVER (ORDER BY count1 DESC) AS top1,
FIRST_VALUE(sex) OVER (ORDER BY count2 DESC) AS top2
FROM (
SELECT age,
sex,
COUNT(age) OVER (PARTITION BY age) AS count1,
COUNT(sex) OVER (PARTITION BY sex) AS count2
FROM some_table
) some_table

Avg Sql Query Always Returns int

I have one column for Farmer Names and one column for Town Names in my table TRY.
I want to find Average_Number_Of_Farmers_In_Each_Town.
Select TownName ,AVG(num)
FROM(Select TownName,Count(*) as num From try Group by TownName) a
group by TownName;
But this query always returns int values. How can i get values in float too?

;WITH [TRY]([Farmer Name], [Town Name])
AS
(
SELECT N'Johny', N'Bucharest' UNION ALL
SELECT N'Miky', N'Bucharest' UNION ALL
SELECT N'Kinky', N'Ploiesti'
)
SELECT AVG(src.Cnt) AS Average
FROM
(
SELECT COUNT(*)*1.00 AS Cnt
FROM [TRY]
GROUP BY [TRY].[Town Name]
) src
Results:
Average
--------
1.500000
Without ... *1.00 the result will be (!) 1 (AVG(INT 2 , INT 1) -truncated-> INT 1, see section Return types).

Your query is always returning int logically because the average is not doing anything. Both the inner and the outer queries are grouping by town name -- so there is one value for each average, and that average is the count.
If you are looking for the overall average, then something like:
Select AVG(cast(cnt as float))
FROM (Select TownName, Count(*) as cnt
From try
Group by TownName
) t
You can also do this without the subquery as:
select cast(count(*) as float) /count(distinct TownName)
from try;
EDIT:
The assumption was that each farmer in the town has one row in try. Are you just trying to count the number of distinct farmers in each town? Assuming you have a field like FarmerName that identifies a given farmer, that would be:
select TownName, count(distinct FarmerName)
from try
group by TownName;

sql query finding most often level appear

I have a table Student in SQL Server with these columns:
[ID], [Age], [Level]
I want the query that returns each age value that appears in Students, and ﬁnds the level value that appears most often. For example, if there are more 'a' level students aged 18 than 'b' or 'c' it should print the pair (18, a).
I am new to SQL Server and I want a simple answer with nested query.

You can do this using window functions:
select t.*
from (select age, level, count(*) as cnt,
row_number() over (partition by age order by count(*) desc) as seqnum
from student s
group by age, level
) t
where seqnum = 1;
The inner query aggregates the data to count the number of levels for each age. The row_number() enumerates these for each age (the partition by with the largest first). The where clause then chooses the highest values.
In the case of ties, this returns just one of the values. If you want all of them, use rank() instead of row_number().

One more option with ROW_NUMBER ranking function in the ORDER BY clause. WITH TIES used when you want to return two or more rows that tie for last place in the limited results set.
SELECT TOP 1 WITH TIES age, level
FROM dbo.Student
GROUP BY age, level
ORDER BY ROW_NUMBER() OVER(PARTITION BY age ORDER BY COUNT(*) DESC)
Or the second version of the query using amount each pair of age and level, and max values of count pair age and level per age.
SELECT *
FROM (
SELECT age, level, COUNT(*) AS cnt,
MAX(COUNT(*)) OVER(PARTITION BY age) AS mCnt
FROM dbo.Student
GROUP BY age, level
)x
WHERE x.cnt = x.mCnt
Demo on SQLFiddle

Another option but will require later version of sql-server:
;WITH x AS
(
SELECT age,
level,
occurrences = COUNT(*)
FROM Student
GROUP BY age,
level
)
SELECT *
FROM x x
WHERE EXISTS (
SELECT *
FROM x y
WHERE x.occurrences > y.occurrences
)
I realise it doesn't quite answer the question as it only returns the age/level combinations where there are more than one level for the age.
Maybe someone can help to amend it so it includes the single level ages aswell in the result set: http://sqlfiddle.com/#!3/d597b/9

with combinations as (
select age, level, count(*) occurrences
from Student
group by age, level
)
select age, level
from combinations c
where occurrences = (select max(occurrences)
from combinations
where age = c.age)
This finds every age and level combination in the Students table and counts the number of occurrences of each level.
Then, for each age/level combination, find the one whose occurrences are the highest for that age/level combination. Return the age and level for that row.
This has the advantage of not being tied to SQL Server - it's vanilla SQL. However, a window function like Gordon pointed out may perform better on SQL Server.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Keeping Unique Rows with Group By Cube - sql

Related

How to create an additional column with the percentages related to a count distinct statement

Convert table into grouped statistics of same table

Get top ranked values across multiple fields

Avg Sql Query Always Returns int

sql query finding most often level appear

Categories

Resources