SQL - making several groupings (performance) - sql

I have some SQL query that founds records based on provided parameters. That query is pretty heavy, so I want to execute it less as possible.
After I getting result from that query, I need to perform its breakdown.
For example, consider the following query:
SELECT location, department, industry
FROM data
WHERE ...
After that, I need to perform breakdown of that results, e.g. I need to provide list of all locations where from I have results and counts of each type, same for departments and same for industries.
As I know, in order to get breakdown by locations, I need to perform GROUP BY (location) and then count.
My question is: is it possible, for performance considerations, to perform several groupings/ counts on query result without recalculating it over and over again for each grouping?

Yes, this is possible. Unless I misunderstood you.
You need to use windowed functions. For instance:
SELECT location
, department
, industry
, COUNT(*) OVER(PARTITION BY location, department)
, COUNT(*) OVER(PARTITION BY location, department, industry)
FROM data
WHERE ...;
Keep in mind, that doing a COUNT(DISTINCT column) is not possible.

If I understand correctly, you can do what you want with grouping sets (documented here):
SELECT location, department, industry, count(*)
FROM data
WHERE ...
GROUP BY GROUPING SETS ((location), (department), (industry))
This will return rows like:
location1 NULL NULL 10
. . .
NULL dept1 NULL 17
. . .
If you want to get fancy, and you have no NULL values in any of the columns, you can do:
SELECT (case when location is not null then 'location'
when department is not null then 'department'
when industry is not null then 'industry'
end) as which,
coalesce(location, department, industry) as name, count(*)
FROM data
WHERE ...
GROUP BY GROUPING SETS ((location), (department), (industry))
ORDER BY which;
You can actually do the same thing using the GROUPING() function, if you do have NULL values in the columns, but you have to replace the coalesce() as well.

Related

Confused with the Group By function in SQL

Q1: After using the Group By function, why does it only output one row of each group at most? Does this mean that having is supposed to filter the group rather than filter the records in each group?
Q2: I want to find the records in each group whose ages are greater than the average age of that group. I tried the following, but it returns nothing. How should I fix this?
SELECT *, avg(age) FROM Mytable Group By country Having age > avg(age)
Thanks!!!!
You can calculate the average age for each country in a subquery and join that to your table for filtering:
SELECT mt.*, MtAvg.AvgAge
FROM Mytable mt
inner join
(
select mtavgs.country
, avg(mtavgs.age) as AvgAge
from Mytable mtavgs
group by mtavgs.country
) MTAvg
on mtavg.country=mt.country
and mt.Age > mtavg.AvgAge
GROUP BY returns always 1 row per unique combination of values in the GROUP BY columns listed (provided that they are not removed by a HAVING clause). The subquery in our example (alias: MTAvg) will calculate a single row per country. We will use its results for filtering the main table rows by applying the condition in the INNER JOIN clause; we will also report that average by including the calculated average age.
GROUP BY is a keyword that is called an aggregate function. Check this out here for further reading SQL Group By tutorial
What it does is it lumps all the results together into one row. In your example it would lump all the results with the same country together.
Not quite sure what exactly your query needs to be to solve your exact problem. I would however look into what are called window functions in SQL. I believe what you first need to do is write a window function to find the average age in each group. Then you can write a query to return the results you need
Depending on your dbms type and version, you may be able to use a "window function" that will calculate the average per country and with this approach it makes the calculation available on every row. Once that data is present as a "derived table" you can simply use a where clause to filter for the ages that are greater then the calculated average per country.
SELECT mt.*
FROM (
SELECT *
, avg(age) OVER(PARTITION BY country) AS AvgAge
FROM Mytable
) mt
WHERE mt.Age > mt.AvgAge

Use Common Table Expression to get difference between two sets of data

I'm trying to use a common table expression to find the differences between two queries I wrote. The first query returns how many patients belong to each ROOMID(each ID represent a specific room).
Second query I have is how many patients that belong to each ROOMId have surgery operated on them. PatientID represent each patient.
select roomID, count(distinct patientID) as totalinsurgery
from data with (nolock)
where ptprocess = 'surgery'
group by clientid, batchid
Second query:
select CAroomid, sum(patientsinroom) as patientsinroom
from data
group by caroomid
So the idea behind is try to get the 'difference' in result of the two query. So how many patients in the room went to surgery. What is the best way to use common table expression to get the result?
So how many patients in the room went to surgery.
I suspect you just want conditional aggregation:
select roomId,
count(distinct case when ptprocess = 'surgery' then patientID end) as num_surgery
count(distinct patientID) as total
from data
group by roomId;
Note: I have no idea why you are using count(distinct). Can a patient really occur more than one time in a room?

Beginning SQL group by and AVG

I am trying to pull information from two columns titled clientstate and clientrevenue in my table. I want clientstate to show up as the state, and have only distinct names in it, and under client revenue I want the average revenue per state, and that will only show up if there are at least two clients from that state. I am very new at this, so what I have is pretty iffy:
SELECT clientstate, clientrevenue
FROM client
GROUP BY clientrevenue
HAVING COUNT (*) >=2;
Where am I going wrong here?
SELECT clientstate AS [State]
, AVG(clientrevenue) AS [Average Revenue]
FROM client
GROUP BY clientstate
Grouping by ClientRevenue will try to group similar values and that doesn't have a logical sense.
First, in order to get distinct states, clientstate column needs to be used in the GROUP BY statement.
Thus, the code would be :
SELECT clientstate, AVG(clientrevenue)
FROM Source_Table
GROUP BY clientstate --this would get you distinct states
Now, considering the 2 clients per state, it's rather a condition than a HAVING statement. HAVING statement limits your query results according to the aggregate function you are using. For instance, in the code aforementioned, the aggregate function is AVG(clientrevenue). So, we can only use it in HAVING. we can not add count(*) unless it was used in SELECT.
So, you need to add it as a condition like
SELECT clientstate, AVG(clientrevenue)
FROM Source_Table A
WHERE (SELECT count(DISTINCT client_ID) FROM Source_Table B
WHERE A.clientstate = B.clientstate) >= 2 --Condition
GROUP BY clientstate --this would get you distinct states

How does GROUP BY use COUNT(*)

I have this query which finds the number of properties handled by each staff member along with their branch number:
SELECT s.branchNo, s.staffNo, COUNT(*) AS myCount
FROM Staff s, PropertyForRent p
WHERE s.staffNo=p.staffNo
GROUP BY s.branchNo, s.staffNo
The two relations are:
Staff{staffNo, fName, lName, position, sex, DOB, salary, branchNO}
PropertyToRent{propertyNo, street, city, postcode, type, rooms, rent, ownerNo, staffNo, branchNo}
How does SQL know what COUNT(*) is referring to? Why does it count the number of properties and not (say for example), the number of staff per branch?
This is a bit long for a comment.
COUNT(*) is counting the number of rows in each group. It is not specifically counting any particular column. Instead, what is happening is that the join is producing multiple properties, because the properties are what cause multiple rows for given values of s.branchNo and s.staffNo.
It gets even a little more "confusing" if you include a column name. The following would all typically return the same value:
COUNT(*)
COUNT(s.branchNo)
COUNT(s.staffNo)
COUNT(p.propertyNo)
With a column name, COUNT() determines the number of rows that do not have a NULL value in the column.
And finally, you should learn to use proper, explicit join syntax in your queries. Put join conditions in the on clause, not the where clause:
SELECT s.branchNo, s.staffNo, COUNT(*) AS myCount
FROM Staff s JOIN
PropertyForRent p
ON s.staffNo = p.staffNO
GROUP BY s.branchNo, s.staffNo;
GROUP BY clauses partition your result set. These partitions are all the sql engine needs to know - it simply counts their sizes.
Try your query with only count(*) in the select part.
In particular, COUNT(*) does not produce the number of distinct rows/columns in your result set!
Some people might think that count(*) really count all the columns, however the sql optimizer is smarter than that.
COUNT(*) returns the number of rows in a specified table without getting rid of duplicates. Which mean that you can't use Distinct with count(*)
Count(*) will return the cardinality (elements in table) of the specified mapping.
What you have to remember is that when using count over a specific column, null won't be allowed while count(*) will allow null in the rows as it could be any field.
How does SQL know what COUNT(*) is referring to?
I'm pretty sure, however not 100% sure as I can't find in doc, that the sql optimizer simply do a count on the primary key (not null) instead of trying to handle null in rows.

I'm not sure what is the purpose of "group by" here

I'm struggling to understand what this query is doing:
SELECT branch_name, count(distinct customer_name)
FROM depositor, account
WHERE depositor.account_number = account.account_number
GROUP BY branch_name
What's the need of GROUP BY?
You must use GROUP BY in order to use an aggregate function like COUNT in this manner (using an aggregate function to aggregate data corresponding to one or more values within the table).
The query essentially selects distinct branch_names using that column as the grouping column, then within the group it counts the distinct customer_names.
You couldn't use COUNT to get the number of distinct customer_names per branch_name without the GROUP BY clause (at least not with a simple query specification - you can use other means, joins, subqueries etc...).
It's giving you the total distinct customers for each branch; GROUP BY is used for grouping COUNT function.
It could be written also as:
SELECT branch_name, count(distinct customer_name)
FROM depositor INNER JOIN account
ON depositor.account_number = account.account_number
GROUP BY branch_name
Let's take a step away from SQL for a moment at look at the relational trainging language Tutorial D.
Because the two relations (tables) are joined on the common attribute (column) name account_number, we can use a natural join:
depositor JOIN account
(Because the result is a relation, which by definition has only distinct tuples (rows), we don't need a DISTINCT keyword.)
Now we just need to aggregate using SUMMARIZE..BY:
SUMMARIZE (depositor JOIN account)
BY { branch_name }
ADD ( COUNT ( customer_name ) AS customer_tally )
Back in SQLland, the GROUP BY branch_name is doing the same as SUMMARIZE..BY { branch_name }. Because SQL has a very rigid structure, the branch_name column must be repeated in the SELECT clause.
If you want to COUNT something (see SELECT-Part of the statement), you have to use GROUP BY in order to tell the query what to aggregate. The GROUP BY statement is used in conjunction with the aggregate functions to group the result-set by one or more columns.
Neglecting it will lead to SQL errors in most RDBMS, or senseless results in others.
Useful link:
http://www.w3schools.com/sql/sql_groupby.asp