SQL GROUP BY When Many Tables and Fields are Queried

SQL GROUP BY When Many Tables and Fields are Queried - sql

I've never understood the the GROUP BY clause because most of the examples that google provides are very simple. I have a real life example of 982 columns pulling from 6 tables and need to select MAX(iVersion) which requires GROUP BY. A duplicate record identifier can be entered into SQL database multiple time when it has a newer iVersion number. I need to get the most recent record version but SSMS keeps screaming at me with the usual
field is not part of an aggregate function.
This is my updated code
SELECT TOP (100) Certification.dCertifiedDate, ChildGeneral.dChildsDateOfBirth, ChildGeneral.cChildsFirstName, ChildGeneral.cChildsLastName,Father.cFathersFirstName, Father.cFathersLastName,
Mother.cMothersFirstName, Mother.cMothersLastName, ChildGeneral.cChildsID, MAX(ChildGeneral.iVersionID) AS iVersionID, RecordTypes.cRecordCode, ChildGeneralFlag.cStateFileNumber
FROM ChildGeneral
INNER JOIN Father ON ChildGeneral.cChildsID = Father.cChildsID
AND ChildGeneral.iVersionID = Father.iVersionID
INNER JOIN Mother ON ChildGeneral.cChildsID = Mother.cChildsID
AND ChildGeneral.iVersionID = Mother.iVersionID
INNER JOIN ChildGeneralFlag ON ChildGeneral.cChildsID = ChildGeneralFlag.cChildsID
AND ChildGeneral.iVersionID = ChildGeneralFlag.iVersionID
INNER JOIN RecordTypes ON ChildGeneral.cRecordType = RecordTypes.cListItemID
INNER JOIN Certification ON ChildGeneral.cChildsID = Certification.cChildsID
WHERE CAST(CONVERT(VARCHAR, ChildGeneral.dChildsDateOfBirth, 101) AS DATE) >= CAST('01/01/1971' AS DATE)
AND CAST(CONVERT(VARCHAR, ChildGeneral.dChildsDateOfBirth, 101) AS DATE) <= CAST('12/31/2010' AS DATE)
GROUP BY ChildGeneral.iVersionID, Certification.dCertifiedDate, ChildGeneral.dChildsDateOfBirth, cChildsTimeOfBirth,ChildGeneral.cChildsFirstName,
ChildGeneral.cChildsLastName, Father.cFathersFirstName, Father.cFathersLastName, Mother.cMothersFirstName,
Mother.cMothersLastName, ChildGeneral.cChildsID, RecordTypes.cRecordCode, ChildGeneralFlag.cStateFileNumber
ORDER BY ChildGeneralFlag.cStateFileNumber
There should only by one record for each ChildGeneralFlag.cStateFileNumber with the MAX(ChildGeneral.iVersionID) which could be anywhere from 1-99

So I chopped down to 12 columns from the 6 tables and I get this error until I have added every last column to the group by. Msg 8120, Level 16, State 1, Line 33 Column 'RecordTypes.cRecordCode' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause. Then I lose the MAX effect and it gives me all the records
Yep that's how it goes. It's logical when you think about it:
Suppose you have rows (it doesn't matter how many table they come from or how many columns there are. Here I present just two)
City, Age
New York, 23
New York, 24
New York, 25
Chicago, 22
You want the max age by city. You group by city
SQL looks at all the unique values of City, and in this case essentially sets out two buckets, one labelled "New York" and the other labelled "Chicago". Every row's age goes into one bucket or the other. Then it looks through each bucket finding the max age in each one. You get 25 for NY and 22 for Chicago
Suppose we add another level
City, District, Age
New York, Queens, 23
New York, Queens, 24
New York, Bronx, 25
Chicago, Central, 22
You can keep your group by as is, but you can't select the district too unless you group it or max it
If you group it, SQL sets out 3 buckets this time, "New York/Bronx", "New York/Queens" and "Chicago/Central". The Queens bucket has 2 ages thrown into it, the max age in Bronx is 25, Queens is 24. Ultimately you get 3 rows out of you query because of the 3 unique values of city+district
If you max the district and the age and keep the group by only on city you get 2 rows out, but the max of district is Queens - alphabetically "greater" that Bronx, and you get 25, which truly is the max age but there was never originally a New York/Queens/25 row. Bucket values are allowed to mix up in aggregations. By stating MAX(age) and MAX(district) sql just pulls those values out of their rows, throws them all in the bucket and then finds the highest. Values in buckets retain their individual identity (whether they're an age value or a district value) but they lose all association with other values on the same row they were originally from
There isn't a concept of "group by New York, give me the max age, and also give me the district that went with it" because that might see the db having to make a choice it isn't empowered to make. Suppose we had this:
City, District, Age
New York, Queens, 23
New York, Queens, 25
New York, Bronx, 25
Chicago, Central, 22
Max is 25, but there are two rows with this max - should the DB return both? Should it pick one to throw away because you've asked for GROUP BY city which means city should be unique in the output?
"Both Rows please" you might say. "Just pick one to discard" I might say..
The db won't choose, so instead you'll have to be explicit:
SELECT *
FROM Person p
INNER JOIN (select city, max(age) maxage from person group by city) m
ON p.city = m.city AND p.age = m.maxage
Here we explicitly say "do the group, join it back, thus give me both rows" - the group by query becomes an elaborate 2 column where clause, that filters to only rows having both "New York and 25", or "Chicago and 22". You get both your New York rows.
There isn't a specified way of saying "pick one to discard" - instead we leverage some way of being explicit to break the tie- you can say "I want the city, the max age the associated district and if there are ties I want the first one when districts are sorted alphabetically".
And if you end up with a situation where there are ties in district, you have to add another level of sorting (and if that ties you keep going until there are no ties or you don't care any more)
Often that query looks like this:
SELECT * FROM(
SELECT *, ROW_NUMBER() OVER(PARTITION BY city ORDER BY age DESC, district) rn
) WHERE rn = 1
Row number essentially does the same query we did above when we grouped by city and joined it back in - partition by is like "group by with auto join back to current row based on partitioned values". In essence this establishes a column with an incrementing counter that starts from 1 per city and counts up in the given order. Because we asked for age descending, tie breaker on district ascending, the 25,Bronx row gets the 1, the 25,Queens row gets 2. Because we then filter for only rows with 1, we get rid of the queens rows
If I want a "max this group by that, and also all the other data from that row" I either need to group/max just what I want and join it back to get the detail that was lost during the join (and suffer duplication if there are multiple matches), or I PARTITION and NUMBER my rows so the max row is in position 1 and the detail was never lost.
We cannot escape the fact that GROUP BY loses detail or mixes row data
ps; there are also ways of using numbering/partitioning queries to retain ties - this answer isn't meant to be a comprehensive cover of all windowing functions, just enough to cover the point that group by is a cake that you cannot always eat and have

Related

Count the number of occurences in each bucket Redshift SQL

This might be difficult to explain. But Im trying to write a redshift sql query where I have want the count of organizations that fall into different market buckets. There are 50 markets. For example company x can be only be found in 1 market and company y can be found in 3 markets. I want to preface that I have over 10,000 companies to fit into these buckets. So ideally it would be more like, hypothetically 500 companies are found in 3 markets or 7 companies are found in 50 markets.
The table would like
Market Bucket
Org Count
1 Markets
3
2 Markets
1
3 Markets
0
select count(distinct case when enterprise_account = true and (market_name then organization_id end) as "1 Market" from organization_facts
I was trying to formulate the query from above but I got confused on how to effectively formulate the query
Organization Facts
Market Name
Org ID
Org Name
New York
15683
Company x
Orlando
38478
Company y
Twin Cities
2738
Company z
Twin Cities
15683
Company x
Detroit
99
Company xy

You would need a sub-query that retrieves the number of markets per company, and an outer query that summarises into a count of markets.
Something like:
with markets as (
select
org_name,
count(distinct market_name) as market_count
from organization_facts
)
select
market_count,
count(*) as org_count
from markets
group by market_count
order by market_count

If I follow you correctly, you can do this with two levels of aggregation. Assuming that org_id represents a company in your dataset:
select cnt_markets, count(*) cnt_org_id
from (select count(*) cnt_markets from organization_facts group by org_id) t
group by cnt_markets
The subquery counts the number of markets per company. I assumed no duplicate (ord_id, market_name) tuples in the table ; if that's not the case, then you need count(distinct market_name) instead of count(*) in that spot.
Then, the outer query just counts how many times each market count occurs in the subquery, which yields the result that you want.
Note that I left apart the enterprise_account column ,that appears in your query but not in your data.

Group by demands us to include all selected rows, when we need the results grouped by just one row

Following this programming exercise: SQL with Street Fighter, which statement is:
It's time to assess which of the world's greatest fighters are through
to the 6 coveted places in the semi-finals of the Street Fighter World
Fighting Championship. Every fight of the year has been recorded and
each fighter's wins and losses need to be added up.
Each row of the table fighters records, alongside the fighter's name,
whether they won (1) or lost (0), as well as the type of move that
ended the bout.
id
name
won
lost
move_id
winning_moves
id
move
However, due to new health and safety regulations, all ki blasts have
been outlawed as a potential fire hazard. Any bout that ended with
Hadoken, Shouoken or Kikoken should not be counted in the total wins
and losses.
So, your job:
Return name, won, and lost columns displaying the name, total number of wins and total number of losses. Group by the fighter's
name.
Do not count any wins or losses where the winning move was Hadoken, Shouoken or Kikoken.
Order from most-wins to least
Return the top 6. Don't worry about ties.
How could we group the fighters by their names?
We have tried:
select name, won, lost from fighters inner join winning_moves on fighters.id=winning_moves.id
group by name order by won desc limit 6;
However it displays:
There was an error with the SQL query:
PG::GroupingError: ERROR: column "fighters.won" must appear in the
GROUP BY clause or be used in an aggregate function LINE 3: select
name, won, lost from fighters inner join winning_move...
In addition we have also tried to include all selected rows:
select name, won, lost from fighters inner join winning_moves on fighters.id=winning_moves.id
group by name,won,lost order by won desc limit 6;
But the results differ from the expected.
Expected:
name won lost
Sakura 44 15
Cammy 44 17
Rose 42 19
Karin 42 13
Dhalsim 40 15
Ryu 39 16
Actual:
name won lost
Vega 2 1
Guile 2 1
Ryu 2 1
Rose 1 0
Vega 1 0
Zangief 1 0
Besides we have read:
https://www.w3schools.com/sql/sql_join.asp
MySql Inner Join with WHERE clause
How to limit rows in PostgreSQL SELECT
https://www.w3schools.com/sql/sql_groupby.asp
GROUP BY clause or be used in an aggregate function
PostgreSQL column must appear in the GROUP BY clause or be used in an aggregate function when using case statement
must appear in the GROUP BY clause or be used in an aggregate function

I guess you need to have sum() to aggregate the ids wins n loss. In addition to that you dont need join as you dont wanna show the move in the first query
select name, sum(won) as wins,
sum(lost)
from fighters
group by name order by sum(won)
desc limit 6;

BigQuery GROUP BY function still showing duplicates

I'm doing a query in BigQuery:
SELECT id FROM [table] WHERE city = 'New York City' GROUP BY id
The weird part is it shows duplicate ids, often right next to each other.
There is absolutely nothing different between the ids themselves. There are around 3 million rows total, for ~500k IDs. So there are a lot of duplicates, but that is by design. We figured the filtering would easily eliminate that but noticed discrepancies in totals.
Is there a reason BigQuery's GROUP BY function would work improperly? For what its worth, the dataset has ~3 million rows.
Example of duplicate ID:
56abdb5b9a75d90003001df6
56abdb5b9a75d90003001df6

the only explanation is your id is STRING and in reality those two ids are different because of spaces before or most likely after what is "visible" for eyes
I recommend you to adjust your query like below
SELECT REPLACE(id, ' ', '')
FROM [table]
WHERE city = 'New York City'
GROUP BY 1
another option to troubleshoot would be below
SELECT id, LENGTH(id)
FROM [table]
WHERE city = 'New York City'
GROUP BY 1, 2
so you can see if those ids are same by length or not - my initial assumption was about space - but it can be any other char(s) including non printable

Access 2010 SQL Query with Date Range

I'm new here and quite new to SQL and Access. What I have is a table called 'Apartments' that contains a bunch of rows of information. It has Building, Letter, SSN, LeaseDate, MonthlyRent, MoveinCondition and MoveoutCondition. For my class I have to figure out how many times a specific apartment was leased given all the information in the table and display by Building, Letter and NumberLeased.
What I have so far is this:
SELECT Building, Letter, COUNT(*)
FROM Apartments
GROUP BY Building, Letter;
This displays it almost correctly! However there is a catch. There can be multiple tenants on the lease at the same date, but it only counts as one active lease.
So what I did to check was this:
SELECT Building, Letter, LeaseDate, COUNT(*)
FROM Apartments
GROUP BY Building, Letter, LeaseDate;
Now this in fact does group by the building, letter and the lease date and counts the number of leases on the date.
But how do I display it so that it's not counting these duplicates, and add some sort of where or having statement to specify this.
for example: If apartment 1A was leased on 1/1/14 but by 4 tenants and also 1/1/13 by 3 tenants, it should only show the NumberLeased as 2, not 7.

Start with a query which gives you a single row for each apartment lease term. Per your example, the following query will condense the rows for each of the 4 apartment 1A tennants for the 1/1/14 LeaseDate into a single row:
SELECT DISTINCT Building, Letter, LeaseDate
FROM Apartments
Then use that as a subquery and base the lease counts on its distinct rows:
SELECT sub.Building, sub.Letter, Count(*) AS NumberLeased
FROM
(
SELECT DISTINCT Building, Letter, LeaseDate
FROM Apartments
) AS sub
GROUP BY sub.Building, sub.Letter;

MySQL: Getting highest score for a user

I have the following table (highscores),
id gameid userid name score date
1 38 2345 A 100 2009-07-23 16:45:01
2 39 2345 A 500 2009-07-20 16:45:01
3 31 2345 A 100 2009-07-20 16:45:01
4 38 2345 A 200 2009-10-20 16:45:01
5 38 2345 A 50 2009-07-20 16:45:01
6 32 2345 A 120 2009-07-20 16:45:01
7 32 2345 A 100 2009-07-20 16:45:01
Now in the above structure, a user can play a game multiple times but I want to display the "Games Played" by a specific user. So in games played section I can't display multiple games. So the concept should be like if a user played a game 3 times then the game with highest score should be displayed out of all.
I want result data like:
id gameid userid name score date
2 39 2345 A 500 2009-07-20 16:45:01
3 31 2345 A 100 2009-07-20 16:45:01
4 38 2345 A 200 2009-10-20 16:45:01
6 32 2345 A 120 2009-07-20 16:45:01
I tried following query but its not giving me the correct result:
SELECT id,
gameid,
userid,
date,
MAX(score) AS score
FROM highscores
WHERE userid='2345'
GROUP BY gameid
Please tell me what will be the query for this?
Thanks

Requirement is a bit vague/confusing but would something like this satisfy the need ?
(purposely added various aggregates that may be of interest).
SELECT gameid,
MIN(date) AS FirstTime,
MAX(date) AS LastTime,
MAX(score) AS TOPscore.
COUNT(*) AS NbOfTimesPlayed
FROM highscores
WHERE userid='2345'
GROUP BY gameid
-- ORDER BY COUNT(*) DESC -- for ex. to have games played most at top
Edit: New question about adding the id column to the the SELECT list
The short answer is: "No, id cannot be added, not within this particular construct". (Read further to see why) However, if the intent is to have the id of the game with the highest score, the query can be modified, using a sub-query, to achieve that.
As explained by Alex M on this page, all the column names referenced in the SELECT list and which are not used in the context of an aggregate function (MAX, MIN, AVG, COUNT and the like), MUST be included in the ORDER BY clause. The reason for this rule of the SQL language is simply that in gathering the info for the results list, SQL may encounter multiple values for such an column (listed in SELECT but not GROUP BY) and would then not know how to deal with it; rather than doing anything -possibly useful but possibly silly as well- with these extra rows/values, SQL standard dictates a error message, so that the user can modify the query and express explicitly his/her goals.
In our specific case, we could add the id in the SELECT and also add it in the GROUP BY list, but in doing so the grouping upon which the aggregation takes place would be different: the results list would include as many rows as we have id + gameid combinations the aggregate values for each of this row would be based on only the records from the table where the id and the gameid have the corresponding values (assuming id is the PK in table, we'd get a single row per aggregation, making the MAX() and such quite meaningless).
The way to include the id (and possibly other columns) corresponding to the game with the top score, is with a sub-query. The idea is that the subquery selects the game with TOP score (within a given group by), and the main query's SELECTs any column of this rows, even when the fieds wasn't (couldn't be) in the sub-query's group-by construct. BTW, do give credit on this page to rexem for showing this type of query first.
SELECT H.id,
H.gameid,
H.userid,
H.name,
H.score,
H.date
FROM highscores H
JOIN (
SELECT M.gameid, hs.userid, MAX(hs.score) MaxScoreByGameUser
FROM highscores H2
GROUP BY H2.gameid, H2.userid
) AS M
ON M.gameid = H.gameid
AND M.userid = H.userid
AND M.MaxScoreByGameUser = H.score
WHERE H.userid='2345'
A few important remarks about the query above
Duplicates: if there the user played several games that reached the same hi-score, the query will produce that many rows.
GROUP BY of the sub-query may need to change for different uses of the query. If rather than searching for the game's hi-score on a per user basis, we wanted the absolute hi-score, we would need to exclude userid from the GROUP BY (that's why I named the alias of the MAX with a long, explicit name)
The userid = '2345' may be added in the [now absent] WHERE clause of the sub-query, for efficiency purposes (unless MySQL's optimizer is very smart, currently all hi-scores for all game+user combinations get calculated, whereby we only need these for user '2345'); down side duplication; solution; variables.
There are several ways to deal with the issues mentioned above, but these seem to be out of scope for a [now rather lenghty] explanation about the GROUP BY constructs.

Every field you have in your SELECT (when a GROUP BY clause is present) must be either one of the fields in the GROUP BY clause, or else a group function such as MAX, SUM, AVG, etc. In your code, userid is technically violating that but in a pretty harmless fashion (you could make your code technically SQL standard compliant with a GROUP BY gameid, userid); fields id and date are in more serious violation - there will be many ids and dates within one GROUP BY set, and you're not telling how to make a single value out of that set (MySQL picks a more-or-less random ones, stricter SQL engines might more helpfully give you an error).
I know you want the id and date corresponding to the maximum score for a given grouping, but that's not explicit in your code. You'll need a subselect or a self-join to make it explicit!

Use:
SELECT t.id,
t.gameid,
t.userid,
t.name,
t.score,
t.date
FROM HIGHSCORES t
JOIN (SELECT hs.gameid,
hs.userid,
MAX(hs.score) 'max_score'
FROM HIGHSCORES hs
GROUP BY hs.gameid, hs.userid) mhs ON mhs.gameid = t.gameid
AND mhs.userid = t.userid
AND mhs.max_score = t.score
WHERE t.userid = '2345'

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas