BigQuery GROUP BY function still showing duplicates - sql

I'm doing a query in BigQuery:
SELECT id FROM [table] WHERE city = 'New York City' GROUP BY id
The weird part is it shows duplicate ids, often right next to each other.
There is absolutely nothing different between the ids themselves. There are around 3 million rows total, for ~500k IDs. So there are a lot of duplicates, but that is by design. We figured the filtering would easily eliminate that but noticed discrepancies in totals.
Is there a reason BigQuery's GROUP BY function would work improperly? For what its worth, the dataset has ~3 million rows.
Example of duplicate ID:
56abdb5b9a75d90003001df6
56abdb5b9a75d90003001df6

the only explanation is your id is STRING and in reality those two ids are different because of spaces before or most likely after what is "visible" for eyes
I recommend you to adjust your query like below
SELECT REPLACE(id, ' ', '')
FROM [table]
WHERE city = 'New York City'
GROUP BY 1
another option to troubleshoot would be below
SELECT id, LENGTH(id)
FROM [table]
WHERE city = 'New York City'
GROUP BY 1, 2
so you can see if those ids are same by length or not - my initial assumption was about space - but it can be any other char(s) including non printable

Related

How do I display multiple fields when using distinct count?

I am trying to get a count of total different first and last names with the same email address, and I'm not sure where to go from here. Field1 and Field2 are in the same table.
My output should have the concatenated field, field 1, field2
SELECT COUNT(DISTINCT(CONCAT(first_name,last_name)))
FROM `datalake.core.profile_snapshot`
WHERE classic_country = 'US' and
email.personal = 'example#provider.net'
LIMIT 1000
Appreciate any help!
SELECT
first_name
,last_name
,email_address
,count(1) as number
FROM datalake.core.profile_snapshot
GROUP BY
first_name
,last_name
,email_address
If you want to reduce the result set to a particular email address then just add a where clause to do so.
I've used email_address instead of email.personal.
LIMIT for SQL is generally limiting the number of rows returned, not for filtering. Need to use HAVING to filter on your aggregate
Email with 1000+ Distinct Names
SELECT email
/*Put random pipe character "|" in between first and last name so don't get names that concatenate to same value
Such as Jane Doe and Jan Edoe. Not a realistic example but concatenation could result in same "value" without a separator*/
,DistinctNames = COUNT(DISTINCT CONCAT(first_name,'|',last_name))
FROM datalake.core.profile_snapshot
WHERE classic_country = 'US'
AND email.personal = 'example#provider.net' /*Can comment this out if you want to see all email with 1000+ distinct names*/
GROUP BY email
/*HAVING clause = WHERE clause for aggregates*/
HAVING COUNT(DISTINCT CONCAT(first_name,'|',last_name)) > 1000 /*1000 distinct names for each email*/

SQL GROUP BY When Many Tables and Fields are Queried

I've never understood the the GROUP BY clause because most of the examples that google provides are very simple. I have a real life example of 982 columns pulling from 6 tables and need to select MAX(iVersion) which requires GROUP BY. A duplicate record identifier can be entered into SQL database multiple time when it has a newer iVersion number. I need to get the most recent record version but SSMS keeps screaming at me with the usual
field is not part of an aggregate function.
This is my updated code
SELECT TOP (100) Certification.dCertifiedDate, ChildGeneral.dChildsDateOfBirth, ChildGeneral.cChildsFirstName, ChildGeneral.cChildsLastName,Father.cFathersFirstName, Father.cFathersLastName,
Mother.cMothersFirstName, Mother.cMothersLastName, ChildGeneral.cChildsID, MAX(ChildGeneral.iVersionID) AS iVersionID, RecordTypes.cRecordCode, ChildGeneralFlag.cStateFileNumber
FROM ChildGeneral
INNER JOIN Father ON ChildGeneral.cChildsID = Father.cChildsID
AND ChildGeneral.iVersionID = Father.iVersionID
INNER JOIN Mother ON ChildGeneral.cChildsID = Mother.cChildsID
AND ChildGeneral.iVersionID = Mother.iVersionID
INNER JOIN ChildGeneralFlag ON ChildGeneral.cChildsID = ChildGeneralFlag.cChildsID
AND ChildGeneral.iVersionID = ChildGeneralFlag.iVersionID
INNER JOIN RecordTypes ON ChildGeneral.cRecordType = RecordTypes.cListItemID
INNER JOIN Certification ON ChildGeneral.cChildsID = Certification.cChildsID
WHERE CAST(CONVERT(VARCHAR, ChildGeneral.dChildsDateOfBirth, 101) AS DATE) >= CAST('01/01/1971' AS DATE)
AND CAST(CONVERT(VARCHAR, ChildGeneral.dChildsDateOfBirth, 101) AS DATE) <= CAST('12/31/2010' AS DATE)
GROUP BY ChildGeneral.iVersionID, Certification.dCertifiedDate, ChildGeneral.dChildsDateOfBirth, cChildsTimeOfBirth,ChildGeneral.cChildsFirstName,
ChildGeneral.cChildsLastName, Father.cFathersFirstName, Father.cFathersLastName, Mother.cMothersFirstName,
Mother.cMothersLastName, ChildGeneral.cChildsID, RecordTypes.cRecordCode, ChildGeneralFlag.cStateFileNumber
ORDER BY ChildGeneralFlag.cStateFileNumber
There should only by one record for each ChildGeneralFlag.cStateFileNumber with the MAX(ChildGeneral.iVersionID) which could be anywhere from 1-99
So I chopped down to 12 columns from the 6 tables and I get this error until I have added every last column to the group by. Msg 8120, Level 16, State 1, Line 33 Column 'RecordTypes.cRecordCode' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause. Then I lose the MAX effect and it gives me all the records
Yep that's how it goes. It's logical when you think about it:
Suppose you have rows (it doesn't matter how many table they come from or how many columns there are. Here I present just two)
City, Age
New York, 23
New York, 24
New York, 25
Chicago, 22
You want the max age by city. You group by city
SQL looks at all the unique values of City, and in this case essentially sets out two buckets, one labelled "New York" and the other labelled "Chicago". Every row's age goes into one bucket or the other. Then it looks through each bucket finding the max age in each one. You get 25 for NY and 22 for Chicago
Suppose we add another level
City, District, Age
New York, Queens, 23
New York, Queens, 24
New York, Bronx, 25
Chicago, Central, 22
You can keep your group by as is, but you can't select the district too unless you group it or max it
If you group it, SQL sets out 3 buckets this time, "New York/Bronx", "New York/Queens" and "Chicago/Central". The Queens bucket has 2 ages thrown into it, the max age in Bronx is 25, Queens is 24. Ultimately you get 3 rows out of you query because of the 3 unique values of city+district
If you max the district and the age and keep the group by only on city you get 2 rows out, but the max of district is Queens - alphabetically "greater" that Bronx, and you get 25, which truly is the max age but there was never originally a New York/Queens/25 row. Bucket values are allowed to mix up in aggregations. By stating MAX(age) and MAX(district) sql just pulls those values out of their rows, throws them all in the bucket and then finds the highest. Values in buckets retain their individual identity (whether they're an age value or a district value) but they lose all association with other values on the same row they were originally from
There isn't a concept of "group by New York, give me the max age, and also give me the district that went with it" because that might see the db having to make a choice it isn't empowered to make. Suppose we had this:
City, District, Age
New York, Queens, 23
New York, Queens, 25
New York, Bronx, 25
Chicago, Central, 22
Max is 25, but there are two rows with this max - should the DB return both? Should it pick one to throw away because you've asked for GROUP BY city which means city should be unique in the output?
"Both Rows please" you might say. "Just pick one to discard" I might say..
The db won't choose, so instead you'll have to be explicit:
SELECT *
FROM Person p
INNER JOIN (select city, max(age) maxage from person group by city) m
ON p.city = m.city AND p.age = m.maxage
Here we explicitly say "do the group, join it back, thus give me both rows" - the group by query becomes an elaborate 2 column where clause, that filters to only rows having both "New York and 25", or "Chicago and 22". You get both your New York rows.
There isn't a specified way of saying "pick one to discard" - instead we leverage some way of being explicit to break the tie- you can say "I want the city, the max age the associated district and if there are ties I want the first one when districts are sorted alphabetically".
And if you end up with a situation where there are ties in district, you have to add another level of sorting (and if that ties you keep going until there are no ties or you don't care any more)
Often that query looks like this:
SELECT * FROM(
SELECT *, ROW_NUMBER() OVER(PARTITION BY city ORDER BY age DESC, district) rn
) WHERE rn = 1
Row number essentially does the same query we did above when we grouped by city and joined it back in - partition by is like "group by with auto join back to current row based on partitioned values". In essence this establishes a column with an incrementing counter that starts from 1 per city and counts up in the given order. Because we asked for age descending, tie breaker on district ascending, the 25,Bronx row gets the 1, the 25,Queens row gets 2. Because we then filter for only rows with 1, we get rid of the queens rows
If I want a "max this group by that, and also all the other data from that row" I either need to group/max just what I want and join it back to get the detail that was lost during the join (and suffer duplication if there are multiple matches), or I PARTITION and NUMBER my rows so the max row is in position 1 and the detail was never lost.
We cannot escape the fact that GROUP BY loses detail or mixes row data
ps; there are also ways of using numbering/partitioning queries to retain ties - this answer isn't meant to be a comprehensive cover of all windowing functions, just enough to cover the point that group by is a cake that you cannot always eat and have

How can I select the Nth row of a group of fields?

I have a very very small database that I am needing to return a field from a specific row.
My table looks like this (simplified)
Material_Reading Table
pointID Material_Name
123 WoodFloor
456 Carpet
789 Drywall
111 Drywall
222 Carpet
I need to be able to group these together and see the different kinds (WoodFloor, Carpet, and Drywall) and need to be able to select which one I want and have that returned. So my select statement would put the various different types in a list and then I could have a variable which would select one of the rows - 1, 2, 3 for example.
I hope that makes sense, this is somewhat a non-standard implementation because its a filemaker database unfortunately, so itstead of one big SQL statement doing all I need I will have several that will each select an individual row that I indicate.
What I have tried so far:
SELECT DISTINCT Material_Name FROM MATERIAL_READING WHERE Room_KF = $roomVariable
This works and returns a list of all my material names which are in the room indicated by the room variable. But I cant get a specific one by supplying a row number.
I have tried using LIMIT 1 OFFSET 1. Possibly not supported by Filemaker or I am doing it wrong, I tried it like this - it gives an error:
SELECT DISTINCT Material_Name FROM MATERIAL_READING WHERE _Room_KF = $roomVariable ORDER BY Material_Name LIMIT 1 OFFSET 1
I am able to use ORDER BY like this:
SELECT DISTINCT Material_Name FROM MATERIAL_READING WHERE Room_KF = $roomVariable ORDER BY Material_Name
In MSSQL
SELECT DISTINCT Material_Name
FROM MATERIAL_READING
WHERE _Room_KF = 'roomVariable'
ORDER BY Material_Name
OFFSET N ROWS
FETCH NEXT 5 ROWS ONLY
where N->from which row does to start
X->no.of rows to retrieve which were started from (N+1 row)

How can an get count of the unique lengths of a string in database rows?

I am using Oracle and I have a table with 1000 rows. There is a last name field and
I want to know the lengths of the name field but I don't want it for every row. I want a count of the various lengths.
Example:
lastname:
smith
smith
Johnson
Johnson
Jackson
Baggins
There are two smiths length of five. Four others, length of seven. I want my query to return
7
5
If there were 1,000 names I expect to get all kinds of lengths.
I tried,
Select count(*) as total, lastname from myNames group by total
It didn't know what total was. Grouping by lastname just groups on each individual name unless it's a different last name, which is as expected but not what I need.
Can this be done in one SQL query?
SELECT Length(lastname)
FROM MyTable
GROUP BY Length(lastname)
select distinct(LENGTH(lastname)) from mynames;
Select count(*), Length(column_name) from table_name group by Length(column_name);
This will work for the different lengths in a single column.

How to concatenate multiple rows?

I have the following query which returns the salary of all employees. This work perfectly but I need to collect extra data that I will aggregate into one cell (see Result Set 2).
How can I aggregate data into a comma separated list? A little bit like what Sum does, but I need a string in return.
SELECT Employee.Id, SUM(Pay) as Salary
FROM Employee
INNER JOIN PayCheck ON PayCheck.EmployeeId = Employee.Id
GROUP BY Employee.Id
Result Set 1
Employee.Id Salary
-----------------------------------
1 150
2 250
3 350
I need:
Result Set 2
Employee.Id Salary Data
----------------------------------------------------
1 150 One, Two, Three
2 250 Four, Five, Six
3 350 Seven
For SQL Server 2005+, use the STUFF function and FOR XML PATH:
WITH summary_cte AS (
SELECT Employee.Id, SUM(Pay) as Salary
FROM Employee
JOIN PayCheck ON PayCheck.EmployeeId = Employee.Id
GROUP BY Employee.Id)
SELECT sc.id,
sc.salary,
STUFF((SELECT ','+ yt.data
FROM your_table yt
WHERE yt.id = sc.id
GROUP BY yt.data
FOR XML PATH(''), TYPE).value('.','VARCHAR(max)'), 1, 1, '')
FROM summary_cte sc
But you're missing details about where the data you want to turn into a comma delimited string is, and how it relates to an employee record...
I don't have my code in front of me, or I would show you a quick example, but I would look into writing a CLR aggregate for this. Its very simple. There are some automatically created method to use, and they're just for collection (add to a List<> object or something), Merge (merging multiple lists created in multiple threads), and an output (take the list and turn it into a string - String.Join(",", list.ToArray())). Only thing to know is that there is a length limit of 8000 characters.