Select a subgroup of records by one distinct column - sql

Sorry if this has been answered before, but all the related questions didn't quite seem to match my purpose.
I have a table that looks like the following:
ID POSS_PHONE CELL_FLAG
=======================
1 111-111-1111 0
2 222-222-2222 0
2 333-333-3333 1
3 444-444-4444 1
I want to select only distinct ID values for an insert, but I don't care which specific ID gets pulled out of the duplicates.
For Example(a valid SELECT would be):
1 111-111-1111 0
2 222-222-2222 0
3 444-444-4444 1
Before I had the CELL_FLAG column, I was just using an aggregate function as so:
SELECT ID, MAX(POSS_PHONE)
FROM TableA
GROUP BY ID
But I can't do:
SELECT ID, MAX(POSS_PHONE), MAX(CELL_FLAG)...
because I would lose integrity within the row, correct?
I've seen some similar examples using CTEs, but once again, nothing that quite fit.
So maybe this is solvable by a CTE or some type of self-join subquery? I'm at a block right now, so I can't see any other solutions.

Just get your aggregation in a subquery and join to it:
SELECT a.ID, sub.Poss_Phone, CELL_FLAG
FROM TableA as a
INNER JOIN (SELECT ID, MAX(POSS_PHONE) as [Poss_Phone]
FROM TableA
GROUP BY ID) Sub
ON Sub.ID = a.ID and SUB.Poss_Phone = A.Poss_Phone
This will keep integrity between your non-aggregated fields but still give you the MAX(Poss_Phone) per ID.

Related

BQ/SQL join two tables in a way that one column fills up with all distinct values from the other table while remaining columns get a null

Hello everyone this is my first question here. I have been browsing thru the questions but couldnt quite find the answer to my problem:
I have a couple of tables which I need to join. The key I join with is non unique(in this case its a date). This is working fine but now I also need to group the results based on another column without getting cross-join like results (meaning each value of this column should only appear once but depending on the table used the column can have different values in each table)
Here is an example of what I have and what I would like to get:
Table1
Date/Key
Group Column
Example Value1
01-01-2022
a
1
01-01-2022
d
2
01-01-2022
e
3
01-01-2022
f
4
Table 2
Date/Key
Group Column
Example Value 2
01-01-2022
a
1
01-01-2022
b
2
01-01-2022
c
3
01-01-2022
d
4
Wanted Result :
Table Result
Date/Key
Group Column
Example Value1
Example Value2
01-01-2022
a
1
1
01-01-2022
b
NULL
2
01-01-2022
c
NULL
3
01-01-2022
d
2
4
01-01-2022
e
3
NULL
01-01-2022
f
4
NULL
I have tryed a couple of approaches but I always get results with values in group column appear multiple times. I am under the impression that full joining and then grouping over the group column shoul work but apparently I am missing something. I also figured I could bruteforce the result by left joining everything with setting the on to table1.date = table2.date AND table1.Groupcolumn = table2.Groupcolumn ect.. and then doing UNIONs of all permutations (so each table was on "the left" once) but this is not only tedious but bigquery doesnt like it since it contains too many sub queries.
I feel kinda bad that my first question is something that I should actually know but I hope someone can help me out!
I do not need a full code solution just a hint to the correct approach would suffice (also incase I missed it: if this was already answered I also appreciate just a link to it!)
Edit:
So one solution I came up with, which appears to work, was to select the group column of each table and union them as a with() and then join this "list" onto the first table like
list as(Select t1.GroupColumn FROM Table_1 t1 WHERE CONDITION1
UNION DISTINCT Select t1.GroupColumn FROM Table_1 t1 WHERE CONDITION2 ... ect)
result as (
SELECT l.GoupColumn, t1.Example_Value1, t2.Example_Value2
FROM Table_1 t1
LEFT JOIN( SELECT * FROM list) s
ON S.GroupColumn = t1.GroupColumn
LEFT JOIN Table_2 t2
on S.GroupColumn = t2.GroupColumn
and t1.key = t2.key
...
)
SELECT * FROM result
I think what you are looking for is a FULL OUTER JOIN and then you can coalesce the date and group columns. It doesn't exactly look like you need to group anything based on the example data you posted:
SELECT
coalesce(table1.date_key, table2.date_key) AS date_key,
coalesce(table1.group_column, table2.group_column) AS group_column,
table1.example_value_1,
table2.example_value_2
FROM
table1
FULL OUTER JOIN
table2
USING
(date_key,
group_column)
ORDER BY
date_key,
group_column;
Consider below simple approach
select * from (
select *, 'example_value1' type from table1 union all
select *, 'example_value2' type from table2
)
pivot (
any_value(example_value1)
for type in ('example_value1', 'example_value2')
)
if applied to sample data in your question - output is

How to include values with count 0?

I have the following table called Trains.
id | name | train_id
1 Carl 1
2 Kat 1
3 Paul 2
4 Adam 4
5 Janet 4
6 James 4
I am trying to count for each name how many other people are in the same train.
Here's what I've gotten so far:
SELECT T1.name, COUNT(T2.name)
FROM Trains T1, Trains T2
WHERE T1.name<>T2.name AND T1.train_id=T2.train_id
GROUP BY T1.name;
However, the result I get is
Janet 2
Adam 2
Kat 1
Carl 1
James 2
but I should also have the name 'Paul' there with count 0. I am new to SQL and I am unsure of how I could change my code to have the zero values here as well.
If you phrase your current logic as a left join, it should work:
SELECT t1.name, COUNT(t2.name) AS cnt
FROM Trains t1
LEFT JOIN Trains t2
ON t1.name <> t2.name AND t1.train_id = t2.train_id
GROUP BY t1.name;
Demo
The problem with your current approach is that it doing an old school implicit inner join, not a left join. This means that first the join happens, then the WHERE clause is filtering off the missing Paul record. By using a left join, all names on the left side of the join are retained.
I don't think a join is needed. Just use window functions:
select name, count(*) over (partition by train_id) - 1
from trains t;
Basically, count(*) over (partition by train_id) counts the number of rows on the train. The - 1 is to subtract the current row.
We use COUNT(*) which counts all of the input rows for a group.
(COUNT() also works with expressions, but it has slightly different behavior.)
Here's how the database executes this query:
FROM train_id: — First, retrieve all of the records from the Trains table.
GROUP BY name — Next, determine the unique name groups.
SELECT ... — Finally, select the name and the count of the number of rows in that group.
We also give this count of rows an alias using AS people_in_same_train to make the output more readable.
SELECT
name, COUNT(*) AS people_in_same_train
FROM Trains
GROUP BY name;

SQL - Count Results of 2 Columns

I have the following table which contains ID's and UserId's.
ID UserID
1111 11
1111 300
1111 51
1122 11
1122 22
1122 3333
1122 45
I'm trying to count the distinct number of 'IDs' so that I get a total, but I also need to get a total of ID's that have also seen the that particular ID as well... To get the ID's, I've had to perform a subquery within another table to get ID's, I then pass this into the main query... Now I just want the results to be displayed as follows.
So I get a Total No for ID and a Total Number for Users ID - Also would like to add another column to get average as well for each ID
TotalID Total_UserID Average
2 7 3.5
If Possible I would also like to get an average as well, but not sure how to calculate that. So I would need to count all the 'UserID's for an ID add them altogether and then find the AVG. (Any Advice on that caluclation would be appreciated.)
Current Query.
SELECT DISTINCT(a.ID)
,COUNT(b.UserID)
FROM a
INNER JOIN b ON someID = someID
WHERE a.ID IN ( SELECT ID FROM c WHERE GROUPID = 9999)
GROUP BY a.ID
Which then Lists all the IDs and COUNT's all the USERID.. I would like a total of both columns. I've tried warpping the query in a
SELECT COUNT(*) FROM (
but this only counts the ID's which is great, but how do I count the USERID column as well
You seem to want this:
SELECT COUNT(DISTINCT a.ID), COUNT(b.UserID),
COUNT(b.UserID) * 1.0 / COUNT(DISTINCT a.ID)
FROM a INNER JOIN
b
ON someID = someID
WHERE a.ID IN ( SELECT ID FROM c WHERE GROUPID = 9999);
Note: DISTINCT is not a function. It applies to the whole row, so it is misleading to put an expression in parentheses after it.
Also, the GROUP BY is unnecessary.
The 1.0 is because SQL Server does integer arithmetic and this is a simple way to convert a number to a decimal form.
You can use
SELECT COUNT(DISTINCT a.ID) ...
to count all distinct values
Read details here
I believe you want this:
select TotalID,
Total_UserID,
sum(Total_UserID+TotalID) as Total,
Total_UserID/TotalID as Average
from (
SELECT (DISTINCT a.ID) as TotalID
,COUNT(b.UserID) as Total_UserID
FROM a
INNER JOIN b ON someID = someID
WHERE a.ID IN ( SELECT ID FROM c WHERE GROUPID = 9999)
) x

Select data from a table where only the first two columns are distinct

Background
I have a table which has six columns. The first three columns create the pk. I'm tasked with removing one of the pk columns.
I selected (using distinct) the data into a temp table (excluding the third column), and tried inserting all of that data back into the original table with the third column being '11' for every row as this is what I was instructed to do. (this column is going to be removed by a DBA after I do this)
However, when I went to insert this data back into the original table I get a pk constraint error. (shocking, I know)
The other three columns are just date columns, so the distinct select didn't create a unique pk for each record. What I'm trying to achieve is just calling a distinct on the first two columns, and then just arbitrarily selecting the three other columns as it doesn't matter which dates I choose (at least not on dev).
What I've tried
I found the following post which seems to achieve what I want:
How do I (or can I) SELECT DISTINCT on multiple columns?
I tried the answers from both Joel,and Erwin.
Attempt 1:
However, with Joels answer the set returned is too large - the inner join isn't doing what I thought it would do. Selecting distinct col1 and col2 there are 400 columns returned, however when I use his solution 600 rows are returned. I checked the data and in fact there were duplicate pk's. Here is my attempt at duplicating Joels answer:
select a.emp_no,
a.eec_planning_unit_cde,
'11' as area, create_dte,
create_by_emp_no, modify_dte,
modify_by_emp_no
from tempdb.guest.temp_part_time_evaluator b
inner join
(
select emp_no, eec_planning_unit_cde
from tempdb.guest.temp_part_time_evaluator
group by emp_no, eec_planning_unit_cde
) a
ON b.emp_no = a.emp_no AND b.eec_planning_unit_cde = a.eec_planning_unit_cde
Now, if I execute just the inner select statement 400 rows are returned. If I select the whole query 600 rows are returned? Isn't inner join supposed to only show the intersection of the two sets?
Attempt 2:
I also tried the answer from Erwin. This one has a syntax error and I'm having trouble googling the spec on the where clause (specifically, the trick he is using with (emp_no, eec_planning_unit_cde))
Here is the attempt:
select emp_no,
eec_planning_unit_cde,
'11' as area, create_dte,
create_by_emp_no,
modify_dte,
modify_by_emp_no
where (emp_no, eec_planning_unit_cde) IN
(
select emp_no, eec_planning_unit_cde
from tempdb.guest.temp_part_time_evaluator
group by emp_no, eec_planning_unit_cde
)
Now, I realize that the post I referenced is for postgresql. Doesn't T-SQL have something similar? Trying to google parenthesis isn't working too well.
Overview of Questions:
Why doesn't inner join return an intersection of two sets? From googling this is what I thought it was supposed to do
Is there another way to achieve the same method that I was trying in attempt 2 in t-sql?
It doesn't matter to me which one of these I use, or if I use another solution... how should I go about this?
A select distinct will be based on all columns so it does not guarantee the first two to be distinct
select pk1, pk2, '11', max(c1), max(c2), max(c3)
from table
group by pk1, pk2
You could TRY this:
SELECT a.emp_no,
a.eec_planning_unit_cde,
b.'11' as area,
b.create_dte,
b.create_by_emp_no,
b.modify_dte,
b.modify_by_emp_no
FROM
(
SELECT emp_no, eec_planning_unit_cde
FROM tempdb.guest.temp_part_time_evaluator
GROUP BY emp_no, eec_planning_unit_cde
) a
JOIN tempdb.guest.temp_part_time_evaluator b
ON a.emp_no = b.emp_no AND a.eec_planning_unit_cde = b.eec_planning_unit_cde
That would give you a distinct on those fields but if there is differences in the data between columns you might have to try a more brute force approch.
SELECT a.emp_no,
a.eec_planning_unit_cde,
a.'11' as area,
a.create_dte,
a.create_by_emp_no,
a.modify_dte,
a.modify_by_emp_no
FROM
(
SELECT ROW_NUMBER() OVER(ORDER BY emp_no, eec_planning_unit_cde) rownumber,
a.emp_no,
a.eec_planning_unit_cde,
a.'11' as area,
a.create_dte,
a.create_by_emp_no,
a.modify_dte,
a.modify_by_emp_no
FROM tempdb.guest.temp_part_time_evaluator
) a
WHERE rownumber = 1
I'll reply one by one:
Why doesn't inner join return an intersection of two sets? From googling this is what I thought it was supposed to do
Inner join don't do an intersection. Le'ts supose this tables:
T1 T2
n s n s
1 A 2 X
2 B 2 Y
2 C
3 D
If you join both tables by numeric column you don't get the intersection (2 rows). You get:
select *
from t1 inner join t2
on t1.n = t2.n;
| N | S |
---------
| 2 | B |
| 2 | B |
| 2 | C |
| 2 | C |
And, your second query approach:
select *
from t1
where t1.n in (select n from t2);
| N | S |
---------
| 2 | B |
| 2 | C |
Is there another way to achieve the same method that I was trying in attempt 2 in t-sql?
Yes, this subquery:
select *
from t1
where not exists (
select 1
from t2
where t2.n = t1.n
);
It doesn't matter to me which one of these I use, or if I use another solution... how should I go about this?
yes, using #JTC second query.

Sql COALESCE entire rows?

I just learned about COALESCE and I'm wondering if it's possible to COALESCE an entire row of data between two tables? If not, what's the best approach to the following ramblings?
For instance, I have these two tables and assuming that all columns match:
tbl_Employees
Id Name Email Etc
-----------------------------------
1 Sue ... ...
2 Rick ... ...
tbl_Customers
Id Name Email Etc
-----------------------------------
1 Bob ... ...
2 Dan ... ...
3 Mary ... ...
And a table with id's:
tbl_PeopleInCompany
Id CompanyId
-----------------
1 1
2 1
3 1
And I want to query the data in a way that gets rows from the first table with matching id's, but gets from second table if no id is found.
So the resulting query would look like:
Id Name Email Etc
-----------------------------------
1 Sue ... ...
2 Rick ... ...
3 Mary ... ...
Where Sue and Rick was taken from the first table, and Mary from the second.
SELECT Id, Name, Email, Etc FROM tbl_Employees
WHERE Id IN (SELECT ID From tbl_PeopleInID)
UNION ALL
SELECT Id, Name, Email, Etc FROM tbl_Customers
WHERE Id IN (SELECT ID From tbl_PeopleInID) AND
Id NOT IN (SELECT Id FROM tbl_Employees)
Depending on the number of rows, there are several different ways to write these queries (with JOIN and EXISTS), but try this first.
This query first selects all the people from tbl_Employees that have an Id value in your target list (the table tbl_PeopleInID). It then adds to the "bottom" of this bunch of rows the results of the second query. The second query gets all tbl_Customer rows with Ids in your target list but excluding any with Ids that appear in tbl_Employees.
The total list contains the people you want — all Ids from tbl_PeopleInID with preference given to Employees but missing records pulled from Customers.
You can also do this:
1) Outer Join the two tables on tbl_Employees.Id = tbl_Customers.Id. This will give you all the rows from tbl_Employees and leave the tbl_Customers columns null if there is no matching row.
2) Use CASE WHEN to select either the tbl_Employees column or tbl_Customers column, based on whether tbl_Customers.Id IS NULL, like this:
CASE WHEN tbl_Customers.Id IS NULL THEN tbl_Employees.Name ELSE tbl_Customers.Name END AS Name
(My syntax might not be perfect there, but the technique is sound).
This should be pretty performant. It uses a CTE to basically build a small table of Customers that have no matching Employee records, and then it simply UNIONs that result with the Employee records
;WITH FilteredCustomers (Id, Name, Email, Etc)
AS
(
SELECT Id, Name, Email, Etc
FROM tbl_Customers C
INNER JOIN tbl_PeopleInCompany PIC
ON C.Id = PIC.Id
LEFT JOIN tbl_Employees E
ON C.Id = E.Id
WHERE E.Id IS NULL
)
SELECT Id, Name, Email, Etc
FROM tbl_Employees E
INNER JOIN tbl_PeopleInCompany PIC
ON C.Id = PIC.Id
UNION
SELECT Id, Name, Email, Etc
FROM FilteredCustomers
Using the IN Operator can be rather taxing on large queries as it might have to evaluate the subquery for each record being processed.
I don't think the COALESCE function can be used for what you're thinking. COALESCE is similar to ISNULL, except it allows you to pass in multiple columns, and will return the first non-null value:
SELECT Name, Class, Color, ProductNumber,
COALESCE(Class, Color, ProductNumber) AS FirstNotNull
FROM Production.Product
This article should explain it's application:
http://msdn.microsoft.com/en-us/library/ms190349.aspx
It sounds like Larry Lustig's answer is more along the lines of what you need though.