BQ/SQL join two tables in a way that one column fills up with all distinct values from the other table while remaining columns get a null

BQ/SQL join two tables in a way that one column fills up with all distinct values from the other table while remaining columns get a null - sql

Hello everyone this is my first question here. I have been browsing thru the questions but couldnt quite find the answer to my problem:
I have a couple of tables which I need to join. The key I join with is non unique(in this case its a date). This is working fine but now I also need to group the results based on another column without getting cross-join like results (meaning each value of this column should only appear once but depending on the table used the column can have different values in each table)
Here is an example of what I have and what I would like to get:
Table1
Date/Key
Group Column
Example Value1
01-01-2022
a
1
01-01-2022
d
2
01-01-2022
e
3
01-01-2022
f
4
Table 2
Date/Key
Group Column
Example Value 2
01-01-2022
a
1
01-01-2022
b
2
01-01-2022
c
3
01-01-2022
d
4
Wanted Result :
Table Result
Date/Key
Group Column
Example Value1
Example Value2
01-01-2022
a
1
1
01-01-2022
b
NULL
2
01-01-2022
c
NULL
3
01-01-2022
d
2
4
01-01-2022
e
3
NULL
01-01-2022
f
4
NULL
I have tryed a couple of approaches but I always get results with values in group column appear multiple times. I am under the impression that full joining and then grouping over the group column shoul work but apparently I am missing something. I also figured I could bruteforce the result by left joining everything with setting the on to table1.date = table2.date AND table1.Groupcolumn = table2.Groupcolumn ect.. and then doing UNIONs of all permutations (so each table was on "the left" once) but this is not only tedious but bigquery doesnt like it since it contains too many sub queries.
I feel kinda bad that my first question is something that I should actually know but I hope someone can help me out!
I do not need a full code solution just a hint to the correct approach would suffice (also incase I missed it: if this was already answered I also appreciate just a link to it!)
Edit:
So one solution I came up with, which appears to work, was to select the group column of each table and union them as a with() and then join this "list" onto the first table like
list as(Select t1.GroupColumn FROM Table_1 t1 WHERE CONDITION1
UNION DISTINCT Select t1.GroupColumn FROM Table_1 t1 WHERE CONDITION2 ... ect)
result as (
SELECT l.GoupColumn, t1.Example_Value1, t2.Example_Value2
FROM Table_1 t1
LEFT JOIN( SELECT * FROM list) s
ON S.GroupColumn = t1.GroupColumn
LEFT JOIN Table_2 t2
on S.GroupColumn = t2.GroupColumn
and t1.key = t2.key
...
)
SELECT * FROM result

I think what you are looking for is a FULL OUTER JOIN and then you can coalesce the date and group columns. It doesn't exactly look like you need to group anything based on the example data you posted:
SELECT
coalesce(table1.date_key, table2.date_key) AS date_key,
coalesce(table1.group_column, table2.group_column) AS group_column,
table1.example_value_1,
table2.example_value_2
FROM
table1
FULL OUTER JOIN
table2
USING
(date_key,
group_column)
ORDER BY
date_key,
group_column;

Consider below simple approach
select * from (
select *, 'example_value1' type from table1 union all
select *, 'example_value2' type from table2
)
pivot (
any_value(example_value1)
for type in ('example_value1', 'example_value2')
)
if applied to sample data in your question - output is

Related

Filter values based on another column in SQL

I have a table with purchase orders, the orders' lines and a code for each line
Order_ID
LINE
CODE
A0001
1
aaaa
A0002
1
bbbb
A0002
2
xxxx
A0003
1
cccc
A0004
1
xxxx
A0004
2
dddd
And I need to filter out all the Orders that have at least one line with the code 'xxxx':
Order_ID
LINE
CODE
A0001
1
aaaa
A0003
1
cccc
I thought something like this:
SELECT *
FROM MyTable
WHERE Order_ID not in (SELECT * FROM MyTable WHERE CODE = 'xxxx')
BUT, the big problem here is that I'm working with a pretty big query so the subquery is also too large and the whole query takes a lot to run. Is there any workaround to avoid the subquery?

Beyond the issue others have pointed out with your syntax, the big problem is the subquery being called for each row. I've done a test on a similar big table with a non-unique field (like your Order_ID), and found very large savings by structuring the query with a CTE:
WITH bad_ids AS (
SELECT DISTINCT Order_ID
FROM MyTable
WHERE CODE = 'xxxx'
)
SELECT *
FROM MyTable m
LEFT JOIN bad_ids b
ON m.Order_ID = b.Order_ID
WHERE b.Order_ID IS NULL;
Here I'm assuming Order_ID is not allowed to be null.
Given the huge difference I saw in performance, I'd be surprised if you didn't see the same, no matter what the differences are between query planners.
Edit: Essentially this works because the above query is searching through the table twice and joining once. Your query is searching some subset of the table as many times as you have rows.

The column Order_ID should be compared with itself, not with any column (*):
SELECT *
FROM MyTable
WHERE Order_ID not in (SELECT Order_ID FROM MyTable WHERE CODE = 'xxxx')

SELECT *
FROM MyTable
WHERE CODE <> 'xxxx'

Count number of repeats in SQL

I tried to solve one problem but without success.
I have two list of number
{1,2,3,4}
{5,6,7,8,9}
And I have table
ID Number
1 1
1 2
1 7
1 2
1 6
2 8
2 7
2 3
2 9
Now I need to count how many times number from second list come after number from first list but I should count only one by one id
in example table above result should be 2
three matched pars but because we have only two different IDs result is 2 instead 3
Pars:
1 2
1 7
1 2
1 6
2 3
2 9
note. I work with MSSQL
Edit. There is one more column Date which determined order
Edit2 - Solution
i write this query
SELECT * FROM table t
left JOIN table tt ON tt.ID = t.ID
AND tt.Date > t.Date
AND t.Number IN (1,2,3,4)
AND tt.Number IN (6,7,8,9)
And after this I had a plan to group by id and use only one match for each id but execution take a lot time

Here is a query that would do it:
select a.id, min(a.number) as a, min(b.number) as b
from mytable a
inner join mytable b
on a.id = b.id
and a.date < b.date
and b.number in (5,6,7,8,9)
where a.number in (1,2,3,4)
group by a.id
Output is:
id a b
1 1 6
2 3 9
So the two pairs are output each on one line, with the value a belonging to the first group of numbers, and the value of column b to the second group.
Here is a fiddle
Comments on attempt (edit 2 to question)
Later you added a query attempt to your question. Some comments about that attempt:
You don't need a left join because you really want to have a match for both values. inner join has in general better performance, so use that.
The condition t.Number IN (1,2,3,4) does not belong in the on clause. In combination with a left join the result will include t records that violate this condition. It should be put in the where clause.
Your concern about performance may be warranted, but can be resolved by adding a useful index on your table, i.e. on (id, number, date) or (id, date, number)

Select data from a table where only the first two columns are distinct

Background
I have a table which has six columns. The first three columns create the pk. I'm tasked with removing one of the pk columns.
I selected (using distinct) the data into a temp table (excluding the third column), and tried inserting all of that data back into the original table with the third column being '11' for every row as this is what I was instructed to do. (this column is going to be removed by a DBA after I do this)
However, when I went to insert this data back into the original table I get a pk constraint error. (shocking, I know)
The other three columns are just date columns, so the distinct select didn't create a unique pk for each record. What I'm trying to achieve is just calling a distinct on the first two columns, and then just arbitrarily selecting the three other columns as it doesn't matter which dates I choose (at least not on dev).
What I've tried
I found the following post which seems to achieve what I want:
How do I (or can I) SELECT DISTINCT on multiple columns?
I tried the answers from both Joel,and Erwin.
Attempt 1:
However, with Joels answer the set returned is too large - the inner join isn't doing what I thought it would do. Selecting distinct col1 and col2 there are 400 columns returned, however when I use his solution 600 rows are returned. I checked the data and in fact there were duplicate pk's. Here is my attempt at duplicating Joels answer:
select a.emp_no,
a.eec_planning_unit_cde,
'11' as area, create_dte,
create_by_emp_no, modify_dte,
modify_by_emp_no
from tempdb.guest.temp_part_time_evaluator b
inner join
(
select emp_no, eec_planning_unit_cde
from tempdb.guest.temp_part_time_evaluator
group by emp_no, eec_planning_unit_cde
) a
ON b.emp_no = a.emp_no AND b.eec_planning_unit_cde = a.eec_planning_unit_cde
Now, if I execute just the inner select statement 400 rows are returned. If I select the whole query 600 rows are returned? Isn't inner join supposed to only show the intersection of the two sets?
Attempt 2:
I also tried the answer from Erwin. This one has a syntax error and I'm having trouble googling the spec on the where clause (specifically, the trick he is using with (emp_no, eec_planning_unit_cde))
Here is the attempt:
select emp_no,
eec_planning_unit_cde,
'11' as area, create_dte,
create_by_emp_no,
modify_dte,
modify_by_emp_no
where (emp_no, eec_planning_unit_cde) IN
(
select emp_no, eec_planning_unit_cde
from tempdb.guest.temp_part_time_evaluator
group by emp_no, eec_planning_unit_cde
)
Now, I realize that the post I referenced is for postgresql. Doesn't T-SQL have something similar? Trying to google parenthesis isn't working too well.
Overview of Questions:
Why doesn't inner join return an intersection of two sets? From googling this is what I thought it was supposed to do
Is there another way to achieve the same method that I was trying in attempt 2 in t-sql?
It doesn't matter to me which one of these I use, or if I use another solution... how should I go about this?

A select distinct will be based on all columns so it does not guarantee the first two to be distinct
select pk1, pk2, '11', max(c1), max(c2), max(c3)
from table
group by pk1, pk2

You could TRY this:
SELECT a.emp_no,
a.eec_planning_unit_cde,
b.'11' as area,
b.create_dte,
b.create_by_emp_no,
b.modify_dte,
b.modify_by_emp_no
FROM
(
SELECT emp_no, eec_planning_unit_cde
FROM tempdb.guest.temp_part_time_evaluator
GROUP BY emp_no, eec_planning_unit_cde
) a
JOIN tempdb.guest.temp_part_time_evaluator b
ON a.emp_no = b.emp_no AND a.eec_planning_unit_cde = b.eec_planning_unit_cde
That would give you a distinct on those fields but if there is differences in the data between columns you might have to try a more brute force approch.
SELECT a.emp_no,
a.eec_planning_unit_cde,
a.'11' as area,
a.create_dte,
a.create_by_emp_no,
a.modify_dte,
a.modify_by_emp_no
FROM
(
SELECT ROW_NUMBER() OVER(ORDER BY emp_no, eec_planning_unit_cde) rownumber,
a.emp_no,
a.eec_planning_unit_cde,
a.'11' as area,
a.create_dte,
a.create_by_emp_no,
a.modify_dte,
a.modify_by_emp_no
FROM tempdb.guest.temp_part_time_evaluator
) a
WHERE rownumber = 1

I'll reply one by one:
Why doesn't inner join return an intersection of two sets? From googling this is what I thought it was supposed to do
Inner join don't do an intersection. Le'ts supose this tables:
T1 T2
n s n s
1 A 2 X
2 B 2 Y
2 C
3 D
If you join both tables by numeric column you don't get the intersection (2 rows). You get:
select *
from t1 inner join t2
on t1.n = t2.n;
| N | S |
---------
| 2 | B |
| 2 | B |
| 2 | C |
| 2 | C |
And, your second query approach:
select *
from t1
where t1.n in (select n from t2);
| N | S |
---------
| 2 | B |
| 2 | C |
Is there another way to achieve the same method that I was trying in attempt 2 in t-sql?
Yes, this subquery:
select *
from t1
where not exists (
select 1
from t2
where t2.n = t1.n
);
It doesn't matter to me which one of these I use, or if I use another solution... how should I go about this?
yes, using #JTC second query.

Select a subgroup of records by one distinct column

Sorry if this has been answered before, but all the related questions didn't quite seem to match my purpose.
I have a table that looks like the following:
ID POSS_PHONE CELL_FLAG
=======================
1 111-111-1111 0
2 222-222-2222 0
2 333-333-3333 1
3 444-444-4444 1
I want to select only distinct ID values for an insert, but I don't care which specific ID gets pulled out of the duplicates.
For Example(a valid SELECT would be):
1 111-111-1111 0
2 222-222-2222 0
3 444-444-4444 1
Before I had the CELL_FLAG column, I was just using an aggregate function as so:
SELECT ID, MAX(POSS_PHONE)
FROM TableA
GROUP BY ID
But I can't do:
SELECT ID, MAX(POSS_PHONE), MAX(CELL_FLAG)...
because I would lose integrity within the row, correct?
I've seen some similar examples using CTEs, but once again, nothing that quite fit.
So maybe this is solvable by a CTE or some type of self-join subquery? I'm at a block right now, so I can't see any other solutions.

Just get your aggregation in a subquery and join to it:
SELECT a.ID, sub.Poss_Phone, CELL_FLAG
FROM TableA as a
INNER JOIN (SELECT ID, MAX(POSS_PHONE) as [Poss_Phone]
FROM TableA
GROUP BY ID) Sub
ON Sub.ID = a.ID and SUB.Poss_Phone = A.Poss_Phone
This will keep integrity between your non-aggregated fields but still give you the MAX(Poss_Phone) per ID.

Add Column values in sql server query

I have result of two queries like:
Result of query 1
ID Value
1 4
2 0
3 6
4 9
Result of query 2
ID Value
1 6
2 4
3 0
4 1
I want to add values column "Value" and show final result:
Result of Both queries
ID Value
1 10
2 4
3 6
4 10
plz guide me...

select id, sum(value) as value
from (
select id, value from query1
uninon all
select id, value from query2
) x
group by id

Try using a JOIN:
SELECT
T1.ID,
T1.Value + T2.Value AS Value
FROM (...query1...) AS T1
JOIN (...query2...) AS T2
ON T1.Id = T2.Id
You may also need to consider what should happen if there is an Id present in one result but not in the other. The current query will omit it from the results. You may want to investigate OUTER JOIN as an alternative.

A not particularly nice but fairly easy to comprehend way would be:
SELECT ID,SUM(Value) FROM
(
(SELECT IDColumn AS ID,ValueColumn AS Value FROM TableA) t1
OUTER JOIN
(SELECT IDColumn AS ID,ValueColumn AS Value FROM TableB) t2
) a GROUP BY a.ID
It has the benefits of
a) I don't know your actual table structure so you should be able to work out how to get the two 'SELECT's working from your original queries
b) If ID doesn't appear in either table, that's fine

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas