Consolidate union result - sql

I am selecting email address from two source tables via a union query. Normally, I would leverage union's behavior of filtering out duplicate values, but in this case, each source table may or may not have a value for each person, and I have to give "priority" to the email address from the first source table when it exists.
When I encounter a situation where the same email address exists in both sources, I would like to to omit the line from source 2 where the email matches that from source 1, as shown below. Assuming CURRENT DATA, how can I make a new selection from CURRENT DATA to arrive at DESIRED RESULT?
CURRENT DATA
PERSONID
EMAILADDRESS
SOURCE
7538583
email#example
1
7538583
email#example
2
7538583
person#somecompany
2
DESIRED RESULT
PERSONID
EMAILADDRESS
SOURCE
7538583
email#example
1
7538583
person#somecompany
2

You can use a subquery in the second part of the union:
select personid,
emailaddress,
1 as source
from table1
union
select personid,
emailaddress,
2 as source
from table2
where not exists (select 1 from table1 where table1.personid = table2.personid);

Related

SQL to find record with a value in one table but not another table

I have two tables: Address and Address_Backup. Each table can contain multiple rows of data for a single individual because there are multiple address types stored in this table. So the data looks something like this:
ID Code Description Description2 City State
6798 HOME 478 Elm NULL Boise ID
6798 OTHER 405 S Main NULL NULL NULL
Address_Backup is supposed to be identical to Address, but we've found some data that exists in Address_Backup that doesn't exist in Address. I need a query that joins the ID numbers in the two tables and returns data for addresses where the "Other" code type exists in the Address_Backup table but does not exist in the Address table.
Assuming your IDs do not change in either table so that ID 100 in Address always points to ID 100 in Address_Backup:
SELECT *
FROM Address_Backup
WHERE ID NOT IN (SELECT ID FROM Address) AND Code = 'OTHER'
using NOT EXISTS
SELECT *
FROM Address_Backup AS B
WHERE NOT EXISTS (SELECT 1 FROM Address WHERE ID = B.ID) AND Code = 'OTHER'

How to delete records in BigQuery based on values in an array?

In Google BigQuery, I would like to delete a subset of records, based on the value of a specific column. It's a query that I need to run repeatedly and that I would like to run automatically.
The problem is that this specific column is of the form STRUCT<column_1 ARRAY (STRING), column_2 ARRAY (STRING), ... >, and I don't know how to use such a column in the where-clause when using the delete-command.
Here is basically what I am trying to do (this code does not work):
DELETE
FROM dataset.table t
LEFT JOIN UNNEST(t.category.column_1) AS type
WHERE t.partition_date = '2020-07-22'
AND type = 'some_value'
The error that I'm getting is: Syntax error: Expected end of input but got keyword LEFT at [3:1]
If I replace the DELETE with SELECT *, it does work:
SELECT *
FROM dataset.table t
LEFT JOIN UNNEST(t.category.column_1) AS type
WHERE t.partition_date = '2020-07-22'
AND type = 'some_value'
Does somebody know how to use such a column to delete a subset of records?
EDIT:
Here is some code to create a reproducible example with some silly data (fill in your own dataset and table name in all queries):
Suppose you want to delete all rows where category.type contains the value 'food'.
1 - create a table:
CREATE TABLE <DATASET>.<TABLE_NAME>
(
article STRING,
category STRUCT<
color STRING,
type ARRAY<STRING>
>
);
2 - Insert data into the new table:
INSERT <DATASET>.<TABLE_NAME>
SELECT "apple" AS article, STRUCT('red' AS color, ['fruit','food'] as type) AS category
UNION ALL
SELECT "cabbage" AS article, STRUCT('blue' AS color, ['vegetable', 'food'] as type) AS category
UNION ALL
SELECT "book" AS article, STRUCT('red' AS color, ['object'] as type) AS category
UNION ALL
SELECT "dog" AS article, STRUCT('green' AS color, ['animal', 'pet'] as type) AS category;
3 - Show that select works (return all rows where category.type contains the value 'food'; these are the rows I want to delete):
SELECT *
FROM <DATASET>.<TABLE_NAME>
LEFT JOIN UNNEST(category.type) type
WHERE type = 'food'
Initial Result
4 - My attempt at deleting rows where category.type contains 'food' does not work:
DELETE
FROM <DATASET>.<TABLE_NAME>
LEFT JOIN UNNEST(category.type) type
WHERE type = 'food'
Syntax error: Unexpected keyword LEFT at [3:1]
Desired Result
This is the code I used to delete the desired records (the records where category.type contains the value 'food'.)
DELETE
FROM <DATASET>.<TABLE_NAME> t1
WHERE EXISTS(SELECT 1 FROM UNNEST(t1.category.type) t2 WHERE t2 = 'food')
The embarrasing thing is that I've seen these kind of answers on similar questions (for example on update-queries). But I come from Oracle-SQL and I think that there you are required to connect your subquery with your main query in the WHERE-statement of the subquery (ie. connect t1 with t2), so I didn't understand these answers. That's why I posted this question.
However, I learned that BigQuery automatically understands how to connect table t1 and 'table' t2; you don't have to explicitly connect them.
Now it is possible to still do this (perhaps even recommended?):
DELETE
FROM <DATASET>.<TABLE_NAME> t1
WHERE EXISTS (SELECT 1 FROM <DATASET>.<TABLE_NAME> t2 LEFT JOIN UNNEST(t2.category.type) AS type WHERE type = 'food' AND t1.article=t2.article)
but a second difficulty for me was that my ID in my actual data is somehow hidden in an array>struct-construction, so I got stuck connecting t1 & t2. Fortunately this is not always an absolute necessity.
Since you did not provide any sample data I am going to explain using some dummy data. In case you add your sample data, I can update the answer.
Firstly,according to your description, you have only a STRUCT not an Array[Struct <col_1, col_2>].For this reason, you do not need to use UNNEST to access the values within the data. Below is an example how to access particular data within a STRUCT.
WITH data AS (
SELECT 1 AS id, STRUCT("Alex" AS name, 30 AS age, "NYC" AS city) AS info UNION ALL
SELECT 1 AS id, STRUCT("Leo" AS name, 18 AS age, "Sydney" AS city) AS info UNION ALL
SELECT 1 AS id, STRUCT("Robert" AS name, 25 AS age, "Paris" AS city) AS info UNION ALL
SELECT 1 AS id, STRUCT("Mary" AS name, 28 AS age, "London" AS city) AS info UNION ALL
SELECT 1 AS id, STRUCT("Ralph" AS name, 45 AS age, "London" AS city) AS info
)
SELECT * FROM data
WHERE info.city = "London"
Notice that the STRUCT is named info and the data we accessed is city and used it in the WHERE clause.
Now, in order to delete the rows that contains an specific value within the STRUCT , in your case I assume it would be your_struct.column_1, you can use DELETE or MERGE and DELETE. I have saved the above data in a table to execute the below examples, which have the same output,
First method: DELETE
DELETE FROM `project.dataset.table`
WHERE info.city = "Sydney"
Second method: MERGE and DELETE
MERGE `project.dataset.table` a
USING (SELECT * from `project.dataset.table` WHERE info.city ="London") b
ON a.info.city =b.info.city
WHEN matched and b.id=1 then
Delete
And the output for both queries,
Row id info.name info.age info.city
1 1 Alex 30 NYC
2 1 Robert 25 Paris
3 1 Ralph 45 London
4 1 Mary 28 London
As you can see the row where info.city = "Sydney" was deleted in both cases.
It is important to point out that your data is excluded from your source table. Therefore, you should be careful.
Note: Since you want to run this process everyday, you could use Schedule Query within BigQuery Console, appending or overwriting the results after each run. Also, it is a good practice not deleting data from your source table. Thus, consider creating a new table from your source table without the rows you do not desire.

SQL - Select TOP 1 from child table with WHERE condition, 2 possible values and 1 is preferred

I have 2 tables, parent and a child with 1-N relation.
Person: Id (INT), Name (VARCHAR)
PersonToCompany: Id (INT), PersonId (INT), Email (Varchar)
I want to JOIN both tables, but select just 1 record from the PersonToCompany table. I know I can do this using e.g. CROSS APPLY, but I also have some conditions.
I want to select only specific PersonToCompany records, like this:
WHERE (Email LIKE '%#abc.com%' OR Email LIKE '%xyz.com%')
Now the tricky part - some people can have 2 PersonToCompany records with both #abc.com and #xyz.com email domains. In this case, I want to be sure that the record with #abc.com will be selected. How can I do that?
This is my original subquery that selects #abc.com OR #xyz.com with no preference:
CROSS APPLY (
SELECT TOP 1
PersonToCompany.Email AS Email
FROM PersonToCompany
WHERE PersonToCompany.PersonId = Person.Id
AND (PersonToCompany.Email LIKE '%#abc.com%') OR (PersonToCompany.Email LIKE '%#xyz.com%')
) PersonToCompany
TOP 1 without ORDER BY is "give me a row, I don't care which one". So the simple fix is to add an ORDER BY:
CROSS APPLY (
SELECT TOP 1
PersonToCompany.Email AS Email
FROM PersonToCompany
WHERE PersonToCompany.PersonId = Person.Id
AND (PersonToCompany.Email LIKE '%#abc.com%' OR
PersonToCompany.Email LIKE '%#xyz.com%')
ORDER BY CASE WHEN PersonToCompany.Email LIKE '%#abc.com%' THEN 0 ELSE 1 END
) PersonToCompany
(I've also shifted some parentheses around to get the logic correct, I believe - you were bracketing individual predicates, which doesn't really do anything. I've bracketed the OR so that the PersonId match is required no matter which email address is found, which sounds more correct to me)

SQL query not returning rows if empty

I am new to SQL and I would like to get some insights for my problem
I am using the following query,
select id,
pid
from assoc
where id in (100422, 100414, 100421, 100419, 100423)
All these id need not have pid, some doesn't and some has pid. Currently it skips the records which doesn't have pid.
I would like a way which would show the results as below.
pid id
-----------
703 100422
313 100414
465 100421
null 100419
null 100423
Any help would be greatly appreciated. Thanks!
Oh, I think I've got the idea: you have to enumerate all the ids and corresponding pids. If there's no corresponding pid, put null (kind of outer join). If it's your case, then Oracle solution can be:
with
-- dummy: required ids
dummy as (
select 100422 as id from dual
union all select 100414 as id from dual
union all select 100421 as id from dual
union all select 100419 as id from dual
union all select 100423 as id from dual),
-- main: actual data we have
main as (
select id,
pid
from assoc
-- you may put "id in (select d.id from dummy d)"
where id in (100422, 100414, 100421, 100419, 100423))
-- we want to print out either existing main.pid or null
select main.pid as pid,
dummy.id as id
from dummy left join main on dummy.id = main.id
id is obtained from other table and assoc only has pid associated with id.
The assoc table seems to be the association table used to implement a many-to-many relationship between two entities in a relational database.
It contains entries only for the entities from one table that are in relationship with entities from the other table. It doesn't contain information about the entities that are not in a relationship and some of the results you want to get come from entities that are not in a relationship.
The solution for your problem is to RIGHT JOIN the table where the column id comes from and put the WHERE condition against the values retrieved from the original table (because it contains the rows you need). The RIGHT JOIN ensures all the matching rows from the right side table are included in the result set, even when they do not have matching rows in the left side table.
Assuming the table where the id column comes from is named table1, the query you need is:
SELECT assoc.id, assoc.pid
FROM assoc
RIGHT JOIN table1 ON assoc.id = table1.id
WHERE table1.id IN (100422, 100414, 100421, 100419, 100423)

Pick One Row per Unique ID from duplicate records

In Ms.Access 2010, I have a similar query table like one below where its displaying duplicate records. Problem is that even though I have unique ID's, one of the field has different data than other row since I have combined two seperate tables in this query. I just want to display one row per ID and eliminate other rows. It doesn't matter which row I pick. See below:
ID - NAME - FAVCOLOR
1242 - John - Blue
1242 - John - Red
1378 - Mary - Green
I want to just pick any of the the row with same ID. It doesn't matter which row I pick as long as I am displaying one row per ID is what matters.
ID - NAME - FAVCOLOR
1242 - John - Red
1378 - Mary - Green
Use the SQL from your current query as a subquery and then GROUP BY ID and NAME. You can retrieve the minimum FAVCOLOR since you want only one and don't care which.
SELECT sub.ID, sub.NAME, Min(sub.FAVCOLOR)
FROM
(
SELECT ID, [NAME], FAVCOLOR
FROM TABLE1
UNION ALL
SELECT ID, [NAME], FAVCOLOR
FROM TABLE2
) AS sub
GROUP BY sub.ID, sub.NAME;
Note NAME is a reserved word. Bracket that name or prefix it with the table name or alias to avoid confusing the db engine.
Try selecting union without the ALL parameter and see if you get the desired result.
Your new query would look like
"SELECT ID, NAME, FAVCOLOR FROM TABLE1; UNION SELECT ID, NAME, FAVCOLOR FROM TABLE2;"
If you just want the IDs, why is the color in the query? Maybe I'm missing something.
The only thing I could suggest is to use some aggregate function (min, max) to get one color.
Select
id,
name,
max(favcolor)
from (
(select * from table1) t1
union (select * from table2) t2 )t
group by
id,
name