efficient way to compare two tables in bigquery - sql

I am interested in comparing, whether two tables contain the same data.
I could do it like this:
#standardSQL
SELECT
key1, key2
FROM
(
SELECT
table1.key1,
table1.key2,
table1.column1 - table2.column1 as col1,
table1.col2 - table2.col2 as col2
FROM
`table1` AS table1
LEFT JOIN
`table2` AS table2
ON
table1.key1 = table2.key1
AND
table1.key2 = table2.key2
)
WHERE
col1 != 0
OR
col2 != 0
But when I want to compare all numerical columns, this is kind of hard, especially if I want to do it for multiple table combinations.
Therefore my question: Is someone aware of a possibility to iterate over all numerical columns and restrict the result set to those keys where any of these differences where not zero?

In Standard SQL, we found using a UNION ALL of two EXCEPT DISTINCT's works for our use cases:
(
SELECT * FROM table1
EXCEPT DISTINCT
SELECT * from table2
)
UNION ALL
(
SELECT * FROM table2
EXCEPT DISTINCT
SELECT * from table1
)
This will produce differences in both directions:
rows in table1 that are not in table2
rows in table2 that are not in table1
Notes and caveats:
table1 and table2 must be of the same width and have columns in the same order and type.
this does not work directly with STRUCT or ARRAY data types. You should either UNNEST, or use TO_JSON_STRING to convert the these data types first.
this does not directly work with GEOGRAPHY either, you must cast to text first using ST_AsText

First, I want to bring up issues with your original query
The main issues are 1) using LEFT JOIN ; 2) using col != 0
Below is how it should be modified to really capture ALL differences from both tables
Run your original query and below one - and hopefully you will see the difference
#standardSQL
SELECT key1, key2
FROM
(
SELECT
IFNULL(table1.key1, table2.key1) key1,
IFNULL(table1.key2, table2.key2) key2,
table1.column1 - table2.column1 AS col1,
table1.col2 - table2.col2 AS col2
FROM `table1` AS table1
FULL OUTER JOIN `table2` AS table2
ON table1.key1 = table2.key1
AND table1.key2 = table2.key2
)
WHERE IFNULL(col1, 1) != 0
OR IFNULL(col2, 1) != 0
or you can just try to run your original and above version against dummy data to see the difference
#standardSQL
WITH `table1` AS (
SELECT 1 key1, 1 key2, 1 column1, 2 col2 UNION ALL
SELECT 2, 2, 3, 4 UNION ALL
SELECT 3, 3, 5, 6
), `table2` AS (
SELECT 1 key1, 1 key2, 1 column1, 29 col2 UNION ALL
SELECT 2, 2, 3, 4 UNION ALL
SELECT 4, 4, 7, 8
)
SELECT key1, key2
FROM
(
SELECT
IFNULL(table1.key1, table2.key1) key1,
IFNULL(table1.key2, table2.key2) key2,
table1.column1 - table2.column1 AS col1,
table1.col2 - table2.col2 AS col2
FROM `table1` AS table1
FULL OUTER JOIN `table2` AS table2
ON table1.key1 = table2.key1
AND table1.key2 = table2.key2
)
WHERE IFNULL(col1, 1) != 0
OR IFNULL(col2, 1) != 0
Secondly, below will highly simplify your overall query
#standardSQL
SELECT
IFNULL(table1.key1, table2.key1) key1,
IFNULL(table1.key2, table2.key2) key2
FROM `table1` AS table1
FULL OUTER JOIN `table2` AS table2
ON table1.key1 = table2.key1
AND table1.key2 = table2.key2
WHERE TO_JSON_STRING(table1) != TO_JSON_STRING(table2)
You can test it with the same dummy data example as above
Note: in this solution you don't need to pick specific columns - it just compare all columns! but if you need to compare only specific columns - you still will need to cherry-pick them like in below example
#standardSQL
SELECT
IFNULL(table1.key1, table2.key1) key1,
IFNULL(table1.key2, table2.key2) key2
FROM `table1` AS table1
FULL OUTER JOIN `table2` AS table2
ON table1.key1 = table2.key1
AND table1.key2 = table2.key2
WHERE TO_JSON_STRING((table1.column1, table1.col2)) != TO_JSON_STRING((table2.column1, table2.col2))

You will need to specify which are the numerical columns, but looking at a representation of all of them will do the fast compare:
#standardSQL
WITH table_a AS (
SELECT 1 id, 2 n1, 3 n2
), table_b AS (
SELECT 1 id, 2 n1, 4 n2
)
SELECT id
FROM table_a a
JOIN table_b b
USING(id)
WHERE TO_JSON_STRING([a.n1, a.n2]) != TO_JSON_STRING([b.n1, b.n2])

Related

select query respecting conditions

i have my table containing 4 Columns (id, val1, val2, val3).
Does anyone knows how to select rows where val3 is the same where val1 is different.
for example
row1: (id1, user1, matheos, cvn)
row2: (id2, user2, matheos, cvn)
row3: (id3, user3, Claudia, bnps)
then i return the row1 and row2.
Your explanation is not entirely clear, but the following query will find matching rows according to the criteria you specified:
select a.*, b.*
from my_table a
join my_table b on b.val3 = a.val3
and b.val2 <> a.val2
and b.id < a.id
In order to produce the rows separately, you can also do:
select *
from my_table a
where exists (
select null from my_table b where b.val3 = a.val3 and b.val2 <> a.val2
)
Based on your explanation, you can try this:
select distinct t1.* from mytable t1
JOIN mytable t2 where t1.val3 = t2.val3
and t1.val1 != t2.val1;
Demo: SQL Fiddle

Find unmatched rows between two tables

Given this setup:
CREATE TABLE table1 (column1 text, column2 text);
CREATE TABLE table2 (column1 text, column2 text);
INSERT INTO table1 VALUES
('A', 'A')
, ('B', 'N')
, ('C', 'C')
, ('B', 'A');
INSERT INTO table2 VALUES
('A', 'A')
, ('B', 'N')
, ('C', 'X')
, ('B', 'Y');
How can I find missing combinations of (column1, column2) between these two tables? Rows not matched in the other table.
The desired result for the given example would be:
C | C
B | A
C | X
B | Y
There can be duplicate entries so we'd want to omit those.
One method is union all:
select t1.col1, t1.col2
from t1
where (t1.col1, t1.col2) not in (select t2.col1, t2.col2 from t2)
union all
select t2.col1, t2.col2
from t2
where (t2.col1, t2.col2) not in (select t1.col1, t1.col2 from t1);
If there are duplicates within a table, you can remove them by using select distinct. There is no danger of duplicates between the tables.
Seems to be a perfect task for set operations:
( --all rows from table 1 missing in table 2
select *
from table1
except
select *
from table2
)
union all -- both select return distinct rows
( -- all rows in table 2 missing in table 1
select *
from table2
except
select *
from table1
)
You can try to use not exists with a subquery, then use UNION ALL
select Column1,Column2 from table1 t1
where NOT exists
(
select 1
FROM table2 t2
where t1.Column1 = t2.Column1 or t1.Column2 = t2.Column2
)
UNION ALL
select Column1,Column2 from table2 t1
where NOT exists
(
select 1
FROM table1 t2
where t1.Column1 = t2.Column1 or t1.Column2 = t2.Column2
)
You can try set operations. EXCEPT to find the rows in table but not in the other and UNION to put the partial results into one.
(SELECT column1,
column2
FROM table1
EXCEPT
SELECT column1,
column2
FROM table2)
UNION
(SELECT column1,
column2
FROM table2
EXCEPT
SELECT column1,
column2
FROM table1);
If you don't need duplicate elimination you can try to use the ALL variants (EXCEPT ALL and UNION ALL). They are generally faster, as the DBMS doesn't have to look for and eliminate duplicates.
The devil is in the details with this seemingly simple task.
Short and among the fastest:
SELECT col1, col2
FROM (SELECT col1, col2, TRUE AS x1 FROM t1) t1
FULL JOIN (SELECT col1, col2, TRUE AS x2 FROM t2) t2 USING (col1, col2)
WHERE (x1 AND x2) IS NULL;
The FULL [OUTER] JOIN includes all rows from both sides, but fills in NULL values for columns of missing rows. The WHERE conditions (x1 AND x2) IS NULL identifies these unmatched rows. Equivalent: WHERE x1 IS NULL OR x2 IS NULL.
To eliminate duplicate pairs, add DISTINCT (or GROUP BY) at the end - cheaper for few dupes:
SELECT DISTINCT col1, col2
FROM ...
If you have many dupes on either side, it's cheaper to fold before the join:
SELECT col1, col2
FROM (SELECT DISTINCT col1, col2, TRUE AS x1 FROM t1) t1
FULL JOIN (SELECT DISTINCT col1, col2, TRUE AS x2 FROM t2) t2 USING (col1, col2)
WHERE (x1 AND x2) IS NULL;
It's more complicated if there can be NULL values. DISTINCT / DISTINCT ON or GROUP BY treat them as equal (so dupes with NULL values are folded in the subqueries above). But JOIN or WHERE conditions must evaluate to TRUE for rows to pass. NULL values are not considered equal in this, the FULL [OUTER] JOIN never finds a match for pairs containing NULL. This may or may not be desirable. You just have to be aware of the difference and define your requirements.
Consider the added demo in the SQL Fiddle
If there are no NULL values, no duplicates, but an additional column defined NOT NULL in each table, like the primary key, let's name each id, then it can be as simple as:
SELECT col1, col2
FROM t1
FULL JOIN t2 USING (col1, col2)
WHERE t1.id IS NULL OR t2.id IS NULL;
Related:
Select rows which are not present in other table
PostgreSQL - Create table as select with distinct on specific columns

Remove inner query in SQL

We have a SQL query which is not written as per the sql guideline. We have to change the query but if we change the logic and remove the inner query then it take to much time to execute. Below is the query:
select col1,
col2,
case
when col1 <> '' then(select top 1
col1
from table1 as BP
where bp.col1 = FD.col1 order by BP.col1)
when col2 <> '' then(select top 1
BP.col2
from table1 as BP
where BP.col2 = FD.col2 order by BP.col2)
else ''
end
from table2 FD
The above query is being used to insert the data into a temp table. The table1 has almost 100 million of data. Is there any way to remove the inline query along with the good performance. We have already created the indexes on table1. Any thought?
Try this
;WITH CTE
AS
(
SELECT
RN = ROW_NUMBER() OVER(ORDER BY COALESCE(T2.col1,T2.col2)),
T2.col1,
T2.col2,
T1Val = COALESCE(T2.col1,T2.col2,'')
FROM table2 T2
LEFT JOIN table1 T1
ON
(
(
ISNULL(T2.col1,'')<>'' AND T1.col1 = T2.col1
)
OR
(
ISNULL(T2.col2,'')<>'' AND T1.col2 = T2.col2
)
)
)
SELECT
*
FROM CTE
WHERE RN = 1
Here is my modest help:
You can already prepare and materialize your subquery1 and subquery2 (Group BY col1 or col2) <-- It will reduce the size of your table 1)
Split your main query (from table2 into 3 different queries)
1 with SELECT .. FROM table2 WHERE col1 <> ''
1 with SELECT .. FROM table2 WHERE col1 = '' AND col2 <> ''
1 with SELECT .. FROM table2 WHERE col1 = '' AND col2 = ''
Use an INNER JOIN with your table created in the first point.
(If you use SSIS you can // and use your inner join table into a Lookup)
If your col1 or col2 use a NVARCHAR(MAX) or a big size, you should have a look to a HashFunction (MD5 for example) and compare Hash instead.
Be sure to have all your indexes
At least if it is not performant, you can try with:
OUTER APPLY (SELECT TOP 1 .. )
Another idea should be:
SELECT col1, col2, col1 AS yourNewCol
FROM table2 T2
WHERE EXISTS( SELECT 1 FROM table1 T1 WHERE T1.col1 = T2.col1)
UNION ALL
SELECT col1, col2, col2 AS yourNewCol
FROM table2 T2
WHERE
NOT EXISTS( SELECT 1 FROM table1 T1 WHERE T1.col1 = T2.col1)
AND EXISTS( SELECT 1 FROM table1 T1 WHERE T1.col2 = T2.col2)
UNION ALL
...
I don't have a clean solution for you, but some ideas.
Let me know if it helps you.
Regards,
Arnaud

Cleaning up and simplifying a nested SQL statement

I'm trying to determine if there is a better way to do this in SQL. My goal is to run one query which returns two values that are to be used for another query. See below
select *
from table2
where col1 =
( select col1
from table1
where id = 123 )
and col2 =
( select col2
from table1
where id = 123 );
Is there a way to simplify this code by either doing a where clause that checks both values against one nested query, or by running the first querying and somehow setting the values of col1 and col2 to variables that I can use in the second query?
You can do
select *
from table2
where (col1, col2) = (select col1, col2
from table1
where id = 123)
SELECT DISTINCT a.*
FROM table2 a
INNER JOIN table1 b
ON a.col1 = b.col1
AND a.col2 = b.col2
WHERE b.id = 123
you can simply use query as below
select t2.* from table2 t2,table1 t1 where t1.col1=t2.col1 and
t1.col2=t2.col2 and t1.id=123
Seems like you've got it backwards. Since you know exactly what you want from table1 (so, presumably, the query is smaller), you should start by getting the data from table1, then join the releveant rows from table2:
select table2.*
from table1
inner join table2
on table2.col1 = table1.col1
and table2.col2 = table1.col2
where table1.id = 123

SQL query to find distinct values in two tables?

Table 1 Table 2
Number | Code Code | Description
1234 A A Something
1235 B C Something else
1246 C D Something other
1247 A
1248 B
1249 A
I would like to find the distinct Code values and get a return like this:
1 | 2
-------
A A
B
C C
D
I can't figure out how to write a SQL query that would return me the above results. Anyone have any experience with a query like this or similar?
In proper RDBMS:
SELECT
T1.Code, T2.Code
FROM
(SELECT DISTINCT Code FROM Table1) T1
FULL OUTER JOIN
(SELECT DISTINCT Code FROM Table2) T2
ON T1.Code = T2.Code
In MySQL... the UNION removes duplicates
SELECT
T1.Code, T2.Code
FROM
Table1 T1
LEFT OUTER JOIN
Table2 T2 ON T1.Code = T2.Code
UNION
SELECT
T1.Code, T2.Code
FROM
Table1 T1
RIGHT OUTER JOIN
Table2 T2 ON T1.Code = T2.Code
In Standard SQL, using relational operators and avoiding nulls:
SELECT Code AS col_1, Code AS col_2
FROM Table_1
INTERSECT
SELECT Code AS col_1, Code AS col_2
FROM Table_2
UNION
SELECT Code AS col_1, 'missing' AS col_2
FROM Table_1
EXCEPT
SELECT Code AS col_1, 'missing' AS col_2
FROM Table_2
UNION
SELECT 'missing' AS col_1, Code AS col_2
FROM Table_2
EXCEPT
SELECT 'missing' AS col_1, Code AS col_2
FROM Table_1;
Again in Standard SQL, this time using constructs that MySQL actually supports:
SELECT Code AS col_1, Code AS col_2
FROM Table_1
WHERE EXISTS (
SELECT *
FROM Table_2
WHERE Table_2.Code = Table_1.Code
)
UNION
SELECT Code AS col_1, 'missing' AS col_2
FROM Table_1
WHERE NOT EXISTS (
SELECT *
FROM Table_2
WHERE Table_2.Code = Table_1.Code
)
UNION
SELECT 'missing' AS col_1, Code AS col_2
FROM Table_2
WHERE NOT EXISTS (
SELECT *
FROM Table_1
WHERE Table_1.Code = Table_2.Code
);
What you're looking for is a full outer join:
select a.code as code_1,b.code as code_2
from(
select code
from table1
group by 1
)a
full outer join(
select code
from table2
group by 1
)b
using(code)
order by 1;
This actually looks like a UNION of two outer joins. Try this:
SELECT t1.Code, t2.Code
FROM Table1 AS t1
LEFT JOIN Table2 AS t2 ON t1.Code
UNION
SELECT t1.Code, t2.Code
FROM Table1 AS t1
RIGHT JOIN Table2 AS t2 ON t1.Code
ORDER BY 1, 2
The UNION operation will only keep distinct values.
The trick would be to get the distinct values from both tables, something like this:
SELECT a.Code, b.code
FROM
( --Get the DISTICT Codes from all sets
SELECT Distinct Code from Table1
UNION SELECT Distinct Code from Table2
) x Left JOIN
Table1 a ON x.code = a.Code LEFT JOIN
Table2 b ON x.code = b.Code
SELECT
ct.ct_id,
ct.pd_id,
ct.ct_qty,
pd.product_name,
pd.price,
src.service_id,
src.service_name,
src.service_charge,
src.service_quantity
FROM
cart ct,
product pd,
service src
WHERE ct_session_id = '$sid'
LIMIT 1