Generic comparison method for two tables in BigQuery if tables contain STRUCT type

Generic comparison method for two tables in BigQuery if tables contain STRUCT type - sql

I'm looking for a generic method to compare two tables in BigQuery, even if they have columns that are STRUCT type.
It should work for any pair of tables, and ideally wouldn't involve writing a query that depends on that actual columns of the tables. All I really need to know is if the tables are equal or not, but it would be a bonus if it could show me the difference between the rows that aren't the same.
So something like (in pseudo code)
sizeOf( TABLE A EXCEPT TABLE B ) == 0
or
Hash(TABLE A) == HASH(TABLE B)
Would be fine.
I tried using this:
( SELECT * FROM table1
EXCEPT DISTINCT
SELECT * FROM table2)
UNION ALL
( SELECT * FROM table2
EXCEPT DISTINCT
SELECT * FROM table1)
But I got this error.
Column 1 in EXCEPT ALL has type that does not support set operation comparisons: STRUCT at [3:5]
Does anyone know of a way to get around this?
EDIT
Should have mentioned before, but I need this to work regardless of the ordering of the rows of the table.

I think yo are looking for something like below to start with
#standardSQL
SELECT TO_JSON_STRING(a) FROM `project.dataset.tableA` a
EXCEPT DISTINCT
SELECT TO_JSON_STRING(b) FROM `project.dataset.tableB` b
Or, more complete example - to show differences - note: this can be quite exhausting output for really different tables
#standardSQL
SELECT 'a' table, * FROM (
SELECT TO_JSON_STRING(a) record FROM `project.dataset.tableA` a
EXCEPT DISTINCT
SELECT TO_JSON_STRING(b) FROM `project.dataset.tableB` b
)
UNION ALL
SELECT 'b', * FROM (
SELECT TO_JSON_STRING(b) FROM `project.dataset.tableB` b
EXCEPT DISTINCT
SELECT TO_JSON_STRING(a) FROM `project.dataset.tableA` a
)

Related

SQL count(distinct) from both the table

I have 2 tables. Let's say Table A and Table B. Table A has a column called "name". Table B also has a column "name". I want to find out the count(distinct name). Name should take values from both the columns.
For ex-
Table A
name
A
B
C
Table B
name
A
B
D
Output should be 4.

The best concept is, first combine the data in the way you want using a subquery, and then dedupe or do the 2nd step.
For example,
WITH COMBINED AS (
SELECT
name
FROM
TableA
UNION ALL
SELECT
name
FROM
TableB
)
SELECT
DISTINCT name
FROM
COMBINED
In your situation, the 2nd step can be accomplished by changing UNION ALL to a UNION. This will dedupe the values automatically. You won't even need a subquery or a 2nd step. But I wanted to teach you the concept because it comes up often.
SELECT name FROM TableA
UNION
SELECT name FROM TableB

Then UNION in the CTE will reove all Duplicates
so a COUNT(*) will suffoce
WITH CTE AS (
SELECT name FROM TableA
UNION
SELECT name FROM TableB
)
SELECT COUNT(*) FROM CTE

I hope this query should do it:
SELECT SUM(names) AS total_names
FROM (
SELECT COUNT(DISTINCT(name)) as names FROM TableA
UNION
SELECT COUNT(DISTINCT(name)) as names FROM TableB
) t;
Note: Tested with sql server

Yet another option:
select hll_count.merge(hll_sketch) names
from (
select hll_count.init(name) hll_sketch from tableA
union all
select hll_count.init(name) from tableB
)
HLL++ functions are approximate aggregate functions. Approximate aggregation typically requires less memory than exact aggregation functions, like COUNT(DISTINCT), but also introduces statistical error. This makes HLL++ functions appropriate for large data streams for which linear memory usage is impractical, as well as for data that is already approximate.
See more about benefits of using HyperLogLog++ functions

SQL combine two query results

I can't use a Union because it's not the result I want, and I can't use join because I haven't any common column. I have tried many different SQL query structures and nothing works as I want.
I need help to achieve what I believe is a really simple SQL query. What I am doing now is
select a, b
from (select top 4 a from element_type order by c) as Y,
(SELECT * FROM (VALUES (NULL), (1), (2), (3)) AS X(b)) as Z
The first is a part of a table and the second is a hand created select that gives results like this:
select a; --Give--> a,b,c,d (1 column)
select b; --Give--> 1,2,3,4 (1 column)
I need a query based on the two first that give me (2 column) :
a,1
b,2
c,3
d,4
How can i do this? UNION, JOIN or anything else? Or maybe I can't.
All I can get for now is this:
a,1
a,2
a,3
a,4
b,1
b,2
...

If you want to join two tables together purely on the order the rows appear, then I hope your database support analytic (window) functions:
SELECT * FROM
(SELECT t.*, ROW_NUMBER() OVER(ORDER BY x) as rown FROM table1 t) t1
INNER JOIN
(SELECT t.*, ROW_NUMBER() OVER(ORDER BY x) as rown FROM table2 t) t2
ON t1.rown = t2.rown
Essentially we invent something to join them on by numbering the rows. If one of your tables already contains incrementing integers from 1, you dont need to ROW_NUMBER() OVER() on that table, because it already has suitable data to join to; you just invent a fake column of incrementing nubmers in the other table and then join together
Actually, even if it doesn't support analytics, there are ugly ways of doing row numbering, such as joining the table back to itself using id < id and COUNT(*) .. GROUP BY id to number the rows. I hate doing it, but if your DB doesnt support ROW_NUMBER i'll post an example.. :/
Bear in mind, of course, that RDBMS have R in the name for a reason - related data is.. well.. related. They don't do so well when data is unrelated, so if your hope is to join the "chalks" table to the "cheese" table even though the two are completely unrelated, you're finding out now why it's hard work! :)

Try using row_number. I've created something that might help you. See below:
declare #tableChar table(letter varchar)
insert into #tableChar(letter)
select 'a';
insert into #tableChar(letter)
select 'b';
insert into #tableChar(letter)
select 'c';
insert into #tableChar(letter)
select 'd';
select letter,ROW_NUMBER() over(order by letter ) from #tableChar

You can user row_number() to achieve this,
select a,row_number() over(order by a) as b from element_type;
As you are not taking second part from other table, so you do not need to use join. But if you are doing this on different tables the you can use row_number() to create key for both the tables and bases on those keys, you can join.
Hope it will help.

Redshift: Cross join make data disappear

I have a weird issue in Redshift with a crossjoin.
I am generating days and want to join them with some ids.
The sample query is this:
with ids as (
Select number as id
from models.number_10000
limit 10
),
day as (
SELECT
TO_CHAR(DATEADD(day,num.number,CAST(DATEADD(day,-463,GETDATE()) AS DATE)),'YYYY-MM-DD') as date_string
FROM
(Select * from models.number_10000 limit 463)
as num
)
SELECT
id,date_string
from ids,day
Everything is working fine so far.
However, if I group by then I have no results.
with ids as (
Select number as id
from models.number_10000
limit 10
),
day as (
SELECT
TO_CHAR(DATEADD(day,num.number,CAST(DATEADD(day,-463,GETDATE()) AS DATE)),'YYYY-MM-DD') as date_string
FROM
(Select * from models.number_10000 limit 463)
as num
)
SELECT
id,date_String
from ids,day
group by 1,2
How is this happening? I have never faced something similar. I guess it's something with the cross join and the group by but it seems very strange.
Any thoughts?

I'd start with the following:
State your JOIN explicitly (ANSI92 Style).
State the names of items you want to be grouped explicitly.
Moreover - remove your GROUP BY statement (as you do not have any aggregate functions) and have a DISTINCT clause in your select statement.

Join two select statements together

I am trying to work out how much we have taken in for entry fees.
I have two separate queries both returning values but i need them be as one instead of two separate queries.
SELECT SUM(ENTRY) AS TOTAL1 FROM MONEY
SELECT SUM(ENTRY) AS TOTAL1 FROM MONEY2

I needed to use UNION in order to get the statements together. Then used the below to get one number.
SELECT SUM(X.TOTAL1) from
(
SELECT SUM(ENTRY) AS TOTAL1 FROM MONEY
UNION
SELECT SUM(ENTRY) AS TOTAL1 FROM MONEY2
) X;

select sum(entry) as grand_total
from ( select entry from money
union all
select entry from money2
);
The point being, you SHOULD use UNION ALL; and how many columns each table has is irrelevant, because you don't need to UNION ALL the two tables (all columns from each); you only need to UNION ALL the ENTRY column from the first table and the ENTRY column from the second table.

Compare two tables of data in HIVE

I have to find out if data in both the tables is same for a given view_date. If same my SQL should return zero, else non zero.
Table1/Table2 columns:
Source
view_date
count
start_date
end_date
I tried in the below way:
SELECT *
FROM (
SELECT count(*)
FROM table1
) a
JOIN (
SELECT count(*)
FROM TABLE 2
) b
WHERE view_date = '05/08/2016'
AND a.x != b.y;
But I am not getting the expected result. Could someone please help me?

Here is one method that counts the number of rows that are unique in each table:
select count(*)
from (select source, count, start_date, end_date,
min(which) as minwhich, max(which) as maxwhich
from ((select source, count, start_date, end_date, 1 as which
from table1
where viewdate = '2016-06-08'
) union all
(select source, count, start_date, end_date, 2 as which
from table2
where viewdate = '2016-06-08'
)
) t12
group by source, count, start_date, end_date
having minwhich = maxwhich
) t;
Note: If rows are duplicated across all values in a table, this does not check that the same number of duplicates are in each table.

To do a full comparison of 2 tables, you not only need to make sure that the number of rows match, but you must check that all the data in all the columns for all the rows match!
This can be a complicated problem (when I worked at Hortonworks, for 1 project we developed 3 different programs to try to solve this). Lately I had the opportunity to develop a program that solves this in an elegant and efficient way: https://github.com/bolcom/hive_compared_bq
The program shows you the differences in a webpage (which is something you could skip if you don't need it) and also gives you a return value 0/1 which is what you currently want.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Generic comparison method for two tables in BigQuery if tables contain STRUCT type - sql

Related

SQL count(distinct) from both the table

SQL combine two query results

Redshift: Cross join make data disappear

Join two select statements together

Compare two tables of data in HIVE

Categories

Resources