Compare 2 versions a of table in big-query - sql

I wanted to compare 2 versions of a table.
I wanted to compare Before last modification with latest data from a table.
here i have a sample sql script which compares the tables
WITH
before_mod AS ( SELECT *
FROM `big-query-112.temp.tableB`
FOR SYSTEM_TIME AS OF TIMESTAMP_SUB({{ lastModification }}, INTERVAL 2 second)),
after_mod AS ( SELECT * FROM `big-query-112.temp.tableB` ),
row_changed AS (
SELECT *
FROM before_mod EXCEPT DISTINCT
SELECT *
FROM after_mod
)
SELECT * FROM row_changed
This SQL first will create a CTE for
before_mob -> this holds a snapshot of the table as it was on that specific point in time.
afrer_mod -> the actual data in the tableB
Then "row_changed" table is created by selecting all rows from "before_mod" that are not in "after_mod".
The problem is that bigquery does not allow to use diferent timestamp FOR SYSTEM_TIME AS ...
Exception:If a 'FOR SYSTEM_TIME AS OF' expression is used, all references of a table should use the same TIMESTAMP value.
I also tried adding the before_mod in a view and then query the view SQL below
CREATE OR REPLACE VIEW `big-query-112.temp.tableB_before_mod_temp` AS (
SELECT *
FROM `big-query-112.temp.tableB`
FOR SYSTEM_TIME AS OF TIMESTAMP_SUB('2023-02-04 13:12:35 UTC', INTERVAL 0 second)
);
WITH
before_mod AS ( SELECT * FROM `big-query-112.temp.tableB_before_mod_temp`),
after_mod AS ( SELECT * FROM `big-query-112.temp.tableB` ),
row_changed AS (
SELECT *
FROM before_mod EXCEPT DISTINCT
SELECT *
FROM after_mod
)
SELECT * FROM row_changed
The problem with this one is that it is not showing the rows that are different, seams that is getting in table from only a specific time.
Also, cannot use materialized view Exception: Invalid value: Materialized view query cannot reference historical versions of the table definition
Is there a way how can i compare 2 versions of the table, without creating a copy?
NOTE: Table does not have an ID (in the way the table is being generated it is hard to add an id which is always same for a specific row)
also querying the SELECT * FROM `big-query-112.temp.tableB_before_mod_temp shows the expected results

Related

show columns in CTE returns an error - why?

I have a show columns query that works fine:
SHOW COLUMNS IN table
but it fails when trying to put it in a CTE, like this:
WITH columns_table AS (
SHOW COLUMNS IN table
)
SELECT * from columns_table
any ideas why and how to fix it?
Using RESULT_SCAN:
Returns the result set of a previous command (within 24 hours of when you executed the query) as if the result was a table. This is particularly useful if you want to process the output from any of the following:
SHOW or DESC[RIBE] command that you executed.
SHOW COLUMNS IN ...;
WITH columns_table AS (
SELECT *
FROM table(RESULT_SCAN(LAST_QUERY_ID()))
)
SELECT *
FROM columns_table;
CTE requires select clause and we cannot use SHOW COLUMN IN CTE's and as a alterative use INFORMATION_SCHEMA to retrieve metadata .Like below:
WITH columns_table AS (
Select * from INTL_DB.INFORMATION_SCHEMA.COLUMNS where TABLE_NAME='CURRENCIES'
)
SELECT * from columns_table;

How to select the nth column, and order columns' selection in BigQuery

I have this huge table upon which I apply a lot of processing (using CTEs), and I want to perform a UNION ALL on 2 particular CTEs.
SELECT *
, 0 AS orders
, 0 AS revenue
, 0 AS units
FROM secondary_prep_cte WHERE purchase_event_flag IS FALSE
UNION ALL
SELECT *
FROM results_orders_and_revenues_cte
I get a "Column 1164 in UNION ALL has incompatible types : STRING,DATE at [97:5]
Obviously I don't know the name of the column, and I'd like to debug this but I feel like I'm going to waste a lot of time if I can't pin-point which column is 1164.
I also think this is a problem of the order of columns between the CTEs, so I have 2 questions:
How do I identify the 1164th column
How do I order my columns before performing the UNION ALL
I found this similar question but it is for MSSQL, I am using BigQuery
You can get information from INFORMATION_SCHEMA.COLUMNS but you'll need to create a table or view from the CTE:
CREATE OR REPLACE VIEW `project.dataset.secondary_prep_view` as select * from (select 1 as id, "a" as name, "b" as value)
Then:
SELECT * FROM dataset.INFORMATION_SCHEMA.COLUMNS WHERE table_name = 'secondary_prep_view';

BigQuery - Create a table from results of a query that uses complex CTEs?

I have a multi CTE query with large underlying datasets that is run too frequently. I could just create a table of the results of that query for people to use instead, and refresh that daily. But I'm lost on the syntax to create such a table.
CREATE OR REPLACE TABLE dataset.target_table
AS
with cte_one as (
select
stuff
from big.table
),
...
cte_five as (
select
stuff
from other_big.table
),
final as (
select *
from cte_five left join cte_x on cte_five.id = cte_x.id
)
SELECT
*
FROM final
Is basically what I have. This actually creates the target table with the right schema even, but doesn't insert any rows...Any hints? Thanks
If you really want to do this in one step, you can just do SELECT INTO...
with cte_one as (
select
stuff
from big.table
),
...
cte_five as (
select
stuff
from other_big.table
),
final as (
select *
from cte_five left join cte_x on cte_five.id = cte_x.id
)
SELECT
*
INTO dataset.target_table
FROM final
That said, since this isn't just a once-off need I recommend creating the landing table once initially and then scheduling a daily flush and fill (TRUNCATE + INSERT) to update the data. It will give you more explicit control over the data types and also lets you work with a persistent object rather than something built from scratch daily.

Choosing the view query at runtime (postgres database)

I want to create a view which will choose between two possible selects based on a session variable (set_config) on runtime.
Today I do it by a "union all" between two selects in the following manner:
create view my_view as (
select * from X where cast(current_setting('first_select') as int)=1 and ...;
union all
select * from Y where cast(current_setting('first_select') as int)=0 and ...;
)
The problem is that Postgres optimizer takes bad decisions when the target is a union.
So when I run for example:
select * from my_view where id in (select id from Z where field='value')
It decides to do a full scan on table X although it has an index on "id".
Is there another way to define such a view without using a "union" clause?
Just OR them together into the WHERE-clause. The optimiser will find the invariant conditions.
CREATE VIEW my_view AS (
SELECT * FROM X WHERE
( cast(current_setting('first_select') as int)=1 AND <condition1> )
OR ( cast(current_setting('first_select') as int)=0 AND <condition2> )
...
)
;

SQL Server : compare two tables with UNION and Select * plus additional label column

I've been playing around with the sample on Jeff' Server blog to compare two tables to find the differences.
In my case the tables are a backup and the current data. I can get what I want with this SQL statement (simplified by removing most of the columns). I can then see the rows from each table that don't have an exact match and I can see from which table they come.
SELECT
MIN(TableName) as TableName
,[strCustomer]
,[strAddress1]
,[strCity]
,[strPostalCode]
FROM
(SELECT
'Old' as TableName
,[JAS001].[dbo].[AR_CustomerAddresses].[strCustomer]
,[JAS001].[dbo].[AR_CustomerAddresses].[strAddress1]
,[JAS001].[dbo].[AR_CustomerAddresses].[strCity]
,[JAS001].[dbo].[AR_CustomerAddresses].[strPostalCode]
FROM
[JAS001].[dbo].[AR_CustomerAddresses]
UNION ALL
SELECT
'New' as TableName
,[JAS001new].[dbo].[AR_CustomerAddresses].[strCustomer]
,[JAS001new].[dbo].[AR_CustomerAddresses].[strAddress1]
,[JAS001new].[dbo].[AR_CustomerAddresses].[strCity]
,[JAS001new].[dbo].[AR_CustomerAddresses].[strPostalCode]
FROM
[JAS001new].[dbo].[AR_CustomerAddresses]) tmp
GROUP BY
[strCustomer]
,[strAddress1]
,[strCity]
,[strPostalCode]
HAVING
COUNT(*) = 1
This Stack Overflow Answer gives me a much cleaner SQL query but does not tell me from which table the rows come.
SELECT * FROM [JAS001new].[dbo].[AR_CustomerAddresses]
UNION
SELECT * FROM [JAS001].[dbo].[AR_CustomerAddresses]
EXCEPT
SELECT * FROM [JAS001new].[dbo].[AR_CustomerAddresses]
INTERSECT
SELECT * FROM [JAS001].[dbo].[AR_CustomerAddresses]
I could use the first version but I have many tables that I need to compare and I think that there has to be an easy way to add the source table column to the second query. I've tried several things and googled to no avail. I suspect that maybe I'm just not searching for the correct thing since I'm sure it's been answered before.
Maybe I'm going down the wrong trail and there is a better way to compare the databases?
Could you use the following setup to accomplish your goal?
SELECT 'New not in Old' Descriptor, *
FROM
(
SELECT * FROM [JAS001new].[dbo].[AR_CustomerAddresses]
EXCEPT
SELECT * FROM [JAS001].[dbo].[AR_CustomerAddresses]
) a
UNION
SELECT 'Old not in New' Descriptor, *
FROM
(
SELECT * FROM [JAS001].[dbo].[AR_CustomerAddresses]
EXCEPT
SELECT * FROM [JAS001new].[dbo].[AR_CustomerAddresses]
) b
You can't add the table name there because union, except, and intersection all compare all columns. This means you can't differentiate between them by adding the table name to the query. A group by gives you control over what columns are considered in finding duplicates so you can exclude the table name.
To help you with the large number of tables you need to compare you could write a sql query off the metadata tables that hold table names and columns and generate the sql commands dynamically off those values.
Derive one column using table names like below
SELECT MIN(TableName) as TableName
,[strCustomer]
,[strAddress1]
,[strCity]
,[strPostalCode]
,table_name_came
FROM
(SELECT 'Old' as TableName
,[JAS001].[dbo].[AR_CustomerAddresses].[strCustomer]
,[JAS001].[dbo].[AR_CustomerAddresses].[strAddress1]
,[JAS001].[dbo].[AR_CustomerAddresses].[strCity]
,[JAS001].[dbo].[AR_CustomerAddresses].[strPostalCode]
,'[JAS001].[dbo].[AR_CustomerAddresses]' as table_name_came
FROM [JAS001].[dbo].[AR_CustomerAddresses]
UNION ALL
SELECT 'New' as TableName
,[JAS001new].[dbo].[AR_CustomerAddresses].[strCustomer]
,[JAS001new].[dbo].[AR_CustomerAddresses].[strAddress1]
,[JAS001new].[dbo].[AR_CustomerAddresses].[strCity]
,[JAS001new].[dbo].[AR_CustomerAddresses].[strPostalCode]
,'[JAS001new].[dbo].[AR_CustomerAddresses]' as table_name_came
FROM [JAS001new].[dbo].[AR_CustomerAddresses]
) tmp
GROUP BY [strCustomer]
,[strAddress1]
,[strCity]
,[strPostalCode]
,table_name_came
HAVING COUNT(*) = 1