select latest Table in a Big Query Dataset - Standard SQL syntax - google-bigquery

I have dataset containing multiple tables with similar names:
e.g.
affilinet_4221_first_20180911_204956
affilinet_4221_first_20180911_160004
affilinet_4221_first_20180911_085559
affilinet_4221_first_20180910_201323
affilinet_4221_first_20180910_201042
affilinet_4221_first_20180910_080006
affilinet_4221_first_20180909_160707
This query identifies the latest dataset (according to yyyymmdd_hhmmss naming convention) with __TABLES_SUMMARY__ method
SELECT max(table_id) as table_id FROM `modemutti-8d8a6.feed_first.__TABLES_SUMMARY__`
where table_id LIKE "affilinet_4221_first_%"
query result
this query extracts all values from a specific table with _TABLE_SUFFIX method
SELECT * FROM `modemutti-8d8a6.feed_first.*`
WHERE _TABLE_SUFFIX = "affilinet_4221_first_20180911_204956"
query result
This query combines __TABLES_SUMMARY__ (which returns affilinet_4221_first_20180911_204956) and _TABLE_SUFFIX
SELECT * FROM `modemutti-8d8a6.feed_first.*`
WHERE _TABLE_SUFFIX = (
SELECT max(table_id) FROM `modemutti-8d8a6.feed_first.__TABLES_SUMMARY__`
where table_id LIKE "affilinet_4221_first_%")
this query fails:
Error: Cannot read field 'modemio_cat_level' of type INT64 as STRING
error screenshot
any idea why is this happening or how I could solve the issue?
------------EDIT------------
#Mikhail solution works correctly but processes a huge amount of data. See explicit call Vs the suggested Method. Another solution would have been
SELECT * FROM `modemutti-8d8a6.feed_first.affilinet_4221_first_*` WHERE _TABLE_SUFFIX =
(
SELECT MAX(_TABLE_SUFFIX) FROM`modemutti-8d8a6.feed_first.affilinet_4221_first_*`
)
but this processes also a much bigger amount of data compared to the explicit query. Is there are way to achieve through a view in the UI or should I rather use the Python / Java SDK via API?

Try below
#standardSQL
SELECT * FROM `modemutti-8d8a6.feed_first.affilinet_4221_first_*`
WHERE _TABLE_SUFFIX = (
SELECT REPLACE(MAX(table_id), 'affilinet_4221_first_', '')
FROM `modemutti-8d8a6.feed_first.__TABLES_SUMMARY__`
WHERE table_id LIKE "affilinet_4221_first_%"
)

Related

show columns in CTE returns an error - why?

I have a show columns query that works fine:
SHOW COLUMNS IN table
but it fails when trying to put it in a CTE, like this:
WITH columns_table AS (
SHOW COLUMNS IN table
)
SELECT * from columns_table
any ideas why and how to fix it?
Using RESULT_SCAN:
Returns the result set of a previous command (within 24 hours of when you executed the query) as if the result was a table. This is particularly useful if you want to process the output from any of the following:
SHOW or DESC[RIBE] command that you executed.
SHOW COLUMNS IN ...;
WITH columns_table AS (
SELECT *
FROM table(RESULT_SCAN(LAST_QUERY_ID()))
)
SELECT *
FROM columns_table;
CTE requires select clause and we cannot use SHOW COLUMN IN CTE's and as a alterative use INFORMATION_SCHEMA to retrieve metadata .Like below:
WITH columns_table AS (
Select * from INTL_DB.INFORMATION_SCHEMA.COLUMNS where TABLE_NAME='CURRENCIES'
)
SELECT * from columns_table;

add table_id to the result from multiple tables in BigQuery

Below is how I structured the data in BigQuery database.
test
-> sales
-> monthly-2015
-> monthly-2016
-> ...
I want to combine the data of all tables with the table name , monthly-*, and below is how I wrote the sql from examples I found.
Running this sql leads an error like following Scalar subquery produced more than one element. How could I fix it to error?
SELECT
*,
(
SELECT
table_id
FROM
`test.sales.__TABLES_SUMMARY__`
WHERE
table_id LIKE 'monthly-%')
FROM
`test.sales.monthly*`
I want to combine the data of all tables with the table name , monthly-*
Try below
SELECT *, 'monthly_' || _TABLE_SUFFIX as table_name
FROM `test.sales.monthly_*`

How do I pass a series of parameters into a TVF to get series of tables in Big Query

In Postgres I have a query that uses a table value function
SELECT
forecast.*
FROM (
SELECT
generate_series(begin_date, end_date, '1 mon'::interval)::date zdate
) zdate
LEFT JOIN LATERAL forecast_f(zdate.zdate)
forecast(forecast_version , source, forecast_date, gl_date, customer, program, rev) ON true
where 1=1;
and forecast_f looks like:
A lot of boiler plate and then:
BEGIN
return query
select * from table where a lot of parameters are pulled in.
I'm trying to do the same thing in Bigquery and have googled around a few uncommon concepts:
Generating a series
passing parameters into a udf with the right data type (always an adventure)
tvf
and the documentation did not have a lot on TVF's I thought maybe it can't handle tvfs and I'd have to join it all in to a column and split it somehow when it comes out of the function. When I googled, others complained about special cases when TVF's don't work, but that would suggest there are cases where it does work, like maybe mine. So I made this:
create or replace temp function snap(t timestamp)
as
(select * from forecast_stuff.forecast_full_practice where zfrom <= t and (zto> t or zto is null));
select * from snap(current_time())
which didn't work. Also, this fancy number:
create or replace temp function snap(t timestamp)
as
((select intersection from forecast_stuff.forecast_full_practice where zfrom <= t and (zto> t or zto is null)));
select * from snap(current_time())
Didn't work either. Something about if not exist not supported in temporary functions. I remember doing something like this in f1 or dremel a few years back. Did they not bring the technology forward?

format the timestamp output for TABLE_DATE_RANGE() BigQuery

I have a requirement to query different tables a once to save my time. Tables names like
abc_yyyymmdd
can be easily query using the
table_date_range(abc_,timestamp('2016-01-01'),timestamp('2016-03-12'))
but I have different format table name
abc_mm_dd_yyyy
is there a way to query in these tables using table_date_range.
In Legacy SQL you can use TABLE_QUERY for this
So it can be something like below
SELECT *
FROM (
TABLE_QUERY(YourDataset, 'LEFT(table_id, 4) = "abc_" AND LENGTH(table_id) = 14
AND CONCAT(SUBSTR(table_id,11,4),'-',SUBSTR(table_id,5,2), -",SUBSTR(table_id,8,2))
BETWEEN "2016-01-01" AND "2016-03-12"')
)
If you can use Standard SQL, you can use the _TABLE_SUFFIX pseudo column to work with any table name format.
Is there an equivalent of table wildcard functions in BigQuery with standard SQL?
In this case, it would be something like:
SELECT ... FROM `mydataset.abc_2016_*` WHERE _TABLE_SUFFIX = '01-01' or _TABLE_SUFFIX = '03-12'

SQL Server : compare two tables with UNION and Select * plus additional label column

I've been playing around with the sample on Jeff' Server blog to compare two tables to find the differences.
In my case the tables are a backup and the current data. I can get what I want with this SQL statement (simplified by removing most of the columns). I can then see the rows from each table that don't have an exact match and I can see from which table they come.
SELECT
MIN(TableName) as TableName
,[strCustomer]
,[strAddress1]
,[strCity]
,[strPostalCode]
FROM
(SELECT
'Old' as TableName
,[JAS001].[dbo].[AR_CustomerAddresses].[strCustomer]
,[JAS001].[dbo].[AR_CustomerAddresses].[strAddress1]
,[JAS001].[dbo].[AR_CustomerAddresses].[strCity]
,[JAS001].[dbo].[AR_CustomerAddresses].[strPostalCode]
FROM
[JAS001].[dbo].[AR_CustomerAddresses]
UNION ALL
SELECT
'New' as TableName
,[JAS001new].[dbo].[AR_CustomerAddresses].[strCustomer]
,[JAS001new].[dbo].[AR_CustomerAddresses].[strAddress1]
,[JAS001new].[dbo].[AR_CustomerAddresses].[strCity]
,[JAS001new].[dbo].[AR_CustomerAddresses].[strPostalCode]
FROM
[JAS001new].[dbo].[AR_CustomerAddresses]) tmp
GROUP BY
[strCustomer]
,[strAddress1]
,[strCity]
,[strPostalCode]
HAVING
COUNT(*) = 1
This Stack Overflow Answer gives me a much cleaner SQL query but does not tell me from which table the rows come.
SELECT * FROM [JAS001new].[dbo].[AR_CustomerAddresses]
UNION
SELECT * FROM [JAS001].[dbo].[AR_CustomerAddresses]
EXCEPT
SELECT * FROM [JAS001new].[dbo].[AR_CustomerAddresses]
INTERSECT
SELECT * FROM [JAS001].[dbo].[AR_CustomerAddresses]
I could use the first version but I have many tables that I need to compare and I think that there has to be an easy way to add the source table column to the second query. I've tried several things and googled to no avail. I suspect that maybe I'm just not searching for the correct thing since I'm sure it's been answered before.
Maybe I'm going down the wrong trail and there is a better way to compare the databases?
Could you use the following setup to accomplish your goal?
SELECT 'New not in Old' Descriptor, *
FROM
(
SELECT * FROM [JAS001new].[dbo].[AR_CustomerAddresses]
EXCEPT
SELECT * FROM [JAS001].[dbo].[AR_CustomerAddresses]
) a
UNION
SELECT 'Old not in New' Descriptor, *
FROM
(
SELECT * FROM [JAS001].[dbo].[AR_CustomerAddresses]
EXCEPT
SELECT * FROM [JAS001new].[dbo].[AR_CustomerAddresses]
) b
You can't add the table name there because union, except, and intersection all compare all columns. This means you can't differentiate between them by adding the table name to the query. A group by gives you control over what columns are considered in finding duplicates so you can exclude the table name.
To help you with the large number of tables you need to compare you could write a sql query off the metadata tables that hold table names and columns and generate the sql commands dynamically off those values.
Derive one column using table names like below
SELECT MIN(TableName) as TableName
,[strCustomer]
,[strAddress1]
,[strCity]
,[strPostalCode]
,table_name_came
FROM
(SELECT 'Old' as TableName
,[JAS001].[dbo].[AR_CustomerAddresses].[strCustomer]
,[JAS001].[dbo].[AR_CustomerAddresses].[strAddress1]
,[JAS001].[dbo].[AR_CustomerAddresses].[strCity]
,[JAS001].[dbo].[AR_CustomerAddresses].[strPostalCode]
,'[JAS001].[dbo].[AR_CustomerAddresses]' as table_name_came
FROM [JAS001].[dbo].[AR_CustomerAddresses]
UNION ALL
SELECT 'New' as TableName
,[JAS001new].[dbo].[AR_CustomerAddresses].[strCustomer]
,[JAS001new].[dbo].[AR_CustomerAddresses].[strAddress1]
,[JAS001new].[dbo].[AR_CustomerAddresses].[strCity]
,[JAS001new].[dbo].[AR_CustomerAddresses].[strPostalCode]
,'[JAS001new].[dbo].[AR_CustomerAddresses]' as table_name_came
FROM [JAS001new].[dbo].[AR_CustomerAddresses]
) tmp
GROUP BY [strCustomer]
,[strAddress1]
,[strCity]
,[strPostalCode]
,table_name_came
HAVING COUNT(*) = 1