Query table which whose name changes everyday - Bigquery - sql

I have a table which is like - 'numberedlanding_20210606_storefront'
this table is generated every day and changes in the time stamp. For example, nextday table name would be 'numberedlanding_20210607_storefront'
I am writing a query that uses this table as a base table and does further does aggregations such as -
For example -
with df1 as (select * from 'numberedlanding_20210606_storefront' where col1>0),
df2 as (select *,col1/col2 as test from df1),
select * from df1 a join df2 b on a.name=b.name
Now since this table name changes everyday how do I make it dynamic in this query?
I tried looking into Bigquery's wildcard but I think it applies on to variable at the end of the name, I want to change the name which is in the middle.
Also, here is something which I tried but was not successful -
CREATE TEMP FUNCTION
getTableName() AS ((CONCAT("numberedlanding_",FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)),"_storefront")));
SELECT
getTableName()
This just gives me the table name I want to use. Any suggestions would be really helpful.

SELECT * FROM numberedlanding_* WHERE _TABLE_SUFFIX = CURRENT_DATE() || '_storefront'

Related

How to get the partition column names for a table?

I have a table that is partitioned on one or more columns. I can do ...
SHOW PARTITIONS table_db.table_1
which gives a list of all partitions like this,
year=2007
year=2015
year=1999
year=1993
but I am only interested in finding which columns the table is partitioned on, in this case, year. And I would like to be able to do this of multiple tables at once, giving me a list of their names and partitioned columns somewhat like this.
table_name partition_col
table_1 year
table_2 year, month
I tried the solutions here...
https://docs.aws.amazon.com/athena/latest/ug/querying-glue-catalog.html#querying-glue-catalog-listing-partitions
SELECT * FROM table_db."table_1$partitions"
does give me results with one column for each partition...
# year
1 2007
2 2015
3 1999
4 1993
...but I couldn't extract the column names from this query.
Try this.
SELECT table_name,
array_join(array_agg(column_name), ', ') as partition_col
FROM information_schema.columns
WHERE extra_info = 'partition key'
GROUP BY 1
Get metadatas with AWS client provided in your language, like boto3 athena for python
import boto3
client = boto3.client()
table = client.get_table_metadata(
CatalogName=catalog,
DatabaseName=database,
TableName=name
)["TableMetadata"]
partition_keys = table["PartitionKeys"]
it seems solution is for mysql not SQL Server.

How to select the nth column, and order columns' selection in BigQuery

I have this huge table upon which I apply a lot of processing (using CTEs), and I want to perform a UNION ALL on 2 particular CTEs.
SELECT *
, 0 AS orders
, 0 AS revenue
, 0 AS units
FROM secondary_prep_cte WHERE purchase_event_flag IS FALSE
UNION ALL
SELECT *
FROM results_orders_and_revenues_cte
I get a "Column 1164 in UNION ALL has incompatible types : STRING,DATE at [97:5]
Obviously I don't know the name of the column, and I'd like to debug this but I feel like I'm going to waste a lot of time if I can't pin-point which column is 1164.
I also think this is a problem of the order of columns between the CTEs, so I have 2 questions:
How do I identify the 1164th column
How do I order my columns before performing the UNION ALL
I found this similar question but it is for MSSQL, I am using BigQuery
You can get information from INFORMATION_SCHEMA.COLUMNS but you'll need to create a table or view from the CTE:
CREATE OR REPLACE VIEW `project.dataset.secondary_prep_view` as select * from (select 1 as id, "a" as name, "b" as value)
Then:
SELECT * FROM dataset.INFORMATION_SCHEMA.COLUMNS WHERE table_name = 'secondary_prep_view';

select latest Table in a Big Query Dataset - Standard SQL syntax

I have dataset containing multiple tables with similar names:
e.g.
affilinet_4221_first_20180911_204956
affilinet_4221_first_20180911_160004
affilinet_4221_first_20180911_085559
affilinet_4221_first_20180910_201323
affilinet_4221_first_20180910_201042
affilinet_4221_first_20180910_080006
affilinet_4221_first_20180909_160707
This query identifies the latest dataset (according to yyyymmdd_hhmmss naming convention) with __TABLES_SUMMARY__ method
SELECT max(table_id) as table_id FROM `modemutti-8d8a6.feed_first.__TABLES_SUMMARY__`
where table_id LIKE "affilinet_4221_first_%"
query result
this query extracts all values from a specific table with _TABLE_SUFFIX method
SELECT * FROM `modemutti-8d8a6.feed_first.*`
WHERE _TABLE_SUFFIX = "affilinet_4221_first_20180911_204956"
query result
This query combines __TABLES_SUMMARY__ (which returns affilinet_4221_first_20180911_204956) and _TABLE_SUFFIX
SELECT * FROM `modemutti-8d8a6.feed_first.*`
WHERE _TABLE_SUFFIX = (
SELECT max(table_id) FROM `modemutti-8d8a6.feed_first.__TABLES_SUMMARY__`
where table_id LIKE "affilinet_4221_first_%")
this query fails:
Error: Cannot read field 'modemio_cat_level' of type INT64 as STRING
error screenshot
any idea why is this happening or how I could solve the issue?
------------EDIT------------
#Mikhail solution works correctly but processes a huge amount of data. See explicit call Vs the suggested Method. Another solution would have been
SELECT * FROM `modemutti-8d8a6.feed_first.affilinet_4221_first_*` WHERE _TABLE_SUFFIX =
(
SELECT MAX(_TABLE_SUFFIX) FROM`modemutti-8d8a6.feed_first.affilinet_4221_first_*`
)
but this processes also a much bigger amount of data compared to the explicit query. Is there are way to achieve through a view in the UI or should I rather use the Python / Java SDK via API?
Try below
#standardSQL
SELECT * FROM `modemutti-8d8a6.feed_first.affilinet_4221_first_*`
WHERE _TABLE_SUFFIX = (
SELECT REPLACE(MAX(table_id), 'affilinet_4221_first_', '')
FROM `modemutti-8d8a6.feed_first.__TABLES_SUMMARY__`
WHERE table_id LIKE "affilinet_4221_first_%"
)

SQL Contains query

I have two tables (A and B) that contain ID's however in table B some records have these ID's grouped together e.g the IDExec column may consist of a record that looks like 'id1 id2'. I'm trying to find the ID's in table A that do not appear in table B. I thought that by using something like:
SELECT *
FROM A
WHERE NOT EXISTS( SELECT *
FROM B
WHERE Contains(A.ExecID, B.ExecID))
This isn't working as contains needs the 2nd parameter to be string, text_lex or variable.
Do you guys have a solution to this problem?
To shed more light on the above problem the table strucutres are as follows:
Table A (IDExec, ProdName, BuySell, Quantity, Price, DateTime)
Table B (IDExec, ClientAccountNo, Quantity)
The C# code I've created to manipulate the buysell data in Table A groups up all the buysell's of the same product on a given day. The question now is how would you guy normalise this so I'm not bastardizing IDExec? Would it be better to create a new ID column in Table B called AllocID and link the two tables like that? So something like this:
Table A (IDExec, AllocID, ProdName, BuySell, Quantity, Price, DateTime)
Table B (AllocID, ClientAccountNo, Quantity)
This data should be normalized, storing multiple values in one field is a bad idea.
A workaround is using LIKE:
SELECT *
FROM A
WHERE NOT EXISTS( SELECT *
FROM B
WHERE ' '+B.ExecID+' ' LIKE '% '+A.ExecID+' %')
This is using space delimited values per your example.
This is kind of crude, but it will give you all of the entries in A that are not contained in B.
SELECT * FROM A WHERE A.ExecID not in (SELECT ExecID from B);
I have a very simple solution. It's called normalization. Proper modeling can work wonders for query simplicity and accuracy.
However, you may be stuck with what you have. Assuming ExecID is a string in both tables, try this:
select *
from A
where not exists(
select *
from B
where ExecID like '%' || a.ExecID || '%';
This is a horrible query as it performs a complete table scan of B for every row in A and the subquery is susceptible to false hits, so maybe you can do better, but your best course ultimately is a touch of database refactoring.

How to create temporary table in Google BigQuery

Is there any way to create a temporary table in Google BigQuery through:
SELECT * INTO <temp table>
FROM <table name>
same as we can create in SQL?
For complex queries, I need to create temporary tables to store my data.
2018 update - definitive answer with DDL
With BigQuery's DDL support you can create a table from the results a query - and specify its expiration at creation time. For example, for 3 days:
#standardSQL
CREATE TABLE `fh-bigquery.public_dump.vtemp`
OPTIONS(
expiration_timestamp=TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 3 DAY)
) AS
SELECT corpus, COUNT(*) c
FROM `bigquery-public-data.samples.shakespeare`
GROUP BY corpus
Docs: https://cloud.google.com/bigquery/docs/data-definition-language
2019 update -- With BigQuery scripting (Beta now), CREATE TEMP TABLE is officially supported. See public documentation here.
2018 update: https://stackoverflow.com/a/50227484/132438
Every query in bigquery creates a temporary table with the results. Temporary unless you give a name to the destination table, then you are in control of its lifecycle.
Use the api to see the temporary table name, or name your tables when querying.
2019 update -- With BigQuery scripting, CREATE TEMP TABLE is officially supported. See public documentation here.
CREATE TEMP TABLE Example
(
x INT64,
y STRING
);
INSERT INTO Example
VALUES (5, 'foo');
INSERT INTO Example
VALUES (6, 'bar');
SELECT *
FROM Example;
A temporary table can be created with WITH in the "New Standard SQL". See WITH clause.
An example given by Google:
WITH subQ1 AS (SELECT SchoolID FROM Roster),
subQ2 AS (SELECT OpponentID FROM PlayerStats)
SELECT * FROM subQ1
UNION ALL
SELECT * FROM subQ2;
To create a temporary table, use the TEMP or TEMPORARY keyword when you use the CREATE TABLE statement and use of CREATE TEMPORARY TABLE requires a script , so its better to start with begin statement.
Begin
CREATE TEMP TABLE <table_name> as select * from <table_name> where <condition>;
End ;
Example of creating temp tables in GCP bigquery
CREATE TABLE `project_ID_XXXX.Sales.superStore2011`
OPTIONS(
expiration_timestamp=TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
) AS
SELECT
Product_Name,Product_Category, SUM(profit) Total_Profit, FORMAT_DATE("%Y",Order_Date) AS Year
FROM
`project_ID_XXXX.Sales.superStore`
WHERE
FORMAT_DATE("%Y",Order_Date)="2011"
GROUP BY
Product_Name,Product_Category,Order_Date
ORDER BY
Year, Total_Profit DESC
LIMIT 5
It's 2022, and if you type the codes to create a TEMP table in BQ's interactive windows, it will not work. Probably will display below error message:
Vaguely it will give you an idea that your interactive windows should be tied with some session. There is the official documentation on how to create sessions etc.,
The short and easy method for me was go to MORE menu of the Google BigQuery Interactive windows, select Query Settings
It will display below SS (as of 2022 April)
Enable/click Use session mode and SAVE. That's it enjoy your Temporary Tables :D
Take the SQL sample of
SELECT name,count FROM mydataset.babynames
WHERE gender = 'M' ORDER BY count DESC LIMIT 6 INTO mydataset.happyhalloween;
The easiest command line equivalent is
bq query --destination_table=mydataset.happyhalloween \
"SELECT name,count FROM mydataset.babynames WHERE gender = 'M' \
ORDER BY count DESC LIMIT 6"
See the documentation here:
https://cloud.google.com/bigquery/bq-command-line-tool#createtablequery
I followed Google's official document while learning UDF and encountered the issue: use of create temporary table requires a script or session
Erroneous script:
CREATE TEMP TABLE users
AS SELECT 1 id, 10 age
UNION ALL SELECT 2, 30
UNION ALL SELECT 3, 10;
Solution:
BEGIN
CREATE TEMP TABLE users
AS SELECT 1 id, 10 age
UNION ALL SELECT 2, 30
UNION ALL SELECT 3, 10;
END;
To create and store your data on the fly, you can specify optional _SESSION qualifier to create temporary table.
CREATE TEMP TABLE _SESSION.tmp_01
AS
SELECT name FROM `bigquery-public-data`.usa_names.usa_1910_current
WHERE year = 2017
;
Here you can create the table from a complex query starting after 'AS' and the temporary table will be created at once and will be deleted after 24 hours.
To access the table,
select * from _SESSION.tmp_01;
Update September 2022:
As per the documentation, you can create a temporary table like:
CREATE TEMP TABLE continents(name STRING, visitors INT64)
AS
select geo.continent, count(distinct user_pseudo_id) as Continent_Visitors
FROM `firebaseProject.dataset.events_date`
group by geo.continent order by Continent_Visitors desc;
SELECT * from continents;
Drop table continents;