MERGE on multiple tables - google-bigquery

I am trying to do the following but this is an "Illegal operation (write) on meta-table".
MERGE x.y.events_* as events
USING
(
select distinct
user_id,
user_pseudo_id
from x.y.events_*
where user_id is not null
and user_pseudo_id is not null
qualify row_number() over (partition by user_pseudo_id) = 1
order by user_pseudo_id
) user_ids
ON events.user_pseudo_id = user_ids.user_pseudo_id
WHEN MATCHED THEN
UPDATE SET events.user_id = user_ids.user_id
This works fine if I define x.y.events_20230115 after MERGE but I have about 700 tables to update plus I would like to run this dynamically every day so it updates yesterdays dataset. With the wildcard, bigQuery tell me that this is an "Illegal operation (write) on meta-table". Makes sense, however I can't figure out how to proceed.
I am aware that I can use something like _table_suffix = FORMAT_DATE('%Y%m%d', DATE_SUB(#run_date, INTERVAL 1 DAY)) in WHERE clauses but that doesn't seem like a solution here as I'm trying to write stuff.
Could anyone kindly point me to the right direction here? How to dynamically refer to the table suffix in MERGE x.y.events_ or is there perhaps a better way of doing this? Some sort of iteration?

Related

Writing Scheduled Queries using the run_date vs current_date

I have created a scheduled query that returns a count of users, and transactions on each day. Here is the code:
SELECT
event_date,
COUNT(DISTINCT user_id) users,
COUNT(DISTINCT transaction_id) transactions,
FROM `xyz.events`
WHERE
event_date = current_date
GROUP BY event_date
ORDER BY event_date
The query shown above works when I execute it manually. But when I use it as a scheduled query it doesn't update the destination table as it should even though if I check the runs, it shows that the query has run successfully for that particular day.
The query shown below however does the trick and runs exactly as intended. It updates the daily count of users and transactions in the destination table.
SELECT
DATE_SUB(#run_date, INTERVAL 1 DAY) event_date,
COUNT(DISTINCT user_id) users,
COUNT(DISTINCT transaction_id) transactions,
FROM `xyz.events`
WHERE
event_date = DATE_SUB(#run_date, INTERVAL 1 DAY)
GROUP BY event_date
ORDER BY event_date
So I wanted to understand why this is happening? Because when run manually both the queries give the same output.
Welcome Anxiety,
When you call the CURRENT_DATE() function you must add the opening and closing parenthesis at the end (). Having this missing from the end of your function call is why this query is failing when set to run as a scheduled query.
As to why it runs when you run it in a regular BigQuery query window, I am not certain, but assume the UI must have some inbuilt logic to work around the missing parenthesis , which is not available to scheduled queries.

Merge statement?

I am more of beginner with sql but would like some help on which statement would be best to use for my query. So I have an app that has test data, because the score could be 90 or be 85.6 the values are in different columns - former in int.value, latter in double.value. I need to merge the two columns together into one column for "test_score". Here is my current query, data goes to a table called "App_test_outcome":
SELECT event_date, timestamp_micros(event_timestamp) as Timestamp, user_pseudo_id, geo.country, geo.region, geo.city, geo.sub_continent,
(select value.string_value from unnest (event_params) where key = "test_passed") as Test_outcome,
(select value.string_value from unnest (event_params) where key = "test_category") as Test_outcome_category,
(select value.double_value from unnest (event_params) where key = "test_score") as Test_outcome_score,
FROM `Appname.analytics_number.events_*`
WHERE
_TABLE_SUFFIX BETWEEN '20200201' AND '20201130' AND
event_name = "test_completed"
Would I need to make another query to then merge with the above query already in that table, or is there a way to run a query and merge the two columns together in one. As I would prefer the latter option if possible.
I did get an error message when trying to append a query with double.value to int.value but an error message appeared "invalid schema: Field Test_outcome_score has changed type from FLOAT to INTEGER". Which makes me think I cannot merge the two columns anyway.
Any help would be great,
Many thanks,
Maybe IFNULL with CAST will help:
(select IFNULL(value.double_value, CAST(value.int_value AS FLOAT64)) from unnest (event_params) where key = "test_score") as Test_outcome_score

How to extract record's table name when using Table wildcard functions [duplicate]

I have a set of day-sharded data where individual entries do not contain the day. I would like to use table wildcards to select all available data and get back data that is grouped by both the column I am interested in and the day that it was captured. Something, in other words, like this:
SELECT table_id, identifier, Sum(AppAnalytic) as AppAnalyticCount
FROM (TABLE_QUERY(database_main,'table_id CONTAINS "Title_" AND length(table_id) >= 4'))
GROUP BY identifier, table_id order by AppAnalyticCount DESC LIMIT 10
Of course, this does not actually work because table_id is not visible in the table aggregation resulting from the TABLE_QUERY function. Is there any way to accomplish this? Some sort of join on table metadata perhaps?
This functionality is available now in BigQuery through _TABLE_SUFFIX pseudocolumn. Full documentation is at https://cloud.google.com/bigquery/docs/querying-wildcard-tables.
Couple of things to note:
You will need to use Standard SQL to enable table wildcards
You will have to rename _TABLE_SUFFIX into something else in your SELECT list, i.e. following example illustrates it
SELECT _TABLE_SUFFIX as table_id, ... FROM `MyDataset.MyTablePrefix_*`
Not available today, but something I'd love to have too. The team takes feature requests seriously, so thanks for adding support for this one :).
In the meantime, a workaround is doing a manual union of a SELECT of each table, plus an additional column with the date data.
For example, instead of:
SELECT x, #TABLE_ID
FROM table201401, table201402, table201303
You could do:
SELECT x, month
FROM
(SELECT x, '201401' AS month FROM table201401),
(SELECT x, '201402' AS month FROM table201402),
(SELECT x, '201403' AS month FROM table201403)

Is there a way to select table_id in a Bigquery Table Wildcard Query

I have a set of day-sharded data where individual entries do not contain the day. I would like to use table wildcards to select all available data and get back data that is grouped by both the column I am interested in and the day that it was captured. Something, in other words, like this:
SELECT table_id, identifier, Sum(AppAnalytic) as AppAnalyticCount
FROM (TABLE_QUERY(database_main,'table_id CONTAINS "Title_" AND length(table_id) >= 4'))
GROUP BY identifier, table_id order by AppAnalyticCount DESC LIMIT 10
Of course, this does not actually work because table_id is not visible in the table aggregation resulting from the TABLE_QUERY function. Is there any way to accomplish this? Some sort of join on table metadata perhaps?
This functionality is available now in BigQuery through _TABLE_SUFFIX pseudocolumn. Full documentation is at https://cloud.google.com/bigquery/docs/querying-wildcard-tables.
Couple of things to note:
You will need to use Standard SQL to enable table wildcards
You will have to rename _TABLE_SUFFIX into something else in your SELECT list, i.e. following example illustrates it
SELECT _TABLE_SUFFIX as table_id, ... FROM `MyDataset.MyTablePrefix_*`
Not available today, but something I'd love to have too. The team takes feature requests seriously, so thanks for adding support for this one :).
In the meantime, a workaround is doing a manual union of a SELECT of each table, plus an additional column with the date data.
For example, instead of:
SELECT x, #TABLE_ID
FROM table201401, table201402, table201303
You could do:
SELECT x, month
FROM
(SELECT x, '201401' AS month FROM table201401),
(SELECT x, '201402' AS month FROM table201402),
(SELECT x, '201403' AS month FROM table201403)

How to update or insert a record in a postgres table which is obtained by doing another query?

I want to write a simple statistic tool that is doing some queries and saving the results in a nother table from the same database.
Mainly I want to tracke the number of items in different tables, number of touched items during a month and so on. This would allow me to get some analytics regarding the usages of the system, information that I will not be able to get just by looking at the database status at one moment.
Let's say that I have this query:
select count(*) as mytab_mcount from mytab where updated > CURRENT_DATE - INTERVAL '1 months';
Now I do want to store the result of this query in a stats table so I can query it in order to get some trend data.
Clearly I could code this in something but I am wondering if I can do this only in SQL, Postgres blend of it.
I want to put the result in a table like
date mytab_mcount some_stat
2013-09-01 1234 NUL
Clearly the SQL should insert a new row or update the existing one.
Is this possilbe, can you put a basic example?
I this could be done in a single query it would be very easy to automate this, keeping all the logic in one place, and having a cron job to execute it.
Have you tried something like:
INSERT INTO stat_table (stat_date, table_name, row_count, some_stat)
SELECT CURRENT_DATE, 'mytab', count(*), 2+3
FROM mytab
WHERE updated > CURRENT_DATE - INTERVAL '1 months';
Or
UPDATE stat_table
SET row_count = (SELECT count(*) FROM mytab WHERE updated > CURRENT_DATE - INTERVAL '1 months'),
stat_date = CURRENT_DATE,
some_stat = (SELECT 1+3)
WHERE table_name = 'mytab';