Writing Scheduled Queries using the run_date vs current_date - sql

I have created a scheduled query that returns a count of users, and transactions on each day. Here is the code:
SELECT
event_date,
COUNT(DISTINCT user_id) users,
COUNT(DISTINCT transaction_id) transactions,
FROM `xyz.events`
WHERE
event_date = current_date
GROUP BY event_date
ORDER BY event_date
The query shown above works when I execute it manually. But when I use it as a scheduled query it doesn't update the destination table as it should even though if I check the runs, it shows that the query has run successfully for that particular day.
The query shown below however does the trick and runs exactly as intended. It updates the daily count of users and transactions in the destination table.
SELECT
DATE_SUB(#run_date, INTERVAL 1 DAY) event_date,
COUNT(DISTINCT user_id) users,
COUNT(DISTINCT transaction_id) transactions,
FROM `xyz.events`
WHERE
event_date = DATE_SUB(#run_date, INTERVAL 1 DAY)
GROUP BY event_date
ORDER BY event_date
So I wanted to understand why this is happening? Because when run manually both the queries give the same output.

Welcome Anxiety,
When you call the CURRENT_DATE() function you must add the opening and closing parenthesis at the end (). Having this missing from the end of your function call is why this query is failing when set to run as a scheduled query.
As to why it runs when you run it in a regular BigQuery query window, I am not certain, but assume the UI must have some inbuilt logic to work around the missing parenthesis , which is not available to scheduled queries.

Related

MERGE on multiple tables

I am trying to do the following but this is an "Illegal operation (write) on meta-table".
MERGE x.y.events_* as events
USING
(
select distinct
user_id,
user_pseudo_id
from x.y.events_*
where user_id is not null
and user_pseudo_id is not null
qualify row_number() over (partition by user_pseudo_id) = 1
order by user_pseudo_id
) user_ids
ON events.user_pseudo_id = user_ids.user_pseudo_id
WHEN MATCHED THEN
UPDATE SET events.user_id = user_ids.user_id
This works fine if I define x.y.events_20230115 after MERGE but I have about 700 tables to update plus I would like to run this dynamically every day so it updates yesterdays dataset. With the wildcard, bigQuery tell me that this is an "Illegal operation (write) on meta-table". Makes sense, however I can't figure out how to proceed.
I am aware that I can use something like _table_suffix = FORMAT_DATE('%Y%m%d', DATE_SUB(#run_date, INTERVAL 1 DAY)) in WHERE clauses but that doesn't seem like a solution here as I'm trying to write stuff.
Could anyone kindly point me to the right direction here? How to dynamically refer to the table suffix in MERGE x.y.events_ or is there perhaps a better way of doing this? Some sort of iteration?

BigQuery - Date_trunc in window function can't be group by on

I am trying to use date_trunc() on some date in a window function on BigQuery. I used to do this previoulsy in Snowflake and everything went smoothly. Unfortunalty, BigQuery tells me that the full date needs to be in the group by, which defeat the purpose of using the date_trunc function. I wish to group by "year-month" and customer_id and give every customer a "rank" based on their order per "year-month". Here's an example of my script
select
id as customer_id,
date_trunc(month from date) as date,
count(1) as orders,
row_number() over (partition by date_trunc(month from date) order by count(1) desc) as customer_order
from table
group by 1,2
And I get this error code :
PARTITION BY expression references column date which is neither grouped nor aggregated
Anyone knows how to prevent this problem in an elegant manner? I know I could do a subquery \ CTE to fix this but I'm curious to understand why Big Query prevent this operation.

SQL Statement to Return Records with More than Two Instances and Omit the Last Instance

I am looking to write a SQL statement that will pull EventIDs with two or more instances, but will omit the last instance of these records. This seems crazy, but the purpose of this is to look at events that have multiple updates (the updates are related to the TIME_OF_EVENT column, each time a crew/person updates the event, the time it was updated gets stored here) and see which ones have expired in the middle. I see if they expired in the middle by comparing the TIME_OF_EVENT to the PREV_ERT.
select *
from ert_change_log
where time_of_event > '30-SEP-21 23:59:59'
and source <> 'I'
The SQL above generates the picture below. This is simply just a reference of the table to provide an example of EventIDs that meet this criteria.
In a perfect world, the query I am needing would only return EventIDs 210043901 and 210044021 and would omit the latest TIME_OF_EVENT for those EventID.
If this is confusing I would be glad to offer more explanation or clarification!
Thanks for any input.
You can use the ROW_NUMBER analytic function:
SELECT *
FROM (
SELECT e.*,
ROW_NUMBER() OVER (PARTITION BY event_id ORDER BY time_of_event DESC) AS rn
FROM ert_change_log e
WHERE time_of_event >= DATE '2021-10-01'
AND source <> 'I'
)
WHERE rn > 1;

How to update or insert a record in a postgres table which is obtained by doing another query?

I want to write a simple statistic tool that is doing some queries and saving the results in a nother table from the same database.
Mainly I want to tracke the number of items in different tables, number of touched items during a month and so on. This would allow me to get some analytics regarding the usages of the system, information that I will not be able to get just by looking at the database status at one moment.
Let's say that I have this query:
select count(*) as mytab_mcount from mytab where updated > CURRENT_DATE - INTERVAL '1 months';
Now I do want to store the result of this query in a stats table so I can query it in order to get some trend data.
Clearly I could code this in something but I am wondering if I can do this only in SQL, Postgres blend of it.
I want to put the result in a table like
date mytab_mcount some_stat
2013-09-01 1234 NUL
Clearly the SQL should insert a new row or update the existing one.
Is this possilbe, can you put a basic example?
I this could be done in a single query it would be very easy to automate this, keeping all the logic in one place, and having a cron job to execute it.
Have you tried something like:
INSERT INTO stat_table (stat_date, table_name, row_count, some_stat)
SELECT CURRENT_DATE, 'mytab', count(*), 2+3
FROM mytab
WHERE updated > CURRENT_DATE - INTERVAL '1 months';
Or
UPDATE stat_table
SET row_count = (SELECT count(*) FROM mytab WHERE updated > CURRENT_DATE - INTERVAL '1 months'),
stat_date = CURRENT_DATE,
some_stat = (SELECT 1+3)
WHERE table_name = 'mytab';

Query to get the duration and details from a table

I have a scenario and not quite sure how to query it. As a sample, I have following table structure and want to get the history of the action for bus:
ID-----TIME---------BUSID----OPID----MOVING----STOPPED----PARKED----COUNT
1------10:10:10-----101------1101-----1---------0----------0---------15
2------10:10:11-----102------1102-----0---------1----------0---------5
3------10:11:10-----101------1101-----1---------0----------0---------15
4------10:12:10-----101------1101-----0---------1----------0---------15
5------10:13:10-----101------1101-----1---------0----------0---------19
6------10:14:10-----101------1101-----1---------0----------0---------19
7------10:15:10-----101------1101-----0---------1----------0---------19
8------10:16:10-----101------1101-----0---------0----------1---------0
9------10:17:10-----101------1101-----0---------0----------1---------0
I want to write a query to get the status of a bus like:
BUSID----OPID----STATUS-----TIME---------DURATION---COUNT
101------1101----MOVING-----10:10:10-----2-----------15
101------1101----STOPPED----10:12:10-----1-----------15
101------1101----MOVING-----10:13:10-----2-----------19
101------1101----STOPPED----10:15:10-----1-----------19
101------1101----PARKED-----10:16:10-----2-----------0
I am using SQL Server 2008.
Thanks for your help.
You can use Common Table Expressions to calculate the duration between the different rows.
WITH cte_log AS
(
SELECT
Row_Number()
OVER
(
ORDER BY time DESC
)
AS
id, time, busid, opid, moving, stopped, parked, count
FROM
log_table
WHERE
busid = 101
)
SELECT
current_rows.busid,
current_rows.opid,
current_rows.time,
DATEDIFF(second, current_rows.time, previous_rows.time) AS duration
current_rows.count
FROM
cte_log_position AS current_rows
LEFT OUTER JOIN
log_table AS previous_rows ON ((current_rows.row_id + 1) = previous_rows.row_id)
WHERE
current_rows.busid = 101
ORDER BY
current_rows.time DESC;
The WITH statement creates a temporary result set that is defined within the execution scope of this query. We are using it to fetch the previous records of each row and to calculate the time difference between the the current and the previous record.
This example was not tested, and it may not work perfectly, but I hope it gets you going in the correct direction. Feel free to leave feedback.
You may also want to check the following external links on how to use Common Table Expressions:
SQL Select Next Row and SQL Select Previous Row with Current Row using T-SQL CTE
Calculate Difference between current and previous rows... CTE and Row_Number() rocks!
4 Guys From Rolla: Common Table Expressions (CTE) in SQL Server 2005
MSDN: Using Common Table Expressions
personally i would denormalize the data so you have start_time and end_time in the one row. this will make the query much more efficient.
I don't have access to SQL Server at the moment, so there may be syntax errors in the following:
SELECT
BUSID,
OPID,
IF (MOVING = 1) 'MOVING' ELSE IF (STOPPED = 1) 'STOPPED' ELSE 'PARKED' AS STATUS
TIME,
COUNT
FROM BUS_DATA_TABLE
GROUP BY BUSID
ORDER BY TIME
You'll note that this does not include duration. Until you order your data, you don't know which is the previous entry. Once the data is ordered you can calculate the duration as the difference between the times in consecutive records. You could do this by SELECTing into a new table and then running a second query.
Grouping by BUSID, should give you your report for all buses.
Making certain assumptions about column type, etc:
SELECT
BUSID,
OPID,
STATUS,
TIME,
DURATION,
COUNT
FROM
TABLENAME
WHERE
BUSID = 1O1
ORDER BY
TIME
;