Parameter injection not working in BigQuery Vertex AI notebook for WEEK(WEEKDAY) - google-bigquery

I'm using a Vertex AI notebook, running BigQuery IPython magics and injecting parameters.
params = {"day_of_week": "WEDNESDAY"}
%%bigquery sales_df --project $project_id --params $params
SELECT created_date, num_sales
FROM sales
WHERE DATE_DIFF(created_date, "2023-01-01", WEEK(#day_of_week)) BETWEEN -12 and 11
I'd expect this to work but it doesn't because WEEK expects a WEEKDAY argument (BQ docs). Hardcoding it as WEEK(WEDNESDAY) works but the parameter injection doesn't because it adds quotes around the string, effectively compiling it as WEEK("WEDNESDAY").
Is there any workaround here? I can't find a BQ method that turns a string weekday ("WEDNESDAY") into the day of week enum (WEDNESDAY).
UPDATE: I have raised this as a BigQuery feature request -https://issuetracker.google.com/issues/266201211.

You might try a dynamic SQL in your BigQuery magic.
params = {"day_of_week": "WEDNESDAY"}
%%bigquery sales_df --project $project_id --params $params
EXECUTE IMMEDIATE FORMAT("""
SELECT created_date, num_sales
FROM sales
WHERE DATE_DIFF(created_date, "2023-01-01", WEEK(%s)) BETWEEN -12 and 11
""", #day_of_week);
below is a working example I've tried.
%%bigquery --project your-project-id --params {"day_of_week": "WEDNESDAY"}
EXECUTE IMMEDIATE FORMAT("""
SELECT DATE_DIFF(CURRENT_DATE, "2023-01-01", WEEK(%s)) BETWEEN -12 and 11
""", #day_of_week)

Related

Athena Prepared Statement Parameter Order

I'm attempting to use the Athena parameterized queries feature in conjunctions with CTEs.
If I run a simple Athena CTE select statement like this, I get back no results even though I should get a single row back with a single value of 1:
aws athena start-query-execution \
--query-string "WITH cte as (select 1 where 1 = ?) select * from cte where 2 = ?" \
--query-execution-context "Database"="default" \
--result-configuration "OutputLocation"="s3://athena-test-bucket/" \
--execution-parameters "1" "2"
# Output CSV:
# "_col0"
Strangely, if I reverse the parameters, the query works correctly:
aws athena start-query-execution \
--query-string "WITH cte as (select 1 where 1 = ?) select * from cte where 2 = ?" \
--query-execution-context "Database"="default" \
--result-configuration "OutputLocation"="s3://athena-test-bucket/" \
--execution-parameters "2" "1"
# Output CSV:
# "_col0"
# "1"
The AWS docs state that:
In parameterized queries, parameters are positional and are denoted by ?. Parameters are assigned values by their order in the query. Named parameters are not supported.
But this clearly doesn't seem to be the case, as the query only acts as expected when the parameters are in reversed order.
How can I get Athena to correctly insert parameters in the standard, first-to-last, positional way that's typical of other DBs? While the above example is simple, in reality, I have queries with arbitrarily nested CTEs with arbitrary amounts of parameters.
This is a know issue (incorrect parameters order for prepared statements using WITH clause) in old versions of Presto (link 1, link 2) and Trino (link) which was fixed. For Presto in version 0.272 (fix PR):
Fix parameter ordering for prepared statement queries using a WITH clause.
Sadly at the moment of writing Athena engine version 2 is based on Presto 0.217 so you need to workaround by changing the parameters order.

TrinoUserError (type=USER_ERROR, name=SYNTAX_ERROR, message="line 7:26: mismatched input 'COUNT'. Expecting: '*', <expression>")

I am using dbt-trino and for some reason, it doesn't understand the MySQL query that works fine by executing it directly on MySQL. In this query, I want to select and group records that have been created during the previous month.
The Query:
SELECT order_location, COUNT(*) as order_count
FROM {{ ref('x_stg_order_fields') }}
WHERE
created_at >= DATE_FORMAT( CURRENT_DATE - INTERVAL 1 MONTH, '%Y/%m/01' )
AND
created_at < DATE_FORMAT( CURRENT_DATE, '%Y/%m/01' )
GROUP BY order_location
While this query works fast and successfully directly on MySQL, it returns this error when executing with dbt run:
TrinoUserError(type=USER_ERROR, name=SYNTAX_ERROR, message="line 7:53: mismatched input 'COUNT'. Expecting: '*', <expression>")
Does this mean that dbt-trino doesn't support all MySQL functions?
That error is coming from your database, not from dbt itself. dbt does not parse your SQL commands, it just passes them through to your connected database.
My guess is that {{ ref('x_stg_order_fields' }} may be referring to an ephemeral model that contains a syntax error, or possibly a field named count that isn't quoted?
You can confirm or disprove that by looking at the SQL that dbt tried to run in your database, by inspecting the target directory in your project. Specifically, target/run/path/to/your_model.sql will show you the actual command being executed. You should be able to check line 7, col 53 in that file, and you will see the code that trino is erroring about.

Apache Drill Timestampdiff on Oracle DB

Hey everyone im relativly new to Apache Drill and im having troubles converting my Oracle specific sql scripts (pl/sql) to Drill based querys.
For example i have a Scripts who checks for processed data in the last X Days.
In this script im using the the sysdate function.
Here is my old script:
SELECT i.id,i.status,status_text,i.kunnr,i.bukrs,i.belnr,i.gjahr,event,i.sndprn,i.createdate,executedate,tstamp,v.typ_text,i.docnum,i.description, i.*
FROM in_job i JOIN vstatus_injob v ON i.id= v.id
WHERE 1=1
AND i.createdate > sysdate - 30.5
order by i.createdate desc;
When i looked up in terms of drill specific Datetime Diff functions i found "TIMESTAMPDIFF".
So here is my "drillified" script:
SELECT i.id, i.status, status_text, i.kunnr, i.bukrs, i.belnr, i.gjahr, i.event, i.sndprn, i.createdate, i.executedate, i.tstamp,v.typ_text,i.docnum,i.description,i.*
FROM SchemaNAME.IN_JOB i JOIN SchemaNAME.VSTATUS_INJOB v ON i.id=v.id
WHERE TIMESTAMPDIFF(DAY, CURRENT_TIMESTAMP, i.createdate) >=30
And the Error that is returned reads like this:
DATA_READ ERROR: The JDBC storage plugin failed while trying setup the SQL query.
By further inspection i can see the Oracle specific error that reads:
Caused by: java.sql.SQLSyntaxErrorException: ORA-00904: "TIMESTAMPDIFF": invalid ID
So now my question:
I thought apache drill replaces the function "TIMSTAMPDIFF" at runtime. But from what i can see in the logs its more like that Drill Hands over the Function Call "TIMESTAMPDIFF" to the Oracle database.
If thats true, how could i change my script to calculate the time difference (in days) and compare it to an int (ie 30 in the script).
If i use sysdate like above Apache Drill jumps in and says it doesnt know "sysdate".
How would you guyes handle that?
Thanks in advance and so long
:)
I have found a solution...
Just in Case someone (or even me in the future) is having a similar problem.
{
"queryType": "SQL",
"query": "select to_char(SELECT CURRENT_TIMESTAMP - INTERVAL XX MONTH FROM (VALUES(1)),'dd.MM.yy')"
}
With some to_char and the use of the CURRENT_TIMESTAMP - Interval Function Calls i can get everything i needed.
I took the query above packed it into an Grafana Variable, named it "timeStmpDiff" and then queried everything with an json Api Call to my Drill instance.
Basically:
"query" : "SELECT i.id, i.status, status_text, i.kunnr, i.bukrs, i.belnr, i.gjahr, i.event, i.sndprn, i.createdate, i.executedate, i.tstamp,v.typ_text,i.docnum,i.description,i.* FROM ${Schema}.IN_JOB i JOIN ${Schema}.VSTATUS_INJOB v ON i.id=v.id WHERE i.createdate >= '${timeStmpDiff}' order by i.createdate desc"
You can, of course query it in on go with an subselect.
But because i use grafana it made sense to me to bundle that in a Variable.

How to store the output of a query in a variable in HIVE

I want to store current_day - 1 in a variable in Hive. I know there are already previous threads on this topic but the solutions provided there first recommends defining the variable outside hive in a shell environment and then using that variable inside Hive.
Storing result of query in hive variable
I first got the current_Date - 1 using
select date_sub(FROM_UNIXTIME(UNIX_TIMESTAMP(),'yyyy-MM-dd'),1);
Then i tried two approaches:
1. set date1 = ( select date_sub(FROM_UNIXTIME(UNIX_TIMESTAMP(),'yyyy-MM-dd'),1);
and
2. set hivevar:date1 = ( select date_sub(FROM_UNIXTIME(UNIX_TIMESTAMP(),'yyyy-MM-dd'),1);
Both the approaches are throwing an error:
"ParseException line 1:82 cannot recognize input near 'select' 'date_sub' '(' in expression specification"
When I printed (1) in place of yesterday's date the select query is saved in the variable. The (2) approach throws "{hivevar:dt_chk} is undefined
".
I am new to Hive, would appreciate any help. Thanks.
Hive doesn't support a straightforward way to store query result to variables.You have to use the shell option along with hiveconf.
date1 = $(hive -e "set hive.cli.print.header=false; select date_sub(from_unixtime(unix_timestamp(),'yyyy-MM-dd'),1);")
hive -hiveconf "date1"="$date1" -f hive_script.hql
Then in your script you can reference the newly created varaible date1
select '${hiveconf:date1}'
After lots of research, this is probably the best way to achieve setting a variable as an output of an SQL:
INSERT OVERWRITE LOCAL DIRECTORY '<home path>/config/date1'
select CONCAT('set hivevar:date1=',date_sub(FROM_UNIXTIME(UNIX_TIMESTAMP(),'yyyy-MM-dd'),1)) from <some table> limit 1;
source <home path>/config/date1/000000_0;
You will then be able to use ${date1} in your subsequent SQLs.
Here we had to use <some table> limit 1 as hive got a bug in insert overwrite if we don't specify a table name.

Parameters in Datalab SQL modules

The parameterization example in the "SQL Parameters" IPython notebook in the datalab github repo (under datalab/tutorials/BigQuery/) shows how to change the value being tested for in a WHERE clause. Is it possible to use a parameter to change the name of a field being SELECT'd on?
eg:
SELECT COUNT(DISTINCT $a) AS n
FROM [...]
After I received the answer below, here is what I have done (with a dummy table name and field name, obviously):
%%sql --module test01
DEFINE QUERY get_counts
SELECT $a AS a, COUNT(*) AS n
FROM [project_id.dataset_id.table_id]
GROUP BY a
ORDER BY n DESC
table = bq.Table('project_id.dataset_id.table_id')
field = table.schema['field_name']
bq.Query(test01.get_counts,a=field).sql
bq.Query(test01.get_counts,a=field).results()
You can use a field from a Schema object (eg. given a table, get a specific field via table.schema[fieldname]).
Or implement a custom object with a _repr_sql_ method. See: https://github.com/GoogleCloudPlatform/datalab/blob/master/sources/lib/api/gcp/bigquery/_schema.py#L49