Is "--view_udf_resource" broken? - google-bigquery

I would like to reference a UDF inside a View. According to BigQuery documentation ('bq help mk') and to this post How do I create a BigQuery view that uses a user-defined function?, it is possible to do it with the "--view_udf_resource" syntax.
However, when I try it I get the following error:
# gsutil cat gs://mybucket/bar.js
CREATE TEMP FUNCTION GetWord() AS ('fire');
# bq mk --nouse_legacy_sql --view_udf_resource="gs://mybucket/bar2.js" --view="SELECT 1 as one, GetWord() as myvalue" mydataset.myfoo
Error in query string: Function not found: GetWord at [1:18]
I have also tried it with the Java API and I get the same error:
public void foo(){
final String viewQuery = "#standardSQL\n SELECT 1 as one, GetWord() as myvalue";
UserDefinedFunction userDefinedFunction = UserDefinedFunction.inline("CREATE TEMP FUNCTION GetWord() AS ('fire');");
ViewDefinition tableDefinition = ViewDefinition.newBuilder(viewQuery)
.setUserDefinedFunctions(userDefinedFunction)
.build();
TableId viewTableId = TableId.of(projectName, dataSetName, "foobar");
final TableInfo tableInfo = TableInfo.newBuilder(viewTableId, tableDefinition).build();
bigQuery.create(tableInfo);
}
com.google.cloud.bigquery.BigQueryException: Function not found: GetWord at [2:19]
Am I doing something wrong? Or is the Google's documentation misleading and it is not possible to reference any custom UDF from a View?

You cannot (currently) create a view using standard SQL that uses UDFs. You need to have all of the logic be inline as part of the query itself, and the post that you are looking at is about JavaScript UDFs using legacy SQL. There is an open feature request to support permanent registration of UDFs, however, which would enable you to reference UDFs from views.

Related

pandas to_csv function not writing to Blob Storage when called from Spark UDF

I am using a Spark UDF to read some data from a GET endpoint and write them as a CSV file to a Azure BLOB location.
My GET endpoint takes 2 query parameters,param1 and param2.
So initially, I have a dataframe paramDF that has two columns param1 and param2.
param1 param2
12 25
45 95
Schema: paramDF:pyspark.sql.dataframe.DataFrame
param1:string
param2:string
Now I write a UDF that accept the two parameters, register it, and then invoke this UDF for each row in the dataframe.
UDF is as below:
def executeRestApi(param1,param2):
dlist=[]
try:
print(DataUrl.format(token=TOKEN, q1=param1,q2=param2))
response=requests.get(DataUrl.format(token=TOKEN, oid=param1,wid=param2))
if(response.status_code==200):
metrics=response.json()['data']['metrics']
dic={}
dic['metric1'] = metrics['metric1']
dic['metric2'] = metrics['metric2']
dlist.append(dic)
pandas.DataFrame(dlist).to_csv("../../dbfs/mnt/raw/Important/MetricData/listofmetrics.csv",header=True,index=False,mode='x')
return "Success"
except Exception as e:
return "Failure"
Register the UDF:
udf_executeRestApi = udf(executeRestApi, StringType())
Finally the call the UDF this way
paramDf.withColumn("result",udf_executeRestApi(col("param1"),col("param2"))
I dont see any errors while calling the UDF, in fact the UDF returns the value "Success" correctly.
Only thing is that the files are not written to Azure BLOB storage, no matter what I try.
UDFs' are primarily meant for custom functionality(and return a value).However ,in my case, I am trying to execute the GET API call and the write operation using the UDF(and that is my main intention here).
There is no issue with my pandas.DataFrame().tocsv(),as the same line, when tried separately,with a simple list is writing data to the BLOB correctly.
What could be going wrong here?
Note: Env is Spark on Databricks.
There isn't any problem with the indentation, even though it looks untidy here.
Try calling display on the dataframe

Use js packages in BigQuery UDF

I was trying to create a BigQuery UDF which requires an external npm package.
CREATE TEMPORARY FUNCTION tempfn(message STRING)
RETURNS STRING
LANGUAGE js AS """
var tesfn = require('js-123');
return tesfn(message)
""";
SELECT tempfn("Hello") as test;
It gives me an error
ReferenceError: require is not defined at tempfn(STRING) line 2,
columns 15-16
Is there a way that I can use these packages?
You can't load npm packages using require from JavaScript UDFs. You can, however, load external libraries from GCS, as outlined in the documentation. The example that the documentation gives is,
CREATE TEMP FUNCTION myFunc(a FLOAT64, b STRING)
RETURNS STRING
LANGUAGE js AS
"""
// Assumes 'doInterestingStuff' is defined in one of the library files.
return doInterestingStuff(a, b);
"""
OPTIONS (
library="gs://my-bucket/path/to/lib1.js",
library=["gs://my-bucket/path/to/lib2.js", "gs://my-bucket/path/to/lib3.js"]
);
SELECT myFunc(3.14, 'foo');
Here the assumption is that you have files with these names in Cloud Storage, and that one of them defines doInterestingStuff.

to_json not working with selectExpr in spark

I am reading a databricks blog link
and I find a problem with the built-in function to_json.
In the codes blew within this tutorial, it returns error:
org.apache.spark.sql.AnalysisException: Undefined function: 'to_json'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.
Does this means that this usage in the tutorial is wrong? and no udf could be used in selectExpr. Could I do something like register this to_json function into default database?
val deviceAlertQuery = notifydevicesDS
.selectExpr("CAST(dcId AS STRING) AS key", "to_json(struct(*)) AS value")
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("toipic", "device_alerts")
.start()
You need to improt the to_json function as
import org.apache.spark.sql.functions.to_json
This should work rather than the selectExpr
data.withColumn("key", $"dcId".cast("string"))
.select(to_json(struct(data.columns.head, data.columns.tail:_*)).as("value")).show()
You must also use the spark 2.x
I hope this helps to solve your problem.
based on information I get from mail list. this function are not added into SQL from spark 2.2.0. Here is the commit link:commit.
Hope this will help. THX Hyukjin Kwon and Burak Yavuz.

Use standard SQL queries in java bigquery API

Is it possible to use standard SQL queries when using java bigquery API?
I am trying to execute query but it throws
com.google.api.client.googleapis.json.GoogleJsonResponseException: 400 Bad Request
"message" : "11.3 - 11.56: Unrecognized type FLOAT64"
There are two ways to use standard SQL with the BigQuery Java API. The first is to start your query text with #standardSQL, e.g.:
#standardSQL
SELECT ...
FROM YourTable;
The second is to set useLegacySql to false as part of the QueryJobConfiguration object. For example (taken from the documentation):
public static void runStandardSqlQuery(String queryString)
throws TimeoutException, InterruptedException {
QueryJobConfiguration queryConfig =
QueryJobConfiguration.newBuilder(queryString)
// To use standard SQL syntax, set useLegacySql to false.
// See: https://cloud.google.com/bigquery/sql-reference/
.setUseLegacySql(false)
.build();
runQuery(queryConfig);
}

How to use insert_job

I want to run a Bigquery SQL query using insert method.
I ran the following code just like so:
JobConfigurationQuery = Google::Apis::BigqueryV2::JobConfigurationQuery
bq = Google::Apis::BigqueryV2::BigqueryService.new
scopes = [Google::Apis::BigqueryV2::AUTH_BIGQUERY]
bq.authorization = Google::Auth.get_application_default(scopes)
bq.authorization.fetch_access_token!
query_config = {query: "select colA from [dataset.table]"}
qr = JobConfigurationQuery.new(configuration:{query: query_config})
bq.insert_job(projectId, qr)
and I got an error as below:
Caught error invalid: Job configuration must contain exactly one job-specific configuration object (e.g., query, load, extract, spreadsheetExtract), but there were 0:
Please let me know how to use the insert_job method.
I'm not sure what client library you're using, but insert_job probably takes a JobConfiguration. You should create one of those and set the query parameter to equal your JobConfigurationQuery you've created.
This is necessary because you can insert various jobs (load, copy, extract) with different types of configurations to this one API method, and they all take a single configuration type with a subfield that specifies which type and details about the job to insert.
More info from BigQuery's documentation:
jobs.insert documentation
job resource: note the "configuration" field and its "query" subfield