SnappyData - Error creating Kafka streaming table - sql

I'm seeing an issue when creating a spark streaming table using kafka from the snappy shell.
'The exception 'Invalid input 'C', expected dmlOperation, insert, withIdentifier, select or put (line 1, column 1):'
Reference: http://snappydatainc.github.io/snappydata/streamingWithSQL/#spark-streaming-overview
Here is my sql:
CREATE STREAM TABLE if not exists sensor_data_stream
(sensor_id string, metric string)
using kafka_stream
options (
storagelevel 'MEMORY_AND_DISK_SER_2',
rowConverter 'io.snappydata.app.streaming.KafkaStreamToRowsConverter',
zkQuorum 'localhost:2181',
groupId 'streamConsumer',
topics 'test:01');
The shell seems to not like the script at the first character 'C'. I'm attempting to execute the script using the following command:
snappy> run '/scripts/my_test_sensor_script.sql';
any help appreciated!

There is some inconsistency in documentation and actual syntax.The correct syntax is:
CREATE STREAM TABLE sensor_data_stream if not exists (sensor_id string,
metric string) using kafka_stream
options (storagelevel 'MEMORY_AND_DISK_SER_2',
rowConverter 'io.snappydata.app.streaming.KafkaStreamToRowsConverter',
zkQuorum 'localhost:2181',
groupId 'streamConsumer', topics 'test:01');
One more thing you need to do is to write row converter for your data

Mike, You need to create your own rowConverter class by implementing following trait -
trait StreamToRowsConverter extends Serializable {
def toRows(message: Any): Seq[Row]
}
and then specify that rowConverter fully qualified class name in the DDL.
The rowConverter is specific to a schema.
'io.snappydata.app.streaming.KafkaStreamToRowsConverter' is just an placeholder class name, which should be replaced by your own rowConverter class.

Related

Passing DataFrame from notebook to another with pyspark

i'am trying to call a DataFrame that i created in notebook1 to use it in my notebook2 in Databricks Community addition with pyspark and i tried this code dbutils.notebook.run("notebook1", 60, {"dfnumber2"})
but it shows this error.
py4j.Py4JException: Method _run([class java.lang.String, class java.lang.Integer, class java.util.HashSet, null, class java.lang.String]) does not exist
any help please?
The actual problem is that you pass last parameter ({"dfnumber2"}) incorrectly - with this syntax it's a set, not the map type. You need to use syntax: {"table_name": "dfnumber2"} to represent it as a dict/map.
But if you look into documentation of dbutils.notebook.run, you will see following phrase:
To implement notebook workflows, use the dbutils.notebook.* methods. Unlike %run, the dbutils.notebook.run() method starts a new job to run the notebook.
But jobs aren't supported on the Community Edition, so it won't work anyway.
Create a global temp view and pass the table name as argument to your next notebook.
Drnumber2.createOrReplaceGlobalTempView("dfnumber2")
dbutils.notebook.run("notebook1", 60, {table_name:"dfnumber2"})
In your notebook1 you can do
table_name= dbutils.widgets.get("table_name")
Dfnumber2 = spark.sql("select * from global_temp."+table_name)

Using apache beam JsonTimePartitioning to create time partitioned tables in bigqiery

I have tried using the JsonTimePartitioning class in apache beam JAVA sdk to write data to dynamic tables in bigquery but i get "cannot find symbol" for the class JsonTimePartitioning.
this is how i try to import the class
import com.google.api.services.bigquery.model.JsonTimePartitioning;
and this is how i try to use it in my pipeline
.withWriteDisposition(WriteDisposition.WRITE_APPEND)
.withJsonTimePartitioningTo(new JsonTimePartitioning().setType("DAY")));
I can't seem to find the JsonTimePartitioning anywhere. Can you point to an example that you are trying to follow? The existing methods on BigQueryIO either accept an instance of TimePartiotioning, or a value-provider for a String that is actually a JSON-serialized instance of the same TimePartitioning. And in fact, when calling the TimePartitioning version of the method, you still end up just serializing it into string internally:. You can find an example of how it's used here:
Loading historical data into time-partitioned BigQuery tables To load
historical data into a time-partitioned BigQuery table, specify
BigQueryIO.Write.withTimePartitioning(com.google.api.services.bigquery.model.TimePartitioning)
with a field used for column-based partitioning. For example:
PCollection<Quote> quotes = ...;
quotes.apply(BigQueryIO.write()
.withSchema(schema)
.withFormatFunction(quote -> new TableRow()
.set("timestamp", quote.getTimestamp())
.set(..other columns..))
.to("my-project:my_dataset.my_table")
.withTimePartitioning(new TimePartitioning().setField("time"))); ```

Write to a dynamic BigQuery table through Apache Beam

I am getting the BigQuery table name at runtime and I pass that name to the BigQueryIO.write operation at the end of my pipeline to write to that table.
The code that I've written for it is:
rows.apply("write to BigQuery", BigQueryIO
.writeTableRows()
.withSchema(schema)
.to("projectID:DatasetID."+tablename)
.withWriteDisposition(WriteDisposition.WRITE_TRUNCATE)
.withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED));
With this syntax I always get an error,
Exception in thread "main" java.lang.IllegalArgumentException: Table reference is not in [project_id]:[dataset_id].[table_id] format
How to pass the table name with the correct format when I don't know before hand which table it should put the data in? Any suggestions?
Thank You
Very late to the party on this however.
I suspect the issue is you were passing in a string not a table reference.
If you created a table reference I suspect you'd have no issues with the above code.
com.google.api.services.bigquery.model.TableReference table = new TableReference()
.setProjectId(projectID)
.setDatasetId(DatasetID)
.setTableId(tablename);
rows.apply("write to BigQuery", BigQueryIO
.writeTableRows()
.withSchema(schema)
.to(table)
.withWriteDisposition(WriteDisposition.WRITE_TRUNCATE)
.withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED));

BigQuery fails to save view that uses functions

We're using BigQuery with their new dialect of "standard" SQL.
the new SQL supports inline functions written in SQL instead of JS, so we created a function to handle date conversion.
CREATE TEMPORARY FUNCTION
STR_TO_TIMESTAMP(str STRING)
RETURNS TIMESTAMP AS (PARSE_TIMESTAMP('%Y-%m-%dT%H:%M:%E*SZ', str));
It must be a temporary function as Google returns Error: Only temporary functions are currently supported; use CREATE TEMPORARY FUNCTION
if you try a permanent function.
If you try to save a view with a query that uses the function inline - you get the following error: Failed to save view. No support for CREATE TEMPORARY FUNCTION statements inside views.
If you try to outsmart it, and remove the function (hoping to add it during query time), you'll receive this error Failed to save view. Function not found: STR_TO_TIMESTAMP at [4:7].
Any suggestions on how to address this? We have more complex functions than the example shown.
Since the issue was marked as resolved, BigQuery now supports permanents registration of UDFs.
In order to use your UDF in a view, you'll need to first create it.
CREATE OR REPLACE FUNCTION `ACCOUNT-NAME11111.test.STR_TO_TIMESTAMP`
(str STRING)
RETURNS TIMESTAMP AS (PARSE_TIMESTAMP('%Y-%m-%dT%H:%M:%E*SZ', str));
Note that you must use a backtick for the function's name.
There's no TEMPORARY in the statement, as the function will be globally registered and persisted.
Due to the way BigQuery handles namespaces, you must include both the project name and the dataset name (test) in the function's name.
Once it's created and working successfully, you can use it a view.
create view test.test_view as
select `ACCOUNT-NAME11111.test.STR_TO_TIMESTAMP`('2015-02-10T13:00:00Z') as ts
You can then query you view directly without explicitly specifying the UDF anywhere.
select * from test.test_view
As per the documentation https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#create_function_statement , the functionality is still in Beta phase but is doable. The functions can be viewed in the same dataset it was created and the view can be created.
Please share if that worked fine for you or if you have any findings which would be helpful for others.
Saving a view created with a temp function is still not supported, but what you can do is plan the SQL-query (already rolled out for the latest UI), and then save it as a table. This worked for me, but I guess it depends on the query parameters you want.
##standardSQL
## JS in SQL to extract multiple h.CDs at the same time.
CREATE TEMPORARY FUNCTION getCustomDimension(cd ARRAY<STRUCT< index INT64,
value STRING>>, index INT64)
RETURNS STRING
LANGUAGE js AS """
for(var i = 0; i < cd.length; i++) {
var item = cd[i];
if(item.index == index) {
return item.value
}
}
return '';
""";
SELECT DISTINCT h.page.pagePath, getcustomDimension(h.customDimensions,20), fullVisitorId,h.page.pagePathLevel1, h.page.pagePathLevel2, h.page.pagePathLevel3, getcustomDimension(h.customDimensions,3)
FROM
`XXX.ga_sessions_*`,
UNNEST(hits) AS h
WHERE
### rolling timeframe
_TABLE_SUFFIX = FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(),INTERVAL YY DAY))
AND h.type='PAGE'
Credit for the solution goes to https://medium.com/#JustinCarmony/strategies-for-easier-google-analytics-bigquery-analysis-custom-dimensions-cad8afe7a153

String interpolation in HiveQL

I'm trying to create an external table that routes to an S3 bucket. I want to group all the tables by date, so the path will be something like s3://<bucketname>/<date>/<table-name>. My current function for creating it looks something like
concat('s3://<bucket-name>/', date_format(current_date,'yyyy-MM-dd'), '/<table-name>/');
I can run this in a SELECTquery fine; however when I try to put it in my table creation statement, I get the following error:
set s3-path = concat('s3://<bucket-name>/', date_format(current_date,'yyyy-MM-dd'), '/<table-name>/');
CREATE EXTERNAL TABLE users
(id STRING,
name STRING)
STORED AS PARQUET
LOCATION ${hiveconf:s3-path};
> FAILED: ParseException line 7:9 mismatched input 'concat' expecting StringLiteral near 'LOCATION' in table location specification
Is there any way to do string interpolation or a function invocation in Hive in this context?
you can try something like this. retrieve udf results in Hive. Essentially, as workaround, you can create a script and call it from the terminal passing the parameter like a hive config.