Stream Analytics doesn't produce output to SQL table when reference data is used - sql

I am working with ASA lately and I am trying to insert ASA stream directly to the SQL table using reference data. I based my development on this MS article: https://msdn.microsoft.com/en-us/azure/stream-analytics/reference/reference-data-join-azure-stream-analytics.
Overview of data flow - telemetry:
I've many devices of different types (Heat Pumps, Batteries, Water Pumps, AirCon...). Each of these devices has different JSON schema for their telemetry data. I can distinguish JSONs by an attribute in message (e.g.: "DeviceType":"HeatPump" or "DeviceType":"AirCon"...)
All of these devices are sending their telemetry to a single Event Hub
Behind Event Hub, there is a single Stream Analytics component where I redirect streams to different outputs based on attribute Device Type. For example I redirect telemetry from HeatPumps with query SELECT * INTO output-sql-table FROM input-event-hub WHERE DeviceType = 'HeatPump'
I would like to use some reference data to "enrich" ASA stream with some IDKeys, before I inserted stream into SQL table.
What I've already done:
Successfully inserted ASA stream directly to SQL table using ASA query SELECT * INTO [sql-table] FROM Input WHERE DeviceType ='HeatPump', where [sql-table] has the same schema than JSON message + standard columns (EventProcessedUtcTime, PartitionID, EventEnqueueUtcTime)
Successfully inserted ASA stream directly to SQL table using ASA query SELECT Column1, Column2, Column3... INTO [sql-table] FROM Input WHERE DeviceType = 'HeatPump' - basically the same query as above, only this time I used named columns in select statement.
Generated JSON file of reference data and put it to the BLOB storage
Created new static (not using {date} and {time} placeholders) reference data input in ASA pointing to the file in BLOB storage.
Then I joined reference data to the data stream in ASA query using the same statement with named columns
Results no output rows in SQL table
When debugging the problem I've used Test functionality in Query ASA
I sample data from Event Hub - stream data.
I upload sample data from file - reference data.
After sampling data from Event Hub have finished, I tested a query -> output produced some rows -> it's not a problem in a query
Yet... if I run ASA, no output rows are inserted into SQL table.
Some other ideas I tried:
Used TRY_CAST function to cast fields from reference data to appropriate data types before I joined them with fields in stream data
Used TRY_CAST function to cast fields in SELECT before I inserted them into SQL table
I really don't know what to do now. Any suggestions?
EDIT: added data stream JSON, reference data JSON, ASA query, ASA input configuration, BLOB storage configuration and ASA test output result
Data Stream JSON - single message
[
{
"Activation": 0,
"AvailablePowerNegative": 6.0,
"AvailablePowerPositive": 1.91,
"DeviceID": 99999,
"DeviceIsAvailable": true,
"DeviceOn": true,
"Entity": "HeatPumpTelemetry",
"HeatPumpMode": 3,
"Power": 1.91,
"PowerCompressor": 1.91,
"PowerElHeater": 0.0,
"Source": "<omitted>",
"StatusToPowerOff": 1,
"StatusToPowerOn": 9,
"Timestamp": "2018-08-29T13:34:26.0Z",
"TimestampDevice": "2018-08-29T13:34:09.0Z"
}
]
Reference data JSON - single message
[
{
"SourceID": 1,
"Source": "<ommited>",
"DeviceID": 10,
"DeviceSourceCode": 99999,
"DeviceName": "NULL",
"DeviceType": "Heat Pump",
"DeviceTypeID": 1
}
]
ASA Query
WITH HeatPumpTelemetry AS
(
SELECT
*
FROM
[input-eh]
WHERE
source='<omitted>'
AND entity = 'HeatPumpTelemetry'
)
SELECT
e.Activation,
e.AvailablePowerNegative,
e.AvailablePowerPositive,
e.DeviceID,
e.DeviceIsAvailable,
e.DeviceOn,
e.Entity,
e.HeatPumpMode,
e.Power,
e.PowerCompressor,
e.PowerElHeater,
e.Source,
e.StatusToPowerOff,
e.StatusToPowerOn,
e.Timestamp,
e.TimestampDevice,
e.EventProcessedUtcTime,
e.PartitionId,
e.EventEnqueuedUtcTime
INTO
[out-SQL-HeatPumpTelemetry]
FROM
HeatPumpTelemetry e
LEFT JOIN [input-json-devices] d ON
TRY_CAST(d.DeviceSourceCode as BIGINT) = TRY_CAST(e.DeviceID AS BIGINT)
ASA Reference Data Input configuration
Reference Data input configuration in Stream Analytics
BLOB storage directory tree
Blob storage directory tree
ASA test query output
ASA test query output

matejp. I didn't reproduce your issue and you could refer to my steps.
reference data in blob storage:
{
"a":"aaa",
"reference":"www.bing.com"
}
stream data in blob storage
[
{
"id":"1",
"name":"DeIdentified 1",
"DeviceType":"aaa"
},
{
"id":"2",
"name":"DeIdentified 2",
"DeviceType":"No"
}
]
query statement:
SELECT
inputSteam.*,inputRefer.*
into sqloutput
FROM
inputSteam
Join inputRefer on inputSteam.DeviceType = inputRefer.a
Output:
Hope it helps you.Any concern, let me know.

I think I found the error. In past days I tested nearly every combination possible when configuring inputs in Azure Stream Analytics.
I've started with this example as baseline: https://learn.microsoft.com/en-us/azure/stream-analytics/stream-analytics-build-an-iot-solution-using-stream-analytics
I've tried the solution without any changes to be sure that the example with reference data input works -> it worked
Then I've changed ASA output from CosmosDB to SQL table without changing anything -> it worked
Then I've changed my initial ASA job to be the as much the "same" as the ASA job in the example (writing into SQL table) -> it worked
Then I've started playing with BLOB directory names -> here I've found the error.
I think the problem I encountered is due to using a character "-" in folder name.
In my case I've created folder named "reference-data" and upload file named "devices.json" (folder structure "/reference-data/devices.json") -> ASA output to SQL table didn't work
As soon as I've changed the folder name to "refdata" (folder structure "/referencedata/devices.json") -> ASA output to SQL table worked.
Tried 3 times changing reference data input from folder name containing "-" and not containing it => every time ASA output to SQL server stop working when "-" was in folder name.
To recap:
I recommend not to use "-" in BLOB folder names for static reference data input in ASA Jobs.

Related

Data Factory Copy Activity: Error found when processing 'Csv/Tsv Format Text' source 'xxx.csv' with row number 6696: found more columns than expected

I am trying to perform a simply copy activity in Azure Data Factory from CSV to SQL Table, but I'm getting the following error:
{
"errorCode": "2200",
"message": "ErrorCode=DelimitedTextMoreColumnsThanDefined,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Error found when processing 'Csv/Tsv Format Text' source 'organizations.csv' with row number 6696: found more columns than expected column count 41.,Source=Microsoft.DataTransfer.Common,'",
"failureType": "UserError",
"target": "Copy data1",
"details": []
}
The copy activity is as follows
Source
My Sink is as follows:
As preview of the data in source is as follows:
This seems like a very straight forward copy activity. Any thoughts on what might be causing the error?
My row 6696 looks like the following:
3b1a2e5f-d08b-166b-4b91-eb53009b2377 Compassites Software Solutions organization compassites-software https://www.crunchbase.com/organization/compassites-software 318375 17/07/2008 10:46 05/12/2022 12:17 company compassitesinc.com http://www.compassitesinc.com IND Karnataka Bangalore "Pradeep Court", #163/B, 6th Main 3rd Cross, JP Nagar 3rd phase 560078 operating Custom software solution experts Big Data,Cloud Computing,Information Technology,Mobile,Software Data and Analytics,Information Technology,Internet Services,Mobile,Software 01/11/2005 51-100 info#compassitesinc.com 080-42032572 http://www.facebook.com/compassites http://www.linkedin.com/company/compassites-software-solutions http://twitter.com/compassites https://res.cloudinary.com/crunchbase-production/image/upload/v1397190270/c3e5acbde40f36eaf4f8c6f6eda3f803.png company
No commas
As the error message indicates, there is a record at row number 6696 where there is a value containing , as a character in it.
Look at the following demonstration where I have taken a similar case. I have 3 columns in my source. The data looks as shown below:
When I run use similar dataset settings and read these values, the same error would be thrown.
So, the value T1,OG is being considered as if they belong to 2 different columns since they have dataset delimiter within the value.
Such values would throw an error as it is ambiguous to read. One way to avoid this is to enclose such values with quote character (double quote in this case).
Now when I run the copy activity, it would give the desired output.
The table data would look like this:

Left join did not working properly in Azure Stream Analytics

I'm trying to create a simple left join between two inputs (event hubs), the source of inputs is an app function that process a rabbitmq queue and send to a event hub.
In my eventhub1 I have this data:
[{
"user": "user_aa_1"
}, {
"user": "user_aa_2"
}, {
"user": "user_aa_3"
}, {
"user": "user_cc_1"
}]
In my eventhub2 I have this data:
[{
"user": "user_bb_1"
}, {
"user": "user_bb_2"
}, {
"user": "user_bb_3
}, {
"user": "user_cc_1"
}]
I use that sql to create my left join
select hub1.[user] h1,hub2.[user] h2
into thirdTestDataset
from hub1
left join hub2
on hub2.[user] = hub1.[user]
and datediff(hour,hub1,hub2) between 0 and 5
and test result looks ok...
the problem is when I try it on job running... I got this result in power bi dataset...
Any idea why my left isn't working like any sql query?
I tested your query sql and it works well for me too.So when you can't get expected output after executing ASA job,i suggest you following troubleshoot solutions in this document.
Based on your output,it seems that the HUB2 becomes the left table.You could use diagnostic log in ASA to locate the truly output of job execution.
I tested the end-to-end using blob storage for input 1 and 2 and your sample and a PowerBI dataset as output and observed the expected result.
I think there are few things that can go wrong with your query:
First, your join has a 5-hours windows: basically that means it looks at EH1 and EH2 for matches during that large window, so live results will be different from sample input for which you have only 1 row. Can you validate that you had no match during this 5-hour window?
Additionally by default PBI streaming datasets are "hybrid datasets" so it will accumulate results without a good way to know when the result was emitted since there is no timestamp in your output schema. So you can also view previous data here. I'd suggest few things here:
In Power BI, change the option of your dataset: disable "Historic data analysis" to remove caching of data
Add a timestamp column to make sure to identify when the data is generated (the first line of you query will become: select System.timestamp() as time, hub1.[user] h1,hub2.[user] h2 )
Let me know if it works for you.
Thanks,
JS (Azure Stream Analytics)

Can't figure out how to insert keys and values of nested JSON data into SQL rows with NiFi

I'm working on a personal project and very new (learning as I go) to JSON, NiFi, SQL, etc., so forgive any confusing language used here or a potentially really obvious solution. I can clarify as needed.
I need to take the JSON output from a website's API call and insert it into a table in my MariaDB local server that I've set up. The issue is that the JSON data is nested, and two of the key pieces of data that I need to insert are used as variable key objects rather than values, so I don't know how to extract it and put it in the database table. Essentially, I think I need to identify different pieces of the JSON expression and insert them as values, but I'm clueless how to do so.
I've played around with the EvaluateJSON, SplitJSON, and FlattenJSON processors in particular, but I can't make it work. All I can ever do is get the result of the whole expression, rather than each piece of it.
{"5381":{"wind_speed":4.0,"tm_st_snp":26.0,"tm_off_snp":74.0,"tm_def_snp":63.0,"temperature":58.0,"st_snp":8.0,"punts":4.0,"punt_yds":178.0,"punt_lng":55.0,"punt_in_20":1.0,"punt_avg":44.5,"humidity":47.0,"gp":1.0,"gms_active":1.0},
"1023":{"wind_speed":4.0,"tm_st_snp":26.0,"tm_off_snp":82.0,"tm_def_snp":56.0,"temperature":74.0,"off_snp":82.0,"humidity":66.0,"gs":1.0,"gp":1.0,"gms_active":1.0},
"5300":{"wind_speed":17.0,"tm_st_snp":27.0,"tm_off_snp":80.0,"tm_def_snp":64.0,"temperature":64.0,"st_snp":21.0,"pts_std":9.0,"pts_ppr":9.0,"pts_half_ppr":9.0,"idp_tkl_solo":4.0,"idp_tkl_loss":1.0,"idp_tkl":4.0,"idp_sack":1.0,"idp_qb_hit":2.0,"humidity":100.0,"gp":1.0,"gms_active":1.0,"def_snp":23.0},
"608":{"wind_speed":6.0,"tm_st_snp":20.0,"tm_off_snp":53.0,"tm_def_snp":79.0,"temperature":88.0,"st_snp":4.0,"pts_std":5.5,"pts_ppr":5.5,"pts_half_ppr":5.5,"idp_tkl_solo":4.0,"idp_tkl_loss":1.0,"idp_tkl_ast":1.0,"idp_tkl":5.0,"humidity":78.0,"gs":1.0,"gp":1.0,"gms_active":1.0,"def_snp":56.0},
"3396":{"wind_speed":6.0,"tm_st_snp":20.0,"tm_off_snp":60.0,"tm_def_snp":70.0,"temperature":63.0,"st_snp":19.0,"off_snp":13.0,"humidity":100.0,"gp":1.0,"gms_active":1.0}}
This is a snapshot of an output with a couple thousand lines. Each of the numeric keys that you see above (5381, 1023, 5300, etc) are player IDs for the following stats. I have a table set up with three columns: Player ID, Stat ID, and Stat Value. For example, I need that first snippet to be inserted into my table as such:
Player ID Stat ID Stat Value
5381 wind_speed 4.0
5381 tm_st_snp 26.0
5381 tm_off_snp 74.0
And so on, for each piece of data. But I don't know how to have NiFi select the right pieces of data to insert in the right columns.
I believe that it's possible to use jolt to transform your json into a format:
[
{"playerId":"5381", "statId":"wind_speed", "statValue": 0.123},
{"playerId":"5381", "statId":"tm_st_snp", "statValue": 0.456},
...
]
then use PutDatabaseRecord with json reader.
Another approach is to use ExecuteGroovyScript processor.
Add new parameter to it with name SQL.mydb and link it to your DBCP controller service
And use the following script as Script Body parameter:
import groovy.json.JsonSlurper
import groovy.json.JsonBuilder
def ff=session.get()
if(!ff)return
//read flow file content and parse it
def body = ff.read().withReader("UTF-8"){reader->
new JsonSlurper().parse(reader)
}
def results = []
//use defined sql connection to create a batch
SQL.mydb.withTransaction{
def cmd = 'insert into mytable(playerId, statId, statValue) values(?,?,?)'
results = SQL.mydb.withBatch(100, cmd){statement->
//run through all keys/subkeys in flow file body
body.each{pid,keys->
keys.each{k,v->
statement.addBatch(pid,k,v)
}
}
}
}
//write results as a new flow file content
ff.write("UTF-8"){writer->
new JsonBuilder(results).writeTo(writer)
}
//transfer to success
REL_SUCCESS << ff

Not all data read in from SQL database to R

I am creating a Shiny app and have been using data from Microsoft SQL Server Management Studio by creating my table with the query below, saving it as a CSV, and reading it in with
alldata<-read.csv(file1$datapath, fileEncoding = "UTF-8-BOM")
with the above code in my server function and the below code in my ui function
fileInput("file1", "Choose CSV File", accept=".csv")
Using this code, I have been able to manipulate the all the data (creating tables and plots) from the CSV successfully. I wanted to try directly obtaining the data from the SQL server when my app loads instead of going into SQL, executing the query, saving the data, and then loading it into my app. I tried the below code, and it sort of works. For example, the variable CODE has 30 levels, all of which are represented and able to be manipulated when I read the data in with the CSV, but only 23 are represented and manipulated when I run the below code. Is there a specific reason this may be happening. I tried running the SQL code along with the code to make my datatables in base R, instead of shiny to see if I could spot something specific not being read in correctly, but it all works perfectly when I read it in line by line
library(RODBCext)
dbhandle<-odbcDriverConnect('driver={SQL Server}; server=myserver.com;
database=mydb; trusted_connection=true')
query<-"SELECT CAST(r.DATE_COMPLETED AS DATE)[DATE]
, res.CODE
, r.TYPE
, r.LOCATION
, res.OPERATION
, res.UNIT
FROM
mydb.RECORD r
LEFT OUTER JOIN mydb.RESULT res
ON r.AMSN = res.AMSN
and r.unit = res.unit
where r.STATUS = 'C'
and res.CODE like '%ABC-%'"
auditdata<-sqlExecute(channel=dbhandle, query=query, fetch=TRUE, stringsAsFactors=FALSE)
odbcClose(dbhandle)
*I only want the complete data set loaded once per Shiny session, so I currently have this outside of the server function in my server.R file.
Try adding 'believeNRows=FALSE' to your SQLExecute call. RODBC doesn't return the full query without this parameter.
auditdata<-sqlExecute(channel=dbhandle, query=query, fetch=TRUE, stringsAsFactors=FALSE, believeNRows=FALSE)

How to create a view against a table that has record fields?

We have a weekly backup process which exports our production Google Appengine Datastore onto Google Cloud Storage, and then into Google BigQuery. Each week, we create a new dataset named like YYYY_MM_DD that contains a copy of the production tables on that day. Over time, we have collected many datasets, like 2014_05_10, 2014_05_17, etc. I want to create a data set Latest_Production_Data that contains a view for each of the tables in the most recent YYYY_MM_DD dataset. This will make it easier for downstream reports to write their query once and always retrieve the most recent data.
To do this, I have code that gets the most recent dataset and the names of all the tables that dataset contains from the BigQuery API. Then, for each of these tables, I fire a tables.insert call to create a view that is a SELECT * from the table I am looking to create a reference to.
This fails for tables that contain a RECORD field, from what looks to be a pretty benign column-naming rule.
For example, I have this table:
For which I issue this API call:
{
'tableReference': {
'projectId': 'redacted',
'tableId': u'AccountDeletionRequest',
'datasetId': 'Latest_Production_Data'
}
'view': {
'query': u'SELECT * FROM [2014_05_17.AccountDeletionRequest]'
},
}
This results in the following error:
HttpError: https://www.googleapis.com/bigquery/v2/projects//datasets/Latest_Production_Data/tables?alt=json returned "Invalid field name "__key__.namespace". Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 128 characters long.">
When I execute this query in the BigQuery web console, the columns are renamed to translate the . to an _. I kind of expected the same thing to happen when I issued the create view API call.
Is there an easy way I can programmatically create a view for each of the tables in my dataset, regardless of their underlying schema? The problem I'm encountering now is for record columns, but another problem I anticipate is for tables that have repeated fields. Is there some magic alternative to SELECT * that will take care of all these intricacies for me?
Another idea I had was doing a table copy, but I would prefer not to duplicate the data if I can at all avoid it.
Here is the workaround code I wrote to dynamically generate a SELECT statement for each of the tables:
def get_leaf_column_selectors(dataset, table):
schema = table_service.get(
projectId=BQ_PROJECT_ID,
datasetId=dataset,
tableId=table
).execute()['schema']
return ",\n".join([
_get_leaf_selectors("", top_field)
for top_field in schema["fields"]
])
def _get_leaf_selectors(prefix, field):
if prefix:
format = prefix + ".%s"
else:
format = "%s"
if 'fields' not in field:
# Base case
actual_name = format % field["name"]
safe_name = actual_name.replace(".", "_")
return "%s as %s" % (actual_name, safe_name)
else:
# Recursive case
return ",\n".join([
_get_leaf_selectors(format % field["name"], sub_field)
for sub_field in field["fields"]
])
We had a bug where you needed to need to select out the individual fields in the view and use an 'as' to rename the fields to something legal (i.e they don't have '.' in the name).
The bug is now fixed, so you shouldn't see this issue any more. Please ping this thread or start a new question if you see it again.