How to update insert new record with updated value from staging table in Azure Data Explorer - azure-iot-hub

I have requirement, where data is indigested from the Azure IoT hub. Sample incoming data
{
"message":{
"deviceId": "abc-123",
"timestamp": "2022-05-08T00:00:00+00:00",
"kWh": 234.2
}
}
I have same column mapping in the Azure Data Explorer Table, kWh is always comes as incremental value not delta between two timestamps. Now I need to have another table which can have difference between last inserted kWh value and the current kWh.
It would be great help, if anyone have a suggestion or solution here.
I'm able to calculate the difference on the fly using the prev(). But I need to update the table while inserting the data into table.

As far as I know, there is no way to perform data manipulation on the fly and inject Azure IoT data to Azure Data explorer through JSON Mapping. However, I found a couple of approaches you can take to get the calculations you need. Both the approaches involve creation of secondary table to store the calculated data.
Approach 1
This is the closest approach I found which has on-fly data manipulation. For this to work you would need to create a function that calculates the difference of Kwh field for the latest entry. Once you have the function created, you can then bind it to the secondary(target) table using policy update and make it trigger for every new entry on your source table.
Refer the following resource, Ingest JSON records, which explains with an example of how to create a function and bind it to the target table. Here is a snapshot of the function the resource provides.
Note that you would have to create your own custom function that calculates the difference in kwh.
Approach 2
If you do not need a real time data manipulation need and your business have the leniency of a 1-minute delay, you can create a query something similar to below which calculates the temperature difference from source table (jsondata in my scenario) and writes it to target table (jsondiffdata)
.set-or-append jsondiffdata <| jsondata | serialize
| extend temperature = temperature - prev(temperature,1), humidity, timesent
Refer the following resource to get more information on how to Ingest from query. You can use Microsoft Power Automate to schedule this query trigger for every minute.
Please be cautious if you decide to go the second approach as it is uses serialization process which might prevent query parallelism in many scenarios. Please review this resource on Windows functions and identify a suitable query approach that is better optimized for your business needs.

Related

Data disappears when moving from events_intraday_ to events_

I am using BigQuery to analyze FirebaseAnalytics events. I use events_intraday_ for real-time analysis and events_ for daily analysis, and the data is automatically transferred from events_intraday to events_ after a certain time, but some data will disappear at that time. The table exists, but the data is clearly reduced. About 2 days out of a week's data is lost here. Please tell me why this happens.
Thanks.
Data should not be lost when moved from events_intraday_ to events_.
A common problem that is easy problem fix is with the set up of intraday collects the data from “today” in realtime, you first need to agree with Google BigQuery on what “today” refers to. BigQuery can’t guess what timezone you want to query, which is why the default UNIX timestamp format of the event_timestamp column in BigQuery is always in UTC time. this post explains it clearly Firebase BigQuery server offset time
Also I am not sure your last statement is correct "events_intraday_" and "events_" are not quite the same thing, an "events_intraday_" table contains raw, unsampled event data for the current day while the "events_" table contains processed and aggregated event data.
This processing of data after its collected but before data is exported to BigQuery, this means you would expect some data to be lost. Generally, the affected fields are traffic sources and linked marketing products (AdWords, Campaign Manager, etc.), if these are areas you are looking at its probably a GA4 processing issue.

Bigquery and Tableau

I attached Tableau with Bigquery and was working on the Dash boards. Issue hear is Bigquery charges on the data a query picks everytime.
My table is 200GB data. When some one queries the dash board on Tableau, it runs on total query. Using any filters on the dashboard it runs again on the total table.
on 200GB data, if someone does 5 filters on different analysis, bigquery is calculating 200*5 = 1 TB (nearly). For one day on testing the analysis we were charged on a 30TB analysis. But table behind is 200GB only. Is there anyway I can restrict Tableau running on total data on Bigquery everytime there is any changes?
The extract in Tableau is indeed one valid strategy. But only when you are using a custom query. If you directly access the table it won't work as that will download 200Gb to your machine.
Other options to limit the amount of data are:
Not calling any columns that you don't need. Do this by hiding unused fields in Tableau. It will not include those fields in the query it sends to BigQuery. Otherwise it's a SELECT * and then you pay for the full 200Gb even if you don't use those fields.
Another option that we use a lot is partitioning our tables. For instance, a partition per day of data if you have a date field. Using TABLE_DATE_RANGE and TABLE_QUERY functions you can then smartly limit the amount of partitions and hence rows that Tableau will query. I usually hide the complexity of these table wildcard functions away in a view. And then I use the view in Tableau. Another option is to use a parameter in Tableau to control the TABLE_DATE_RANGE.
1) Right now I learning BQ + Tableau too. And I found that using "Extract" is must for BQ in Tableau. With this option you can also save time building dashboard. So my current pipeline is "Build query > Add it to Tableau > Make dashboard > Upload Dashboard to Tableau Online > Schedule update for Extract
2) You can send Custom Quota Request to Google and set up limits per project/per user.
3) If each of your query touching 200GB each time, consider to optimize these queries (Don't use SELECT *, use only dates you need, etc)
The best approach I found was to partition the table in BQ based on a date (day) field which has no timestamp. BQ allows you to partition a table by a day level field. The important thing here is that even though the field is day/date with no timestamp it should be a TIMESTAMP datatype in the BQ table. i.e. you will end up with a column in BQ with data looking like this:
2018-01-01 00:00:00.000 UTC
The reasons the field needs to be a TIMESTAMP datatype (even though there is no time in the data) is because when you create a viz in Tableau it will generate SQL to run against BQ and for the partitioned field to be utilised by the Tableau generated SQL it needs to be a TIMESTAMP datatype.
In Tableau, you should always filter on your partitioned field and BQ will only scan the rows within the ranges of the filter.
I tried partitioning on a DATE datatype and looked up the logs in GCP and saw that the entire table was being scanned. Changing to TIMESTAMP fixed this.
The thing about tableau and Big Query is that tableau calculates the filter values using your query ( live query ). What I have seen in my project logging is, it creates filters from your own query.
select 'Custom SQL Query'.filtered_column from ( your_actual_datasource_query ) as 'Custom SQL Query' group by 'Custom SQL Query'.filtered_column
Instead, try to create the tableau data source with incremental extracts and also try to have your query date partitioned ( Big Query only supports date partitioning) so that you can limit the data use.

How to send data to only one Azure SQL DB Table from Azure Streaming Analytics?

Background
I have set up an IoT project using an Azure Event Hub and Azure Stream Analytics (ASA) based on tutorials from here and here. JSON formatted messages are sent from a wifi enabled device to the event hub using webhooks, which are then fed through an ASA query and stored in one of three Azure SQL databases based on the input stream they came from.
The device (Particle Photon) transmits 3 different messages with different payloads, for which there are 3 SQL tables defined for long term storage/analysis. The next step includes real-time alerts, and visualization through Power BI.
Here is a visual representation of the idea:
The ASA Query
SELECT
ParticleId,
TimePublished,
PH,
-- and other fields
INTO TpEnvStateOutputToSQL
FROM TpEnvStateInput
SELECT
ParticleId,
TimePublished,
EventCode,
-- and other fields
INTO TpEventsOutputToSQL
FROM TpEventsInput
SELECT
ParticleId,
TimePublished,
FreshWater,
-- and other fields
INTO TpConsLevelOutputToSQL
FROM TpConsLevelInput
Problem: For every message received, the data is pushed to all three tables in the database, and not only the output specified in the query. The table in which the data belongs gets populated with a new row as expected, while the two other tables get populated with NULLs for columns which no data existed for.
From the ASA Documentation it was my understanding that the INTO keyword would direct the output to the specified sink. But that does not seem to be the case, as the output from all three inputs get pushed to all sinks (all 3 SQL tables).
The test script I wrote for the Particle Photon will send one of each type of message with hardcoded fields, in the order: EnvState, Event, ConsLevels, each 15 seconds apart, repeating.
Here is an example of the output being sent to all tables, showing one column from each table:
Which was generated using this query (in Visual Studio):
SELECT
t1.TimePublished as t1_t2_t3_TimePublished,
t1.ParticleId as t1_t2_t3_ParticleID,
t1.PH as t1_PH,
t2.EventCode as t2_EventCode,
t3.FreshWater as t3_FreshWater
FROM dbo.EnvironmentState as t1, dbo.Event as t2, dbo.ConsumableLevel as t3
WHERE t1.TimePublished = t2.TimePublished AND t2.TimePublished = t3.TimePublished
For an input event of type TpEnvStateInput where the key 'PH' would exist (and not keys 'EventCode' or 'FreshWater', which belong to TpEventInput and TpConsLevelInput, respectively), an entry into only the EnvironmentState table is desired.
Question:
Is there a bug somewhere in the ASA query, or a misunderstanding on my part on how ASA should be used/setup?
I was hoping I would not have to define three separate Stream Analytics containers, as they tend to be rather pricey. After running through this tutorial, and leaving 4 ASA containers running for one day, I used up nearly $5 in Azure credits. At a projected $150/mo cost, there's just no way I could justify sticking with Azure.
ASA is supposed to be purposed for Complex Event Processing. You are using ASA in your queries to essentially pass data from the event hub to tables. It will be much cheaper if you actually host a simple "worker web app" to process the incoming events.
This blog post covers the best practices:
http://blogs.msdn.com/b/servicebus/archive/2015/01/16/event-processor-host-best-practices-part-1.aspx
ASA is great if you are doing some transformations, filters, light analytics on your input data in real-time. Furthermore, it also works great if you have some Azure Machine Learning models that are exposed as functions (currently in preview).
In your example, all three "select into" statements are reading from the same input source, and don't have any filter clauses, so all rows would be selected.
If you only want to rows select specific rows for each of the output, you have to specify a filter condition. For example, assuming you only want records with a non null value in column "PH" for the output "TpEnvStateOutputToSQL", then ASA query would look like below
SELECT
ParticleId,
TimePublished,
PH
-- and other fields INTO TpEnvStateOutputToSQL FROM TpEnvStateInput WHERE PH IS NOT NULL

sql insert varchar or data

I'm developing application, which generate big html reports. I need to store data in temp tables in DB for html pages. Which is the best way to do it? Generate big xml string in table tmpTable(num, xmlStr)(xmlStr - aprox. 400 Kb) for HTML page, insert into table and than select this page after user request. Or save data in temp table like tmpTable1(num, val1, val2, val3...), where val - just short strings, int and double, and generate xml using this data after user requesting. Which way will be good for perfomance?
If you can normalize the data in tabular format, it's better to have that data in table. Generate the report based on user demand. Also, if report is not changing frequently, you may generate it as a batch process and keep it on server for the required time period.
Additionally, if you want to do any historical data mining, you still have raw data in your table. You can always run your queries and get the desired outputs. I'd personally go with this approach. Please share what would you choose and any further input/feedback.

STAD data extraction into Java Application

SAP logs all the transactions executed by user and accumulates them by monthly wise. Information it stores user ID, date, time, transaction executed instance, etc.
Table MONI contains this information. But when I check this table it looks like raw data. It's decoded when displayed via the transaction code STAD.
I would like to extract STAD data to Java application via BAPI or RFC call.
You can read the keys of the table with the usual RFC_READ_TABLE, you can convert the raw data to usable data with FM IMPORT_DOWNLOAD_FROM_MONI.