Azure Stream Analytics - How to sort event hub data based on Timestamp and custom Id fields - azure-data-lake

I am trying to sort the event hub input data using Stream Analytics and populating the output in ADLS 2 in CSV format but the data is not sorted.
I used azure function (Timer trigger) which fetches the data from SQL Server sequentially (order by Id field) and sending the batch event data to event hub.
In Stream Analytics, I am using the event hub stream as Input and the ADLS 2 as output.
I tried both, CollectTop() and TopOne() to sort the data.
Also, I am using "TIMESTAMP BY SeenTime"
Below is the ASA query:
eventdata: Input
adls2: Output
With
Step1 as (
SELECT eventdata.Id as Id, eventdata.ClientMacAddress, eventdata.SeenEpoch,
eventdata.SeenTime, System.Timestamp() t
FROM eventdata
TIMESTAMP BY SeenTime
),
Step2 as (
SELECT TopOne() OVER (ORDER BY Id asc) as topEvent
FROM Step1
Group by TumblingWindow(minute, 10)
)
SELECT udf.convertJsonToString(topEvent)
INTO
[adls2]
FROM Step2
Appreciate your help in advance.

Related

Accurate JSON Extract in Metabase SQL

i have this table
id
status
outgoing
1
paid
{"a945248027_14454878":"processing","a945248027_14454878":"cancelled","a945248027_14454878":"completed"}
I am trying to extract the value after underscore and the processes i.e 14454878, "processing", "cancelled" and "completed"
I tried extracting the keys using this query on metabase
SELECT *,
CAST(substring(key from '_([^_]+)$') AS INTEGER) as Volume,
substring(outgoing::varchar from ':"([a-z]*)' ) as Status
FROM table
CROSS JOIN LATERAL json_object_keys(outgoing) AS j(key);
However, upon splitting i got this
Key
Volume
Status
a945248027
14454878
processing
a945248027
14454878
processing
a945248027
14454878
processing
Whereas this is what i am trying to achieve
Key
Volume
Status
a945248027
14454878
processing
a945248027
14454878
cancelled
a945248027
14454878
completed
Please help

Call Azure Stream Analytics UDF with multi-dimensional array of last 5 records, grouped by record

I am trying to call an AzureML UDF from Stream Analytics query and that UDF expects an array of 5 rows and 2 columns. The input data is streamed from an IoT hub and we have two fields in the incoming messages: temperature & humidity.
This would be the 'passthrough query' :
SELECT GetMetadataPropertyValue([room-telemetry], 'IoTHub.ConnectionDeviceId') AS RoomId,
Temperature, Humidity
INTO
[maintenance-alerts]
FROM
[room-telemetry]
I have an AzureML UDF (successfully created) that should be called with the last 5 records per RoomId and that will return one value from the ML Model. Obviously, there are multiple rooms in my stream, so I need to find a way to get some kind of windowing of 5 records Grouped per RoomId. I don't seem to find a way to call the UDF with the right arrays selected from the input stream. I know I can create a Javascript UDF that would return an array from the specific fields, but that would be record/by record, where here I would need this with multiple records that are grouped by the RoomId.
Someone has any insights?
Best regards
After the good suggestion of #jean-sébastien and an answer to an isolated question for the array-parsing, I finally was able to stitch everything together in a solution that builds. (still have to get it to run at runtime, though).
So, the solution exists in using CollectTop to aggregate the latest rows of the entity you want to group by, including the specification of a Time Window.
And the next step was to create the javascript UDF to take that data structure and parse it into a multi-dimensional array.
This is the query I have right now:
-- Taking relevant fields from the input stream
WITH RelevantTelemetry AS
(
SELECT engineid, tmp, hum, eventtime
FROM [engine-telemetry]
WHERE engineid IS NOT NULL
),
-- Grouping by engineid in TimeWindows
TimeWindows AS
(
SELECT engineid,
CollectTop(2) OVER (ORDER BY eventtime DESC) as TimeWindow
FROM
[RelevantTelemetry]
WHERE engineid IS NOT NULL
GROUP BY TumblingWindow(hour, 24), engineid
)
--Output timewindows for verification purposes
SELECT engineid, Udf.Predict(Udf.getTimeWindows(TimeWindow)) as Prediction
INTO debug
FROM TimeWindows
And this is the Javascript UDF:
function getTimeWindows(input){
var output = [];
for(var x in input){
var array = [];
array.push(input[x].value.tmp);
array.push(input[x].value.hum);
output.push(array);
}
return output;
}

How to get firebase console event details such as first_open, app_remove and Registration_Success using big query for last two weeks?

I'm creating visualization for App download count, the app removes count and user registration counts from firebase console data for the last two weeks. It gives us the total count of the selected period but we need date wise count for each. For that, we plan to get the data count using a big query. how do we get all metrics by writing a single query?
We will get all the metrics using single query has below
SELECT event_date,count(*),platform,event_name FROM `apple-XYZ.analytics_XXXXXX.events_*` where
(event_name = "app_remove" or event_name = "first_open" or event_name = "Registration_Success") and
(event_date between "20200419" and "20200502") and (stream_id = "XYZ" or stream_id = "ZYX") and
(platform = "ANDROID" or platform = "IOS") group by platform, event_date, event_name order by event_date;
Result: for two weeks (From 19-04-2020 to 02-04-2020)

How to build a raport Top Conversion Paths in BigQuery

I have a problem with build a raport like "Top Conversion Paths" in Google Analytics. Any ideas how can I create this?
I find something like this, but it dosen't work (https://lastclick.city/top-conversion-paths-in-ga-and-bigquery.html):
SELECT
REGEXP_REPLACE(touchpointPath, 'Conversion >.*', 'Conversion') as touchpointPath, COUNT(touchpointPath) AS TOP
FROM (SELECT
GROUP_CONCAT(touchpoint,' > ') AS touchpointPath
FROM (SELECT
*
FROM (SELECT
fullVisitorId,
'Conversion' AS touchpoint,
(visitStartTime+hits.time) AS timestamp
FROM
TABLE_DATE_RANGE([pro-tracker-id.ga_sessions_], TIMESTAMP('2018-10-01'), TIMESTAMP('2018-10-05'))
WHERE
hits.eventInfo.eventAction="Email Submission success")
,
(SELECT
fullVisitorId,
CONCAT(trafficSource.source,'/',trafficSource.medium) AS touchpoint,
(visitStartTime+hits.time) AS timestamp
FROM
TABLE_DATE_RANGE([pro-tracker-id.ga_sessions_], TIMESTAMP('2018-10-01'), TIMESTAMP('2018-10-05'))
WHERE
hits.hitNumber=1)
ORDER BY
timestamp)
GROUP BY
fullVisitorId
HAVING
touchpointPath LIKE '%Conversion%')
GROUP BY
touchpointPath
ORDER BY
TOP DESC
It doesn't work because you have to modify the query to your needs.
This line needs to be changed to match your specific event action:
hits.eventInfo.eventAction="YOUR EVENT ACTION HERE")
The table reference and the dates need to be changed too:
TABLE_DATE_RANGE([pro-tracker-id.ga_sessions_], TIMESTAMP('2018-10-01'), TIMESTAMP('2018-10-05'))
The shared article refers to a link regarding getting information about the flatten function in BigQuery Legacy SQL.
As far as I know, queries in the new BigQuery UI runs as Standard SQL by default; however, you are able to set the SQL variant by including a prefix to your query in the web UI, REST API call or when using the Cloud Client library.

Oldest Record For a Distinct ID - SparkSQL [duplicate]

This question already has answers here:
How to select the first row of each group?
(9 answers)
Closed 4 years ago.
I am relatively new here, so i will try to follow the means of SO.
I am working on spark on data bricks and working with the following data:
Distinct_Id Event Date
*some alphanumerical value* App Access 2018-01-09
*some alphanumerical value* App Opened 2017-23-01
... ... ...
The data means:
Every distinct_id identifies a distinct user. There are 4 main events - App access, app opened, app launched, mediaReady.
The problem:
I am trying to find the first app access date for a particular distinct_id.
App access is defined as: event in ('App access', 'App opened', 'App Launched')
The first app viewed date for a particular distinct_id.
App viewed is defined as: event == 'mediaReady'
My data is present in parquet files and the data volume is huge (2 years data).
I tried the following to find first app access date:
temp_result = spark.sql("
with cte as(
select gaid,
event,
event_date,
RANK() OVER (PARTITION BY gaid order by event_date) as rnk
from df_raw_data
WHERE upper(event) IN ('APP LAUNCHED', 'APP OPENED', 'APP ACCESS')
group by gaid,event,event_date
)
select DISTINCT gaid, event_date, event from cte where rnk = 1
")
I am trying to write a robust query which will scale with the increase in data and give the result.
I hope I've described the problem in a decent way.
Feels more like a pivot query:
SELECT
gaid,
MIN(CASE WHEN event in ('App access', 'App opened', 'App Launched') THEN date END) as first_app_access_date,
MIN(CASE WHEN event in ('mediaReady') THEN date END) as first_app_viewed_date
FROM df_raw_data
GROUP BY gaid
I've no idea about case sensitivity etc of a spark db so you might need to fix some of that up..