Kusto -- generate data diff / delta -- - azure-log-analytics

I created a custom data type to store some configuration of an external product. So each day I send the configuration of this specific product / service ( multiple rows but with identical data model) to the Log Analytics data store.
Is there a possibility to show which rows gets added or removed between multiple days? The data structure is always the same f.e.
MyCustomData_CL
| project Guid_g , Name_s, URL_s
I would like to see which records gets added / removed at which time. So basically compare every day with the previous day.
How could I accomplish this with Kusto query language?
Best regards,
Jens

Next query uses full-outer join to compare two sets (one from previous day, and one from the current day). If you don't have datetime column in your table, you can try using ingestion_time() function instead (that reveals the time data was ingested into the table).
let MyCustomData_CL =
datatable (dt:datetime,Guid_g:string , Name_s:string, URL_s:string)
[
datetime(2018-11-27), '111', 'name1', 'url1',
datetime(2018-11-27), '222', 'name2', 'url2',
//
datetime(2018-11-28), '222', 'name2', 'url2',
datetime(2018-11-28), '333', 'name3', 'url3',
];
let data_prev = MyCustomData_CL | where dt between( datetime(2018-11-27) .. (1d-1tick));
let data_new = MyCustomData_CL | where dt between( datetime(2018-11-28) .. (1d-1tick));
data_prev
| join kind=fullouter (data_new) on Guid_g , Name_s , URL_s
| extend diff=case(isnull(dt), 'Added', isnull(dt1), 'Removed', "No change")
Result:
dt Guid_g Name_s URL_s dt1 Guid_g1 Name_s1 URL_s1 diff
2018-11-28 333 name3 url3 Added
2018-11-27 111 name1 url1 Removed
2018-11-27 222 name2 url2 2018-11-28 222 name2 url2 No change

Related

SQL Having/Where clause to compare MAX from current/another table

I have a table that has date information and is being copied to another table and trying to perform an incremental load.
date = date format
hour = int
person
date
hour
bob
2023-01-01
1
bill
2023-01-02
2
select * into test.person_copy from
(select * from original.person)
My thought process of performing the incremental load is to check on the max(date) & max(hour) from the original table against the copied table to identify what is the gap between the max values from both tables. However, I'm not entirely sure how to implement the logic as it doesn't seem straight forward with the where clause. Having clause might make more sense, but also doesn't seem correct?
select * into test.person_copy from
(select * from original.person org
Having max(org.date, org.hour) > (select max(copy.date,copy.hour) from test.person_copy copy)
)
The other variation I had in mind was to use HAVING NOT IN
Having max(org.date, org.hour) NOT IN (select max(copy.date,copy.hour) from test.person_copy copy)
Wasn't sure if logic is correct. Hour field will be of importance's, but can live with just the date fields.
Expected output would be that the logic would check for existing max(date) and only insert if it doesn't exist. Example below, 2023-01-03
| person | date | hour |
|--------|------------|------|
| bob | 2023-01-01 | 1 |
| bill | 2023-01-02 | 2 |
| test | 2023-01-03 | 2 |
Don't have access to a RedShift environment but the following query should work:
select *
into test.person_copy
from original.person org
where dateadd(hrs, org.hour, org.date) >
(select max(dateadd(hrs, cpy.hour, cpy.date))
from test.person_copy cpy
)
This assumes that when the previous hour's copy was made entire set of source rows for that date&hour was copied (the new incremental load would have all rows for the dates&hours not already copied). This means that you need additional criteria in the select to make sure that you include only completed date-hours (i.e. make sure that you don't include the rows with hour=10 while the time is still 10:30).

Athena query get the index of any element in a list

I need to access to the elements in a column whose type is list according to the other elements' locations in another list-like column. Say, my dataset is like:
WITH dataset AS (
SELECT ARRAY ['hello', 'amazon', 'athena'] AS words,
ARRAY ['john', 'tom', 'dave'] AS names
)
SELECT * FROM dataset
And I'm going to achieve
SELECT element_at(words, index(names, 'john')) AS john_word
FROM dataset
Is there a way to have a function in Athena like "index"? Or how can I customize one like this? The desired result should be like:
| -------- |
| john_word|
| -------- |
| hello |
| -------- |
array_position:
array_position(x, element) → bigint
Returns the position of the first occurrence of the element in array x (or 0 if not found).
Note that in presto array indexes start from 1.
SELECT element_at(words, array_position(names, 'john')) AS john_word
FROM dataset

Using different time periods in one Azure log query

So I have an Azure log query (KQL) that takes in a date parameter, like check log entries for the last X amount of days. In this query I look up values from two different logs, and I would like to have the two lookups use different dates for the respective logs. To get an idea of what I'm looking for, see the query below which is almost what I have now, with a bit of pseudo code where I can't quite figure out how to structure it.
let usernames = LogNumberOne
| where TimeGenerated > {timeperiod:start} and TimeGenerated < {timeperiod:end}
| bla bla bla lots of stuff
let computernames = LogNumberTwo
| where TimeGenerated > {timeperiod:start} - 2d
| where bla bla bla lots of stuff
usernames
| join kind=innerunique (computernames) on session_id
| some logic to display table
So from LogNumberOne I want the values within the specified time period, but from LogNumberTwo I want the values from the specified time period plus another 2 days before that. Is this possible or do I need another parameter? I have tried with the query above, so {timeperiod:start} - 2d, but that doesn't seem to work, it just uses the timeperiod:start value without subtracting 2 days.
See next variant for using join with filter later.
let usernames = datatable(col1:string, session_id:string, Timestamp:datetime )
[
'user1', '1', datetime(2020-05-14 16:00:00),
'user2', '2', datetime(2020-05-14 16:05:30),
];
let computernames =
datatable(session_id:string, ComputerName:string, Timestamp:datetime )
[
'1', 'Computer1', datetime(2020-05-14 16:00:30),
'2', 'Computer2', datetime(2020-05-14 16:06:20),
];
usernames
| join kind=inner (
computernames
| project-rename ComputerTime = Timestamp
) on session_id
| where Timestamp between(ComputerTime .. (-2d))
In case large sets of join are involved - use technique described in the following article:
https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/join-timewindow
let window = 2d;
let usernames = datatable(col1:string, session_id:string, Timestamp:datetime )
[
'user1', '1', datetime(2020-05-13 16:00:00),
'user2', '2', datetime(2020-05-12 16:05:30),
];
let computernames =
datatable(session_id:string, ComputerName:string, Timestamp:datetime )
[
'1', 'Computer1', datetime(2020-05-14 16:00:30),
'2', 'Computer2', datetime(2020-05-14 16:06:20),
];
usernames
| extend _timeKey = range(bin(Timestamp, 1d), bin(Timestamp, 1d)+window, 1d)
| mv-expand _timeKey to typeof(datetime)
| join kind=inner (
computernames
| project-rename ComputerTime = Timestamp
| extend _timeKey = bin(ComputerTime, 1d)
) on session_id, _timeKey
| where Timestamp between(ComputerTime .. (-window))

Issue displaying empty value of repeated columns in Google Data Studio

I've got an issue when trying to visualize in Google Data Studio some information from a denormalized table.
Context: I want to gather all the contact of a company and there related orders in a table in Big Query. Contacts can have no order or multiple orders. Following Big Query best practice, this table is denormalized and all the orders for a client are in arrays of struct. It looks like this:
Fields Examples:
+-------+------------+-------------+-----------+
| Row # | Contact_Id | Orders.date | Orders.id |
+-------+------------+-------------+-----------+
|- 1 | 23 | 2019-02-05 | CB1 |
| | | 2020-03-02 | CB293 |
|- 2 | 2321 | - | - |
|- 3 | 77 | 2010-09-03 | AX3 |
+-------+------------+-------------+-----------+
The issue is when I want to use this table as a data source in Data Studio.
For instance, if I build a table with Contact_Id as dimension, everything is fine and I can see all my contacts. However, if I add any dimensions from the Orders struct, all info from contact with no orders are not displayed. For instance, all info from Contact_Id 2321 is removed from the table.
Have you find any workaround to visualize these empty arrays (for instance as null values)?
The only solution I've found is to build an intermediary table with the orders unnested.
The way I've just discovered to work around this is to add an extra field in my DS-> BQ connector:
ARRAY_LENGTH(fields.orders) AS numberoforders
This will return zero if the array is empty - you can then create calculated fields within DataStudio - using the "numberoforders" field to force values to NULL or zero.
You can fix this behaviour by changing a little your query on the BigQuery connector.
Instead of doing this:
SELECT
Contact_id,
Orders
FROM myproject.mydataset.mytable
try this:
SELECT
Contact_id,
IF(ARRAY_LENGTH(Orders) > 0, Orders, [STRUCT(CAST(NULL AS DATE) AS date, CAST(NULL AS STRING) AS id)]) AS Orders
FROM myproject.mydataset.mytable
This way you are forcing your repeated field to have, at least, an array containing NULL values and hence Data Studio will represent those missing values.
Also, if you want to create new calculated fields using one of the nested fields, you should check before if the value is NULL to avoid filling all NULL values. For example, if you have a repeated and nested field which can be 1 or 0, and you want to create a calculated field swaping the value, you should do:
IF(myfield.key IS NOT NULL, IF(myfield.key = 1, 0, 1), NULL)
Here you can see what happens if you check before swaping and if you don't:
Original value No check Check
1 0 0
0 1 1
NULL 1 NULL
1 0 0
NULL 1 NULL

Google BigQuery - Parsing string data from a Bigquery table column

I have a table A within a dataset in Bigquery. This table has multiple columns and one of the columns called hits_eventInfo_eventLabel has values like below:
{ID:AEEMEO,Score:8.990000;ID:SEAMCV,Score:8.990000;ID:HBLION;Property
ID:DNSEAWH,Score:0.391670;ID:CP1853;ID:HI2367;ID:H25600;}
If you write this string out in a tabular form, it contains the following data:
**ID | Score**
AEEMEO | 8.990000
SEAMCV | 8.990000
HBLION | -
DNSEAWH | 0.391670
CP1853 | -
HI2367 | -
H25600 | -
Some IDs have scores, some don't. I have multiple records with similar strings populated under the column hits_eventInfo_eventLabel within the table.
My question is how can I parse this string successfully WITHIN BIGQUERY so that I can get a list of property ids and their respective recommendation scores (if existing)? I would like to have the order in which the IDs appear in the string to be preserved after parsing this data.
Would really appreciate any info on this. Thanks in advance!
I would use combination of SPLIT to separate into different rows and REGEXP_EXTRACT to separate into different columns, i.e.
select
regexp_extract(x, r'ID:([^,]*)') as id,
regexp_extract(x, r'Score:([\d\.]*)') score from (
select split(x, ';') x from (
select 'ID:AEEMEO,Score:8.990000;ID:SEAMCV,Score:8.990000;ID:HBLION;Property ID:DNSEAWH,Score:0.391670;ID:CP1853;ID:HI2367;ID:H25600;' as x))
It produces the following result:
Row id score
1 AEEMEO 8.990000
2 SEAMCV 8.990000
3 HBLION null
4 DNSEAWH 0.391670
5 CP1853 null
6 HI2367 null
7 H25600 null
You can write your own JavaScript functions in BigQuery to get exactly what you want now: http://googledevelopers.blogspot.com/2015/08/breaking-sql-barrier-google-bigquery.html