Merge data from multiple tables based on a key in Kusto

Merge data from multiple tables based on a key in Kusto - azure-log-analytics

I'm trying to merge multiple tables in Azure Log Analytics. Each table has a unique column and a common column. Merging them with Join() is inefficient because I can only do two tables at a time. Union() seems to be the correct function but when I merge my tables I ended with duplicate rows in my common column.
Example:
// CPU usage
let CPU_table=VPN_Metrics_CL | extend timestamp = (todatetime(ts_s)+7h)
| where metric_s == "system/cpmCPUTotal1Min.rrd"
| extend region = substring(host_s,0,4)
| summarize maxCPU = max(val_d) by region
| extend score_CPU = case(maxCPU <= 59, 0,
maxCPU <= 79, 1,
3)
| project score_CPU, region;
// Memory usage
let Memory_table=VPN_Metrics_CL| extend timestamp = todatetime(ts_s)+7h
| where metric_s in ("hw_mem_used_pct") and val_d >= 0 and host_s contains "vpn"
| extend region = substring(host_s,0,4)
| summarize maxMemory = max(val_d) by region
| extend score_mem = case(maxMemory <= 59, 0,
maxMemory <= 79, 1,
3)
| project score_mem, region;
union CPU_table, Memory_table
I plan on having a total of 10+ tables.
Here is the result:
score_mem | score_CPU | region
0 USA
0 USA
etc. etc.
How can I merge rows based on a key? The key being the region.
Thanks

If the source is the same table - the most efficient way will be using conditional aggregates:
let isCpuMetric = (metric_s:string) {metric_s == "system/cpmCPUTotal1Min.rrd"};
let isMemoryMetric = (metric_s:string, val_d:double, host_s:string) {metric_s in ("hw_mem_used_pct") and val_d >= 0 and host_s contains "vpn"};
VPN_Metrics_CL
| extend timestamp = (todatetime(ts_s)+7h)
| extend region = substring(host_s,0,4)
| where isCpuMetric(metric_s) or isMemoryMetric(metric_s, val_d, host_s)
| summarize maxCPU = maxif(val_d, isCpuMetric(metric_s)), maxMemory=maxif(val_d, isMemoryMetric(metric_s, val_d, host_s)) by region
| extend score_mem = case(maxMemory <= 59, 0, maxMemory <= 79, 1, 3),
score_CPU = case(maxCPU <= 59, 0, maxCPU <= 79, 1, 3)
If the sources are different - you can still join or lookup operator. If you have results R1 .. RN - coming from a sub-queries:
R1
| lookup R2 on Region
| lookup R3 on Region
...
Docs for lookup operator: https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/lookupoperator

I found it easier to give every category's score column the same name: "score"
Then with Union, I merge all the tables and summarize a total score.
union CPU_table, Memory_table, AAA_table, bw_data, more_tables.....
| summarize score_total = sum(score) by region, bin(timestamp, $__interval)
| project score_total, region, timestamp

Related

KQL :: return only tags with more than 4 records

I have created a Kusto query that allows me to return all our database park. The query only takes 10 lines of code:
Resources
| join kind=inner (
resourcecontainers
| where type == 'microsoft.resources/subscriptions'
| project subscriptionId, subscriptionName = name)
on subscriptionId
| where subscriptionName in~ ('Subscription1','Subscription2')
| where type =~ 'microsoft.sql/servers/databases'
| where name != 'master'
| project subscriptionName, resourceGroup, name, type, location,sku.tier, properties.requestedServiceObjectiveName, tags.customerCode
By contract we are supposed to give only 4 Azure SQL Database per customer but sometimes developers take a copy of them and they rename it _old or _backup and suddenly a customer can have 5 or 6 databases.
This increase the overall costs of the Cloud and I would like to have a list of all customers that have more than 4 databases.
In order to do so I can use the tag tags.customerCode which has the 3 letters identifier for each customer.
The code should work like this: if a customer is called ABC and there are 4 Azure SQL Databases with tags.customerCode ABC the query should return nothing. If there are 5 or 6 databases with tags.customerCode ABC the query should return all of them.
Not sure if Kusto can be that flexible.

Here is a possible solution.
It should be noted that Azure resource graph supports only a limited subset of KQL.
resourcecontainers
| where type == 'microsoft.resources/subscriptions'
//and name in~ ('Subscription1','Subscription2')
| project subscriptionId, subscriptionName = name
| join kind=inner
(
resources
| where type =~ 'microsoft.sql/servers/databases'
and name != 'master'
)
on subscriptionId
| project subscriptionId, subscriptionName, resourceGroup, name, type, location
,tier = sku.tier
,requestedServiceObjectiveName = properties.requestedServiceObjectiveName
,customerCode = tostring(tags.customerCode)
| summarize dbs = count(), details = make_list(pack_all()) by customerCode
| where dbs > 4
| mv-expand with_itemindex=db_seq ['details']
| project customerCode
,dbs
,db_seq = db_seq + 1
,subscriptionId = details.subscriptionId
,subscriptionName = details.subscriptionName
,resourceGroup = details.resourceGroup
,name = details.name
,type = details.type
,location = details.location
,tier = details.tier
,requestedServiceObjectiveName = details.requestedServiceObjectiveName

Self-join Kusto Query in Analytics Rule

I am working within Microsoft Sentinel Analytics Rules with the Kusto Query Language. (KQL)
I need to work in a Table called CrowdstrikeReplicatorLogs_CL which contains rows that contain a) data rows for which I need to alert on and b) metadata. that contains information about the subject in the alert.
This means I need to self-join the KQL table with itself to get the final result.
The column in question to join the table itself is the aid_g column.
ThreatIntelligenceIndicator
| where foo == bar
| join kind=innerunique (
CrowdstrikeReplicatorLogs_CL
| where TimeGenerated >= ago(dt_lookBack)
| where event_simpleName_s has_any ("NetworkConnectIP4", "NetworkConnectIP6")
| extend json=parse_json(custom_fields_message_s)
| extend ip4 = json["RemoteAddressIP4"], ip6=json["RemoteAddressIP6"]
| extend CS_ipEntity = tostring(iff(isnotempty(ip4), ip4, ip6))
| extend CommonSecurityLog_TimeGenerated = TimeGenerated
) on $left.TI_ipEntity == $right.CS_ipEntity
| join kind=innerunique (
CrowdstrikeReplicatorLogs_CL
| where custom_fields_message_s has "ComputerName"
| extend customFields=parse_json(custom_fields_message_s)
| project Hostname=customFields['ComputerName'], Platform=event_platform_s, aid_g
) on $left.aid_g == $right.aid_g
;
However, this raises a Query contains incompatible 'set' commands. error in Sentinel.
Is there a proper way to self-join tables?

Convert disk size from megabytes to gigabytes in KQL query

I have following query that helps me data from vm disk
InsightsMetrics
| where Namespace == "LogicalDisk"
| extend Tags = todynamic(Tags)
| extend Drive=tostring(todynamic(Tags)["vm.azm.ms/mountId"])
| extend DiskSize=tostring(todynamic(Tags)["vm.azm.ms/diskSizeMB"])
| summarize
Free_space_percentage = avgif(Val, Name == 'FreeSpacePercentage'),
Free_Gigabytes = avgif(Val, Name == 'FreeSpaceMB') /1024
by Computer, Drive
| join (
InsightsMetrics
| where Namespace == "LogicalDisk"
| extend Tags = todynamic(Tags)
| extend DiskSize=tostring(todynamic(Tags)["vm.azm.ms/diskSizeMB"])
| extend Drive=tostring(todynamic(Tags)["vm.azm.ms/mountId"])
) on Computer, Drive
| where DiskSize has "."
| summarize by Computer,Drive , Free_space_percentage, Free_Gigabytes, DiskSize
Issue is now that DiskSize is displayed in megabytes when all the rest are in gigabytes. I have now tried several hours to try convert it to gigas without luck. Could someone help me where and how should i do my convert in my query?

It seems your issue is not with converting MB to GB, but with structuring a query that will give you the average values as well as the disk size.
Assuming the disks' sizes are not changed during the query period, take_any() will do the trick.
InsightsMetrics
// | where TimeGenerated between(datetime(2022-04-01) .. datetime(2022-04-01 00:00:10))
| where Namespace == "LogicalDisk"
| extend Tags = todynamic(Tags)
| extend Drive = tostring(Tags["vm.azm.ms/mountId"])
| extend diskSizeGB = Tags["vm.azm.ms/diskSizeMB"]/1024.0
| summarize
avg_FreeSpacePercentage = avgif(Val, Name == 'FreeSpacePercentage')
,avg_FreeSpaceGB = avgif(Val, Name == 'FreeSpaceMB') /1024
,take_any(diskSizeGB)
by Computer, Drive
Computer
Drive
avg_FreeSpacePercentage
avg_FreeSpaceGB
diskSizeGB
DC00.na.contosohotels.com
C:
74.9538803100586
94.8232421875
126.50976181030273
DC00.na.contosohotels.com
D:
91.4168853759766
14.6240234375
15.998043060302734
SQL12.na.contosohotels.com
C:
57.7019577026367
72.998046875
126.50976181030273
SQL12.na.contosohotels.com
D:
92.02197265625
29.4443359375
31.998043060302734
SQL12.na.contosohotels.com
F:
99.9144668579102
127.7626953125
127.87304306030273
AppBE01.na.contosohotels.com
C:
73.2973098754883
92.7275390625
126.50976181030273
AppBE01.na.contosohotels.com
D:
91.3375244140625
14.611328125
15.998043060302734
Fiddle

Spatial Clustering - associate cluster attribute (id) to the geometry that is part of the cluster

I'm having certain issues into associate a clustered set of geometries with their own proprieties.
Data
I've a table with a set of geometries,
buildings {
gid integer,
geom geometry(Multipoligon,4326)
}
And I've run the function ST_ClusterWithin with a certain threshold over a the "buildings" table.
From that cluster analysis, I got a table that named "clusters",
clusters {
cid Integer,
geom geometry(GeometryCollection,4326)
}
Question
I would love to extract into a table all geometry with associated its own cluster information.
clustered_building {
gid Integer
cid Integer
geom geometry(Multipoligon,4326)
}
gid | cid | geom |
-----+------------+-----------------------+
1 | 1 | multypoligon(...) |
2 | 1 | multypoligon(...) |
3 | 1 | multypoligon(...) |
4 | 2 | multypoligon(...) |
5 | 3 | multypoligon(...) |
6 | 3 | multypoligon(...) |
What I Did (but does not work)
I've been trying using the two function ST_GeometryN / ST_NumGeometries parse each MultyGeometry and extract the information of the cluster with this query that is derived from one of the Standard Example of the ST_Geometry manual page.
INSERT INTO clustered_building (cid, c_item , geom)
SELECT sel.cid, n, ST_GeometryN(sel.geom, n) as singlegeom
FROM ( SELECT cid, geom, ST_NumGeometries(geom) as num
FROM clusters") AS sel
CROSS JOIN generate_series(1,sel.num) n
WHERE n <= ST_NumGeometries(sel.geom);
The query, it takes few seconds if I force to use a series of 10.
CROSS JOIN generate_series(1,10)
But it got stuck when I ask to generate a series according to the number of item in each GeometryCollection.
And also, this query does not allow me to link the single geometry to his own features into the building table because I'm losing the "gid"
could someone please help me,
thanks
Stefano

I don't have you data, but using some dummy values, where ids 1, 2 and 3 intersect and 4 and 5, you can do something like the following:
WITH
temp (id, geom) AS
(VALUES (1, ST_Buffer(ST_Makepoint(0, 0), 2)),
(2, ST_Buffer(ST_MakePoint(1, 1), 2)),
(3, ST_Buffer(ST_MakePoint(2, 2), 2)),
(4, ST_Buffer(ST_MakePoint(9, 9), 2)),
(5, ST_Buffer(ST_MakePoint(10, 10), 2))),
clusters(geom) as
(SELECT
ST_Makevalid(
ST_CollectionExtract(
unnest(ST_ClusterIntersecting(geom)), 3))
FROM temp
)
SELECT array_agg(temp.id), cl.geom
FROM clusters cl, temp
WHERE ST_Intersects(cl.geom, temp.geom)
GROUP BY cl.geom;
If you wrap the final cl.geom is ST_AsText, you will see something like:
{1,2,3} | MULTIPOLYGON(((2.81905966523328 0.180940334766718,2.66293922460509 -0.111140466039203,2.4142135623731 -0.414213562373094,2.11114046603921 -0.662939224605089,1.81905966523328 -0.819059665233282,1.84775906502257 -0.765366864730179,1.96157056080646 -0.390180644032256,2 0,2 3.08780778723872e-16,2 0,2.39018064403226 0.0384294391935396,2.76536686473018 0.152240934977427,2.81905966523328 0.180940334766718))......
{4,5} | MULTIPOLYGON(((10.8190596652333 8.18094033476672,10.6629392246051 7.8888595339608,10.4142135623731 7.58578643762691,10.1111404660392 7.33706077539491,9.76536686473018 7.15224093497743,9.39018064403226 7.03842943919354,9 7,8.60981935596775 7.03842943919354,8.23463313526982 7.15224093497743,7.8888595339608 7.33706077539491,7.58578643762691 7.5857864376269,7.33706077539491 7.88885953396079,7.15224093497743 8.23463313526982
where you can see the ids 1,2,3, below to the first multipolygon, and 4,5 the other.
The general idea is you cluster the data, and then you intersect the returned clusters with the original data, using array_agg to group the ids together, so that the returned Multipolygons now contain the original ids. The use of ST_CollectionExtract with 3 as the second paramter, in conjunction with unnest, which splits the geometry collection returned by ST_ClusterIntersecting back into rows, returns each contiguous cluster as a (Multi)Polygon. The ST_MakeValid is because sometimes when you intersect geometries with other related geometries, such as the original polygons with your clustered polygonss, you get strange rounding effects and GEOS error about non-noded intersections and the like.
I answered a similar question on gis.stackexchange recently that you might find useful.

Creating custom event schedules. Should I use "LIKE"?

I'm creating a campaign event scheduler that allows for frequencies such as "Every Monday", "May 6th through 10th", "Every day except Sunday", etc.
I've come up with a solution that I believe will work fine (not yet implemented), however, it uses "LIKE" in the queries, which I've never been too fond of. If anyone else has a suggestion that can achieve the same result with a cleaner method, please suggest it!
+----------------------+
| Campaign Table |
+----------------------+
| id:int |
| event_id:foreign_key |
| start_at:datetime |
| end_at:datetime |
+----------------------+
+-----------------------------+
| Event Table |
+-----------------------------+
| id:int |
| valid_days_of_week:string | < * = ALL. 345 = Tue, Wed, Thur. etc.
| valid_weeks_of_month:string | < * = ALL. 25 = 2nd and 5th weeks of a month.
| valid_day_numbers:string | < * = ALL. L = last. 2,7,17,29 = 2nd day, 7th, 17th, 29th,. etc.
+-----------------------------+
A sample event schedule would look like this:
valid_days_of_week = '1357' (Sun, Tue, Thu, Sat)
valid_weeks_of_month = '*' (All weeks)
valid_day_numbers = ',1,2,5,6,8,9,25,30,'
Using today's date (6/25/15) as an example, we have the following information to query with:
Day of week: 5 (Thursday)
Week of month: 4 (4th week in June)
Day number: 25
Therefore, to fetch all of the events for today, the query would look something like this:
SELECT c.*
FROM campaigns AS c,
LEFT JOIN events AS e
ON c.event_id = e.id
WHERE
( e.valid_days_of_week = '*' OR e.valid_days_of_week LIKE '%5%' )
AND ( e.valid_weeks_of_month = '*' OR e.valid_weeks_of_month LIKE '%4%' )
AND ( e.valid_day_numbers = '*' OR e.valid_day_numbers LIKE '%,25,%' )
That (untested) query would ideally return the example event above. The "LIKE" queries are what have me worried. I want these queries to be fast.
By the way, I'm using PostgreSQL
Looking forward to excellent replies!

Use arrays:
CREATE TABLE events (id INT NOT NULL, dow INT[], wom INT[], dn INT[]);
CREATE INDEX ix_events_dow ON events USING GIST(dow);
CREATE INDEX ix_events_wom ON events USING GIST(wom);
CREATE INDEX ix_events_dn ON events USING GIST(dn);
INSERT
INTO events
VALUES (1, '{1,3,5,7}', '{0}', '{1,2,5,6,8,9,25,30}'); -- 0 means any
, then query:
SELECT *
FROM events
WHERE dow && '{0, 5}'::INT[]
AND wom && '{0, 4}'::INT[]
AND dn && '{0, 26}'::INT[]
This will allow using the indexes to filter the data.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Merge data from multiple tables based on a key in Kusto - azure-log-analytics

Related

KQL :: return only tags with more than 4 records

Self-join Kusto Query in Analytics Rule

Convert disk size from megabytes to gigabytes in KQL query

Spatial Clustering - associate cluster attribute (id) to the geometry that is part of the cluster

Creating custom event schedules. Should I use "LIKE"?

Categories

Resources