Filter in KQL arrays - kql

I have a database that one of its columns represents a JSON array. The array is an array or arrays with numbers, like [[1], [2,3], [4,5,6,7], [8]]. I want to filter rows where one of the arrays inside has at least 3 elements.
I used this KQL:
traces
| where message has "Items per slot"
| extend items = parse_json(customDimensions["ItemsPerSlot"])
| where ??? // There is a slot with at least 3 items
But how can I write this where condition?

let traces = datatable(id:int, message:string, customDimensions:dynamic)
[
1 ,"Items per slot" ,dynamic({"ItemsPerSlot": [[1], [2,3], [4,5,6,7], [8]]})
,2 ,"Items per slot" ,dynamic({"ItemsPerSlot": [[1], [2,3], [8]]})
];
traces
| where message has "Items per slot"
| extend items = customDimensions["ItemsPerSlot"]
| mv-apply item = items on (where array_length(item) >= 3 | take 1 | project-away item)
id
message
customDimensions
items
1
Items per slot
{"ItemsPerSlot":[[1],[2,3],[4,5,6,7],[8]]}
[[1],[2,3],[4,5,6,7],[8]]
Fiddle

Related

Divding column of Dataframe by constant value

I have a Data frame in below format.
| Occupation | wa_rating | Genre |
| engineer | 935 | Musical |
Now I want to divide Rating column of this Dataframe by totalRatings.
but when I am doing
resultDF = joinedDF.select(col("wa_rating")/totalRating)
It is giving me below error.
unsupported literal type class java.util.Arraylist
Likely your totalRating variable is a list. For example [100]. And you can't divide a number by a list. This throws your error:
resultDF = joinedDF.select(col("wa_rating")/[100])
but this does not
resultDF = joinedDF.select(col("wa_rating")/100)
Check that totalRating is an actual number (a float or integer). If it's a list containing a number, simply extract the number from it.
EDIT:
From your comments, we now know that your totalRating is a list. You can transform it to a number with:
totalRating = joinedDF3.groupBy().sum("Rating").collect()[0][0]

Divide two timecharts in Splunk

I want to divide two timecharts (ideally to look also like a timechart, but something else that emphasizes the trend is also good).
I have two types of URLs and I can generate timecharts for them like this:
index=my-index sourcetype=access | regex _raw="GET\s/x/\w+" | timechart count
index=my-index sourcetype=access | regex _raw="/x/\w+/.*/\d+.*\s+HTTP" | timechart count
The purpose is to emphasize that the relative number of URLs of the second type is increasing and the relative number of URLs of the first type is decreasing.
This is why I want to divide them (ideally the second one by the first one).
For example, if the first series generates 2, 4, 8, 4 and the second one generates 4, 9, 20, 12 I want to have only one dashboard showing somehow the result 2, 2.25, 2.5, 3.
I just managed to get together those information by doing this, but not to generate a timechart and not to divide them:
index=my-index sourcetype=access
| eval type = if(match(_raw, "GET\s/x/\w+"), "new", if(match(_raw, "/x/\w+/.*/\d+.*\s+HTTP"), "old", "other"))
| table type
| search type != "other"
| stats count as "Calls" by type
I also tried some approaches using eval, but none of them work.
Try this query:
index=my-index sourcetype=access
| eval type = if(match(_raw, "GET\s/x/\w+"), "new", if(match(_raw, "/x/\w+/.*/\d+.*\s+HTTP"), "old", "other"))
| fields type
| search type != "other"
| timechart count(eval(type="new")) as "New", count(eval(type="old")) as "Old"
| eval Div=if(Old=0, 0, Old/New)

Changing names of variables using the values of another variable

I am trying to rename around 100 dummy variables with the values from a separate variable.
I have a variable products, which stores information on what products a company sells and have generated a dummy variable for each product using:
tab products, gen(productid)
However, the variables are named productid1, productid2 and so on. I would like these variables to take the values of the variable products instead.
Is there a way to do this in Stata without renaming each variable individually?
Edit:
Here is an example of the data that will be used. There will be duplications in the product column.
And then I have run the tab command to create a dummy variable for each product to produce the following table.
sort product
tab product, gen(productid)
I noticed it updates the labels to show what each variable represents.
What I would like to do is to assign the value to be the name of the variable such as commercial to replace productid1 and so on.
Using your example data:
clear
input companyid str10 product
1 "P2P"
2 "Retail"
3 "Commercial"
4 "CreditCard"
5 "CreditCard"
6 "EMFunds"
end
tabulate product, generate(productid)
list, abbreviate(10)
sort product
levelsof product, local(new) clean
tokenize `new'
ds productid*
local i 0
foreach var of varlist `r(varlist)' {
local ++i
rename `var' ``i''
}
Produces the desired output:
list, abbreviate(10)
+---------------------------------------------------------------------------+
| companyid product Commercial CreditCard EMFunds P2P Retail |
|---------------------------------------------------------------------------|
1. | 3 Commercial 1 0 0 0 0 |
2. | 5 CreditCard 0 1 0 0 0 |
3. | 4 CreditCard 0 1 0 0 0 |
4. | 6 EMFunds 0 0 1 0 0 |
5. | 1 P2P 0 0 0 1 0 |
6. | 2 Retail 0 0 0 0 1 |
+---------------------------------------------------------------------------+
Arbitrary strings might not be legal Stata variable names. This will happen if they (a) are too long; (b) start with any character other than a letter or an underscore; (c) contain characters other than letters, numeric digits and underscores; or (d) are identical to existing variable names. You might be better off making the strings into variable labels, where only an 80 character limit bites.
This code loops over the variables and does its best:
gen long obs = _n
foreach v of var productid? productid?? productid??? {
su obs if `v' == 1, meanonly
local tryit = product[r(min)]
capture rename `v' `=strtoname("`tryit'")'
}
Note: code not tested.
EDIT: Here is a test. I added code for variable labels. The data example and code show that repeated values and values that could not be variable names are accommodated.
clear
input str13 products
"one"
"two"
"one"
"three"
"four"
"five"
"six something"
end
tab products, gen(productsid)
gen long obs = _n
foreach v of var productsid*{
su obs if `v' == 1, meanonly
local value = products[r(min)]
local tryit = strtoname("`value'")
capture rename `v' `tryit'
if _rc == 0 capture label var `tryit' "`value'"
else label var `v' "`value'"
}
drop obs
describe
Contains data
obs: 7
vars: 7
size: 133
-------------------------------------------------------------------------------
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------
products str13 %13s
five byte %8.0g five
four byte %8.0g four
one byte %8.0g one
six_something byte %8.0g six something
three byte %8.0g three
two byte %8.0g two
-------------------------------------------------------------------------------
Another solution is to use the extended macro function
local varlabel:variable label
The tested code is:
clear
input companyid str10 product
1 "P2P"
2 "Retail"
3 "Commercial"
4 "CreditCard"
5 "CreditCard"
6 "EMFunds"
end
tab product, gen(product_id)
* get the list of product id variables
ds product_id*
* loop through the product id variables and change the
variable name to its label
foreach var of varlist `r(varlist)' {
local varlabel: variable label `var'
display "`varlabel'"
local pos = strpos("`varlabel'","==")+2
local varlabel = substr("`varlabel'",`pos',.)
display "`varlabel'"
rename `var' `varlabel'
}

How to filter after group by and aggregate in Spark dataframe?

I have a spark dataframe df with schema as such:
[id:string, label:string, tags:string]
id | label | tag
---|-------|-----
1 | h | null
1 | w | x
1 | v | null
1 | v | x
2 | h | x
3 | h | x
3 | w | x
3 | v | null
3 | v | null
4 | h | null
4 | w | x
5 | w | x
(h,w,v are labels. x can be any non-empty values)
For each id, there is at most one label "h" or "w", but there might be multiple "v". I would like to select all the ids that satisfies following conditions:
Each id has:
1. one label "h" and its tag = null,
2. one label "w" and its tag != null,
3. at least one label "v" for each id.
I am thinking that I need to create three columns checking each above conditions. And then I need to do a group by "id".
val hCheck = (label: String, tag: String) => {if (label=="h" && tag==null) 1 else 0}
val udfHCheck = udf(hCheck)
val wCheck = (label: String, tag: String) => {if (label=="w" && tag!=null) 1 else 0}
val udfWCheck = udf(wCheck)
val vCheck = (label: String) => {if (label==null) 1 else 0}
val udfVCheck = udf(vCheck)
dfx = df.withColumn("hCheck", udfHCheck(col("label"), col("tag")))
.withColumn("wCheck", udfWCheck(col("label"), col("tag")))
.withColumn("vCheck", udfVCheck(col("label")))
.select("id","hCheck","wCheck","vCheck")
.groupBy("id")
Somehow I need to group three columns {"hCheck","wCheck","vCheck"} into vector of list [x,0,0],[0,x,0],[0,0,x]. And check if these vector contain all three {[1,0,0],[0,1,0],[0,0,1]}
I have not been able to solve this problem yet. And there might be a better approach than this one. Hope someone can give me suggestions. Thanks
To convert the three checks to vectors you can do:
Specifically you can do:
val df1 = df.withColumn("hCheck", udfHCheck(col("label"), col("tag")))
.withColumn("wCheck", udfWCheck(col("label"), col("tag")))
.withColumn("vCheck", udfVCheck(col("label")))
.select($"id",array($"hCheck",$"wCheck",$"vCheck").as("vec"))
Next the groupby returns a grouped object on which you need to perform aggregations. Specifically to get all the vectors you should do something like:
.groupBy("id").agg(collect_list($"vec"))
Also you do not need udfs for the various checks. You can do it with column semantics. For example udfHCheck can be written as:
with($"label" == lit("h") && tag.isnull 1).otherwise(0)
BTW, you said you wanted a label 'v' for each but in vcheck you just check if the label is null.
Update: Alternative solution
Upon looking on this question again, I would do something like this:
val grouped = df.groupBy("id", "label").agg(count("$label").as("cnt"), first($"tag").as("tag"))
val filtered1 = grouped.filter($"label" === "v" || $"cnt" === 1)
val filtered2 = filtered.filter($"label" === "v" || ($"label" === "h" && $"tag".isNull) || ($"label" === "w" && $"tag".isNotNull))
val ids = filtered2.groupBy("id").count.filter($"count" === 3)
The idea is that first we groupby BOTH id and label so we have information on the combination. The information we collect is how many values (cnt) and the first element (doesn't matter which).
Now we do two filtering steps:
1. we need exactly one h and one w and any number of v so the first filter gets us these cases.
2. we make sure all the rules are met for each of the cases.
Now we have only combinations of id and label which match the rules so in order for the id to be legal we need to have exactly three instances of label. This leads to the second groupby which simply counts the number of labels which matched the rules. We need exactly three to be legal (i.e. matched all the rules).

Highlighting Values in a Crystal Reports Crosstab based on sibling values

I have crosstab which has row columns indicating different classes, and then peoples names across the top.
| | Required | Person 1 | Person 2 | Person 3 |
| Class 1 | 8 6 | 1 6 | 3 6 | 4 6 |
| Class 2 | 6 2 | 3 2 | 2 2 | 1 2 |
Each field contains 2 values The first value is the number of hours spent in the class, the second field is the number of hours required for certification.
The Required field id my grand total summary.
In the cross tab expert the fields are defined as follows.
Rows:
Command.descr -> a field containing the class names
Columns:
Command.fullname -> a field containing students full names
Summarized Fields:
Sum of Command.evlength -> summation of all time spent in a given course
Max of #required -> this formula returns the number of required hours based on the course name
I am trying to highlight the field Sum of Command.evlength if it is greater than or equal to the value of Max of #required.
My solution was to perform background formatting. Right-Click on the Sum of Command.evlength field, select Format Field. Click the borders tab, check Background, and enter a formula.
The formula I was using is:
if CurrentFieldValue >= {#required} then color(152, 251, 152) else crNoColor
This is not the correct formula. My crosstab has been placed in the footer, which causes {#required} to contain the last value in the grid which in the above example is 2.
From my research I thought I would have to use GridRowColumnValue(row or column name) to access the value of {#required} in the crosstab, but I could not come up with the correct string to represent it.
Does anyone have a way for me to correctly perform this comparison?
Frustratingly I don't think you can use the highlighting expert to compare to a dynamic value. You could swap the columns round then add the following formulas:
To the max_of_required background colour:
whileprintingrecords;
global numbervar required_hrs := currentfieldvalue;
crNoColor;
To the sum_of_command.evlength background colour:
whileprintingrecords;
global numbervar required_hrs;
if currentfieldvalue >= required_hrs then
crRed
else
crNoColor;
I think there are a few other ways but i'm not as confident with those so start here.