Kusto optimization to avoid time out - kql

I am looking to optimize my kusto query which is getting timeout in 10 mins. I have to usually run it for a week period so there is less limit on the time
AzureDiagnostics
| where TimeGenerated between (startofday(datetime("2022-03-26")) .. endofday(datetime("2022-04-04")))
| where Category == 'kube-audit'
| where log_s hasprefix '"code":2'
| where (strlen(log_s) >= 32000
and not(log_s has "aksService")
and not(log_s has "system:serviceaccount:crossplane-system:crossplane")
and not(log_s has "system:serviceaccount:elastic-system:elastic-operator")
and not(log_s has "system:serviceaccount:kube-system:daemon-set-controller")
and not(log_s has "system:serviceaccount:kube-system:deployment-controller")
and not(log_s has "system:serviceaccount:kube-system:endpoint-controller")
and not(log_s has "system:serviceaccount:kube-system:node-controller")
and not(log_s has "system:serviceaccount:kube-system:replicaset-controller")
and not(log_s has "system:serviceaccount:kube-system:statefulset-controller"))
or strlen(log_s) < 32000
| extend op = parse_json(log_s)
| where not(tostring(op.verb) in ("list", "get", "watch"))
| where not(tostring(op.user.username) hasprefix "system:")
| where not(tostring(op.user.username) in ("hcpService", "aksService", "aksProblemDetector", "readinessChecker", "nodeclient", "masterclient"))
| where not(tostring(op.requestURI) in ("/apis/authorization.k8s.io/v1/selfsubjectaccessreviews"))
| extend user = op.user.username
| extend decision = tostring(parse_json(tostring(op.annotations)).["authorization.k8s.io/decision"])
| extend requestURI = tostring(op.requestURI)
| extend name = tostring(parse_json(tostring(op.objectRef)).name)
| extend namespace = tostring(parse_json(tostring(op.objectRef)).namespace)
| extend verb = tostring(op.verb)
| project TimeGenerated, SubscriptionId, ResourceId, namespace, name, requestURI, verb, decision, ['user']
| order by TimeGenerated asc

Related

Weighted + ordered tag search using Postgres

After an AI file analysis across tens of thousands of audio files I end up with this kind of data structure in a Postgres DB:
id | name | tag_1 | tag_2 | tag_3 | tag_4 | tag_5
1 | first song | rock | pop | 80s | female singer | classic rock
2 | second song | pop | rock | jazz | electronic | new wave
3 | third song | rock | funk | rnb | 80s | rnb
Tag positions are really important: the more "to the left", the more prominent it is in the song. The number of tags is also finite (50 tags) and the AI always returns 5 of them for every song, no null values expected.
On the other hand, this is what I have to query:
{"rock" => 15, "pop" => 10, "soul" => 3}
The key is a Tag name and the value an arbitrary weight. Numbers of entries could be random from 1 to 50.
According to the example dataset, in this case it should return [1, 3, 2]
I'm also open for data restructuring if it could be easier to achieve using raw concatenated strings but... is it something doable using Postgres (tsvectors?) or do I really have to use something like Elasticsearch for this?
After a lot of trials and errors this is what I ended up with, using only Postgres:
Turn all data set to integers, so it moves on to something like this (I also added columns to match the real data set more closely) :
id | bpm | tag_1 | tag_2 | tag_3 | tag_4 | tag_5
1 | 114 | 1 | 2 | 3 | 4 | 5
2 | 102 | 2 | 1 | 6 | 7 | 8
3 | 110 | 1 | 9 | 10 | 3 | 12
Store requests in an array as strings (note that I sanitized those requests before with some kind of "request builder"):
requests = [
"bpm BETWEEN 110 AND 124
AND tag_1 = 1
AND tag_2 = 2
AND tag_3 = 3
AND tag_4 = 4
AND tag_5 = 5",
"bpm BETWEEN 110 AND 124
AND tag_1 = 1
AND tag_2 = 2
AND tag_3 = 3
AND tag_4 = 4
AND tag_5 IN (1, 3, 5)",
"bpm BETWEEN 110 AND 124
AND tag_1 = 1
AND tag_2 = 2
AND tag_3 = 3
AND tag_4 IN (1, 3, 5),
AND tag_5 IN (1, 3, 5)",
....
]
Simply loop in the requests array, from the most precise to the most approximate one:
# Ruby / ActiveRecord example
track_ids = []
requests.each do |request|
track_ids += Track.where([
"(#{request})
AND tracks.id NOT IN ?", track_ids
]).pluck(:id)
break if track_ids.length > 200
end
... and done! All my songs are ordered by similarity, the closest match at the top, and the more to the bottom, the more approximate they get. Since everything is about integers, it's pretty fast (fast enough on a 100K rows dataset), and the output looks like pure magic. Bonus point: it is still easily tweakable and maintainable by the whole team.
I do understand that's rough, so I'm open to any more efficient way to do the same thing, even if something else is needed in the stack (ES ?), but so far: it's a simple solution that just works.

Is there any Eloquent way to get sum of relational table and sort by it?

Example:
table Users
ID | Username | sex
1 | Tony | m
2 | Andy | m
3 | Lucy | f
table Scores
ID | user_id | score
1 | 2 | 4
2 | 1 | 3
3 | 1 | 4
4 | 2 | 3
5 | 1 | 1
6 | 3 | 3
7 | 3 | 2
8 | 2 | 3
Expected Result:
ID | Username | sex | score_sum (sum) (desc)
2 | Andy | m | 10
1 | Tony | m | 8
3 | Lucy | f | 5
The code I use so far:
User model:
class User extends Authenticatable
{
...
public function scores()
{
return $this->hasMany('App\Score');
}
...
}
Score model
class Job extends Model
{
//i put nothing here
}
Code in controller:
$users = User::all();
foreach ($users as $user){
$user->score_sum = $user->scores()->sum('score');
}
$users = collect($users)->sortByDesc('score_sum');
return view('homepage', [
'users' => $users->values()->all()
]);
Hope my example above make sense. My code does work, but I thought there must be an Eloquent and elegant way to do this without foreach?
There are 2 options for doing this in an Eloquent way.
Option 1
The first way is to do this to add the score_sum as an attribute that is always included when querying the users model. This is only a good idea if you will be using the score_sum the majority of the time when querying the users table. If you only need the score_sum on very specific view or for specific business logic then I would use the second option below.
To do this you will add the attribute to the users model, you can look here for documentation: https://laravel.com/docs/5.6/eloquent-mutators#defining-an-accessor
Here is an example for your use case:
/app/User.php
class User extends Model
{
.
.
.
public function getScoreSumAttribute($value)
{
return $this->scores()->sum('score');
}
}
Option 2
If you just want to do this for a single use case, then the easiest solution is just to use the sum() function in the eventual foreach loop you will be using (most likely in the view).
For example in a view:
#foreach($users as $user)
<div>Username: {{$user->username}}</div>
<div>Sex: {{$user->sex}}</div>
<div>Score Sum: {{$user->scores()->sum('price')}}</div>
#endforeach
Additionally, if you do not want to do this in a foreach loop you can use a raw query in the Eloquent call in your Controller gets the `score_sum'. Here is an example of how that can be done:
$users = User::select('score_sum',DB::raw(SUM(score) FROM 'scores'))->get();
I did not have a quick environment to test this, you might need a WHERE clause in the DB::raw query
Hope this helps!
This is as nice as it gets:
User::selectRaw('*, (SELECT SUM(score) FROM scores WHERE user_id = users.id) as score_sum')
->orderBy('score_sum', 'DESC')
->get();

Dependent types over datatypes

I struggled for a bit with writing a function that could only be passed certain days of the week. I expected that this would work:
datatype days = sunday | monday | tuesday | wednesday | thursday | friday | saturday
fn only_sunday{d:days | d == sunday}(d: days(d)): days(d) = d
but of course, days(d) was never even defined. That only seemed like it would work because ATS has so many builtin types - int and also int(int), etc.
This also doesn't work, but perhaps it's just the syntax that's wrong?
typedef only_sunday = { d:days | d == sunday }
fn only_sunday(d: only_sunday): only_sunday = d
After revisiting the Dependent Types chapter of Introduction to Programming in ATS, I realized that this works:
datatype days(int) =
| sunday(0)
| monday(1)
| tuesday(2)
| wednesday(3)
| thursday(4)
| friday(5)
| saturday(6)
fn print_days{n:int}(d: days(n)): void =
print_string(case+ d of
| sunday() => "sunday"
| monday() => "monday"
| tuesday() => "tuesday"
| wednesday() => "wednesday"
| thursday() => "thursday"
| friday() => "friday"
| saturday() => "saturday")
overload print with print_days
fn sunday_only{n:int | n == 0}(d: days(n)): days(n) = d
implement main0() =
println!("this typechecks: ", sunday_only(sunday))
but of course, it's a little bit less clear that n == 0 means "the day must be sunday" than it would with some code like d == sunday. And although it doesn't seem that unusual to map days to numbers, consider:
datatype toppings(int) = lettuce(0) | swiss_cheese(1) | mushrooms(2) | ...
In this case the numbers are completely senseless, such that you can only understand any {n:int | n != 1} ... toppings(n) code as anti-swiss-cheese if you have the datatype definition at hand. And if you were to edit in a new topping
datatype toppings(int) = lettuce(0) | tomatoes(1) | swiss_cheese(2) | ...
then it would be quite a chore update 1 to 2 in only any Swiss cheese code.
Is there a more symbolic way to use dependent types?
You could try something like this:
stadef lettuce = 0
stadef swiss_cheese = 1
stadef mushrooms = 2
datatype
toppings(int) =
| lettuce(lettuce)
| swiss_cheese(swiss_cheese)
| mushrooms(mushrooms)

How to write a select query or server-side function that will generate a neat time-flow graph from many data points?

NOTE: I am using a graph database (OrientDB to be specific). This gives me the freedom to write a server-side function in javascript or groovy rather than limit myself to SQL for this issue.*
NOTE 2: Since this is a graph database, the arrows below are simply describing the flow of data. I do not literally need the arrows to be returned in the query. The arrows represent relationships.*
I have data that is represented in a time-flow manner; i.e. EventC occurs after EventB which occurs after EventA, etc. This data is coming from multiple sources, so it is not completely linear. It needs to be congregated together, which is where I'm having the issue.
Currently the data looks something like this:
# | event | next
--------------------------
12:0 | EventA | 12:1
12:1 | EventB | 12:2
12:2 | EventC |
12:3 | EventA | 12:4
12:4 | EventD |
Where "next" is the out() edge to the event that comes next in the time-flow. On a graph this comes out to look like:
EventA-->EventB-->EventC
EventA-->EventD
Since this data needs to be congregated together, I need to merge duplicate events but preserve their edges. In other words, I need a select query that will result in:
-->EventB-->EventC
EventA--|
-->EventD
In this example, since EventB and EventD both occurred after EventA (just at different times), the select query will show two branches off EventA as opposed to two separate time-flows.
EDIT #2
If an additional set of data were to be added to the data above, with EventB->EventE, the resulting data/graph would look like:
# | event | next
--------------------------
12:0 | EventA | 12:1
12:1 | EventB | 12:2
12:2 | EventC |
12:3 | EventA | 12:4
12:4 | EventD |
12:5 | EventB | 12:6
12:6 | EventE |
EventA-->EventB-->EventC
EventA-->EventD
EventB-->EventE
I need a query to produce a tree like:
-->EventC
-->EventB--|
| -->EventE
EventA--|
-->EventD
EDIT #3 and #4
Here is the data with edges shown as opposed to the "next" column above. I also added a couple additional columns here to hopefully clear up any confusion about the data:
# | event | ip_address | timestamp | in | out |
----------------------------------------------------------------------------
12:0 | EventA | 123.156.189.18 | 2015-04-17 12:48:01 | | 13:0 |
12:1 | EventB | 123.156.189.18 | 2015-04-17 12:48:32 | 13:0 | 13:1 |
12:2 | EventC | 123.156.189.18 | 2015-04-17 12:48:49 | 13:1 | |
12:3 | EventA | 103.145.187.22 | 2015-04-17 14:03:08 | | 13:2 |
12:4 | EventD | 103.145.187.22 | 2015-04-17 14:05:23 | 13:2 | |
12:5 | EventB | 96.109.199.184 | 2015-04-17 21:53:00 | | 13:3 |
12:6 | EventE | 96.109.199.184 | 2015-04-17 21:53:07 | 13:3 | |
The data is saved like this to preserve each individual event and the flow of a session (labeled by the ip address).
TL;DR
Got lots of events, some duplicates, and need them all organized into one neat time-flow graph.
Holy cow.
After wrestling with this for over a week I think I FINALLY have a working function. This isn't optimized for performance (oh the loops!), but gets the job done for the time being while I can work on performance. The resulting OrientDB server-side function (written in javascript):
The function:
// Clear previous runs
db.command("truncate class tmp_Then");
db.command("truncate class tmp_Events");
// Get all distinct events
var distinctEvents = db.query("select from Events group by event");
// Send 404 if null, otherwise proceed
if (distinctEvents == null) {
response.send(404, "Events not found", "text/plain", "Error: events not found" );
} else {
var edges = [];
// Loop through all distinct events
distinctEvents.forEach(function(distinctEvent) {
var newEvent = [];
var rid = distinctEvent.field("#rid");
var eventType = distinctEvent.field("event");
// The main query that finds all *direct* descendents of the distinct event
var result = db.query("select from (traverse * from (select from Events where event = ?) where $depth <= 2) where #class = 'Events' and $depth > 1 and #rid in (select from Events group by event)", [eventType]);
// Save the distinct event in a temp table to create temp edges
db.command("create vertex tmp_Events set rid = ?, event = ?", [rid, event]);
edges.push(result);
});
// The edges array defines which edges should exist for a given event
edges.forEach(function(edge, index) {
edge.forEach(function(e) {
// Create the temp edge that corresponds to its distinct event
db.command("create edge tmp_Then from (select from tmp_Events where rid = " + distinctEvents[index].field("#rid") + ") to (select from tmp_Events where rid = " + e.field("#rid") + ")");
});
});
var result = db.query("select from tmp_Events");
return result;
}
Takeaways:
Temp tables appeared to be necessary. I tried to do this without temp tables (classes), but I'm not sure it could be done. I needed to mock edges that didn't exist in the raw data.
Traverse was very helpful in writing the main query. Traversing through an event to find its direct, unique descendents was fairly simple.
Having the ability to write stored procs in Javascript is freaking awesome. This would have been a nightmare in SQL.
omfg loops. I plan to optimize this and continue to make it better so hopefully other people can find some use for it.

Optimizing working scheduling MiniZinc code - constraint programming

Please can you help optimize this working MiniZinc code:
Task: There is a conference which has 6x time slots. There are 3 speakers attending the conference who are each available at certain slots. Each speaker will present for a predetermined number of slots.
Objective: Produce the schedule that has the earliest finish of speakers.
Example: Speakers A, B & C. Talk durations = [1, 2, 1]
Speaker availability:
+---+------+------+------+
| | Sp.A | Sp.B | Sp.C |
+---+------+------+------+
| 1 | | Busy | |
| 2 | Busy | Busy | Busy |
| 3 | Busy | Busy | |
| 4 | | | |
| 5 | | | Busy |
| 6 | Busy | Busy | |
+---+------+------+------+
Link to working MiniZinc code: http://pastebin.com/raw.php?i=jUTaEDv0
What I'm hoping to optimize:
% ensure allocated slots don't overlap and the allocated slot is free for the speaker
constraint
forall(i in 1..num_speakers) (
ending_slot[i] = starting_slot[i] + app_durations[i] - 1
) /\
forall(i,j in 1..num_speakers where i < j) (
no_overlap(starting_slot[i], app_durations[i], starting_slot[j], app_durations[j])
) /\
forall(i in 1..num_speakers) (
forall(j in 1..app_durations[i]) (
starting_slot[i]+j-1 in speaker_availability[i]
)
)
;
Expected solution:
+---+----------+----------+----------+
| | Sp.A | Sp.B | Sp.C |
+---+----------+----------+----------+
| 1 | SELECTED | Busy | |
| 2 | Busy | Busy | Busy |
| 3 | Busy | Busy | SELECTED |
| 4 | | SELECTED | |
| 5 | | SELECTED | Busy |
| 6 | Busy | Busy | |
+---+----------+----------+----------+
I'm hakank (author of the original model). If I understand it correctly, your question now is how to present the table for the optimal solution, not really about finding the solution itself (all FlatZinc solvers I tested solved it in no time).
One way of creating the table is to have a help matrix ("m") which contain information if a speaker is selected (1), busy (-1) or not available (0):
array[1..num_slots, 1..num_speakers] of var -1..1: m;
Then one must connect info in this the matrix and the other decision variables ("starting_slot" and "ending_slot"):
% connect to matrix m
constraint
forall(t in 1..num_slots) (
forall(s in 1..num_speakers) (
(not(t in speaker_availability[s]) <-> m[t,s] = -1)
/\
((t >= starting_slot[s] /\ t <= ending_slot[s]) <-> m[t,s] = 1)
)
)
;
Then the matrix "m" can be printed like this:
% ...
++
[
if s = 1 then "\n" else " " endif ++
if fix(m[t,s]) = -1 then
"Busy "
elseif fix(m[t,s]) = 1 then
"SELECTED"
else
" "
endif
| t in 1..num_slots, s in 1..num_speakers
]
;
As always, there are more than one way of doing this, but I settled with this since it's quite direct.
Here's the complete model:
http://www.hakank.org/minizinc/scheduling_speakers_optimize.mzn
Update: Adding the output of the model:
Starting: [1, 4, 3]
Durations: [1, 2, 1]
Ends: [1, 5, 3]
z: 5
SELECTED Busy
Busy Busy Busy
Busy Busy SELECTED
SELECTED
SELECTED Busy
Busy Busy
----------
==========
Update 2:
Another way is to use cumulative/4 instead of no_overlap/4 which should be more effective, i.e.
constraint
forall(i in 1..num_speakers) (
ending_slot[i] = starting_slot[i] + app_durations[i] - 1
)
% /\ % use cumulative instead (see below)
% forall(i,j in 1..num_speakers where i < j) (
% no_overlap(starting_slot[i], app_durations[i], starting_slot[j], app_durations[j])
% )
/\
forall(i in 1..num_speakers) (
forall(j in 1..app_durations[i]) (
starting_slot[i]+j-1 in speaker_availability[i]
)
)
/\ cumulative(starting_slot, app_durations, [1 | i in 1..num_speakers], 1)
;
Here's the altered version (which give the same result)
http://www.hakank.org/minizinc/scheduling_speakers_optimize2.mzn
(I've also skipped the presentation matrix "m" and do all presentation in the output section.)
For this simple problem instance, there is no discernible difference, but for larger instances this should be faster. (And for larger instances, one might want to test different search heuristics instead of "solve minimize z".)
As I commented on your previous question Constraint Programming: Scheduling speakers in shortest time, the cumulative constraint is appropriate for this. I don't have Minizinc code handy, but there is the model in ECLiPSe (http://eclipseclp.org):
:- lib(ic).
:- lib(ic_edge_finder).
:- lib(branch_and_bound).
solve(JobStarts, Cost) :-
AllUnavStarts = [[2,6],[1,6],[2,5]],
AllUnavDurs = [[2,1],[3,1],[1,1]],
AllUnavRess = [[1,1],[1,1],[1,1]],
JobDurs = [1,2,1],
Ressources = [1,1,1],
length(JobStarts, 3),
JobStarts :: 1..9,
% the jobs must not overlap with each other
cumulative(JobStarts, JobDurs, Ressources, 1),
% for each speaker, no overlap of job and unavailable periods
(
foreach(JobStart,JobStarts),
foreach(JobDur,JobDurs),
foreach(UnavStarts,AllUnavStarts),
foreach(UnavDurs,AllUnavDurs),
foreach(UnavRess,AllUnavRess)
do
cumulative([JobStart|UnavStarts], [JobDur|UnavDurs], [1|UnavRess], 1)
),
% Cost is the maximum end date
( foreach(S,JobStarts), foreach(D,JobDurs), foreach(S+D,JobEnds) do true ),
Cost #= max(JobEnds),
minimize(search(JobStarts,0,smallest,indomain,complete,[]), Cost).