KQL summarize by count then take those with a count above x - kql

I have a situation where I am trying to count all instances of something, then I want to see where the count is greater than X for my own purposes
Right now I have all my clauses, then summarize count() by X, Y, Z where X, Y, and Z are columns. This gives me about 35 lines, but a lot of them have a count of 1 and do not interest me. Is there a way I can further filter this to see those where the total count is greater than a number I choose, for ex 20 or so?
I tried adding where count>20 but this gets me a syntax error.

Adding |where count_ >20 does the trick. I missed the _ initially

Related

Combining many sort ranks into one master sort rank

Say I have some sorted result from a SQL query that looks like:
x y z
0 0 0
0 0 1
0 0 2
0 1 0
0 1 1
0 2 0
0 2 1
Where x, y and z are sort ranks. These sort ranks are always greater than 0, and smaller than 500mil.
Is there a way to combine the values from x, y and z into one "master" sort rank? Sorting the dataset using this "master" sort rank should result in the same ordering.
I'm thinking I can do something with bit shifting but I am not sure...
Assuming that every value in each of the three columns in between 1 and 500 million, you could use the following formula to generate a unique rank:
1000000
z + (500 x 10^6)*y + (500 x 10^6)*(500 x 10^6)*x
To generate this rank you could use the following query:
SELECT
x, y, z,
z + (500 * 1000000)*y + (500 * 1000000)*(500 * 1000000)*x AS master_rank
FROM yourTable;
The reason this works can be seen by examining say the z and y columns. The largest value from z is 500 million, which is guaranteed to be smaller than the smallest value in y, which is 1 billion. This logic applies to the whole formula. This approach is similar to using a bit mask, on a larger scale.
Note that I assume that your version of SQL can tolerate numbers this large. If it doesn't, then you might want to consider another approach here, possibly just ordering as #Gordon mentioned in his answer. Besides this, having 1 bil x 1 bil records would make for a very large table and would have other problems.
Do you mean something like this?
order by x * 10000 + y * 100 + z
(You would adjust the numbers for the width you need.)
I'm not sure why you would want to do that instead of:
order by x, y, z
If you do combine into a single value, be careful about integer overflow.

Number sort using Min, Max and Variables

I am new to programming and I am trying to create a program that will take 3 random numbers X Y and Z and will sort them into ascending order X being the lowest and Z the highest using Min, Max functions and a Variable (tmp)
I know that there is a particular strategy that I need to use that effects the (X,Y) pair first then (Y,Z) then (X,Y) again but I can't grasp the logic.
The closest I have got so far is...
y=min(y,z)
x=min(x,y)
tmp=max(y,z)
z=tmp
tmp=max(x,y)
y=tmp
x=min(x,y)
tmp=max(x,y)
y=tmp
I've tried so many different combinations but it seems that the problem is UNSOLVABLE can anybody else help?
You need to get sort the X,Y Pair first
tmp=min(x,y)
y=max(x,y)
x=tmp
Then sort the Y,Z pair
tmp = min(y,z)
z=max(y,z)
y=tmp
Then, resort the X,Y pair (in case the original Z was the lowest value...
tmp=min(x,y)
y=max(x,y)
x=tmp
If the commands you have mentioned are the only ones available on the website, and you can only use each one once try:
# Sort X,Y pair
tmp=max(x,y)
x=min(x,y)
y=tmp
# Sort Y,Z pair
tmp=max(y,z)
y=min(y,z)
z=tmp
# Sort X,Y pair again.
tmp=max(x,y)
x=min(x,y)
y=tmp
Hope that helps.
I'm not sure if I understood your question correctly, but you are over righting your variables. Or are you trying to solve some homework with the restriction to only use min() and max() functions?
What about using a list?
tmp = [x, y, z]
tmp.sort()
x, y, z = tmp

How to choose a range for a loop based upon the answers of a previous loop?

I'm sorry the title is so confusingly worded, but it's hard to condense this problem down to a few words.
I'm trying to find the minimum value of a specific equation. At first I'm looping through the equation, which for our purposes here can be something like y = .245x^3-.67x^2+5x+12. I want to design a loop where the "steps" through the loop get smaller and smaller.
For example, the first time it loops through, it uses a step of 1. I will get about 30 values. What I need help on is how do I Use the three smallest values I receive from this first loop?
Here's an example of the values I might get from the first loop: (I should note this isn't supposed to be actual code at all. It's just a brief description of what's happening)
loop from x = 1 to 8 with step 1
results:
x = 1 -> y = 30
x = 2 -> y = 28
x = 3 -> y = 25
x = 4 -> y = 21
x = 5 -> y = 18
x = 6 -> y = 22
x = 7 -> y = 27
x = 8 -> y = 33
I want something that can detect the lowest three values and create a loop. From theses results, the values of x that get the smallest three results for y are x = 4, 5, and 6.
So my "guess" at this point would be x = 5. To get a better "guess" I'd like a loop that now does:
loop from x = 4 to x = 6 with step .5
I could keep this pattern going until I get an absurdly accurate guess for the minimum value of x.
Does anybody know of a way I can do this? I know the values I'm going to get are going to be able to be modeled by a parabola opening up, so this format will definitely work. I was thinking that the values could be put into a column. It wouldn't be hard to make something that returns the smallest value for y in that column, and the corresponding x-value.
If I'm being too vague, just let me know, and I can answer any questions you might have.
nice question. Here's at least a start for what I think you should do for this:
Sub findMin()
Dim lowest As Integer
Dim middle As Integer
Dim highest As Integer
lowest = 999
middle = 999
hightest = 999
Dim i As Integer
i = 1
Do While i < 9
If (retVal(i) < retVal(lowest)) Then
highest = middle
middle = lowest
lowest = i
Else
If (retVal(i) < retVal(middle)) Then
highest = middle
middle = i
Else
If (retVal(i) < retVal(highest)) Then
highest = i
End If
End If
End If
i = i + 1
Loop
End Sub
Function retVal(num As Integer) As Double
retVal = 0.245 * Math.Sqr(num) * num - 0.67 * Math.Sqr(num) + 5 * num + 12
End Function
What I've done here is set three Integers as your three Min values: lowest, middle, and highest. You loop through the values you're plugging into the formula (here, the retVal function) and comparing the return value of retVal (hence the name) to the values of retVal(lowest), retVal(middle), and retVal(highest), replacing them as necessary. I'm just beginning with VBA so what I've done likely isn't very elegant, but it does at least identify the Integers that result in the lowest values of the function. You may have to play around with the values of lowest, middle, and highest a bit to make it work. I know this isn't EXACTLY what you're looking for, but it's something along the lines of what I think you should do.
There is no trivial way to approach this unless the problem domain is narrowed.
The example polynomial given in fact has no minimum, which is readily determined by observing y'>0 (hence, y is always increasing WRT x).
Given the wide interpretation of
[an] equation, which for our purposes here can be something like y =
.245x^3-.67x^2+5x+12
many conditions need to be checked, even assuming the domain is limited to polynomials.
The polynomial order is significant, and the order determines what conditions are necessary to check for how many solutions are possible, or whether any solution is possible at all.
Without taking this complexity into account, an iterative approach could yield an incorrect solution due to underflow error, or an unfortunate choice of iteration steps or bounds.
I'm not trying to be hard here, I think your idea is neat. In practice it is more complicated than you think.

How would I split a large set of tabular data into smaller relevant tables? (Not a DB Question)

I'm really hoping I can describe this question in an understandable way. This is a puzzle that I have not been able to begin to solve even though I (mostly) understand it. I'm just not sure where to start, and I'm really hoping someone out there can get me headed in the right direction.
I have a LARGE table of data. It describes relationships between objects. Let's say the Y-axis has items numbered 1-1000, and the X-axis has items 1-1000 also. If item #234 on the Y-axis is related to item #791 on X, there will be a mark in the table where the row and column cross. In some industries this is referred to an a Truth Table. One can, at a glance, see how many items in a system relate to each other. The marks in the table can help to identify trends and patterns.
Here's some other helpful stuff about the nature of the table:
The full range of the number of relationships (r) for each item on either axis can be 1 <= r <= axisTotal.
The X and Y axis will share common items, but each axis will also have items that the other axis does not.
Each item will only exist once per axis. It can be on X and Y, but it would only be on each one 1 time.
The total number of items on each axis will most likely NOT be equal. Each axis could have from 50 to 1000's of items.
The end result is that this is going to be a report that needs to be printed. We have successfully printed a table that had about 100-150 items on each axis on an 11in X 17in piece of paper. Any more than that and it begins to be so small it's unreadable.
What I am trying to do is split the super large tables into smaller tables, but related points need to stay together. If I grab item 1-100 on X then I would need each item they relate to from Y.
I've generated a number of these tables and, while the number of relationships CAN be arbitrary, I have never seen an item relate to all other items. So in real practice the range is more like 1 <= r <= (10% * axisTotal). If an item's relationships exceed this range, it can be split up into multiple tables, but that is not optimal at all.
At the end of the day I think we, and our clients, would be happy if a 1000x1000 item table was split into 8 to 10 printed pages of smaller, related tables.
Any guidance would be a great help! Thanks.
---EDIT---
One other thing worth noting, there will be no empty rows or columns in the table. Every item on both the x and y axis will relate to at least 1 item on the opposite axis.
---EDIT---
Here is an example of a small truth table that I'm describing: . Every row and column has at least one relationship.
---EDIT---
May 18th, 2011
For what it's worth, I was moving pretty good on this project and I got pulled off for a couple of weeks. So it's going to a little while before I get back to this problem. But it is one that I will have to solve soon.
---EDIT---
July 11th, 2011
Bummer. Well, looks like I'm not going to be able to solve this problem right now. I was really hoping to be able to figure this out. Through discussion we decided to present the truth table in an Excel spreadsheet as an add-on resource to the main report. Excel 2007 and later will handle 1000's of columns which will more than suffice. Plus, we added some VBA which allows the viewer to double click on the column titles. This action will reduce the rows to only ones where there are interactions. Then it removes empty columns. In this way they can see a small sub-table based on the item they want to view, and can print it if they want.
This isn't an answer, I just want to try to visualize your data a little better. Does it kind of look like this?
Alice Bob Charlie ... Zelda
Shoes X X
Hats X X
Gloves X
...
Pants X
EDIT
Is it a requirement to show the data in tabular format? Or could you just list each out? Something like:
Alice
Shoes
Bob
Hats
Pants
Charlie
Shoes
Gloves
Zelda
Hats
Or the other way:
Shoes
Alice
Charlie
Hats
Bob
Zelda
Gloves
Charlie
Pants
Bob
EDIT 2
Okay, I've made another larger truth table to hopefully get a better understanding of how you want to split things up:
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
1 x x x x
2 x x x x x x
3 x x x x
4 x x x
5 x x x
6 x x x
7 x x x
8 x x x
For argument's sake lets just say that you can only fit 4 rows on a page (because I don't feel like typing out a giant table this early in the morning) so we're going to split this into two pages. First, it is important to show every row, right? Second, do you need to show columns that never have a value. For instance Y and Z never have a value for rows 1 through 8 in this table, can they be excluded from the report or do they still need to be there? Third, is order of the rows important?
If its not important to show completely empty columns then we could remove 10 columns from the table above and compress it down to:
A B C E F H I L M O P Q R U V W
1 x x x x
2 x x x x x x
3 x x x x
4 x x x
5 x x x
6 x x x
7 x x x
8 x x x
Then if row order isn't important you can compress it further by taking an optimum row arrangement (not necessarily shown here). The two tables below have further been compress to 11 and 10 columns:
A B C F H I M P Q R U
1 x x x x
2 x x x x x x
5 x x x
7 x x x
A E H I L M O P U W
3 x x x x
4 x x x
6 x x x
8 x x x
Am I going down a completely wrong path here? These are all just questions to help me better understand your data and output requirements.
Also, in all seriousness, is it an option to get larger printers/plotters? Also, is it an option to just generate a PDF and use Acrobat's print tile's option?
Last year I read an article at the Computational Biology PLoS journal (www.ploscompbiol.org), that seems related to your problem.
In short, it describes a new approach when we already have a set of proteins and tabular data about their one-to-one interaction and we want to to group them so that interaction inside a group and interaction between two groups is either maximized or (this is the innovative idea) minimized .
If we plot the start data table with black for high and white for low interaction it looks randomly gray. The result table, after the calculations and rearranging is done (so grouped items are placed near one another), looks more like orthogonal areas of black and white.
The article: Protein Interaction Networks—More Than Mere Modules,
where there are also references to other older techniques for grouping this kind of data.

Treatment of error values in the SQL standard

I have a question about the SQL standard which I'm hoping a SQL language lawyer can help with.
Certain expressions just don't work. 62 / 0, for example. The SQL standard specifies quite a few ways in which expressions can go wrong in similar ways. Lots of languages deal with these expressions using special exceptional flow control, or bottom psuedo-values.
I have a table, t, with (only) two columns, x and y each of type int. I suspect it isn't relevant, but for definiteness let's say that (x,y) is the primary key of t. This table contains (only) the following values:
x y
7 2
3 0
4 1
26 5
31 0
9 3
What behavior is required by the SQL standard for SELECT expressions operating on this table which may involve division(s) by zero? Alternatively, if no one behavior is required, what behaviors are permitted?
For example, what behavior is required for the following select statements?
The easy one:
SELECT x, y, x / y AS quot
FROM t
A harder one:
SELECT x, y, x / y AS quot
FROM t
WHERE y != 0
An even harder one:
SELECT x, y, x / y AS quot
FROM t
WHERE x % 2 = 0
Would an implementation (say, one that failed to realize on a more complex version of this query that the restriction could be moved inside the extension) be permitted to produce a division by zero error in response to this query, because, say it attempted to divide 3 by 0 as part of the extension before performing the restriction and realizing that 3 % 2 = 1? This could become important if, for example, the extension was over a small table but the result--when joined with a large table and restricted on the basis of data in the large table--ended up restricting away all of the rows which would have required division by zero.
If t had millions of rows, and this last query were performed by a table scan, would an implementation be permitted to return the first several million results before discovering a division by zero near the end when encountering one even value of x with a zero value of y? Would it be required to buffer?
There are even worse cases, ponder this one, which depending on the semantics can ruin boolean short-circuiting or require four-valued boolean logic in restrictions:
SELECT x, y
FROM t
WHERE ((x / y) >= 2) AND ((x % 2) = 0)
If the table is large, this short-circuiting problem can get really crazy. Imagine the table had a million rows, one of which had a 0 divisor. What would the standard say is the semantics of:
SELECT CASE
WHEN EXISTS
(
SELECT x, y, x / y AS quot
FROM t
)
THEN 1
ELSE 0
END AS what_is_my_value
It seems like this value should probably be an error since it depends on the emptiness or non-emptiness of a result which is an error, but adopting those semantics would seem to prohibit the optimizer for short-circuiting the table scan here. Does this existence query require proving the existence of one non-bottoming row, or also the non-existence of a bottoming row?
I'd appreciate guidance here, because I can't seem to find the relevant part(s) of the specification.
All implementations of SQL that I've worked with treat a division by 0 as an immediate NaN or #INF. The division is supposed to be handled by the front end, not by the implementation itself. The query should not bottom out, but the result set needs to return NaN in this case. Therefore, it's returned at the same time as the result set, and no special warning or message is brought up to the user.
At any rate, to properly deal with this, use the following query:
select
x, y,
case y
when 0 then null
else x / y
end as quot
from
t
To answer your last question, this statement:
SELECT x, y, x / y AS quot
FROM t
Would return this:
x y quot
7 2 3.5
3 0 NaN
4 1 4
26 5 5.2
31 0 NaN
9 3 3
So, your exists would find all the rows in t, regardless of what their quotient was.
Additionally, I was reading over your question again and realized I hadn't discussed where clauses (for shame!). The where clause, or predicate, should always be applied before the columns are calculated.
Think about this query:
select x, y, x/y as quot from t where x%2 = 0
If we had a record (3,0), it applies the where condition, and checks if 3 % 2 = 0. It does not, so it doesn't include that record in the column calculations, and leaves it right where it is.