How to histogram a numeric variable? - splunk

I want to produce a simple histogram of a numeric variable X.
I'm having trouble finding a clear example.
Since it's important that the histogram be meaningful more than beautiful, I would prefer to specify the bin-size rather than letting the tool decide. See: Data Scientists: STOP Randomly Binning Histograms

Histograms are a primary tool for understanding the distribution of data. As such, Splunk automatically creates a histogram by default for raw event queries. So it stands to reason that Splunk should provide tools for you to create histograms of your own variables extracted from query results.
It may be that the reason this is hard to find is that the basic answer is very simple:
(your query) |rename (your value) as X
|chart count by X span=1.0
Select "Visualization" and set chart type to "Column Chart" for a traditional vertical-bar histogram.
There is an example of this in the docs described as "Chart the number of transactions by duration".
The span value is used to control binning of the data. Adjust this value to optimize your visualization.
Warning: It is legal to omit span, but if you do so the X-axis will be compacted non-linearly to eliminate empty bins -- this could result in confusion if you aren't careful about observing the bin labels (assuming they're even drawn).
If you have a long-tail distribution, it may be useful to partition the results to focus on the range of interest. This can be done using where:
(your query) |rename (your value) as X
|where X>=0 and X<=100
|chart count by X span=1.0
Alternatively, use a clamping function to preserve the out-of-range counts:
(your query) |rename (your value) as X
|eval X=max(0,min(X,100))
|chart count by X span=1.0
Another way to deal with long-tails is to use a logarithmic span mode -- special values for span include log2 and log10 (documented as log-span).
If you would like to have both a non-default span and a compressed X-axis, there's probably a parameter for that -- but the documentation is cryptic.
I found that this 2-stage approach made that happen:
(your query) |rename (your value) as X
|bin X span=10.0 as X
|chart count by X
Again, this type of chart can be dangerously misleading if you don't pay careful attention to the labels.

Related

Searching for groups of objects given a reduction function

I have a few questions about a type of search.
First, is there a name and if so what is the name of the following type of search? I want to search for subsets of objects from some collection such that a reduction and filter function applied to the subset is true. For example, say I have the following objects, each of which contains an id and a value.
[A,10]
[B,10]
[C,10]
[D,9]
[E,11]
I want to search for "all the sets of objects whose summed values equal 30" and I would expect the output to be, {{A,B,C}, {A,D,E}, {B,D,E}, {C,D,E}}.
Second, is the only strategy to perform this search brute-force? Is there some type of general-purpose algorithm for this? Or are search optimizations dependent on the reduction function?
Third, if you came across this problem, what tools would you use to solve it in a general way? Assume the reduction and filter functions could be anything and are not necessarily the sum function. Does SQL provide a good API for this type of search? What about Prolog? Any interesting tips and tricks would be appreciated.
Thanks.
I cannot comment on the problem in general but brute forcing search can be easily done in prolog.
w(a,10).
w(b,10).
w(c,10).
w(d,9).
w(e,11).
solve(0, [], _).
solve(N, [X], [X|_]) :- w(X, N).
solve(N, [X|Xs], [X|Bs]) :-
w(X, W),
W < N,
N1 is N - W,
solve(N1, Xs, Bs).
solve(N, [X|Xs], [_|Bs]) :- % skip element if previous clause fails
solve(N, [X|Xs], Bs).
Which gives
| ?- solve(30, X, [a, b, c, d, e]).
X = [a,b,c] ? ;
X = [a,d,e] ? ;
X = [b,d,e] ? ;
X = [c,d,e] ? ;
(1 ms) no
Sql is TERRIBLE at this kind of problem. Until recently there was no way to get 'All Combinations' of row elements. Now you can do so with Recursive Common Table Expressions, but you are forced by its limitations to retain all partial results as well as final results which you would have to filter out for your final results. About the only benefit you get with SQL's recursive procedure is that you can stop evaluating possible combinations once a sub-path exceeds 30, your target total. That makes it slightly less ugly than an 'evaluate all 2^N combinations' brute force solution (unless every combination sums to less than the target total).
To solve this with SQL you would be running an algorithm that can be described as:
Seed your result set with all table entries less than your target total and their value as a running sum.
Iteratively join your prior result with all combinations of table that were not already used in the result set and whose value added to running sum is less than or equal to target total. Running sum becomes old running sum plus value, and append ID to ID LIST. Union this new result to the old results. Iterate until no more records qualify.
Make a final pass of the result set to filter out the partial sums that do not total to your target.
Oh, and unless you make special provisions, solutions {A,B,C}, {C,B,A}, and {A,C,B} all look like different solutions (order is significant).

What does n mean in big-oh complexity?

In Big-Oh notation, what does n mean? I've seen input size and length of a vector. If it's input size, does it mean memory space on the computer? I see n often interchangeably used with input size.
Examples of Big-Oh,
O(n) is linear running time
O(logn) is logarithmic running time.
A code complexity analysis example, (I'm changing input n to m)
def factorial(m):
product = 1
for i in range(1, m+1):
product = product*i
return product
This is O(n). What does n mean? Is it how much memory it takes? Maybe n mean number of elements in a vector? Then, how do you explain when n=3, a single number?
When somebody says O(n), the n can refer to different things depending on context. When it isn't obvious what n refers to, people ideally point it out explicitly, but several conventions exist:
When the name of the variable(s) used in the O-notation also exist in the code, they almost certainly refer to the value of the variable with that name (if they refer to anything else, that should be pointed out explicitly). So in your original example where you had a variable named n, O(n) would refer to that variable.
When the code does not contain a variable named n and n is the only variable used in the O notation, n usually refers to the total size of the input.
When multiple variables are used, starting with n and then continuing the alphabet (e.g. O(n*m)), n usually refers to the size of the first parameter, m the second and so on. However, in my opinion, it's often clearer to use something like | | or len( ) around the actual parameter names instead (e.g. O(|l1| * |l2|) or O(len(l1) * len(l2)) if your parameters are called l1 and l2).
In the context of graph problems v is usually used to refer to the number of vertices and e to the number of edges.
In all other cases (and also in some of the above cases if there is any ambiguity), it should be explicitly mentioned what the variables mean.
In your original code you had a variable named n, so the statement "This is O(n)" almost certainly referred to the value of the parameter n. If we further assume that we're only counting the number of multiplications or the number of times the loop body executes (or we measure the time and pretend that multiplication takes constant time), that statement is correct.
In your edited code, there is no longer a variable named n. So now the statement "This is O(n)" must refer to something else. Usually one would then assume that it refers to the size of the input (which would be the number of bits in m, i.e. log m). But then the statement is blatantly false (it'd be O(2^n), not O(n)), so the original statement clearly referred to the value of n and you broke it by editing the code.
n usually means amount of input data.
For example, take an array of 10 elements. To iterate all elements you will need ten iterations. n is 10 in this case.
In your example n is also value which describes size of input data. As you can see your factorial implementation will require n+1 iterations so the asymptotic complexity for this implementation is around O(n) (NOTE: I omitted 1 since it doesn't change picture a lot). If you will increase passed variable n to your function it will require more iteration to perform for calculating result.
O(1) describes an algorithm that will always execute in the same time (or space) regardless of the size of the input data set.
O(N) describes an algorithm whose performance will grow linearly and in direct proportion to the size of the input data set.
O(N^2) represents an algorithm whose performance is directly proportional to the square of the size of the input data set. This is common with algorithms that involve nested iterations over the data set.
I hope this helps.

Determine whether there is a subset of size n which has a standard deviation <= s

Given a bunch of numbers, I am trying to determine whether there is a "clump" anywhere where numbers are very densely packed.
To make things more precise, I thought I'd ask a more specific problem: given a set of numbers, I would like to determine whether there is a subset of size n which has a standard deviation <= s. If there are many such subsets, I'd like to find the subset with the lowest standard deviation.
So question #1 : does this formal problem definition effectively capture the intuitive concept of a "clump" of densely packed numbers?
EDIT: I don't actually care about determining which numbers belong to this "clump", I'm much more interested in determining where the clump is centred, which is why I think that specifying n in advance is okay. But feel free to correct me!
And question #2 : assuming it does, what is the best way to go about implementing something like this (in particular, I want a solution with lowest time complexity)? So far I think I have a solution that runs in n log n:
First, note that the lowest-standard-deviation-possessing subset of a given size must consist of consecutive numbers. So step 1 is sort the numbers (this is n log n)
Second, take the first n numbers and compute their standard deviation. If our array of numbers is 0-based, then the first n numbers are [0, n-1]. To get standard deviation, compute s1 and s2 as follows:
s1 = sum of numbers
s2 = sum of squares of numbers
Then, wikipedia says that the standard deviation is sqrt(n*s2 - s1^2)/n. Record this value as the highest standard deviation seen so far.
Find the standard deviation of [1, n], [2, n+1], [3, n+2] ... until you hit the the last n numbers. To do each computation takes only constant time if you keep track of s1 and s2 running totals: for example, to get std dev of [1, n], just subtract the 0th element from the s1 and s2 totals and add the nth element, then recalculate standard deviation. This means that the entire standard deviation calculating portion of the algorithm takes linear time.
So total time complexity n log n.
Is my assessment right? Is there a better way to do this? I really need this to run fast on fairly large sets, so the faster the better! Space is less of an issue (I think).
Having been working recently on a similar problem, both the definition of the clumps and the proposed implementation seem reasonable.
Another reasonable definition would be to find the minimum of all the ranges of n numbers. Thus, given that the list of numbers x is sorted, one would just find the minimum of x[n]-x[1], x[n+1]-x[2], etc. This would be slightly quicker than finding the standard deviation because it would avoid the multiplications and square roots. Indeed, you can avoid the square roots even when looking for the lowest standard deviation by finding the minimum variance (the square of the standard deviation), rather than the sd itself.
A caution would be that the location of the biggest clump might be quite sensitive to the choice of n. If there is an a priori reason to select a particular n, that won't be a problem. If not, however, it might require some experimentation to select the value of n that fairly reliably finds the clumps you are looking for, whether you are selecting by range or by standard deviation. Some ideas on this can be found in Chapter 6 of the online book ABC of EDA.

Spike removal algorithm

I have an array of values ranging from 30 to 300. I want to somehow make an weighted average, where, if I have 5 values and one is a lot bigger than the rest(spike), it won't influence the average that much as it would if I simply make a arithmetic average: eg: (n1+n2+n3+n4+n5)/5.
Does anyone has an idea how to make an simple algorithm that does just that, or where to look?
Sounds like you're looking to discard data that falls outside some parameter range you've specified. You could do it by computing the median/mode and ignoring values outside of this range when computing your mean. You'll have to adjust the divisor accordingly, of course, to account for the number of discarded values. What this "tolerable" range should be is ultimately up to you to decide, and will likely depend on your specific application needs.
Alternatively, you could try something like eliminating items r% out of range of your total average. Something like this (in javascript):
function RangedAverage(arr, r)
{
x = Average(arr);
//now eliminate items r% out of range
for(var i=0; i<arr.length; i++)
if(arr[i] < (x/r) || arr[i]>(x*(1+r)))
arr.splice(i,1);
x = Average(arr); //compute new average
return x;
}
You could try a median filter rather than a mean filter. It's often used in image processing to mitigate spurious pixel values (as opposed to white noise).
As you have noticed the mean is susceptible to skewing by spikes. perhaps median or mode may be a better statistic as they tend to be less skewed?
this should be a comment but js seems to be broken for me atm: its not quite clear whether you are after a single number that is characteristic of your array (i.e. an average) or a new array with the spikes removed (median filter)
in response to that then i'd suggest you first look at if median or mode is more appropriate as a statistic. if not then apply a median filter (very good at removing spikes) then average
A Kalman filter is often used in similar applications. I don't know if it qualifies as "simple," but it's robust and well known.
Lots of ways of doing this: You could implement a low-pass digital filter.
Or, if you're just concerned about removing outliers from a statistical summary, you could just remove the highest and lowest N% of your data values from the dataset before averaging.
"Robust statistics" is the search term that will get you into the literature. An advantage of a Kalman filter is that you have a running estimate of the variability of the data, and this allows you eventually to "discard observations that are more than x% likely to be spurious given the whole set of observations so far".

How to use a look up table in MATLAB

I need to perform an exponential operation of two parameters (one set: t, and the other comes from the arrays) on a set of 2D arrays (a 3D Matrix if you want).
f(t,x) = exp(t-x)
And then I need to add the result of every value in the 3rd dimension. Because it takes too much time using bsxfun to perform the entire operation I was thinking of using a look up table.
I can create the table as a matrix LUT (2 dimensional due to the two parameters), then I can retrieve the values using LUT(par1,par2). But accessing on the 3rd dimension using a loop is expensive too.
My question is: is there a way to implement such mechanism (a look up table) to have a predefined values and then just using them accessing from the matrix elements (kind of indexing) without loops. Or, how can I create a look up table that MATLAB handles automatically to speed up the exponential operation?
EDIT:
I actually used similar methods to create the LUT. Now, my problem actually is how to access it in an efficient way.
Lets said I have a 2 dimensional array M. With those values that I want to apply the function f(t,M(i,j)) for fixed value t. I can use a loop to go through all the values (i,j) of M. But I want a faster way of doing it, because I have a set of M's, and then I need to apply this procedure to all the other values.
My function is a little bit complex than the example I gave:
pr = mean(exp(-bsxfun(#rdivide,bsxfun(#minus,color_vals,double(I)).^2,m)./2),3);
That is my actual function, as you can see is more complex than the example I presented. But the idea is the same. It does an average in the third dimension of the set of M's of the exponential of the difference of two arrays.
Hope that helps.
I agree that the question is not very clear, and that showing some code would help. I'll try anyway.
In order to have a LUT make sense at all, the set of values attained by t-x has to be limited, for example to integers.
Assuming that the exponent can be any integer from -1000 to 1000, you could create a LUT like this:
LUT = exp(-1000:1000);
Then you create your indices (assuming t is a 1D array, and x is a 2D array)
indexArray = bsxfun(#minus,reshape(t,[1,1,3]), x) + 1001; %# -1000 turns into 1
Finally, you create your result
output = LUT(indexArray);
%# sum along third dimension (i.e. sum over all `t`)
output = sum(output,3);
I am not sure I understand your question, but I think this is the answer.
x = 0:3
y = 0:2
z = 0:6
[X,Y,Z] = meshgrid(x,y,z)
LUT = (X+Y).^Z