SQL percentile_cont vs SPSS frequencies percentiles - sql

I am trying to figure out how percentiles are calculated by SQL's percentile_cont function and by SPSS in FREQUENCIES. I want to compare them and understand why they get different results.
I have tried looking this up myself, but finding a source for this information is difficult. If you have an explanation for why they differ, can you please share where I can read about that myself?

The percentile formula used in FREQUENCIES in SPSS Statistics is a weighted average method aimed at p(N+1), where p is the percentile expressed as a proportion (0-1 range) and N is the number of cases or records. Ignoring complications associated with weighted data, particularly non-integer weights, you order the data values in ascending order and if p(N+1) is an integer, you take the value of the p(N+1)th ordered case. If p(N+1) is between the integers associated with the ordinal positions of two numbers, you linearly interpolate between the values according to the fractional value of p(N+1).
This general formula is a commonly-used one, denoted method 4 in SAS and method 6 in a well-known November 1996 article in The American Statistician by Hyndman & Fan (Vol. 50, No. 4, pp. 361-365) that is the basis for the nine definitions used in the quantile function in R. There is one special point about the method in FREQUENCIES, which is that while other implementations of this method will set any percentile where p(N+1)>N to the value of the Nth case, in SPSS Statistics the value is given as missing instead.
The method used in SQL's percentile_cont appears to be method 7 in the list of nine from Hyndman & Fan, which aims at 1+(N-1)p. The EXAMINE procedure in SPSS Statistics offers the method used in FREQUENCIES (as the HAVERAGE method) and four additional methods. None of these match the method in SQL's percentile_cont.
Formulas for the statistics in FREQUENCIES, EXAMINE, and other procedures in SPSS Statistics are available in the IBM SPSS Statistics Algorithms manual, a pdf of which is freely downloadable.

Related

Does pandas categorical data speed up indexing?

Somebody told me it is a good idea to convert identifying columns (e.g. person numbers) from strings to categorical. This would speed up some operations like searching, filtering and grouping.
I understand that a 40 chars strings costs much more RAM and time to compare instead of a simple integer.
But I would have some overhead because of a str-to-int-table for translating between two types and to know which integer number belongs to which string "number".
Maybe .astype('categorical') can help me here? Isn't this an integer internally? Does this speed up some operations?
The user guide has the following about categorical data use cases:
The categorical data type is useful in the following cases:
A string variable consisting of only a few different values. Converting such a string variable to a categorical variable will save some memory, see here.
The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”). By converting to a categorical and specifying an order on the categories, sorting and min/max will use the logical order instead of the lexical order, see here.
As a signal to other Python libraries that this column should be treated as a categorical variable (e.g. to use suitable statistical methods or plot types).
See also the API docs on categoricals.
The book, Python for Data Analysis by Wes McKinney, has the following on this topic:
The categorical representation can yield significant performance
improvements when you are doing analytics. You can also perform
transformations on the categories while leaving the codes unmodified.
Some example transformations that can be made at relatively low cost are:
Renaming categories
Appending a new category without changing the order or position of the existing categories
GroupBy operations can be significantly faster with categoricals because the underlying algorithms use the integer-based codes array instead of an array of strings.
Series containing categorical data have several special methods similar to the Series.str specialized string methods. This also provides convenient access to the categories and codes.
In large datasets, categoricals are often used as a convenient tool for memory savings and better performance.

Calculate Bell Curve Values

In an MS Access 2007 app, which manages contracts and changes for large construction projects, I need to create a Bell Curve representing a Contract Value, over a time period.
For example, a $500m contract runs for, say, 40 months, and I need a Bell Curve that distributes the Contract Value over these 40 months. The idea is to present a starting point for cashflow projections during the life of the contract.
Using VBA, I had thought to create the 'monthly' values and store them in a temp table, for later use in a report chart. However, I'm stuck trying to work out an algorithm.
Any suggestions on how I might tackle this would be most appreciated.
You will need the =NORMSDIST() function borrowed from Excel as follows:
Public Function Normsdist(X As Double) As Double
Normsdist = Excel.WorksheetFunction.Normsdist(X)
End Function
Requires some knowledge of statistics to use this function to distribute a cash flow over x periods, assuming a standard-normal. I created an Excel sheet to demonstrate how this function is used and posted it here:
Normal Distribution of a cash flow sample .XLSX
If for some reason you hated the idea of using an Excel function, you can pull any statistics text or search for the formula that generates a series of normal values. In your case you want to distribute the cash flow over three standard deviations in each tail. So that's a total of six (6) standard deviations. To divide 40 months over 6 standard deviations it's 6/40 = 0.15 standard deviations each data point (month). Use a for/next/step or similar loop to generate that to a temporary table, as you suggested, and graph it with a column chart (as seen in the above Excel example). Will take just a little VBA coding to make this variable as to user-supplied number of months and total contract.
A standard-normal distribution has a mean of 0 and standard deviation of 1. If you want a flatter bell curve, you can use the NormDist function instead where you can specify the mean and st. dev.

Determine whether there is a subset of size n which has a standard deviation <= s

Given a bunch of numbers, I am trying to determine whether there is a "clump" anywhere where numbers are very densely packed.
To make things more precise, I thought I'd ask a more specific problem: given a set of numbers, I would like to determine whether there is a subset of size n which has a standard deviation <= s. If there are many such subsets, I'd like to find the subset with the lowest standard deviation.
So question #1 : does this formal problem definition effectively capture the intuitive concept of a "clump" of densely packed numbers?
EDIT: I don't actually care about determining which numbers belong to this "clump", I'm much more interested in determining where the clump is centred, which is why I think that specifying n in advance is okay. But feel free to correct me!
And question #2 : assuming it does, what is the best way to go about implementing something like this (in particular, I want a solution with lowest time complexity)? So far I think I have a solution that runs in n log n:
First, note that the lowest-standard-deviation-possessing subset of a given size must consist of consecutive numbers. So step 1 is sort the numbers (this is n log n)
Second, take the first n numbers and compute their standard deviation. If our array of numbers is 0-based, then the first n numbers are [0, n-1]. To get standard deviation, compute s1 and s2 as follows:
s1 = sum of numbers
s2 = sum of squares of numbers
Then, wikipedia says that the standard deviation is sqrt(n*s2 - s1^2)/n. Record this value as the highest standard deviation seen so far.
Find the standard deviation of [1, n], [2, n+1], [3, n+2] ... until you hit the the last n numbers. To do each computation takes only constant time if you keep track of s1 and s2 running totals: for example, to get std dev of [1, n], just subtract the 0th element from the s1 and s2 totals and add the nth element, then recalculate standard deviation. This means that the entire standard deviation calculating portion of the algorithm takes linear time.
So total time complexity n log n.
Is my assessment right? Is there a better way to do this? I really need this to run fast on fairly large sets, so the faster the better! Space is less of an issue (I think).
Having been working recently on a similar problem, both the definition of the clumps and the proposed implementation seem reasonable.
Another reasonable definition would be to find the minimum of all the ranges of n numbers. Thus, given that the list of numbers x is sorted, one would just find the minimum of x[n]-x[1], x[n+1]-x[2], etc. This would be slightly quicker than finding the standard deviation because it would avoid the multiplications and square roots. Indeed, you can avoid the square roots even when looking for the lowest standard deviation by finding the minimum variance (the square of the standard deviation), rather than the sd itself.
A caution would be that the location of the biggest clump might be quite sensitive to the choice of n. If there is an a priori reason to select a particular n, that won't be a problem. If not, however, it might require some experimentation to select the value of n that fairly reliably finds the clumps you are looking for, whether you are selecting by range or by standard deviation. Some ideas on this can be found in Chapter 6 of the online book ABC of EDA.

Spike removal algorithm

I have an array of values ranging from 30 to 300. I want to somehow make an weighted average, where, if I have 5 values and one is a lot bigger than the rest(spike), it won't influence the average that much as it would if I simply make a arithmetic average: eg: (n1+n2+n3+n4+n5)/5.
Does anyone has an idea how to make an simple algorithm that does just that, or where to look?
Sounds like you're looking to discard data that falls outside some parameter range you've specified. You could do it by computing the median/mode and ignoring values outside of this range when computing your mean. You'll have to adjust the divisor accordingly, of course, to account for the number of discarded values. What this "tolerable" range should be is ultimately up to you to decide, and will likely depend on your specific application needs.
Alternatively, you could try something like eliminating items r% out of range of your total average. Something like this (in javascript):
function RangedAverage(arr, r)
{
x = Average(arr);
//now eliminate items r% out of range
for(var i=0; i<arr.length; i++)
if(arr[i] < (x/r) || arr[i]>(x*(1+r)))
arr.splice(i,1);
x = Average(arr); //compute new average
return x;
}
You could try a median filter rather than a mean filter. It's often used in image processing to mitigate spurious pixel values (as opposed to white noise).
As you have noticed the mean is susceptible to skewing by spikes. perhaps median or mode may be a better statistic as they tend to be less skewed?
this should be a comment but js seems to be broken for me atm: its not quite clear whether you are after a single number that is characteristic of your array (i.e. an average) or a new array with the spikes removed (median filter)
in response to that then i'd suggest you first look at if median or mode is more appropriate as a statistic. if not then apply a median filter (very good at removing spikes) then average
A Kalman filter is often used in similar applications. I don't know if it qualifies as "simple," but it's robust and well known.
Lots of ways of doing this: You could implement a low-pass digital filter.
Or, if you're just concerned about removing outliers from a statistical summary, you could just remove the highest and lowest N% of your data values from the dataset before averaging.
"Robust statistics" is the search term that will get you into the literature. An advantage of a Kalman filter is that you have a running estimate of the variability of the data, and this allows you eventually to "discard observations that are more than x% likely to be spurious given the whole set of observations so far".

How would I calculate EXPECTED income if I have PAST income data in mySQL?

Ok, I'm just curious what the formula would be for calculating an expected income over the next X weeks/months/etc, if the only data I have in mySQL DB is all past transactions (dates of transactions, amounts, etc)
I am thinking taking some averages and whatnot, but I can't think of a specific formula (there must be something along those lines) to take say average rise of income over time (weekly/monthly) and then apply it to a select future period and display it weekly/monthly/etc?
Any suggestions?
use AVG() on the income in the past devide it to proper weekly/monthly amounts if neccessary.
see http://dev.mysql.com/doc/refman/5.1/en/group-by-functions.html#function_avg for more info on AVG()
Linear regression + simple integration is probably sufficient for your needs. I leave sorting out exact implementation for your DB up to you, but that follow that link to the "Estimation Methods" section, and probably use Ordinary Least Squares.
Alternatively, you can always slurp your data into something like R where the details are already implemented.
EDIT:
For more detail: you're trying to model INCOME = BASE + SCALING*T where we are assuming that a linear model is "good" (it's probably not great, but it's probably good enough on a short time scale). For two value linear regression, you're pretty much just taking averages; follow that link to "Fitting the Regression Line" and you'll see which things you need to average (y = INCOME and x = T). There are some tricks you can play to simplify the calculation for the computer if you can enforce some other conditions (e.g., having equally spaced time periods + no missing data), but you'll need to math a bit more yourself first if you want to do that (and you'll be less flexible in the face of changing db assumptions).