create a histogram with a dynamic number of partitions in sqlite - sql

I have a row x with integers in range 0 < x <= maxX.
To create a histogram with five partitions of equal size I am using the following statement in sqlite
select case
when x > 0 and x <= 1*((maxX+4)/5) then 1
when x > 1*((maxX+4)/5) and x <= 2*((maxX+4)/5) then 2
when x > 2*((maxX+4)/5) and x <= 3*((maxX+4)/5) then 3
when x > 3*((maxX+4)/5) and x <= 4*((maxX+4)/5) then 4
else 5 end as category, count(*) as count
from A,B group by category
Is there a way to make a "dynamic" query for this in the way that I can create a histogram of n partitions without writing n conditions in the case-statement?

You can use arithmetic to divide the values. Here is one method. It essentially takes the ceiling value of maxX / 5 and uses that to define the partitions:
select (case when cast(maxX / params.n as int) = maxX / params.n
then (x - 1) / (maxX / param.n)
else (x - 1) / cast(1 + maxX / params.n as int)
end) as category, count(*)
from (select 5 as n) params cross join
A
group by category;
The -1 is because your numbers start at one rather than zero.

Related

WITH RECURSIVE: is it possible to COUNT() the working table?

I am using HSQLDB 2.6.1, and I want to COUNT() the working table of a recursive CTE.
I wrote the following test:
with recursive
nums (n, m) as
(
select 1, 1 from (values(1))
union all
select * from (
with
var (k) as
(
select count(*) from nums
)
select n+1, var.k from nums, var where n+1 <= 10
)
)
select n, m from nums;
Here is the result set:
N M
1 1
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
10 0
It seems like COUNT() does not work on the working table.
Was it not supposed to work?
And is there another way to count the working table?
This cannot be done directly in the current version (2.7.0) of HyperSQL.
The column n of your query is incremented in each round, therefore counting the identical n values in the result table gives the size of the delta table for each round.
with recursive nums (n, m) as (
...
) select count(*) from nums group by n;

SQL - Generate "missing rows" with a select

This question relates to SQL 2012 -
Lets say I have 3 rows generated as follows:
Start Position = 10
End Position = 13
Value = 100
Start position = 14
End Position = 14
Value = 250
Start Position = 15
End Position = 25
Value = 300
on 3 rows ..
Is there a way I can force SQL to write the output:
10 - 100
11 - 100
12 - 100
13 - 100
14 - 250
15 - 300
16 - 300
etc and so on and so forth
Been wracking the brains but cant work out an easy way to do it
Thanks a lot
J
You can do this with a recursive CTE or a numbers table. Assuming the gaps are no more than a few hundred or thousand:
with n as (
select row_number() over (order by (select null)) - 1 as n
from master.spt_values
)
select (t.startpos + n.n) as position, value
from t join
n
on t.startpos + n.n <= t.endpos;
No database is complete without its table of numbers! Instructions on how to create one are all over the net, here is an example:
Create a numbers table
I have a numbers table in my database, its called t_numbers, it has a single column "n" with a row for each number starting from 1. It goes up to 999,999 but it takes up very little space on disk. Once you have that, you can write something like this:
set up a bit of data to use first
declare #Rows table
(
StartPos int,
EndPos int,
Value int
)
insert into #Rows values (10, 13, 100), (14, 14, 250), (15, 25, 300)
If you want don't want gaps for nulls use an inner join
select n.n, Value
from t_numbers n
inner join #Rows r on n.n >= StartPos and n.n <= EndPos
If you want the gaps then left join, but limit the return with a where clause
select n.n, Value
from t_numbers n
left join #Rows r on n.n >= StartPos and n.n <= EndPos
where n <= (select MAX(EndPos) from #Rows)
Thankyou people! that did it
In the end I created a table of numbers (thankyou guys)
+ said
CROSS JOIN MyNumbers
WHERE T.Real_Position between T.triangle_start_position and T.triangle_end_position
This gave me the exact resultset I was looking for

Lots of WHEN conditions in CASE statement (binning)

How can I do binning in SQL Server 2008 if I need about 100 bins? I need to group records depending if a binning variable belongs to one of 100 equal intervals.
For example if there is continious variable age I could write:
CASE
WHEN AGE >= 0 AND AGE < 1 THEN '1'
WHEN AGE >= 1 AND AGE < 2 THEN '2'
...
WHEN AGE >= 99 AND AGE < 100 THEN '100'
END [age_group]
But this process would be timeconsuming? Are there some other ways how to do that?
Try This Code Once:
SELECT CASE
WHEN AGE = 0 THEN 1
ELSE Ceiling([age])
END [age_group]
FROM #T
Here CEILING function returns the smallest integer greater than or equal to the specified numeric expression.i.e select CEILING(0.1) SQL Returns 1 As Output
But According to Your Output Requirement Floor(age)+1 is enough to get Required Output.
SELECT Floor([age]) + 1 [age_group]
FROM #T
Here Floor Function Returns the largest integer less than or equal to the specified numeric expression.
Try this based upon your comment about the segments being 1200:
;With Number
AS
(
SELECT *
FROM (Values(1),(2), (3), (4), (5), (6), (7), (8), (9), (10))N(x)
),
Segments
As
(
SELECT (ROW_NUMBER() OVER(ORDER BY Num1.x) -1) * 1200 As StartNum,
ROW_NUMBER() OVER(ORDER BY Num1.x) * 1200 As EndNum
FROM Number Num1
CROSS APPLY Number Num2
)
SELECT *
FROM Segments
SELECT *
FROM Segments
INNER JOIN MyTable
ON MyTable.Price >= StartNum AND MyTable.Price < EndNum
Mathematics, I guess. In this case,
Ceiling(Age) AS [age_group]
cast as necessary into character type of your choice. Ceiling is the 'round up to an integer' function in SQL Server.
You can use arithmetic for this purpose. Something like this:
select floor(bins * (age - minage) / (range + 1)), count(*)
from t cross join
(select min(age) as minage, max(age) as maxage,
1.0*(max(age) - min(age)) as range, 100 as bins
from t
) m
group by floor(bins * (age - minage) / (range + 1))
However, this is overkill for your example, which doesn't need a case at all.
If your interval for the groups are fixed - for example 1200, you can just do an integer division to get the index with that grouping.
For example:
SELECT 1000 / 1200 equals 0
SELECT 2200 / 1200 equals 1
Remember - you need to cast to int to get the result if you're using a decimal datatype. Integer division requires int on both sides of the operator.
And then add 1 to get the group

Pearson Correlation SQL Server

I have two tables:
ID,YRMO,Counts
1,Dec 2013,4
1,Jan 2014,6
1,Feb 2014,7
2,Jan,2014,6
2,Feb,2014,8
ID,YRMO,Counts
1,Dec 2013,10
1,Jan 2014,8
1,March 2014,12
2,Jan 2014,6
2,Feb 2014,10
I want to find the pearson corelation coefficient for each sets of ID. There are about more than 200 different IDS.
Pearson correlation is a measure of the linear correlation (dependence) between two variables X and Y, giving a value between +1 and −1 inclusive
More can be found here :http://oreilly.com/catalog/transqlcook/chapter/ch08.html
at calculating correlation section
To calculate Pearson Correlation Coefficient; you need to first calculate Mean then standard daviation and then correlation coefficient as outlined below
1. Calculate Mean
insert into tab2 (tab1_id, mean)
select ID, sum([counts]) /
(select count(*) from tab1) as mean
from tab1
group by ID;
2. Calculate standard deviation
update tab2
set stddev = (
select sqrt(
sum([counts] * [counts]) /
(select count(*) from tab1)
- mean * mean
) stddev
from tab1
where tab1.ID = tab2.tab1_id
group by tab1.ID);
3. Finally Pearson Correlation Coefficient
select ID,
((sf.sum1 / (select count(*) from tab1)
- stats1.mean * stats2.mean
)
/ (stats1.stddev * stats2.stddev)) as PCC
from (
select r1.ID,
sum(r1.[counts] * r2.[counts]) as sum1
from tab1 r1
join tab1 r2
on r1.ID = r2.ID
group by r1.ID
) sf
join tab2 stats1
on stats1.tab1_id = sf.ID
join tab2 stats2
on stats2.tab1_id = sf.ID
Which on your posted data results in
See a demo fiddle here http://sqlfiddle.com/#!3/0da20/5
EDIT:
Well refined a bit. You can use the below function to get PCC but I am not getting exact same result as of your but rather getting 0.999996000000000 for ID = 1.
This could be a great entry point for you. You can refine the calculation further from here.
create function calculate_PCC(#id int)
returns decimal(16,15)
as
begin
declare #mean numeric(16,5);
declare #stddev numeric(16,5);
declare #count numeric(16,5);
declare #pcc numeric(16,12);
declare #store numeric(16,7);
select #count = CONVERT(numeric(16,5), count(case when Id=#id then 1 end)) from tab1;
select #mean = convert(numeric(16,5),sum([Counts])) / #count
from tab1 WHERE ID = #id;
select #store = (sum(counts * counts) / #count) from tab1 WHERE ID = #id;
set #stddev = sqrt(#store - (#mean * #mean));
set #pcc = ((#store - (#mean * #mean)) / (#stddev * #stddev));
return #pcc;
end
Call the function like
select db_name.dbo.calculate_PCC(1)
A Single-Pass Solution:
There are two flavors of the Pearson correlation coefficient, one for a Sample and one for an entire Population. These are simple, single-pass, and I believe, correct formulas for both:
-- Methods for calculating the two Pearson correlation coefficients
SELECT
-- For Population
(avg(x * y) - avg(x) * avg(y)) /
(sqrt(avg(x * x) - avg(x) * avg(x)) * sqrt(avg(y * y) - avg(y) * avg(y)))
AS correlation_coefficient_population,
-- For Sample
(count(*) * sum(x * y) - sum(x) * sum(y)) /
(sqrt(count(*) * sum(x * x) - sum(x) * sum(x)) * sqrt(count(*) * sum(y * y) - sum(y) * sum(y)))
AS correlation_coefficient_sample
FROM (
-- The following generates a table of sample data containing two columns with a luke-warm and tweakable correlation
-- y = x for 0 thru 99, y = x - 100 for 100 thru 199, etc. Execute it as a stand-alone to see for yourself
-- x and y are CAST as DECIMAL to avoid integer math, you should definitely do the same
-- Try TOP 100 or less for full correlation (y = x for all cases), TOP 200 for a PCC of 0.5, TOP 300 for one near 0.33, etc.
-- The superfluous "+ 0" is where you could apply various offsets to see that they have no effect on the results
SELECT TOP 200
CAST(ROW_NUMBER() OVER (ORDER BY [object_id]) - 1 + 0 AS DECIMAL) AS x,
CAST((ROW_NUMBER() OVER (ORDER BY [object_id]) - 1) % 100 AS DECIMAL) AS y
FROM sys.all_objects
) AS a
As I noted in the comments, you can try the example with TOP 100 or less for full correlation (y = x for all cases); TOP 200 yields correlations very near 0.5; TOP 300, around 0.33; etc. There is a place ("+ 0") to add an offset if you like; spoiler alert, it has no effect. Make sure you CAST your values as DECIMAL - integer math can significantly impact these calcs.

Need SQL Server Query to solve 3rd Order Polynomial Regression

Can anyone help with some SQL query code to provide estimates of the co-efficients for a 3rd order Polynomial regression?
Please assume that I have a table of X and Y data values and want to estimate a, b and c in:
Y(X) = aX + bX^2 + cX^3 + E
APPROXIMATE but fast solution would be to sample 4 representative points from the data and solve the polynomial equation for these points.
As for the sampling, you can split the data into equal sectors and compute average of X and Y for each sector - the split can be done using quartiles of X-values, averages of X-values, min(x)+(max(x)-min(x))/4 or whatever you think is the most appropriate.
To illustrate the sampling by quartiles (i.e. by row numbers):
As for the solving, i used numberempire.com to solve these* equations for variables k,a,b,c:
k + a*X1 + b*X1^2 + c*X1^3 - Y1 = 0,
k + a*X2 + b*X2^2 + c*X2^3 - Y2 = 0,
k + a*X3 + b*X3^2 + c*X3^3 - Y3 = 0,
k + a*X4 + b*X4^2 + c*X4^3 - Y4 = 0
*Since Y(X) = 0 + ax bx^2 + cx^3 + ϵ implicitly includes [0, 0] point as one of the sample points, it would create bad approximations for data sets that don't include [0, 0]. I took the liberty of solving Y(X) = k + ax bx^2 + cx^3 + ϵ instead.
The actual SQL would go like this:
select
-- returns 1 row with columns labeled K, A, B and C = coefficients in 3rd order polynomial equation for the 4 sample points
-(X1*(X2p2*(X3p3*Y4-X4p3*Y3)+X2p3*(X4p2*Y3-X3p2*Y4)+(X3p2*X4p3-X3p3*X4p2)*Y2)+X1p2*(X2*(X4p3*Y3-X3p3*Y4)+X2p3*(X3*Y4-X4*Y3)+(X3p3*X4-X3*X4p3)*Y2)+X1p3*(X2*(X3p2*Y4-X4p2*Y3)+X2p2*(X4*Y3-X3*Y4)+(X3*X4p2-X3p2*X4)*Y2)+(X2*(X3p3*X4p2-X3p2*X4p3)+X2p2*(X3*X4p3-X3p3*X4)+X2p3*(X3p2*X4-X3*X4p2))*Y1)/(X1*(X2p2*(X4p3-X3p3)-X3p2*X4p3+X3p3*X4p2+X2p3*(X3p2-X4p2))+X2*(X3p2*X4p3-X3p3*X4p2)+X1p2*(X3*X4p3+X2*(X3p3-X4p3)+X2p3*(X4-X3)-X3p3*X4)+X2p2*(X3p3*X4-X3*X4p3)+X1p3*(X2*(X4p2-X3p2)-X3*X4p2+X3p2*X4+X2p2*(X3-X4))+X2p3*(X3*X4p2-X3p2*X4)) as k,
(X1p2*(X2p3*(Y4-Y3)-X3p3*Y4+X4p3*Y3+(X3p3-X4p3)*Y2)+X2p2*(X3p3*Y4-X4p3*Y3)+X1p3*(X3p2*Y4+X2p2*(Y3-Y4)-X4p2*Y3+(X4p2-X3p2)*Y2)+X2p3*(X4p2*Y3-X3p2*Y4)+(X3p2*X4p3-X3p3*X4p2)*Y2+(X2p2*(X4p3-X3p3)-X3p2*X4p3+X3p3*X4p2+X2p3*(X3p2-X4p2))*Y1)/(X1*(X2p2*(X4p3-X3p3)-X3p2*X4p3+X3p3*X4p2+X2p3*(X3p2-X4p2))+X2*(X3p2*X4p3-X3p3*X4p2)+X1p2*(X3*X4p3+X2*(X3p3-X4p3)+X2p3*(X4-X3)-X3p3*X4)+X2p2*(X3p3*X4-X3*X4p3)+X1p3*(X2*(X4p2-X3p2)-X3*X4p2+X3p2*X4+X2p2*(X3-X4))+X2p3*(X3*X4p2-X3p2*X4)) as a,
-(X1*(X2p3*(Y4-Y3)-X3p3*Y4+X4p3*Y3+(X3p3-X4p3)*Y2)+X2*(X3p3*Y4-X4p3*Y3)+X1p3*(X3*Y4+X2*(Y3-Y4)-X4*Y3+(X4-X3)*Y2)+X2p3*(X4*Y3-X3*Y4)+(X3*X4p3-X3p3*X4)*Y2+(X2*(X4p3-X3p3)-X3*X4p3+X3p3*X4+X2p3*(X3-X4))*Y1)/(X1*(X2p2*(X4p3-X3p3)-X3p2*X4p3+X3p3*X4p2+X2p3*(X3p2-X4p2))+X2*(X3p2*X4p3-X3p3*X4p2)+X1p2*(X3*X4p3+X2*(X3p3-X4p3)+X2p3*(X4-X3)-X3p3*X4)+X2p2*(X3p3*X4-X3*X4p3)+X1p3*(X2*(X4p2-X3p2)-X3*X4p2+X3p2*X4+X2p2*(X3-X4))+X2p3*(X3*X4p2-X3p2*X4)) as b,
(X1*(X2p2*(Y4-Y3)-X3p2*Y4+X4p2*Y3+(X3p2-X4p2)*Y2)+X2*(X3p2*Y4-X4p2*Y3)+X1p2*(X3*Y4+X2*(Y3-Y4)-X4*Y3+(X4-X3)*Y2)+X2p2*(X4*Y3-X3*Y4)+(X3*X4p2-X3p2*X4)*Y2+(X2*(X4p2-X3p2)-X3*X4p2+X3p2*X4+X2p2*(X3-X4))*Y1)/(X1*(X2p2*(X4p3-X3p3)-X3p2*X4p3+X3p3*X4p2+X2p3*(X3p2-X4p2))+X2*(X3p2*X4p3-X3p3*X4p2)+X1p2*(X3*X4p3+X2*(X3p3-X4p3)+X2p3*(X4-X3)-X3p3*X4)+X2p2*(X3p3*X4-X3*X4p3)+X1p3*(X2*(X4p2-X3p2)-X3*X4p2+X3p2*X4+X2p2*(X3-X4))+X2p3*(X3*X4p2-X3p2*X4)) as c
from (select
samples.*,
-- precomputing the powers should give better performance (at least i hope it would)
power(X1,2) X1p2, power(X2,2) X2p2, power(X3,2) X3p2, power(X4,2) X4p2,
power(Y1,3) Y1p3, power(Y2,3) Y2p3, power(Y3,3) Y3p3, power(Y4,3) Y4p3
from (select
avg(case when sector = 1 then x end) X1,
avg(case when sector = 2 then x end) X2,
avg(case when sector = 3 then x end) X3,
avg(case when sector = 4 then x end) X4,
avg(case when sector = 1 then y end) Y1,
avg(case when sector = 2 then y end) Y2,
avg(case when sector = 3 then y end) Y3,
avg(case when sector = 4 then y end) Y4
from (select x, y,
-- splitting to sectors 1 - 4 by row number (SQL Server version)
ceiling(row_number() OVER (ORDER BY x asc) / count(*) * 4) sector
from original_data
)
) samples
)
According to developer.mimer.com, these optional features need to be enabled in SQL Server:
T611, "Elementary OLAP operations"
F591, "Derived tables"
SQL Server has a built-in ranking function NTILE(n) which will more easily create your sectors. I replaced:
ceiling(row_number() OVER (ORDER BY x asc) / count(*) * 4) sector
with:
NTILE(4) OVER(ORDER BY x ASC) [sector]
I also needed to add several "precomputed powers" to allow for the full column range as selected. The full list appears below:
POWER(samples.X1, 2) AS [X1p2],
POWER(samples.X1, 3) AS [X1p3],
POWER(samples.X2, 2) AS [X2p2],
POWER(samples.X2, 3) AS [X2p3],
POWER(samples.X3, 2) AS [X3p2],
POWER(samples.X3, 3) AS [X3p3],
POWER(samples.X4, 2) AS [X4p2],
POWER(samples.X4, 3) AS [X4p3],
POWER(samples.Y1, 3) AS [Y1p3],
POWER(samples.Y2, 3) AS [Y2p3],
POWER(samples.Y3, 3) AS [Y3p3],
POWER(samples.Y4, 3) AS [Y4p3]
Overall, great answer by #Aprillion! Well explained and the numberempire.com h/t was very helpful.