Calculate 'R Square' and 'P-Value' for multiple linear regression in TSQL - sql

We just have few built-in functions in SQL Server to do sophisticated statistical analysis but I need to calculate multiple linear regression in TSQL.
Based on this post (Multiple Linear Regression function in SQL Server), I could be able to get Coefficients for Intercept (Y), X1 and X2.
What I need is p-value for X1 and X2 and also R Square
Test data:
DECLARE #TestData TABLE (i INT IDENTITY(1, 1), X1 FLOAT, X2 FLOAT, y FLOAT)
INSERT #TestData
SELECT 0, 17, 210872.3034 UNION ALL
SELECT 0, 23, 191988.2299 UNION ALL
SELECT 0, 18, 204564.9455 UNION ALL
SELECT 0, 4, 189528.9212 UNION ALL
SELECT 0, 0, 200203.6364 UNION ALL
SELECT 11, 0, 218814.1701 UNION ALL
SELECT 5, 0, 220109.2129 UNION ALL
SELECT 2, 0, 214377.8534 UNION ALL
SELECT 1, 0, 204926.9208 UNION ALL
SELECT 0, 0, 202499.4065 UNION ALL
SELECT 0, 3, 196917.8182 UNION ALL
SELECT 0, 9, 202286.0012
Desired output:
R Square 0.4991599183412360
p-value X1 0.0264247876580807
p-value X2 0.7817597643898020
I have already been able to get following data from the above test data.
b Coefficients
----------------------------------
Intercept (Y) 202119.231151577
X1 C(H) 1992.8421941724
X2 C(C) -83.8561622730127
I know TSQL is not a good platform to obtain this but I need it to be done purely in TSQL.
I am aware of XLeratorDB Function Packages for SQL Server

You could calculate R Squared by hand and create a variable 'R2' equal to
(Nxysum - xsumysum)^2/ (Nx2sum - xsumxsum) (Ny2sum - ysumysum)?
Where xsum and ysum are the sum of your values and N is the number of observations.
The formula for R Squared is simple enough that you don't necessarily need any function or statistical software. Check out this link for calculating it by hand: http://sciencefair.math.iit.edu/analysis/linereg/hand/
You can apply the same logic to T-SQL.

Related

Total distance for points in same table sql

I do have the following three tables within postgres
ShipmentTrip
id,shipment_id, type,status,lat,long
1, 1, pickup, whatever, 25, 75
2, 1, dropoff, whatever, 27, 76
3, 2, pickup, whatever, 25, 75
4, 2, dropoff, whatever, 27, 76
Shipment
id,...,driver_id
Driver
id
I am trying to calculate the full distance a driver made
I tried different ways but I am not able to solve it
Tried cross join and subqueries and many different approaches but still no result
You have the PostGIS extension for PostgreSQL to calculate distances from latitude and longitude values.
You should use PostGIS functions for that purposes, here an example with your shipment_id 1:
SELECT ST_Distance(
ST_Transform('SRID=4326;POINT(75 25)'::geometry, 3857),
ST_Transform('SRID=4326;POINT(76 27)'::geometry, 3857));
Based on your sample, query should look like
select driver_id, shipment_id, points[1] p1, points[2] p2,
ST_Distance(
ST_Transform(('SRID=4326;POINT('||points[1]||')')::geometry, 3857),
ST_Transform(('SRID=4326;POINT('||points[2]||')')::geometry, 3857)) distance
from (select s.driver_id, st.shipment_id, array_agg(st.long||' '||st.lat::text) points
from "ShipmentTrip" st
join "Shipment" s on st.shipment_id = s.id
join "Driver" d on d.id = s.driver_id group by st.shipment_id, s.driver_id) trip;
Coordinates in lat long system of reference SRID (units in degrees) are transformed to a metric system (3857) in order to obtain a distance in meters.
If you don't have PostGIS extension installed in your database
create extension postgis;
However it's a straight linear distance between two points, not a distance following roads.
Docs for distance https://postgis.net/docs/ST_Distance.html and for points
https://postgis.net/docs/ST_MakePoint.html
Fiddle https://dbfiddle.uk/7XFHZWLA

Aggregating one bigquery table to another bigquery table

I am trying to aggregate multi PB (around 7PB) worth of BigQuery Table into another BigQuery Table
I have (partition_key, clusterkey1, clusterkey2, col1, col2, val)
Where partition_key is used for bigquery partition and clusterkey is used for clustering.
For example
(timestamp1, timestamp2, 0, 1, 2, 1)
(timestamp3, timestamp4, 0, 1, 2, 7)
(timestamp31, timestamp22, 2, 1, 2, 2)
(timestamp11, timestamp12, 2, 1, 2, 3)
should result to
(0, 1, 2, 8)
(2, 1, 2, 5)
I want to aggregate base on (clusterkey2, col1, col2), across all partition_key and all clusterkey1 for val
What is a feasible way to do this?
Should I write a custom loader and just read all data from it line by line, or is there a native way to do this?
Depending on where / how you are executing this you can do it by writing a simple sql script and defining the target output, for example:
SELECT clusterkey2
, col1
, col2
, sum(val)
from table
group by clusterkey2, col1, col2
This will get you the desired results.
From here you can do a few things, but they are mostly all outlined here in the documentation:
https://cloud.google.com/bigquery/docs/writing-results#writing_query_results
Specifically from the above you are looking to set the destination table.
One thing to note, you may want to include a partition key in the where clause to help narrow down your data if you do not want the aggregate results of the whole table.

Doing a concat over a partition in SQL?

I have some data ordered like so:
date, uid, grouping
2018-01-01, 1, a
2018-01-02, 1, a
2018-01-03, 1, b
2018-02-01, 2, x
2018-02-05, 2, x
2018-02-01, 3, z
2018-03-01, 3, y
2018-03-02, 3, z
And I wanted a final form like:
uid, path
1, "a-a-b"
2, "x-x"
3, "z-y-z"
but running something like
select
a.uid
,concat(grouping) over (partition by date, uid) as path
from temp1 a
Doesn't seem to want to play well with SQL or Google BigQuery (which is the specific environment I'm working in). Is there an easy enough way to get the groupings concatenated that I'm missing? I imagine there's a way to brute force it by including a bunch of if-then statements as custom columns, then concatenating the result, but I'm sure that will be a lot messier. Any thoughts?
You are looking for string_agg():
select a.uid, string_agg(grouping, '-' order by date) as path
from temp1 a
group by a.uid;

How can I traverse a tree bottom-up to calculate a (weighted) average of node values in PostgreSQL?

The typical example for e.g. summing a whole tree in PostgreSQL is using WITH RECURSIVE (Common Table Expressions). However, these examples typically go from top to bottom, flatten the tree and perform an aggregate function on the whole result set. I have not found a suitable example (on StackOverflow, Google, etc.) for the problem I am trying to solve:
Consider an unbalanced tree where each node can have an associated value. Most of the values are attached to leaf nodes, but the others may have values as well. If a node (leaf or not) has an explicitly attached value, this value can be directly used without further calculation (subtree can be ignored, then). If the node has no value, the value should be computed as the average of its direct children.
However, as none of the nodes are guaranteed to have a value attached, I need to go bottom up in order to obtain a total average. In a nutshell, starting from the leafs, I need to apply AVG() to each set of siblings and use this (intermediate) result as value for the parent node (if it has none). This parent's (new) value (explicitly attached, or the average of its children) is in turn used in the calculation of average values at the next level (the average value of the parent and its siblings).
Example situation:
A
+- B (6)
+- C
+- D
+- E (10)
+- F (2)
+- H (18)
+- I (102)
+- J (301)
I need to compute the average value for A, which should be 10 (because (6+6+18)/3 = 10 and I,J are ignored).
Your data can be stored as:
create table tree(id int primary key, parent int, caption text, node_value int);
insert into tree values
(1, 0, 'A', null),
(2, 1, 'B', 6),
(3, 1, 'C', null),
(4, 3, 'D', null),
(5, 4, 'E', 10),
(6, 4, 'F', 2),
(7, 1, 'H', 18),
(8, 7, 'I', 102),
(9, 7, 'J', 301);
The simplest way to do bottom-up aggregation is a recursive function.
create or replace function get_node_value(node_id int)
returns int language plpgsql as $$
declare
val int;
begin
select node_value
from tree
where id = node_id
into val;
if val isnull then
select avg(get_node_value(id))
from tree
where parent = node_id
into val;
end if;
return val;
end;
$$;
select get_node_value(1);
get_node_value
----------------
10
(1 row)
Test it here.
It is possible to achieve the same in an sql function. The function code is not so obvious but it may be a bit faster than plpgsql.
create or replace function get_node_value_sql(node_id int)
returns int language sql as $$
select coalesce(
node_value,
(
select avg(get_node_value_sql(id))::int
from tree
where parent = node_id
)
)
from tree
where id = node_id;
$$;
Viewing a tree from the bottom up using cte is not especially complicated. In this particular case the difficulty lies in the fact that average should be computed for each level separately.
with recursive bottom_up(id, parent, caption, node_value, level, calculated) as (
select
*,
0,
node_value calculated
from tree t
where not exists (
select id
from tree
where parent = t.id)
union all
select
t.*,
b.level+ 1,
case when t.node_value is null then b.calculated else t.node_value end
from tree t
join bottom_up b on t.id = b.parent
)
select id, parent, caption, avg(calculated)::int calculated
from (
select id, parent, caption, level, avg(calculated)::int calculated
from bottom_up
group by 1, 2, 3, 4
) s
group by 1, 2, 3
order by 1;
Test it here.

SQL Formula Answer is Different from Excel Formula

I have a query which surmises data according to station which i have selected. In my sql table i have a column which is created by dividing column 1 with Column 2 *1000 it works but if i was to do the same thing in excel i get another answer :S.
SELECT [Client],[Station],[Saleshouse],[Terr vs# Sat],[Month],[Year 2],
sum([Hour]) as 'Hour',
sum ([Impacts]) as 'Impacts',
sum (nullif([Insertions],0)) as Insertions,
sum (nullif([Final Cost],0)) as FinalCost,
sum (nullif([Resp 1],0)) as 'Resp 1',
sum (nullif([Resp 2],0)) as 'Resp 2',
sum (nullif([Final Cost],0)/nullif([Resp 1],0)) as 'CPR 1',
sum (nullif([Resp 1],0)/nullif(([Impacts]*1000),0)) as 'Resp%',
sum (nullif([Final Cost],0)/nullif([Resp 2],0)) as 'CPR 2',
sum (nullif([Resp 2],0)/nullif([Impacts]*1000,0)) as 'Resp%2',
sum (nullif([Resp 1],0)/ nullif([Resp 2],0)) as 'New vistor/all visitor'
FROM [Exporter].[dbo].[tbl_Television_Data]
WHERE [Station] ='4MUSIC'
GROUP BY [Client],[Station],[Saleshouse],[Terr vs# Sat],[Month],[Year 2] ORDER BY [Station]
The coulmn which im talking about is RESP% i get an answer of 0.0365374676389937
But within excel i use that same fromula method and i get this 0.00283685.
as you can see they are differnt yet use the same values :S.
Anyone able to give a answer on whats happening??
Firstly, the question is tagged with MySQL, but the syntax used looks suspiciously like SQL-Server, as such I will do all the demonstrations with SQL-Server syntax.
Secondly, based on the magnitude of the error (to the order of 10), I think you have your calculations mixed up, you appear to be saying that in the below 3 columns
sum ([Impacts]) as 'column1',
sum (nullif([Resp 2],0)) as 'column2',
sum (nullif([Resp 2],0)/nullif([Impacts]*1000,0)) as 'column3'
the third column is not giving you the same result as if you did the (column2 / (column1 * 1000) in excel, however this is not the same formula. I think what you need is:
NULLIF(SUM([Resp 2]), 0) / (NULLIF(SUM(Impacts) * 1000, 0) AS [Resp 2%]
Imagine this simple example (I have removed the NULLIF operators to make the difference between the equations more clear):
DECLARE #T TABLE ([Resp 2] FLOAT, [Impacts] FLOAT);
INSERT #T VALUES (132.44, 40), (103, 43);
SELECT Resp2 = SUM([Resp 2]),
Impacts = SUM(Impacts),
YourCalc = SUM([Resp 2] / ([Impacts] * 1000)),
MyCalc = SUM([Resp 2]) / (SUM(Impacts) * 1000)
FROM #T;
This will give you:
+--------+---------+--------------------+--------------------+
| Resp2 | Impacts | YourCalc | MyCalc |
|--------+---------+--------------------+--------------------|
| 235.44 | 83 | 0.0057063488372093 | 0.0028366265060241 |
+--------+---------+--------------------+--------------------+
As you can see MyCalc pretty much matches what you are expecting (the difference is probably due to rounding), but your calc does not (I did not fancy using trial and error to get your calc to match 0.0365374676389937 but hopefully you get the idea).