Eliminating outliers by standard deviation in SQL Server

Eliminating outliers by standard deviation in SQL Server - sql

I am trying to eliminate outliers in SQL Server 2008 by standard deviation. I would like only records that contain a value in a specific column within +/- 1 standard deviation of that column's mean.
How can I accomplish this?

If you are assuming a bell curve distribution of events, then only 68% of values will be within 1 standard deviation away from the mean (95% are covered by 2 standard deviations).
I would load a variable with the standard deviation of your range (derived using stdev / stdevp sql function) and then select the values that are within the appropriate number of standard deviations.
declare #stdtest table (colname varchar(20), colvalue int)
insert into #stdtest (colname, colvalue) values ('a', 2)
insert into #stdtest (colname, colvalue) values ('b', 4)
insert into #stdtest (colname, colvalue) values ('c', 4)
insert into #stdtest (colname, colvalue) values ('d', 4)
insert into #stdtest (colname, colvalue) values ('e', 5)
insert into #stdtest (colname, colvalue) values ('f', 5)
insert into #stdtest (colname, colvalue) values ('g', 7)
insert into #stdtest (colname, colvalue) values ('h', 9)
declare #std decimal
declare #mean decimal
declare #lower decimal
declare #higher decimal
declare #noofstds int
select #std = STDEV(colvalue), #mean = AVG(colvalue) from #stdtest
--68%
set #noofstds = 1
select #lower = #mean - (#noofstds * #std)
select #higher = #mean + (#noofstds * #std)
select #lower, #higher, * from #stdtest where colvalue between #lower and #higher
--returns rows with a colvalue between 3 and 7 inclusive
--95%
set #noofstds = 2
select #lower = #mean - (#noofstds * #std)
select #higher = #mean + (#noofstds * #std)
select #lower, #higher, * from #stdtest where colvalue between #lower and #higher
--returns rows with a colvalue between 1 and 9 inclusive

There is an aggregate function called STDEV in SQL that will give you the standard deviation. This is the hard part- then just find the range between the mean and +/- one STDEV value.
This is one way you could go about doing it -
create table #test
(
testNumber int
)
INSERT INTO #test (testNumber)
SELECT 2
UNION ALL
SELECT 4
UNION ALL
SELECT 4
UNION ALL
SELECT 4
UNION ALL
SELECT 5
UNION ALL
SELECT 5
UNION ALL
SELECT 7
UNION ALL
SELECT 9
SELECT testNumber FROM #test t
JOIN (
SELECT STDEV (testnumber) as [STDEV], AVG(testnumber) as mean
FROM #test
) X on t.testNumber >= X.mean - X.STDEV AND t.testNumber <= X.mean + X.STDEV

I'd be careful and think about what you're doing. Throwing away outliers might mean that you're discarding information that might not fit into a pre-conceived world view that could be quite wrong. Those outliers might be "black swans" that are rare, though not as rare as you'd think, and quite significant.
You give no context or explanation of what you're doing. It's easy to cite a function or technique that will fulfill the needs of your particular case, but I thought it appropriate to post the caution until additional information is supplied.

Related

How can I delete trailing contiguous records in a partition with a particular value?

I'm using the latest version of SQL Server and have the following problem. Given the table below, the requirement, quite simply, is to delete "trailing" records in each _category partition that have _value = 0. Trailing in this context means, when the records are placed in _date order, any series or contiguous block of records with _value = 0 at the end of the list should be deleted. Records with _value = 0 that have subsequent records in the partition with some non-zero value should stay.
create table #x (_id int identity, _category int, _date date, _value int)
insert into #x values (1, '2022-10-01', 12)
insert into #x values (1, '2022-10-03', 0)
insert into #x values (1, '2022-10-04', 10)
insert into #x values (1, '2022-10-06', 11)
insert into #x values (1, '2022-10-07', 10)
insert into #x values (2, '2022-10-01', 1)
insert into #x values (2, '2022-10-02', 0)
insert into #x values (2, '2022-10-05', 19)
insert into #x values (2, '2022-10-10', 18)
insert into #x values (2, '2022-10-12', 0)
insert into #x values (2, '2022-10-13', 0)
insert into #x values (2, '2022-10-15', 0)
insert into #x values (3, '2022-10-02', 10)
insert into #x values (3, '2022-10-03', 0)
insert into #x values (3, '2022-10-05', 0)
insert into #x values (3, '2022-10-06', 12)
insert into #x values (3, '2022-10-08', 0)
I see a few ways to do it. The brute force way is to to run the records through a cursor in date order, and grab the ID of any record where _value = 0 and see if it holds until the category changes. I'm trying to avoid T-SQL though if I can do it in a query.
To that end, I thought I could apply some gaps and islands trickery and do something with window functions. I feel like there might be a way to leverage last_value() for this, but so far I only see it useful in identifying partitions that have the criteria, not so much in helping me get the ID's of the records to delete.
The desired result is the deletion of records 10, 11, 12 and 17.
Appreciate any help.

I'm not sure that your requirement requires a gaps and islands approach. Simple exists logic should work.
SELECT _id, _catrgory, _date, _value
FROM #x x1
WHERE _value <> 0 OR
EXISTS (
SELECT 1
FROM #x x2
WHERE x2._category = x1._category AND
x2._date > x1._date AND
x2._value <> 0
);

Assuming that all _values are greater than or equal to 0 you can use MAX() window function in an updatable CTE:
WITH cte AS (
SELECT *,
MAX(_value) OVER (
PARTITION BY _category
ORDER BY _date
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
) max
FROM #x
)
DELETE FROM cte
WHERE max = 0;
If there are negative _values use MAX(ABS(_value)) instead of MAX(_value).
See the demo.

Using common table expressions, you can use:
WITH CTE_NumberedRows AS (
SELECT *, rn = ROW_NUMBER() OVER(PARTITION BY _category ORDER BY _date)
FROM #x
),
CTE_Keepers AS (
SELECT _category, rnLastKeeper = MAX(rn)
FROM CTE_NumberedRows
WHERE _value <> 0
GROUP BY _category
)
DELETE NR
FROM CTE_NumberedRows NR
LEFT JOIN CTE_Keepers K
ON K._category = NR._category
WHERE NR.rn > ISNULL(K.rnLastKeeper, 0)
See this db<>fiddle for a working demo.
EDIT: My original post did not handle the all-zero's edge case. This has been corrected above, together with some naming tweaks. (The original can still be found here.
Tim Biegeleisen's post may be the simpler approach.

Get the list of year values based on the gap and year value in the table

Scenario: I have a table with Year and Gap columns. What I need the output as, starting from the given year value it incremented up to the value in the gap column.
i.e., If the YearVal is 2001, and Gap is 3, I need the output as
Result
--------
2001
2002
2003
What I have tried:
DECLARE #ResultYears TABLE (Gap INT, YearVal INT);
INSERT INTO #ResultYears (Gap, YearVal) VALUES (3, 2001);
;WITH FinalResult AS (
SELECT YearVal AS [YR] FROM #ResultYears
UNION ALL
SELECT [YR] + 1 FROM FinalResult
WHERE [YR] + 1 <= (SELECT YearVal + (Gap -1) FROM #ResultYears)
)
SELECT * FROM FinalResult;
db<>fiddle demo with one entry in the table.
Using the query above, I can achieve the expected result. But if the table have more than one entry, the query is not working.
i.e., If I have the entries in the table as below:
DECLARE #ResultYears TABLE (Gap INT, YearVal INT);
INSERT INTO #ResultYears (Gap, YearVal) VALUES
(3, 2001), (4, 2008), (1, 2014), (2, 2018);
How can I modify the query to achieve my expected result?
db<>fiddle demo with more than one entry in the table.

Is this what you're after?
DECLARE #ResultYears TABLE (Gap INT, YearVal INT);
INSERT INTO #ResultYears (Gap, YearVal) VALUES
(3, 2001), (4, 2008), (1, 2014), (2, 2018);
WITH N AS(
SELECT N
FROM (VALUES(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL))N(N)),
Tally AS(
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) - 1 AS I
FROM N N1, N N2), --100 is more than enough
Years AS(
SELECT RY.YearVal + T.I AS [Year],
RY.Gap,
RY.YearVal
FROM #ResultYears RY
JOIN Tally T ON RY.Gap > T.I)
SELECT *
FROM Years Y
ORDER BY Y.YearVal;
Personally I prefer a tally table over a rCTE; they are far quicker, especially with large datasets, or where the rCTE would have to do a high volume of recursion.
Demo on db<>fiddle

Initially Create one user defined table type function which return the Gap years
CREATE FUNCTION [dbo].[ufn_GetYears]
(
#i_Gap INT,#Year INT
)
RETURNS #Temp TABLE
(
Years INT
)
AS
BEGIN
;WITH CTE
AS
(
SELECT 1 AS Seq,DATEFROMPARTS ( #Year,01,01) AS Years
UNION ALL
SELECT seq +1,DATEADD(YEAR,1,Years)
FROM Cte
WHERE Seq < #i_Gap
)
INSERT INTO #Temp
SELECT DATEPART(YEAR,Years )
FROM CTE
RETURN
END
Sample Data
DECLARE #ResultYears TABLE
(Gap INT,
YearVal INT
);
INSERT INTO #ResultYears (Gap, YearVal) VALUES
(3, 2001), (4, 2008), (1, 2014), (2, 2018);
Sql Query to get the expected result using CROSS APPLY
SELECT R.Gap,dt.Years
FROM #ResultYears R
CROSS APPLY [dbo].[ufn_GetYears](R.Gap,R.YearVal) AS dt
Result
Gap Years
---------
3 2001
3 2002
3 2003
4 2008
4 2009
4 2010
4 2011
1 2014
2 2018
2 2019

If for a reason, you prefer recursive CTE (which is definetly slower)
DECLARE #ResultYears TABLE (Gap INT, YearVal INT);
INSERT INTO #ResultYears (Gap, YearVal) VALUES (3, 2001), (4, 2008), (1, 2014), (2, 2018);
;WITH FinalResult AS (
SELECT YearVal, Gap, YearVal [YR] FROM #ResultYears
UNION ALL
SELECT YearVal, Gap, [YR] + 1
FROM FinalResult
WHERE [YR] + 1 <= YearVal + (Gap -1)
)
SELECT * FROM FinalResult
ORDER BY [YR];
You need to keep original row parameters in the recursive part. this way recursion runs as desired.

sql code to convert varchar colum to int

I am stuck on converting a varchar column schedule containing the following data 0, 1, 2, 3,4,8,9,10,11,12,15,16,17,18,19 to INT. I know, please don't ask why this schedule column was not created as INT initially, long story.
So I tried this, but it doesn't work. and give me an error:
select CAST(schedule AS int) from shift_test:
the query should check if the numbers representing days are found in the schedule filed using the sql code below
select empid, case when ((DateDiff(hour,'01-01-2014 07:00' , '01-01-2014 00:00')/ 24 )% 15) in ( CAST(schedule AS int))
then 'A' else '*' end as shift_A from Shift_test
After executing i get this error.
Conversion failed when converting the varchar value to int.
Any help will be appriciated

Use ISNUMERIC() test if you are using version 2008 or 2008R2. In SQL SERVER 2012 you can use TRY_CAST() function, which checks if the data conversion is allowed for given literal.
Mock up code for SQL Server 2008/R2:
Select col1, col2,
case
when <condition> and isnumeric(col2) then cast(col2 as int)
else <do whatever...>
end
as converted_col2
from <yourtable>;
For SQL Server 2012:
Select col1, col2,
case
when <condition> then try_cast(col2 as int)
else <do whatever...>
end
as converted_col2
from <yourtable>;
Example with SQl Server 2008
declare #T table (empid int, schedule varchar(2)) ;
insert into #T(empid, schedule) values (1, '1');
insert into #T(empid, schedule) values (2, '2');
insert into #T(empid, schedule) values (3, '03');
insert into #T(empid, schedule) values (4, '4');
insert into #T(empid, schedule) values (5, '05');
insert into #T(empid, schedule) values (6, 'A');
select empid,
case
when ISNUMERIC(schedule) = 1
and ((DateDiff(hour,'01-01-2014 07:00' , '10-01-2014 00:00')/ 24 )% 15)
in ( CAST(schedule AS int)) then 'A'
else '*'
end
as shift_A
from #T;

# Binaya Ive added my code for more help.
The schedule column contain (0, 1, 2, 3,4,8,9,10,11,12,15,16,17,18,19) which is varchar.
i want to output A if after this calculation ((DateDiff(hour,'01-01-2014 07:00',startdate)/ 24 )% 15) the result is found in the schedule column else if after the calculation and the result is not found output *
;with Shift_runover (shift_code,schedule,endd,startdate)
-- Start at the beginning of shift.
as
(select shift_code,schedule,Cast(end_date as DateTime) as endd,Cast(start_date as DateTime)as startdate from dbo.Shift_test
union all
-- Add hours up to the desired end date.
select shift_code,schedule,endd,DateAdd(hour, 1,startdate)from Shift_runover where startdate<=endd),
Extendedsamples as
(
-- Calculate the number of days since the beginning of the first shift on 1/1/2014.
select shift_code,schedule,startdate,DateDiff(hour,'01-01-2014 07:00',startdate)/ 24 as Days from Shift_runover ),
Shifts as
(
-- the schedule column contain (0, 1, 2, 3,4,8,9,10,11,12,15,16,17,18,19) which is varchar.
-- i want to output A if ((DateDiff(hour,'01-01-2014 07:00',startdate)/ 24 )% 15) is found in the schedule colume
select *,
case when (DateDiff(hour,'01-01-2014 07:00',startdate)/ 24 )% 15 in(schedule)
then 'A' else '*' end as shift_A
from ExtendedSamples
)
select *
from Shifts
option ( maxrecursion 0 )

Increment value colum by previous row in select sql statement

i have to find a way to solve this issue... in a table like that, i would see my column "C" increment his value on each rows, starting from a costant, adding value in column "B" and adding value by the previous value in the same column "C".
Furthermore ... Grouping by User.
For example: (starting point Phil: 350, starting point Mark: 100)
USER - POINT - INITIALPOINT
Phil - 1000 - 1350
Phil - 150 - 1500
Phil - 200 - 1700
Mark - 300 - 400
Mark - 250 - 650
How can i do that?

Using windowing. The table declaration is SQL Server but the rest is standard SQL if your RDBMS supports it (SQL Server 2012, PostgreSQL 9.1 etc)
DECLARE #t TABLE (ID int IDENTITY(1,1), UserName varchar(100), Point int);
INSERT #t (UserName, Point)
VALUES
('Phil', 1000),
('Phil', 150),
('Phil', 200),
('Mark', 300),
('Mark', 250);
DECLARE #n TABLE (UserName varchar(100), StartPoint int);
INSERT #n (UserName, StartPoint)
VALUES
('Phil', 350),
('Mark', 100);
SELECT
T.ID, T.UserName, T.Point,
N.StartPoint + SUM(Point) OVER(PARTITION BY T.UserName ORDER BY T.ID ROWS UNBOUNDED PRECEDING)
FROM
#n N
JOIN
#t T ON N.UserName = T.UserName
ORDER BY
T.ID;
To do this, you need an order to the table (I used ID) and a better way of doing a starting value (I used a separate table)

SQL Server 2008 doesn't support cumulative sums directly using window functions. You can use a correlated subquery for the same effect.
So, using the same structure as GBN:
DECLARE #t TABLE (ID int IDENTITY(1,1), UserName varchar(100), Point int);
INSERT #t (UserName, Point)
VALUES
('Phil', 1000),
('Phil', 150),
('Phil', 200),
('Mark', 300),
('Mark', 250);
DECLARE #n TABLE (UserName varchar(100), StartPoint int);
INSERT #n (UserName, StartPoint)
VALUES
('Phil', 350),
('Mark', 100);
SELECT
T.ID, T.UserName, T.Point,
(N.StartPoint +
(select SUM(Point) from #t t2 where t2.UserName = t.userName and t2.ID <= t.id)
)
FROM
#n N
JOIN
#t T ON N.UserName = T.UserName
ORDER BY
T.ID;

You didn't specify your DBMS, so this is ANSI SQL:
select "user",
point,
case
when "user" = 'Phil' then 350
else 100
end + sum(point) over (partition by "user" order by some_date_column) as sum
from the_table
where "user" in ('Mark', 'Phil')
order by "user", some_date_column;
You need some column to sort the rows by, otherwise the "running sum" will be meaningliss as rows in a table are not sorted (there is no such thing as "the first row" in a relational table). That's the some_date_column is for in my example. It could be an increasing primary key or something else as long as it defines a proper ordering of the rows.

project a sparse result at some level

I don't really know what to call this but it's not that hard to explain
Basically what I have is a result like this
Similarity ColumnA ColumnB ColumnC
1 SomeValue NULL SomeValue
2 NULL SomeB NULL
3 SomeValue NULL SomeC
4 SomeA NULL NULL
This result is created by matching a set of strings against another table. Each string also contains some values for these ColumnA..C which are the values I wan't to aggregate in some way.
Something like min/max works very well but I can't figure out how to get it to account for the highest similarity not just the min/max value. I don't really want the min/max, I want the first non-null value with the highest similarity.
Ideally the result would look like this
ColumnA ColumnB ColumnC
SomeA SomeB SomeC
I'd like be able to efficiently join in the temporary result to compute the rest and I've been exploring different options. Something which I've been considering is creating a SQL Server CLR aggregate the yields the "first" non-null value but I'm unsure if there's even such a thing as a first or last when running an aggregate on a result.

Okay, so I figured it out, I originally had trouble with the UPDATE FROM and JOIN not playing well together. I was counting on that the UPDATE would just occur multiple times and that would give me the correct results, however, there's no such guarantee from SQL Server (it's actually undefined behavior and alltough it appeared to work we'll have none of that) but since you can run UPDATE against a CTE I combined that with the OUTER APPLY to select the exactly 1 row to complement a missing value if possible.
Here's the whole thing with test data as well.
DECLARE #cost TABLE (
make nvarchar(100) not null,
model nvarchar(100),
a numeric(18,2),
b numeric(18,2)
);
INSERT #cost VALUES ('a%', null, 100, 2);
INSERT #cost VALUES ('a%', 'a%', 149, null);
INSERT #cost VALUES ('a%', 'ab', 349, null);
INSERT #cost VALUES ('b', null, null, 2.5);
INSERT #cost VALUES ('b', 'b%', 249, null);
INSERT #cost VALUES ('b', 'b', null, 3);
DECLARE #unit TABLE (
id int,
make nvarchar(100) not null,
model nvarchar(100)
);
INSERT #unit VALUES (1, 'a', null);
INSERT #unit VALUES (2, 'a', 'a');
INSERT #unit VALUES (3, 'a', 'ab');
INSERT #unit VALUES (4, 'b', null);
INSERT #unit VALUES (5, 'b', 'b');
DECLARE #tmp TABLE (
id int,
specificity int,
a numeric(18,2),
b numeric(18,2),
primary key(id, specificity)
);
INSERT #tmp
OUTPUT inserted.* --FOR DEBUGGING
SELECT
unit.id
, ROW_NUMBER() OVER (
PARTITION BY unit.id
ORDER BY cost.make DESC, cost.model DESC
) AS specificity
, cost.a
, cost.b
FROM #unit unit
INNER JOIN #cost cost ON unit.make LIKE cost.make
AND (cost.model IS NULL OR unit.model LIKE cost.model)
;
--fix the holes
WITH tmp AS (
SELECT *
FROM #tmp
WHERE specificity = 1
AND (a IS NULL OR b IS NULL) --where necessary
)
UPDATE tmp
SET
tmp.a = COALESCE(tmp.a, a.a)
, tmp.b = COALESCE(tmp.b, b.b)
OUTPUT inserted.* --FOR DEBUGGING
FROM tmp
OUTER APPLY (
SELECT TOP 1 a
FROM #tmp a
WHERE a.id = tmp.id
AND a.specificity > 1
AND a.a IS NOT NULL
ORDER BY a.specificity
) a
OUTER APPLY (
SELECT TOP 1 b
FROM #tmp b
WHERE b.id = tmp.id
AND b.specificity > 1
AND b.b IS NOT NULL
ORDER BY b.specificity
) b
;

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Eliminating outliers by standard deviation in SQL Server - sql

I am trying to eliminate outliers in SQL Server 2008 by standard deviation. I would like only records that contain a value in a specific column within +/- 1 standard deviation of that column's mean. How can I accomplish this?

Related

How can I delete trailing contiguous records in a partition with a particular value?

Get the list of year values based on the gap and year value in the table

sql code to convert varchar colum to int

Increment value colum by previous row in select sql statement

project a sparse result at some level

Categories

Resources