SQL: Show average and min/max within standard deviations - sql

I have the following SQL table -
Date StoreNo Sales
23/4 34 4323.00
23/4 23 564.00
24/4 34 2345.00
etc
I am running a query that returns average sales, max sales and min sales for a certain period -
select avg(Sales), max(sales), min(sales)
from tbl_sales
where date between etc
But there are some values coming through in the min and max that are really extreme - perhaps because the data entry was bad, perhaps because some anomoly had occurred on that date and store.
What I'd like is a query that returns average, max and min, but somehow excludes the extreme values. I am open to how this is done, but perhaps it would use standard deviations in some way (for example, only using data within x std devs of the true average).
Many thanks

In order to calculate the standard deviation, you need to iterate through all of the elements, so it would be impossible to do this in one query. The lazy way would be to just do it in two passes:
DECLARE
#Avg int,
#StDev int
SELECT #Avg = AVG(Sales), #StDev = STDEV(Sales)
FROM tbl_sales
WHERE ...
SELECT AVG(Sales) AS AvgSales, MAX(Sales) AS MaxSales, MIN(Sales) AS MinSales
FROM tbl_sales
WHERE ...
AND Sales >= #Avg - #StDev * 3
AND Sales <= #Avg + #StDev * 3
Another simple option that might work (fairly common in analysis of scientific data) would be to just drop the minimum and maximum x values, which works if you have a lot of data to process. You can use ROW_NUMBER to do this in one statement:
WITH OrderedValues AS
(
SELECT
Sales,
ROW_NUMBER() OVER (ORDER BY Sales) AS RowNumAsc,
ROW_NUMBER() OVER (ORDER BY Sales DESC) AS RowNumDesc
)
SELECT ...
FROM tbl_sales
WHERE ...
AND Sales >
(
SELECT MAX(Sales)
FROM OrderedValues
WHERE RowNumAsc <= #ElementsToDiscard
)
AND Sales <
(
SELECT MIN(Sales)
FROM OrderedValues
WHERE RowNumDesc <= #ElementsToDiscard
)
Replace ROW_NUMBER with RANK or DENSE_RANK if you want to discard a certain number of unique values.
Beyond these simple tricks you start to get into some pretty heavy stats. I have to deal with similar kinds of validation and it's far too much material for a SO post. There are a hundred different algorithms that you can tweak in a dozen different ways. I would try to keep it simple if possible!

Expanding on DuffyMo's post you could do something like
With SalesStats As
(
Select Sales, NTILE( 100 ) OVER ( Order By Sales ) As NtileNum
From tbl_Sales
)
Select Avg( Sales ), Max( Sales ), Min( Sales )
From SalesStats
Where NtileNum Between 5 And 95
This will exclude the lowest 5% and highest 95%. If you have numbers that vary wildly, you may find that the Average isn't a quality summary statistic and should consider using median. You can do that by doing something like:
With SalesStats As
(
Select NTILE( 100 ) OVER ( Order By Sales ) As NtileNum
, ROW_NUMBER() OVER ( Order By Id ) As RowNum
From tbl_Sales
)
, TotalSalesRows
(
Select COUNT(*) As Total
From tbl_Sales
)
, Median As
(
Select Sales
From SalesStats
Cross Join TotalSalesRows
Where RowNum In ( (TotalRows.Total + 1) / 2, (TotalRows.Total + 2) / 2 )
)
Select Avg( Sales ), Max( Sales ), Min( Sales ), Median.Sales
From SalesStats
Cross Join Median
Where NtileNum Between 5 And 95

Maybe what you're looking for are percentiles.
Standard deviation tends to be sensitive to outliers, since it's calculated using the square of the difference between a value and the mean.
Maybe a more robust, less sensitive measure like absolute value of difference between a value and the mean would be more appropriate in your case.

Related

In Spark SQL how do I take 98% of the lowest values

I am using Spark SQL and I have some outliers that have incredibly high transaction counts in comparison to the rest. I only want the lowest 98% of the values and to cut off the top 2% outliers. How do I go about doing that? The TOP function is not being recognized in Spark SQL. This is a sample of the table but it is a very large table.
Date
ID
Name
Transactions
02/02/2022
ABC123
Bob
107
01/05/2022
ACD232
Emma
34
12/03/2022
HH254
Kirsten
23
12/11/2022
HH254
Kirsten
47
You need a couple of window functions to compute the relative rank; the row_number() will give absolute rank, but you won't know where to draw the cutoff line without a full record count to compute the percentile.
In an inner query,
Select t.*,
row_number() Over (Order By Transactions, Date desc) * 100
/ count(*) Over (Rows unbounded preceeding to rows unbounded following) as percentile
From myTable t
Then in an outer query just
Select * from (*inner query*)
Where percentile <= 98
You might be able to omit the Over clause on the Count(*), I don't know.
You can calculate the 98th percentile value for the Transactions column and then filter the rows where the value of Transactions is below the 98th percentile. You can use the following query to accomplish that:
WITH base_data AS (
SELECT Date, ID, Name, Transactions
FROM your_table
),
percentiles AS (
SELECT percentiles_approx(Transactions, array(0.98)) AS p
FROM base_data
)
SELECT Date, ID, Name, Transactions
FROM base_data
JOIN percentiles
ON Transactions <= p
The percentiles_approx method is used on the baseData DataFrame to obtain the 98th percentile value

Using SQL Server : how to use select criteria based on sum

Given the below table and using SQL (SQL Server preferred), how can I select only the ProductID's that sum to the first 200 orders or less is returned?
In other words, I'd like ID's for 'Corn Flakes', 'Wheeties' returned since this is close to the sum of 200 orders but returning anything more would be over the limit.
Given that 108 + 92 = 200, I must assume that you want the product ids in order.
In that case, you can use a cumulative sum:
select t.*
from (select t.*,
sum(orders) over (order by product_id) as running_orders
from t
) t
where running_orders <= 200;
Not sure which is more appropriate for your level and version:
select * from T as t
where (
select sum(Orders) from T as t2
where t2.ProductID <= t.ProductID -- *
) <= 200;
with data as (
select *,
sum(Orders)
over (order by ProductID desc) as cumm -- *
from T
)
select * from data where cumm <= 200;
Both of these essentially assume there will be no ties, or at least no ties that would both land as a single product order in the 200th spot.
If you discover that you intended to sort by number or orders rather than product id change the column references in the lines marked with asterisks.

ensure TOP (10) percent includes records from each date in range

I am attempting to perfect an Audit methodology to gather 10 percent of records from the last week to they can be audited. I currently use a CROSS APPLY to get 10 percent for each office during the period, but most of those records are from the first 2 days. In order to improve audit I want to make sure that records for each day in the range are included in the 10 percent.
SELECT t1.PIC, t1.TransID, t1.ID, t1.TranCode, t1.Doc, t1.TranDate, t1.Operator, t1.Office
FROM [dbo].[Office]
CROSS APPLY
(
SELECT TOP (10) PERCENT d2.*
FROM ##AUDIT AS d2
WHERE d2.Office = [dbo].[Office].CodeValue
ORDER BY d2.TransID
) AS t1
ORDER BY [dbo].[Office].CodeValue
This works great to get me 10 percent from each office, but I need to improve it.
Don't order by TransId, which is presumably incrementally assigned. Instead, use a where to get the date period you want and then order randomly:
SELECT . . .
FROM [dbo].[Office] CROSS APPLY
(SELECT TOP (10) PERCENT d2.*
FROM ##AUDIT AS d2
WHERE d2.Office = [dbo].[Office].CodeValue AND
d2.tranDate >= dateadd(day, -7, cast(getdate() as date))
ORDER BY newid()
) t1
Here's an alternative method, not using top...percent. Assign row numbers pseudorandomly (using newID), starting over on each new transdate. Use ceiling(tot/10.0) to get the number of records on each date that comprises at least 10% of the sample (e.g. 1 record if there were 10 or fewer records on that day, 2 records if there were between 11 and 20 records, etc), then select that many records from your initial table.
;with CTE as (
select tranDate, transID
, count(transID) over (partition by tranDate) tot
, row_number() over (partition by tranDate order by newid()) rn
from ##Audit)
select *
from CTE
where ceiling(tot/10.0) >= rn
You can modify the partition by portion of the query if you need to select 10% from each office, on each date, or other factors.

Producing n rows per group

It is known that GROUP BY produces one row per group. I want to produce multiple rows per group. The particular use case is, for example, selecting two cheapest offerings for each item.
It is trivial for two or three elements in the group:
select type, variety, price
from fruits
where price = (select min(price) from fruits as f where f.type = fruits.type)
or price = (select min(price) from fruits as f where f.type = fruits.type
and price > (select min(price) from fruits as f2 where f2.type = fruits.type));
(Select n rows per group in mysql)
But I am looking for a query that can show n rows per group, where n is arbitrarily large. In other words, a query that displays 5 rows per group should be convertible to a query that displays 7 rows per group by just replacing some constants in it.
I am not constrained to any DBMS, so I am interested in any solution that runs on any DBMS. It is fine if it uses some non-standard syntax.
For any database that supports analytic functions\ window functions, this is relatively easy
select *
from (select type,
variety,
price,
rank() over ([partition by something]
order by price) rnk
from fruits) rank_subquery
where rnk <= 3
If you omit the [partition by something], you'll get the top three overall rows. If you want the top three for each type, you'd partition by type in your rank() function.
Depending on how you want to handle ties, you may want to use dense_rank() or row_number() rather than rank(). If two rows tie for first, using rank, the next row would have a rnk of 3 while it would have a rnk of 2 with dense_rank. In both cases, both tied rows would have a rnk of 1. row_number would arbitrarily give one of the two tied rows a rnk of 1 and the other a rnk of 2.
To save anyone looking some time, at the time of this writing, apparently this won't work because https://dev.mysql.com/doc/refman/5.7/en/subquery-restrictions.html.
I've never been a fan of correlated subqueries as most uses I saw for them could usually be written more simply, but I think this has changed by mind... a little. (This is for MySQL.)
SELECT `type`, `variety`, `price`
FROM `fruits` AS f2
WHERE `price` IN (
SELECT DISTINCT `price`
FROM `fruits` AS f1
WHERE f1.type = f2.type
ORDER BY `price` ASC
LIMIT X
)
;
Where X is the "arbitrary" value you wanted.
If you know how you want to limit further in cases of duplicate prices, and the data permits such limiting ...
SELECT `type`, `variety`, `price`
FROM `fruits` AS f2
WHERE (`price`, `other_identifying_criteria`) IN (
SELECT DISTINCT `price`, `other_identifying_criteria`
FROM `fruits` AS f1
WHERE f1.type = f2.type
ORDER BY `price` ASC, `other_identifying_criteria` [ASC|DESC]
LIMIT X
)
;
"greatest N per group problems" can easily be solved using window functions:
select type, variety, price
from (
select type, variety, price,
dense_rank() over (partition by type) order by price as rnk
from fruits
) t
where rnk <= 5;
Windows functions only work on SQL Server 2012 and above. Try this out:
SQL Server 2005 and Above Solution
DECLARE #yourTable TABLE(Category VARCHAR(50), SubCategory VARCHAR(50), price INT)
INSERT INTO #yourTable
VALUES ('Meat','Steak',1),
('Meat','Chicken Wings',3),
('Meat','Lamb Chops',5);
DECLARE #n INT = 2;
SELECT DISTINCT Category,CA.SubCategory,CA.price
FROM #yourTable A
CROSS APPLY
(
SELECT TOP (#n) SubCategory,price
FROM #yourTable B
WHERE A.Category = B.Category
ORDER BY price DESC
) CA
Results in two highest priced subCategories per Category:
Category SubCategory price
------------------------- ------------------------- -----------
Meat Chicken Wings 3
Meat Lamb Chops 5

SELECT field value minus previous field value

I have a select query that gets a CarID, month, mileage and CO2 emission.
Now it gives for each month per car the mileage like this:
month 1: 5000
month 2: 5200
...
What I really need is that it takes the current value minus the previous one. I get data between a certain time frame and I already included a mileage point before that time frame. So it would be possible to get the total miles per month, I just don't know how. What I want is this.
pre timeframe: 5000
month 1: 200
month 2: 150
...
How would I do this?
edit: code, I have not yet tried anything as I have no clue how to start to do this.
resultlist as (
SELECT
CarID
, '01/01/2000' as beginmonth
, MAX(kilometerstand) as Kilometers
, MAX(Co2Emission) as CO2
FROM
totalmileagelist
GROUP BY CarID
UNION
SELECT
CarID
, beginmonth
, MAX(kilometerstand) as Kilometers
, MAX(Co2Emission) as CO2
FROM
resultunionlist
GROUP BY CarID, beginmonth
)
select * from resultlist
order by CarID, beginmonth
Edit2: explanation to the code
In the first part of the result list I grab the latest mileage per car. In the second part, after the union, I grab per month per car the latest mileage.
If you just want to subtract the previous milage, use the lag() function:
select ml.*,
(kilometerstand - lag(kilometerstand) over (partition by carid order by month)
) as diff
from totalmileagelist ml;
lag() is available in SQL Server 2012+. In earlier versions you can use a correlated subquery or outer apply.
(I missed the version because it is in the title and not on a tag.) In SQL Server 2008:
select ml.*,
(ml.mileage - mlprev.mileage) as diff
from totalmileagelist ml outer apply
(select top 1 ml2.*
from totalmileagelist ml2
where ml2.CarId = ml.CarId and
ml2.month < ml.month
order by ml2.month desc
) mlprev;
Try like this:
SELECT id, yourColumnValue,
COALESCE(
(
SELECT TOP 1 yourColumnValue
FROM table_name t
WHERE t.id> tbl.id
ORDER BY
rowInt
), 0) - yourColumnValue AS diff
FROM table_name tbl
ORDER BY
id
or like this using rank()
select rank() OVER (ORDER BY id) as 'RowId', mileage into temptable
from totalmileagelist
select t1.mileage - t2.mileage from temptable t1, temptable t2
where t1.RowId = t2.RowId - 1
drop table temptable