I want to detect outliers (greater than 20x stddev deviation from the average) but I don't want to have the more than 3x outliers affecting the average. I came up with this:
SELECT d.* FROM
(
SELECT
d.*,
(amount - avg(amount_excl_3z) OVER(PARTITION BY productid)) / NULLIF(STDEV(amount_excl_3z) OVER(PARTITION BY productid), 0) AS zscore_ex
FROM
(
SELECT
d.*,
--when the amount zscore is 3x, null the amount else provide
CASE WHEN ABS(amount - avg(amount) OVER(PARTITION BY productid)) / NULLIF(STDEV(amount) OVER(PARTITION BY productid), 0) > 3
THEN NULL ELSE amount END AS amount_excl_3z
FROM sales d
WHERE --the past year's sales of product 1, but one day I will consider all prods hence why i left the partitions in
timestamp > GETUTCDATE()-365.0 AND
productid = 1
) d
) d
WHERE e.zscore_ex > 20
ORDER BY amount desc
The concern with the data is that if too many outliers occur they will drastically affect the average - there might be 1000 occurrences of a product with an amount of 1, and then 5 occurrences of a product amount of 20,000.. I don't want the 20,000 to influence the average.. I don't really want 50 occurrences of 20,000 to influence the average.. 500 occurrences though, that would be OK/represent a new norm..
The way I'm thus considering doing this is to strip out low occurrences of massive outliers. If they start occurring frequently enough that they affect the average enough to come into range, then I'll start including them..
The query above is my best stab at "outlier detection that tries to exclude low numbers of wild outliers from having too much of an impact" - is there any other facility in SQL server that I can leverage more effectively for this algorithm? Perhaps some analytic query that can indicate where on a distribution curve a point lies? I looked at PERCENT_RANK, CUME_DIST, PERCENTILE_CONT/DISC, NTILE, but they seemed to be more linear in the distribution of the output than doing a zscore..
Related
I'm brainstorming on ways to find trends over a dataset containing transaction amounts that spans a year.
I'd like to run an average of top 25% observations of data and bottom 75% observations of data and viceversa.
If the entire dataset contains 1000 observations, I'd like to run:
An average of the top 25% and then separately, an average of the bottom 75% and find the resulting average of this.
Inversely, top 75% average, then bottom 25%, then the average of the 2.
For the overall average I have: avg(transaction_amount)
I am aware that in order for the sectioning averages to be useful, I will have to order the data according to the date which I already have accounted for in my SQL code:
select avg(transaction_amount)
from example.table
order by transaction_date
I am now struggling to find a way to split the data between 25% and 75% based on the number of observations.
Thanks.
If you're using MSSQL, it's pretty trivial depending on exactly the output you're looking for.
SELECT TOP 25 PERCENT
*
FROM (
SELECT
AVG(transaction_amount) as avg_amt
FROM example.table
) AS sub
ORDER BY sub.avg_amt DESC
Use PERCENT_RANK in order to see which percentage block a row belongs to. Then use this to group your data:
with data as
(
select t.*, percent_rank() over (order by transaction_amount) as pr
from example.table t
)
select
case when pr <= 0.75 then '0-75%' else '75-100%' end as percent,
avg(transaction_amount) as avg,
avg(avg(transaction_amount)) over () as avg_of_avg
from data
group by case when pr <= 0.75 then '0-75%' else '75-100%' end
union all
select
case when pr <= 0.25 then '0-25%' else '25-100%' end as percent,
avg(transaction_amount) as avg,
avg(avg(transaction_amount)) over () as avg_of_avg
from data
case when pr <= 0.25 then '0-25%' else '25-100%' end;
I've been playing around it for the whole day and it's by far the most hard topic to understand in SQL.
Say we have a students table, which consists of group number and students rating as so:
Each group can contain multiple students
And now I want to look for groups where at least 60% of students have rating of 4 or higher.
Expecting something like:
group_n
percentage_of_goodies
1120
0.7
1200
0.66
1111
1
I tried this
with group_goodies as (
select group_n, count(id) goodies from students
where rating >= 4
group by group_n
), group_counts as (
select group_n, count(id) acount from students
group by group_n
)
select cast(group_goodies.goodies as float)/group_counts.acount from group_counts, group_goodies
where cast(group_goodies.goodies as float)/group_counts.acount > 0.6;
and got an unexpected result
where percentage seems to surpass 100% (and it's not because I misplaced denominator, since there are controversial outputs below as well), which is obviously is not intended. There are also more output rows than there are groups. Apparently, I could use window functions, but I can't figure it out myself.. So how can I have this query done?
Problem is extracting count of students before and after the query seems to be impossible within a single query, so I had to create 2 CTEs in order to receive needed results. Both of the CTEs seems to output proper result, where in first CTE amount of students rarely exceeds 10, and in second CTE amounts are naturally smaller and match needs. But When I divide them like that, it results in something unreasonable.
If someone explains it properly, one will make my day 😳
If I understand correctly, this is a pretty direct aggregation query:
select group_id, avg( (rating >= 4)::int ) as good_students
from students
group by group_id
having avg( (rating >= 4)::int ) > 0.6;
I don't see why two levels of aggregation would be needed.
The avg() works by converting each rating to 0 if less than or equal to 4 or 1 for the higher ones. The average of these values is the ratio that are 1.
First, aggregate the students per group and rating to set a partial sum.
Then calculate the fraction of ratings that are 4 or better and report only those groups where that fraction exceeds 0.6.
SELECT group_n,
100.0 *
sum(ratings_per_group) FILTER (WHERE rating >= 4) /
sum(ratings_per_group)
AS fraction_of_goodies
FROM (SELECT group_n, rating,
count(*) AS ratings_per_group
FROM students
GROUP BY group_n, rating
) AS per_group_and_rating
GROUP BY group_n
HAVING sum(ratings_per_group) FILTER (WHERE rating >= 4) /
sum(ratings_per_group) > 0.6;
I am using Microsoft SQL Server 2005 Management Studio. I am a bit new so I hope I am not breaking any rules. My data is 15 columns and almost a million rows, however I am just giving you a sample to get assistance on one area where I am stuck.
In the above example as you can see the column 'lastlevel' values are decreasing. Also you can see that data in the 'Last_read' column date range is from today to 14 days prior (it was ran yesterday hence April 27, also pls. disregard that for 1st customer date 2021/04/14 is missing, it is an anomaly).
Column 'Shipto' provides the customer number and each customer has max 14 rows of data.
Please disregard column 'current_reading' and rn
If look at 'lastlevel' again you will notice that the values are going down consistently, however on April 18th, it goes from 0.73 to 0.74, showing an increase of 0.01.
What I want to do is that whenever there is an increase at all, I want that whole customer's all 14 rows be removed from the output i.e. I only want to see customers that have the prefect descending data and no increases.
Can you help?
WITH
deltas AS
(
-- For each [Shipto]; deduct the preceding row's value and record it as the [delta]
-- Note, each [Shipto]'s first row's delta with therefor be NULL
SELECT
*,
lastlevel - LAG(lastlevel) OVER (PARTITION BY Shipto ORDER BY Last_Read, lastlevel DESC) AS delta
FROM
yourTable
),
max_deltas AS
(
-- Get the maximum of the deltas per [Shipto]
SELECT
*,
MAX(delta) OVER (PARTITION BY Shipto) AS max_delta
FROM
deltas
)
-- Return only rows where the delta never exceeds 0 (thus, never ascending over any timestep)
SELECT
*
FROM
max_deltas
WHERE
max_delta <= 0
I've ordered by Last_Read, lastlevel DESC such that if two readings are on the same date, it is assumed that the highest value should be considered to have happened first.
I have a view consists of data from different tables. major fields are BillNo,ITEM_FEE,GroupNo. Actually I need to calculate the total discount by passing the groupNo. The discount calculation is based on the fraction of amount group by BillNo(single Bill no can have multiple entries). If there are multiple transactions for a single BillNo then discount is calculated if decimal part of sum of ITEM_FEE is greater than 0 and if there is only single transaction and the decimal part of ITEM_FEE is greater than 0 then the decimal part will be treated as discount.
I have prepared script and I am getting total discount for a particular groupNo.
declare #GroupNo as nvarchar(100)
set #GroupNo='3051'
SELECT Sum(disc) Discount
FROM --sum(ITEM_FEE) TotalAmoiunt,
(SELECT (SELECT CASE
WHEN ( Sum(item_fee) )%1 > 0 THEN Sum(( item_fee )%1)
END
FROM view_bi_sales VBS
WHERE VBS.billno = VB.billno
GROUP BY billno) Disc
FROM view_bi_sales VB
WHERE groupno = #GroupNo)temp
The problem is that it takes almost 2 minutes to get the result.
Please help me to find the result faster if possible.
Thank you for all your help and support , as I was already calculating sum of decimal part of ITEM_FEE group by BillNo , there was no need of checking greater than 0 or not. below query gives me the desired ouput in less than 10 sec
select sum(discount) from
(select sum((ITEM_FEE)%1) discount from view_BI_Sales
where groupno=3051
group by BillNo )temp
If I understand correctly, you don't need a JOIN. This might help performance:
SELECT SUM(disc) as Discount
FROM (SELECT (CASE WHEN SUM(item_fee % 1) > 0
THEN SUM(item_fee % 1)
END) as disc
FROM view_bi_sales VBS
WHERE groupno = #GroupNo
GROUP BY billno
) vbs;
I have a table with the following structure:
timstamp-start, timestamp-stop
1,5
6,10
25,30
31,35
...
i am only interested in continuous timespans e.g. the break between a timestamp-end and the following timestamp-start is less than 3.
How could I get the aggregated covered timespans as a result:
timestamp-start,timestamp-stop
1,10
25,35
The reason I am considering this is because a user may request a timespan that would need to return several thousand rows. However, most records are continous and using above method could potentially reduce many thousand of rows down to just a dozen. Or is the added computation not worth the savings in bandwith and latency?
You can group the time stamps in three steps:
Add a flag to determine where a new period starts (that is, a gap greater than 3).
Cumulatively sum the flag to assign groupings.
Re-aggregate with the new groupings.
The code looks like:
select min(ts_start) as ts_start, max(ts_end) as ts_end
from (select t.*,
sum(flag) over (order by ts_start) as grouping
from (select t.*,
(coalesce(ts_start - lag(ts_end) over (order by ts_start),0) > 3)::int as flag
from t
) t
) t
group by grouping;