remove outliers by group in sql - sql

In my column in SQL Server, I must delete outliers for each group separately. Here are my columns
select
customer,
sku,
stuff,
action,
acnumber,
year
from
mytable
Sample data:
customer sku year stuff action
-----------------------------------
1 1 2 2017 10 0
2 1 2 2017 20 1
3 1 3 2017 30 0
4 1 3 2017 40 1
5 2 4 2017 50 0
6 2 4 2017 60 1
7 2 5 2017 70 0
8 2 5 2017 80 1
9 1 2 2018 10 0
10 1 2 2018 20 1
11 1 3 2018 30 0
12 1 3 2018 40 1
13 2 4 2018 50 0
14 2 4 2018 60 1
15 2 5 2018 70 0
16 2 5 2018 80 1
I must delete outlier from stuff variable, but separately by group customer+sku+year.
All that is below the 25th percentile and above 75 percentile should be considered an outlier and this principle must be respected for each group.
How to clear dataset for next working ?
Note, in this dataset, there is variable action (it tales value 0 and 1). It is not group variable, but outliers must be delete only for ZERO(0) categories of action variable.
in R language this is decided as
remove_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- NA
y[x > (qnt[2] + H)] <- NA
y
}
new <- remove_outliers(vpg$stuff)
vpg=cbind(new,vpg)

Something like this, maybe:
DELETE mytable
WHERE PERCENT_RANK() OVER (PARTITION BY Department ORDER BY customer, sku, year ORDER BY stuff ) < .25 OR
PERCENT_RANK() OVER (PARTITION BY Department ORDER BY customer, sku, year ORDER BY stuff ) > .75

Related

Merge rows and convert a string in row value to a user-defined one when condition related to other columns is matched

Assuming I'm dealing with this dataframe:
ID
Qualified
Year
Amount A
Amount B
1
No
2020
0
150
1
No
2019
0
100
1
Yes
2019
10
15
1
No
2018
0
100
1
Yes
2018
10
150
2
Yes
2020
0
200
2
No
2017
0
100
...
...
...
...
My desired output should be like this:
ID
Qualified
Year
Amount A
Amount B
1
No
2020
0
150
1
Partial
2019
10
115
1
Partial
2018
10
250
2
Yes
2020
0
200
2
No
2017
0
100
...
...
...
...
As you can see, Qualified column creates new merged values (Yes & No -> Partial, amount A + B ) from a condition: a year in an ID includes both Yes and No in Qualified column.
Don't know how to approach it. Anyone could provide any methodology?
You can use the function agg() and groupby() to perform this operation.
agg() allows you to use not only common aggregation functions (such as sum, mean, etc.) but also custom defined functions.
I would do as follows:
def agg_qualify(x):
values = x.unique()
if len(x)>1:
return 'Partial'
return values[0]
df.groupby(['ID', 'Year']).agg({
'Qualified': lambda x: agg_qualify(x),
'Amount A': 'sum',
'Amount B': 'sum',
}).reset_index()
Output:
ID Year Qualified Amount A Amount B
0 1 2018 Partial 10 250.0
1 1 2019 Partial 10 115.0
2 1 2020 No 0 150.0
3 2 2020 Yes 0 200.0

segmentation total based on multiple condition

data frame:-
ID spend month_diff
12 10 -1
12 10 -2
12 20 1
12 30 2
13 15 -1
13 20 -2
13 25 1
13 30 2
I want to get the spend_total based on the month difference for a particular ID. month_diff in negative means spend done by customer in last year and positive means this year.so,i want to compare the spend of customers for past year and this year. so the conditions are as follows:
Conditions:-
if month_diff >= -2 and < 0 then cumulative spend for negative months - flag=pre
if month_diff > 0 and <=2 then cumulative spend for positive months - flag=post
Desired data frame:-
ID spend month_diff tot_spend flag
12 10 -2 20 pre
12 30 2 50 post
13 20 -2 35 pre
13 30 2 55 post
Use numpy.sign with Series.shift , Series.ne and Series.cumsum for consecutive groups and pass to DataFrame.groupby with aggregate GroupBy.last and sum.
Last use numpy.select:
a = np.sign(df['month_diff'])
g = a.ne(a.shift()).cumsum()
df1 = (df.groupby(['ID', g])
.agg({'month_diff':'last', 'spend':'sum'})
.reset_index(level=1, drop=True)
.reset_index())
df1['flag'] = np.select([df1['month_diff'].ge(-2) & df1['month_diff'].lt(0),
df1['month_diff'].gt(0) & df1['month_diff'].le(2)],
['pre','post'], default='another val')
print (df1)
ID month_diff spend flag
0 12 -2 20 pre
1 12 2 50 post
2 13 -2 35 pre
3 13 2 55 post

SQL - Select rows after reaching minimum value/threshold

Using Sql Server Mgmt Studio. My data set is as below.
ID Days Value Threshold
A 1 10 30
A 2 20 30
A 3 34 30
A 4 25 30
A 5 20 30
B 1 5 15
B 2 10 15
B 3 12 15
B 4 17 15
B 5 20 15
I want to run a query so only rows after the threshold has been reached are selected for each ID. Also, I want to create a new days column starting at 1 from where the rows are selected. The expected output for the above dataset will look like
ID Days Value Threshold NewDayColumn
A 3 34 30 1
A 4 25 30 2
A 5 20 30 3
B 4 17 15 1
B 5 20 15 2
It doesn't matter if the data goes below the threshold for the latter rows, I want to take the first row when threshold is crossed as 1 and continue counting rows for the ID.
Thank you!
You can use window functions for this. Here is one method:
select t.*, row_number() over (partition by id order by days) as newDayColumn
from (select t.*,
min(case when value > threshold then days end) over (partition by id) as threshold_days
from t
) t
where days >= threshold_days;

SQL Aggregate functions with groupings

I need to create some checks to make sure that students are enrolled in the correct courses with the correct number of units. Here is my SQL at the moment.
SELECT StudentID
,AssessmentCode
,BoardCode
,BoardCategory
,BoardUnits
,sum(cast(boardunits as int)) over (partition by studentid,boardcategory) as UnitCount
,Count(boardcategory) over (partition by studentid) as SubjectCount
FROM uvNCStudentSubjectDetails
where fileyear = 2015
and filesemester = 1
and studentyearlevel = 11
and StudentIBFlag = 0
order by Studentnameinternal,BoardCategory
This gives me the following info...
StudentID AssessmentCode BoardCode BoardCategory BoardUnits UnitCount SubjectCount
61687 11TECDAT 11080 A 2 11 7
61687 11PRS1U 11350 A 1 11 7
61687 11MATGEN 11235 A 2 11 7
61687 11LANGRB 11870 A 2 11 7
61687 11ENGSTD 11130 A 2 11 7
61687 11GEOGEO 11190 A 2 11 7
64549 11TECIND 11200 A 2 10 7
64549 11SCIPHY 11310 A 2 10 7
64549 11SCIEAE 11100 A 2 10 7
64549 11MATGEN 11235 A 2 10 7
64549 11ENGSTD 11130 A 2 10 7
64549 11TECHOS 26501 B 2 2 7
64549 11MUSDRS 63212 C 1 1 7
45461 11ECOECO 11110 A 2 13 7
45461 11ENGADV 11140 A 2 13 7
45461 11HISMOD 11270 A 2 13 7
45461 11HISLST 11220 A 2 13 7
45461 11MATMAT 11240 A 2 13 7
45461 11PRS1U 11350 A 1 13 7
45461 11SCIBIO 11030 A 2 13 7
Note for the first student, I have a count of Category A subject Units (11 in total) He is only doing Category A subjects. For the second student, he has 10 units of Category A subjects, he is doing 1 Category B subject worth 2 units and one category C subject worth 1 unit. the final student just has 13 Category A units.
Now what I would really like is something like this...!
StudentID Sum A Units Sum B Units Sum C Units Sum A Units + Sum B Units Count of Subjects
61687 11 0 0 11 7
64549 10 2 1 12 7
45461 13 0 0 13 7
So I would like some aggregated functions with a student grouped onto only 1 row and the sum of his different units as separate fields. I would also like a field which sums the Category A and B Units and also a field which gives a count of the total number of subjects they are doing. I could then use this data to set up some warning messages if a student is not doing the correct number of A or B Units etc
I have played around with common table expressions, subqueries etc but am not really sure what I am doing and am not sure which is the correct way about getting the data in the form I want.
Is anyone able to help?
SELECT
STUDENTID,
SUM(CASE BOARDCATEGORY WHEN 'A' THEN 1 ELSE 0 END) AS SUM_A_UNITS,
SUM(CASE BOARDCATEGORY WHEN 'B' THEN 1 ELSE 0 END) AS SUM_B_UNITS,
SUM(CASE BOARDCATEGORY WHEN 'C' THEN 1 ELSE 0 END) AS SUM_C_UNITS,
SUM(CASE BOARDCATEGORY WHEN 'A' THEN 1 WHEN 'B' THEN 1 ELSE 0 END) AS SUM_A_UNITS+SUM_B_UNITS,
COUNT(BOARDCODE) AS COUNT_OF_SUBJECTS
FROM (
SELECT StudentID
,AssessmentCode
,BoardCode
,BoardCategory
,BoardUnits
,sum(cast(boardunits as int)) over (partition by studentid,boardcategory) as UnitCount
,Count(boardcategory) over (partition by studentid) as SubjectCount
FROM uvNCStudentSubjectDetails
where fileyear = 2015
and filesemester = 1
and studentyearlevel = 11
and StudentIBFlag = 0
order by Studentnameinternal,BoardCategory
)
GROUP BY STUDENTID;
Wrapped your SQL statement in the solution, so that you can see what the solution does straight away.
Use SUM and CASE (i.e. SUM only when a condition is met).

Grouping query into group and subgroup

I want to group my data using SQL or R so that I can get top or bottom 10 Subarea_codes for each Company and Area_code. In essence: the Subarea_codes within the Area_codes where each Company has its largest or smallest result.
data.csv
Area_code Subarea_code Company Result
10 101 A 15
10 101 P 10
10 101 C 4
10 102 A 10
10 102 P 8
10 102 C 5
11 111 A 15
11 111 P 20
11 111 C 5
11 112 A 10
11 112 P 5
11 112 C 10
result.csv should be like this
Company Area_code Largest_subarea_code Result Smallest_subarea_code Result
A 10 101 15 102 10
P 10 101 10 102 8
C 10 102 5 101 4
A 11 111 15 112 10
P 11 111 20 112 5
C 11 112 10 111 5
Within each Area_code there can be hundreds of Subarea_codes but I only want the top and bottom 10 for each Company.
Also this doesn't have to be resolved in one query, but can be divided into two queries, meaning smallest is presented in results_10_smallest and largest in result_10_largest. But I'm hoping I can accomplish this with one query for each result.
What I've tried:
SELECT Company, Area_code, Subarea_code MAX(Result)
AS Max_result
FROM data
GROUP BY Subarea_code
ORDER BY Company
;
This gives me all the Companies with the highest results within each Subarea_code. Which would mean: A, A, P, A-C for the data above.
Using sqldf package:
df <- read.table(text="Area_code Subarea_code Company Result
10 101 A 15
10 101 P 10
10 101 C 4
10 102 A 10
10 102 P 8
10 102 C 5
11 111 A 15
11 111 P 20
11 111 C 5
11 112 A 10
11 112 P 5
11 112 C 10", header=TRUE)
library(sqldf)
mymax <- sqldf("select Company,
Area_code,
max(Subarea_code) Largest_subarea_code
from df
group by Company,Area_code")
mymaxres <- sqldf("select d.Company,
d.Area_code,
m.Largest_subarea_code,
d.Result
from df d, mymax m
where d.Company=m.Company and
d.Subarea_code=m.Largest_subarea_code")
mymin <- sqldf("select Company,
Area_code,
min(Subarea_code) Smallest_subarea_code
from df
group by Company,Area_code")
myminres <- sqldf("select d.Company,
d.Area_code,
m.Smallest_subarea_code,
d.Result
from df d, mymin m
where d.Company=m.Company and
d.Subarea_code=m.Smallest_subarea_code")
result <- sqldf("select a.*, b.Smallest_subarea_code,b.Result
from mymaxres a, myminres b
where a.Company=b.Company and
a.Area_code=b.Area_code")
If you already doing it in R, why not use the much more efficient data.table instead of sqldf using SQL syntax? Assuming data is your data set, simply:
library(data.table)
setDT(data)[, list(Largest_subarea_code = Subarea_code[which.max(Result)],
Resultmax = max(Result),
Smallest_subarea_code = Subarea_code[which.min(Result)],
Resultmin = min(Result)), by = list(Company, Area_code)]
# Company Area_code Largest_subarea_code Resultmax Smallest_subarea_code Resultmin
# 1: A 10 101 15 102 10
# 2: P 10 101 10 102 8
# 3: C 10 102 5 101 4
# 4: A 11 111 15 112 10
# 5: P 11 111 20 112 5
# 6: C 11 112 10 111 5
There seems to be a discrepancy between the output shown and the description. The description asks for the top 10 and bottom 10 results for each Area code/Company but the sample output shows only the top 1 and the bottom 1. For example, for area code 10 and company A subarea 101 is top with a result of 15 and and subarea 102 is 2nd largest with a result of 10 so according to the description there should be two rows for that company/area code combination. (If there were more data there would be up to 10 rows for that company/area code combination.)
We give two answers. The first assumes the top 10 and bottom 10 are wanted for each company and area code as in the question's description and the second assumes only the top and bottom for each company and area code as in the question's sample output.
1) Top/Bottom 10
Here we assume that the top 10 and bottom 10 results for each Company/Area code are wanted. If its just the top and bottom one then see (2) later on (or replace 10 with 1 in the code here). Bottom10 is all rows for which there are 10 or fewer subareas for the same area code and company with equal or smaller results. Top10 is similar.
library(sqldf)
Bottom10 <- sqldf("select a.Company,
a.Area_code,
a.Subarea_code Bottom_Subarea,
a.Result Bottom_Result,
count(*) Bottom_Rank
from df a join df b
on a.Company = b.Company and
a.Area_code = B.Area_code and
b.Result <= a.Result
group by a.Company, a.Area_code, a.Subarea_code
having count(*) <= 10")
Top10 <- sqldf("select a.Company,
a.Area_code,
a.Subarea_code Top_Subarea,
a.Result Top_Result,
count(*) Top_Rank
from df a join df b
on a.Company = b.Company and
a.Area_code = B.Area_code and
b.Result >= a.Result
group by a.Company, a.Area_code, a.Subarea_code
having count(*) <= 10")
The description indicated you wanted the top 10 OR the bottom 10 for each company/area code in which case just use one of the results above. If you want to combine them we show a merge below. We have added a Rank column to indicate the smallest/largest (Rank is 1), second smallest/largest (Rank is 2), etc.
sqldf("select t.Area_code,
t.Company,
t.Top_Rank Rank,
t.Top_Subarea,
t.Top_Result,
b.Bottom_Subarea,
b.Bottom_Result
from Bottom10 b join Top10 t
on t.Area_code = b.Area_code and
t.Company = b.Company and
t.Top_Rank = b.Bottom_Rank
order by t.Area_code, t.Company, t.Top_Rank")
giving:
Area_code Company Rank Top_Subarea Top_Result Bottom_Subarea Bottom_Result
1 10 A 1 101 15 102 10
2 10 A 2 102 10 101 15
3 10 C 1 102 5 101 4
4 10 C 2 101 4 102 5
5 10 P 1 101 10 102 8
6 10 P 2 102 8 101 10
7 11 A 1 111 15 112 10
8 11 A 2 112 10 111 15
9 11 C 1 112 10 111 5
10 11 C 2 111 5 112 10
11 11 P 1 111 20 112 5
12 11 P 2 112 5 111 20
Note that this format makes less sense if there are ties and, in fact, could generate more than 10 rows for a Company/Area code so you might just want to use the individual Top10 and Bottom10 in that case. You could also consider jittering df$Result if this a problem:
df$Result <- jitter(df$Result)
# now perform SQL statements
2) Top/Bottom Only
Here we give only the top and bottom results and the corresponding subareas for each company/area code. Note that this uses an extension to SQL supported by sqlite and the SQL code is substantially simpler:
Bottom1 <- sqldf("select Company,
Area_code,
Subarea_code Bottom_Subarea,
min(Result) Bottom_Result
from df
group by Company, Area_code")
Top1 <- sqldf("select Company,
Area_code,
Subarea_code Top_Subarea,
max(Result) Top_Result
from df
group by Company, Area_code")
sqldf("select a.Company,
a.Area_code,
Top_Subarea,
Top_Result,
Bottom_Subarea
Bottom_Result
from Top1 a join Bottom1 b
on a.Company = b.Company and
a.Area_code = b.Area_code
order by a.Area_code, a.Company")
This gives:
Company Area_code Top_Subarea Top_Result Bottom_Result
1 A 10 101 15 102
2 C 10 102 5 101
3 P 10 101 10 102
4 A 11 111 15 112
5 C 11 112 10 111
6 P 11 111 20 112
Update Correction and added (2).
Above answers are fine to fetch max result.
This solves the top10 issue:
data.top <- data[ave(-data$Result, data$Company, data$Area_code, FUN = rank) <= 10, ]
In this script the user declares the company. The script then indicates the max top 10 results (idem for min values).
Result=NULL
A <- read.table(/your-file.txt",header=T,sep="\t",na.string="NA")
Company<-A$Company=="A" #can be A, C, P or other values
Subarea<-unique(A$Subarea)
for (i in 1:length(unique(A$Subarea)))
{Result[i]<-max(A$Result[Company & A$Subarea_code==Subarea[i]])}
Res1<-t((rbind(Subarea,Result)))
Res2<-Res1[order(-Res1[,2]),]
Res2[1:10,]