Grouping query into group and subgroup - sql

I want to group my data using SQL or R so that I can get top or bottom 10 Subarea_codes for each Company and Area_code. In essence: the Subarea_codes within the Area_codes where each Company has its largest or smallest result.
data.csv
Area_code Subarea_code Company Result
10 101 A 15
10 101 P 10
10 101 C 4
10 102 A 10
10 102 P 8
10 102 C 5
11 111 A 15
11 111 P 20
11 111 C 5
11 112 A 10
11 112 P 5
11 112 C 10
result.csv should be like this
Company Area_code Largest_subarea_code Result Smallest_subarea_code Result
A 10 101 15 102 10
P 10 101 10 102 8
C 10 102 5 101 4
A 11 111 15 112 10
P 11 111 20 112 5
C 11 112 10 111 5
Within each Area_code there can be hundreds of Subarea_codes but I only want the top and bottom 10 for each Company.
Also this doesn't have to be resolved in one query, but can be divided into two queries, meaning smallest is presented in results_10_smallest and largest in result_10_largest. But I'm hoping I can accomplish this with one query for each result.
What I've tried:
SELECT Company, Area_code, Subarea_code MAX(Result)
AS Max_result
FROM data
GROUP BY Subarea_code
ORDER BY Company
;
This gives me all the Companies with the highest results within each Subarea_code. Which would mean: A, A, P, A-C for the data above.

Using sqldf package:
df <- read.table(text="Area_code Subarea_code Company Result
10 101 A 15
10 101 P 10
10 101 C 4
10 102 A 10
10 102 P 8
10 102 C 5
11 111 A 15
11 111 P 20
11 111 C 5
11 112 A 10
11 112 P 5
11 112 C 10", header=TRUE)
library(sqldf)
mymax <- sqldf("select Company,
Area_code,
max(Subarea_code) Largest_subarea_code
from df
group by Company,Area_code")
mymaxres <- sqldf("select d.Company,
d.Area_code,
m.Largest_subarea_code,
d.Result
from df d, mymax m
where d.Company=m.Company and
d.Subarea_code=m.Largest_subarea_code")
mymin <- sqldf("select Company,
Area_code,
min(Subarea_code) Smallest_subarea_code
from df
group by Company,Area_code")
myminres <- sqldf("select d.Company,
d.Area_code,
m.Smallest_subarea_code,
d.Result
from df d, mymin m
where d.Company=m.Company and
d.Subarea_code=m.Smallest_subarea_code")
result <- sqldf("select a.*, b.Smallest_subarea_code,b.Result
from mymaxres a, myminres b
where a.Company=b.Company and
a.Area_code=b.Area_code")

If you already doing it in R, why not use the much more efficient data.table instead of sqldf using SQL syntax? Assuming data is your data set, simply:
library(data.table)
setDT(data)[, list(Largest_subarea_code = Subarea_code[which.max(Result)],
Resultmax = max(Result),
Smallest_subarea_code = Subarea_code[which.min(Result)],
Resultmin = min(Result)), by = list(Company, Area_code)]
# Company Area_code Largest_subarea_code Resultmax Smallest_subarea_code Resultmin
# 1: A 10 101 15 102 10
# 2: P 10 101 10 102 8
# 3: C 10 102 5 101 4
# 4: A 11 111 15 112 10
# 5: P 11 111 20 112 5
# 6: C 11 112 10 111 5

There seems to be a discrepancy between the output shown and the description. The description asks for the top 10 and bottom 10 results for each Area code/Company but the sample output shows only the top 1 and the bottom 1. For example, for area code 10 and company A subarea 101 is top with a result of 15 and and subarea 102 is 2nd largest with a result of 10 so according to the description there should be two rows for that company/area code combination. (If there were more data there would be up to 10 rows for that company/area code combination.)
We give two answers. The first assumes the top 10 and bottom 10 are wanted for each company and area code as in the question's description and the second assumes only the top and bottom for each company and area code as in the question's sample output.
1) Top/Bottom 10
Here we assume that the top 10 and bottom 10 results for each Company/Area code are wanted. If its just the top and bottom one then see (2) later on (or replace 10 with 1 in the code here). Bottom10 is all rows for which there are 10 or fewer subareas for the same area code and company with equal or smaller results. Top10 is similar.
library(sqldf)
Bottom10 <- sqldf("select a.Company,
a.Area_code,
a.Subarea_code Bottom_Subarea,
a.Result Bottom_Result,
count(*) Bottom_Rank
from df a join df b
on a.Company = b.Company and
a.Area_code = B.Area_code and
b.Result <= a.Result
group by a.Company, a.Area_code, a.Subarea_code
having count(*) <= 10")
Top10 <- sqldf("select a.Company,
a.Area_code,
a.Subarea_code Top_Subarea,
a.Result Top_Result,
count(*) Top_Rank
from df a join df b
on a.Company = b.Company and
a.Area_code = B.Area_code and
b.Result >= a.Result
group by a.Company, a.Area_code, a.Subarea_code
having count(*) <= 10")
The description indicated you wanted the top 10 OR the bottom 10 for each company/area code in which case just use one of the results above. If you want to combine them we show a merge below. We have added a Rank column to indicate the smallest/largest (Rank is 1), second smallest/largest (Rank is 2), etc.
sqldf("select t.Area_code,
t.Company,
t.Top_Rank Rank,
t.Top_Subarea,
t.Top_Result,
b.Bottom_Subarea,
b.Bottom_Result
from Bottom10 b join Top10 t
on t.Area_code = b.Area_code and
t.Company = b.Company and
t.Top_Rank = b.Bottom_Rank
order by t.Area_code, t.Company, t.Top_Rank")
giving:
Area_code Company Rank Top_Subarea Top_Result Bottom_Subarea Bottom_Result
1 10 A 1 101 15 102 10
2 10 A 2 102 10 101 15
3 10 C 1 102 5 101 4
4 10 C 2 101 4 102 5
5 10 P 1 101 10 102 8
6 10 P 2 102 8 101 10
7 11 A 1 111 15 112 10
8 11 A 2 112 10 111 15
9 11 C 1 112 10 111 5
10 11 C 2 111 5 112 10
11 11 P 1 111 20 112 5
12 11 P 2 112 5 111 20
Note that this format makes less sense if there are ties and, in fact, could generate more than 10 rows for a Company/Area code so you might just want to use the individual Top10 and Bottom10 in that case. You could also consider jittering df$Result if this a problem:
df$Result <- jitter(df$Result)
# now perform SQL statements
2) Top/Bottom Only
Here we give only the top and bottom results and the corresponding subareas for each company/area code. Note that this uses an extension to SQL supported by sqlite and the SQL code is substantially simpler:
Bottom1 <- sqldf("select Company,
Area_code,
Subarea_code Bottom_Subarea,
min(Result) Bottom_Result
from df
group by Company, Area_code")
Top1 <- sqldf("select Company,
Area_code,
Subarea_code Top_Subarea,
max(Result) Top_Result
from df
group by Company, Area_code")
sqldf("select a.Company,
a.Area_code,
Top_Subarea,
Top_Result,
Bottom_Subarea
Bottom_Result
from Top1 a join Bottom1 b
on a.Company = b.Company and
a.Area_code = b.Area_code
order by a.Area_code, a.Company")
This gives:
Company Area_code Top_Subarea Top_Result Bottom_Result
1 A 10 101 15 102
2 C 10 102 5 101
3 P 10 101 10 102
4 A 11 111 15 112
5 C 11 112 10 111
6 P 11 111 20 112
Update Correction and added (2).

Above answers are fine to fetch max result.
This solves the top10 issue:
data.top <- data[ave(-data$Result, data$Company, data$Area_code, FUN = rank) <= 10, ]

In this script the user declares the company. The script then indicates the max top 10 results (idem for min values).
Result=NULL
A <- read.table(/your-file.txt",header=T,sep="\t",na.string="NA")
Company<-A$Company=="A" #can be A, C, P or other values
Subarea<-unique(A$Subarea)
for (i in 1:length(unique(A$Subarea)))
{Result[i]<-max(A$Result[Company & A$Subarea_code==Subarea[i]])}
Res1<-t((rbind(Subarea,Result)))
Res2<-Res1[order(-Res1[,2]),]
Res2[1:10,]

Related

Aggregate from two tables

I have two tables
Table 'Sales Line' (SL)
Date "Entry No" Item Qty
(PK)
01/01/2018 1 ABC 1
01/02/2018 2 ABC 2
03/02/2018 3 DEF 1
04/06/2018 4 DEF 3
01/01/2019 5 DEF 1
06/06/2019 6 ABC 2
Table 'Cost Breakdown' (CB)
"SL Entry No" Cost
(FK)
1 10
1 15
2 5
3 25
4 10
4 10
5 5
6 5
6 10
Expected result:
Item Tot_Qty Tot_Cost
ABC 3 30
DEF 4 45
Note that I'm only interested on transaction in 2018 only.
How do I aggregate Tot_Qty and Tot_Cost ? Thank you
With query suggested by #GMB, the result is :
Item Tot_Qty Tot_Cost
ABC 4 30
DEF 7 45
so, line from SL will be repeated as many as correponding number of lines in CB.
You can join both tables, filter on the date, and aggregate by item:
select sl.item, sum(sl.qty) tot_qty, sum(sl.qty * cb.cost) tot_cost
from sales_line sl
inner join cost_breakdown cb on cb.sl_entry_no = sl.entry_no
where sl.date >= '20180101' and sl.date < '20190101'
group by sl.item

remove outliers by group in sql

In my column in SQL Server, I must delete outliers for each group separately. Here are my columns
select
customer,
sku,
stuff,
action,
acnumber,
year
from
mytable
Sample data:
customer sku year stuff action
-----------------------------------
1 1 2 2017 10 0
2 1 2 2017 20 1
3 1 3 2017 30 0
4 1 3 2017 40 1
5 2 4 2017 50 0
6 2 4 2017 60 1
7 2 5 2017 70 0
8 2 5 2017 80 1
9 1 2 2018 10 0
10 1 2 2018 20 1
11 1 3 2018 30 0
12 1 3 2018 40 1
13 2 4 2018 50 0
14 2 4 2018 60 1
15 2 5 2018 70 0
16 2 5 2018 80 1
I must delete outlier from stuff variable, but separately by group customer+sku+year.
All that is below the 25th percentile and above 75 percentile should be considered an outlier and this principle must be respected for each group.
How to clear dataset for next working ?
Note, in this dataset, there is variable action (it tales value 0 and 1). It is not group variable, but outliers must be delete only for ZERO(0) categories of action variable.
in R language this is decided as
remove_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- NA
y[x > (qnt[2] + H)] <- NA
y
}
new <- remove_outliers(vpg$stuff)
vpg=cbind(new,vpg)
Something like this, maybe:
DELETE mytable
WHERE PERCENT_RANK() OVER (PARTITION BY Department ORDER BY customer, sku, year ORDER BY stuff ) < .25 OR
PERCENT_RANK() OVER (PARTITION BY Department ORDER BY customer, sku, year ORDER BY stuff ) > .75

Table returning cumulative sum

I have this table. It contains a column with points (a), a column with a playerid (b) and column with games(c). I would like to translate this table using SQL to a format in which values in column a get summed up. This would need to result in the table below. Column d contains the values summed by the previous value, column e contains the playerId en column f the gamenumber
So I would like this:
a b c
1 385 11255 1
2 378 11178 1
3 370 11551 1
4 264 11255 2
5 100 11178 2
6 405 11551 2
7 200 11255 3
8 412 11178 3
9 50 11551 3
Into this:
d e f
385 11255 1
649 11255 2
849 11255 3
378 11178 1
478 11178 2
890 11178 3
370 11551 1
775 11551 2
825 11551 3
You can use the SUM() OVER() window function (if your version of SQL Server supports it)
select b,c,sum(a) over(partition by b order by c) as running_sum
from tbl
On versions that don't support it, you can do this with cross apply.
select t.b,t.c,t1.total
from tbl t
cross apply (select sum(a) as total from tbl t1 where t1.b=t.b and t1.c<=t.c) t1
I think you want something like this:
select sum(a) over (partition by b order by c) as d, b as e, c as f
from t
order by e, f;
Cumulative sums with this syntax are supported since SQL Server 2012.

Group By and get top N in Simple SQL

I have following table in SQLite
BANK:
user-id sender-name receiver-name amount
----------------------------------------
1 A B 200
2 A C 250
3 A B 400
4 A B 520
4 A D 120
4 A D 130
4 A B 110
4 A B 300
4 A B 190
4 A C 230
4 A B 110
4 A C 40
4 A C 80
I need to find out top 3 transaction from each receiver. There are multiple solutions provided for several other database which is not compatible with SQLite cause of the use of certain functions like PARTITION and RANK and even user-defined variables.
I need the solution in simple SQL queries to allow use with SQLite.
Expected result:
receiver-name amount
--------------------
B 560
C 1220
D 250
I managed to do it with using only simple function with self-join.
Now you can just update N with your preferred value, for my case top 3, it would be LIMIT 3.
SELECT receiver-name ,(
SELECT SUM(amount) as sum_amount
FROM (
SELECT amount
FROM bank as b2
WHERE b2.receiver-name = b.receiver-name
ORDER BY b2.amount DESC
LIMIT 3
)
) as sum_amount
FROM bank as b
GROUP BY receiver-name

Query to multiply certain sets of rows on a single table

I've got a bit of a complicated query that I'm struggling with. You will notice that the schema isn't the easiest thing to work with but it's what I've been given and there isn't time to re-design (common story!).
I have rows like the ones below. Note: The 3 digit value numbers are just random numbers I made up.
id field_id value
1 5 999
1 6 888
1 7 777
1 8 foo <--- foo so we want the 3 values above
1 9 don't care
2 5 123
2 6 456
2 7 789
2 8 bar <--- bar so we DON'T want the 3 values above
2 9 don't care
3 5 623
3 6 971
3 7 481
3 8 foo <--- foo so we want the 3 values above
3 9 don't care
...
...
n 5 987
n 6 654
n 7 321
n 8 foo <--- foo so we want the 3 values above
n 9 don't care
I want this result:
id result
1 999*888*777
3 623*971*481
...
n 987*654*321
Is this clear? So we have a table with n*5 rows. For each of the sets of 5 rows: 3 of them have values we might want to multiply together, 1 of them tells us if we want to multiply and 1 of them we don't care about so we don't want the row in the query result.
Can we do this in Oracle? Preferably one query.. I guess you need to use a multiplication operator (somehow), and a grouping.
Any help would be great. Thank you.
something like this:
select m.id, exp(sum(ln(m.value)))
from mytab m
where m.field_id in (5, 6, 7)
and m.id in (select m2.id
from mytab m2
where m2.field_id = 8
and m2.value = 'foo')
group by m.id;
eg:
SQL> select * from mytab;
ID FIELD_ID VAL
---------- ---------- ---
1 5 999
1 6 888
1 7 777
1 8 foo
1 9 x
2 5 123
2 6 456
2 7 789
2 8 bar
2 9 x
3 5 623
3 6 971
3 7 481
3 8 foo
3 9 x
15 rows selected.
SQL> select m.id, exp(sum(ln(m.value))) result
2 from mytab m
3 where m.field_id in (5, 6, 7)
4 and m.id in (select m2.id
5 from mytab m2
6 where m2.field_id = 8
7 and m2.value = 'foo')
8 group by m.id;
ID RESULT
---------- ----------
1 689286024
3 290972773
Same logic; just removed the hard-coded values. posting this answer thinking might be helpful to some others.
SELECT a.id,
exp(sum(ln(a.val)))
FROM mytab a,
(SELECT DISTINCT id,
field_id
FROM mytab
WHERE val = 'foo') b
WHERE a.id = b.id
AND a.field_id < b.field_id
GROUP BY a.id;