SQL : SELECT all rows with maximum values and with WHERE condition also - sql

I've table which look like this:
id data version rulecode
---------------------------
a1 1 100 x1
a2 1 100 x1
a1 1 200 x2
a4 2 500 x2
a7 2 200 x1
a6 2 500 x1
a7 2 500 x2
a8 2 150 x2
a9 3 120 x1
a10 3 130 x2
a10 3 120 x1
a12 3 130 x2
a13 3 130 x1
a14 3 110 x2
a15 3 110 x1
a16 4 220 x1
a17 4 230 x2
a18 4 240 x2
a19 4 240 x1
..........................
..........................
Now I want only those rows which has maximum version and data value as (1,2 and 4)
When I tried with dense_rank(), I am getting only those rows which have 1 value from data column:
SELECT * FROM
(SELECT *, dense_rank() OVER (ORDER BY version desc) as col
FROM public.table_name WHERE data in (1,2,4))x
WHERE x.col=1
Output:
id data version rulecode
---------------------------
a1 1 200 x2
My expected output:
id data version rulecode
a1 1 200 x2
a4 2 500 x2
a6 2 500 x1
a8 2 500 x2
a18 4 240 x2
a19 4 240 x1
Note: the value of data column is till millions.
Can someone help me out here to get the expected output?

You seem to want a PARTITION BY:
SELECT *
FROM (SELECT *,
DENSE_RANK() OVER (PARTITION BY data ORDER BY version desc) as seqnum
FROM public.table_name
WHERE data in (1, 2, 4)
) x
WHERE x.seqnum = 1

Using analytic functions:
WITH cte AS (
SELECT *, MAX(version) OVER (PARTITION BY data) max_version
FROM yourTable
)
SELECT id, data, version, rulecode
FROM cte
WHERE version = max_version AND data IN (1, 2, 4);
Note that we could have also filtered the data values inside the CTE. I will leave it as is, for a general solution to your problem.

Related

Create a new column for table B based on information from table A

I have this problem. I want to create a report that keeps everything in table B, but adds another column from table A (QtyRecv).
Condition: If RunningTotalQtyUsed (from table B) < QtyRecv, take that QtyRecv for the new column.
For example, for item A1, (RunningTotalQtyUsed) 55 < 100 (QtyRecv), -> ExpectedQtyRecv = 100.
But if RunningTotalQtyUsed exceeds QtyRecv, we take the next QtyRecv to cover that used quantity.
For example, 101 > 100, -> ExpectedQtyRecv = 138.
149 (RunningTotalQtyUsed) < (100 + 138) (QtyRecv) -> get 138.
250 < (100 + 138 + 121) -> get 121.
The same logic applies to item A2.
If total QtyRecv = 6 + 4 + 10 = 20, but RunningTotalQtyUsed = 31 -> result should be 99999 to notify an error that QtyRecv can't cover QtyUsed.
Table A:
Item QtyRecv
A1 100
A1 138
A1 121
A2 6
A2 4
A2 10
Table B:
Item RunningTotalQtyUsed
A1 55
A1 101
A1 149
A1 250
A2 1
A2 5
A2 9
A2 19
A2 31
Expected result:
Item RunningTotalQtyUsed ExpectedQtyRecv
A1 55 100
A1 101 138
A1 149 138
A1 250 121
A2 1 6
A2 5 6
A2 9 4
A2 19 10
A2 31 99999
What I made an effort:
SELECT b.*
FROM tableB b LEFT JOIN tableA a
ON b.item = a.item
item RunningTotalQtyUsed
A1 55
A1 55
A1 55
A1 101
A1 101
A1 101
A1 149
A1 149
A1 149
A1 250
A1 250
A1 250
A2 1
A2 1
A2 1
A2 5
A2 5
A2 5
A2 9
A2 9
A2 9
A2 19
A2 19
A2 19
A2 31
A2 31
A2 31
It doesn't keep the same number of rows as table B. How to still keep table B but add the ExpectQtyRecv from table A? Thank you so much for all the help!
SELECT B.TOTAL,B.SUM_RunningTotalQtyUsed,A.SUM_QtyRecv FROM
(
SELECT B.ITEM,SUM(B.RunningTotalQtyUsed)AS SUM_RunningTotalQtyUsed
FROM TABLE_B AS B
GROUP BY B.ITEM
)B_TOTAL
LEFT JOIN
(
SELECT A.ITEM,SUM(A.QtyRecv)AS SUM_QtyRecv
FROM TABLE_A AS A
GROUP BY A.ITEM
)A_TOTAL ON B.ITEM=A.ITEM
I can not be sure, but may be you need something like above ?

Finding max row after groupby in pandas dataframe

I have a daframe as follows:
Month Col1 Col2 Val
A p a1 31
A q a1 78
A r b2 13
B x a1 54
B y b2 56
B z b2 65
I want to get the following:
Month a1 b2
A q r
B x z
Essentially for each pair of Month and Col2, I want to find the value in Col1 which is has the maximum value.
I am not sure how to approach this.
Your problem is:
Find row with max Val within a group, which is sort and drop_duplicates, and
transform the data, which is pivot:
(df.sort_values('Val')
.drop_duplicates(['Month','Col2'], keep='last')
.pivot(index='Month', columns='Col2', values='Col1')
)
Output:
Col2 a1 b2
Month
A q r
B x z

SUM based on another column Sign in Oracle

I have two tables
Table A has only last level id(leaf_id) information along with sum_data.
id sum_data
A5 40
B3 -50
C2 90
Table B has hierarchy information of id's and the sign to be considered for the id.
id Z has three children A2 and B2 and C2
id parent id leaf_id level sign
Z NULL A5 1 +
A2 Z A5 2 +
A3 A2 A5 3 -
A4 A3 A5 4 +
A5 A4 A5 5 +
Z NULL B3 1 +
B2 Z B3 2 -
B3 B2 B3 3 +
Z NULL C2 1 +
C2 Z C2 2 +
I need to calculate sum_data of Z based on sign operator and the calculation follows like this:
id parent id leaf_id level sign sum_data
Z NULL A5 1 + -40 --(rolled up sum_data from A2* sign =-40 * +)
A2 Z A5 2 + -40 --(rolled up sum_data from A3* sign =-40 * +)
A3 A2 A5 3 - -40 --(rolled up sum_data from A4* sign = 40 * -)
A4 A3 A5 4 + +40 --(rolled up sum_data from A5)
A5 A4 A5 5 + 40 --got this from Table A
Z NULL B3 1 + 50 --(rolled up sum_data from B2* sign = 50 * +)
B2 Z B3 2 - 50 --(rolled up sum_data from B3* sign = -50 * -)
B3 B2 B3 3 + -50 -- got this from Table A
Z NULL C2 1 + 0
C2 Z C2 2 ignore 0 -- (90 comes from Table A, as sign is ignore it is 0)
My output should be
id sum_data
Z 10 ( -40 from A5 hierarchy + 50 from B3 hierarchy + 0 from C2 hierarchy)
Can you please help me in deriving the sum_data in Oracle SQL code.

What is the R equivalent of SQL "SELECT * FROM table GROUP BY c1, c2"?

I want to reduce my data frame (EDIT: in a cpu-efficient way) to rows with unique values of the pair c3, c4, while keeping all columns. In other words I want to transform my data frame
> df <- data.frame(c1=seq(7), c2=seq(4, 10), c3=c("A", "B", "B", "C", "B", "A", "A"), c4=c(1, 2, 3, 3, 2, 2, 1))
c1 c2 c3 c4
1 1 4 A 1
2 2 5 B 2
3 3 6 B 3
4 4 7 C 3
5 5 8 B 2
6 6 9 A 2
7 7 10 A 1
to the data frame
c1 c2 c3 c4
1 1 4 A 1
2 2 5 B 2
3 3 6 B 3
4 4 7 C 3
6 6 9 A 2
where the values of c1 and c2 could be any value which occurs for a unique pair of c3, c4. Also the order of the resulting data frame is not of importance.
EDIT: My data frame has around 250 000 rows and 12 columns and should be grouped by 2 columns – therefore I need a CPU-efficient solution.
Working but unsatisfactory alternative
I solved this problem with
> library(sqldf)
> sqldf("Select * from df Group By c3, c4")
but in order to speed up and parallelize my program I have to eliminate the calls to sqldf.
EDIT: Currently the sqldf solution clocks at 3.5 seconds. I consider this a decent time. The problem is that I cannot start various queries in parallel therefore I am searching for an alternative way.
Not working attempts
duplicate()
> df[duplicated(df, by=c("c3", "c4")),]
[1] c1 c2 c3 c4
<0 rows> (or 0-length row.names)
selects duplicate rows and does not select rows where only columns c3 and c4 are duplicates.
aggregate()
> aggregate(df, by=list(df$c3, df$c4))
Error in match.fun(FUN) : argument "FUN" is missing, with no default
aggregate requires a function applied to all lines with the same values of c3 and c4
data.table's by
> library(data.table)
> dt <- data.table(df)
> dt[,list(c1, c2) ,by=list(c3, c4)]
c3 c4 c1 c2
1: A 1 1 4
2: A 1 7 10
3: B 2 2 5
4: B 2 5 8
5: B 3 3 6
6: C 3 4 7
7: A 2 6 9
does not kick out the rows which have non-unique values of c3 and c4, whereas
> dt[ ,length(c1), by=list(c3, c4)]
c3 c4 V1
1: A 1 2
2: B 2 2
3: B 3 1
4: C 3 1
5: A 2 1
does discard the values of c1 and c2 and reduces them to one dimension as specified with the passed function length.
Here is a data.table solution.
library(data.table)
setkey(setDT(df),c3,c4) # convert df to a data.table and set the keys.
df[,.SD[1],by=list(c3,c4)]
# c3 c4 c1 c2
# 1: A 1 1 4
# 2: A 2 6 9
# 3: B 2 2 5
# 4: B 3 3 6
# 5: C 3 4 7
The SQL you propose seems to extract the first row having a given combination of (c3,c4) - I assume that's what you want.
EDIT: Response to OP's comments.
The result you cite seems really odd. The benchmarks below, on a dataset with 12 columns and 2.5e5 rows, show that the data.table solution runs in about 25 milliseconds without setting keys, and in about 7 milliseconds with keys set.
set.seed(1) # for reproducible example
df <- data.frame(c3=sample(LETTERS[1:10],2.5e5,replace=TRUE),
c4=sample(1:10,2.5e5,replace=TRUE),
matrix(sample(1:10,2.5e6,replace=TRUE),nc=10))
library(data.table)
DT.1 <- as.data.table(df)
DT.2 <- as.data.table(df)
setkey(DT.2,c3,c4)
f.nokeys <- function() DT.1[,.SD[1],by=list(c3,c4)]
f.keys <- function() DT.2[,.SD[1],by=list(c3,c4)]
library(microbenchmark)
microbenchmark(f.nokeys(),f.keys(),times=10)
# Unit: milliseconds
# expr min lq median uq max neval
# f.nokeys() 23.73651 24.193129 24.609179 25.747767 26.181288 10
# f.keys() 5.93546 6.207299 6.395041 6.733803 6.900224 10
In what ways is your dataset different from this one??
Drawback (maybe): All solutions sort the result by group variables.
Using aggregate
Solution mentioned by Martin: aggregate(. ~ c3 + c4, df, head, 1)
My old solution:
> aggregate(df,by=list(df$c3,df$c4),FUN=head,1)
Group.1 Group.2 c1 c2 c3 c4
1 A 1 1 4 A 1
2 A 2 6 9 A 2
3 B 2 2 5 B 2
4 B 3 3 6 B 3
5 C 3 4 7 C 3
> aggregate(df,by=list(df$c3,df$c4),FUN=head,1)[,-(1:2)]
c1 c2 c3 c4
1 1 4 A 1
2 6 9 A 2
3 2 5 B 2
4 3 6 B 3
5 4 7 C 3
Using ddply
> require(plyr)
Loading required package: plyr
> ddply(df, ~ c3 + c4, head, 1)
c1 c2 c3 c4
1 1 4 A 1
2 6 9 A 2
3 2 5 B 2
4 3 6 B 3
5 4 7 C 3
Some dplyr options:
library(dplyr)
group_by(df, c3, c4) %>% filter(row_number() == 1)
group_by(df, c3, c4) %>% slice(1)
group_by(df, c3, c4) %>% do(head(.,1))
group_by(df, c3, c4) %>% summarise_each(funs(first))
group_by(df, c3, c4) %>% summarise_each(funs(.[1]))
group_by(df, c3, c4) %>% summarise_each(funs(head(.,1)))
group_by(df, c3, c4) %>% distinct()
Here's a dplyr-only benchmark:
library(microbenchmark)
set.seed(99)
df <- data.frame(matrix(sample(500, 25e4*12, replace = TRUE), ncol = 12))
dim(df)
microbenchmark(
f1 = {group_by(df, X1, X2) %>% filter(row_number() == 1)},
f2 = {group_by(df, X1, X2) %>% summarise_each(funs(first))},
f3 = {group_by(df, X1, X2) %>% summarise_each(funs(.[1]))},
f4 = {group_by(df, X1, X2) %>% summarise_each(funs(head(., 1)))},
f5 = {group_by(df, X1, X2) %>% distinct()},
times = 10
)
Unit: milliseconds
expr min lq median uq max neval
f1 498 505 509 527 615 10
f2 726 766 794 815 823 10
f3 1485 1504 1545 1571 1639 10
f4 25170 25668 26027 26188 26406 10
f5 618 622 631 653 675 10
I excluded the version with do(head(.,1)) since it's just not a very good option and takes too long.
You can use interaction and duplicated:
subset(df, !duplicated(interaction(c3, c4)))
# c1 c2 c3 c4
# 1 1 4 A 1
# 2 2 5 B 2
# 3 3 6 B 3
# 4 4 7 C 3
# 6 6 9 A 2

pig order by with rank and join the rank together

I have the following data with the schema (t0:chararray, t1:int)
a0 1
a1 7
b2 9
a2 4
b0 6
And I want to order it t1 and then combine with a rank
a0 1 1
a2 4 2
b0 6 3
a1 7 4
b2 9 5
Is there any convenient way without writing UDF in pig?
There is the RANK operation in Pig. This should be sufficient:
X = rank A by t1 ASC;
Please see the Pig docs for more details.