SQL query with CASE statement ,multiple variables and sub variables - sql

i am working on dataset where i need to write an query for below requirement either in R-programming or SQLDF , i want to know and learn to write in both language ( SQL and R ) ,kindly help.
Requirement is : i need to print Variable "a" from table when
Total_Scores of id 34 and Rank 3 is GREATER THAN Total_Scores of id 34 and Rank 4 else Print Variable b .
This above case Applicable for each id and Rank
id Rank Variable Total_Scores
34 3 a 11
34 4 b 6
126 3 c 15
126 4 d 18
190 3 e 9
190 4 f 10
388 3 g 20
388 4 h 15
401 3 i 15
401 4 x 11
476 3 y 11
476 4 z 11
536 3 p 15
536 4 q 6
i have tried to write SQL CASE statement and i am stuck ,could you please
help to write the query
"select id ,Rank ,
CASE
WHEN (select Total_Scores from table where id == 34 and Rank == 3) > (select Total_Scores from table where id == 34 and Rank == 4)
THEN "Variable is )
Final output Should be :
id Rank Variable Total_Scores
34 3 a 11
126 4 d 18
190 4 f 10
388 3 g 20
401 3 i 15
536 3 p 15

You seem to want the row with the highest score for each id. A canonical way to write this in SQL uses row_number():
select t.*
from (select t.*,
row_number() over (partition by id order by score desc) as seqnum
from t
) t
where seqnum = 1;
This returns one row per id, even when the scores are tied. If you want all rows in that case, use rank() instead of row_number().
An alternative method can have better performance with an index on (id, score):
select t.*
from t
where t.score = (select max(t2.score) from t t2 where t2.id = t.id);

You can try this.
SELECT T.* FROM (
SELECT id,
MAX(Total_Scores) Max_Total_Scores
FROM MyTable
GROUP BY id
HAVING MAX(Total_Scores) > MIN(Total_Scores) ) AS MX
INNER JOIN MyTable T ON MX.id = T.id AND MX.Max_Total_Scores = T.Total_Scores
ORDER BY id
Sql Fiddle

In R
library(dplyr)
df %>% group_by(id) %>%
filter(Total_Scores == max(Total_Scores)) %>% filter(n()==1) %>%
ungroup()
# A tibble: 6 x 4
id Rank Variable Total_Scores
<int> <int> <chr> <int>
1 34 3 a 11
2 126 4 d 18
3 190 4 f 10
4 388 3 g 20
5 401 3 i 15
6 536 3 p 15
Data
df <- read.table(text="
id Rank Variable Total_Scores
34 3 a 11
34 4 b 6
126 3 c 15
126 4 d 18
190 3 e 9
190 4 f 10
388 3 g 20
388 4 h 15
401 3 i 15
401 4 x 11
476 3 y 11
476 4 z 11
536 3 p 15
536 4 q 6
",header=T, stringsAsFactors = F)

Assuming that what you want is to get the subset of rows whose Total_Scores is largest for that id here are two approaches.
The question did not discuss how to deal with ties. There is one id in the example that has a tie but there is no output corresponding to it which I assume was not intendedand that either both the rows should have been output or one of them. Anyways in the solutions below in (1) it will give one of the rows arbitrarily if there are duplicates whereas (2) will give both.
1) sqldf
If you use max in an SQLite select it will automatically select the other variables of the same row so:
library(sqldf)
sqldf("select id, Rank, Variable, max(Total_Scores) Total_Scores
from DF
group by id")
giving:
id Rank Variable Total_Scores
1 34 3 a 11
2 126 4 d 18
3 190 4 f 10
4 388 3 g 20
5 401 3 i 15
6 476 3 y 11
7 536 3 p 15
2) base R In base R we can use ave and subset like this:
subset(DF, ave(Total_Scores, id, FUN = function(x) x == max(x)) > 0)
giving:
id Rank Variable Total_Scores
1 34 3 a 11
4 126 4 d 18
6 190 4 f 10
7 388 3 g 20
9 401 3 i 15
11 476 3 y 11
12 476 4 z 11
13 536 3 p 15
Note
The input in reproducible form:
Lines <- "id Rank Variable Total_Scores
34 3 a 11
34 4 b 6
126 3 c 15
126 4 d 18
190 3 e 9
190 4 f 10
388 3 g 20
388 4 h 15
401 3 i 15
401 4 x 11
476 3 y 11
476 4 z 11
536 3 p 15
536 4 q 6"
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE)

Related

pandas: get top n including the duplicates of a sorted column

I have some data like
This is a table sorted by score column and also then by cat column
score cat
18 B
18 A
17 A
16 B
16 A
15 B
14 B
13 A
12 A
10 B
9 B
I want to get the top 5 of score including the duplicates and also add the rank
i.e
rank score cat
1 18 B
1 18 A
2 17 A
3 16 B
3 16 A
4 15 B
5 14 B
How can i get this using pandas
Since the data frame is ordered, try factorize
df['rnk'] = df.score.factorize()[0]+1
out = df[df['rnk'] <= 5]
out
score cat rnk
0 18 B 1
1 18 A 1
2 17 A 2
3 16 B 3
4 16 A 3
5 15 B 4
6 14 B 5

Aggregating counts of different columns subsets

I have a dataset with a tree structure and for each path in the tree, I want to compute the corresponding counts at each level. Here is a minimal reproducible example with two levels.
import pandas as pd
data = pd.DataFrame()
data['level_1'] = np.random.choice(['1', '2', '3'], 100)
data['level_2'] = np.random.choice(['A', 'B', 'C'], 100)
I know I can get the counts on the last level by doing
counts = data.groupby(['level_1','level_2']).size().reset_index(name='count_2')
print(counts)
level_1 level_2 count_2
0 1 A 10
1 1 B 12
2 1 C 8
3 2 A 10
4 2 B 10
5 2 C 10
6 3 A 17
7 3 B 12
8 3 C 11
What I would like to have is a dataframe with one row for each possible path in the tree with the counts at each level in that path. For the example above, it would be something like
level_1 level_2 count_1 count_2
0 1 A 30 10
1 1 B 30 12
2 1 C 30 8
3 2 A 30 10
4 2 B 30 10
5 2 C 30 10
6 3 A 40 17
7 3 B 40 12
8 3 C 40 11
This is an example with only two levels, which is easy to solve, but I would like to have a way to get those counts for an arbitrary number of levels.
This will be the transform
counts['count_1']=counts.groupby(['level_1']).count_2.transform('sum')
counts
Out[445]:
level_1 level_2 count_2 count_1
0 1 A 7 30
1 1 B 13 30
2 1 C 10 30
3 2 A 7 30
4 2 B 7 30
5 2 C 16 30
6 3 A 9 40
7 3 B 10 40
8 3 C 21 40
You can make do from your original data:
groups = data.groupby('level_1').level_2
pd.merge(groups.value_counts(),
groups.size(),
left_index=True,
right_index=True)
which gives:
level_2_x level_2_y
level_1 level_2
1 A 14 39
B 14 39
C 11 39
2 C 13 34
A 12 34
B 9 34
3 B 12 27
C 9 27
A 6 27

Rank multiple columns in pandas

I have a dataset of a series with missing values that I want to replace by the index. The second column contains the same numbers than the first column, but in a different order.
here's an example:
>>> df
ind u v d
0 5 7 151
1 7 20 151
2 8 40 151
3 20 5 151
this should turn out to:
>>>df
ind u v d
0 1 2 151
1 2 4 151
2 3 5 151
3 4 1 151
i reindexed the values in row 'u' by creating a new column:
>>>df['new_index'] = range(1, len(numbers) + 1)
but how do I now replace values of the second column referring to the indexes?
Thanks for any advice!
You can use Series.rank, but first need create Series with unstack and last create DataFrame with unstack again:
df[['u','v']] = df[['u','v']].unstack().rank(method='dense').astype(int).unstack(0)
print (df)
u v d
ind
0 1 2 151
1 2 4 151
2 3 5 151
3 4 1 151
If use only DataFrame.rank, output in v is different:
df[['u','v']] = df[['u','v']].rank(method='dense').astype(int)
print (df)
u v d
ind
0 1 2 151
1 2 3 151
2 3 4 151
3 4 1 151

Select for running total rank based on column values

I have problem while assigning the Ranks for the below scenarios.In my scenario running total calculated based on the Cnt field.
My sql query should return Rank values like below output. Per page it should accept only 40 rows, so im assigning ranks contain only 40 records. If the running total crossing 40 it should change ranks. For each count 40 it should change the rank values.
It would great help if I can get sql query to return values
select f1,f2,sum(f2) over(order by f1) runnign_total
from [dbo].[Sheet1$]
OutPut:
ID cnt Running Total Rank
1 4 4 1
2 5 9 1
3 4 13 1
4 4 17 1
5 4 21 1
6 5 26 1
7 4 30 1
8 4 34 1
9 4 38 1
10 4 42 2
11 4 46 2
12 4 50 2
13 4 54 2
14 4 58 2
15 4 62 2
16 4 66 2
17 4 70 2
18 4 74 2
19 4 78 2
20 4 82 3
21 4 86 3
22 4 90 3
select f1,f2,sum(f2) over(order by f1) running_total, Floor(sum(f2) over(order by f1) / 40) [rank]
from [dbo].[Sheet1$]

How to select item in one table that is not present in another table in SQL [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
The following table is REORGANISED table
TID ITEMS TIMES TWU
1 D 5 633
1 M 5 665
1 R 14 861
1 F 4 871
1 I 8 910
1 A 7 942
1 N 7 950
1 Z 2 986
1 H 2 1020
2 S 4 551
2 R 7 861
2 F 6 871
2 I 4 910
2 A 6 942
2 N 8 950
2 Z 6 986
2 H 2 1020
3 U 4 354
3 V 7 528
3 B 2 641
3 J 4 842
3 F 4 871
3 I 2 910
3 A 6 942
3 N 2 950
3 Z 4 986
4 X 4 338
4 O 2 442
4 D 2 633
4 B 6 641
4 M 1 665
4 F 5 871
4 A 1 942
4 N 7 950
4 Z 10 986
4 H 1 1020
5 T 5 365
5 C 8 370
5 K 7 397
5 Q 5 397
5 P 5 471
5 S 3 551
5 D 1 633
5 B 6 641
5 M 6 665
5 J 4 842
5 R 6 861
5 I 1 910
5 A 4 942
5 Z 10 986
5 H 7 1020
6 L 5 305
6 U 1 354
6 K 2 397
Above table is sorted in ascending order..I considered the minimum value of twu's item as leaf node for each TID..remaining items are intermediate node..The following table is LEAFNODES
TID ITEMS
1 D
2 S
3 U
4 X
5 T
6 L
Now i want to select ITEMS in LEAFNODES that is not present as intermediate node in reorganised
To get records that do not exist in another table, you use NOT EXISTS. You want to know whether intermediate records exist. You consider all records that don't have the minimun twu per tid an intermediate record. So all records with a higher twu than minimum are intermediate, and you want to select all items from leafnodes that are not intermediate.
select *
from leafnodes
where not exists
(
select *
from reorganized
where twu >
(
select min(twu)
from reorganized leaf
where leaf.tid = reorganized.tid
)
and reorganized.items = leafnodes.items
);
Here is the same with an IN clause, which I consider slightly more readable. Here you regard the intermediate items a set, and you want an item NOT to be IN that set.
select *
from leafnodes
where items not in
(
select items
from reorganized
where twu >
(
select min(twu)
from reorganized leaf
where leaf.tid = reorganized.tid
)
);
select * from LEAFNODES
Excluse
(with temp as
(select TID , min(TWU) as TWU from LEAFNODES )
select TID , Items from REORGANISED as a inner join temp on a.TID=temp.TID and a.TWU=Temp.TWU))
;WITH CTE1 AS
(
-- Select the TId with minimum value from REORGANISED
SELECT TID,MIN(TWU)TWU
FROM REORGANISED
GROUP BY TID
)
,CTE2 AS
(
-- Now we will get value(ITEMS) for the min value and TID from CTE1 for
SELECT C1.TID,R.ITEMS,C1.TWU
FROM CTE1 C1
JOIN REORGANISED R ON C1.TID=R.TID AND C1.TWU=R.TWU
)
SELECT L.*
FROM CTE2 C2
JOIN LEAFNODES L ON C2.TID=L.TID AND C2.ITEMS=L.ITEMS