SQL Server one column data as row and other one as column - sql

Data within the actual table represents the intersection between two categories (for e.g a and b are categories). Data I have has multiple categories. In output table I need to show the intersection between such categories.
This is the actual table data. sample data
col1 Category IntersectedCategory Count
----------------------------------------------
1 a g 12
2 b f 12
3 a a 260
4 c I 38
5 d h 39
6 b g 12
7 b h 27
8 c c 114
9 a h 60
10 a d 57
11 e e 137
12 f h 15
13 g I 12
14 d g 12
15 e f 34
16 c b 15
17 h h 190
18 f f 96
19 c d 14
20 e d 46
21 g f 12
22 e g 12
23 c f 12
24 g g 97
25 d I 72
26 b b 116
27 c h 32
28 b I 45
29 e h 15
30 c g 6
31 a b 16
32 I I 361
33 I f 55
34 a e 38
35 e I 68
36 d d 142
37 g h 6
38 a f 33
39 e b 21
40 b d 21
41 a c 29
42 a I 114
43 I h 81
44 e c 6
45 d f 29
Expected Output:
list a b
----------------
a 20 5
b 5 25

You can use Dynamic SQL for this.
Build a variable with the column names and then PIVOT on them.
Example snippet:
-- using a temporary table for demonstration purposes.
IF OBJECT_ID('tempdb..#TestTable') IS NOT NULL DROP TABLE #TestTable;
CREATE TABLE #TestTable (col1 int identity(1,1) primary key, Category varchar(8), IntersectedCategory varchar(8), [Count] int);
-- Small set of sample data
insert into #TestTable (Category, IntersectedCategory, [Count]) values
('a','a',20)
,('a','b',5)
,('b','a',5)
,('b','b',25)
;
-- Generating the column names
declare #cols varchar(max);
select #cols = concat(#cols+', ', quotename(IntersectedCategory))
from #TestTable
group by IntersectedCategory
order by IntersectedCategory;
-- constructing the sql statement that uses a PIVOT
declare #DynSql varchar(max);
set #DynSql = 'select *
from (select Category as list, IntersectedCategory as col, [Count] as total from #TestTable) as src
pivot (SUM(total) for col in ('+ #cols +')) as pvt
order by list';
-- select #DynSql as DynSql; -- Just to check how the sql looks like
-- running the generated SQL statement
exec(#DynSql);
Result:
list a b
---- -- --
a 20 5
b 5 25

Related

How do I add specific columns from different tables onto an existing table in postgresql?

I have an original table (TABLE 1):
A
B
C
D
1
3
5
7
2
4
6
8
I want to add column F from the table below (Table 2) onto table 1:
A
F
G
H
1
29
5
7
2
30
6
8
As well as adding Column J,L and O from the table below (Table 3) onto column 1:
A
I
J
K
L
M
N
O
1
9
11
13
15
17
19
21
2
10
12
14
16
18
20
22
How do I go about adding only the specific columns onto table 1?
Expected Result:
A
B
C
D
F
J
L
O
1
3
5
7
29
11
15
21
2
4
6
8
30
12
16
22
Use following query
SELECT T1.A,
B,
C,
D,
F,
J,
L,
O
FROM table1 T1
JOIN table2 T2
ON T1.A = T2.A
JOIN table3 t3
ON T1.A = T3.A

Updating values in a table if present in a different table using pandas merge

I have 2 tables, Table p and Table q. The contents of Table p are to be updated from Table q.
Table p:
A B C
1 45 22 25
2 34 46 56
3 59 55 44
Table q:
A B C
1 34 46 59
2 59 55 49
I want to merge these two tables based on column 'A' and 'B' such that if values of 'A', 'B' in table p are not present in table q, values in column B in table p are the same.
Tried:
p['A'] = pd.merge(q, on=['A','B'], how='left')['C']
Output:
A B C
1 45 22 NaN
2 34 46 59
3 59 55 49
Desired Output:
A B C
1 45 22 25
2 34 59 59
3 59 55 49
I can create a different column and merge and then combine back to column 'A' of table p but that seems lengthy. Is there a more direct way to do this?
You can using update
keycol=['A','B']
df1=df1.set_index(keycol)
df1.update(df2.set_index(keycol))
df1
Out[762]:
C
A B
45 22 25.0
34 46 59.0
59 55 49.0
df1.reset_index()
Out[763]:
A B C
0 45 22 25.0
1 34 46 59.0
2 59 55 49.0
Another solution from map
df1.A.map(df2.set_index('A').B).fillna(df1.B)
Out[727]:
1 22.0
2 59.0
3 99.0
Name: A, dtype: float64

SQL query with CASE statement ,multiple variables and sub variables

i am working on dataset where i need to write an query for below requirement either in R-programming or SQLDF , i want to know and learn to write in both language ( SQL and R ) ,kindly help.
Requirement is : i need to print Variable "a" from table when
Total_Scores of id 34 and Rank 3 is GREATER THAN Total_Scores of id 34 and Rank 4 else Print Variable b .
This above case Applicable for each id and Rank
id Rank Variable Total_Scores
34 3 a 11
34 4 b 6
126 3 c 15
126 4 d 18
190 3 e 9
190 4 f 10
388 3 g 20
388 4 h 15
401 3 i 15
401 4 x 11
476 3 y 11
476 4 z 11
536 3 p 15
536 4 q 6
i have tried to write SQL CASE statement and i am stuck ,could you please
help to write the query
"select id ,Rank ,
CASE
WHEN (select Total_Scores from table where id == 34 and Rank == 3) > (select Total_Scores from table where id == 34 and Rank == 4)
THEN "Variable is )
Final output Should be :
id Rank Variable Total_Scores
34 3 a 11
126 4 d 18
190 4 f 10
388 3 g 20
401 3 i 15
536 3 p 15
You seem to want the row with the highest score for each id. A canonical way to write this in SQL uses row_number():
select t.*
from (select t.*,
row_number() over (partition by id order by score desc) as seqnum
from t
) t
where seqnum = 1;
This returns one row per id, even when the scores are tied. If you want all rows in that case, use rank() instead of row_number().
An alternative method can have better performance with an index on (id, score):
select t.*
from t
where t.score = (select max(t2.score) from t t2 where t2.id = t.id);
You can try this.
SELECT T.* FROM (
SELECT id,
MAX(Total_Scores) Max_Total_Scores
FROM MyTable
GROUP BY id
HAVING MAX(Total_Scores) > MIN(Total_Scores) ) AS MX
INNER JOIN MyTable T ON MX.id = T.id AND MX.Max_Total_Scores = T.Total_Scores
ORDER BY id
Sql Fiddle
In R
library(dplyr)
df %>% group_by(id) %>%
filter(Total_Scores == max(Total_Scores)) %>% filter(n()==1) %>%
ungroup()
# A tibble: 6 x 4
id Rank Variable Total_Scores
<int> <int> <chr> <int>
1 34 3 a 11
2 126 4 d 18
3 190 4 f 10
4 388 3 g 20
5 401 3 i 15
6 536 3 p 15
Data
df <- read.table(text="
id Rank Variable Total_Scores
34 3 a 11
34 4 b 6
126 3 c 15
126 4 d 18
190 3 e 9
190 4 f 10
388 3 g 20
388 4 h 15
401 3 i 15
401 4 x 11
476 3 y 11
476 4 z 11
536 3 p 15
536 4 q 6
",header=T, stringsAsFactors = F)
Assuming that what you want is to get the subset of rows whose Total_Scores is largest for that id here are two approaches.
The question did not discuss how to deal with ties. There is one id in the example that has a tie but there is no output corresponding to it which I assume was not intendedand that either both the rows should have been output or one of them. Anyways in the solutions below in (1) it will give one of the rows arbitrarily if there are duplicates whereas (2) will give both.
1) sqldf
If you use max in an SQLite select it will automatically select the other variables of the same row so:
library(sqldf)
sqldf("select id, Rank, Variable, max(Total_Scores) Total_Scores
from DF
group by id")
giving:
id Rank Variable Total_Scores
1 34 3 a 11
2 126 4 d 18
3 190 4 f 10
4 388 3 g 20
5 401 3 i 15
6 476 3 y 11
7 536 3 p 15
2) base R In base R we can use ave and subset like this:
subset(DF, ave(Total_Scores, id, FUN = function(x) x == max(x)) > 0)
giving:
id Rank Variable Total_Scores
1 34 3 a 11
4 126 4 d 18
6 190 4 f 10
7 388 3 g 20
9 401 3 i 15
11 476 3 y 11
12 476 4 z 11
13 536 3 p 15
Note
The input in reproducible form:
Lines <- "id Rank Variable Total_Scores
34 3 a 11
34 4 b 6
126 3 c 15
126 4 d 18
190 3 e 9
190 4 f 10
388 3 g 20
388 4 h 15
401 3 i 15
401 4 x 11
476 3 y 11
476 4 z 11
536 3 p 15
536 4 q 6"
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE)

count of matching rows and select 30 from each count group in hive

How do I count the matching rows for the below sample data
ID Attribute 1 Attribute 2
1 A AA
2 B CC
3 C BB
4 A AA
5 C BB
6 D AA
7 B AA
8 C DD
9 A AB
10 A AA
the out put should look like this
ID Attribute 1 Attribute 2 count(Attribute1+Attribute2)
1 A AA 3
2 B CC 1
3 C BB 2
4 A AA 3
5 C BB 2
6 D AA 1
7 B AA 1
8 C DD 1
9 A AB 1
10 A AA 3
and then select 50% of rows from each count group. ex : for the mtaching row (A,AA) I need to select only 2 occurances. which would give me the the ID (1 and 4)
Here is the query:
select * from (select product_category_id,
count(1) over (partition by product_category_id) cnt,
row_number() over (partition by product_category_id) rn
from products) p where rn <= 0.5 * cnt;
Here is the sample result, product_category_id, cnt and then row number. In your case you need to have your two attributes in partition by clause.
59 24 1
59 24 2
59 24 3
59 24 4
59 24 5
59 24 6
59 24 7
59 24 8
59 24 9
59 24 10
59 24 11
59 24 12
You can do with SQL query like this.
SELECT *,
(SELECT COUNT(*)
FROM table AS t2
WHERE t1.[Attribute1] = t2.[Attibute1]
AND t1.[Attribute2] = t2.[Attibute2]) AS 'count(Attribute1+Attribute2)'
FROM table AS t1

Dedup using HiveQL

I have a hive table with field 'a'(int), 'b'(string), 'c'(bigint), 'd'(bigint) and 'e'(string).
I have data like:
a b c d e
---------------
1 a 10 18 i
2 b 11 19 j
3 c 12 20 k
4 d 13 21 l
1 e 14 22 m
4 f 15 23 n
2 g 16 24 o
3 h 17 25 p
Table is sorted on key 'b'.
Now we want output like below:
a b c d e
---------------
1 e 14 22 m
4 f 15 23 n
2 g 16 24 o
3 h 17 25 p
which will be deduped on key 'a' but will keep last(latest) 'b'.
Is it possible using Hive query(HiveQL)?
If column b is unique, Try follow hql:
select
*
from
(
select max(b) as max_b
from
table
group by a
) table1
join table on table1.max_b = table.b