Dedup using HiveQL

Dedup using HiveQL - hive

I have a hive table with field 'a'(int), 'b'(string), 'c'(bigint), 'd'(bigint) and 'e'(string).
I have data like:
a b c d e
---------------
1 a 10 18 i
2 b 11 19 j
3 c 12 20 k
4 d 13 21 l
1 e 14 22 m
4 f 15 23 n
2 g 16 24 o
3 h 17 25 p
Table is sorted on key 'b'.
Now we want output like below:
a b c d e
---------------
1 e 14 22 m
4 f 15 23 n
2 g 16 24 o
3 h 17 25 p
which will be deduped on key 'a' but will keep last(latest) 'b'.
Is it possible using Hive query(HiveQL)?

If column b is unique, Try follow hql:
select
*
from
(
select max(b) as max_b
from
table
group by a
) table1
join table on table1.max_b = table.b

Related

Add entries of a table into rows of another table

I have two tables
table a:
ID VALUE_z
1 41
2 32
3 51
table b:
ID TYPE z
1 a 10
1 b 15
1 c 20
2 a 12
2 b 8
2 c 5
3 a 21
3 b 4
3 c 2
I want to add the rows from table a to the column VALUE in table b based on the ID. The result should look like this
table result:
ID TYPE VALUE
1 a 10
1 b 15
1 c 20
1 z 41
2 a 12
2 b 8
2 c 5
2 z 32
3 a 21
3 b 4
3 c 2
3 z 51

Try the following using INSERT INTO SELECT Statement:
insert into tableB
select ID, 'z', VALUE_z
from tableA
See demo

Group By Client IDs by Type of Order/Combo requested

I have the following table multiplied by thousands of rows containing the types of service requested by each client. I would like to group what services each client has requested and count them as one. For example, the A, B, and C services requested (a 'combo') by customer 5 will be a specific group/combo and counted as one.
client_id
type_of_service
count
5
A
1
5
B
1
5
C
1
10
B
1
10
C
1
10
D
1
10
F
1
17
C
1
17
D
1
19
A
1
19
B
1
32
E
1
32
A
1
35
C
1
35
F
1
37
D
1
37
E
1
37
F
1
45
B
1
45
D
1
45
F
1
Which approach should I choose in this case? I am using Metabase.
What I would like to see is something like this:
Combo Service
Count
A, B, C
1
C, D
1
E, F
1
B, D, F
1
Any thoughts would be helpful!

How do I add specific columns from different tables onto an existing table in postgresql?

I have an original table (TABLE 1):
A
B
C
D
1
3
5
7
2
4
6
8
I want to add column F from the table below (Table 2) onto table 1:
A
F
G
H
1
29
5
7
2
30
6
8
As well as adding Column J,L and O from the table below (Table 3) onto column 1:
A
I
J
K
L
M
N
O
1
9
11
13
15
17
19
21
2
10
12
14
16
18
20
22
How do I go about adding only the specific columns onto table 1?
Expected Result:
A
B
C
D
F
J
L
O
1
3
5
7
29
11
15
21
2
4
6
8
30
12
16
22

Use following query
SELECT T1.A,
B,
C,
D,
F,
J,
L,
O
FROM table1 T1
JOIN table2 T2
ON T1.A = T2.A
JOIN table3 t3
ON T1.A = T3.A

SQL Server one column data as row and other one as column

Data within the actual table represents the intersection between two categories (for e.g a and b are categories). Data I have has multiple categories. In output table I need to show the intersection between such categories.
This is the actual table data. sample data
col1 Category IntersectedCategory Count
----------------------------------------------
1 a g 12
2 b f 12
3 a a 260
4 c I 38
5 d h 39
6 b g 12
7 b h 27
8 c c 114
9 a h 60
10 a d 57
11 e e 137
12 f h 15
13 g I 12
14 d g 12
15 e f 34
16 c b 15
17 h h 190
18 f f 96
19 c d 14
20 e d 46
21 g f 12
22 e g 12
23 c f 12
24 g g 97
25 d I 72
26 b b 116
27 c h 32
28 b I 45
29 e h 15
30 c g 6
31 a b 16
32 I I 361
33 I f 55
34 a e 38
35 e I 68
36 d d 142
37 g h 6
38 a f 33
39 e b 21
40 b d 21
41 a c 29
42 a I 114
43 I h 81
44 e c 6
45 d f 29
Expected Output:
list a b
----------------
a 20 5
b 5 25

You can use Dynamic SQL for this.
Build a variable with the column names and then PIVOT on them.
Example snippet:
-- using a temporary table for demonstration purposes.
IF OBJECT_ID('tempdb..#TestTable') IS NOT NULL DROP TABLE #TestTable;
CREATE TABLE #TestTable (col1 int identity(1,1) primary key, Category varchar(8), IntersectedCategory varchar(8), [Count] int);
-- Small set of sample data
insert into #TestTable (Category, IntersectedCategory, [Count]) values
('a','a',20)
,('a','b',5)
,('b','a',5)
,('b','b',25)
;
-- Generating the column names
declare #cols varchar(max);
select #cols = concat(#cols+', ', quotename(IntersectedCategory))
from #TestTable
group by IntersectedCategory
order by IntersectedCategory;
-- constructing the sql statement that uses a PIVOT
declare #DynSql varchar(max);
set #DynSql = 'select *
from (select Category as list, IntersectedCategory as col, [Count] as total from #TestTable) as src
pivot (SUM(total) for col in ('+ #cols +')) as pvt
order by list';
-- select #DynSql as DynSql; -- Just to check how the sql looks like
-- running the generated SQL statement
exec(#DynSql);
Result:
list a b
---- -- --
a 20 5
b 5 25

Join information from table A with the information from another table B multiple times

So this maybe a simple question but I would like to learn if this can be done in one query.
Table A: contains gene information
gene start end
1 a 5 0
2 b 6 1
3 c 7 2
4 d 8 3
5 e 9 4
6 f 10 5
7 g 11 6
8 h 12 7
9 i 13 8
10 j 14 9
Table B: contains calculated gene information.
gene1 gene2 cor
1 d j -0.7600805
2 c i 0.4274278
3 e g -0.9249361
4 a f 0.8567928
5 b h -0.3018518
6 d j -0.3723553
7 c i 0.1617981
8 e g 0.8575933
9 a f 0.8409788
10 b h 0.1506035
The result table I'm trying to get is:
gene1 gene2 cor start1 end1 start2 end2
1 d j -0.7600805 8 3 14 9
2 c i 0.4274278 7 2 13 8
3 e g -0.9249361
4 a f 0.8567928
5 b h -0.3018518
6 d j -0.3723553 etc.
7 c i 0.1617981
8 e g 0.8575933
9 a f 0.8409788
10 b h 0.1506035
The method I can think of is to join table A onto table B twice, firstly by gene1 and then by gene2, which would require for an intermediate table. Is there any simpler ways to achieve this in one step?

Yes, two joins will do it
You simply need to do this:
SELECT b.Gene1
,b.Gene2
,b.cor
,a1.Start AS Start1
,a1.End AS End1
,a2.Start AS Start2
,a2.End AS End2
FROM TableB b
INNER JOIN TableA a1
ON a1.Gene = b.Gene1
INNER JOIN TableA a2
ON a2.Gene = b.Gene2
Depending on your dbms you may need to tweek the syntax a bit

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Dedup using HiveQL - hive

If column b is unique, Try follow hql: select * from ( select max(b) as max_b from table group by a ) table1 join table on table1.max_b = table.b

Related

Add entries of a table into rows of another table

Group By Client IDs by Type of Order/Combo requested

How do I add specific columns from different tables onto an existing table in postgresql?

SQL Server one column data as row and other one as column

Join information from table A with the information from another table B multiple times

Categories

Resources