Aggregate strings in group by and ordered in Hive and Presto - hive

I have a table in the following format:
IDX IDY Time Text
idx1 idy1 t1 text1
idx1 idy2 t2 text2
idx1 idy2 t3 text3
idx1 idy1 t4 text4
idx2 idy3 t5 text5
idx2 idy3 t6 text6
idx2 idy1 t7 text7
idx2 idy3 t8 text8
What I'd like to see is something like this:
idx1 text1
idx1 text2, text3
idx1 text4
idx2 text5, text6
idx2 text7
idx2 text8
So in the final phase, I can get to:
text1
text2, text3
text4
==SEPERATOR==
text5, text6
text7
text8
How can I perform this in Hive or Presto? Thanks.

Hive
This is the base query, you can take it from here if you like
select IDX
,IDY
,min(time) as from_time
,max(time) as to_time
,concat_ws(',',collect_list (Text)) as text
from (select *
,row_number () over
(
partition by IDX
order by Time
) as rn
,row_number () over
(
partition by IDX,IDY
order by Time
) as rn_IDY
from mytable
) t
group by IDX,IDY
,rn - rn_IDY
order by IDX,from_time
+------+------+-----------+---------+-------------+
| idx | idy | from_time | to_time | text |
+------+------+-----------+---------+-------------+
| idx1 | idy1 | t1 | t1 | text1 |
| idx1 | idy2 | t2 | t3 | text2,text3 |
| idx1 | idy1 | t4 | t4 | text4 |
| idx2 | idy3 | t5 | t6 | text5,text6 |
| idx2 | idy1 | t7 | t7 | text7 |
| idx2 | idy3 | t8 | t8 | text8 |
+------+------+-----------+---------+-------------+
Presto
select array_join(array_agg (Text),',') as text
from (select *
,row_number () over
(
partition by IDX
order by Time
) as rn
,row_number () over
(
partition by IDX,IDY
order by Time
) as rn_IDY
from mytable
) t
group by IDX,IDY
,rn - rn_IDY
order by IDX,min(time)
;
+-------------+
| text |
+-------------+
| text1 |
| text2,text3 |
| text4 |
| text5,text6 |
| text7 |
| text8 |
+-------------+

Related

How to join and union at the same time

I'm sure this has a simple solution but I'm struggling to find it.
I have two tables
CREATE TABLE t1 (
"name" VARCHAR(1),
"id" INTEGER,
"data1" VARCHAR(2)
);
INSERT INTO t1
("name", "id", "data1")
VALUES
('a', '1', 'a1'),
('d', '1', 'd1');
CREATE TABLE t2 (
"name" VARCHAR(1),
"id" INTEGER,
"data2" VARCHAR(2)
);
INSERT INTO t2
("name", "id", "data2")
VALUES
('d', '1', 'd2'),
('k', '1', 'k2');
I want this final combined table:
| name | id | data1 | data2 |
| ---- | --- | ----- | ----- |
| a | 1 | a1 | |
| d | 1 | d1 | d2 |
| k | 1 | | k2 |
Things I've tried:
Do a union
select
t1.name,
t1.id,
t1.data1,
NULL as data2
from t1
union
select
t2.name,
t2.id,
NULL as data1,
t2.data2
from t2
| name | id | data1 | data2 |
| ---- | --- | ----- | ----- |
| a | 1 | a1 | |
| d | 1 | d1 | |
| d | 1 | | d2 |
| k | 1 | | k2 |
Do a full join
select * from t1
full join t2 on t2.id = t1.id
and t2.name = t1.name;
| name | id | data1 | name | id | data2 |
| ---- | --- | ----- | ---- | --- | ----- |
| a | 1 | a1 | | | |
| d | 1 | d1 | d | 1 | d2 |
| | | | k | 1 | k2 |
The answer is somewhere in between 😅
You're looking for a FULL OUTER JOIN:
SELECT COALESCE(t1.name, t2.name) AS name,
COALESCE(t1.id, t2.id) AS id,
t1.data1,
t2.data2
FROM t1
FULL OUTER JOIN t2 ON t2.name = t1.name
Output:
name id data1 data2
a 1 a1 null
d 1 d1 d2
k 1 null k2
Demo on db-fiddle
I think you just want a full join:
select *
from t1 full join
t2
using (name, id);
Thanks, Gordon and Nick for your answers.
I checked the documentation and found an even shorter variation:
select *
from t1
natural full join t2;
From the documentation
Furthermore, the output of JOIN USING suppresses redundant columns: there is no need to print both of the matched columns, since they must have equal values.
While JOIN ON produces all columns from T1 followed by all columns from T2, JOIN USING produces one output column for each of the listed column pairs (in the listed order), followed by any remaining columns from T1, followed by any remaining columns from T2.

Join multiple tables using SQL & T-SQL

Unfortunately, I cannot be sure that the name of my question is correct here.
Example of initial data:
Table 1 Table 2 Table 3
| ID | Name | | ID | Info1 | | ID | Info2 |
|----|-------| |----|-------| |----|-------|
| 1 | Name1 | | 1 | text1 | | 1 | text1 |
| 2 | Name2 | | 1 | text1 | | 1 | text1 |
| 3 | Name3 | | 2 | text2 | | 1 | text1 |
| 2 | text2 | | 2 | text2 |
| 3 | text3 |
In my initial data I have relationship between 3 tables by field ID.
I need to join table2 and table3 to the first table, but if I do sequential join, like left join table2 and left join table3 by ID I will get additional records on second join, because there will be several records with one ID after first join.
I need to get records of table2 and table3 like a list in each column for ID of first table.
Here an example of expected result:
Table 3
| ID | Name |Info1(Table2)|Info2(Table3)|
|-------|-----------|-------------|-------------|
| 1 | Name1 | text1 | text1 |
| 1 | Name1 | text1 | text1 |
| 1 | Name1 | null | text1 |
| 2 | Name2 | text2 | text2 |
| 2 | Name2 | text2 | null |
| 3 | Name3 | null | text3 |
This is the method I would use, however, the table design you have could probably be improved on; why are Table2 and Table3 separate in the first place?
USE Sandbox;
GO
CREATE TABLE dbo.Table1 (ID int, [Name] varchar(5))
INSERT INTO dbo.Table1 (ID,
[Name])
VALUES(1,'Name1'),
(2,'Name1'),
(3,'Name3');
CREATE TABLE dbo.Table2 (Id int,Info1 varchar(5));
CREATE TABLE dbo.Table3 (Id int,Info2 varchar(5));
INSERT INTO dbo.Table2 (Id,
Info1)
VALUES(1,'text1'),
(1,'text1'),
(2,'text2'),
(2,'text2');
INSERT INTO dbo.Table3 (Id,
Info2)
VALUES(1,'text1'),
(1,'text1'),
(1,'text1'),
(2,'text2'),
(3,'text3');
WITH T2 AS(
SELECT ID,
Info1,
ROW_NUMBER() OVER (PARTITION BY ID ORDER BY (SELECT NULL)) AS RN --SELECT NULL as you have no other columns to actually create an order
FROM Table2),
T3 AS(
SELECT ID,
Info2,
ROW_NUMBER() OVER (PARTITION BY ID ORDER BY (SELECT NULL)) AS RN
FROM Table3),
Tally AS(
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS I --Assuming you have 10 or less matching items
FROM (VALUES(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL)) N(N))
SELECT T1.ID,
T1.[Name],
T2.info1,
T3.info2
FROM Table1 T1
CROSS JOIN Tally T
LEFT JOIN T2 ON T1.ID = T2.ID AND T.I = T2.RN
LEFT JOIN T3 ON T1.ID = T3.ID AND T.I = T3.RN
WHERE T2.ID IS NOT NULL OR T3.ID IS NOT NULL
ORDER BY T1.ID, T.I;
GO
DROP TABLE dbo.Table1;
DROP TABLE dbo.Table2;
DROP TABLE dbo.Table3;
If you have more than 10 rows, then you could build a "proper" tally table on the fly, or create a physical one. One on the fly is probably going to be a better idea though, as I doubt you're going to have 100's of matching rows.

SELECT Earliest Date from Grouped Results in DATEDIFF

From this table T2, I need to select the earliest date from each group by ID where the Prog is 'YY' and use it in DATEDIFF with respect to EDate:
+----+-----------+-----------+------+
| ID | SDate | Edate | Prog |
+----+-----------+-----------+------+
| 1 | 4/12/2016 | 5/18/2016 | XX |
| 1 | 4/1/2016 | 4/4/2016 | YY |
| 1 | 5/23/2016 | 5/28/2016 | YY |
| 2 | 9/21/2016 | 9/26/2016 | XX |
| 2 | 8/7/2016 | 8/9/2016 | YY |
| 3 | 8/2/2015 | 8/12/2015 | YY |
| 3 | 4/12/2015 | 4/18/2015 | YY |
+----+-----------+-----------+------+
And then show it with the aggregate level in Table T1 as the Desired Output:
+----+------+-----+-----------+------+
| ID | Name | Age | SDate | Days |
+----+------+-----+-----------+------+
| 1 | A | 52 | 4/1/2016 | 3 |
| 2 | B | 11 | 8/7/2016 | 2 |
| 3 | C | 24 | 4/12/2015 | 6 |
+----+------+-----+-----------+------+
Attempt:
SELECT
T1.ID,
T1.Name,
T1.Age,
MIN(T2.SDate) AS [SDate],
--DATEDIFF(day,MIN(T2.SDate),T2.EDate) AS [Days]
FROM T1
INNER JOIN T2
ON T1.ID=T2.ID
WHERE T2.Prog='YY'
GROUP BY
T1.ID,
T1.Name,
T1.Age
I commented out the DATEDIFF function for Days since I am not sure how to formulate that. Obviously, something like DATEDIFF(day,SELECT MIN(SDate) FROM T2 WHERE Prog='YY','Another Date') won't work since I will get an overall MIN(SDate) which won't be partitioned by ID and I can't do SELECT ID,MIN(SDate) FROM T2 WHERE Prog='YY' GROUP BY ID in the inner subquery either since DATEDIFF will only accept a Date field.
So how do I extract MIN(SDate) and calculate the DATEDIFF for corresponding Edate, for each grouped ID in that case?
Use the min window function to get the min sdate for each id and use it for computing the date difference.
SELECT ID,NAME,Age,DATEDIFF(DD,SDate,EDate)
FROM (
SELECT
T1.ID,
T1.Name,
T1.Age,
MIN(CASE WHEN T2.PROG = 'YY' THEN T2.SDate END) OVER(PARTITION BY T2.ID) AS [SDate],
T2.EDate
FROM T1
INNER JOIN T2 ON T1.ID=T2.ID
) x
Use MIN as a window function:
SELECT T1.ID,
T1.Name,
T1.Age,
DATEDIFF(day,
MIN(T2.SDate) OVER PARTITION BY (T1.ID, T1.Name, T1.Age),
T2.EDate) AS [Days]
FROM T1
INNER JOIN T2
ON T1.ID = T2.ID
WHERE T2.Prog = 'YY'

Overwrite NULL result in query with next NOT NULL value

I have two tables:
t1
t1.id | t1.val
----- | ------
1 | a
2 | b
3 | c
4 | d
5 | e
6 | f
7 | j
and t2
t2.id|t2.val
---- | ---
1| www
3| xxx
6| yyy
7| zzz
When I apply such sql-instruction:
SELECT t1.id, t1.val, t2.val
FROM t1
LEFT JOIN t2 ON ( t1.id = t2.id )
And result gives same table
t1.id | t1.val | t2.val
----- | ------ | ------
1 | a | www
2 | b | NULL
3 | c | xxx
4 | d | NULL
5 | e | NULL
6 | f | yyy
7 | j | zzz
Help me change the sql-instruction if I want to get result like this
t1.id | t1.val | t2.val
----- | ------ | ------
1 | a | www
2 | b | xxx
3 | c | xxx
4 | d | yyy
5 | e | yyy
6 | f | yyy
7 | j | zzz
Thanks for all!!
One method uses a correlated subquery:
select t1.*,
(select t2.val
from t2
where t2.id >= t1.id
order by t2.id
limit 1
) as t2val
from t1;
Another method uses window functions, but it is a bit more complicated:
SELECT t1.id, t1.val, t2.val
FROM (SELECT t1.id, t1.val,
MIN(t2.id) OVER (ORDER BY id DESC) as matching_id
FROM t1 LEFT JOIN
t2
ON t1.id = t2.id
) t LEFT JOIN
t2
ON t2.id = t.matching_id;
SELECT t1.id, t1.val, NVL(t2.val,LEAD(t2.val) OVER (ORDER BY t2.id))
FROM t1
LEFT JOIN t2 ON ( t1.id = t2.id )

TSQL select the from two rows that has higher priority and is not null

I try to consolidate two rows of the same table whereas each row has a priority.
The value of interest is the value having priority 1 if it is not NULL; otherwise the value with priority 0.
An example data source could be:
| Id | GroupId | Priority | Col1 | Col2 | Col3 | ... | Coln |
-----------------------------------------------------------------
| 1 | 1 | 0 | NULL | 4711 | 3.41 | ... | f00 |
| 2 | 1 | 1 | NULL | NULL | 2.83 | ... | bar |
| 3 | 2 | 0 | NULL | 4711 | 3.41 | ... | f00 |
| 4 | 2 | 1 | 23 | NULL | 2.83 | ... | NULL |
and I want to have:
| GroupId | Col1 | Col2 | Col3 | ... | Coln |
-------------------------------------------------
| 1 | NULL | 4711 | 2.83 | ... | bar |
| 2 | 23 | 4711 | 2.83 | ... | f00 |
Is there a generic way in TSQL without the need to check each column explicitly?
SELECT
t1.GroupId,
ISNULL(t2.Col1, t1.Col1) as Col1,
ISNULL(t2.Col2, t1.Col2) as Col2,
ISNULL(t2.Col3, t1.Col3) as Col3,
...
ISNULL(t2.Coln, t1.Coln) as Coln
FROM mytable t1
JOIN mytable t2 ON t1.GroupId = t2.GroupId
WHERE
t1.Priority = 0 AND
t2.Priority = 1
Regards
I'll elaborate the ROW_NUMBER() solution that #KM suggested since IMO it's the best solution for this. (In CTE form for easier readability)
WITH cte AS (
SELECT
t1.GroupId,
t1.Col1,
t1.Col2,
ROW_NUMBER() OVER(PARTITION BY t1.GroupId ORDER BY ISNULL(GroupId ,-1) ) AS [row_id]
FROM
mytable t1
)
SELECT
*
FROM
cte
WHERE
row_id = 1
That will give you the row with the highest priority (according to your rules) for each GroupId in mytable.
ROW_NUMBER and RANK are two of my favorite TSQL tricks. http://msdn.microsoft.com/en-us/library/ms186734.aspx
edit: Another favorite of mine is PIVOT/UNPIVOT which you can use to transpose rows/columns which is another way of going about this type of problem. http://msdn.microsoft.com/en-us/library/ms177410.aspx
I think this would do what you are asking for without using isnull for every column
select
*
from
mytable t1
where
priority=(select max(priority) from mytable where groupid=t1.groupid group by groupid)