Hive: merge or tag multiple rows based on neighboring rows - hive

I have the following table and want to merge multiple rows based on neighboring rows.
INPUT
EXPECTED OUTPUT
The logic is that since "abc" is connected to "abcd" in the first row and "abcd" is connected to "abcde" in the second row and so on, thus "abc", "abcd", "abcde", "abcdef" are connected and put in one array. The same applied to the rest rows. The number of connected neighboring rows are arbitrary.
The question is how to do that using Hive script without any UDF. Do I have to use Spark for this type of operation? Thanks very much.
One idea I had is to tag rows first as
How to do that using Hive script only?

This is an example of a CONNECT BY query which is not supported in HIVE or SPARK, unlike DB2 or ORACLE, et al.
You can simulate such a query with Spark Scala, but it is far from handy. Putting a tag in means the question is less relevant then, imo.

Here is a work-around using Hive script to get the intermediate table.
drop table if exists step1;
create table step1 STORED as orc as
with src as
(
select split(u.tmp,",")[0] as node_1, split(u.tmp,",")[1] as node_2
from
(select stack (7,
"abc,abcd",
"abcd,abcde",
"abcde,abcdef",
"bcd,bcde",
"bcde,bcdef",
"cdef,cdefg",
"def,defg"
) as tmp
) u
)
select node_1, node_2, if(node_2 = lead(node_1, 1) over (order by node_1), 1, 0) as tag, row_number() OVER (order by node_1) as row_num
from src;
drop table if exists step2;
create table step2 STORED as orc as
SELECT tag, row_number() over (ORDER BY tag) as row_num
FROM (
SELECT cast(v.tag as int) as tag
FROM (
SELECT
split(regexp_replace(repeat(concat(cast(key as string), ","), end_idx-start_idx), ",$",""), ",") as tags --repeat the row number by the number of rows
FROM (
SELECT COALESCE(lag(row_num, 1) over(ORDER BY row_num), 0) as start_idx, row_num as end_idx, row_number() over (ORDER BY row_num) as key
FROM step1 where tag=0
) a
) b
LATERAL VIEW explode(tags) v as tag
) c ;
drop table if exists step3;
create table step3 STORED as orc as
SELECT
a.node_1, a.node_2, b.tag
FROM step1 a
JOIN step2 b
ON a.row_num=b.row_num;
The final table looks like
select * from step3;
+---------------+---------------+------------+
| step3.node_1 | step3.node_2 | step3.tag |
+---------------+---------------+------------+
| abc | abcd | 1 |
| abcd | abcde | 1 |
| abcde | abcdef | 1 |
| bcd | bcde | 2 |
| bcde | bcdef | 2 |
| cdef | cdefg | 3 |
| def | defg | 4 |
+---------------+---------------+------------+
The third column can be used to collect node pairs.

Related

How to extract a JSON value in Hive

I Have a JSON string that is stored in a single cell in the DB corresponding to a parent ID
{"profileState":"ACTIVE","isDefault":"true","joinedOn":"2019-03-24T15:19:52.639Z","profileType":"ADULT","id":"abc","signupDeviceId":"1"}||{"profileState":"ACTIVE","isDefault":"true","joinedOn":"2021-09-05T07:47:00.245Z","imageId":"19","profileType":"KIDS","name":"Kids","id":"efg","signupDeviceId":"1"}
Now I want to use the above JSON to extract the id from this. Let say we have data like
Parent ID | Profile JSON
1 | {profile_json} (see above string)
I want the output to look like this
Parent ID | ID
1 | abc
1 | efg
Now, I've tried a couple of iterations to solve this
First Approach:
select
get_json_object(p.profile, '$$.id') as id,
test.parent_id
from (
select split(
regexp_replace(
regexp_extract(profiles, '^\\[(.+)\\]$$',1),
'\\}\\,\\{', '\\}\\|\\|\\{'),
'\\|\\|') as profile_list,
parent_id ,
from source_table) test
lateral view explode(test.profile_list) p as profile
)
But this is returning the id column as having NULL values. Is there something I'm missing here.
Second Approach:
with profiles as(
select regexp_replace(
regexp_extract(profiles, '^\\[(.+)\\]$$',1),
'\\}\\,\\{', '\\}\\|\\|\\{') as profile_list,
parent_id
from source_table
)
SELECT
get_json_object (t1.profile_list,'$.id')
FROM profiles t1
The second approach is only returning the first id (abc) as per the above JSON string.
I tried to replicate this in apache hive v4.
Data
+----------------------------------------------------+------------------+
| data | parent_id |
+----------------------------------------------------+------------------+
| {"profileState":"ACTIVE","isDefault":"true","joinedOn":"2019-03-24T15:19:52.639Z","profileType":"ADULT","id":"abc","signupDeviceId":"1"}||{"profileState":"ACTIVE","isDefault":"true","joinedOn":"2021-09-05T07:47:00.245Z","imageId":"19","profileType":"KIDS","name":"Kids","id":"efg","signupDeviceId":"1"} | 1.0 |
+----------------------------------------------------+------------------+
Sql
select pid,get_json_object(expl_jid,'$.id') json_id from
(select parent_id pid,split(data,'\\|\\|') jid from tabl1)a
lateral view explode(jid) exp_tab as expl_jid;
+------+----------+
| pid | json_id |
+------+----------+
| 1.0 | abc |
| 1.0 | efg |
+------+----------+
Solve this. Was using a extract $ in the First Approach
select
get_json_object(p.profile, '$.id') as id,
test.parent_id
from (
select split(
regexp_replace(
regexp_extract(profiles, '^\\[(.+)\\]$$',1),
'\\}\\,\\{', '\\}\\|\\|\\{'),
'\\|\\|') as profile_list,
parent_id ,
from source_table) test
lateral view explode(test.profile_list) p as profile
)

Sort each character in a string from a specific column in Snowflake SQL

I am trying to alphabetically sort each value in a column with Snowflake. For example I have:
| NAME |
| ---- |
| abc |
| bca |
| acb |
and want
| NAME |
| ---- |
| abc |
| abc |
| abc |
how would I go about doing that? I've tried using SPLIT and the ordering the rows, but that doesn't seem to work without a specific delimiter.
Using REGEXP_REPLACE to introduce separator between each character, STRTOK_SPLIT_TO_TABLE to get individual letters as rows and LISTAGG to combine again as sorted string:
SELECT tab.col, LISTAGG(s.value) WITHIN GROUP (ORDER BY s.value) AS result
FROM tab
, TABLE(STRTOK_SPLIT_TO_TABLE(REGEXP_REPLACE(tab.col, '(.)', '\\1~'), '~')) AS s
GROUP BY tab.col;
For sample data:
CREATE OR REPLACE TABLE tab
AS
SELECT 'abc' AS col UNION
SELECT 'bca' UNION
SELECT 'acb';
Output:
Similar implementation as Lukasz's, but using regexp_extract_all to extract individual characters in the form of an array that we later split to rows using flatten . The listagg then stitches it back in the order we specify in within group clause.
with cte (col) as
(select 'abc' union
select 'bca' union
select 'acb')
select col, listagg(b.value) within group (order by b.value) as col2
from cte, lateral flatten(regexp_extract_all(col,'.')) b
group by col;

Json Array Column split into Rows SQL

Currently I have in my DB(mariaDB 10.3) a column that is called data and contains a json array:
client| data
1 | '["a","b","c"]'
2 | '["k"]'
and I would like to brake it down into
client| data
1 | "a"
1 | "b"
1 | "c"
2 | "k"
Unfortunately, MariaDB does not support "unnesting" function JSON_TABLE(), unlike MySQL 8.0.
We are left with some kind of iterative approach, typicaly by using a table of numbers to enumerate the array elements. If you have a table with at least as many rows as the maximum number of elements in an array, say bigtable, you can do:
select client, json_unquote(json_extract(t.data, concat('$[', n.rn - 1, ']'))) value
from mytable t
inner join (select row_number() over() rn from bigtable) n
on n.rn <= json_length(t.data)
order by t.client, n.rn
Demo on DB Fiddle:
client | value
-----: | :----
1 | a
1 | b
1 | c
2 | k

Hive: How to check if values of one array are present in another?

I have two arrays like this , which are being returned from a UDF I created:
array A - [P908,S57,A65]
array B - [P908,S57]
I need to check if elements of array A are present in array B, or elements of array B are present in array A using hive queries.
I am stuck here. Could anyone suggest a way?
Can I also return some other data type from the UDF in place of array to make the comparison easier?
select concat(',',concat_ws(',',A),',') regexp
concat(',(',concat_ws('|',B),'),') as are_common_elements
from mytable
;
Demo
create table mytable (id int,A array<string>,B array<string>);
insert into table mytable
select 1,array('P908','S57','A65'),array('P908','S57')
union all select 2,array('P908','S57','A65'),array('P9','S5777')
;
select * from mytable;
+------------+----------------------+----------------+
| mytable.id | mytable.a | mytable.b |
+------------+----------------------+----------------+
| 1 | ["P908","S57","A65"] | ["P908","S57"] |
| 2 | ["P908","S57","A65"] | ["P9","S5777"] |
+------------+----------------------+----------------+
select id
,concat(',',concat_ws(',',A),',') as left_side_of_regexp
,concat(',(',concat_ws('|',B),'),') as right_side_of_regexp
,concat(',',concat_ws(',',A),',') regexp
concat(',(',concat_ws('|',B),'),') as are_common_elements
from mytable
;
+----+---------------------+----------------------+---------------------+
| id | left_side_of_regexp | right_side_of_regexp | are_common_elements |
+----+---------------------+----------------------+---------------------+
| 1 | ,P908,S57,A65, | ,(P908|S57), | true |
| 2 | ,P908,S57,A65, | ,(P9|S5777), | false |
+----+---------------------+----------------------+---------------------+
We can do this using the Lateral view.
Lets we have 2 tables , Table1 and Table2 and column with array field as col1 and col2 respectively in the tables.
Use something like below:-
select collect_set (array_contains (col1 , r.tab2) )
from table1 ,
(select exp1 as tab2
from (table2 t2 lateral view explode(col2) exploded_table as exp1 ) ) r
You can also use array_intersection or other array function.

Get previous and next row from rows selected with (WHERE) conditions

For example I have this statement:
my name is Joseph and my father's name is Brian
This statement is splitted by word, like this table:
------------------------------
| ID | word |
------------------------------
| 1 | my |
| 2 | name |
| 3 | is |
| 4 | Joseph |
| 5 | and |
| 6 | my |
| 7 | father's |
| 8 | name |
| 9 | is |
| 10 | Brian |
------------------------------
I want to get previous and next word of each word
For example I want to get previous and next word of "name":
--------------------------
| my | name | is |
--------------------------
| father's | name | is |
--------------------------
How could I get this result?
you didn't specify your DBMS, so the following is ANSI SQL:
select prev_word, word, next_word
from (
select id,
lag(word) over (order by id) as prev_word,
word,
lead(word) over (order by id) as next_word
from words
) as t
where word = 'name';
SQLFiddle: http://sqlfiddle.com/#!12/7639e/1
Why did no-body give the simple answer?
SELECT LAG(word) OVER ( ORDER BY ID ) AS PreviousWord ,
word ,
LEAD(word) OVER ( ORDER BY ID ) AS NextWord
FROM words;
Without subqueries:
SELECT a.word
FROM my_table AS a
JOIN my_table AS b
ON b.word = 'name' AND abs(a.id - b.id) <= 1
ORDER BY a.id
Use Join to get the expected result for SQL Server 2005 plus.
create table words (id integer, word varchar(20));
insert into words
values
(1 ,'my'),
(2 ,'name'),
(3 ,'is'),
(4 ,'joseph'),
(5 ,'and'),
(6 ,'my'),
(7 ,'father'),
(8 ,'name'),
(9 ,'is'),
(10,'brian');
SELECT A.Id , C.word AS PrevName ,
A.word AS CurName ,
B.word AS NxtName
FROM words AS A
LEFT JOIN words AS B ON A.Id = B.Id - 1
LEFT JOIN words AS C ON A.Id = C.Id + 1
WHERE A.Word = 'name'
Result:
Fiddler Demo
Try this
SELECT *
FROM tablename a
WHERE ID IN(SELECT ID - 1
FROM tablename
WHERE word = 'name') -- will fetch previous rows of word `name`
OR ID IN(SELECT ID + 1
FROM tablename
WHERE word = 'name') -- will fetch next rows of word `name`
OR word = 'name' -- to fetch the rows where word = `name`
Here's a different approach, if you want the selects to be fast. It takes a bit of preparation work.
Create a new column (e.g. "phrase") in the database that will contain the words
you want. (i.e. the previous, the current and next).
Write a trigger that on insert appends the new word to the previous
row's phrase and prepends the previous row's word to the new row's word and fills
phrase.
If the individual words can change, you'll need a trigger on update to keep the phrase in sync.
Then just select the phrase. You get much better speed, but at the cost of extra storage and slower insert and harder maintainability. Obviously you have to update the phrase column for the existing records, but you have the SQL to do that in the other answers.