Hive concat two map object - hive

I have two tables as follow in hive:
Table 1
key1 | value1
int | map(int,array(int))
Table 2
key2 | value2
int | map(int,array(int))
and now I join the table on key and I want to concat two maps that have the same key. In other words the final should looklike .
Table
key | value
int | map(int,array(int))
I tried to use function collect_set when I am joining as follows:
collect_set(value1,value2)
but it through exception that only one input is required. Any thoughts or comments?
Thanks

COLLECT_SET() is an aggregate function so it wouldn't really be useful (or valid) if you are trying to combine things. One thing you could try is using COMBINE(). It can be found in this library of UDFs here. Suppose you had some data like:
table0:
idx map_kv
0 {2:[1,2,3,4], 3:[5,6,7,8,9]}
table1:
idx map_kv
0 {2:[5,6,7,8,9], 3:[1,2,3,4]}
Then you could do
Query:
ADD JAR /path/to/jar/brickhouse-0.7.1.jar;
CREATE TEMPORARY FUNCTION COLLECT AS 'brickhouse.udf.collect.CollectUDAF';
CREATE TEMPORARY FUNCTION COMBINE AS 'brickhouse.udf.collect.CombineUDF';
SELECT idx
, COLLECT(map_key, arr) AS final_map
FROM (
SELECT a.idx
, a.map_key
, COMBINE(map_val_0, map_val_1) AS arr
FROM (
SELECT idx
, map_key
, map_val_0
FROM database.table0
LATERAL VIEW EXPLODE(map_kv) exptbl0 AS map_key, map_val_0 ) a
JOIN (
SELECT idx
, map_key
, map_val_1
FROM database.table1
LATERAL VIEW EXPLODE(map_kv) exptbl1 AS map_key, map_val_1 ) b
ON a.idx=b.idx AND a.map_key=b.map_key ) c
GROUP BY idx;
This will produce:
Output:
idx final_map
0 {2:[1,2,3,4,5,6,7,8,9], 3:[5,6,7,8,9,1,2,3,4]}

Related

Pivot with column name in Postgres

I have the following table tbl:
column1 | column2 | column 3
-----------------------------------
1 | 'value1' | 3
2 | 'value2' | 4
How to do "pivot" with column names to produce output like:
column1 | 1 | 2
column2 | 'value1' |'value2'
column3 | 3 | 4
As has been commented, the issue of data types is undefined in the question.
If you are OK with all result columns being type text (every data type can be converted to text), you can use one of these:
Plain SQL
WITH cte AS (
SELECT nu.*
FROM tbl t
, LATERAL (
VALUES
(1, t.column1::text)
, (2, t.column2)
, (3, t.column3::text)
) nu(rn, c)
)
SELECT *
FROM (TABLE cte OFFSET 0 LIMIT 3) c1
JOIN (TABLE cte OFFSET 3 LIMIT 3) c2 USING (rn);
The same with useful column names:
WITH cte AS (
SELECT nu.*
FROM tbl t
, LATERAL (
VALUES
('column1', t.column1::text)
, ('column2', t.column2)
, ('column3', t.column3::text)
) nu(rn, c)
)
SELECT * FROM (
SELECT *
FROM (TABLE cte OFFSET 0 LIMIT 3) c1
JOIN (TABLE cte OFFSET 3 LIMIT 3) c2 USING (rn)
) t (key, row1, row2);
Works in any modern version of Postgres.
The SQL string has to be adapted to the number of rows and columns. See fiddles below!
Using a document type as stepping stone
Makes for shorter code.
With many rows and many columns, performance of the SQL solution may scale better because the intermediate derived table is smaller.
(The thread is limited as you can't have more than ~ 1600 table columns in Postgres.)
Since everything is converted to text anyway, hstore seems most efficient. See:
Key value pair in PostgreSQL
SELECT key
, arr[1] AS row1
, arr[2] AS row2
FROM (
SELECT x.key, array_agg(x.value) AS arr
FROM tbl t, each(hstore(t)) x
GROUP BY 1
) sub
ORDER BY 1;
Technically speaking we would have to enforce the right sort order when in array_agg(), but that should work without explicit ORDER BY. To be absolutely sure you can add one: array_agg(x.value ORDER BY t.ctid) Using ctid for lack of information.
You can do the same with JSON functions in (Postgres 9.3+). Just replace each(hstore(t) with json_each_text(row_to_json(t). The rest is identical.
These fiddles demonstrate how to scale each query:
Original example with 2 rows of 3 columns:
db<>fiddle here
Scaled up to 3 rows of 4 columns:
db<>fiddle here

Need Sorting With External Array or Comma Separated data

Am working with PostgreSQL 8.0.2, I have table
create table rate_date (id serial, rate_name text);
and it's data is
id rate_name
--------------
1 startRate
2 MidRate
3 xlRate
4 xxlRate
After select it will show data with default order or order by applied to any column of same table. My requirement is I have separate entity from where I will get data as (xlRate, MidRate,startRate,xxlRate) so I want to use this data to sort the select on table rate_data. I have tried for values join but it's not working and no other solution am able to think will work. If any one have idea please share detail.
Output should be
xlRate
MidRate
startRate
xxlRate
my attempt/thinking.
select id, rate_name
from rate_date r
join (
VALUES (1, 'xlRate'),(2, 'MidRate')
) as x(a,b) on x.b = c.rate_name
I am not sure if this is helpful but in Oracle you could achieve that this way:
select *
from
(
select id, rate_name,
case rate_name
when 'xlRate' then 1
when 'MidRate' then 2
when 'startRate' then 3
when 'xxlRate' then 4
else 100
end my_order
from rate_date r
)
order by my_order
May be you can do something like this in PostgreSQL?

SQL grouping by distinct values in a multi-value string column

(I want to perform a group-by based on the distinct values in a string column that has multiple values
The said column has a list of strings in a standard format separated by commas. The potential values are only a,b,c,d.
For example the column collection (type: String) contains:
Row 1: ["a","b"]
Row 2: ["b","c"]
Row 3: ["b","c","a"]
Row 4: ["d"]`
The expected output is a count of unique values:
collection | count
a | 2
b | 3
c | 2
d | 1
For all the below i used this table:
create table tmp (
id INT auto_increment,
test VARCHAR(255),
PRIMARY KEY (id)
);
insert into tmp (test) values
("a,b"),
("b,c"),
("b,c,a"),
("d")
;
If the possible values are only a,b,c,d you can try one of this:
Tke note that this will only works if you have not so similar values like test and test_new, because then the test would be joined also with all test_new rows and the count would not match
select collection, COUNT(*) as count from tmp JOIN (
select CONCAT("%", tb.collection, "%") as like_collection, collection from (
select "a" COLLATE utf8_general_ci as collection
union select "b" COLLATE utf8_general_ci as collection
union select "c" COLLATE utf8_general_ci as collection
union select "d" COLLATE utf8_general_ci as collection
) tb
) tb1
ON tmp.test LIKE tb1.like_collection
GROUP BY tb1.collection;
Which will give you the result you want
collection | count
a | 2
b | 3
c | 2
d | 1
or you can try this one
SELECT
(SELECT COUNT(*) FROM tmp WHERE test LIKE '%a%') as a_count,
(SELECT COUNT(*) FROM tmp WHERE test LIKE '%b%') as b_count,
(SELECT COUNT(*) FROM tmp WHERE test LIKE '%c%') as c_count,
(SELECT COUNT(*) FROM tmp WHERE test LIKE '%d%') as d_count
;
The result would be like this
a_count | b_count | c_count | d_count
2 | 3 | 2 | 1
What you need to do is to first explode the collection column into separate rows (like a flatMap operation). In redshift the only way to generate new rows is to JOIN - so let's CROSS JOIN your input table with a static table having consecutive numbers, and take only ones having id less or equal to number of elements in the collection. Then we'll use split_part function to read the item at correct index. Once we have the exploaded table, we'll do a simple GROUP BY.
If your items are stored as JSON array strings ('["a", "b", "c"]') then you can use JSON_ARRAY_LENGTH and JSON_EXTRACT_ARRAY_ELEMENT_TEXT instead of REGEXP_COUNT and SPLIT_PART respectively.
with
index as (
select 1 as i
union all select 2
union all select 3
union all select 4 -- could be substituted with 'select row_number() over () as i from arbitrary_table limit 4'
),
agg as (
select 'a,b' as collection
union all select 'b,c'
union all select 'b,c,a'
union all select 'd'
)
select
split_part(collection, ',', i) as item,
count(*)
from index,agg
where regexp_count(agg.collection, ',') + 1 >= index.i -- only get rows where number of items matches
group by 1

Convert an array into a Map

I have a table with a column like
[{"key":"e","value":["253","203","204"]},{"key":"st","value":["mi"]},{"key":"k2","value":["1","2"]}]
Which is of the format array<struct<key:string,value:array<string>>>
I want to convert the column into below format :
{"e":["253","203","204"],"st":["mi"],"k2":["1","2"]}
which is of the type map<string,array<string>>
I have tried exploding the array but that does not work. Any ideas how I can do this in hive.
Without use of external libraries it's impossible. Please refer to brickhouse or create your own UDAF.
Note: further code provides snippets to reproduce the problem and solving the problem that Hive's built-in functions can solve. i.e map<string,string> not map<string, array<string>>.
-- reproducing the problem
CREATE TABLE test_table(id INT, input ARRAY<STRUCT<key:STRING,value:ARRAY<STRING>>>);
INSERT INTO TABLE test_table
SELECT
1 AS id,
ARRAY(
named_struct("key","e", "value", ARRAY("253","203","204")),
named_struct("key","st", "value", ARRAY("mi")),
named_struct("key","k2", "value", ARRAY("1", "2"))
) AS input;
SELECT id, input FROM test_table;
+-----+-------------------------------------------------------------------------------------------------------+--+
| id | input |
+-----+-------------------------------------------------------------------------------------------------------+--+
| 1 | [{"key":"e","value":["253","203","204"]},{"key":"st","value":["mi"]},{"key":"k2","value":["1","2"]}] |
+-----+-------------------------------------------------------------------------------------------------------+--+
With exploding and using STRUCT features, we can split the keys and values.
SELECT id, exploded_input.key, exploded_input.value
FROM (
SELECT id, exploded_input
FROM test_table LATERAL VIEW explode(input) d AS exploded_input
) x;
+-----+------+----------------------+--+
| id | key | value |
+-----+------+----------------------+--+
| 1 | e | ["253","203","204"] |
| 1 | st | ["mi"] |
| 1 | k2 | ["1","2"] |
+-----+------+----------------------+--+
The idea is to use your UDAF to "collect" a map while aggregating on id.
What Hive can solve with built in function is generating map<string,string> by converting rows to strings with a special delimiter, aggregate rows via another special delimiter and use str_to_map built-in function on the delimiters to generate map<string, string>.
SELECT
id,
str_to_map(
-- outputs: e:253,203,204#st:mi#k2:1,2 with delimiters between aggregated rows
concat_ws('#', collect_list(list_to_string)),
'#', -- first delimiter
':' -- second delimiter
) mapped_output
FROM (
SELECT
id,
-- outputs 3 rows: (e:253,203,203), (st:mi), (k2:1,2)
CONCAT(exploded_input.key,':' , CONCAT_WS(',', exploded_input.value)) as list_to_string
FROM (
SELECT id, exploded_input
FROM test_table LATERAL VIEW explode(input) d AS exploded_input
) x
) y
GROUP BY id;
Which outputs a string to string map like:
+-----+-------------------------------------------+--+
| id | mapped_output |
+-----+-------------------------------------------+--+
| 1 | {"e":"253,203,204","st":"mi","k2":"1,2"} |
+-----+-------------------------------------------+--+
with input_set as (
select array(named_struct('key','e','value',array('253','203','204')),named_struct('key','st','value',array('mi')),named_struct('key','k2','value',array('1','2'))) as input_array
), break_input_set as (
select y.col_num as y_col_num,y.col_value as y_col_value from input_set lateral view posexplode(input_set.input_array) y as col_num, col_value
), create_map as (
select map(y_col_value.key,y_col_value.value) as final_map from break_input_set
)
select * from create_map;
var Array = [{"key":"e","value":["253","203","204"]},{"key":"st","value":["mi"]},{"key":"k2","value":["1","2"]}];
var obj = {}
for(var i=0;i<Array.length;i++){
obj[Array[i].key] = Array[i].value
}
obj will be in the required format

PostgreSQL query on text array value

I have a table where one column has an array - but stored in a text format:
mytable
id ids
-- -------
1 '[3,4]'
2 '[3,5]'
3 '[3]'
etc ...
I want to find all records that have the value 5 as an array element in the ids column.
I was trying to achieve this by using the "string to array" function and removing the [ symbols with the translate function, but couldn't find a way.
You can do this: http://www.sqlfiddle.com/#!1/5c148/12
select *
from tbl
where translate(ids, '[]','{}')::int[] && array[5];
Output:
| ID | IDS |
--------------
| 2 | [3,5] |
You can also use bool_or: http://www.sqlfiddle.com/#!1/5c148/11
with a as
(
select id, unnest(translate(ids, '[]','{}')::int[]) as elem
from tbl
)
select id
from a
group by id
having bool_or(elem = 5);
To see the original elements:
with a as
(
select id, unnest(translate(ids, '[]','{}')::int[]) as elem
from tbl
)
select id, '[' || array_to_string(array_agg(elem), ',') || ']' as ids
from a
group by id
having bool_or(elem = 5);
Output:
| ID | IDS |
--------------
| 2 | [3,5] |
Postgresql DDL is atomic, if it's not late yet in your project, just structure your stringly-typed array to a real array: http://www.sqlfiddle.com/#!1/6e18c/2
alter table tbl
add column id_array int[];
update tbl set id_array = translate(ids,'[]','{}')::int[];
alter table tbl drop column ids;
Query:
select *
from tbl
where id_array && array[5]
Output:
| ID | ID_ARRAY |
-----------------
| 2 | 3,5 |
You can also use contains operator: http://www.sqlfiddle.com/#!1/6e18c/6
select *
from tbl
where id_array #> array[5];
I prefer the && syntax though, it directly connotes intersection. It reflects that you are detecting if there's an intersection between two sets(array is a set)
http://www.postgresql.org/docs/8.2/static/functions-array.html
If you store the string representation of your arrays slightly differently, you can cast to array of integer directly:
INSERT INTO mytable
VALUES
(1, '{3,4}')
,(2, '{3,5}')
,(3, '{3}');
SELECT id, ids::int[]
FROM mytable;
Else, you have to put in one more step:
SELECT (translate(ids, '[]','{}'))::int[]
FROM mytable
I would consider making the column an array type to begin with.
Either way, you can find your row like this:
SELECT id, ids
FROM (
SELECT id, ids, unnest(ids::int[]) AS elem
FROM mytable
) x
WHERE elem = 5