Merging records by 1+ common elements - apache-spark-sql

I have a hive table with the following schema:
int key1 # this is unique
array<int> key2_list
Now I want to merge records here if their key2_lists have any common element(s). For example, if record A has (10, [1,3,5]) and B has (12, [1,2,4]), I want to merge it as ([10,12], [1,2,3,4,5]) or ([1,2,3,4,5]).
If it is easier, the input table can have the following schema instead:
int key1
int key2
I'd prefer to do this through Hive or SparkSQL.

For your second table definition you can run something like
select collect_list(key1) from yourtable group by key2
Or for your first table definition
select collect_list(key1) from
(select key1, mykey FROM yourtable LATERAL VIEW explode(key2_list) a AS mykey) t
group by mykey

Related

Traverse a nested json object and insert the keys and values into two related tables

I'm passing the following json structure to my procedure:
{questA: [[a1, a2], [a3, a4]], questB: [[b1, b2], [b2, b4]...]}
I would like to go over all the 'quest' keys (questA, questB...) and insert each key name to one table and it's value sets to another table in multiple rows so each set (a1, a2) has it's own row plus foreign key field to it's parent quest key.
quest
-------
id
key
questValues
-------------
id
val
val
quest_id
foreign key (quest_id) references quest(id)
I've tried something like:
FOR key, val IN SELECT * FROM jasonb_each_text(myJson) LOOP
...
END LOOP;
But it loops over everything so the val arrays are just plain text now. I thought about chaining selects with one of the json literal functions but I'm unsure about the syntax.
You can indeed do this with chaining the output various JSON functions:
with input (parameter) as (
values ('{"questA": [["a1", "a2"], ["a3", "a4"]], "questB": [["b1", "b2"], ["b2", "b4"]]}'::jsonb)
), elements as (
select j.quest, k.answer
from input i
cross join lateral jsonb_each(i.parameter) as j(quest,vals)
cross join lateral jsonb_array_elements(j.vals) as k(answer)
), new_quests as (
insert into quest ("key")
select distinct quest
from elements
returning *
)
insert into quest_values (val1, val2, quest_id)
select e.answer ->> 0 as val1,
e.answer ->> 1 as val2,
nq.id as quest_id
from new_quests nq
join elements e on e.quest = nq.key;
The first step ("elements") turns the JSON value into rows that can be used as the source of the INSERT statements. It returns this:
quest | answer
-------+-------------
questA | ["a1", "a2"]
questA | ["a3", "a4"]
questB | ["b1", "b2"]
questB | ["b2", "b4"]
The next step inserts the unique values of the quest column into the quest table and returns the generated IDs.
And the final statement joins the generated IDs with the rows from the first step and extracts the two array elements as two values. It uses that query as the source to insert into the quest_values table.
Inside a procedure you obviously don't need the part that generates the sample data, so it would look something like this:
with elements as (
select j.quest, k.answer
from jsonb_each(the_parameter) as j(quest,vals)
cross join lateral jsonb_array_elements(j.vals) as k(answer)
), new_quests as (
insert into quest ("key")
select distinct quest
from elements
returning *
)
insert into quest_values (val1, val2, quest_id)
select e.answer ->> 0 as val1,
e.answer ->> 1 as val2,
nq.id as quest_id
from new_quests nq
join elements e on e.quest = nq.key;
Where the_parameter is the JSONB parameter passed to your procedure.
Online example: https://rextester.com/NBJIK44025

Clone a record, then use its auto increment id for further operations

Update:
After narrowing down the code it seems that the line
INSERT INTO table1 TABLE table1_temp RETURNING id
is causing the issue. Any tips what is wrong with this?
Original question:
table1 has many colums (I don't care which) and it has an auto increment primary key (id). This is what I need to do and how I'm trying:
First, I'd like to duplicate a record in table1.
BEGIN;
CREATE TEMP TABLE table1_temp ON COMMIT DROP AS
SELECT * FROM table1 WHERE id = <some integer>;
ALTER TABLE table1_temp DROP COLUMN id;
WITH generated_id AS (
INSERT INTO table1 TABLE table1_temp RETURNING id
)
Then, perform an insert to some_table where I need to use the generated id of the copy that was created in table1.
INSERT INTO some_table (something, the_id_into_this)
VALUES ('some value', (SELECT id FROM generated_id));
Then get some data from yet_another_table (columns: somestuff, id_here) and use this and the id for an insert into that same table.
INSERT INTO yet_another_table
(SELECT somestuff,
(SELECT id FROM generated_id) AS id_here
FROM yet_another_table
WHERE id_here = <some integer>)
Finally, I need to return the id so I can use it in my app...
RETURNING id_here AS id;
COMMIT;
Am I on the right path implementing this? When running the query, I get the following error:
column "id" is of type integer but expression is of type character
varying HINT: You will need to rewrite or cast the expression.
It doesn't tell me the line number where it occurrs and I have no idea what might cause this.
INSERT INTO table1 TABLE table1_temp
You cannot do that because table1_temp has different set of columns (you dropped id column).
You need to specify columns explicitly (all but id column):
INSERT INTO table1(col1, col2, ...) TABLE table1_temp
I found a simple solution for cloning a record with an auto increment id that doesn't require you to specify any other columns of the table:
BEGIN;
CREATE TEMP TABLE table1_temp ON COMMIT DROP AS
SELECT * FROM table1 WHERE id = #;
UPDATE table1_temp SET id = nextval('table1_seq');
INSERT INTO table1 TABLE table1_temp;
COMMIT;
And for the CTE part of the question, here is how you can reuse a returned value at multiple subsequent queries by concatenating WITH statements:
WITH generated_id AS (
INSERT INTO ... RETURNING id
), _ AS (
QUERY1 ... SELECT id FROM generated_id ...
), __ AS (
QUERY2 ... SELECT id FROM generated_id ...
...

Hive - getting the column names count of a table

How can I get the hive column count names using HQL? I know we can use the describe.tablename to get the names of columns. How do we get the count?
create table mytable(i int,str string,dt date, ai array<int>,strct struct<k:int,j:int>);
select count(*)
from (select transform ('')
using 'hive -e "desc mytable"'
as col_name,data_type,comment
) t
;
5
Some additional playing around:
create table mytable (id int,first_name string,last_name string);
insert into mytable values (1,'Dudu',null);
select size(array(*)) from mytable limit 1;
This is not bulletproof since not all combinations of columns types can be combined into an array.
It also requires that the table will contain at least 1 row.
Here is a more complex but also stronger solution (types versa), but also requires that the table will contain at least 1 row
select size(str_to_map(val)) from (select transform (struct(*)) using 'sed -r "s/.(.*)./\1/' as val from mytable) t;

Merging two tables into one with the same column names

I use this command to merge 2 tables into one:
CREATE TABLE table1 AS
SELECT name, sum(cnt)
FROM (SELECT * FROM table2 UNION ALL SELECT * FROM table3) X
GROUP BY name
ORDER BY 1;
table2 and table3 are tables with columns named name and cnt, but the result table (table1) has the columns name and sum.
The question is how to change the command so that the result table will have the columns name and cnt?
Have you tried this (note the AS cnt)?
CREATE TABLE table1 AS SELECT name,sum(cnt) AS cnt
FROM ...
In the absence of an explicit name, the output of a function inherits the basic function name in Postgres. You can use a column alias in the SELECT list to fix this - like #hennes already supplied.
If you need to inherit all original columns with name and type (and possibly more) you can also create the table with a separate command:
To copy columns with names and data types only, still use CREATE TABLE AS, but add LIMIT 0:
CREATE TABLE table1 AS
TABLE table2 LIMIT 0; -- "TABLE" is just shorthand for "SELECT * FROM"
To copy (per documentation):
all column names, their data types, and their not-null constraints:
CREATE TABLE table1 (LIKE table2);
... and optionally also defaults, constraints, indexes, comments and storage settings:
CREATE TABLE table1 (LIKE table2 INCLUDING ALL);
... or, for instance, just defaults and constraints:
CREATE TABLE table1 (LIKE table2 INCLUDING DEFAULTS INCLUDING CONSTRAINTS);
Then INSERT:
INSERT INTO table1 (name, cnt)
SELECT ... -- column names are ignored

update table with 4 columns specified, but only 2 columns are available

I have one table called test, which has 4 columns:
id INT
v_out INT
v_in INT
label CHARACTER
I'm trying to update the table with the following query:
String sql = "
update
test
set
v_out = temp.outV
, v_in = temp.inV
, label = temp.label
from (
values(
(1,234,235,'abc')
,(2,234,5585,'def')
)
) as temp (e_id, outV, inV, label)
where
id = temp.e_id
";
When I execute it, I got the error:
org.postgresql.util.PSQLException: ERROR:
table "temp" has 2 columns available but 4 columns specified
Whats the problem, and how can i solve it?
The values for the values clause must not be enclosed in parentheses:
values (
(1,234,235,'abc'), (2,234,5585,'def')
)
creates a single row with two columns. Each column being an anonymous "record" with 4 fields.
What you want is:
from (
values
(1,234,235,'abc'),
(2,234,5585,'def')
) as temp (e_id, outV, inV, label)
SQLFiddle showing the difference: http://sqlfiddle.com/#!15/d41d8/2763
This behavior is documented, but that is quite hard to find:
http://www.postgresql.org/docs/current/static/rowtypes.html#AEN7362
It's essentially the same thing as select (col1, col2) from some_table vs. select col1, col2 from some_table. The first one returns one column with an anonymous composite type that has two fields. The second one returns two columns from the table.