How can I insert a key-value pair into a hive map? - hive

Based on the following tutorial, Hive has a map type. However, there does not seem to be a documented way to insert a new key-value pair into a Hive map, via a SELECT with some UDF or built-in function. Is this possible?
As a clarification, suppose I have a table called foo with a single column, typed map, named column_containing_map.
Now I want to create a new table that also has one column, typed map, but I want each map (which is contained within a single column) to have an additional key-value pair.
A query might look like this:
CREATE TABLE IF NOT EXISTS bar AS
SELECT ADD_TO_MAP(column_containing_map, "NewKey", "NewValue")
FROM foo;
Then the table bar would contain the same maps as table foo except each map in bar would have an additional key-value pair.

Consider you have a student table which contains student marks in various subjects.
hive> desc student;
id string
name string
class string
marks map<string,string>
You can insert values directly to table as below.
INSERT INTO TABLE student
SELECT STACK(1,
'100','Sekar','Mathematics',map("Mathematics","78")
)
FROM empinfo
LIMIT 1;
Here 'empinfo' table can be any table in your database.
And Results are:
100 Sekar Mathematics {"Mathematics":"78"}

for key-value pairs, you can insert like following sql:
INSERT INTO TABLE student values( "id","name",'class',
map("key1","value1","key2","value2","key3","value3","key4","value4") )
please pay attention to sequence of the values in map.

I think the combine function from brickhouse will do what you need. Slightly modifying the query in your original question, it would look something like this
SELECT
combine(column_containing_map, str_to_map("NewKey:NewValue"))
FROM
foo;
The limitation with this example is that str_to_map creates a MAP< STRING,STRING >. If your hive map contains other primitive types for the keys or values, this won't work.

I'm sorry, I didn't quite get this. What do you mean by with some UDF or built-in function?If you wish to insert into a table which has a Map field it's similar to any other datatype. For example :
I have a table called complex1, created like this :
CREATE TABLE complex1(c1 array<string>, c2 map<int,string> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY '-' MAP KEYS TERMINATED BY ':' LINES TERMINATED BY '\n';
I also have a file, called com.txt, which contains this :
Mohammad-Tariq,007:Bond
Now, i'll load this data into the above created table :
load data inpath '/inputs/com.txt' into table complex1;
So this table contains :
select * from complex1;
OK
["Mohammad","Tariq"] {7:"Bond"}
Time taken: 0.062 seconds
I have one more table, called complex2 :
CREATE TABLE complex2(c1 map<int,string>);
Now, to select data from complex1 and insert into complex2 i'll do this :
insert into table complex2 select c2 from complex1;
Scan the table to cross check :
select * from complex2;
OK
{7:"Bond"}
Time taken: 0.062 seconds
HTH

Related

Table can't be queried after change column position

When querying table using "select * from t2p", the reponse is as blow. I think I have missed some concepts, please help me out.
Failed with exception java.io.IOException:java.lang.ClassCastException: org.apache.hadoop.hive.serde2.lazy.objectinspector.LazyMapObjectInspector cannot be cast to org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector
Step1, create table
create table t2p(id int, name string, score map<string,double>)
partitioned by (class int)
row format delimited
fields terminated by ','
collection items terminated by '\\;'
map keys terminated by ':'
lines terminated by '\n'
stored as textfile;
Step2, insert data like
1,zs,math:90.0;english:92.0
2,ls,chinese:89.0;math:80.0
3,xm,geo:87.0;math:80.0
4,lh,chinese:89.0;english:81.0
5,xw,physics:91v;english:81.0
Step3, add another column
alter table t2p add columns (school string);
Step4, change column's order
alter table t2p change school school string after name;
Step5, do query and get error as mentioned above.
select * from t2p;
This is an obvious error.
Your command alter table t2p change school school string after name; changes metadata only. If you are moving columns, the data must already match the new schema or you must change it to match by some other means.
Which means, the map column has to be matching to the new column. In other words, if you want to move column around, make sure new column and existing data types are same.
I did a simple experiment with int data type. It worked because data type are not hugely different but you can see metadata changed but data stayed same.
create table t2p(id int, name string, score int)
partitioned by (class int)
stored as textfile;
insert into t2p partition(class=1) select 100,'dum', 199;
alter table t2p add columns (school string);
alter table t2p change school school string after name;
MSCK REPAIR TABLE t2p ;
select * from t2p;
You can see new column school is mapped to position 3( defined as INT).
Solution - You can do this but make sure new structure+data type is compatible to old structure.

BigQuery insert values AS, assume nulls for missing columns

Imagine there is a table with 1000 columns.
I want to add a row with values for 20 columns and assume NULLs for the rest.
INSERT VALUES syntax can be used for that:
INSERT INTO `tbl` (
date,
p,
... # 18 more names
)
VALUES(
DATE('2020-02-01'),
'p3',
... # 18 more values
)
The problem with it is that it is hard to tell which value corresponds to which column. And if you need to change/comment out some value then you have to make edits in two places.
INSERT SELECT syntax also can be used:
INSERT INTO `tbl`
SELECT
DATE('2020-02-01') AS date,
'p3' AS p,
... # 18 more value AS column
... # 980 more NULL AS column
Then if I need to comment out some column just one line has to be commented out.
But obviously having to set 980 NULLs is an inconvenience.
What is the way to combine both approaches? To achieve something like:
INSERT INTO `tbl`
SELECT
DATE('2020-02-01') AS date,
'p3' AS p,
... # 18 more value AS column
The query above doesn't work, the error is Inserted row has wrong column count; Has 20, expected 1000.
Your first version is really the only one you should ever be using for SQL inserts. It ensures that every target column is explicitly mentioned, and is unambiguous with regard to where the literals in the VALUES clause should go. You can use the version which does not explicitly mention column names. At first, it might seem that you are saving yourself some code. But realize that there is a column list which will be used, and it is the list of all the table's columns, in whatever their positions from definition are. Your code might work, but appreciate that any addition/removal of a column, or changing of column order, can totally break your insert script. For this reason, most will strongly advocate for the first version.
You can try following solution, it is combination of above 2 process which you have highlighted in case study:-
INSERT INTO `tbl` (date, p, 18 other coll names)
SELECT
DATE('2020-02-01') AS date,
'p3' AS p,
... # 18 more value AS column
Couple of things you should consider here are :-
Other 980 Columns should ne Nullable, that means it should hold NULL values.
All 18 columns in Insert line and Select should be in same order so that data will be inserted in same correct order.
To Avoid any confusion, try to use Alease in Select Query same as Insert Table Column name. It will remove any ambiguity.
Hopefully it will work for you.
In BigQuery, the best way to do what you're describing is to first load to a staging table. I'll assume you can get the values you want to insert into JSON format with keys that correspond to the target column names.
values.json
{"date": "2020-01-01", "p": "p3", "column": "value", ... }
Then generate a schema file for the target table and save it locally
bq show --schema project:dataset.tbl > schema.json
Load the new data to the staging table using the target schema. This gives you "named" null values for each column present in the target schema but missing from your json, bypassing the need to write them out.
bq load --replace --source_format=NEWLINE_DELIMIITED_JSON \
project:dataset.stg_tbl values.json schema.json
Now the insert select statement works every time
insert into `project:dataset.tbl`
select * from `project:dataset.stg_tbl`
Not a pure SQL solution but I managed this by loading my staging table with data then running something like:
from google.cloud import bigquery
client = bigquery.Client()
table1 = client.get_table(f"{project_id}.{dataset_name}.table1")
table1_col_map = {field.name: field for field in table1.schema}
table2 = client.get_table(f"{project_id}.{dataset_name}.table2")
table2_col_map = {field.name: field for field in table2.schema}
combined_schema = {**table2_col_map, **table1_col_map}
table1.schema = list(combined_schema.values())
client.update_table(table1_cols, ["schema"])
Explanation:
This will retrieve the schemas of both, convert their schemas into a dictionary with key as column name and value as the actual field info from the sdk. Then both are combined with dictionary unpacking (the order of unpacking determines which table's columns have precedence when a column is common between them. Finally the combined schema is assigned back to table 1 and used to update the table, adding the missing columns with nulls.

Trying to insert into Table A and link Table B and C but add to all if not exist

I have 4 tables:
Table A:
LogID (unique identifier),
UserID (bigint),
LogDate (date/time),
LogEventID (int),
IPID (varchar(36)),
UserAgentID (varchar(36))
Table B:
IPID (unique identifier),
IPAddress (varchar(255))
Table C:
UserAgentID (unique identifier),
UserAgent (varchar(255))
Table D:
LogEventID (int),
LogEvent (varchar(255))
I am trying to write the to Table A but need to check Table B, Table C and Table D contain data so I can link them. If they don’t contain any data, I would need to create some. Some of the tables may contain data, sometimes none of them may.
Pretty much everything, really struggling
first, you do a insert into table B, C, D WHERE NOT EXISTS
example
INSERT INTO TableB (IPID, IPAddress)
SELECT #IPPD, #IPAddress
WHERE NOT EXISTS
(
SELECT *
FROM TableB x
WHERE x.IPID = #IPID
)
then you insert into table A
INSERT INTO TableA ( . . . )
SELECT . . .
SQL Server doesn't let you modify multiple tables in a single statement, so you cannot do this with a single statement.
What can you do?
You can wrap the multiple statements in a single transaction, if your goal is to modify the database only once.
You can write the multiple statements in stored procedure.
What you really probably want is a view with insert triggers on the view. You can define a view that is the join of the tables with the values from the reference tables. An insert trigger can then check if the values exist and replace them with the appropriate ids. Or, insert into the appropriate table.
The third option does exactly what you want. I find that it is a bit of trouble to maintain triggers, so for an application, I would prefer wrapping the logic in a stored procedure.

Hive - Create Table statement with 'select query' and 'fields terminated by' commands

I want to create a table in Hive using a select statement which takes a subset of a data from another table. I used the following query to do so :
create table sample_db.out_table as
select * from sample_db.in_table where country = 'Canada';
When I looked into the HDFS location of this table, there are no field separators.
But I need to create a table with filtered data from another table along with a field separator. For example I am trying to do something like :
create table sample_db.out_table as
select * from sample_db.in_table where country = 'Canada'
ROW FORMAT SERDE
FIELDS TERMINATED BY '|';
This is not working though. I know the alternate way is to create a table structure with field names and the "FIELDS TERMINATED BY '|'" command and then load the data.
But is there any other way to combine the two into a single query that enables me to create a table with filtered data from another table and also with a field separator ?
Put row format delimited .. in front of AS select
do it like this
Change the query to yours
hive> CREATE TABLE ttt row format delimited fields terminated by '|' AS select *,count(1) from t1 group by id ,name ;
Query ID = root_20180702153737_37802c0e-525a-4b00-b8ec-9fac4a6d895b
here is the result
[root#hadoop1 ~]# hadoop fs -cat /user/hive/warehouse/ttt/**
2|\N|1
3|\N|1
4|\N|1
As you can see in the documentation, when using the CTAS (Create Table As Select) statement, the ROW FORMAT statement (in fact, all the settings related to the new table) goes before the SELECT statement.

Dynamic partition cannot be the parent of a static partition

I'm trying to aggregate data from 1 table (whose data is re-calculated monthly) in another table (holding the same data but for all time) in Hive. However, whenever I try to combine the data, I get the following error:
FAILED: SemanticException [Error 10094]: Line 3:74 Dynamic partition cannot be the parent of a static partition 'category'
The code I'm using to create the tables is below:
create table my_data_by_category (views int, submissions int)
partitioned by (category string)
row format delimited
fields terminated by ','
escaped by '\\'
location '${hiveconf:OUTPUT}/${hiveconf:DATE_DIR}/my_data_by_category';
create table if not exists my_data_lifetime_total_by_category
like my_data_by_category
row format delimited
fields terminated by ','
escaped by '\\'
stored as textfile
location '${hiveconf:OUTPUT}/lifetime-totals/my_data_by_category';
The code I'm using to populate the tables is below:
insert overwrite table my_data_by_category partition(category)
select mdcc.col1, mdcc2.col2, pcc.category
from my_data_col1_counts_by_category mdcc
left outer join my_data_col2_counts_by_category mdcc2 where mdcc.category = mdcc2.category
group by mdcc.category, mdcc.col1, mdcc2.col2;
insert overwrite table my_data_lifetime_total_by_category partition(category)
select mdltc.col1 + mdc.col1 as col1, mdltc.col2 + mdc.col2, mdc.category
from my_data_lifetime_total_by_category mdltc
full outer join my_data_by_category mdc on mdltc.category = mdc.category
where mdltc.col1 is not null and mdltc.col2 is not null;
The frustrating part is that I have this data partitioned on another column and repeating this same process with that partition works without a problem. I've tried Googling the "Dynamic partition cannot be the parent of a static partition" error message, but I can't find any guidance on what causes this or how it can be fixed. I'm pretty sure that there's an issue with a way that 1 or more of my tables is set up, but I can't see what. What's causing this error and what I can I do resolve it?
There is no partitioned by clause in this script. As you are trying to insert into non partitioned table using partition in insert statement, it is failing.
create table if not exists my_data_lifetime_total_by_category
like my_data_by_category
row format delimited
fields terminated by ','
escaped by '\\'
stored as textfile
location '${hiveconf:OUTPUT}/lifetime-totals/my_data_by_category';
No. You don't need to add partition clause.
You are doing group by mdcc.category in insert overwrite table my_data_by_category partition(category)..... but you are not using any UDAF.
Are you sure you can do this?
I think that if you change your second create statement to:
create table if not exists my_data_lifetime_total_by_category
partitioned by (category string)
row format delimited
fields terminated by ','
escaped by '\\'
stored as textfile
location '${hiveconf:OUTPUT}/lifetime-totals/my_data_by_category';
you should then be free of errors