Impala - Handle special characters on partition column - impala

I am currently working on a job which copies data from a staging table to the final table. The column in the staging table which is used for partition on the final table has multiple records with single quotes (e.g. supplies'A, demand'A etc). Due to this the impala INSERT OVERWRITE statement is failing with the following message:
Query: insert OVERWRITE rec_details (
rec_id, rec_name, rec_value ) PARTITION (rec_part) SELECT
rec_id, rec_name, rec_value, rec_name FROM staging_rec_details Query submitted at: 2017-06-12 03:23:22 (Coordinator:
http://hostname:port) Query progress can be monitored at:
http://hostname:port/query_plan?query_id=ea4e14229d1c0119:a839f51500000000
WARNINGS: TableLoadingException: Failed to load metadata for table:
rec_details CAUSED BY: IllegalStateException: Invalid partition name:
rec_part=-supplies'A
DDL Statements are as follows:
--DDL 1 - Staging Table
CREATE EXTERNAL TABLE staging_rec_details(
rec_id STRING,
rec_name STRING,
rec_value STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\007'
LINES TERMINATED BY '\001'
--WITH SERDEPROPERTIES ('serialization.format'='\t', 'field.delim'='\t')
STORED AS TEXTFILE
LOCATION '/staging/staging_rec_details'
--DDL 2 - Final Table
CREATE EXTERNAL TABLE rec_details(
rec_id STRING,
rec_name STRING,
rec_value STRING
)
PARTITIONED BY (rec_part STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\007'
LINES TERMINATED BY '\001'
--WITH SERDEPROPERTIES ('serialization.format'='\t', 'field.delim'='\t')
STORED AS PARQUET
LOCATION '/data/rec_details'
Following is the Impala statement used for insering records:
--Impala SQL
INSERT OVERWRITE rec_details
(
rec_id, rec_name, rec_value
)
PARTITION (rec_part)
SELECT
rec_id, rec_name, rec_value, rec_name
FROM staging_rec_details
How can I insert data into the final table when the partition column has a special character like single quote ?

The issue was resolved by replacing the special character :
-- Modified Impala SQL
INSERT OVERWRITE rec_details
(
rec_id, rec_name, rec_value
) PARTITION (rec_part)
SELECT
rec_id, rec_name, rec_value,
regexp_replace(rec_name,'\'','')
FROM staging_rec_details

Related

impala CREATE EXTERNAL TABLE and remove double quotes

i got data on CSV for example :
"Female","44","0","0","Yes","Govt_job","Urban","103.59","32.7","formerly smoked"
i put it as hdfs with hdfs dfs put
and now i want to create external table from it on impala (not in hive)
there is an option without the double quotes ?
this is what i run by impala-shell:
CREATE EXTERNAL TABLE IF NOT EXISTS test_test.test1_ext
( `gender` STRING,`age` STRING,`hypertension` STRING,`heart_disease` STRING,`ever_married` STRING,`work_type` STRING,`Residence_type` STRING,`avg_glucose_level` STRING,`bmi` STRING,`smoking_status` STRING )
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION "/user/test/tmp/test1"
Update 28.11
i managed to do it by create the external and then create a VIEW as select with case when concat() each col.
Impala uses the Hive metastore so anything created in Hive is available from Impala after issuing an INVALIDATE METADATA dbname.tablename. HOWEVER, to remove the quotes you need to use the Hive Serde library 'org.apache.hadoop.hive.serde2.OpenCSVSerde' and this is not accessible from Impala. My suggestion would be to do the following:
Create the external table in Hive
CREATE EXTERNAL TABLE IF NOT EXISTS test_test.test1_ext
( gender STRING, age STRING, hypertension STRING, heart_disease STRING, ever_married STRING, work_type STRING, Residence_type STRING, avg_glucose_level STRING, bmi STRING, smoking_status STRING )
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES
(
"separatorChar" = ",",
"quoteChar" = """
)
STORED AS TEXTFILE
LOCATION "/user/test/tmp/test1"
Create a managed table in Hive using CTAS
CREATE TABLE mytable AS SELECT * FROM test_test.test1_ext;
Make it available in Impala
INVALIDATE METADATA db.mytable;

Getting Error 10293 while inserting a row to a hive table having array as one of the fileds

I have a hive table created using the following query:
create table arraytbl (id string, model string, cost int, colors array <string>,size array <float>)
row format delimited fields terminated by ',' collection items terminated by '#';
Now , while trying to insert a row:
insert into mobilephones values
("AA","AAA",5600,colors("red","blue","green"),size(5.6,4.3));
I get the following error:
FAILED: SemanticException [Error 10293]: Unable to create temp file for insert values Expression of type TOK_FUNCTION not supported in insert/values
How can I resolve this issue?
The syantax to enter values in complex datatype if kinda bit weird, however this is my personal opinion.
You need a dummy table to insert values into hive table with complex datatype.
insert into arraytbl select "AA","AAA",5600, array("red","blue","green"), array(CAST(5.6 AS FLOAT),CAST(4.3 AS FLOAT)) from (select 'a') x;
And this is how it looks after insert.
hive> select * from arraytbl;
OK
AA AAA 5600 ["red","blue","green"] [5.6,4.3]

Getting Exception while executing the Hive Create Table statement

I am getting the "SemanticException [Error 10002]: Invalid column reference" while executing the below statement.
CREATE TABLE IF NOT EXISTS default.employee_details_3(FirstName VARCHAR(20),LastName VARCHAR(20)) COMMENT 'This is a test table mod' PARTITIONED BY(Emp_id INT,Gender VARCHAR(15),EmailAddress VARCHAR(40)) CLUSTERED BY(Emp_id,Gender,EmailAddress) INTO 14 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS SEQUENCEFILE ;
I have used the following link for reference
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable
The error is because you are doing partitioning and bucketing on the same column. You can not use same column in partitioned by as well clustered by clause.
use different columns, it will work.
Try below query :
CREATE TABLE IF NOT EXISTS default.employee_details_3
(FirstName VARCHAR(20),
LastName VARCHAR(20)) COMMENT 'This is a test table mod'
PARTITIONED BY(Emp_id INT,Gender VARCHAR(15),EmailAddress VARCHAR(40))
CLUSTERED BY(FirstName) INTO 14 BUCKETS
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS SEQUENCEFILE ;

Insert data into hive table without delimiters

I want 10 words in one column, another 10 words in another column .How to insert data into hive table with no specified delimiters using UDFs?
CREATE TABLE employees_stg (emplid STRING, name STRING, age STRING, salary STRING, dept STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(.{4})(.{35})(.{3})(.{11})(.{4})", --Length of each column specified between braces "({})"
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s" --Output in string format
)
LOCATION '/path/to/input/employees_stg';
LOAD DATA INPATH '/path/to/sample_file.txt' INTO TABLE employees_stg;
SELECT * FROM employees_stg;

How do you add Data to an Existing Hive Metastore?

I have multiple subdirectories in S3 that contain .orc files. I'm trying to create a hive metastore so I can query the data with Presto / Hive, etc. The data is poorlly structured (no consistent delimiter, ugly characters, etc). Here's a scrubbed sample:
1488736466 199.199.199.199 0_b.www.sphericalcow.com.f9b1.qk-g6m6z24tdr.v4.url.name.com TXT IN: NXDOMAIN/0/143
1488736466 6.6.5.4 0.3399.186472.4306.6668.638.cb5a.names-things.update.url.name.com TXT IN: NOERROR/3/306 0\009253\009http://az.blargi.ng/%D3%AB%EF%BF%BD%EF%BF%BD/\009 0\009253\009http://casinoroyal.online/\009 0\009253\009http://d2njbfxlilvpsq.cloudfront.net/b_zq_ym_bangvideo/bangvideo0826.apk\009
I was able to create a table pointing to one of the subdirectories using a serde regex and the fields are parsing properly, but as far as I can tell I can only load one subfolder at a time.
How does one add more data to an existing hive metastore?
Here's an example of my hive metastore create statement with the regex serde bit:
DROP TABLE IF EXISTS test;
CREATE EXTERNAL TABLE test (field1 string, field2 string, field3 string, field4 string)
COMMENT 'fill all the tables with the datas.'
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([0-9]{10}) ([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}) (\\S*) (.*)",
"output.format.string" = "%1$s %2$s %3$s %4$s"
)
STORED AS ORC
LOCATION 's3://path/to/one/of/10/folders/'
tblproperties ("orc.compress" = "SNAPPY", "skip.header.line.count"="2");
select * from test limit 10;
I realize there is probably a very simple solution, but I tried INSERT INTO in place of CREATE EXTERNAL TABLE, but it understandably complains about the input, and I looked in both the hive and serde documentation for help but was unable to find a reference to adding to an existing store.
Possible solution using partitions.
CREATE EXTERNAL TABLE test (field1 string, field2 string, field3 string, field4 string)
partitioned by (mypartcol string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([0-9]{10}) ([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}) (\\S*) (.*)"
)
LOCATION 's3://whatever/as/long/as/it/is/empty'
tblproperties ("skip.header.line.count"="2");
alter table test add partition (mypartcol='folder 1') location 's3://path/to/1st/of/10/folders/';
alter table test add partition (mypartcol='folder 2') location 's3://path/to/2nd/of/10/folders/';
.
.
.
alter table test add partition (mypartcol='folder 10') location 's3://path/to/10th/of/10/folders/';
For #TheProletariat (the OP)
It seems there is no need for RegexSerDe since the columns are delimited by space (' ').
Note the use of tblproperties ("serialization.last.column.takes.rest"="true")
create external table test
(
field1 bigint
,field2 string
,field3 string
,field4 string
)
row format delimited
fields terminated by ' '
tblproperties ("serialization.last.column.takes.rest"="true")
;