Hive: Create table from text file. Handle special character - hive

I have data file in txt format which i need to load to a hive table
I created a table to load data from this file and then LOAD command to insert data as shown below
CREATE TABLE dev.table
(Date string,
c1 string,
c2 string,
c3 string,
c4 string,
c5 string,
c6 string,
c7 string,
c8 string)
row format delimited fields terminated by '\t' stored as textfile;
LOAD DATA LOCAL INPATH 'filepath.txt' OVERWRITE INTO TABLE dev.table;
The data is getting inserted into table but there appears a special character in each column. Below is sample data
Please help to get rid of this special character.

Please check this query:
select Date from dev.table limit 10
As the data which you are selecting is in dev.table.

Related

How to load a "|" delimited file into hive without creating a hive table with "ROW FORMAT DELIMITER"

I am trying to load a local file with "|" delimited values into hive table, we usually create a table with option "ROW FORMAT DELIMITER "|" . But I want to create a normal table and load data . What is the right syntax I need to use, please suggest.
Working Code
CREATE TABLE IF NOT EXISTS testdb.TEST_DATA_TABLE
( column1 string,
column 2 bigint,
)ROW FORMAT DELIMITED FIELDS TERMINATED BY '|';
LOAD DATA LOCAL INPATH 'xxxxx.csv' INTO TABLE testdb.TEST_DATA_TABLE;
But I want to do :
CREATE TABLE IF NOT EXISTS testdb.TEST_DATA_TABLE
( column1 string,
column 2 bigint,
);
LOAD DATA LOCAL INPATH 'xxxxx.csv' INTO TABLE testdb.TEST_DATA_TABLE FIELDS TERMINATED BY '|';
Reason begin: If i create a table, HDFS will store the data in the table with "|" delimeter
With second DDL you have provided, Hive will create default formatted table like Textformat,orc,parquet..etc(as per your configuration) with cntrl+A delimited file(default delimiter in hive).
If you want to store the hdfs file with pipe delimited then we need to create Hive Table in Text with | delimiter.
(or)
You can also write the result of select query to local (or) HDFS path with pipe delimiter also.

Can we load text file separated by :: into hive table?

Is there a way to load a simple text file where fields are separated by "::" into hive table other than replacing those "::" with "," and then load it?
Replacing the "::" with "," is quicker when the text file is small but what if contains millions of records?
Try creating Hive table using Regex serde
Example:
i had file with below text in it.
i::90
w::99
Create Hive table:
hive> create external table default.i
(Id STRING,
Name STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES ('input.regex' = '(.*?)::(.*)')
STORED AS TEXTFILE;
Select from Hive table:
hive> select * from i;
+-------+---------+--+
| i.id | i.name |
+-------+---------+--+
| i | 90 |
| w | 99 |
+-------+---------+--+
In case if you want to skip the header then use below syntax:
hive> create external table default.i
(Id STRING,
Name STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES ('input.regex' = '(.*?)::(.*)')
STORED AS TEXTFILE
tblproperties ('skip.header.line.count'='1');
UPDATE:
Check is there any older files in your table location.if some files are there then delete them(if you don't want them).
1.Create Hive table as:
create external table <db_name>.<table_name>
(col1 STRING,
col2 STRING,
col3 string,
col4 string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES ('input.regex' = '(.*?)::(.*?)::(.*?)::(.*)')
STORED AS TEXTFILE;
2.Then run:
load data local inpath 'Source path' overwrite into table 'Destination table'

error loading csv into hive table

I'm trying to load a tab delimited file into a table in hive, and I want to skip the first row because it contains column names. I'm trying to run the code below, but I'm getting the error below. Does anyone see what the issue is?
Code:
set hive.exec.compress.output=false;
set hive.mapred.mode=nonstrict;
-- region to state mapping
DROP TABLE IF EXISTS StateRegion;
CREATE TEMPORARY TABLE StateRegion (Zip_Code int,
Place_Name string,
State string,
State_Abbreviate string,
County string,
Latitude float,
Longitude float,
ZIP_CD int,
District_NM string,
Region_NM string)
row format delimited fields terminated by '\t'
tblproperties("skip.header.line.count"="1");
STORED AS TEXTFILE;
LOAD DATA LOCAL INPATH 'StateRegion'
OVERWRITE INTO TABLE StateRegion;
--test Export
INSERT OVERWRITE LOCAL DIRECTORY './StateRegionTest/'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
select * from StateRegion;
Error:
FAILED: ParseException line 2:0 cannot recognize input near 'STORED' 'AS' 'TEXTFILE'

Insert data into hive table without delimiters

I want 10 words in one column, another 10 words in another column .How to insert data into hive table with no specified delimiters using UDFs?
CREATE TABLE employees_stg (emplid STRING, name STRING, age STRING, salary STRING, dept STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(.{4})(.{35})(.{3})(.{11})(.{4})", --Length of each column specified between braces "({})"
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s" --Output in string format
)
LOCATION '/path/to/input/employees_stg';
LOAD DATA INPATH '/path/to/sample_file.txt' INTO TABLE employees_stg;
SELECT * FROM employees_stg;

How do I upload a key=value format file into a Hive table?

I am new to data engineering, so this might be a basic question, appreciate your help here.
I have a file which is in the following format -
first_name=A1 last_name=B1 city=Austin state=TX Zip=78703
first_name=A2 last_name=B2 city=Seattle state=WA
Note: No zip code available for the second row.
I need to upload this into Hive, in the following format:
First_name Last_name City State Zip
A1 B1 Austin TX 78703
A2 B2 Seattle WA NULL
Thanks for your help!!
I figured a way to do this in Hive. The idea is to first upload the entire data into a n*1 table (n is the number of rows), and then parsing the key names in the second step using the str_to_map function.
Step 1: Upload all data into 1 column table. Input a delimiter which you are sure will not parse your data, and doesn't exist (\002 in this case)
DROP TABLE IF EXISTS kv_001;
CREATE EXTERNAL TABLE kv_001 (
col_import string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\002'
LOCATION 's3://location/directory/';
Step 2: Using the str_to_map function, extract the keys that are needed
DROP TABLE IF EXISTS required_table;
CREATE TABLE required_table
(first_name STRING
, last_name STRING
, city STRING
, state STRING
, zip INT);
INSERT OVERWRITE TABLE required_table
SELECT
params["first_name"] AS first_name
, params["last_name"] AS last_name
, params["city"] AS city
, params["state"] AS state
, params["zip"] AS zip
FROM
(SELECT str_to_map(col_import, '\001', '=') params FROM kv_001) A;
You can transform your file using python3 script and then upload it to hive table
Try this steps:
Script for example:
import sys
for line in sys.stdin:
line = line.split()
res = []
for item in line:
res.append(item.split("=")[1])
if len(line) == 4:
res.append("NULL")
print(",".join(res))
If only zip field can be empty, it works.
To apply it, use something like
cat file | python3 script.py > output.csv
Then upload this file to hdfs using
hadoop fs -copyFromLocal ./output.csv hdfs:///tmp/
And create the table in hive using
CREATE TABLE my_table
(first_name STRING, last_name STRING, city STRING, state STRING, zip STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
LOAD DATA INPATH '/tmp/output.csv'
OVERWRITE INTO TABLE my_table;