Let's say I want to create a simple table with 4 columns in Hive and load some pipe-delimited data.
CREATE table TEST_1 (
COL1 string,
COL2 string,
COL3 string,
COL4 string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
;
Raw Data:
123|456|Dasani Bottled \| Water|789
What I expect for Col3 value is "Dasani Bottled \| Water", it has some special character "\|" in the middle thus cause Hive table column off position starting at COL3 because I create the table using "|" as the delimiter. The special character \| does have a pipe | character within it.
Is there any way to resolve the issue so Hive can load data correctly?
Thanks for any help.
you can add the ESCAPED BY clause to your table creation like this to allow character escaping
CREATE table TEST_1 (
COL1 string,
COL2 string,
COL3 string,
COL4 string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|' ESCAPED BY '\'
;
From the Hive documentation
Enable escaping for the delimiter characters by using the 'ESCAPED BY'
clause (such as ESCAPED BY '\') Escaping is needed if you want to
work with data that can contain these delimiter characters.
A custom NULL format can also be specified using the 'NULL DEFINED AS'
clause (default is '\N').
Related
I need the ability to bulk insert into an instance of SQL Server 2016 (13.0.4224.16 - FORMAT and FIELDQUOTE properties are not available) on special characters as the field delimiter and to also include any Unicode characters that may be in the data sets. I am trying to use ¿ (Hex 0xBF) or any character that I know isn't in my data set as a field delimiter.
I have a UTF-8 encoded test.txt file containing some test data (headers excluded in dataset):
foo¿bar¿foobar¿
\n < last line
and a TSQL statement to insert:
BULK INSERT [dbo].[testTable]
FROM 'C:\Datasource\test.txt'
WITH (KEEPNULLS,
MAXERRORS=0,
FIELDTERMINATOR='0xBF');
into this table:
create table [dbo].[testTable](
col1 nvarchar(50),
col2 nvarchar(50),
col3 nvarchar(50),
col4 nvarchar(50)
)
when I run a select on my testTable it returns:
col1 col2 col3 col4
foo┬ bar┬ foobar┬ NULL
Why are these ┬ characters showing up? I'm guessing it's my delimiters incorrectly encoded and included in the data? If I change my delimiters to | I can get the data in without issue but it exists in my data sets and would break the inserts further down the line. I tried adding CODEPAGE=65001 which imports Unicode characters without issue using a pipe delimiter, but using the special character delimiter results in this error:
Bulk load: An unexpected end of file was encountered in the data file.
Edit:
I've changed the import txt file to UTF-16 encoding but still encounter the same issues.
I had used sqoop-import command to sqoop the data into Hive from teradata. Sqoop-import command is creating a text file with comma(,) as the delimiter.
After Sqooping, I had created an external table as shown below:
CREATE EXTERNAL TABLE IF NOT EXISTS employee ( eid int, name String,
salary String, description String)
COMMENT ‘Employee details’
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’
LINES TERMINATED BY ‘\n’
STORED AS TEXTFILE;
But description column has values like this:"abc,xyz,mnl". Due to this,loading of data into a hive table is not proper. Then how to create a text file with a delimiter other than comma while sqooping.
Then how to delimit the fields while creating an external table of Hive?
Use --fields-terminated-by in your Sqoop job if you want to avoid the default delimiter.
--fields-terminated-by - This parameter is used for field separator character in output.
Example: --fields-terminated-by |
and then change fields separator in create table statement by FIELDS TERMINATED BY ‘|’
Can we use quote (" or ') as delimiter in hive data files? If not why?
If we could refer to a list of characters which we can use as delimiters for hive data, that would be great.
When using the decimal notation, you can use the whole basic ascii range (decimal 0-127) - tested.
Avoid using \n or\r.
As for " and ', it can be done straightforward -
create table mytable (i int,j int) row format delimited fields terminated by '"';
create table mytable (i int,j int) row format delimited fields terminated by "'";
or
create table mytable (i int,j int) row format delimited fields terminated by '\'';
create table mytable (i int,j int) row format delimited fields terminated by "\"";
CREATE TABLE IF NOT EXISTS user.name_visits(
date1 TIMESTAMP,
MV String,
visits_by_MV int
)
COMMENT ‘visits_at_MV’
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’
LINES TERMINATED BY ‘\n’
;
It is saying error near BY
Below query worked for me..
CREATE TABLE IF NOT EXISTS user.name_visits(
date1 TIMESTAMP,
MV STRING,
visits_by_MV INT
)
COMMENT 'visits_at_MV'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
;
Error you are seeing could be because of the editor you are using.
If you look at your Quotation marks.. they're LEFT SINGLE QUOTATION MARK and RIGHT SINGLE QUOTATION MARK.
Only change I made was using an APOSTROPHE.
Try this way it should work
Change single quotes with double as below:
CREATE TABLE IF NOT EXISTS user.name_visits(
date1 TIMESTAMP,
MV String,
visits_by_MV int
)
COMMENT "visits_at_MV"
ROW FORMAT DELIMITED
FIELDS TERMINATED BY "\t"
LINES TERMINATED BY "\n"
;
My data looks like:
a||b||c
To fetch the data my create table statement is:
CREATE TABLE
( col1 STRING,
col2 STRING,
col3 STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY "||";
But here it is taking '|' as the delimiter not "||".
Can anyone help me on this?
You may use RegexSerDe when dealing with multi-character delimiter strings:
create table mytable (
col1 string,
col2 string,
col3 string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "^([^\\|]+)\\|\\|([^\\|]+)\\|\\|([^\\|]+)$",
"output.format.string" = "%1$s %2$s %3$s")
STORED AS TEXTFILE
LOCATION '/path/to/data';
Note: refine the regex to suit to your needs