Hive 'Load' command not considering Field Delimiter - hive

I am trying to create a hive table using Hive CLI on Hortonworks Sandbox as well as on C3 Cluster. In my 'CREATE TABLE' command, i specify the following:
FIELDS TERMINATED BY '\u0010' ​
and then i load the table using the 'LOAD' command. ​This is giving a correct Hive Table in Sandbox but on C3 cluster, this appends all the fields in the first column and gives NULL values for the rest of the columns.
Please help me to resolve the following issue.
Thanks

There's a bug with unicode literals that is suppose to be fixed in version 2.1
Use decimal or octal notation instead.
... fields terminated by '\020' (Octal)
... fields terminated by '16' (Decimal)

Related

Specify multiple delimiters for Redshift copy command

Is there a way to specify multiple delimiters to Redshift copy command while loading data.
I have a data file having the following format:-
1 | ab | cd | ef
2 | gh | ij | kl
I am using a command like this:-
COPY MY_TBL
FROM 's3://s3-file-path'
iam_role 'arn:aws:iam::ddfjhgkjdfk'
manifest
IGNOREHEADER 1
gzip delimiter '|';
Fields are separated by | and records are separated using newline. How do I copy this data into Redshift. Because my query above gives me a delimiter not found error
No, delimiters are single characters.
From Data Format Parameters:
Specifies the single ASCII character that is used to separate fields in the input file, such as a pipe character ( | ), a comma ( , ), or a tab ( \t ).
You could import it with a pipe delimiter, then perform an UPDATE command to STRIP() off the spaces.
Your error above suggests that something in your data is causing the COPY command to fail. This could be a number of things, from file encoding, to some funky data in there. I've struggled with the "delimiter not found" error recently, which turned out to be the ESCAPE parameter combined with trailing backslashes in my data which prevented my delimiter (\t) from being picked up.
Fortunately, there are a few steps you can take to help you narrow down the issue:
stl_load_errors - This system table contains details on any error logged by Redshift during the COPY operation. This should be able to identify the row number in your data file that is causing the problem.
NOLOAD - will allow you to run your copy command without actually loading any data to Redshift. This performs the COPY ANALYZE operation and will highlight any errors in the stl_load_errors table.
FILLRECORD - This allows Redshift to "fill" any columns that it sees as missing in the input data. This is essentially to deal with any ragged-right data files, but can be useful in helping to diagnose issues that can lead to the "delimiter not found" error. This will let you load your data to Redshift and then query in database to see where your columns start being out of place.
From the sample you've posted, your setup looks good, but obviously this isn't the entire picture. The options above should help you narrow down the offending row(s) to help resolve the issue.

Hue on Cloudera - NULL values (importing file)

Yesterday I installed Cloudera QuickStart VM 5.8. After the import operation of files from the database by HUE, in some tables there were a NULL value (the entire column). In previous steps data display them properly as they should be imported.
First Pic.
Second Pic.
can you run the command describe formatted table_name in hive shell and see what is the field delimiter and then go to the warehouse directory and see if the delimiter in the data and in the table definition is same.i am sure it will not be same thats why you see null.
i am assuming you have imported the data in the default warehouse directory.
then you can do one of the following
1) delete your hive table and create it again with correct delimiter as it is in the actual data ( row format delimited fields terminated by "your delimitor" and give location as your data file
or
2) delete the data that is imported and run sqoop import again and give the fields-terminated-by " the delimitor in the hive table definition"
Once check datatype of second(col_1) and third(col_2) in original database from where your exporting.
This can not be case of missing delimiter, else fourth(col_3) would not have populated correctly, which is correct.

Getting error while retrieving columns on HIVE "TIMESTAMP" column

In Hive, I am trying create table on log file, I have data in the following format.
1000000000012311 1373346000 21.4 XX
1000000020017331 1358488800 16.9 YY
In this second field is Unix timestamp. I am writing following HIVE QUERY:
CREATE EXTERNAL TABLE log(user STRING, tdate TIMESTAMP, spend DOUBLE, state STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' LINES TERMINATED BY '\n' LOCATION '/user/XXX/YYY/ZZZ';
Table is created. but when I am trying to get the data from table Select * form log limit 10';
I am getting following error.
Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: Error evaluating tdate
I have checked the HIVE manual and also google it, but didn't get any solution.
For epoch, you can define as BIGINT and then use the built-in UDF, from_unixtime() to convert to a string representing the date. Some thing like "select from_unixtime(tdate) from log "
A similar post at this link : How to create an external Hive table with column typed Timestamp
Hive supports the datatype timestamp but when used with JDBC cannot accept Timestamp as a datatype. But this was a problem in earlier versions. From Hive version 0.8.0, this problem is fixed. You can checkout this JIRA ticket raised.
https://issues.apache.org/jira/browse/HIVE-2957

Hadoop Hive: create external table with dynamic location

I am trying to create a Hive external table that points to an S3 output file.
The file name should reflect the current date (it is always a new file).
I tried this:
CREATE EXTERNAL TABLE s3_export (...)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION concat('s3://BlobStore/Exports/Daily_', from_unixtime(unix_STRING(),'yyyy-MM-dd'));
but I get an error:
FAILED: Parse Error: line 3:9 mismatched input 'concat' expecting StringLiteral near 'LOCATION' in table location specification
is there any way to dynamically specify table location?
OK, I found the hive variables feature.
So I pass the location in the cli as follows
hive -d s3file=s3://BlobStore/Exports/APKsCollection_test/`date +%F`/
and then use the variable in the hive command
CREATE EXTERNAL TABLE s3_export (...)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '${s3File}';
This function doesn't work at my side ,
how did you make this happen ?
hive -d s3file=s3://BlobStore/Exports/APKsCollection_test/`date +%F`/

Configuring delimiter for Hive MR Jobs

Is there any way to configure the delimiter for Hive MR Jobs??
The default delimiter being used by hive internally is "hive delimiter" (/001). My usecase is to configure the delimiter so that i can use any delimiter as per the requirement. In hadoop there is a property "mapred.textoutputformatter.separator" which will set the key-value separator to the value specified for this property..Is there any such way to configure the delimiter in Hive?..I searched many but didn't get any useful links. Please help me.
As of hive-0.11.0, you can write
INSERT OVERWRITE LOCAL DIRECTORY '...'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
SELECT ...
See HIVE-3682 for the complete syntax.
You can try that:
SELECT (rest of your query)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY 'YourChar' (example: FIELDS TERMINATED BY '\t')
You can also use this :-
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES ('field.delim'='-','serialization.format'='-')
This will separate columns using - delimiter but it is specific to LazSimpleSerde.
i guess you are using INSERT OVERWRITE DIRECTORY option to write to a hdfs file.
If you create a hive table on top of the hdfs file with no delimiter, it will take '\001' as delimiter, so you can read the file from a hive table without any issues
If you source table dnt not specify the delimiter in the create schema statement, then you wont be able to change that. You op will always contain the default. And yes the delimiter will be controlled by create schema for the source table. So that isnt configurable either.
I have had a similar issue and ended up modifying 001 as second step after finishing hive MR job.