Configuring delimiter for Hive MR Jobs

Configuring delimiter for Hive MR Jobs - hive

Is there any way to configure the delimiter for Hive MR Jobs??
The default delimiter being used by hive internally is "hive delimiter" (/001). My usecase is to configure the delimiter so that i can use any delimiter as per the requirement. In hadoop there is a property "mapred.textoutputformatter.separator" which will set the key-value separator to the value specified for this property..Is there any such way to configure the delimiter in Hive?..I searched many but didn't get any useful links. Please help me.

As of hive-0.11.0, you can write
INSERT OVERWRITE LOCAL DIRECTORY '...'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
SELECT ...
See HIVE-3682 for the complete syntax.

You can try that:
SELECT (rest of your query)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY 'YourChar' (example: FIELDS TERMINATED BY '\t')

You can also use this :-
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES ('field.delim'='-','serialization.format'='-')
This will separate columns using - delimiter but it is specific to LazSimpleSerde.

i guess you are using INSERT OVERWRITE DIRECTORY option to write to a hdfs file.
If you create a hive table on top of the hdfs file with no delimiter, it will take '\001' as delimiter, so you can read the file from a hive table without any issues

If you source table dnt not specify the delimiter in the create schema statement, then you wont be able to change that. You op will always contain the default. And yes the delimiter will be controlled by create schema for the source table. So that isnt configurable either.
I have had a similar issue and ended up modifying 001 as second step after finishing hive MR job.

Related

How to Copy data from s3 to Redshift with "," in the field values

I am faced with "Extra column(s) found" error while reading the data from S3 to Redshift.
Since my data has 863830 rows an 21 columns, ill give you a small example of how the data is.
create table test_table(
name varchar(500),
age varchar(500)
)
and my data would be
(ABC,12)
(First,Last,25)
where First,last should go into a single columns
Unfortunately, i am unable to do that with this copy command
COPY test_table from 'path'
iam_role 'credentials'
region 'us-east-1'
IGNOREHEADER 1
delimiter as ','
Is there any way to accomodate commas into a field ?

Is it a CSV file that you're trying to load? If so, try loading with CSV format parameter specified in the command, rather than using delimiter ',' parameter. Here's an example -
COPY test_table from 'path'
iam_role 'credentials'
region 'us-east-1'
IGNOREHEADER 1
CSV;
If that doesn't help, you may have to use the ESCAPE parameter. This would need modifications in your file too. Here's an example - https://docs.aws.amazon.com/redshift/latest/dg/r_COPY_command_examples.html#r_COPY_command_examples-copy-data-with-the-escape-option

Your data doesn't conform to the CSV specification. See RTF-4180
To store your example data the field with the comma in it needs to be enclosed in " "
ABC,12
"First,Last",25
The parentheses in the data file will also need to be removed as these will be interpreted as part of the data fields.
Alternatively you could change the delimiter of your data from "," to something else like "%". However if this character is in your data then you are right back where you started. Ad hoc delimited files only work if you use a character that will never be in your data. This is why I recommend that you use the more robust CSV specification and use the "CSV" option to COPY.

Use multi-character delimiter in Amazon Redshift COPY command

I am trying to load a data file which has a multi-character delimeter('|~|') to Amazon Redshift DB using the COPY command. Redshift COPY command does not allow for multi-character delimiters.
My data looks like this -
John|~|23|~|Los Angeles|~|USA
Jade|~|27|~|New York|~|USA
When I try to use multi-characters in the COPY command I get "COPY delimiter must be a single character;" error.
My COPY command looks like this -
copy test_data from 's3://abcd/testFile'
credentials 'aws_access_key_id=<redacted>;aws_secret_access_key=<redacted>'
delimiter '|~|'
null as '\0'
acceptinvchars
ignoreheader as 1
MAXERROR 1;
I cannot replace or edit the source files since they are very large(>100GB), so I need a solution within the AWS Redshift paradigm.

If you can't edit the source files, and you can't use a multi-character delimiter, then use | as the delimiter and add additional (fake) columns that will be loaded with ~.
You can then either ignore these columns, or use CREATE TABLE AS to copy the data to a new table but without those columns.
Or, use CREATE VIEW to make a version of that table without the fake columns.

Specify multiple delimiters for Redshift copy command

Is there a way to specify multiple delimiters to Redshift copy command while loading data.
I have a data file having the following format:-
1 | ab | cd | ef
2 | gh | ij | kl
I am using a command like this:-
COPY MY_TBL
FROM 's3://s3-file-path'
iam_role 'arn:aws:iam::ddfjhgkjdfk'
manifest
IGNOREHEADER 1
gzip delimiter '|';
Fields are separated by | and records are separated using newline. How do I copy this data into Redshift. Because my query above gives me a delimiter not found error

No, delimiters are single characters.
From Data Format Parameters:
Specifies the single ASCII character that is used to separate fields in the input file, such as a pipe character ( | ), a comma ( , ), or a tab ( \t ).
You could import it with a pipe delimiter, then perform an UPDATE command to STRIP() off the spaces.

Your error above suggests that something in your data is causing the COPY command to fail. This could be a number of things, from file encoding, to some funky data in there. I've struggled with the "delimiter not found" error recently, which turned out to be the ESCAPE parameter combined with trailing backslashes in my data which prevented my delimiter (\t) from being picked up.
Fortunately, there are a few steps you can take to help you narrow down the issue:
stl_load_errors - This system table contains details on any error logged by Redshift during the COPY operation. This should be able to identify the row number in your data file that is causing the problem.
NOLOAD - will allow you to run your copy command without actually loading any data to Redshift. This performs the COPY ANALYZE operation and will highlight any errors in the stl_load_errors table.
FILLRECORD - This allows Redshift to "fill" any columns that it sees as missing in the input data. This is essentially to deal with any ragged-right data files, but can be useful in helping to diagnose issues that can lead to the "delimiter not found" error. This will let you load your data to Redshift and then query in database to see where your columns start being out of place.
From the sample you've posted, your setup looks good, but obviously this isn't the entire picture. The options above should help you narrow down the offending row(s) to help resolve the issue.

Hue on Cloudera - NULL values (importing file)

Yesterday I installed Cloudera QuickStart VM 5.8. After the import operation of files from the database by HUE, in some tables there were a NULL value (the entire column). In previous steps data display them properly as they should be imported.
First Pic.
Second Pic.

can you run the command describe formatted table_name in hive shell and see what is the field delimiter and then go to the warehouse directory and see if the delimiter in the data and in the table definition is same.i am sure it will not be same thats why you see null.
i am assuming you have imported the data in the default warehouse directory.
then you can do one of the following
1) delete your hive table and create it again with correct delimiter as it is in the actual data ( row format delimited fields terminated by "your delimitor" and give location as your data file
or
2) delete the data that is imported and run sqoop import again and give the fields-terminated-by " the delimitor in the hive table definition"

Once check datatype of second(col_1) and third(col_2) in original database from where your exporting.
This can not be case of missing delimiter, else fourth(col_3) would not have populated correctly, which is correct.

Hive External table-CSV File- Header row

Below is the hive table i have created:
CREATE EXTERNAL TABLE Activity (
column1 type, </br>
column2 type
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/exttable/';
In my HDFS location /exttable, i have lot of CSV files and each CSV file also contain the header row. When i am doing select queries, the result contains the header row as well.
Is there any way in HIVE where we can ignore the header row or first line ?

you can now skip the header count in hive 0.13.0.
tblproperties ("skip.header.line.count"="1");

If you are using Hive version 0.13.0 or higher you can specify "skip.header.line.count"="1" in your table properties to remove the header.
For detailed information on the patch see: https://issues.apache.org/jira/browse/HIVE-5795

Lets say you want to load csv file like below located at /home/test/que.csv
1,TAP (PORTUGAL),AIRLINE
2,ANSA INTERNATIONAL,AUTO RENTAL
3,CARLTON HOTELS,HOTEL-MOTEL
Now, we need to create a location in HDFS that holds this data.
hadoop fs -put /home/test/que.csv /user/mcc
Next step is to create a table. There are two types of them to choose from. Refer this for choosing one.
Example for External Table.
create external table industry_
(
MCC string ,
MCC_Name string,
MCC_Group string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/mcc/'
tblproperties ("skip.header.line.count"="1");
Note: When accessed via Spark SQL, the header row of the CSV will be shown as a data row.
Tested on: spark version 2.4.

There is not. However, you can pre-process your files to skip the first row before loading into HDFS -
tail -n +2 withfirstrow.csv > withoutfirstrow.csv
Alternatively, you can build it into where clause in HIVE to ignore the first row.

If your hive version doesn't support tblproperties ("skip.header.line.count"="1"), you can use below unix command to ignore the first line (column header) and then put it in HDFS.
sed -n '2,$p' File_with_header.csv > File_with_No_header.csv

To remove the header from the csv file in place use:
sed -i 1d filename.csv

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Configuring delimiter for Hive MR Jobs - hive

As of hive-0.11.0, you can write INSERT OVERWRITE LOCAL DIRECTORY '...' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' SELECT ... See HIVE-3682 for the complete syntax.

You can try that: SELECT (rest of your query) ROW FORMAT DELIMITED FIELDS TERMINATED BY 'YourChar' (example: FIELDS TERMINATED BY '\t')

You can also use this :- ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ('field.delim'='-','serialization.format'='-') This will separate columns using - delimiter but it is specific to LazSimpleSerde.

i guess you are using INSERT OVERWRITE DIRECTORY option to write to a hdfs file. If you create a hive table on top of the hdfs file with no delimiter, it will take '\001' as delimiter, so you can read the file from a hive table without any issues

Related

How to Copy data from s3 to Redshift with "," in the field values

Use multi-character delimiter in Amazon Redshift COPY command

Specify multiple delimiters for Redshift copy command

Hue on Cloudera - NULL values (importing file)

Hive External table-CSV File- Header row

Categories

Resources