Hive ORC or AVRO format with field delimiters - hive

What does it mean for a hive table with ORC or Avro format to have the Field delimiter specified? Does hive ignore even if its specified?
For example,
CREATE TABLE if not exists T (
C1 STRING ,
C2 STRING )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
STORED AS ORC tblproperties ("orc.compress"="SNAPPY")

When you specify a compress format, its used. The delimiter need not be specified.

Related

Hive: FieldName in Uppercase in avrofile schema required

I have created a hive table. Below is the create statement:
CREATE EXTERNAL TABLE schemanm.tbl_name(
FIELD_NAME_1 string COMMENT
FIELD_NAME_2 string COMMENT
.....)
PARTITIONED BY (
`part_1` string,
`part_2` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION
'path/to/directory'
When I am writing the data in the table through an INSERT OVERWRITE statement, the data in the avro file is getting written with field names in schema in lowercase. The requirement is to have the field names in uppercase in the schema. Below is a snippet showing the schema from the avrofile.
Objavro.schema∂,
{"type":"record","name":"tbl_name","namespace":"schemanm","fields":
[{"name":"field_name_1","type":["null","string"],"default":null},
{"name":"field_name_2","type":["null","string"],"default":null}]
Here field_name_1 and field_name_2 should be FIELD_NAME_1, FIELD_NAME_2 respectively.
I am stuck on this. Any help would be appreciated. I am not able to figure out what changes I can do in the create statement so that the field names get written in uppercase. Thanks in advance.

Hive external table delimited by commas, but comma present in data

I have some data coming in from an external source of the format:
user_id, user_name, project_name, position
"111", "Tom Petty", "Heartbreakers", "Vocals"
"222", "Ringo Starr", "Beatles, The", "Drummer"
"333", "Tom Brady", "Patriots", "QB"
And I create my external table thusly:
CREATE EXTERNAL TABLE tab1 (
USER_ID String,
USER_NAME String,
PROJECT_NAME String,
POSITION String
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/blah/foo'
The problem occurs when data in some of the columns have embedded commas in them, Beatles, The for instance. This results in Hive putting the word The into the next column (position) and dropping the data in the last column.
All the incoming data fields are wrapped in double quotes but they are comma delimited even though they may have commas in them. Unfortunately having the sender clean the data is not an option.
How can I go about creating this table?
try this
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = "\t",
"quoteChar" = "\""
)
You can try using Open CSV Serde in your hive table creation using specific serDe properties.
https://cwiki.apache.org/confluence/display/Hive/CSV+Serde

Querying S3 Text files using Athena

I created a database and the table on Athena, to point to an S3 bucket, where I have the log files created using the UNLOAD command on redshift database. Files have a default delimiter as pipe (|) for the columns.
while creating the table using the Athena interface, I used the field terminator as pipe (|) , collection and map key terminator as default. Here is the DDL statement.
CREATE EXTERNAL TABLE IF NOT EXISTS testdb.worktable (
field1 string,
field2 string,
field3 int,
field4 string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '|',
'field.delim' = '|',
'collection.delim' = 'undefined',
'mapkey.delim' = 'undefined',
) LOCATION 's3://bucket_location'
TBLPROPERTIES ('has_encrypted_data'='false');
Problem :
Most of the rows are correctly aligned to the fields mentioned as columns (delimited by pipe |) But when there are spaces in a particular field say for example a space under field2 column, data shifts to the right, meaning field3 column data is showing up under field4 column
Could someone help me out fix this error ? Thank you!

HIVE 2.1.1 Table creation CSV-Serde

So I did all the research and couldn't see the same issue anywhere in HIVE.
Followed the link below and I have no issues with data in quotes..
https://github.com/ogrodnek/csv-serde
My external table creation has the below serde properties,but for some reason,the default escapeChar('\') is being replaced by quoteChar which is doublequotes(") for my data.
CREATE EXTERNAL TABLE IF NOT EXISTS people_full(
`Unique ID` STRING,
.
.
.
.
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"escapeChar" = "\\"
)
STORED AS TEXTFILE
DATA ISSUE :
Sample HDFS Source data : "\"Robs business
Target HIVE Output : """Robs business
So the three double quotes as seen in """Robs business after the replacement is causing the data unwanted data delimitation (column is a very long string) may be as HIVE cannot handle three double quotes inside data(quote(") is also my default quote character)?
Why is this happening and is there a solution ? Please help.Many thanks.
Best,
Asha
To import your csv file to hdfs with double qoutes in between data and create hive table for that file, follow the query in hive to create external table which works fine and displays each record as of in the file.
create external table tablename (datatype colname,datatype2 colname2) row format
SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES
("separatorChar" = ",","quoteChar" = "\"") stored as textfile location '/dir_name/';
Here, the tablename represents the name of table, datatype is like string, int or maybe other and colname represents the name of the column you are going to give and finally dir_name is the location of csv or text file in hdfs location.
Try with the Escaped by it will work. Please find the below screenshot example.

Create hive timestamp from pig

How i can create a timestamp field in pig from a string that hive accepts as timestamp?
I have formatted the string in pig to match timestamp format in hive, but after loading it is null instead of showing the date.
2014-04-10 09:45:56 this is how the format looks like in pig, and this is matching the format with hive timestamp, but cannot load. (only if i load into string field)
any ideas why?
quick update: no hcatalog is available
problem is some case the timestamp fields contains null values and all the filed become null when using timestamp data type. When putting timestamp to a column where all the row is in the above format it works fine. So the real question is how null values can be handle
I suspect you have written your data to HDFS using PigStorage and you want to load it into a Hive table. The problem is that a missing tuple field will be written by Pig as null which will be treated by Hive 0.11 as null. So far so good.
But then all the subsequent fields will be treated as null, however they can have different values. Hive 0.12 doesn't have this issue.
Depending on the SerDe type, Hive can interpret different strings as null. In case of LazySimpleSerDe it is \N.
You have two option:
set the table's null format property to the empty string which is produced by Pig
or store \N in Pig for null fields
E.g:
Given the following data in Pig 0.11 :
A = load 'data' as (txt:chararray, ts:chararray);
dump A;
(a,2014-04-10 09:45:56)
(b,2014-04-11 10:45:56)
(,)
(e,2014-04-12 11:45:56)
Option 1:
store A into '/user/data';
Hive 0.11 :
CREATE EXTERNAL TABLE test (txt string, tms TimeStamp)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LOCATION '/user/data';
alter table test SET SERDEPROPERTIES('serialization.null.format' = '');
Option 2:
...
B = foreach A generate txt, (ts is null?'\\N':ts);
store B into '/user/data';
Then create the table in Hive without setting the serde property.