HIVE 2.1.1 Table creation CSV-Serde - hive

So I did all the research and couldn't see the same issue anywhere in HIVE.
Followed the link below and I have no issues with data in quotes..
https://github.com/ogrodnek/csv-serde
My external table creation has the below serde properties,but for some reason,the default escapeChar('\') is being replaced by quoteChar which is doublequotes(") for my data.
CREATE EXTERNAL TABLE IF NOT EXISTS people_full(
`Unique ID` STRING,
.
.
.
.
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"escapeChar" = "\\"
)
STORED AS TEXTFILE
DATA ISSUE :
Sample HDFS Source data : "\"Robs business
Target HIVE Output : """Robs business
So the three double quotes as seen in """Robs business after the replacement is causing the data unwanted data delimitation (column is a very long string) may be as HIVE cannot handle three double quotes inside data(quote(") is also my default quote character)?
Why is this happening and is there a solution ? Please help.Many thanks.
Best,
Asha

To import your csv file to hdfs with double qoutes in between data and create hive table for that file, follow the query in hive to create external table which works fine and displays each record as of in the file.
create external table tablename (datatype colname,datatype2 colname2) row format
SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES
("separatorChar" = ",","quoteChar" = "\"") stored as textfile location '/dir_name/';
Here, the tablename represents the name of table, datatype is like string, int or maybe other and colname represents the name of the column you are going to give and finally dir_name is the location of csv or text file in hdfs location.

Try with the Escaped by it will work. Please find the below screenshot example.

Related

Hive: FieldName in Uppercase in avrofile schema required

I have created a hive table. Below is the create statement:
CREATE EXTERNAL TABLE schemanm.tbl_name(
FIELD_NAME_1 string COMMENT
FIELD_NAME_2 string COMMENT
.....)
PARTITIONED BY (
`part_1` string,
`part_2` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION
'path/to/directory'
When I am writing the data in the table through an INSERT OVERWRITE statement, the data in the avro file is getting written with field names in schema in lowercase. The requirement is to have the field names in uppercase in the schema. Below is a snippet showing the schema from the avrofile.
Objavro.schema∂,
{"type":"record","name":"tbl_name","namespace":"schemanm","fields":
[{"name":"field_name_1","type":["null","string"],"default":null},
{"name":"field_name_2","type":["null","string"],"default":null}]
Here field_name_1 and field_name_2 should be FIELD_NAME_1, FIELD_NAME_2 respectively.
I am stuck on this. Any help would be appreciated. I am not able to figure out what changes I can do in the create statement so that the field names get written in uppercase. Thanks in advance.

select row from orc snappy table in hive

I have created a table employee_orc which is orc format with snappy compression.
create table employee_orc(emp_id string, name string)
row format delimited fields terminated by '\t' stored as orc tblproperties("orc.compress"="SNAPPY");
I have uploaded data into the table using the insert statement.
employee_orc table has 1000 records.
When I run the below query, it shows all the records
select * from employee_orc;
But when run the below query, it shows zero results even though the records exist.
select * from employee_orc where emp_id = "EMP456";
Why I am unable to retrieve a single record from the employee_orc table?
The record does not exist. You may think they are the same because they look the same, but there is some difference. One possibility are spaces at the beginning or end of the string. For this, you can use like:
where emp_id like '%EMP456%'
This might help you.
On my part, I don't understand why you want to specify a delimiter in ORC. Are you confusing CSV and ORC or external vs managed ?
I advice you to create your table differently
create table employee_orc(emp_id string, name string)
stored as ORC
TBLPROPERTIES (
"orc.compress"="ZLIB");

Create table on Athena only from certain S3 files (depending on filename)

I'm trying to create a table on Athena from S3 files.
In my bucket, I have different types of files (Activity, Epoch, BodyComp, etc.) and I'd like this table to contain only "Activity" files assuming their filenames are like :
"Activity__xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx__yyyyyyyyyy.json"
where :
- x is a character or a digit
- y is a digit
I can do that after creating the table with this SELECT statement but the query takes too much time:
SELECT *, regexp_extract("$path", '[^/]+$') AS filename
FROM runs
WHERE regexp_extract("$path", '[^/]+$') like 'Activity__%';
I'd like to do it directly in the CREATE TABLE statement.
I tried this with "input.regex" but it didn't work :
CREATE EXTERNAL TABLE IF NOT EXISTS runs(
summaryId string,
distanceInMeters float,
maxHeartRateInBeatsPerMinute int,
totalElevationGainInMeters float,
userAccessToken string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES ('serialization.format' = '1',
"input.regex" = "^Activity\_\_\w{8}-\w{4}-\w{4}-\w{4}-\w{12}\_\_\d{10}\.json")
LOCATION 's3://com.connector/'
TBLPROPERTIES ('has_encrypted_data'='false');
I think the problem comes from the fact that "input.regex" is not the correct parameter to get the filenames.
Thank you for your help,
Max
There is no direct way of doing this. Either you can rename files starting with _(underscore) so that Athena will ignore them or use CTAS and pass the select query above.

Ingesting from Existing Table string field in Serde

I'm looking to parse out a json string in HIVE using Serde, but don't see an easy way of doing so from a string already in HIVE tables. Do you know how I can do this?
To make my scenario more understandable, here is a butchered example I may try:
ADD JAR hdfs:////user/d/libs/json-serde-1.3.8-jar-with-dependencies.jar;
CREATE Temporary TABLE TN (v string);
Insert overwrite table TN select '
[
{"t1":31646203,"t2":"h","s1":
[
{"r1":"w","r2":"w2"}
]
}
]' as v;
CREATE TABLE deserializeThis (jsonDeserialized array<struct<t1:int,t2:string,s1:array<struct<r1:string, r2:string>>>>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';
Insert overwrite table deserializeThis select v FROM TN;
Thanks for all your help!
In order to use SerDe we can operate on the file system level. To do this, we can push the information into a table and read from that location using SerDe.
This answer describes the above:
How can I parse a Json column of a Hive table using a Json serde?

Create a table in Hive and populate it with data

While trying to load data in a Hive table I encountered a behavior that looks strange to me. My data is made up of JSON objects loaded as records in a table called twitter_test containing a single column named "json".
Now I want to extract three fields from each JSON and build a new table called "my_twitter". I thus issue the command
CREATE TABLE my_twitter AS SELECT regexp_replace(get_json_object(t.json, '$.body\[0]'), '\n', '') as text, get_json_object(t.json, '$.publishingdate\[0]') as created_at, get_json_object(t.json, '$.author_screen_name\[0]') as author from twitter_test AS t;
The result is a table with three columns that contains no data. However, if I run the SELECT command alone it returns data as expected.
By trial and error I found out that i need to add LIMIT x at the end of the query for data to be inserted in the new table. The question is: why?
Furthermore, seems strange that I need to know in advance the number x of rows returned by the SELECT statement for the CREATE to work correctly. Is there any workaround?
You could create a table on this json data using the JSON serde which would parse the json objects and then you could easily select each individual columns easily.
Find below a sample hive DDL for creating a json table using json serde
CREATE EXTERNAL TABLE `json_table`(
A string
,B string
)
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'PATH'