Is defining a delimiter in a hive ORC Table useless? - hive

When you create a ORC table in hive, you are changing the file type to be orc. This means you can't look at a specific file outside of the orc table.
Here's an example orc create table statement
CREATE TABLE IF NOT EXISTS table_orc_v1
(
col1 int,
col2 int
)
PARTITIONED BY (odate date)
CLUSTERED BY (col1) INTO 10 BUCKETS
STORED AS ORC TBLPROPERTIES('transactional'='true');
If I try to make this a csv table (like you do on a non-orc table) will it
1) not affect table performance
2) slow down performance as it converts things to a csv file that you can never read
3) give me some benefit that I'm not aware of
4) do something else
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','

if you are using any binary format (ORC, AVRO, Parquet) to store you data then ROW FORMAT DELIMITED FIELDS TERMINATED BY is just ignored, you can use it in your table syntax, it might not give you any error. However they are not being used

Related

How to create imapala table with complex data type and how I can specify delimiter for array type column

I am trying to create impala table with array column type, I have to use custom delimiter for array type column.
I tried below query. But, its throwing error.
Create table array_demo( arra_col ARRAY<string>) row format delimited fields terminated by ','
collection items terminated by '|' stored as parquet
You should omit the ROW FORMAT clause and the subclauses specifying the terminators, and include a STORED AS clause (Parquet is the only format Impala supports with complex data).
The data files to load the table have to be in parquet format too.
If you don't have the data file in Parquet format, you can create the table in Hive,
then create a copy using CREATE TABLE … AS SELECT (CTAS statement), with STORED AS PARQUET.
You then can query the table in Impala.
As an example
-- Create table in Hive
CREATE TABLE array_demo( arra_col ARRAY<STRING>)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '|'
STORED AS TEXTFILE;
-- Copy the table as parquet format
CREATE TABLE array_demo_impala AS
SELECT *
FROM array_demo
STORED AS PARQUET;

data appears as null on redshift external table while working right on athena

So I'm trying to run the following simple query on redshift spectrum:
select * from company.vehicles where vehicle_id is not null
and it return 0 rows(all of the rows in the table are null). However when I run the same query on athena it works fine and return results. Tried msck repair but both athena and redshift are using the same metastore so it shouldn't matter.
I also don't see any errors.
The format of the files is orc.
The create table query is:
CREATE EXTERNAL TABLE 'vehicles'(
'vehicle_id' bigint,
'parent_id' bigint,
'client_id' bigint,
'assets_group' int,
'drivers_group' int)
PARTITIONED BY (
'dt' string,
'datacenter' string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
's3://company-rt-data/metadata/out/vehicles/'
TBLPROPERTIES (
'CrawlerSchemaDeserializerVersion'='1.0',
'CrawlerSchemaSerializerVersion'='1.0',
'classification'='orc',
'compressionType'='none')
Any idea?
How did you create your external table ??
For Spectrum,you have to explicitly set the parameters to treat what should be treated as null
add the parameter 'serialization.null.format'='' in TABLE PROPERTIES so that all columns with '' will be treated as NULL to your external table in spectrum
**
CREATE EXTERNAL TABLE external_schema.your_table_name(
)
row format delimited
fields terminated by ','
stored as textfile
LOCATION [filelocation]
TABLE PROPERTIES('numRows'='100', 'skip.header.line.count'='1','serialization.null.format'='');
**
Alternatively,you can setup the SERDE-PROPERTIES while creating the external table which will automatically recognize NULL values
Eventually it turned out to be a bug in redshift. In order to fix it, we needed to run the following command:
ALTER TABLE table_name SET TABLE properties(‘orc.schema.resolution’=‘position’);
I had a similar problem and found this solution.
In my case I had external tables that were created with Athena pointing to an S3 bucket that contained heavily nested JSON data. To access them with Redshift I used json_serialization_enable to true; before my queries to make the nested JSON columns queryable. This lead to some columns being NULL when the JSON exceeded a size limit, see here:
If the serialization overflows the maximum VARCHAR size of 65535, the cell is set to NULL.
To solve this issue I used Amazon Redshift Spectrum instead of serialization: https://docs.aws.amazon.com/redshift/latest/dg/tutorial-query-nested-data.html.

Parquet Files Generation with hive

I'm trying to generate some parquet files with hive,to accomplish this i loaded a regular hive table from some .tbl files, throuh this command in hive:
CREATE TABLE REGION (
R_REGIONKEY BIGINT,
R_NAME STRING,
R_COMMENT STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
location '/tmp/tpch-generate';
After this i just execute this 2 lines:
create table parquet_reion LIKE region STORED AS PARQUET;
insert into parquet_region select * from region;
But when i check the output generated in HDFS, i dont find any .parquet file, intead i find files names like 0000_0 to 0000_21, and the sum of their sizes are much bigger that the original tbl file.
What im i doing Wrong?
Insert statement doesn't create file with extension but these are the parquet files.
You can use DESCRIBE FORMATTED <table> to show table information.
hive> DESCRIBE FORMATTED <table_name>
Additional Note: You can also create new table from source table using below query:
CREATE TABLE new_test row STORED AS PARQUET AS select * from source_table
It will create new table as parquet format and copies the structure as well as the data.

Creation of a partitioned external table with hive: no data available

I have the following file on HDFS:
I create the structure of the external table in Hive:
CREATE EXTERNAL TABLE google_analytics(
`session` INT)
PARTITIONED BY (date_string string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/flumania/google_analytics';
ALTER TABLE google_analytics ADD PARTITION (date_string = '2016-09-06') LOCATION '/flumania/google_analytics';
After that, the table structure is created in Hive but I cannot see any data:
Since it's an external table, data insertion should be done automatically, right?
your file should be in this sequence.
int,string
here you file contents are in below sequence
string, int
change your file to below.
86,"2016-08-20"
78,"2016-08-21"
It should work.
Also it is not recommended to use keywords as column names (date);
I think the problem was with the alter table command. The code below solved my problem:
CREATE EXTERNAL TABLE google_analytics(
`session` INT)
PARTITIONED BY (date_string string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/flumania/google_analytics/';
ALTER TABLE google_analytics ADD PARTITION (date_string = '2016-09-06');
After these two steps, if you have a date_string=2016-09-06 subfolder with a csv file corresponding to the structure of the table, data will be automatically loaded and you can already use select queries to see the data.
Solved!

automatically partition Hive tables based on S3 directory names

I have data stored in S3 like:
/bucket/date=20140701/file1
/bucket/date=20140701/file2
...
/bucket/date=20140701/fileN
/bucket/date=20140702/file1
/bucket/date=20140702/file2
...
/bucket/date=20140702/fileN
...
My understanding is that if I pull in that data via Hive, it will automatically interpret date as a partition. My table creation looks like:
CREATE EXTERNAL TABLE search_input(
col 1 STRING,
col 2 STRING,
...
)
PARTITIONED BY(date STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
LOCATION 's3n://bucket/';
However Hive doesn't recognize any data. Any queries I run return with 0 results. If I instead just grab one of the dates via:
CREATE EXTERNAL TABLE search_input_20140701(
col 1 STRING,
col 2 STRING,
...
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
LOCATION 's3n://bucket/date=20140701';
I can query data just fine.
Why doesn't Hive recognize the nested directories with the "date=date_str" partition?
Is there a better way to have Hive run a query over multiple sub-directories and slice it based on a datetime string?
In order to get this to work I had to do 2 things:
Enable recursive directory support:
SET mapred.input.dir.recursive=true;
SET hive.mapred.supports.subdirectories=true;
For some reason it would still not recognize my partitions so I had to recover them via:
ALTER TABLE search_input RECOVER PARTITIONS;
You can use:
SHOW PARTITIONS table;
to check and see that they've been recovered.
I had faced the same issue and realized that hive does not have partitions metadata with it. So we need to add that metadata using ALTER TABLE ADD PARTITION query. It becomes tedious, if you have few hundred partitions to create same queries with different values.
ALTER TABLE <table name> ADD PARTITION(<partitioned column name>=<partition value>);
Once you run above query for all available partitions. You should see the results in hive queries.