ORC file format - hive

I am new to Hive. Could you please let me know answer for below question?
Why do we need base table while loading the data in ORC?
Can't we directly create table as ORC and load data in it?

1. Why do we need base table while loading the data in ORC?
We need of the base table, because most of the time we get the data file in text file format, i.e. CSV, TXT, DAT or any other delimiter that we can open the file and see the content. But the file Format ORC maintain in a different way by using their algorithm to optimized the Row and Column.
Hence we need of a base table, so, Actually what happened in that case. We create a table with the textFile format and select the data over their and write it into ORC table.
2. Can't we directly create table as ORC and load data in it?
Yes, you can load the data into ORC file directly.
To understand more about ORC, you can refer to https://orc.apache.org/docs/

Usually if you don't define file format , for hive it is textfile by default.
Need of base table arises because when you create a hive table with orc format and then trying to load data using command:
load data in path '' ..
it simply moves data from one location to another.
hive orc table won't understand textfile. that's when serde comes into picture. you define serde while creating table.
so when a operation like :
1. select * (read)
2. insert into (write)
serde will serialize and desiarlize various format to orc and map data to hive columns.

Related

impala/hive show file format

How can I have impala or hive return the file format of the underlying files on HDFS for a table?
I tried:
SHOW FILES database.table_name
This ilst the files, but the problem is that some people stored parquet files as .parq and others .parquet. Is there anyway to return the file format, such that one could use it in a new create statement?
Use good old show create table mytable.
You can check the output and it clearly mentions file format. It also shows folder inside which file are stored - you should not try to use file name - let impala decide the name. below is a sample result from impala.
result
CREATE TABLE edh.mytable (
column1 STRING
)
STORED AS PARQUET --file format
LOCATION 's3a://cc-mys3/edh/user/hive/warehouse/edh.db/mytable' --folder location

Hive ORC File Format

When we create an ORC table in hive we can see that the data is compressed and not exactly readable in HDFS. So how is Hive able to convert that compressed data into readable format which is shown to us when we fire a simple select * query to that table?
Thanks for suggestions!!
By using ORCserde while creating table. u have to provide package name for serde class.
ROW FORMAT ''.
What serde does is to serialize a particular format data into object which hive can process and then deserialize to store it back in hdfs.
Hive uses “Serde” (Serialization DeSerialization) to do that. When you create a table you mention the file format ex: in your case It’s ORC “STORED AS ORC” , right. Hive uses the ORC library(Jar file) internally to convert into a readable format. To know more about hive internals search for “Hive Serde” and you will know how the data is converted to object and vice-versa.

Automatic updating of an ORC table

I want to do the following thing in Hive: I create an external table stored as a Textfile and I convert this table in an ORC table (with the usual way: first create an empty ORC table, and second load the data from the original one).
For my TextFile table, my data is located in HDFS in a directory, say /user/MY_DATA/.
So when I add/drop files from MY_DATA, my TextFile table is automatically updated. Now I would like the ORC table to be automatically updated too. Do you know if this is possible?
Thank you!
No, there is no straight forward way for this, u need to add the new data in the ORC table as u did for the first load, or u can create a new orc
CREATE TABLE orc_emp STORED AS ORC AS SELECT * FROM employees.emp;
table and drop the old orc table.

Acid Properties in Hive

Just wanted to know if it is possible to run acid transactions on a table that is stored in textfile format in hive. I know we can store a table in text file format and create a new table with orc format and insert data into it using textfile table. Is there any other approach that reduces this overhead ?

Load local csv file to hive parquet table directly,not resort to a temp textfile table

I am now preparing to store data in .csv files into hive. Of course, because of the good performance of parquet file format, the hive table should is parquet format. So, the normal way, is to create a temp table whose format is textfile, then I load local CSV file data into this temp table, and finally, create a same-structure parquet table and use sql insert into parquet_table values (select * from textfile_table);.
But I don't think this temp textfile table is necessary. So, my question is, is there a way for me to load these local .csv files into hive parquet-format table directly, namely, not to resort the a temp table? Or a easier way to accomplish this task?
As stated in Hive documentation:
NO verification of data against the schema is performed by the load command.
If the file is in hdfs, it is moved into the Hive-controlled file system namespace.
You could skip a step by using CREATE TABLE AS SELECT for the parquet table.
So you'll have 3 steps:
Create text table defining the schema
Load data into text table (move the file into the new table)
CREATE TABLE parquet_table AS SELECT * FROM textfile_table STORED AS PARQUET; supported from hive 0.13