Understanding Partitioning in Hive - hive

I am trying to learn Hive and while referring the The Hadoop Definitive Guide, I had some confusions.
As per the text, partition in Hive is done by creating sub-directories of the same values of partitioning column. But as in Hive data loading simply means copying of files, and no data validation checks are done during loading, but during querying only, so does Hive check the data for partitioning. Or how does it determine which file should go to which directory?

Or how does it determine which file should go to which directory?
It doesn't, you have to set the value of the destination partition in the LOAD DATA command. When you perform a LOAD operation into a partitioned table, you have to specify the specific partition (the directory) in which you are going to load the data by means of the PARTITION argument. According to the documentation:
The target being loaded to can be a table or a partition. If the table
is partitioned, then one must specify a specific partition of the
table by specifying values for all of the partitioning columns.
For instance, in this example:
hive> LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');
hive> LOAD DATA LOCAL INPATH './examples/files/kv3.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-08');
The two files will be stored in the invites/ds=2008-08-15 and invites/ds=2008-08-08 folders.

Related

Can hive metastore virtually partition data based on column value without physically changing the directory structure?

As an example consider I have a data of all the major sports events happened.Schema given below
EventName,Date,Month,Year,City
This data that is physically structured in HDFS on year,date,month.
Now I want to create virtual partitions on that based on some other column value, eg. City.The data will be stored physically in HDFS in year,date,month structure only but my metadata keeps track of the virtual partition.
Can hive metastore do it for me?
I don't think so it will happen. Actually partitioning in Hive means creates different dir for different partition. And metastore only contains metadata of table. It won't control the actual data. Technically when ever we query based on that partitioned column in Hive table, the query will execute on that exact partitioned dir only. So virtual partitioning with out changing hdfs structure in the sense the real data will be in one dir so the query has to be execute on entire data. So technically optimisation is not at all happening.

loading data to hive dynamic partitioned tables

I have created a hive table with dynamic partitioning on a column. Is there a way to directly load the data from files using "LOAD DATA" statement? Or do we have to only depend on creating a non-partitioned intermediate table and load file data to it and then inserting data from this intermediate table to partitioned table as mentioned in Hive loading in partitioned table?
No, the LOAD DATA command ONLY copies the files to the destination directory. It doesn't read the records of the input file, so it CANNOT do partitioning based on record values.
If your input data is already split into multiple files based on partitions, you could directly copy the files to table location in HDFS under their partition directory manually created by you (OR just point to their current location in case of EXTERNAL table) and use the following ALTER command to ADD the partition. This way you could skip the LOAD DATA statement altogether.
ALTER TABLE <table-name>
ADD PARTITION (<...>)
No other go, if we need to insert directly, we'll need to specify partitions manually.
For dynamic partitioning, we need staging table and then insert from there.

What will be DataSet size in hive

I have 1 TB data in my HDFS in .csv format. When I load it in my Hive table what will be the total size of data. I mean will there be 2 copies of same data i.e 1 Copy in HDFS and other in Hive table ? Plz clarify. Thanks in advance.
If you create a hive external table, you provide a HDFS location for the table and you store that data into that particular location.
When you create a hive internal table hive create a directory into /apps/hive/warehouse/ directory.
Say, your table name is table1 then your directory will be /apps/hive/warehouse/table1
This directory is also a HDFS directory and when you load data into the table into internal table it goes into its directory.
Hive creates a mapping between table and their corresponding HDFS location and hence when you read the data its reading from the corresponding mapped directory.
Hence there wont be duplicate copy of data corresponding to table and their HDFS location.
But if in your Hadoop cluster Data Replication factor is set to 3(default replication) then it will take 3TB cluster disk space(as you have 1TB data) but there wont be any effect of your hive table data.
Please see below link to know more about Data replication.
http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#Data+Replication
It depends whether you are creating an internal or external table in Hive.
If you create an external table in Hive, it will create a mapping on where your data is stored in HDFS and there won't be any duplication at all. Hive will automatically pick the data where ever it is stored in HDFS.
Read more about external tables here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ExternalTables

How to Reload Partition Data into ORC

Is there any Best practices for loading data into ORC with Partition . If I load 120 GB of data into ORC Partition table. And Partition is on 2 columns. If want reload data for particular partition how to do reloading activity. How to Delete Partition, is it Alter Table Drop partition(Partition Value). Even after deleting the partition, I still see ORC partition file in Hive/WareHouse Folder. How to cleanup unsed partition File. If I want to load only single Partition data into delete partition, how to perform and what is best way.
Is ORC with Partition with Bucket can give better performance than ORC with Partition and then ORC (No Partition)
Dropping a table or a partition only removes the metadata information but does not necessarily delete the data for an external table. You should instead use TRUNCATE to delete the data in an external table or partition. Read more here.

Where does hive stores its table?

I am new to Hadoop and I just started working on Hive, I my understanding it provides a query language to process data in HDFS. With HiveQl we can create tables and load data into it from HDFS.
So my question is: where are those tables stored? Specifically if we have 100 GB file in our HDFS and we want to make a hive table out of that data what will be the size of that table and where is it stored?
If my understanding about this concept is wrong please correct me ..
If the table is 100GB you should consider an Hive External Table (as opposed to a "managed table", for the difference, see this).
With an external table the data itself will be still stored on the HDFS in the file path that you specify (note that you may specify a directory of files as long as they all have the same structure), but Hive will create a map of it in the meta-store whereas the managed table will store the data "in Hive".
When you drop a managed table, it drops the underlying data as opposed to dropping a hive external table which only drops the meta-data from the meta-store referencing that data.
Either way you are using only 100GB as viewed by the user and are taking advantage of the HDFS' robustness though duplication of the data.
Hive will create a directory on HDFS. If you didn't specify any location it will create a directory at /user/hive/warehouse on HDFS. After load command the files are moved to the /warehouse/tablename. You can also point to the HDFS directory if it contains partitions (if the files are partitioned), or use external table concept.