Hive Table deletion and query processing - hive

as per my understanding on Hive concepts, if we load the dataset into hive table, the data file will move from source path to hive warehouse within HDFS, and HDFS was set to three replicas for the data.
these questions might look silly but as i am beginner, i want clear my doubts.
my questions are:
1) if i delete the hive table, will it delete data file from hive warehouse only or along other two replicas from HDFS also?
2)if we are processing query on hive table, will that query be done as distributed processing?
per say, one data file is of size 1GB (interns 8 blocks x 128MB), and as we have three replication factor, there would be total 24 blocks available for this file
will our hive query be distributed among all the data blocks or it would be processed on hive warehouse blocks only?
Thanks in advance..

If you do "load data inpath" from a HDFS path the data will be moved from source to destination HDFS path,
If you do "load data local inpath", it doesn't move data from local to HDFS path, instead it copies
For your question
If you delete file in HDFS all the replicas are deleted.
If you have a 1gb file (8 blocks) with 3 replication factor, when you trigger the query in hive CLI, it converts your query to MR. It process only 8 blocks, in case of the datanode failure of the triggered job, it accesses the 2nd replica on a different node and processes the data (speculative execution)

Related

How to optimize spark to write data to s3 by joining 2 tables of 20TB each

We are joining 2 tables, each 20TB in a spark job.
Running it on r5.4x large EMR cluster.
How can we optimize it, can anyone share few parameters to finetune the job and help it run quicker.
More Details :
Both input tables are timestamp partitioned.
Size of files in each partition is around 128MB each.
Input and output file format is parquet.
Output is written to an external glue table and stored on s3.
Thanks in advance.

PutHiveQL NiFi Processor extremely slow - misconfiguration?

I am currently setting up a simple NiFi flow that reads from a RDBMS source and writes to a Hive sink. The flow works as expected until the PuHiveSql processor, which is running extremely slow. It inserts one record every minute approximately.
Currently is setup as a standalone instance running on one node.
The logs showing the insert every 1 minute approx:
(INSERT INTO customer (id, name, address) VALUES (x, x, x))
Any ideas about why this may be? Improvements to try?
Thanks in advance
Inserting one record at a time into Hive will result extreme slowness.
As your doing regular insert into hive table:
Change your flow:
QueryDatabaseTable
PutHDFS
Then create Hive avro table on top of HDFS directory where you have stored the data.
(or)
QueryDatabaseTable
ConvertAvroToORC //incase if you need to store data in orc format
PutHDFS
Then create Hive orc table on top of HDFS directory where you have stored the data.
Are you poshing one record at time? if so may use the merge record process to create batches before pushing into HiveQL,
It is recommended to batch into 100 records :
See here: https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-hive-nar/1.5.0/org.apache.nifi.processors.hive.PutHiveQL/
Batch Size | 100 | The preferred number of FlowFiles to put to the database in a single transaction
Use the MergeRecord process and set the number of records or/and timeout, it should speed-up considerably

what is the impact of small hive tables?

In one of my usecases, several tables were created out of a bunch of csv files. Each csv file was about 50-80MB.Table is configured to contain 2 buckets. Tables are stored in ORC format. However, when I see in hive warehouse directory in hdfs, it is only about 4 MB - 5MB. I have already brought down the hive block size from its default to 64MB. My concern here is that small files in hdfs put pressure on Namenode. Similarly, is it an issue with small hive tables? Can I still bring down size of hive block?

Does Hive duplicate data?

I have a large log file which I loaded in to HDFS. HDFS will replicate to different nodes based on rack awareness.
Now I load the same file into a hive table. The commands are as below:
create table log_analysis (logtext string) STORED AS TEXTFILE
LOCATION '/user/hive/warehouse/';
LOAD DATA INPATH '/user/log/apache.log' OVERWRITE INTO TABLE log_analysis;
Now when I go and see the '/user/hive/warehouse/' directory there is a table file and copying it to local, it has all the log file data.
My question is: the existing file in HDFS is replicated. Then loading that file in hive table, stored on HDFS also gets replicated.
Is that not the same file stored 6 times (assuming that replication factor is 3) ? That would be such a waste of resources.
Correct, In case you are loading the data from HDFS , the data moves from HDFS to the /user/hive/warehouse/yourdatabasename/tablename.
Your question indicates that you have created an INTERNAL table using hive and you are loading data into HIVE table from HDFS location.
When you load data into an internal table using LOAD DATA INPATAH command, It moves data from the primary location to another location. In your case, it should be /user/hive/warehouse/log_analysis. So basically it provides new address and new HDFS location of the data and you won't be seeing anything in the previous location.
When you move data from one location to another location on HDFS. NameNode receives a new location of the data and it deletes all old metadata for that data. Hence there won't be any duplicate information of the data and the data and there will be only 3 replication and it will be stored only 3 times.
I hope it clear to you.

What will be DataSet size in hive

I have 1 TB data in my HDFS in .csv format. When I load it in my Hive table what will be the total size of data. I mean will there be 2 copies of same data i.e 1 Copy in HDFS and other in Hive table ? Plz clarify. Thanks in advance.
If you create a hive external table, you provide a HDFS location for the table and you store that data into that particular location.
When you create a hive internal table hive create a directory into /apps/hive/warehouse/ directory.
Say, your table name is table1 then your directory will be /apps/hive/warehouse/table1
This directory is also a HDFS directory and when you load data into the table into internal table it goes into its directory.
Hive creates a mapping between table and their corresponding HDFS location and hence when you read the data its reading from the corresponding mapped directory.
Hence there wont be duplicate copy of data corresponding to table and their HDFS location.
But if in your Hadoop cluster Data Replication factor is set to 3(default replication) then it will take 3TB cluster disk space(as you have 1TB data) but there wont be any effect of your hive table data.
Please see below link to know more about Data replication.
http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#Data+Replication
It depends whether you are creating an internal or external table in Hive.
If you create an external table in Hive, it will create a mapping on where your data is stored in HDFS and there won't be any duplication at all. Hive will automatically pick the data where ever it is stored in HDFS.
Read more about external tables here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ExternalTables