Create Table in Hive with one file - hive

I'm creating a new table in Hive using:
CREATE TABLE new_table AS select * from old_table;
My problem is that after the table is created, It generates multiple files for each partition - while I want only one file for each partition.
How can I define it in the table?
Thank you!

There are many possible solutions:
1) Add distribute by partition key at the end of your query. Maybe there are many partitions per reducer and each reducer creates files for each partition. This may reduce the number of files and memory consumption as well. hive.exec.reducers.bytes.per.reducer setting will define how much data each reducer will process.
2) Simple, quite good if there are not too much data: add order by to force single reducer. Or increase hive.exec.reducers.bytes.per.reducer=500000000; --500M files. This is for single reducer solution is for not too much data, it will run slow if there are a lot of data.
If your task is map-only then better consider options 3-5:
3) If running on mapreduce, switch-on merge:
set hive.merge.mapfiles=true;
set hive.merge.mapredfiles=true;
set hive.merge.size.per.task=500000000; --Size of merged files at the end of the job
set hive.merge.smallfiles.avgsize=500000000; --When the average output file size of a job is less than this number,
--Hive will start an additional map-reduce job to merge the output files into bigger files
4) When running on Tez
set hive.merge.tezfiles=true;
set hive.merge.size.per.task=500000000;
set hive.merge.smallfiles.avgsize=500000000;
5) For ORC files you can merge files efficiently using this command:
ALTER TABLE T [PARTITION partition_spec] CONCATENATE; - for ORC

Related

PutHiveQL NiFi Processor extremely slow - misconfiguration?

I am currently setting up a simple NiFi flow that reads from a RDBMS source and writes to a Hive sink. The flow works as expected until the PuHiveSql processor, which is running extremely slow. It inserts one record every minute approximately.
Currently is setup as a standalone instance running on one node.
The logs showing the insert every 1 minute approx:
(INSERT INTO customer (id, name, address) VALUES (x, x, x))
Any ideas about why this may be? Improvements to try?
Thanks in advance
Inserting one record at a time into Hive will result extreme slowness.
As your doing regular insert into hive table:
Change your flow:
QueryDatabaseTable
PutHDFS
Then create Hive avro table on top of HDFS directory where you have stored the data.
(or)
QueryDatabaseTable
ConvertAvroToORC //incase if you need to store data in orc format
PutHDFS
Then create Hive orc table on top of HDFS directory where you have stored the data.
Are you poshing one record at time? if so may use the merge record process to create batches before pushing into HiveQL,
It is recommended to batch into 100 records :
See here: https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-hive-nar/1.5.0/org.apache.nifi.processors.hive.PutHiveQL/
Batch Size | 100 | The preferred number of FlowFiles to put to the database in a single transaction
Use the MergeRecord process and set the number of records or/and timeout, it should speed-up considerably

Is there a way to merge ORC files in HDFS without using ALTER TABLE CONCATENATE command?

This is my first week with Hive and HDFS, so please bear with me.
Almost all the ways I saw so far to merge multiple ORC files suggest using ALTER TABLE with CONCATENATE command.
But I need to merge multiple ORC files of the same table without having to ALTER the table. Another option is to create a copy of the existing table and then use ALTER TABLE on that so that my original table remains unchanged. But I can't do that as well because space and data redundancy reasons.
The thing I'm trying to achieve (ideally) is: I need to transport these ORCs as one file per table into a cloud environment. So, is there a way that I can merge the ORCs on-the-go during the transfer process into cloud? Can this be achieved with/without Hive, maybe directly in HDFS?
Two possible methods other than ALTER TABLE CONCATENATE:
Try to configure merge task, see details here: https://stackoverflow.com/a/45266244/2700344
Alternatively you can force single reducer. This method is quite applicable for not too big files. You can overwrite the same table with ORDER BY, this will force single reducer on the last ORDER BY stage. This will work slow or even fail with big files because all the data will be passed through single reducer:
INSERT OVERWRITE TABLE
SELECT * FROM TABLE
ORDER BY some_col; --this will force single reducer
As a side effect you will get better packed ORC file with efficient index on columns listed in order by.

How do I force hive to always create a consistent filename like 000000_0?

I am doing an Insert overwrite operation through a hive external table onto AWS S3. Hive creates a output file 000000_0 onto S3. However at times I am noticing that it creates file with other names like 0000003_0 etc. I always need to overwrite the existing file but with inconsistent file names I am unable to do so. How do I force hive to always create a consistent filename like 000000_0? Below is an example of how my code looks like, where tab_content is a hive external table.
INSERT OVERWRITE TABLE tab_content
PARTITION(datekey)
select * from source
Better do not do this and modify your program to accept any number of files in the directory.
Each reducer (or mapper if it runs on map-only) creates it's own file. These reducers do know nothing about each other, they named during creation. Files are marked as 000001_0,000002_0. But it can be 000001_1 also if attempt number 0 has failed and attempt number 1 has succeeded. Also if table is partitioned and there is no distribute by partition key at the end, each reducer will create it's own file in each partition.
You can force it to work on a single final reducer (it can be done for example if you add order by clause or setting set mapred.reduce.tasks = 1;). But bear in mind that this solution is not scalable, because too many data will cause performance problems on single reducer. Also What will happen if attempt 0 has failed and it was restarted and attempt 1 succeeded? It will create 000001_1 instead of 000001_0.

Hive insert vs Hive Load: What are the trade offs?

I'm learning Hadoop/Big data technologies. I would like to ingest data in bulk into hive. I started working with a simple CSV file and when I tried to use INSERT command to load each record by record, one record insertion itself took around 1 minute. When I put the file into HDFS and then used the LOAD command, it was instantaneous since it just copies the file into hive's warehouse. I just want to know what are the trade offs that one have to face when they opt in towards LOAD instead of INSERT.
Load- Hive does not do any transformation while loading data into tables. Load operations are currently pure copy/move operations that move datafiles into locations corresponding to Hive tables.
Insert-Query Results can be inserted into tables by using the insert clause and which in turn runs the map reduce jobs.So it takes some time to execute.
In case if you want to optimize/tune the insert statements.Below are some techniques:
1.Setting the execution Engine in hive-site.xml to Tez(if its already installed)
set hive.execution.engine=tez;
2.USE ORCFILE
CREATE TABLE A_ORC (
customerID int, name string, age int, address string
) STORED AS ORC tblproperties (“orc.compress" = “SNAPPY”);
INSERT INTO TABLE A_ORC SELECT * FROM A;
3. Concurrent job runs in hive can save the overall job running time .To achieve that hive-default.xml,below config needs to be changed:
set hive.exec.parallel=true;
set hive.exec.parallel.thread.number=<your value>;
For more info,you can visit http://hortonworks.com/blog/5-ways-make-hive-queries-run-faster/
Hope this helps.

Hive query - INSERT OVERWRITE LOCAL DIRECTORY creates multiple files for a single table

I do the following from a hive table myTable.
INSERT OVERWRITE LOCAL DIRECTORY '/myDir/out' SELECT concat_ws('',NAME,PRODUCT,PRC,field1,field2,field3,field4,field5) FROM myTable;
So, this command generates 2 files 000000_0 and 000001_0 inside the folder out/.
But, I need the contents as a single file. What should I do?
There are multiple files in the directory because every reducer is writing one file. If you really need the contents as a single file, run your map reduce job with only 1 reducer which will write to a single file.
However depending on your data size, this might not be a good approach to run a single reducer.
Edit: Instead of forcing hive to run 1 reduce task and output a single reduce file, it would be better to use hadoop fs operations to merge outputs to a single file.
For example
hadoop fs -text /myDir/out/* | hadoop fs -put - /myDir/out.txt
A bit late to the game, but I found that using LIMIT large_number, where large_number is bigger than rows in your query. It forces hive to use at least a reducer. For example:
set mapred.reduce.tasks=1; INSERT OVERWRITE LOCAL DIRECTORY '/myDir/out' SELECT * FROM table_name LIMIT 1000000000
Worked flawlessly.
CLUSTER BY will make the work.