Hadoop merge files - hive

I have ran a map only job with 674 mappers which hive took an has generated 674 .gz files I want to merge these files to aroung 30-35 files.have tried hive megre mapfilse property by not getting the merged output

Try using TEZ execution engine and then hive.merge.tezfiles. You might also want to specify the size as well.
set hive.execution.engine=tez; -- TEZ execution engine
set hive.merge.tezfiles=true; -- Notifying that merge step is required
set hive.merge.smallfiles.avgsize=128000000; --128MB
set hive.merge.size.per.task=128000000; -- 128MB
If you want to go with MR engine then add following settings (I haven't tried it personally)
set hive.merge.mapredfiles=true; -- Notifying that merge step is required
set hive.merge.smallfiles.avgsize=128000000; --128MB
set hive.merge.size.per.task=128000000; -- 128MB
Above setting will spawn one more step to merge the files and approx size of each part file should be 128MB.
Reference:
Settings description

Related

How to force MR execution when running simple Hive query?

There is Hive 2.1.1 over MR, table test_table stored as sequencefile and the following ad-hoc query:
select t.*
from test_table t
where t.test_column = 100
Although this query can be executed without starting MR (fetch task), sometimes it takes longer to scan HDFS files rather than triggering a single map job.
When I want to enforce MR execution, I make the query more complex: e.g., using distinct. The significant drawbacks of this approach are:
Query results may differ from the original query's
Brings meaningless calculation load on the cluster
Is there a recommended way to force MR execution when using Hive-on-MR?
The hive executor decides either to execute map task or fetch task depending on the following settings (with defaults):
hive.fetch.task.conversion ("more") — the strategy for converting MR tasks into fetch tasks
hive.fetch.task.conversion.threshold (1 GB) — max size of input data that can be fed to a fetch task
hive.fetch.task.aggr (false) — when set to true, queries like select count(*) from src also can be executed in a fetch task
It prompts me the following two options:
set hive.fetch.task.conversion.threshold to a lower value, e.g. 512 Mb
set hive.fetch.task.conversion to "none"
For some reason lowering the threshold did not change anything in my case, so I stood with the second option: seems fine for ad-hoc queries.
More details regarding these settings can be found in Cloudera forum and Hive wiki.
Just add set hive.execution.engine=mr; before your query and it will enforce Hive to use MR.

Can you control hdfs file size for a HortonWorks HDP 3.4.1 Managed Table?

Currently testing a cluster and when using the "CREATE TABLE AS" the resulting managed table ends up being one file ~ 1.2 GB while the base file the query is created from has many small files. The SELECT portion runs fast, but then the result is 2 reducers running to create one file which takes 75% of the run time.
Additional testing:
1) If using "CREATE EXTERNAL TABLE AS" is used the query runs very fast and there is no merge files step involved.
2) Also, the merging doesn't appear to occur with version HDP 3.0.1.
You can change set hive.exec.reducers.bytes.per.reducer=<number> to let hive decide number of reducers based on reducer input size (default value is set to 1 GB or 1000000000 bytes ) [ you can refer to links provided by #leftjoin to get more details about this property and fine tuning for your needs ]
Another option you can try is to change following properties
set mapreduce.job.reduces=<number>
set hive.exec.reducers.max=<number>

Create Table in Hive with one file

I'm creating a new table in Hive using:
CREATE TABLE new_table AS select * from old_table;
My problem is that after the table is created, It generates multiple files for each partition - while I want only one file for each partition.
How can I define it in the table?
Thank you!
There are many possible solutions:
1) Add distribute by partition key at the end of your query. Maybe there are many partitions per reducer and each reducer creates files for each partition. This may reduce the number of files and memory consumption as well. hive.exec.reducers.bytes.per.reducer setting will define how much data each reducer will process.
2) Simple, quite good if there are not too much data: add order by to force single reducer. Or increase hive.exec.reducers.bytes.per.reducer=500000000; --500M files. This is for single reducer solution is for not too much data, it will run slow if there are a lot of data.
If your task is map-only then better consider options 3-5:
3) If running on mapreduce, switch-on merge:
set hive.merge.mapfiles=true;
set hive.merge.mapredfiles=true;
set hive.merge.size.per.task=500000000; --Size of merged files at the end of the job
set hive.merge.smallfiles.avgsize=500000000; --When the average output file size of a job is less than this number,
--Hive will start an additional map-reduce job to merge the output files into bigger files
4) When running on Tez
set hive.merge.tezfiles=true;
set hive.merge.size.per.task=500000000;
set hive.merge.smallfiles.avgsize=500000000;
5) For ORC files you can merge files efficiently using this command:
ALTER TABLE T [PARTITION partition_spec] CONCATENATE; - for ORC

how to decrease the number of mapper in hive while the file is bigger than block size?

guys
I have a table in hive which have more than 720 partitions,and in each partition there is more than 400 files and the file's average size is 1G.
Now I execute following SQL:
insert overwrite table test_abc select * from DEFAULT.abc A WHERE A.P_HOUR ='2017042400' ;
this partition (P_HOUR ='2017042400' )have 409 files. When I submit this sql ,I got following output
INFO : Number of reduce tasks is set to 0 since there's no reduce operator
INFO : number of splits:409
INFO : Submitting tokens for job: job_1482996444961_9384015
I google many doc to find how to decrease the number of mapper, lots of doc solved this problem when the file is small.
I have tried the following set in beeline, but not work
---------------first time
set mapred.min.split.size =5000000000;
set mapred.max.split.size =10000000000;
set mapred.min.split.size.per.node=5000000000;
set mapred.min.split.size.per.rack=5000000000;
-----------------second time
set mapreduce.input.fileinputformat.split.minsize =5000000000;
set mapreduce.input.fileinputformat.split.maxsize=10000000000;
set mapreduce.input.fileinputformat.split.minsize.per.rack=5000000000;
set mapreduce.input.fileinputformat.split.minsize.per.node=5000000000;
my hadoop version is
Hadoop 2.7.2
Compiled by root on 11 Jul 2016 10:58:45
hive version is
Connected to: Apache Hive (version 1.3.0)
Driver: Hive JDBC (version 1.3.0)
In addition to the setup in your post
set hive.hadoop.supports.splittable.combineinputformat=true;
hive.hadoop.supports.splittable.combineinputformat
- Default Value: false
- Added In: Hive 0.6.0
Whether to combine small input files so that fewer mappers are spawned.
MRv2 uses CombineInputFormat, while Tez uses grouped splits to determine the Mapper. If your execution engine is mr and you would like to reduce Mappers use:
mapreduce.input.fileinputformat.split.maxsize=xxxxx
If maxSplitSize is specified, then blocks on the same node are combined to a single split. Blocks that are left over are then combined with other blocks in the same rack. If maxSplitSize is not specified, then blocks from the same rack are combined in a single split; no attempt is made to create node-local splits. If the maxSplitSize is equal to the block size, then this class is similar to the default splitting behavior in Hadoop
This link can be helpful to control Mapper in Hive if your execution engine is mr
If your execution engine is tez and you would lile to control Mappers then use:
set tez.grouping.max-size = XXXXXX;
Here is a good read reference for the parallelism in Hive for tez execution engine,

Set parquet snappy output file size is hive?

I'm trying to split parquet/snappy files created by hive INSERT OVERWRITE TABLE... on dfs.block.size boundary as impala issues a warning when a file in a partition is larger then block size.
impala logs the following WARNINGS:
Parquet files should not be split into multiple hdfs-blocks. file=hdfs://<SERVER>/<PATH>/<PARTITION>/000000_0 (1 of 7 similar)
Code:
CREATE TABLE <TABLE_NAME>(<FILEDS>)
PARTITIONED BY (
year SMALLINT,
month TINYINT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\037'
STORED AS PARQUET TBLPROPERTIES ("parquet.compression"="SNAPPY");
As for the INSERT hql script:
SET dfs.block.size=134217728;
SET hive.exec.reducers.bytes.per.reducer=134217728;
SET hive.merge.mapfiles=true;
SET hive.merge.size.per.task=134217728;
SET hive.merge.smallfiles.avgsize=67108864;
SET hive.exec.compress.output=true;
SET mapred.max.split.size=134217728;
SET mapred.output.compression.type=BLOCK;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
INSERT OVERWRITE TABLE <TABLE_NAME>
PARTITION (year=<YEAR>, month=<MONTH>)
SELECT <FIELDS>
from <ANOTHER_TABLE> where year=<YEAR> and month=<MONTH>;
The issue is file seizes are all over the place:
partition 1: 1 file: size = 163.9 M
partition 2: 2 file: size = 207.4 M, 128.0 M
partition 3: 3 file: size = 166.3 M, 153.5 M, 162.6 M
partition 4: 3 file: size = 151.4 M, 150.7 M, 45.2 M
The issue is the same no matter the dfs.block.size setting (and other settings above) increased to 256M, 512M or 1G (for different data sets).
Is there a way/settings to make sure that the splitting of the output parquet/snappy files are just below hdfs block size?
There is not a way to close files once they grow to the size of a single HDFS block and start a new file. That would go against how HDFS typically works: having files that span many blocks.
The right solution is for Impala to schedule its tasks where the blocks are local instead of complaining that the file spans more than one block. This was completed recently as IMPALA-1881 and will be released in Impala 2.3.
You need to both parquet block size and dfs block size set:
SET dfs.block.size=134217728;
SET parquet.block.size=134217728;
Both need to be set to the same because you want a parquet block to fit inside an hdfs block.
In some cases you can set parquet block size by setting mapred.max.split.size (parquet 1.4.2+) which you already did. You can put it lower than hdfs block size to increase parallelism. Parquet tries to align to hdfs blocks, when possible:
https://github.com/Parquet/parquet-mr/pull/365
Edit 11/16/2015:
According to
https://github.com/Parquet/parquet-mr/pull/365#issuecomment-157108975
this also might be IMPALA-1881 which is fixed in Impala 2.3.