I'm trying to split parquet/snappy files created by hive INSERT OVERWRITE TABLE... on dfs.block.size boundary as impala issues a warning when a file in a partition is larger then block size.
impala logs the following WARNINGS:
Parquet files should not be split into multiple hdfs-blocks. file=hdfs://<SERVER>/<PATH>/<PARTITION>/000000_0 (1 of 7 similar)
Code:
CREATE TABLE <TABLE_NAME>(<FILEDS>)
PARTITIONED BY (
year SMALLINT,
month TINYINT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\037'
STORED AS PARQUET TBLPROPERTIES ("parquet.compression"="SNAPPY");
As for the INSERT hql script:
SET dfs.block.size=134217728;
SET hive.exec.reducers.bytes.per.reducer=134217728;
SET hive.merge.mapfiles=true;
SET hive.merge.size.per.task=134217728;
SET hive.merge.smallfiles.avgsize=67108864;
SET hive.exec.compress.output=true;
SET mapred.max.split.size=134217728;
SET mapred.output.compression.type=BLOCK;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
INSERT OVERWRITE TABLE <TABLE_NAME>
PARTITION (year=<YEAR>, month=<MONTH>)
SELECT <FIELDS>
from <ANOTHER_TABLE> where year=<YEAR> and month=<MONTH>;
The issue is file seizes are all over the place:
partition 1: 1 file: size = 163.9 M
partition 2: 2 file: size = 207.4 M, 128.0 M
partition 3: 3 file: size = 166.3 M, 153.5 M, 162.6 M
partition 4: 3 file: size = 151.4 M, 150.7 M, 45.2 M
The issue is the same no matter the dfs.block.size setting (and other settings above) increased to 256M, 512M or 1G (for different data sets).
Is there a way/settings to make sure that the splitting of the output parquet/snappy files are just below hdfs block size?
There is not a way to close files once they grow to the size of a single HDFS block and start a new file. That would go against how HDFS typically works: having files that span many blocks.
The right solution is for Impala to schedule its tasks where the blocks are local instead of complaining that the file spans more than one block. This was completed recently as IMPALA-1881 and will be released in Impala 2.3.
You need to both parquet block size and dfs block size set:
SET dfs.block.size=134217728;
SET parquet.block.size=134217728;
Both need to be set to the same because you want a parquet block to fit inside an hdfs block.
In some cases you can set parquet block size by setting mapred.max.split.size (parquet 1.4.2+) which you already did. You can put it lower than hdfs block size to increase parallelism. Parquet tries to align to hdfs blocks, when possible:
https://github.com/Parquet/parquet-mr/pull/365
Edit 11/16/2015:
According to
https://github.com/Parquet/parquet-mr/pull/365#issuecomment-157108975
this also might be IMPALA-1881 which is fixed in Impala 2.3.
Related
I am running hive external table queries. Issue:
'Select * from table1' row count which hive shows is different 'Select count(*) from table1'. It should match but not matching not sure why? Result match for small data 20 MB or so but not for Big table i.e 600 MB they do not match..Any one has faced this issue ??
Below are some queries I ran to show the result. My source file is RDS file which I convert to csv file and upload it to HDFS and create external table.
additional details
Note:
I only face this issue for big files e.g. size 200 MB or more but for small files e.g 80 MB there is no ussue.
SELECT count(*) FROM dbname1.cy_tablet where Ranid Is NULL # Zero results
We resolved the issue and all count match now.
We removed headers in our csv file used as source to Hive External tables, by using col_names = FALSE
write_delim(df_data,delim = "|",col_names = FALSE, output_file)#
Removed following line from CREATE EXTERNAL TABLE command
TBLPROPERTIES('skip.header.line.count'='1'
Above steps resolved our issue.
The issue was happening in big files. In our site HDFS block size is 128MB, if we divide file size by 128MB gives us a number , I was getting same as difference. So I think the issue was with headers.
Note: We used pipe '|' as delimiter as we faced some other issues when using ','
I have created the external table using parquet in hive using snappy compression. I want to configure the parquet files size created in the target folder. I.e Min file size = 1024B and Max file size = 128MB based on these configurations, no of parquet files should be created. Can anyone please let me know how this can be achieved?
Thanks
I'm processing the output of the below Hive query:
INSERT OVERWRITE DIRECTORY '/main_directory/table'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' SELECT * FROM table;
one part file at a time sequentially in Python. However, some of the files are too big and the Python script is failing with memory error.
Therefore, I need to limit the size of each of the output part files to ~300 MB or ~250K rows.
I'm creating a new table in Hive using:
CREATE TABLE new_table AS select * from old_table;
My problem is that after the table is created, It generates multiple files for each partition - while I want only one file for each partition.
How can I define it in the table?
Thank you!
There are many possible solutions:
1) Add distribute by partition key at the end of your query. Maybe there are many partitions per reducer and each reducer creates files for each partition. This may reduce the number of files and memory consumption as well. hive.exec.reducers.bytes.per.reducer setting will define how much data each reducer will process.
2) Simple, quite good if there are not too much data: add order by to force single reducer. Or increase hive.exec.reducers.bytes.per.reducer=500000000; --500M files. This is for single reducer solution is for not too much data, it will run slow if there are a lot of data.
If your task is map-only then better consider options 3-5:
3) If running on mapreduce, switch-on merge:
set hive.merge.mapfiles=true;
set hive.merge.mapredfiles=true;
set hive.merge.size.per.task=500000000; --Size of merged files at the end of the job
set hive.merge.smallfiles.avgsize=500000000; --When the average output file size of a job is less than this number,
--Hive will start an additional map-reduce job to merge the output files into bigger files
4) When running on Tez
set hive.merge.tezfiles=true;
set hive.merge.size.per.task=500000000;
set hive.merge.smallfiles.avgsize=500000000;
5) For ORC files you can merge files efficiently using this command:
ALTER TABLE T [PARTITION partition_spec] CONCATENATE; - for ORC
I'm learning Hadoop/Big data technologies. I would like to ingest data in bulk into hive. I started working with a simple CSV file and when I tried to use INSERT command to load each record by record, one record insertion itself took around 1 minute. When I put the file into HDFS and then used the LOAD command, it was instantaneous since it just copies the file into hive's warehouse. I just want to know what are the trade offs that one have to face when they opt in towards LOAD instead of INSERT.
Load- Hive does not do any transformation while loading data into tables. Load operations are currently pure copy/move operations that move datafiles into locations corresponding to Hive tables.
Insert-Query Results can be inserted into tables by using the insert clause and which in turn runs the map reduce jobs.So it takes some time to execute.
In case if you want to optimize/tune the insert statements.Below are some techniques:
1.Setting the execution Engine in hive-site.xml to Tez(if its already installed)
set hive.execution.engine=tez;
2.USE ORCFILE
CREATE TABLE A_ORC (
customerID int, name string, age int, address string
) STORED AS ORC tblproperties (“orc.compress" = “SNAPPY”);
INSERT INTO TABLE A_ORC SELECT * FROM A;
3. Concurrent job runs in hive can save the overall job running time .To achieve that hive-default.xml,below config needs to be changed:
set hive.exec.parallel=true;
set hive.exec.parallel.thread.number=<your value>;
For more info,you can visit http://hortonworks.com/blog/5-ways-make-hive-queries-run-faster/
Hope this helps.