How do I set the parquet file size? I've tried tweaking some settings, but ultimately I get a single large parquet file.
I've created a partitioned external table and then insert into it via an insert overwrite statement.
SET hive.auto.convert.join=false;
SET hive.support.concurrency=false;
SET hive.exec.reducers.max=600;
SET hive.exec.parallel=true;
SET hive.exec.compress.intermediate=true;
SET hive.intermediate.compression.codec=org.apache.hadoop.io.compress.Lz4Codec;
SET mapreduce.map.output.compress=false;
SET mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.Lz4Codec;
SET hive.groupby.orderby.position.alias=true;
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.optimize.sort.dynamic.partition=true;
SET hive.resultset.use.unique.column.names=false
SET mapred.reduce.tasks=100;
SET dfs.blocksize=268435456;
SET parquet.block.size=268435456;
INSERT OVERWRITE TABLE my_table PARTITION (dt)
SELECT dt, x, sum(y) FROM managed_table GROUP BY dt, x;
Using the dfs.blocksize and parquet.block.size parameters, I was hoping to generate 256 mb parquet file splits, but I'm getting a single 4 gb parquet file.
Howe
Related
I want to export 100 million rows in CSV File.
I am using SQL DEVELOPER AND SQL CL but the fetching is taking so much time.
I am using following SQL command in SQL CL.
SET FEEDBACK OFF
SET SQLFORMAT CSV
ALTER SESSION SET NLS_NUMERIC_CHARACTERS = '.,';
ALTER SESSION SET NLS_DATE_FORMAT = "YYYY-MM-DD";
ALTER SESSION SET NLS_TIMESTAMP_TZ_FORMAT = "YYYY-MM-DD HH24:MI:SS.FF TZR";
SPOOL C:\WORK\emp.csv
SELECT /*+PARALLEL*/ *FROM demo_table c;
Could anyone please help me how to export csv file faster?
I am using a MERGE Query that is INSERTING over 800 Million records into a table from another table in the same database (conversion project). We run into this error below when it get's to this particular table it has to write to for the SQL Merge.
2019-02-05 16:35:03.002 Error Could not allocate space for object 'dbo.SORT temporary run storage: 140820412694528' in database 'tempdb' because the 'PRIMARY' filegroup is full. Create disk space by deleting unneeded files, dropping objects in the filegroup, adding additional files to the filegroup, or setting autogrowth on for existing files in the filegroup.
MERGE dbo.' + #p_TargetDLTable + ' as TARGET
USING dbo.' + #p_SourceDLSDTable + ' as SOURCE
ON (TARGET.docid = source.docid AND TARGET.objectid = source.objectid AND
target.pagenum = source.pagenum
and target.subpagenum = source.subpagenum and target.pagever =
source.pagever and target.pathid = source.pathid
and target.annote = source.annote)
WHEN NOT MATCHED BY TARGET AND source.clipid != ''X''
THEN INSERT (docid, pagenum, subpagenum, pagever, objectid, pathid, annote,
formatid, ftoffset, ftcount) VALUES (
source.docid, source.pagenum, source.subpagenum, source.pagever,
source.objectid, source.pathid,source.annote ,source.formatid
,source.ftoffset, source.ftcount); '
The reason I decided to use a MERGE query over INSERT INTO was because all the research was pointing to for the type of join that had to be done, it would result in faster performance.
Is there a way to increase the TempDB, or is there a way for the Merge to not have to use the TempDB? Does the INSERT INTO query also use the TempDB?
I have a non-partitioned table SRC. I have a destination table DST partitioned a date field action_dt. I want to load data from SRC into DST and partition it by action_dt.
At the time of load, SRC table has data (30 Million records) for just one action_dt (example 20170701). I use the below query to do an insert:
SET mapred.max.split.size=268435456;
SET mapred.output.compression.type=BLOCK;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET hive.exec.compress.output=true;
SET parquet.compression=SNAPPY;
SET hive.merge.size.per.task=268435456;
SET hive.merge.smallfiles.avgsize=268435456;
SET hive.merge.mapfiles=true;
SET hive.merge.mapredfiles=true;
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET mapreduce.reduce.memory.mb = 16384;
SET mapreduce.reduce.java.opts=-Djava.net.preferIPv4Stack=true -Xmx12g;
SET hive.execution.engine=tez;
SET hive.exec.reducers.bytes.per.reducer=104857600;
INSERT OVERWRITE TABLE DST partition(action_dt)
SELECT col1, col2, col3, action_dt FROM SRC;
The SRC table is gzip compressed and has about 80 files of size 80-100MB. When the above query is executed, although around 70 reducer are launched, 69 reducers are executed within 10 seconds, but the 70th reducer is processing all the data.
Why is this happening? Is it because it is recognizing that the data belongs to one 'action_dt' (20170701)? Is there a way this can be split so that multiple reducers can process this data? I tried using DISTRIBUTE BY but did not work.
Appreciate any feedback. Thanks.
My script keeps falling with the same error
java.lang.RuntimeException: native snappy library not available: this version of libhadoop was built without snappy support
The code itself looks like this
WITH step1 AS(
SELECT columns
FROM t1 stg
WHERE time_key < '2017-04-08' AND time_key >= DATE_ADD('2017-04-08', -31)
GROUP BY columns
HAVING conditions1
)
, step2 AS(
SELECT columns
FROM t2
WHERE conditions2
)
, step3 AS(
SELECT columns
FROM stg
JOIN comverse_sub
ON conditions3
)
INSERT INTO TABLE t1 PARTITION(time_key = '2017-04-08')
SELECT columns
FROM step3
WHERE conditions4
I checked if there is snappy installed
hadoop checknative -a
and got
snappy: true /usr/hdp/2.5.0.0-1245/hadoop/lib/native/libsnappy.so.1
My settings for tez are
set tez.queue.name=adhoc;
set hive.execution.engine=tez;
set hive.tez.container.size=4096;
set hive.auto.convert.join=true;
set hive.exec.parallel=true;
set hive.tez.auto.reducer.parallelism=true;
SET hive.exec.compress.output=true;
SET tez.runtime.compress=true;
SET tez.runtime.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
I should also note that not all script on tez fail. Some work. Like this one
WITH hist AS(
SELECT columns
FROM t1
WHERE conditions1
)
INSERT INTO TABLE t1 PARTITION(time_key)
SELECT columns
FROM hist
INNER JOIN t2
on conditions2
INNER JOIN t3
ON conditions3
WHERE conditions4
Why does this happen?
I checked this and this and this. Didn't help. Also, when I run the scripts on MR they all work.
Well, I solved it. I just kept adding settings untill it worked. Still don't know what the problem was and the solution is not very good, but it'll do for the time being.
set tez.queue.name=adhoc;
set hive.execution.engine=tez;
set hive.tez.container.size=4096;
set hive.auto.convert.join=true;
set hive.exec.parallel=true;
set hive.tez.auto.reducer.parallelism=true;
SET hive.exec.compress.output=true;
SET tez.runtime.compress=true;
SET tez.runtime.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET das.reduce-tasks-per-node=12;
SET das.map-tasks-per-node=12;
SET das.job.map-task.memory=4096;
SET das.job.reduce-task.memory=4096;
SET das.job.application-manager.memory=4096;
SET tez.runtime.io.sort.mb=512;
SET tez.runtime.io.sort.factor=100;
set hive.tez.min.partition.factor=0.25;
set hive.tez.max.partition.factor=2.0;
set mapred.reduce.tasks = -1;
I am trying to create an external table with tblproperties in Hive. The table gets created but it does not display the rows. Any ideas? Please find the scripts i am using below:
Thanks for your time and suggestions in advance.
Data is in a recursive folder: /user/test/test1/test2/samplefile.csv
use dw_raw;
drop table if exists temp_external_tab1;
create external table if not exists temp_external_tab1 (
col1 int,
col2 string,
col3 string,
col4 string
)
row format delimited fields terminated by ','
lines terminated by '\n'
stored as textfile
location '/user/test/test1/'
tblproperties ("hive.input.dir.recursive" = "TRUE",
"hive.mapred.supports.subdirectories" = "TRUE",
"hive.supports.subdirectories" = "TRUE",
"mapred.input.dir.recursive" = "TRUE");
These are not table properties, but global settings.
You should set these using 'set', i.e.:
set hive.mapred.supports.subdirectories=true;
set mapred.input.dir.recursive=true;
You've created a table but haven't put any data into it. Try
hive> LOAD DATA LOCAL INPATH '/user/test/test1/test2/samplefile.csv'
INTO TABLE temp_external_tab1;
If you are using ambari the set the following properties to hive advanced config inside custom hive-site.xml.
SET hive.input.dir.recursive=TRUE
SET hive.mapred.supports.subdirectories=TRUE
SET hive.supports.subdirectories=TRUE
SET mapred.input.dir.recursive=TRUE
And then restart the affected services. This will read all the data recursively.