Hive Insert Overwrite partitioned table sending data to a single reducer - hive

I have a non-partitioned table SRC. I have a destination table DST partitioned a date field action_dt. I want to load data from SRC into DST and partition it by action_dt.
At the time of load, SRC table has data (30 Million records) for just one action_dt (example 20170701). I use the below query to do an insert:
SET mapred.max.split.size=268435456;
SET mapred.output.compression.type=BLOCK;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET hive.exec.compress.output=true;
SET parquet.compression=SNAPPY;
SET hive.merge.size.per.task=268435456;
SET hive.merge.smallfiles.avgsize=268435456;
SET hive.merge.mapfiles=true;
SET hive.merge.mapredfiles=true;
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET mapreduce.reduce.memory.mb = 16384;
SET mapreduce.reduce.java.opts=-Djava.net.preferIPv4Stack=true -Xmx12g;
SET hive.execution.engine=tez;
SET hive.exec.reducers.bytes.per.reducer=104857600;
INSERT OVERWRITE TABLE DST partition(action_dt)
SELECT col1, col2, col3, action_dt FROM SRC;
The SRC table is gzip compressed and has about 80 files of size 80-100MB. When the above query is executed, although around 70 reducer are launched, 69 reducers are executed within 10 seconds, but the 70th reducer is processing all the data.
Why is this happening? Is it because it is recognizing that the data belongs to one 'action_dt' (20170701)? Is there a way this can be split so that multiple reducers can process this data? I tried using DISTRIBUTE BY but did not work.
Appreciate any feedback. Thanks.

Related

How to Fetch all rows Parallelli in PL/SQL. I want to export 100 million rows in csv for the ETL

I want to export 100 million rows in CSV File.
I am using SQL DEVELOPER AND SQL CL but the fetching is taking so much time.
I am using following SQL command in SQL CL.
SET FEEDBACK OFF
SET SQLFORMAT CSV
ALTER SESSION SET NLS_NUMERIC_CHARACTERS = '.,';
ALTER SESSION SET NLS_DATE_FORMAT = "YYYY-MM-DD";
ALTER SESSION SET NLS_TIMESTAMP_TZ_FORMAT = "YYYY-MM-DD HH24:MI:SS.FF TZR";
SPOOL C:\WORK\emp.csv
SELECT /*+PARALLEL*/ *FROM demo_table c;
Could anyone please help me how to export csv file faster?

Hive set parquet file size?

How do I set the parquet file size? I've tried tweaking some settings, but ultimately I get a single large parquet file.
I've created a partitioned external table and then insert into it via an insert overwrite statement.
SET hive.auto.convert.join=false;
SET hive.support.concurrency=false;
SET hive.exec.reducers.max=600;
SET hive.exec.parallel=true;
SET hive.exec.compress.intermediate=true;
SET hive.intermediate.compression.codec=org.apache.hadoop.io.compress.Lz4Codec;
SET mapreduce.map.output.compress=false;
SET mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.Lz4Codec;
SET hive.groupby.orderby.position.alias=true;
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.optimize.sort.dynamic.partition=true;
SET hive.resultset.use.unique.column.names=false
SET mapred.reduce.tasks=100;
SET dfs.blocksize=268435456;
SET parquet.block.size=268435456;
INSERT OVERWRITE TABLE my_table PARTITION (dt)
SELECT dt, x, sum(y) FROM managed_table GROUP BY dt, x;
Using the dfs.blocksize and parquet.block.size parameters, I was hoping to generate 256 mb parquet file splits, but I'm getting a single 4 gb parquet file.
Howe

How to merge 2 huge SQL Server tables without a lot of disk space?

I have 2 tables. Here is my merge statement:
MERGE INTO Transactions_Raw AS Target
USING Four_Fields AS Source
ON
Target.PI = Source.PI AND
Target.TIME_STAMP = Source.TIME_STAMP AND
Target.STS = Source.STS
WHEN MATCHED THEN
UPDATE SET
Target.FROM_APP_A = Source.FROM_APP_A,
Target.TO_APP_A = Source.TO_APP_A,
Target.FROM_APP_B = Source.FROM_APP_B,
Target.TO_APP_B = Source.TO_APP_B;
Both tables have around 77 million rows to them.
When I run this command, it fails due to tempdb growing and running out of disk space. I cannot get more space.
I tried running it as an update statement and it still failed.
I even tried using SSIS, sort and merge-join transformations from these 2 source tables into a 3rd empty table, left join on transactions_raw... It failed and crashed the server requiring a reboot.
I am thinking I need to run it in batches of 100,000 rows or something? How would I do that? Any other suggestions?
Disclaimer: if you want to use any framework to do this job.
spring-batch is a good choice to perform this task.
Even though it involves some coding, it will be helpful in transferring the data seemless and consistent way.
You can choose the row size(100,000) you want to transfer at a time.
Any failures while inserting the data would be limited to set of
rows (100,000). So you need to deal with less data and insert only
this delta data after fixing it. Correct data can be inserted
successfully. This whole set of control will be provided by using
spring-batch.
I usually prefer BCP for such operations on huge tables.
If that query (make sure it repeats the structure of Target table)
SELECT
Target.ID,
Target.TIME_STAMP,
Target.STS,
ISNULL(Source.FROM_APP_A, Target.FROM_APP_A) AS FROM_APP_A,
ISNULL(Source.TO_APP_A , Target.TO_APP_A) AS TO_APP_A,
ISNULL(Source.FROM_APP_B, Target.FROM_APP_B) AS FROM_APP_B,
ISNULL(Source.TO_APP_B , Target.TO_APP_B) AS TO_APP_B
FROM Database.dbo.Transactions_Raw AS Target
LEFT JOIN Database.dbo.Four_Fields AS Source
ON Target.ID = Source.ID AND
Target.TIME_STAMP = Source.TIME_STAMP AND
Target.STS = Source.STS*
won't overflow your TempDB, you can use it to export data into the file using something like
bcp "query_here" queryout D:\Data.dat -S server -Uuser -k -n
Later import data in a new table using something like
bcp Database.dbo.NewTable in D:\Data.dat -S server -Uuser -n -k -b 100000 -h TABLOCK
-- this is code example doing a loop a bit more explained what I said in comments for an example you just have to plug in your code
SET NOCOUNT ON;
DECLARE #PerBatchCount as int
DECLARE #MAXID as bigint
DECLARE #WorkingOnID as bigint
Set #PerBatchCount = 1000
--Find range of IDs to process
SELECT #WorkingOnID = MIN(ID), #MAXID = MAX(ID)
FROM MasterTableYouAreWorkingOn
WHILE #WorkingOnID <= #MAXID
BEGIN
-- do an update on all the ones that exist in the offer table NOW
--MergeStatementHere with adding the where below for ID
WHERE ID BETWEEN #WorkingOnID AND (#WorkingOnID + #PerBatchCount -1)
-- if the loop runs too much data you may need to add this to delay for your log to have time to ship/empty out
WAITFOR DELAY '00:00:10' -- wait 10 seconds
set #WorkingOnID = #WorkingOnID + #PerBatchCount
END

HIVE on TEZ (java.lang.RuntimeException: native snappy library not available: this version of libhadoop was built without snappy support)

My script keeps falling with the same error
java.lang.RuntimeException: native snappy library not available: this version of libhadoop was built without snappy support
The code itself looks like this
WITH step1 AS(
SELECT columns
FROM t1 stg
WHERE time_key < '2017-04-08' AND time_key >= DATE_ADD('2017-04-08', -31)
GROUP BY columns
HAVING conditions1
)
, step2 AS(
SELECT columns
FROM t2
WHERE conditions2
)
, step3 AS(
SELECT columns
FROM stg
JOIN comverse_sub
ON conditions3
)
INSERT INTO TABLE t1 PARTITION(time_key = '2017-04-08')
SELECT columns
FROM step3
WHERE conditions4
I checked if there is snappy installed
hadoop checknative -a
and got
snappy: true /usr/hdp/2.5.0.0-1245/hadoop/lib/native/libsnappy.so.1
My settings for tez are
set tez.queue.name=adhoc;
set hive.execution.engine=tez;
set hive.tez.container.size=4096;
set hive.auto.convert.join=true;
set hive.exec.parallel=true;
set hive.tez.auto.reducer.parallelism=true;
SET hive.exec.compress.output=true;
SET tez.runtime.compress=true;
SET tez.runtime.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
I should also note that not all script on tez fail. Some work. Like this one
WITH hist AS(
SELECT columns
FROM t1
WHERE conditions1
)
INSERT INTO TABLE t1 PARTITION(time_key)
SELECT columns
FROM hist
INNER JOIN t2
on conditions2
INNER JOIN t3
ON conditions3
WHERE conditions4
Why does this happen?
I checked this and this and this. Didn't help. Also, when I run the scripts on MR they all work.
Well, I solved it. I just kept adding settings untill it worked. Still don't know what the problem was and the solution is not very good, but it'll do for the time being.
set tez.queue.name=adhoc;
set hive.execution.engine=tez;
set hive.tez.container.size=4096;
set hive.auto.convert.join=true;
set hive.exec.parallel=true;
set hive.tez.auto.reducer.parallelism=true;
SET hive.exec.compress.output=true;
SET tez.runtime.compress=true;
SET tez.runtime.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET das.reduce-tasks-per-node=12;
SET das.map-tasks-per-node=12;
SET das.job.map-task.memory=4096;
SET das.job.reduce-task.memory=4096;
SET das.job.application-manager.memory=4096;
SET tez.runtime.io.sort.mb=512;
SET tez.runtime.io.sort.factor=100;
set hive.tez.min.partition.factor=0.25;
set hive.tez.max.partition.factor=2.0;
set mapred.reduce.tasks = -1;

Issue creating Hive External table using tblproperties

I am trying to create an external table with tblproperties in Hive. The table gets created but it does not display the rows. Any ideas? Please find the scripts i am using below:
Thanks for your time and suggestions in advance.
Data is in a recursive folder: /user/test/test1/test2/samplefile.csv
use dw_raw;
drop table if exists temp_external_tab1;
create external table if not exists temp_external_tab1 (
col1 int,
col2 string,
col3 string,
col4 string
)
row format delimited fields terminated by ','
lines terminated by '\n'
stored as textfile
location '/user/test/test1/'
tblproperties ("hive.input.dir.recursive" = "TRUE",
"hive.mapred.supports.subdirectories" = "TRUE",
"hive.supports.subdirectories" = "TRUE",
"mapred.input.dir.recursive" = "TRUE");
These are not table properties, but global settings.
You should set these using 'set', i.e.:
set hive.mapred.supports.subdirectories=true;
set mapred.input.dir.recursive=true;
You've created a table but haven't put any data into it. Try
hive> LOAD DATA LOCAL INPATH '/user/test/test1/test2/samplefile.csv'
INTO TABLE temp_external_tab1;
If you are using ambari the set the following properties to hive advanced config inside custom hive-site.xml.
SET hive.input.dir.recursive=TRUE
SET hive.mapred.supports.subdirectories=TRUE
SET hive.supports.subdirectories=TRUE
SET mapred.input.dir.recursive=TRUE
And then restart the affected services. This will read all the data recursively.