Hive insert hangs in recude at 99% - hive

I am trying to insert into a partitioned Hive table.
Maps succeeds. But the reduce gets to 99%. Then it stays at 99% and doesn't finish. This goes on for hours without any result.
Can somebody let me know what could be the reason?
Note: I tried to insert into a non partitioned parquet table and it succeeded.
But I want to create a partitioned table.
The logs as seen in Hue is as below:
INFO : 2017-11-21 15:42:56,672 Stage-1 map = 100%, reduce = 99%, Cumulative CPU 76743.67 sec
INFO : 2017-11-21 15:43:57,045 Stage-1 map = 100%, reduce = 99%, Cumulative CPU 76816.54 sec
INFO : 2017-11-21 15:44:57,332 Stage-1 map = 100%, reduce = 99%, Cumulative CPU 76892.15 sec

If you are inserting data into table with Dynamic Partition logic in DML and if your data is large , then it creates problem in the reducer.
If possible, please try to pass the the partition values manually in the DML through parameterization , if you already know about the partition values from any source.
Root cause of the problem, Reducer is executing 'DISTINCT values ' of partition column.

Related

Writing speed in Delta tables significantly increases after copying it in Databricks

I am merging a PySpark dataframe into a Delta table. The output delta is partitioned by DATE. The following query takes 30s to run:
query = DeltaTable.forPath(spark, PATH_TO_THE_TABLE).alias(
"actual"
).merge(
spark_df.alias("sdf"),
"actual.DATE >= current_date() - INTERVAL 1 DAYS
AND (actual.feat1 = sdf.feat1)
AND (actual.TIME = sdf.TIME)
AND (actual.feat2 = sdf.feat2) "
,
).whenNotMatchedInsertAll()
After copying the content of the Delta to another location, the above query becomes 60 times faster (i.e. it takes 0.5s on the same cluster), when using the NEW_PATH instead of PATH_TO_THE_TABLE. Here is the command to copy the delta:
(spark.read.format("delta").load(PATH_TO_THE_TABLE).write.format(
"delta"
)
.mode("overwrite").partitionBy(["DATE"]).save(NEW_PATH))
How can I make querying on the first delta as fast as on the new one? I understand that Delta has a versioning system and I suspect it is the reason it takes so much time. I tried to vacuum the Delta table (which lowered the query time to 20s) but I am still far from the 0.5s.
Stack:
Python 3.7;
Pyspark 3.0.1;
Databricks Runtime 7.3 LTS
I see that you compare actual.DATE >= current_date() as this is most important part of your query please try to run periodically ZORDER sorting delta by thta filed:
OPTIMIZE actual ZORDER BY (actual.DATE)
You can also try totally vacuum delta:
VACUUM actual RETAIN 0 HOURS
to do that you need to set spark.databricks.delta.retentionDurationCheck.enabled false.
If you don't want benefits of delta (transaction, concurrent writes, timetravel history etc.) you can just use parquet.

BigQuery. Long execution time on small datasets

I created a new Google cloud project and set up BigQuery data base. I tried different queries, they all are executing too long. Currently we don't have a lot of data, so high performance was expected.
Below are some examples of queries and their execution time.
Query #1 (Job Id bquxjob_11022e81_172cd2d59ba):
select date(installtime) regtime
,count(distinct userclientid) users
,sum(fm.advcost) advspent
from DWH.DimUser du
join DWH.FactMarketingSpent fm on fm.date = date(du.installtime)
group by 1
The query failed in 1 hour + with error "Query exceeded resource limits. 14521.457814668494 CPU seconds were used, and this query must use less than 12800.0 CPU seconds."
Query execution plan: https://prnt.sc/t30bkz
Query #2 (Job Id bquxjob_41f963ae_172cd41083f):
select fd.date
,sum(fd.revenue) adrevenue
,sum(fm.advcost) advspent
from DWH.FactAdRevenue fd
join DWH.FactMarketingSpent fm on fm.date = fd.date
group by 1
Execution time ook 59.3 sec, 7.7 MB processed. What is too slow.
Query Execution plan: https://prnt.sc/t309t4
Query #3 (Job Id bquxjob_3b19482d_172cd31f629)
select date(installtime) regtime
,count(distinct userclientid) users
from DWH.DimUser du
group by 1
Execution time 5.0 sec elapsed, 42.3 MB processed. What is not terrible but must be faster for such small volumes of data.
Tables used :
DimUser - Table size 870.71 MB, Number of rows 2,771,379
FactAdRevenue - Table size 6.98 MB, Number of rows 53,816
FaceMarketingSpent - Table size 68.57 MB, Number of rows 453,600
The question is what am I doing wrong so that query execution time is so big? If everything is ok, I would be glad to hear any advice on how to reduce execution time for such simple queries. If anyone from google reads my question, I would appreciate if jobids are checked.
Thank you!
P.s. Previously I had experience using BigQuery for other projects and the performance and execution time were incredibly good for tables of 50+ TB size.
Posting same reply i've given in the gcp slack workspace:
Both your first two queries looks like you have one particular worker who is overloaded. Can see this because in the compute section, the max time is very different from the avg time. This could be for a number of reasons, but i can see that you are joining a table of 700k+ rows (looking at the 2nd input) to a table of ~50k (looking at the first input). This is not good practice, you should switch it so the larger table is the left most table. see https://cloud.google.com/bigquery/docs/best-practices-performance-compute?hl=en_US#optimize_your_join_patterns
You may also have a heavily skew in your join keys (e.g. 90% of rows are on 1/1/2020, or NULL). check this.
For the third query, that time is expected, try a approx count instead to speed it up. Also note BQ starts to get better if you perform the same query over and over, so this will get quicker.

Teradata Current CPU utilization (Not User level and no History data)

I want to run heavy extraction basically for migration of data from Teradata to some cloud warehouse and would want to check current CPU utilization (in percentage) of overall Teradata CPU and accordingly increase the extraction processes on it.
I know we have this type of information available in "dbc.resusagespma" but it looks like history data and not current, which we can see on Viewpoint.
Can we get such a run time information with the help of SQL in Teradata?
This info is returned by one of the PMPC-API funtions, syslib.MonitorPhysicalSummary, of course, you need Execute Function rights:
SELECT * FROM TABLE (MonitorPhysicalSummary()) AS t

Loading data into dynamic partition takes lot of time

When i run hive query with dynamic partition i see that loading data into partition is time consuming process.
Given that i have data only with date = 2015-03-15 and locale = us and run query with dynamic partition, i see -
Loading data to table schema.tablename partition (date=null, locale=null)
Loading partition {date=2015-01-25,locale=us}
Loading partition {date=2015-02-12,locale=mx}
Loading partition {date=2015-03-17,locale=us}
Loading partition {date=2014-12-31,locale=tw}
...
...
I see lot of stats been gathered here though i have setting
SET hive.stats.autogather=false;
I would like to know is there any setting/anything i can do to minimize the time taking for loading data in to partition. Please help.

Interpreting and finetuning the BATCHSIZE parameter?

So I am playing around with the BULK INSERT statement and am beginning to love it. What was talking the SQL Server Import/Export Wizard 7 hours is taking only 1-3 hours using BULK INSERT. However, what I am observing is that the time to completion is heavily dependent on the BATCHSIZE specification.
Following are the times I observed for a 5.7 GB file containing 50 million records:
BATCHSIZE = 50000, Time Taken: 17.30 mins
BATCHSIZE = 10000, Time Taken: 14:00 mins
BATCHSIZE = 5000 , Time Taken: 15:00 mins
This only makes me curious: Is it possible to determine a good number for BATCHSIZE and if so, what factors does it depend on and can it be approximated without having to run the same query tens of times?
My next run would be a 70 GB file containing 780 million records. Any suggestions would be appreciated? I will report back the results once I finish.
There is some information here and it appears the batch size should be as large as is practical; documentation states in general the larger the batch size the better the performance, but you are not experiencing that at all. Seems that 10k is a good batch size to start with, but I would look at optimizing the bulk insert from other angles such as putting the database into simple mode or specifying a tablock hint during your import race.