AWS glue write dynamic frame out of memory (OOM) - amazon-s3

I am using AWS glue to run a pyspark to read a dynamic frame from catalog (data in redshift),then write it to s3 in csv format. I am getting this error saying the executor is out of memory:
An error occurred while calling o883.pyWriteDynamicFrame. Job aborted due to stage failure: Task 388 in stage 21.0 failed 4 times, most recent failure: Lost task 388.3 in stage 21.0 (TID 10713, ip-10-242-88-215.us-west-2.compute.internal, executor 86): ExecutorLostFailure (executor 86 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 16.1 GB of 16 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
My guess is that the dataframe is not partitioned well before writing so one executor runs out of memory. But when I follow this doc https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-partitions.html to add partition keys to my dynamicframe, the job simply timeout after 4 hours. (The partition key I choose splits the data set into around 10 partitions)
Some other approaches I tried:
Trying to configure the fetchsize but the aws doc shows the glue has configure the fetchsize to 1000 by default. https://aws.amazon.com/blogs/big-data/optimize-memory-management-in-aws-glue/#:~:text=With%20AWS%20Glue%2C%20Dynamic%20Frames,Spark%20executor%20and%20database%20instance.
Trying to set pushdown predicates but the input dataset is created daily and not partitioned. I also need all rows to perform joins/filters in ETL so this might not be a good solution to me.
Does anyone know what are some good alternatives to try out?

Related

What is the Flow Run Size Limit of a notebook activity in Azure Synapse?

I created a large spark notebook and ran it successfully in Azure Synapse. Then I created a new pipeline with a new notebook activity pointing to the existing spark notebook. I triggered it and it failed with the error message:
ErrorCode=FlowRunSizeLimitExceeded, ErrorMessage=Triggering the pipeline failed
due to large run size. This could happen when a run has a large number of
activities or large inputs used in some of the activities, including parameters.
There is only one activity in that pipeline; so, it can't be the number of activities being exceeded. I googled flow run size limit on activity and there was no result. What is the flow run size limit on the notebook activity?
Here is the information:
Filename
Blob Size
UID_ISO_FIPS_LookUp_Table.csv
396 KiB
05-11-2021.csv
630 KiB
https://ghoapi.azureedge.net/api
476 KiB
Type
Size
Cell Total
Cluster Size
.ipynb notebook
668,522 bytes
43 cells
Small (4 vCores / 32 GB) - 3 to 3 nodes
Here is the error message after triggering the pipeline
Here is the sample code in the notebook. The purpose is to join three files into a single file with a single table. Some processing of csv files are filtering, selecting columns, renaming columns, and aggregating values.
Could someone explain why the error message occurred?
I was able to import that .csv with the following Python code on a small Spark pool:
%%pyspark
df = spark.read.load('abfss://someContainer#somestorageAccount.dfs.core.windows.net/raw/csv/05-11-2021.csv', format='csv'
, header = True
)
display(df.limit(10))
df.createOrReplaceTempView("tmp")
Saving it as a temp view allows you to write some conventional SQL to query the dataframe, eg
%%sql
SELECT SUM(deaths) xsum
FROM tmp

How can I change physical memory in a mapreduce/hive job?

I'm trying to run a Hive INSERT OVERWRITE query on an EMR cluster with 40 worker nodes and single master node.
However, while running the INSERT OVERWRITE query, as soon as I get to
Stage-1 map = 100%, reduce = 100%, Cumulative CPU 180529.86 sec
this state, I get the following error:
Ended Job = job_1599289114675_0001 with errors
Diagnostic Messages for this Task:
Container [pid=9944,containerID=container_1599289114675_0001_01_041995] is running beyond physical memory limits. Current usage: 1.5 GB of 1.5 GB physical memory used; 3.2 GB of 7.5 GB virtual memory used. Killing container.
Dump of the process-tree for container_1599289114675_0001_01_041995 :
I'm not sure how can I change the 1.5 GB physical memory number. In my configurations, I don't see such a number, and I don't understand how that 1.5 GB number is being calculated.
I even tried changing the "yarn.nodemanager.vmem-pmem-ratio":"5" to 5 as suggested in some forums. But irrespective of this change, I still get the error.
This is how the job starts:
Number of reduce tasks not specified. Estimated from input data size: 942
Hadoop job information for Stage-1: number of mappers: 910; number of reducers: 942
And this is how my configuration file looks like for the cluster. I'm unable to understand what settings do I have to change to not run into this issue. Could it also be due to Tez settings? Although I'm not using it as the engine.
Any suggestions will be greatly appreciated, thanks.
While opening hive console, append the following to the command
--hiveconf mapreduce.map.memory.mb=8192 --hiveconf mapreduce.reduce.memory.mb=8192 --hiveconf mapreduce.map.java.opts=-Xmx7600M
Incase you still get the Java heap error, try increasing to higher values, but make sure that the mapreduce.map.java.opts doesn't exceed mapreduce.map.memory.mb.

ERROR : FAILED: Error in acquiring locks: Error communicating with the metastore org.apache.hadoop.hive.ql.lockmgr.LockException

Getting the Error in acquiring locks, when trying to run count(*) on partitioned tables.
The table has 365 partitions when filtered on <= 350 partitions, the queries are working fine.
when tried to include more partitions for the query, it's failing with the error.
working on Hive-managed ACID tables, with the following default values
hive.support.concurrency=true //cannot make it as false, it's throwing <table> is missing from the ValidWriteIdList config: null, should be true for ACID read and write.
hive.lock.manager=org.apache.hadoop.hive.ql.lockmgr.zookeeper.ZooKeeperHiveLockManager
hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager
hive.txn.strict.locking.mode=false
hive.exec.dynamic.partition.mode=nonstrict
Tried increasing/decreasing values for these following with a beeline session.
hive.lock.numretries
hive.unlock.numretries
hive.lock.sleep.between.retries
hive.metastore.batch.retrieve.max={default 300} //changed to 10000
hive.metastore.server.max.message.size={default 104857600} // changed to 10485760000
hive.metastore.limit.partition.request={default -1} //did not change as -1 is unlimited
hive.metastore.batch.retrieve.max={default 300} //changed to 10000.
hive.lock.query.string.max.length={default 10000} //changed to higher value
Using the HDI-4.0 interactive-query-llap cluster, the meta-store is backed by default sql-server provided along.
The problem is NOT due to service tier of the hive metastore database.
It is most probably due to too many partitions in one query based on the symptom.
I meet the same issue several times.
In the hivemetastore.log, you shall able to see such error:
metastore.RetryingHMSHandler: MetaException(message:Unable to update transaction database com.microsoft.sqlserver.jdbc.SQLServerException: The incoming request has too many parameters. The server supports a maximum of 2100 parameters. Reduce the number of parameters and resend the request.
at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:254)
at com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1608)
at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement.java:578)
This is due to in Hive metastore, each partition involved in the hive query requires at most 8 parameters to acquire a lock.
Some possible workarounds:
Decompose the the query into multiple sub-queries to read from fewer
partitions.
Reduce the number of partitions by setting different partition keys.
Remove partitioning if partition keys don't have any filters.
Following are the parameters which manage the batch size for INSERT query generated by the direct SQL. Their default value is 1000. Set both of them to 100 (as a good starting point) in the Custom hive-site section of Hive configs via. Ambari and restart ALL Hive related components (including Hive metastore).
hive.direct.sql.max.elements.values.clause=100
hive.direct.sql.max.elements.in.clause=100
We also faced the same error in HDInsight and after doing many configuration changes similar to what you have done, the only thing that worked is scaling our Hive Metastore SQL DB server.
We had to scale it all the way to a P2 tier with 250 DTUs for our workloads to work without these Lock Exceptions. As you may know, with the tier and DTU count, the SQL server's IOPS and response time improves thus we suspected that the Metastore performance was the root cause for these Lock Exceptions with the increase in workloads.
Following link provides information about the DTU based performance variation in SQL servers in Azure.
https://learn.microsoft.com/en-us/azure/sql-database/sql-database-service-tiers-dtu
Additionally as I know, the default Hive metastore that gets provisioned when you opt to not provide an external DB in cluster creation is just an S1 tier DB. This would not be suitable for any high capacity workloads. At the same time, as a best practice always provision your metastores external to the cluster and attach at cluster provisioning time, as this gives you the flexibility to connect the same Metastore to multiple clusters (so that your Hive layer schema can be shared across multiple clusters, e.g. Hadoop for ETLs and Spark for Processing / Machine Learning), and you have the full control to scale up or down your metastore as per your need anytime.
The only way to scale the default metastore is by engaging the Microsoft support.
We faced the same issue in HDINSIGHT. We solved it by upgrading the metastore.
The Default metastore had only 5 DTU which is not recommended for production environments. So we migrated to custom Metastore and spin the Azure SQL SERVER (P2 above 250 DTUs) and the setting the below properties:
hive.direct.sql.max.elements.values.clause=200
hive.direct.sql.max.elements.in.clause=200
Above values are set because SQL SERVER cannot process more than 2100 parameter. When you have partitions more than 348, you faced this issue as 1 partition creates 8 parameters for metastore 8 x 348

Not able to update Big query table with Transfer from a Storage file

I am not able to update a big query table from a storage file. I have latest data file and transfer runs successfully. But it say "8:36:01 AM Detected that no changes will be made to the destination table.".
Tried multiple ways.
Please help.
Thanks,
-Srini
You have to wait 1 hour after your file has been updated in Cloud Storage: https://cloud.google.com/bigquery-transfer/docs/cloud-storage-transfer?hl=en_US#minimum_intervals
I had the same error. I created two transfers from GCS to BigQuery, with write preference set to MIRROR and APPEND. I got the logs below (no error). The GCS file was uploaded less than one hour before.
MIRROR: Detected that no changes will be made to the destination table. Summary: succeeded 0 jobs, failed 0 jobs.
APPEND: None of the 1 new file(s) found matching "gs://mybucket/myfile" meet the requirement of being at least 60 minutes old. They will be loaded in next run. Summary: succeeded 0 jobs, failed 0 jobs.
Both jobs went through one hour later.

Pig script on aws emr with tez occasionally fails with OutOfMemoryException

I have a pig script running on an emr cluster (emr-5.4.0) using a custom UDF. The UDF is used to lookup some dimensional data for which it imports a (somewhat) large amout of text data.
In the pig script, the UDF is used as follows:
DEFINE LookupInteger com.ourcompany.LookupInteger(<some parameters>);
The UDF stores some data in Map<Integer, Integer>
On some input data the aggregation fails with an exception as follows
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.lang.String.split(String.java:2377)
at java.lang.String.split(String.java:2422)
[...]
at com.ourcompany.LocalFileUtil.toMap(LocalFileUtil.java:71)
at com.ourcompany.LookupInteger.exec(LookupInteger.java:46)
at com.ourcompany.LookupInteger.exec(LookupInteger.java:19)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:330)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNextInteger(POUserFunc.java:379)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:347)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POBinCond.genericGetNext(POBinCond.java:76)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POBinCond.getNextInteger(POBinCond.java:118)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:347)
This does not occur when the pig aggregation is run with mapreduce, so a workaround for us is to replace pig -t tez with pig -t mapreduce.
As i'm new to amazon emr, and pig with tez, i'd appreciate some hints on how to analyse or debug the issue.
EDIT:
It looks like a strange runtime behaviour while running the pig script on tez stack.
Please note that the pig script is using
replicated joins (the smaller relations to be joined need to fit into memory) and
the already mentioned UDF, which is initialising a Map<Integer, Interger> producing the aforementioned OutOfMemoryError.
We found another workaround using tez backend. Using increased values for mapreduce.map.memory.mb and mapreduce.map.java.opts (0.8 times of mapreduce.map.memory.mb). Those values are bound to the ec2 instance types and are usually fixed values (see aws emr task config).
By (temporarily) doubling the values, we were able to make the pig script succeed.
The following values were set for a m3.xlarge core instance, which has default values:
mapreduce.map.java.opts := -Xmx1152m
mapreduce.map.memory.mb := 1440
Pig startup command
pig -Dmapreduce.map.java.opts=-Xmx2304m \
-Dmapreduce.map.memory.mb=2880 -stop_on_failure -x tez ... script.pig
EDIT
One colleague came up with the following idea:
Another workaround for the OutOfMemory: GC overhead limit exceeded could be to add explicit STORE and LOAD statements for the problematic relations, that would make tez flush the data to storage. This could also help in debugging the issue, as the (temporary, intermediate) data can be observed with other pig scripts.