AWS Glue ETL"Failed to delete key: target_folder/_temporary" caused by S3 exception "Please reduce your request rate" - amazon-s3

Glue job configured to max 10 nodes capacity, 1 job in parallel and no retries on failure is giving an error "Failed to delete key: target_folder/_temporary", and according to stacktrace the issue is that S3 service starts blocking the Glue requests due to the amount of requests: "AmazonS3Exception: Please reduce your request rate."
Note: The issue is not with IAM as the IAM role that glue job is using has permissions to delete objects in S3.
I found a suggestion for this issue on GitHub with a proposition of reducing the worker count: https://github.com/aws-samples/aws-glue-samples/issues/20
"I've had success reducing the number of workers."
However, I don't think that 10 is too many workers and would even like to actually increase the worker count to 20 to speed up the ETL.
Did anyone have any success who faced this issue? How would I go about solving it?
Shortened stacktrace:
py4j.protocol.Py4JJavaError: An error occurred while calling o151.pyWriteDynamicFrame.
: java.io.IOException: Failed to delete key: target_folder/_temporary
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.delete(S3NativeFileSystem.java:665)
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.delete(EmrFileSystem.java:332)
...
Caused by: java.io.IOException: 1 exceptions thrown from 12 batch deletes
at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.deleteAll(Jets3tNativeFileSystemStore.java:384)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.doSingleThreadedBatchDelete(S3NativeFileSystem.java:1372)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.delete(S3NativeFileSystem.java:663)
...
Caused by: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Please reduce your request rate. (Service: Amazon S3; Status Code: 503; Error Code: SlowDown; Request ID: ...
Part of Glue ETL python script (just in case):
datasource0 = glueContext.create_dynamic_frame.from_catalog(database="database", table_name="table_name", transformation_ctx="datasource0")
... relationalizing, renaming and etc. Transforming from DynamicDataframe to PySpark dataframe and back.
partition_ready = Map.apply(frame=processed_dataframe, f=map_date_partition, transformation_ctx="map_date_partition")
datasink = glueContext.write_dynamic_frame.from_options(frame=partition_ready, connection_type="s3", connection_options={"path": "s3://bucket/target_folder", "partitionKeys": ["year", "month", "day", "hour"]}, format="parquet", transformation_ctx="datasink")
job.commit()
Solved(Kind of), thank you to user ayazabbas
Accepted the answer that helped me into the correct direction of a solution. One of the things I was searching for is how to reduce many small files into big chunks and repartition does exactly that. Instead of repartition(x) I used coalesce(x) where x is 4*worker count of a glue job so that Glue service could allocate each data chunk to each available vCPU resource. It might make sense to have x at least 2*4*worker_count to account for slower and faster transformation parts if they do exist.
Another thing I did was reduce the number of columns by which I was partitioning the data before writing it to S3 from 5 to 4.
Current drawback is that I haven't figured out how to find the worker count within the glue script that the glue service allocates for the job, thus the number is hardcoded according to the job configuration (Glue service allocates sometimes more nodes than what is configured).

I had this same issue. I worked around it by running repartition(x) on the dynamic frame before writing to S3. This forces x files per partition and the max parallelism during the write process will be x, reducing S3 the request rate.
I set x to 1 as I wanted 1 parquet file per partition so I'm not sure what the safe upper limit of parallelism you can have is before the request rate gets too high.
I couldn't figure out a nicer way to solve this issue, it's annoying because you have so much idle capacity during the write process.
Hope that helps.

Related

Spark - Failed to load collect frame - "RetryingBlockFetcher - Exception while beginning fetch"

We have a Scala Spark application, that reads something like 70K records from the DB to a data frame, each record has 2 fields.
After reading the data from the DB, we make minor mapping and load this as a broadcast for later usage.
Now, in local environment, there is an exception, timeout from the RetryingBlockFetcher while running the following code:
dataframe.select("id", "mapping_id")
.rdd.map(row => row.getString(0) -> row.getLong(1))
.collectAsMap().toMap
The exception is:
2022-06-06 10:08:13.077 task-result-getter-2 ERROR
org.apache.spark.network.shuffle.RetryingBlockFetcher Exception while
beginning fetch of 1 outstanding blocks
java.io.IOException: Failed to connect to /1.1.1.1:62788
at
org.apache.spark.network.client.
TransportClientFactory.createClient(Transpor .tClientFactory.java:253)
at
org.apache.spark.network.client.
TransportClientFactory.createClient(TransportClientFactory.java:195)
at
org.apache.spark.network.netty.
NettyBlockTransferService$$anon$2.
createAndStart(NettyBlockTransferService.scala:122)
In the local environment, I simply create the spark session with local "spark.master"
When I limit the max of records to 20K, it works well.
Can you please help? maybe I need to configure something in my local environment in order that the original code will work properly?
Update:
I tried to change a lot of Spark-related configurations in my local environment, both memory, a number of executors, timeout-related settings, and more, but nothing helped! I just got the timeout after more time...
I realized that the data frame that I'm reading from the DB has 1 partition of 62K records, while trying to repartition with 2 or more partitions the process worked correctly and I managed to map and collect as needed.
Any idea why this solves the issue? Is there a configuration in the spark that can solve this instead of repartition?
Thanks!

Error while uploading a huge .csv file to dynamodb through s3 bucket using lambda function

My funtion is
import boto3
import csv
s3 = boto3.client('s3')
dynamodb = boto3.resource('dynamodb')
def lambda_handler(event, context):
bucket='bucketname'
file_name='filename.csv'
obj = s3.get_object(Bucket=bucket,Key=file_name)
rows = obj['Body'].read()
lines = rows.splitlines()
# print(lines)
reader = csv.reader(lines)
parsed_csv = list(reader)
num_rows = (len(parsed_csv))
table = dynamodb.Table('table_name')
with table.batch_writer() as batch:
for i in range(1,num_rows):
Brand_Name= parsed_csv[i][0]
Assigned_Brand_Name= parsed_csv[i][1]
Brand_URL= parsed_csv[i][2]
Generic_Name= parsed_csv[i][3]
HSN_Code= parsed_csv[i][4]
GST_Rate= parsed_csv[i][5]
Price= parsed_csv[i][6]
Dosage= parsed_csv[i][7]
Package= parsed_csv[i][8]
Size= parsed_csv[i][9]
Size_Unit= parsed_csv[i][10]
Administration_Form= parsed_csv[i][11]
Company= parsed_csv[i][12]
Uses= parsed_csv[i][13]
Side_Effects= parsed_csv[i][14]
How_to_use= parsed_csv[i][15]
How_to_work= parsed_csv[i][16]
FAQs_Downloaded= parsed_csv[i][17]
Alternate_Brands= parsed_csv[i][18]
Prescription_Required= parsed_csv[i][19]
Interactions= parsed_csv[i][20]
batch.put_item(Item={
'Brand Name':Assigned_Brand_Name
'Brand URL':Brand_URL,
'Generic Name':Generic_Name,
'Price':Price,
'Dosage':Dosage,
'Company':Company,
'Uses':Uses,
'Side Effects':Side_Effects,
'How to use':How_to_use,
'How to work':How_to_work,
'FAQs Downloaded?':FAQs_Downloaded,
'Alternate Brands':Alternate_Brands,
'Prescription Required':Prescription_Required,
'Interactions':Interactions
})
Response:
{
"errorMessage": "2020-10-14T11:40:56.792Z ecd63bdb-16bc-4813-afed-cbf3e1fa3625 Task timed out after 3.00 seconds"
}
You haven't specified how many rows there are is your CSV file. "Huge" is pretty subjective so it is possible that your task is timing out due to throttling on the DynamoDB table.
If you are using provisioned capacity on the table you are loading into, make sure you have enough capacity allocated. If you're using on-demand capacity then this might be due to the on-demand partitioning that happens when the table needs to scale up.
Either way, you may want to add some error handling for situations like these and add a delay when you get a timeout, before retrying and resuming.
Something to keep in mind is that writes to Dynamo always take 1 WCU and the maximum capacity a single partition can have is 1000 WCU so as your write throughput increases, the table may undergo multiple splits behind the scenes when you're in on-demand mode. For provisioned mode, you'll have to have allocated enough capacity to begin with, otherwise you'll be limited to writing however many items / second you have allocated write capacity.

Flink s3 read error: Data read has a different length than the expected

Using flink 1.7.0, but also seen on flink 1.8.0. We are getting frequent but somewhat random errors when reading gzipped objects from S3 through the flink .readFile source:
org.apache.flink.fs.s3base.shaded.com.amazonaws.SdkClientException: Data read has a different length than the expected: dataLength=9713156; expectedLength=9770429; includeSkipped=true; in.getClass()=class org.apache.flink.fs.s3base.shaded.com.amazonaws.services.s3.AmazonS3Client$2; markedSupported=false; marked=0; resetSinceLastMarked=false; markCount=0; resetCount=0
at org.apache.flink.fs.s3base.shaded.com.amazonaws.util.LengthCheckInputStream.checkLength(LengthCheckInputStream.java:151)
at org.apache.flink.fs.s3base.shaded.com.amazonaws.util.LengthCheckInputStream.read(LengthCheckInputStream.java:93)
at org.apache.flink.fs.s3base.shaded.com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:76)
at org.apache.flink.fs.shaded.hadoop3.org.apache.hadoop.fs.s3a.S3AInputStream.closeStream(S3AInputStream.java:529)
at org.apache.flink.fs.shaded.hadoop3.org.apache.hadoop.fs.s3a.S3AInputStream.close(S3AInputStream.java:490)
at java.io.FilterInputStream.close(FilterInputStream.java:181)
at org.apache.flink.fs.s3.common.hadoop.HadoopDataInputStream.close(HadoopDataInputStream.java:89)
at java.util.zip.InflaterInputStream.close(InflaterInputStream.java:227)
at java.util.zip.GZIPInputStream.close(GZIPInputStream.java:136)
at org.apache.flink.api.common.io.InputStreamFSInputWrapper.close(InputStreamFSInputWrapper.java:46)
at org.apache.flink.api.common.io.FileInputFormat.close(FileInputFormat.java:861)
at org.apache.flink.api.common.io.DelimitedInputFormat.close(DelimitedInputFormat.java:536)
at org.apache.flink.streaming.api.functions.source.ContinuousFileReaderOperator$SplitReader.run(ContinuousFileReaderOperator.java:336)
ys
Within a given job, we generally see many / most of the jobs read successfully, but there's pretty much always at least one failure (say out of 50 files).
It seems this error is actually originating from the AWS client, so perhaps flink has nothing to do with it, but I'm hopeful someone might have an insight as to how to make this work reliably.
When the error occurs, it ends up killing the source and canceling all the connected operators. I'm still new to flink, but I would think that this is something that could be recoverable from a previous snapshot? Should I expect that flink will retry reading the file when this kind of exception occurs?
Maybe you can try to add more connection for s3a like
flink:
...
config: |
fs.s3a.connection.maximum: 320

Dataflow's BigQuery inserter thread pool exhausted

I'm using Dataflow to write data into BigQuery.
When the volume gets big and after some time, I get this error from Dataflow:
{
metadata: {
severity: "ERROR"
projectId: "[...]"
serviceName: "dataflow.googleapis.com"
region: "us-east1-d"
labels: {…}
timestamp: "2016-08-19T06:39:54.492Z"
projectNumber: "[...]"
}
insertId: "[...]"
log: "dataflow.googleapis.com/worker"
structPayload: {
message: "Uncaught exception: "
work: "[...]"
thread: "46"
worker: "[...]-08180915-7f04-harness-jv7y"
exception: "java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.FutureTask#1a1680f rejected from java.util.concurrent.ThreadPoolExecutor#b11a8a1[Shutting down, pool size = 100, active threads = 100, queued tasks = 2316, completed tasks = 1192]
at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047)
at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369)
at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:134)
at java.util.concurrent.Executors$DelegatedExecutorService.submit(Executors.java:681)
at com.google.cloud.dataflow.sdk.util.BigQueryTableInserter.insertAll(BigQueryTableInserter.java:218)
at com.google.cloud.dataflow.sdk.io.BigQueryIO$StreamingWriteFn.flushRows(BigQueryIO.java:2155)
at com.google.cloud.dataflow.sdk.io.BigQueryIO$StreamingWriteFn.finishBundle(BigQueryIO.java:2113)
at com.google.cloud.dataflow.sdk.util.DoFnRunnerBase.finishBundle(DoFnRunnerBase.java:158)
at com.google.cloud.dataflow.sdk.runners.worker.SimpleParDoFn.finishBundle(SimpleParDoFn.java:196)
at com.google.cloud.dataflow.sdk.runners.worker.ForwardingParDoFn.finishBundle(ForwardingParDoFn.java:47)
at com.google.cloud.dataflow.sdk.util.common.worker.ParDoOperation.finish(ParDoOperation.java:62)
at com.google.cloud.dataflow.sdk.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:79)
at com.google.cloud.dataflow.sdk.runners.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:657)
at com.google.cloud.dataflow.sdk.runners.worker.StreamingDataflowWorker.access$500(StreamingDataflowWorker.java:86)
at com.google.cloud.dataflow.sdk.runners.worker.StreamingDataflowWorker$6.run(StreamingDataflowWorker.java:483)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)"
logger: "com.google.cloud.dataflow.sdk.runners.worker.StreamingDataflowWorker"
stage: "F10"
job: "[...]"
}
}
It looks like I'm exhausting the thread pool defined in BigQueryTableInserter.java:84. This thread pool has an hardcoded size of 100 threads and cannot be configured.
My questions are:
How could I avoid this error?
Am I doing something wrong?
Shouldn't the pool size be configurable? How can 100 threads be the perfect fit for all needs and machine types?
Here's a bit of context of my usage:
I'm using Dataflow in streaming mode, reading from Kafka using KafkaIO.java
"After some time" is a few hours, (less than 12h)
I'm using 36 workers of type n1-standard-4
I'm reading around 180k messages/s from Kafka (about 130MB/s of network input to my workers)
Messages are grouped together, outputting around 7k messages/s into BigQuery
Dataflow workers are in the us-east1-d zone, BigQuery dataset location is US
You aren't doing anything wrong, though you may need more resources, depending on how long volume stays high.
The streaming BigQueryIO write does some basic batching of inserts by data size and row count. If I understand your numbers correctly, your rows are large enough that each is being submitted to BigQuery in its own request.
It seems that the thread pool for inserts should install ThreadPoolExecutor.CallerRunsPolicy which causes the caller to block and run jobs synchronously when they exceed the capacity of the executor. I've posted PR #393. This will convert the work queue overflow into pipeline backlog as all the processing threads block.
At this point, the issue is standard:
If the backlog is temporary, you'll catch up once volume decreases.
If the backlog grows without bound, then of course it will not solve the issue and you will need to apply more resources. The signs should be the same as any other backlog.
Another point to be aware of is that around 250 rows/second per thread this will exceed the BigQuery quota of 100k updates/second for a table (such failures will be retried, so you might get past them anyhow). If I understand your numbers correctly, you are far from this.

File: 0: Unexpected from Google BigQuery load job

I've a compressed json file (900MB, newline delimited) and load into a new table via bq command and get the load failure:
e.g.
bq load --project_id=XXX --source_format=NEWLINE_DELIMITED_JSON --ignore_unknown_values mtdataset.mytable gs://xxx/data.gz schema.json
Waiting on bqjob_r3ec270ec14181ca7_000001461d860737_1 ... (1049s) Current status: DONE
BigQuery error in load operation: Error processing job 'XXX:bqjob_r3ec270ec14181ca7_000001461d860737_1': Too many errors encountered. Limit is: 0.
Failure details:
- File: 0: Unexpected. Please try again.
Why the error?
I tried again with the --max_bad_records, still not useful error message
bq load --project_id=XXX --source_format=NEWLINE_DELIMITED_JSON --ignore_unknown_values --max_bad_records 2 XXX.test23 gs://XXX/20140521/file1.gz schema.json
Waiting on bqjob_r518616022f1db99d_000001461f023f58_1 ... (319s) Current status: DONE
BigQuery error in load operation: Error processing job 'XXX:bqjob_r518616022f1db99d_000001461f023f58_1': Unexpected. Please try again.
And also cannot find any useful message in the console.
To BigQuery team, can you have a look using the job ID?
As far I know there are two error sections on a job. There is one error result, and that's what you see now. And there is a second, which should be a stream of errors. This second is important as you could have errors in it, but the actual job might succeed.
Also you can set the --max_bad_records=3 on the BQ tool. Check here for more params https://developers.google.com/bigquery/bq-command-line-tool
You probably have an error that is for each line, so you should try a sample set from this big file first.
Also there is an open feature request to improve the error message, you can star (vote) this ticket https://code.google.com/p/google-bigquery-tools/issues/detail?id=13
This answer will be picked up by the BQ team, so for them I am sharing that: We need an endpoint where we can query based on a jobid, the state, or the stream of errors. It would help a lot to get a full list of errors, it would help debugging the BQ jobs. This could be easy to implement.
I looked up this job in the BigQuery logs, and unfortunately, there isn't any more information than "failed to read" somewhere after about 930 MB have been read.
I've filed a bug that we're dropping important error information in one code path and submitted a fix. However, this fix won't be live until next week, and all that will do is give us more diagnostic information.
Since this is repeatable, it isn't likely a transient error reading from GCS. That means one of two problems: we have trouble decoding the .gz file, or there is something wrong with that particular GCS object.
For the first issue, you could try decompressing the file and re-uploading it as uncompressed. While it may sound like a pain to send gigabytes of data over the network, the good news is that the import will be faster since it can be done in parallel (we can't import a compressed file in parallel since it can only be read sequentially).
For the second issue (which is somewhat less likely) you could try downloading the file yourself to make sure you don't get errors, or try re-uploading the same file and seeing if that works.