Dataflow's BigQuery inserter thread pool exhausted - google-bigquery

I'm using Dataflow to write data into BigQuery.
When the volume gets big and after some time, I get this error from Dataflow:
{
metadata: {
severity: "ERROR"
projectId: "[...]"
serviceName: "dataflow.googleapis.com"
region: "us-east1-d"
labels: {…}
timestamp: "2016-08-19T06:39:54.492Z"
projectNumber: "[...]"
}
insertId: "[...]"
log: "dataflow.googleapis.com/worker"
structPayload: {
message: "Uncaught exception: "
work: "[...]"
thread: "46"
worker: "[...]-08180915-7f04-harness-jv7y"
exception: "java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.FutureTask#1a1680f rejected from java.util.concurrent.ThreadPoolExecutor#b11a8a1[Shutting down, pool size = 100, active threads = 100, queued tasks = 2316, completed tasks = 1192]
at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047)
at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369)
at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:134)
at java.util.concurrent.Executors$DelegatedExecutorService.submit(Executors.java:681)
at com.google.cloud.dataflow.sdk.util.BigQueryTableInserter.insertAll(BigQueryTableInserter.java:218)
at com.google.cloud.dataflow.sdk.io.BigQueryIO$StreamingWriteFn.flushRows(BigQueryIO.java:2155)
at com.google.cloud.dataflow.sdk.io.BigQueryIO$StreamingWriteFn.finishBundle(BigQueryIO.java:2113)
at com.google.cloud.dataflow.sdk.util.DoFnRunnerBase.finishBundle(DoFnRunnerBase.java:158)
at com.google.cloud.dataflow.sdk.runners.worker.SimpleParDoFn.finishBundle(SimpleParDoFn.java:196)
at com.google.cloud.dataflow.sdk.runners.worker.ForwardingParDoFn.finishBundle(ForwardingParDoFn.java:47)
at com.google.cloud.dataflow.sdk.util.common.worker.ParDoOperation.finish(ParDoOperation.java:62)
at com.google.cloud.dataflow.sdk.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:79)
at com.google.cloud.dataflow.sdk.runners.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:657)
at com.google.cloud.dataflow.sdk.runners.worker.StreamingDataflowWorker.access$500(StreamingDataflowWorker.java:86)
at com.google.cloud.dataflow.sdk.runners.worker.StreamingDataflowWorker$6.run(StreamingDataflowWorker.java:483)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)"
logger: "com.google.cloud.dataflow.sdk.runners.worker.StreamingDataflowWorker"
stage: "F10"
job: "[...]"
}
}
It looks like I'm exhausting the thread pool defined in BigQueryTableInserter.java:84. This thread pool has an hardcoded size of 100 threads and cannot be configured.
My questions are:
How could I avoid this error?
Am I doing something wrong?
Shouldn't the pool size be configurable? How can 100 threads be the perfect fit for all needs and machine types?
Here's a bit of context of my usage:
I'm using Dataflow in streaming mode, reading from Kafka using KafkaIO.java
"After some time" is a few hours, (less than 12h)
I'm using 36 workers of type n1-standard-4
I'm reading around 180k messages/s from Kafka (about 130MB/s of network input to my workers)
Messages are grouped together, outputting around 7k messages/s into BigQuery
Dataflow workers are in the us-east1-d zone, BigQuery dataset location is US

You aren't doing anything wrong, though you may need more resources, depending on how long volume stays high.
The streaming BigQueryIO write does some basic batching of inserts by data size and row count. If I understand your numbers correctly, your rows are large enough that each is being submitted to BigQuery in its own request.
It seems that the thread pool for inserts should install ThreadPoolExecutor.CallerRunsPolicy which causes the caller to block and run jobs synchronously when they exceed the capacity of the executor. I've posted PR #393. This will convert the work queue overflow into pipeline backlog as all the processing threads block.
At this point, the issue is standard:
If the backlog is temporary, you'll catch up once volume decreases.
If the backlog grows without bound, then of course it will not solve the issue and you will need to apply more resources. The signs should be the same as any other backlog.
Another point to be aware of is that around 250 rows/second per thread this will exceed the BigQuery quota of 100k updates/second for a table (such failures will be retried, so you might get past them anyhow). If I understand your numbers correctly, you are far from this.

Related

Error while uploading a huge .csv file to dynamodb through s3 bucket using lambda function

My funtion is
import boto3
import csv
s3 = boto3.client('s3')
dynamodb = boto3.resource('dynamodb')
def lambda_handler(event, context):
bucket='bucketname'
file_name='filename.csv'
obj = s3.get_object(Bucket=bucket,Key=file_name)
rows = obj['Body'].read()
lines = rows.splitlines()
# print(lines)
reader = csv.reader(lines)
parsed_csv = list(reader)
num_rows = (len(parsed_csv))
table = dynamodb.Table('table_name')
with table.batch_writer() as batch:
for i in range(1,num_rows):
Brand_Name= parsed_csv[i][0]
Assigned_Brand_Name= parsed_csv[i][1]
Brand_URL= parsed_csv[i][2]
Generic_Name= parsed_csv[i][3]
HSN_Code= parsed_csv[i][4]
GST_Rate= parsed_csv[i][5]
Price= parsed_csv[i][6]
Dosage= parsed_csv[i][7]
Package= parsed_csv[i][8]
Size= parsed_csv[i][9]
Size_Unit= parsed_csv[i][10]
Administration_Form= parsed_csv[i][11]
Company= parsed_csv[i][12]
Uses= parsed_csv[i][13]
Side_Effects= parsed_csv[i][14]
How_to_use= parsed_csv[i][15]
How_to_work= parsed_csv[i][16]
FAQs_Downloaded= parsed_csv[i][17]
Alternate_Brands= parsed_csv[i][18]
Prescription_Required= parsed_csv[i][19]
Interactions= parsed_csv[i][20]
batch.put_item(Item={
'Brand Name':Assigned_Brand_Name
'Brand URL':Brand_URL,
'Generic Name':Generic_Name,
'Price':Price,
'Dosage':Dosage,
'Company':Company,
'Uses':Uses,
'Side Effects':Side_Effects,
'How to use':How_to_use,
'How to work':How_to_work,
'FAQs Downloaded?':FAQs_Downloaded,
'Alternate Brands':Alternate_Brands,
'Prescription Required':Prescription_Required,
'Interactions':Interactions
})
Response:
{
"errorMessage": "2020-10-14T11:40:56.792Z ecd63bdb-16bc-4813-afed-cbf3e1fa3625 Task timed out after 3.00 seconds"
}
You haven't specified how many rows there are is your CSV file. "Huge" is pretty subjective so it is possible that your task is timing out due to throttling on the DynamoDB table.
If you are using provisioned capacity on the table you are loading into, make sure you have enough capacity allocated. If you're using on-demand capacity then this might be due to the on-demand partitioning that happens when the table needs to scale up.
Either way, you may want to add some error handling for situations like these and add a delay when you get a timeout, before retrying and resuming.
Something to keep in mind is that writes to Dynamo always take 1 WCU and the maximum capacity a single partition can have is 1000 WCU so as your write throughput increases, the table may undergo multiple splits behind the scenes when you're in on-demand mode. For provisioned mode, you'll have to have allocated enough capacity to begin with, otherwise you'll be limited to writing however many items / second you have allocated write capacity.

AWS Glue ETL"Failed to delete key: target_folder/_temporary" caused by S3 exception "Please reduce your request rate"

Glue job configured to max 10 nodes capacity, 1 job in parallel and no retries on failure is giving an error "Failed to delete key: target_folder/_temporary", and according to stacktrace the issue is that S3 service starts blocking the Glue requests due to the amount of requests: "AmazonS3Exception: Please reduce your request rate."
Note: The issue is not with IAM as the IAM role that glue job is using has permissions to delete objects in S3.
I found a suggestion for this issue on GitHub with a proposition of reducing the worker count: https://github.com/aws-samples/aws-glue-samples/issues/20
"I've had success reducing the number of workers."
However, I don't think that 10 is too many workers and would even like to actually increase the worker count to 20 to speed up the ETL.
Did anyone have any success who faced this issue? How would I go about solving it?
Shortened stacktrace:
py4j.protocol.Py4JJavaError: An error occurred while calling o151.pyWriteDynamicFrame.
: java.io.IOException: Failed to delete key: target_folder/_temporary
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.delete(S3NativeFileSystem.java:665)
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.delete(EmrFileSystem.java:332)
...
Caused by: java.io.IOException: 1 exceptions thrown from 12 batch deletes
at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.deleteAll(Jets3tNativeFileSystemStore.java:384)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.doSingleThreadedBatchDelete(S3NativeFileSystem.java:1372)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.delete(S3NativeFileSystem.java:663)
...
Caused by: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Please reduce your request rate. (Service: Amazon S3; Status Code: 503; Error Code: SlowDown; Request ID: ...
Part of Glue ETL python script (just in case):
datasource0 = glueContext.create_dynamic_frame.from_catalog(database="database", table_name="table_name", transformation_ctx="datasource0")
... relationalizing, renaming and etc. Transforming from DynamicDataframe to PySpark dataframe and back.
partition_ready = Map.apply(frame=processed_dataframe, f=map_date_partition, transformation_ctx="map_date_partition")
datasink = glueContext.write_dynamic_frame.from_options(frame=partition_ready, connection_type="s3", connection_options={"path": "s3://bucket/target_folder", "partitionKeys": ["year", "month", "day", "hour"]}, format="parquet", transformation_ctx="datasink")
job.commit()
Solved(Kind of), thank you to user ayazabbas
Accepted the answer that helped me into the correct direction of a solution. One of the things I was searching for is how to reduce many small files into big chunks and repartition does exactly that. Instead of repartition(x) I used coalesce(x) where x is 4*worker count of a glue job so that Glue service could allocate each data chunk to each available vCPU resource. It might make sense to have x at least 2*4*worker_count to account for slower and faster transformation parts if they do exist.
Another thing I did was reduce the number of columns by which I was partitioning the data before writing it to S3 from 5 to 4.
Current drawback is that I haven't figured out how to find the worker count within the glue script that the glue service allocates for the job, thus the number is hardcoded according to the job configuration (Glue service allocates sometimes more nodes than what is configured).
I had this same issue. I worked around it by running repartition(x) on the dynamic frame before writing to S3. This forces x files per partition and the max parallelism during the write process will be x, reducing S3 the request rate.
I set x to 1 as I wanted 1 parquet file per partition so I'm not sure what the safe upper limit of parallelism you can have is before the request rate gets too high.
I couldn't figure out a nicer way to solve this issue, it's annoying because you have so much idle capacity during the write process.
Hope that helps.

How to interpret the RabbitMQ Message stats?

I to want get and historize queue metrics for the "Enqueued, Dequeued an Size" (Terminology formerly met on ActiveMQ).
The moving charts provided in the management plugin are not enough for the monitoring that I need to do.
So with RabbitMQ, I'm getting data from https://rabbitmq-server:15672/api/queues/myvhost
This returns json.. for a queue, I can obtain real life production data like :
"messages":0, // for "Size"
"message_stats":{
"deliver_get":171528, // for "Dequeued"
"ack":162348,
"redeliver":9513,
"deliver_no_ack":0,
"deliver":171528,
"get":0,
"publish":51293 // for "Enqueued"
(...)
I'm in particular surprised by the publish counter:
Its value can even decrease between 2 measures done with a couple of minutes of delay ! (see sample chart around 17:00)
As you can see on my data, the deliver_get is significantly larger than the publish.
https://my-rabbitmq:15672/doc/stats.html doesn't give a lot of details that could explain what I actually notice.
Also, under the message_stats object that I obtain, I'm missing the some counters like confirm and return which could be related to the enqueuing.
Are there relationships between these metrics ? (like deliver_get + messages = redeliver + publish.. but that one doesn't work with my figures)
Is there another more detailed documentation about these metrics ?

StackExchange.Redis System.TimeoutException

I got this timeout exception suddenly when I try to persist a range of data, it was working before and I didn't do any changes:
Timeout performing HMSET {key}, inst: 0, mgr: ExecuteSelect, err:
never, queue: 2, qu: 1, qs: 1, qc: 0, wr: 1, wq: 1, in: 0, ar: 0,
clientName: {machine-name}, serverEndpoint:
Unspecified/localhost:6379, keyHashSlot: 2689, IOCP:
(Busy=0,Free=1000,Min=4,Max=1000), WORKER:
(Busy=0,Free=2047,Min=4,Max=2047), Local-CPU: 100% (Please take a look
at this article for some common client-side issues that can cause
timeouts:
https://github.com/StackExchange/StackExchange.Redis/tree/master/Docs/Timeouts.md)
I'm using Redis on windows.
In your timeout error message, I see Local-CPU: 100%. This is the CPU on your client that is calling into Redis server. You might want to look into what is causing the high CPU load on your client.
This article describes why high CPU usage can lead to client-side timeouts. https://gist.github.com/JonCole/db0e90bedeb3fc4823c2#high-cpu-usage
So, I battled with this issue for a few days and almost gave up. Like #Amr Reda said, breaking a large sets into smaller ones might work but that's not optimal.
In my case, I was trying to move 27,000 records into redis and i kept encountering the issue.
To resolve the issue, increase the SyncTimeout value in your redis connection string. It's set by default to 1000ms ie 1second. Large datasets typically take longer to add.
I found out what causing the issue, as I was trying to bulk inserting into hash. What I did is that I chunked the inserted list into smaller ones.
Quick suggestions that worked in my case, using a console .net project with very high concurrency using multithread (around 30.000).
In the program.cs, I added some ThreadPool settings:
int newWorkerThreadsPerCore = 50, newIOCPPerCore = 100;
ThreadPool.SetMinThreads(newWorkerThreadsPerCore, newIOCPPerCore);
Also, I had to change everything from:
var redisValue = dbCache.StringGet("SOMETHING");
To:
var redisValue = dbCache.StringGetAsync("SOMETHING").Result;
Even if you might think they look almost the same (considering you always end up waiting for a result), if you use the non-async version and one single thread receives a redis timeout, it will make all the other 29.999 threads waiting for redis to timeout too, while the async one will only cause a timeout in that only single thread.

ServerXmlHttpRequest hanging sometimes when doing a POST

I have a job that periodically does some work involving ServerXmlHttpRquest to perform an HTTP POST. The job runs every 60 seconds.
And normally it runs without issue. But there's about a 1 in 50,000 chance (every two or three months) that it will hang:
IXMLHttpRequest http = new ServerXmlHttpRequest();
http.open("POST", deleteUrl, false, "", "");
http.send(stuffToDelete); <---hang
When it hangs, not even the Task Scheduler (with the option enabled to kill the job if it takes longer than 3 minutes to run) can end the task. I have to connect to the remote customer's network, get on the server, and use Task Manager to kill the process.
And then its good for another month or three.
Eventually i started using Task Manager to create a process dump,
so i could analyze where the hang is. After five crash dumps (over the last 11 months or so) i get a consistent picture:
ntdll.dll!_NtWaitForMultipleObjects#20()
KERNELBASE.dll!_WaitForMultipleObjectsEx#20()
user32.dll!MsgWaitForMultipleObjectsEx()
user32.dll!_MsgWaitForMultipleObjects#20()
urlmon.dll!CTransaction::CompleteOperation(int fNested) Line 2496
urlmon.dll!CTransaction::StartEx(IUri * pIUri, IInternetProtocolSink * pOInetProtSink, IInternetBindInfo * pOInetBindInfo, unsigned long grfOptions, unsigned long dwReserved) Line 4453 C++
urlmon.dll!CTransaction::Start(const wchar_t * pwzURL, IInternetProtocolSink * pOInetProtSink, IInternetBindInfo * pOInetBindInfo, unsigned long grfOptions, unsigned long dwReserved) Line 4515 C++
msxml3.dll!URLMONRequest::send()
msxml3.dll!XMLHttp::send()
Contoso.exe!FrobImporter.TFrobImporter.DeleteFrobs Line 971
Contoso.exe!FrobImporter.TFrobImporter.ImportCore Line 1583
Contoso.exe!FrobImporter.TFrobImporter.RunImport Line 1070
Contoso.exe!CommandLineProcessor.TCommandLineProcessor.HandleFrobImport Line 433
Contoso.exe!CommandLineProcessor.TCommandLineProcessor.CoreExecute Line 71
Contoso.exe!CommandLineProcessor.TCommandLineProcessor.Execute Line 84
Contoso.exe!Contoso.Contoso Line 167
kernel32.dll!#BaseThreadInitThunk#12()
ntdll.dll!__RtlUserThreadStart()
ntdll.dll!__RtlUserThreadStart#8()
So i do a ServerXmlHttpRequest.send, and it never returns. It will sit there for days (causing the system to miss financial transactions, until come Sunday night i get a call that it's broken).
It is of no help unless someone knows how to debug code, but the registers in the stalled thread at the time of the dump are:
EAX 00000030
EBX 00000000
ECX 00000000
EDX 00000000
ESI 002CAC08
EDI 00000001
EIP 732A08A7
ESP 0018F684
EBP 0018F6C8
EFL 00000000
Windows Server 2012 R2
Microsoft IIS/8.5
Default timeouts of ServerXmlHttpRequest
You can use serverXmlHttpRequest.setTimeouts(...) to configure the four classes of timeouts:
resolveTimeout: The value is applied to mapping host names (such as "www.microsoft.com") to IP addresses; the default value is infinite, meaning no timeout.
connectTimeout: A long integer. The value is applied to establishing a communication socket with the target server, with a default timeout value of 60 seconds.
sendTimeout: The value applies to sending an individual packet of request data (if any) on the communication socket to the target server. A large request sent to a server will normally be broken up into multiple packets; the send timeout applies to sending each packet individually. The default value is 30 seconds.
receiveTimeout: The value applies to receiving a packet of response data from the target server. Large responses will be broken up into multiple packets; the receive timeout applies to fetching each packet of data off the socket. The default value is 30 seconds.
The KB305053 (a server that decides to keep the connection open will cause serverXmlHttpRequest to wait for the connection to close) seems like it plausibly could be the issue. But the 30 second default timeout would have taken care of that.
Possible workaround - Add myself to a Job
The Windows Task Scheduler is unable to terminate the task; even though the option is enabled to do do.
I will look into using the Windows Job API to add my self process to a job, and use SetInformationJobObject to set a time limit on my process:
CreateJobObject
AssignProcessToJobObject
SetInformationJobObject
to limit my process to three minutes of execution time:
PerProcessUserTimeLimit
If LimitFlags specifies
JOB_OBJECT_LIMIT_PROCESS_TIME, this member is the per-process
user-mode execution time limit, in 100-nanosecond ticks. Otherwise,
this member is ignored.
The system periodically checks to determine
whether each process associated with the job has accumulated more
user-mode time than the set limit. If it has, the process is
terminated.
If the job is nested, the effective limit is the most
restrictive limit in the job chain.
Although since Task Scheduler uses Job objects to also limit a task's time, i'm not hopeful that the Job Object can limit a job either.
Edit: Job objects cannot limit a process by process time - only user time. And with a process idle waiting for an object, it will not accumulate any user time - certainly not three minutes worth.
Bonus Reading
How can a ServerXMLHTTP GET request hang? (GET, not POST)
KB305053: ServerXMLHTTP Stops Responding When You Send a POST Request (which says the timeout should expire; where mine does not)
MS Forums: oHttp.Send - Hangs (HEAD, not POST)
MS Forums: ASP to test SOAP WebService using MSXML2.ServerXMLHTTP Send hangs
CC to MS Support Forums
Consider switching to a newer, supported API.
msxml6.dll using MSXML2.ServerXMLHTTP.6.0
winhttpcom.dll using WinHttp.WinHttpRequest.5.1.
The msxml3.dll library is no longer supported and is only kept around for compatibility reasons. Plus, there were a number of security and stability improvements included with msxml4.dll (and newer) that you are missing out on.