Node AWS.S3 SDK upload timeout - amazon-s3

Using the Node AWS SDK S3.upload method is not completing multi part uploads for some reason.
A readable stream that receives uploads from a browser is set as the Body (the readable stream is able to be be piped to file writableStream without any problems).
S3.upload is given the following options object:
{
partSize: 1024*1024*5,
queueSize: 1
}
When trying to upload a ~8.5mb file, the file is completely sent from the browser, but the request returned from S3.upload continually fires 'httpUploadProgress' events that indicate that all bytes have been uploaded. The following is received continually until the error occurs:
progress { loaded: 8832825,
total: 8832825,
part: 1,
key: 'c82d3ef1-5d95-47df-aaa9-2cee48afd702' }
progress { loaded: 8832825,
total: 8832825,
part: 1,
key: 'c82d3ef1-5d95-47df-aaa9-2cee48afd702' }
progress { loaded: 8832825,
total: 8832825,
part: 1,
key: 'c82d3ef1-5d95-47df-aaa9-2cee48afd702' }
RequestTimeout: Your socket connection to the server was not read from
or written to within the timeout period. Idle connections will be
closed.
The progress loaded field shows that it has loaded the total bytes, but the upload is never completed. Even the end event on the readable stream fires.
Console logging in the SDK itself shows that S3.upload consumes all the available data from the readable stream even if the part size is set to 5mb and the queue size is set to 1.
Does the part size and queue size have an impact on proper usage of S3.upload? How can this problem be investigated further?

I had to use createMultipartUpload and uploadPart for my larger (8Mb) file upload.
https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/S3.html#uploadPart-property

Related

Redis stream XReadGroup not reading new messages even if `BLOCK` parameter is 0

I am using redis stream and XReadGroup for reading messages from stream. I have set block parameter as 0.
currently my code look like this
data, err := w.rdb.XReadGroup(ctx, &redis.XReadGroupArgs{
Group: w.opts.group,
Consumer: w.opts.consumer,
Streams: []string{w.opts.streamName, ">"},
Count: 1,
Block: 0,
}).Result()
I am currently facing a problem that if I keep the application (involving this code) idle for 10-12 hours, XReadGroup is not able to read new messages, if I restart the application then all the new messages consumed at once. Is there any solution for this problem?
You can have a block time of let's say 10s, it does not change anything (I guess the code you provided is in a while(true)).
From my experience you can keep the app idle for days and it still works.
I don't really know why but I guess it has to do with the "constant" back and forth "reseting" the connection.

Request timed out error on copying data in azure data factory

I am receiving the below error on running a copy activity in my adf pipeline. My source and sink are cosmos db containers in different subscription. ADF pipeline is created in subscription which has target(sink) cosmos db container.
Error:
Error code 2200 Failure type User configuration issue
Type=Microsoft.Azure.Documents.RequestTimeoutException,Message=Request
timed out. ActivityId: 0d2b8ebb-090d-43eb-8494-f82e53b3134b, Request
URI: /dbs/ZLQDAA==/colls/ZLQDAIez1wo=/docs, RequestStats: , SDK:
documentdb-dotnet-sdk/2.5.1 Host/64-bit
MicrosoftWindowsNT/6.2.9200.0,Source=Microsoft.Azure.Documents.Client,''Type=System.Threading.Tasks.TaskCanceledException,Message=A
task was
canceled.,Source=mscorlib,''Type=Microsoft.Azure.Documents.RequestTimeoutException,Message=Request
timed out. ActivityId: 0d2b8ebb-090d-43eb-8494-f82e53b3134b, Request
URI: /dbs/ZLQDAA==/colls/ZLQDAIez1wo=/docs, RequestStats: , SDK:
documentdb-dotnet-sdk/2.5.1 Host/64-bit
MicrosoftWindowsNT/6.2.9200.0,Source=Microsoft.Azure.Documents.Client,''Type=System.Threading.Tasks.TaskCanceledException,Message=A
task was
canceled.,Source=mscorlib,''Type=Microsoft.Azure.Documents.RequestTimeoutException,Message=Request
timed out. ActivityId: 0d2b8ebb-090d-43eb-8494-f82e53b3134b, Request
URI: /dbs/ZLQDAA==/colls/ZLQDAIez1wo=/docs, RequestStats: , SDK:
documentdb-dotnet-sdk/2.5.1 Host/64-bit
MicrosoftWindowsNT/6.2.9200.0,Source=Microsoft.Azure.Documents.Client,''Type=System.Threading.Tasks.TaskCanceledException,Message=A
task was
canceled.,Source=mscorlib,''Type=Microsoft.Azure.Documents.RequestTimeoutException,Message=Request
timed out. ActivityId: 0d2b8ebb-090d-43eb-8494-f82e53b3134b, Request
URI: /dbs/ZLQDAA==/colls/ZLQDAIez1wo=/docs, RequestStats: , SDK:
documentdb-dotnet-sdk/2.5.1 Host/64-bit
MicrosoftWindowsNT/6.2.9200.0,Source=Microsoft.Azure.Documents.Client,''Type=System.Threading.Tasks.TaskCanceledException,Message=A
task was canceled.,Source=mscorlib,'
As per official documentation
Cosmos DB limits single request’s size to 2MB. The formula is Request Size = Single Document Size * Write Batch Size. If you hit error saying “Request size is too large.”, reduce the writeBatchSize value in copy sink configuration.
Page size: The number of documents per page of the query result. Default is "-1" which uses the service dynamic page up to 1000.
Throughput: Set an optional value for the number of RUs you'd like to apply to your CosmosDB collection for each execution of this data flow during the read operation. Minimum is 400.

GCP - Message from PubSub to BigQuery

I need to get the data from my pubsub message and insert into bigquery.
What I have:
const topicName = "-----topic-name-----";
const data = JSON.stringify({ foo: "bar" });
// Imports the Google Cloud client library
const { PubSub } = require("#google-cloud/pubsub");
// Creates a client; cache this for further use
const pubSubClient = new PubSub();
async function publishMessageWithCustomAttributes() {
// Publishes the message as a string, e.g. "Hello, world!" or JSON.stringify(someObject)
const dataBuffer = Buffer.from(data);
// Add two custom attributes, origin and username, to the message
const customAttributes = {
origin: "nodejs-sample",
username: "gcp",
};
const messageId = await pubSubClient
.topic(topicName)
.publish(dataBuffer, customAttributes);
console.log(`Message ${messageId} published.`);
}
publishMessageWithCustomAttributes().catch(console.error);
I need to get the data/attributes from this message and query in BigQuery, anyone can help me?
Thaks in advance!
In fact, there is 2 solutions to consume the messages: either a message per message, or in bulk.
Firstly, before going in detail, and because you will perform BigQuery calls (or Facebook API calls), you will spend a lot of the processing time to wait the API response.
Message per Message
If you have an acceptable volume of message, you can perform a message per message processing. You have 2 solutions here:
You can handle each message with Cloud Functions. Set the minimal amount of memory to the functions (128Mb) to limit the CPU cost and thus the global cost. Indeed, because you will wait a lot, don't spend expensive CPU cost to do nothing! Ok, you will process slowly the data when they will be there but, it's a tradeoff.
Create Cloud Function on the topic, or a Push Subscription to call a HTTP triggered Cloud Functions
You can also handle request concurrently with Cloud Run. Cloud Run can handle up to 250 requests concurrently (in preview), and because you will wait a lot, it's perfectly suitable. If you need more CPU and memory, you can increase these value to 4CPU and 8Gb of memory. It's my preferred solution.
Bulk processing is possible if you are able to easily manage multi-cpu multi-(light)thread development. It's easy in Go. Concurrency in Node is also easy (await/async) but I don't know if it's multi-cpu capable or only single-cpu. Anyway, the principle is the following
Create a pull subscription on PubSub topic
Create a Cloud Run (better for multi-cpu, but also work with App Engine or Cloud Functions) that will listen the pull subscription for a while (let's say 10 minutes)
For each message pulled, an async process is performed: get the data/attribute, make the call to BigQuery, ack the message
After the timeout of the pull connexion, close the message listening, finish the current message processing and exit gracefully (return 200 HTTP code)
Create a Cloud Scheduler that call every 10 minutes the Cloud Run service. Set the timeout to 15 minutes and discard retries.
Deploy the Cloud Run service with a timeout of 15 minutes.
This solution offers a better message throughput processing (you can process more than 250 message per Cloud Run service), but don't have a real advantage because you are limited by the API call latency.
EDIT 1
Code sample
// For pubsunb triggered function
exports.logMessageTopic = (message, context) => {
console.log("Message Content")
console.log(Buffer.from(message.data, 'base64').toString())
console.log("Attribute list")
for (let key in message.attributes) {
console.log(key + " -> " + message.attributes[key]);
};
};
// For push subscription
exports.logMessagePush = (req, res) => {
console.log("Message Content")
console.log(Buffer.from(req.body.message.data, 'base64').toString())
console.log("Attribute list")
for (let key in req.body.message.attributes) {
console.log(key + " -> " + req.body.message.attributes[key]);
};
};

Why we need to setReadLimit(int) in AWS S3 Java client

I am working on AWS Java S3 Library.
This is my code which is uploading the file to s3 using High-Level API of AWS.
ClientConfiguration configuration = new ClientConfiguration();
configuration.setUseGzip(true);
configuration.setConnectionTTL(1000 * 60 * 60);
AmazonS3Client amazonS3Client = new AmazonS3Client(configuration);
TransferManager transferManager = new TransferManager(amazonS3Client);
ObjectMetadata objectMetadata = new ObjectMetadata();
objectMetadata.setContentLength(message.getBodyLength());
objectMetadata.setContentType("image/jpg");
transferManager.getConfiguration().setMultipartUploadThreshold(1024 * 10);
PutObjectRequest request = new PutObjectRequest("test", "/image/test", inputStream, objectMetadata);
request.getRequestClientOptions().setReadLimit(1024 * 10);
request.setSdkClientExecutionTimeout(1000 * 60 * 60);
Upload upload = transferManager.upload(request);
upload.waitForCompletion();
I am trying to upload a large file. It is working properly but sometimes I am getting below error. I have set readLimit as (1024*10).
2019-04-05 06:41:05,679 ERROR [com.demo.AwsS3TransferThread] (Aws-S3-upload) Error in saving File[media/image/osc/54/54ec3f2f-a938-473c-94b7-a55f39aac4a6.png] on S3[demo-test]: com.amazonaws.ResetException: Failed to reset the request input stream; If the request involves an input stream, the maximum stream buffer size can be configured via request.getRequestClientOptions().setReadLimit(int)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.resetRequestInputStream(AmazonHttpClient.java:1221)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1042)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:948)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:661)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:635)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:618)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$300(AmazonHttpClient.java:586)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:573)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:445)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4041)
at com.amazonaws.services.s3.AmazonS3Client.doUploadPart(AmazonS3Client.java:3041)
at com.amazonaws.services.s3.AmazonS3Client.uploadPart(AmazonS3Client.java:3026)
at com.amazonaws.services.s3.transfer.internal.UploadCallable.uploadPartsInSeries(UploadCallable.java:255)
at com.amazonaws.services.s3.transfer.internal.UploadCallable.uploadInParts(UploadCallable.java:189)
at com.amazonaws.services.s3.transfer.internal.UploadCallable.call(UploadCallable.java:121)
at com.amazonaws.services.s3.transfer.internal.UploadMonitor.call(UploadMonitor.java:139)
at com.amazonaws.services.s3.transfer.internal.UploadMonitor.call(UploadMonitor.java:47)
What is the perpose of readLimit?
How it will usefull?
What should I do to avoid this kind of exception?
After researching on this for 1 week,
I have found that if your uploading file size is less than 48GB then you can set readLimit value 5.01MB.
because AWS dividing file into multiple parts and each part size is value is 5MB(If you have not changed minimum part size value). as per the AWS specs, last part size is less than 5MB. so I have set readLimit 5MB and it solves the issue.
InputStream readLimit purpose:
Marks the current position in this input stream. A subsequent call to the reset method repositions this stream at the last marked position so that subsequent reads re-read the same bytes.Readlimit arguments tells this input stream to allow that many bytes to be read before the mark position gets invalidated. The general contract of mark is that, if the method markSupported returns, the stream somehow remembers all the bytes read after the call to mark and stands ready to supply those same bytes again if and whenever the method reset is called. However, the stream is not required to remember any data at all if more than readLimit bytes are read from the stream before reset is called.

Why does S3 sometimes return HTTP 206 when the whole file has been downloaded?

I've had my S3 bucket logging into another bucket using Server Access Log Format for a while. For the Operation: REST.GET.OBJECT sometimes an HTTP Status: 206 Partial Content is returned because the whole file wasn't downloaded. But I can see in the logs that sometimes when HTTP Status: 206 is returned the whole file was downloaded. I've removed some fields to make it simpler:
Operation: REST.GET.OBJECT
Request-URI: "GET [File] HTTP/1.1"
HTTP Status: 206
Error Code: -
Bytes Sent: 76431360
Object Size: 76431360
Total Time: 16276
Turn-Around Time: 190
What happened here? If the Bytes Sent are the same as the Object Size then how can the source report this as a Partial Content?
The 206 status has nothing to do with incomplete file transfer. The server determines what status code to send before it starts sending the response body, so it would have to predict future to know whether it will be able to send the whole file.
Instead, what 206 status code actually means is that the following three things happened at once:
the client sent Range header in its request;
the server decided to honour it and send exactly the bytes requested, not the whole file;
the server was actually able to do so — the range was valid and satisfiable.
In this case, the standard requires the server to reply with the 206 status code, not 200, regardless whether the range happen to cover exactly the whole file or only a part of it.