blob.uploadtext to the same blob from multiple threads - azure-storage

A beginner question but I could not find a definite answer anywhere else.
What will happen if I use blob.uploadtext() to update the same blob at the same time from two different threads? Will the latter just fail? Or the latter just overwrite?
Also, what happens if I read from the blob while the uploadtext() is in progress? Will the read return a previous version? Or will the read fail? If the read fails, should I retry?
In my scenario, I have multiple instances of the worker role that may update the same blob (of size 10 MB) from time to time. I do not really care about who wins the write as long as they do not corrupt the blob and reading the blob is not blocked for a long time.
Update: I wrote a sample program to upload a 5MB file to the same blobreference from 3 different threads. What I found out is this:
1. If the thread reuse the blobReference object, then the second and third thread will get the MD5 not match exception. But the first thread still can successfully finish the uploading.
2.If each thread create their own BlobClient/BlobReference object, then all the upload will succeed and the actual blob content matches with the one that finished the last.
However, after I increased the size of the blob upload to a 10MB file, I am getting a StorageClientException: The specified block list is invalid. What is worse is that the blob content is somehow 0 bytes after this exception. So my blob upload failed and wiped out the blob content. That seems to be a pretty serious problem. Any ideas?
At below is the sample code:
Task[] tasks = new Task[6];
tasks[0] = Task.Factory.StartNew(() => DoUpload(BlobClient, testStreams, 0));
tasks[1] = Task.Factory.StartNew(() => ReadBlob(BlobClient,0));
tasks[2] = Task.Factory.StartNew(() => DoUpload(BlobClient, testStreams, 1));
tasks[3] = Task.Factory.StartNew(() => ReadBlob(BlobClient, 1));
tasks[4] = Task.Factory.StartNew(() => DoUpload(BlobClient, testStreams, 2));
tasks[5] = Task.Factory.StartNew(() => ReadBlob(BlobClient, 2));
Task.WaitAll(tasks);
DoUpload uses the following to upload:
blob.UploadFromStream(streams[i]);
Thanks.

Related

Error while uploading a huge .csv file to dynamodb through s3 bucket using lambda function

My funtion is
import boto3
import csv
s3 = boto3.client('s3')
dynamodb = boto3.resource('dynamodb')
def lambda_handler(event, context):
bucket='bucketname'
file_name='filename.csv'
obj = s3.get_object(Bucket=bucket,Key=file_name)
rows = obj['Body'].read()
lines = rows.splitlines()
# print(lines)
reader = csv.reader(lines)
parsed_csv = list(reader)
num_rows = (len(parsed_csv))
table = dynamodb.Table('table_name')
with table.batch_writer() as batch:
for i in range(1,num_rows):
Brand_Name= parsed_csv[i][0]
Assigned_Brand_Name= parsed_csv[i][1]
Brand_URL= parsed_csv[i][2]
Generic_Name= parsed_csv[i][3]
HSN_Code= parsed_csv[i][4]
GST_Rate= parsed_csv[i][5]
Price= parsed_csv[i][6]
Dosage= parsed_csv[i][7]
Package= parsed_csv[i][8]
Size= parsed_csv[i][9]
Size_Unit= parsed_csv[i][10]
Administration_Form= parsed_csv[i][11]
Company= parsed_csv[i][12]
Uses= parsed_csv[i][13]
Side_Effects= parsed_csv[i][14]
How_to_use= parsed_csv[i][15]
How_to_work= parsed_csv[i][16]
FAQs_Downloaded= parsed_csv[i][17]
Alternate_Brands= parsed_csv[i][18]
Prescription_Required= parsed_csv[i][19]
Interactions= parsed_csv[i][20]
batch.put_item(Item={
'Brand Name':Assigned_Brand_Name
'Brand URL':Brand_URL,
'Generic Name':Generic_Name,
'Price':Price,
'Dosage':Dosage,
'Company':Company,
'Uses':Uses,
'Side Effects':Side_Effects,
'How to use':How_to_use,
'How to work':How_to_work,
'FAQs Downloaded?':FAQs_Downloaded,
'Alternate Brands':Alternate_Brands,
'Prescription Required':Prescription_Required,
'Interactions':Interactions
})
Response:
{
"errorMessage": "2020-10-14T11:40:56.792Z ecd63bdb-16bc-4813-afed-cbf3e1fa3625 Task timed out after 3.00 seconds"
}
You haven't specified how many rows there are is your CSV file. "Huge" is pretty subjective so it is possible that your task is timing out due to throttling on the DynamoDB table.
If you are using provisioned capacity on the table you are loading into, make sure you have enough capacity allocated. If you're using on-demand capacity then this might be due to the on-demand partitioning that happens when the table needs to scale up.
Either way, you may want to add some error handling for situations like these and add a delay when you get a timeout, before retrying and resuming.
Something to keep in mind is that writes to Dynamo always take 1 WCU and the maximum capacity a single partition can have is 1000 WCU so as your write throughput increases, the table may undergo multiple splits behind the scenes when you're in on-demand mode. For provisioned mode, you'll have to have allocated enough capacity to begin with, otherwise you'll be limited to writing however many items / second you have allocated write capacity.

Is there a way to control the number of bytes read in Reactor Netty's TcpClient?

I am using TcpClient to connect to a simple TCP echo server. Messages consist of the message size in 4 bytes followed by the message itself. For instance, to send the message "hello", the server will expect "0005hello", and respond with "0005hello".
When testing under load (approximately 300+ concurrent users), adjacent requests sometimes result in responses "piling up", e.g. sending "0004good" followed by "0003day" might result in the client receiving "0004good0003" followed by "day".
In a conventional, non-WebFlux-based TCP client, one would normally read the first 4 bytes from the socket into a buffer, determine the length of the message N, then read the following N bytes from the socket into a buffer, before returning the response. Is it possible to achieve such fine-grained control, perhaps by using TcpClient's underlying Channel?
I have also considered the approach of accumulating responses in some data structure (Queue, StringBuffer, etc.) and having a daemon parse the result, but this has not had the desired performance in practice.
I solved this by adding a handler of type LengthFieldBasedFrameDecoder to the Connection:
TcpClient.create()
.host(ADDRESS)
.port(PORT)
.doOnConnected((connection) -> {
connection.addHandler("parseLengthFromFirstFourBytes", new LengthFieldBasedFrameDecoder(9999, 0, 4) {
#Override
protected long getUnadjustedFrameLength(ByteBuf buf, int offset, int length, ByteOrder order) {
ByteBuf lengthBuffer = buf.copy(0, 4);
byte[] messageLengthBytes = new byte[4];
lengthBuffer.readBytes(messageLengthBytes);
String messageLengthString = new String(messageLengthBytes);
return Long.parseLong(messageLengthString);
}
});
})
.connect()
.subscribe();
This solves the issue with the caveat that responses still "pile up" (as described in the question) when the application is subjected to sufficient load.

Snowflake COPY INTO from JSON - ON_ERROR = CONTINUE - Weird Issue

I am trying to load JSON file from Staging area (S3) into Stage table using COPY INTO command.
Table:
create or replace TABLE stage_tableA (
RAW_JSON VARIANT NOT NULL
);
Copy Command:
copy into stage_tableA from #stgS3/filename_45.gz file_format = (format_name = 'file_json')
Got the below error when executing the above (sample provided)
SQL Error [100069] [22P02]: Error parsing JSON: document is too large, max size 16777216 bytes If you would like to continue loading
when an error is encountered, use other values such as 'SKIP_FILE' or
'CONTINUE' for the ON_ERROR option. For more information on loading
options, please run 'info loading_data' in a SQL client.
When I had put "ON_ERROR=CONTINUE" , records got partially loaded, i.e until the record with more than max size. But no records after the Error record was loaded.
Was "ON_ERROR=CONTINUE" supposed to skip only the record that has max size and load records before and after it ?
Yes, the ON_ERROR=CONTINUE skips the offending line and continues to load the rest of the file.
To help us provide more insight, can you answer the following:
How many records are in your file?
How many got loaded?
At what line was the error first encountered?
You can find this information using the COPY_HISTORY() table function
Try setting the option strip_outer_array = true for file format and attempt the loading again.
The considerations for loading large size semi-structured data are documented in the below article:
https://docs.snowflake.com/en/user-guide/semistructured-considerations.html
I partially agree with Chris. The ON_ERROR=CONTINUE option only helps if the there are in fact more than 1 JSON objects in the file. If it's 1 massive object then you would simply not get an error or the record loaded when using ON_ERROR=CONTINUE.
If you know your JSON payload is smaller than 16mb then definitely try the strip_outer_array = true. Also, if your JSON has a lot of nulls ("NULL") as values use the STRIP_NULL_VALUES = TRUE as this will slim your payload as well. Hope that helps.

Why we need to setReadLimit(int) in AWS S3 Java client

I am working on AWS Java S3 Library.
This is my code which is uploading the file to s3 using High-Level API of AWS.
ClientConfiguration configuration = new ClientConfiguration();
configuration.setUseGzip(true);
configuration.setConnectionTTL(1000 * 60 * 60);
AmazonS3Client amazonS3Client = new AmazonS3Client(configuration);
TransferManager transferManager = new TransferManager(amazonS3Client);
ObjectMetadata objectMetadata = new ObjectMetadata();
objectMetadata.setContentLength(message.getBodyLength());
objectMetadata.setContentType("image/jpg");
transferManager.getConfiguration().setMultipartUploadThreshold(1024 * 10);
PutObjectRequest request = new PutObjectRequest("test", "/image/test", inputStream, objectMetadata);
request.getRequestClientOptions().setReadLimit(1024 * 10);
request.setSdkClientExecutionTimeout(1000 * 60 * 60);
Upload upload = transferManager.upload(request);
upload.waitForCompletion();
I am trying to upload a large file. It is working properly but sometimes I am getting below error. I have set readLimit as (1024*10).
2019-04-05 06:41:05,679 ERROR [com.demo.AwsS3TransferThread] (Aws-S3-upload) Error in saving File[media/image/osc/54/54ec3f2f-a938-473c-94b7-a55f39aac4a6.png] on S3[demo-test]: com.amazonaws.ResetException: Failed to reset the request input stream; If the request involves an input stream, the maximum stream buffer size can be configured via request.getRequestClientOptions().setReadLimit(int)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.resetRequestInputStream(AmazonHttpClient.java:1221)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1042)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:948)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:661)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:635)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:618)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$300(AmazonHttpClient.java:586)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:573)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:445)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4041)
at com.amazonaws.services.s3.AmazonS3Client.doUploadPart(AmazonS3Client.java:3041)
at com.amazonaws.services.s3.AmazonS3Client.uploadPart(AmazonS3Client.java:3026)
at com.amazonaws.services.s3.transfer.internal.UploadCallable.uploadPartsInSeries(UploadCallable.java:255)
at com.amazonaws.services.s3.transfer.internal.UploadCallable.uploadInParts(UploadCallable.java:189)
at com.amazonaws.services.s3.transfer.internal.UploadCallable.call(UploadCallable.java:121)
at com.amazonaws.services.s3.transfer.internal.UploadMonitor.call(UploadMonitor.java:139)
at com.amazonaws.services.s3.transfer.internal.UploadMonitor.call(UploadMonitor.java:47)
What is the perpose of readLimit?
How it will usefull?
What should I do to avoid this kind of exception?
After researching on this for 1 week,
I have found that if your uploading file size is less than 48GB then you can set readLimit value 5.01MB.
because AWS dividing file into multiple parts and each part size is value is 5MB(If you have not changed minimum part size value). as per the AWS specs, last part size is less than 5MB. so I have set readLimit 5MB and it solves the issue.
InputStream readLimit purpose:
Marks the current position in this input stream. A subsequent call to the reset method repositions this stream at the last marked position so that subsequent reads re-read the same bytes.Readlimit arguments tells this input stream to allow that many bytes to be read before the mark position gets invalidated. The general contract of mark is that, if the method markSupported returns, the stream somehow remembers all the bytes read after the call to mark and stands ready to supply those same bytes again if and whenever the method reset is called. However, the stream is not required to remember any data at all if more than readLimit bytes are read from the stream before reset is called.

Cascading S3 Sink Tap not being deleted with SinkMode.REPLACE

We are running Cascading with a Sink Tap being configured to store in Amazon S3 and were facing some FileAlreadyExistsException (see [1]).
This was only from time to time (1 time on around 100) and was not reproducable.
Digging into the Cascading codem, we discovered the Hfs.deleteResource() is called (among others) by the BaseFlow.deleteSinksIfNotUpdate().
Btw, we were quite intrigued with the silent NPE (with comment "hack to get around npe thrown when fs reaches root directory").
From there, we extended the Hfs tap with our own Tap to add more action in the deleteResource() method (see [2]) with a retry mechanism calling directly the getFileSystem(conf).delete.
The retry mechanism seemed to bring improvement, but we are still sometimes facing failures (see example in [3]): it sounds like HDFS returns isDeleted=true, but asking directly after if the folder exists, we receive exists=true, which should not happen. Logs also shows randomly isDeleted true or false when the flow succeeds, which sounds like the returned value is irrelevant or not to be trusted.
Can anybody bring his own S3 experience with such a behavior: "folder should be deleted, but it is not"? We suspect a S3 issue, but could it also be in Cascading or HDFS?
We run on Hadoop Cloudera-cdh3u5 and Cascading 2.0.1-wip-dev.
[1]
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory s3n://... already exists
at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:132)
at com.twitter.elephantbird.mapred.output.DeprecatedOutputFormatWrapper.checkOutputSpecs(DeprecatedOutputFormatWrapper.java:75)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:923)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:882)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:882)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:856)
at cascading.flow.hadoop.planner.HadoopFlowStepJob.internalNonBlockingStart(HadoopFlowStepJob.java:104)
at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:174)
at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:137)
at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:122)
at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:42)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.j
[2]
#Override
public boolean deleteResource(JobConf conf) throws IOException {
LOGGER.info("Deleting resource {}", getIdentifier());
boolean isDeleted = super.deleteResource(conf);
LOGGER.info("Hfs Sink Tap isDeleted is {} for {}", isDeleted,
getIdentifier());
Path path = new Path(getIdentifier());
int retryCount = 0;
int cumulativeSleepTime = 0;
int sleepTime = 1000;
while (getFileSystem(conf).exists(path)) {
LOGGER
.info(
"Resource {} still exists, it should not... - I will continue to wait patiently...",
getIdentifier());
try {
LOGGER.info("Now I will sleep " + sleepTime / 1000
+ " seconds while trying to delete {} - attempt: {}",
getIdentifier(), retryCount + 1);
Thread.sleep(sleepTime);
cumulativeSleepTime += sleepTime;
sleepTime *= 2;
} catch (InterruptedException e) {
e.printStackTrace();
LOGGER
.error(
"Interrupted while sleeping trying to delete {} with message {}...",
getIdentifier(), e.getMessage());
throw new RuntimeException(e);
}
if (retryCount == 0) {
getFileSystem(conf).delete(getPath(), true);
}
retryCount++;
if (cumulativeSleepTime > MAXIMUM_TIME_TO_WAIT_TO_DELETE_MS) {
break;
}
}
if (getFileSystem(conf).exists(path)) {
LOGGER
.error(
"We didn't succeed to delete the resource {}. Throwing now a runtime exception.",
getIdentifier());
throw new RuntimeException(
"Although we waited to delete the resource for "
+ getIdentifier()
+ ' '
+ retryCount
+ " iterations, it still exists - This must be an issue in the underlying storage system.");
}
return isDeleted;
}
[3]
INFO [pool-2-thread-15] (BaseFlow.java:1287) - [...] at least one sink is marked for delete
INFO [pool-2-thread-15] (BaseFlow.java:1287) - [...] sink oldest modified date: Wed Dec 31 23:59:59 UTC 1969
INFO [pool-2-thread-15] (HiveSinkTap.java:148) - Now I will sleep 1 seconds while trying to delete s3n://... - attempt: 1
INFO [pool-2-thread-15] (HiveSinkTap.java:130) - Deleting resource s3n://...
INFO [pool-2-thread-15] (HiveSinkTap.java:133) - Hfs Sink Tap isDeleted is true for s3n://...
ERROR [pool-2-thread-15] (HiveSinkTap.java:175) - We didn't succeed to delete the resource s3n://... Throwing now a runtime exception.
WARN [pool-2-thread-15] (Cascade.java:706) - [...] flow failed: ...
java.lang.RuntimeException: Although we waited to delete the resource for s3n://... 0 iterations, it still exists - This must be an issue in the underlying storage system.
at com.qubit.hive.tap.HiveSinkTap.deleteResource(HiveSinkTap.java:179)
at com.qubit.hive.tap.HiveSinkTap.deleteResource(HiveSinkTap.java:40)
at cascading.flow.BaseFlow.deleteSinksIfNotUpdate(BaseFlow.java:971)
at cascading.flow.BaseFlow.prepare(BaseFlow.java:733)
at cascading.cascade.Cascade$CascadeJob.call(Cascade.java:761)
at cascading.cascade.Cascade$CascadeJob.call(Cascade.java:710)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
First, double check the Cascading compatibility page for supported distributions.
http://www.cascading.org/support/compatibility/
Note Amazon EMR is listed as they periodically run the compatibility tests and report the results back.
Second, S3 is an eventually consistent filesystem. HDFS is not. So assumptions about the behavior of HDFS don't carry over to storing data against S3. For example, a rename is really a copy and delete. Where the copy can take hours. Amazon has patched their internal distribution to accommodate many of the differences.
Third, there are no directories in S3. It is a hack and supported differently by different S3 interfaces (jets3t vs s3cmd vs ...). This is bound to be problematic considering the prior point.
Fourth, network latency and reliability are critical, especially when communicating to S3. Historically I've found the Amazon network to be better behaved when manipulating massive datasets on S3 when using EMR vs standard EC2 instances. I also believe their is a patch in EMR that improves matters here as well.
So I'd suggest try running the EMR Apache Hadoop distribution to see if your issues clear up.
When running any jobs on Hadoop that use files in S3, the nuances of eventual consistency must be kept in mind.
I've helped troubleshoot many apps which turned out to have similar race conditions for delete as their root issue -- whether they were in Cascading or Hadoop streaming or written directly in Java.
There was discussion at one point of having notifications from S3 after a given key/value pair had been fully deleted. I haven't kept up on where that feature stood. Otherwise, it's probably best to design systems -- again, whether in Cascading or any other app that uses S3 -- such that data which is consumed or produced by a batch workflow gets managed in HDFS or HBase or a key/value framework (e.g., have used Redis for this). Then S3 gets used for durable storage, but not for intermediate data.