JSONStore error when push a collection with documents with files - ibm-mobilefirst

I have a JSONStore for a list of customers, the user can add documents to those customers using the app.
The list of cusotmers and its data (also attached documents) must be synch with the backend.
When I add a ppt document 774KB (that is the size in binary, I transform it to base64) to the json store and execute the push() it fails with the error:
E/CursorWindow(32705): need to grow: mSize = 1048576, size = 1056310, freeSpace() = 1048464, numRows = 1
E/CursorWindow(32705): Attempting to grow window beyond max size (1048576)
E/Cursor(32705): Failed allocating 1056310 bytes for text/blob at 0,1
D/Cursor(32705): finish_program_and_get_row_count row 0
E/CursorWindow(32705): need to grow: mSize = 1048576, size = 1056310, freeSpace() = 1048464, numRows = 1
E/CursorWindow(32705): Attempting to grow window beyond max size (1048576)
E/Cursor(32705): Failed allocating 1056310 bytes for text/blob at 0,1
D/Cursor(32705): finish_program_and_get_row_count row 0
E/CursorWindow(32705): Bad request for field slot 0,0. numRows = 0, numColumns = 4
E/jsonstore-core(32705): error while dispatching action "allDirty"
E/jsonstore-core(32705): java.lang.IllegalStateException: get field slot from row 0 col 0 failed
E/jsonstore-core(32705): at net.sqlcipher.CursorWindow.getLong_native(Native Method)
E/jsonstore-core(32705): at net.sqlcipher.CursorWindow.getLong(CursorWindow.java:381)
E/jsonstore-core(32705): at net.sqlcipher.AbstractWindowedCursor.getLong(AbstractWindowedCursor.java:110)
E/jsonstore-core(32705): at net.sqlcipher.AbstractCursor.moveToPosition(AbstractCursor.java:195)
E/jsonstore-core(32705): at net.sqlcipher.AbstractCursor.moveToNext(AbstractCursor.java:257)
E/jsonstore-core(32705): at android.database.CursorWrapper.moveToNext(CursorWrapper.java:166)
E/jsonstore-core(32705): at com.worklight.androidgap.plugin.storage.AllDirtyActionDispatcher$AllDirtyAction.performAction(AllDirtyActionDispatcher.java:148)
E/jsonstore-core(32705): at com.worklight.androidgap.plugin.storage.AllDirtyActionDispatcher$AllDirtyAction.performAction(AllDirtyActionDispatcher.java:119)
E/jsonstore-core(32705): at com.worklight.androidgap.plugin.storage.DatabaseActionDispatcher$Context.performReadableDatabaseAction(DatabaseActionDispatcher.java:141)
E/jsonstore-core(32705): at com.worklight.androidgap.plugin.storage.AllDirtyActionDispatcher.dispatch(AllDirtyActionDispatcher.java:64)
E/jsonstore-core(32705): at com.worklight.androidgap.plugin.storage.DatabaseActionDispatcher.dispatch(DatabaseActionDispatcher.java:56)
E/jsonstore-core(32705): at com.worklight.androidgap.plugin.storage.BaseActionDispatcher.dispatch(BaseActionDispatcher.java:87)
E/jsonstore-core(32705): at com.worklight.androidgap.plugin.storage.DispatchingPlugin$ActionDispatcherRunnable.run(DispatchingPlugin.java:113)
E/jsonstore-core(32705): at com.worklight.androidgap.plugin.storage.DispatchingPlugin$SerialExecutor$1.run(DispatchingPlugin.java:147)
E/jsonstore-core(32705): at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1076)
E/jsonstore-core(32705): at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:569)
E/jsonstore-core(32705): at java.lang.Thread.run(Thread.java:856)
E/myApp (32705): [wl.jsonstore] {"src":"push","err":8,"msg":"FAILED_TO_GET_UNPUSHED_DOCUMENTS_FROM_DB","col":"Documentos","usr":"jsonstore","doc":{},"res":{}}
I can add the document, the error is executing the push() method.
All the information I have seen in stackoverflow and infocenter about JSONStore is that there is no size limit. I have more than enough free space in my mobile.
Any idea?
Thank you.

"Don't store blobs [...] in your database. Store identifiers in your database and put the blobs as files onto the storage." - Source
Cordova has a File API you can use.
Here's a quick example:
//Code to write customer-1-file1.ppt and customer-1-file2.ppt to disk.
//See Cordova's File API.
//Pseudocode to get the blobsCollection and add metadata to be able to find the files.
//This would be inside the success callback for writing the files.
WL.JSONStore.get('blobsCollection')
.add([{fileName: 'customer-1-file1.ppt'}, {fileName: 'customer-1-file2.ppt'}]);
//Some time has passed...
//Pseudocode to get %customer-1% from disk
//% are wildcards characters and match any string
WL.JSONStore.get('blobsCollection')
.find({fileName: 'customer-1'}, {exact: false})
.then(function (listOfFiles) {
//listOfFiles => [{_id: 1, json: { fileName: 'customer-1-file1.ppt'} },
// {_id: 2, json: { {fileName: 'customer-1-file2.ppt'} }]
var firstFile = listOfFiles[0].json.fileName;
//Code to read firstFile. See Cordova's File API.
});
JSONStore is backed by SQLite (well, technically SQLCipher which is a wrapper for SQLite that adds data encryption). Read Internal Versus External BLOBs in SQLite. The takeaway is "For BLOBs smaller than 100KB, reads are faster when the BLOBs are stored directly in the database file. For BLOBs larger than 100KB, reads from a separate file are faster".
If you need to store blobs bigger than the default SQLite Cursor size (1048576 bytes), I suggest a feature request here.
I'll make sure this is mentioned in the documentation.
Note that there's a getPushRequired API you can use to get the list of documents that the push API will try to send to the Worklight Adapter. You will need to send file changes yourself to the Worklight Adapter using WL.Client.invokeProcedure, or directly to a backend using something like jQuery.ajax.

Related

Why we need to setReadLimit(int) in AWS S3 Java client

I am working on AWS Java S3 Library.
This is my code which is uploading the file to s3 using High-Level API of AWS.
ClientConfiguration configuration = new ClientConfiguration();
configuration.setUseGzip(true);
configuration.setConnectionTTL(1000 * 60 * 60);
AmazonS3Client amazonS3Client = new AmazonS3Client(configuration);
TransferManager transferManager = new TransferManager(amazonS3Client);
ObjectMetadata objectMetadata = new ObjectMetadata();
objectMetadata.setContentLength(message.getBodyLength());
objectMetadata.setContentType("image/jpg");
transferManager.getConfiguration().setMultipartUploadThreshold(1024 * 10);
PutObjectRequest request = new PutObjectRequest("test", "/image/test", inputStream, objectMetadata);
request.getRequestClientOptions().setReadLimit(1024 * 10);
request.setSdkClientExecutionTimeout(1000 * 60 * 60);
Upload upload = transferManager.upload(request);
upload.waitForCompletion();
I am trying to upload a large file. It is working properly but sometimes I am getting below error. I have set readLimit as (1024*10).
2019-04-05 06:41:05,679 ERROR [com.demo.AwsS3TransferThread] (Aws-S3-upload) Error in saving File[media/image/osc/54/54ec3f2f-a938-473c-94b7-a55f39aac4a6.png] on S3[demo-test]: com.amazonaws.ResetException: Failed to reset the request input stream; If the request involves an input stream, the maximum stream buffer size can be configured via request.getRequestClientOptions().setReadLimit(int)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.resetRequestInputStream(AmazonHttpClient.java:1221)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1042)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:948)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:661)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:635)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:618)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$300(AmazonHttpClient.java:586)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:573)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:445)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4041)
at com.amazonaws.services.s3.AmazonS3Client.doUploadPart(AmazonS3Client.java:3041)
at com.amazonaws.services.s3.AmazonS3Client.uploadPart(AmazonS3Client.java:3026)
at com.amazonaws.services.s3.transfer.internal.UploadCallable.uploadPartsInSeries(UploadCallable.java:255)
at com.amazonaws.services.s3.transfer.internal.UploadCallable.uploadInParts(UploadCallable.java:189)
at com.amazonaws.services.s3.transfer.internal.UploadCallable.call(UploadCallable.java:121)
at com.amazonaws.services.s3.transfer.internal.UploadMonitor.call(UploadMonitor.java:139)
at com.amazonaws.services.s3.transfer.internal.UploadMonitor.call(UploadMonitor.java:47)
What is the perpose of readLimit?
How it will usefull?
What should I do to avoid this kind of exception?
After researching on this for 1 week,
I have found that if your uploading file size is less than 48GB then you can set readLimit value 5.01MB.
because AWS dividing file into multiple parts and each part size is value is 5MB(If you have not changed minimum part size value). as per the AWS specs, last part size is less than 5MB. so I have set readLimit 5MB and it solves the issue.
InputStream readLimit purpose:
Marks the current position in this input stream. A subsequent call to the reset method repositions this stream at the last marked position so that subsequent reads re-read the same bytes.Readlimit arguments tells this input stream to allow that many bytes to be read before the mark position gets invalidated. The general contract of mark is that, if the method markSupported returns, the stream somehow remembers all the bytes read after the call to mark and stands ready to supply those same bytes again if and whenever the method reset is called. However, the stream is not required to remember any data at all if more than readLimit bytes are read from the stream before reset is called.

Node AWS.S3 SDK upload timeout

Using the Node AWS SDK S3.upload method is not completing multi part uploads for some reason.
A readable stream that receives uploads from a browser is set as the Body (the readable stream is able to be be piped to file writableStream without any problems).
S3.upload is given the following options object:
{
partSize: 1024*1024*5,
queueSize: 1
}
When trying to upload a ~8.5mb file, the file is completely sent from the browser, but the request returned from S3.upload continually fires 'httpUploadProgress' events that indicate that all bytes have been uploaded. The following is received continually until the error occurs:
progress { loaded: 8832825,
total: 8832825,
part: 1,
key: 'c82d3ef1-5d95-47df-aaa9-2cee48afd702' }
progress { loaded: 8832825,
total: 8832825,
part: 1,
key: 'c82d3ef1-5d95-47df-aaa9-2cee48afd702' }
progress { loaded: 8832825,
total: 8832825,
part: 1,
key: 'c82d3ef1-5d95-47df-aaa9-2cee48afd702' }
RequestTimeout: Your socket connection to the server was not read from
or written to within the timeout period. Idle connections will be
closed.
The progress loaded field shows that it has loaded the total bytes, but the upload is never completed. Even the end event on the readable stream fires.
Console logging in the SDK itself shows that S3.upload consumes all the available data from the readable stream even if the part size is set to 5mb and the queue size is set to 1.
Does the part size and queue size have an impact on proper usage of S3.upload? How can this problem be investigated further?
I had to use createMultipartUpload and uploadPart for my larger (8Mb) file upload.
https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/S3.html#uploadPart-property

What is going wrong with my etl process?

I'm using GoodData's CloudConnect (based on CloverETL) to read a massive json file and write certain elements to a .csv.
Unfortunately, I'm seeing the error pasted below in the console log. Am I running out of memory due to the error, or is that not enough memory the actual error?
ERROR [WatchDog_0] - Component [JSONReader:JSONREADER1] finished with status ERROR.
Java heap space
ERROR [WatchDog_0] - Error details:
org.jetel.exception.JetelRuntimeException: Component [JSONReader:JSONREADER1] finished with status ERROR.
at org.jetel.graph.Node.createNodeException(Node.java:543)
at org.jetel.graph.Node.run(Node.java:522)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.lang.Exception: java.lang.OutOfMemoryError: Java heap space
at org.jetel.component.TreeReader$StreamConvertingXPathProcessor.checkThrownException(TreeReader.java:766)
at org.jetel.component.TreeReader$StreamConvertingXPathProcessor.manageThread(TreeReader.java:757)
at org.jetel.component.TreeReader$StreamConvertingXPathProcessor.processInput(TreeReader.java:732)
at org.jetel.component.TreeReader.execute(TreeReader.java:412)
at org.jetel.graph.Node.run(Node.java:493)
... 1 more
Caused by: java.lang.OutOfMemoryError: Java heap space
at net.sf.saxon.tinytree.TinyTree.condense(TinyTree.java:379)
at net.sf.saxon.tinytree.TinyBuilder.close(TinyBuilder.java:177)
at net.sf.saxon.event.ReceivingContentHandler.endDocument(ReceivingContentHandler.java:219)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endDocument(AbstractSAXParser.java:745)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:515)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:649)
at net.sf.saxon.event.Sender.sendSAXSource(Sender.java:404)
at net.sf.saxon.event.Sender.send(Sender.java:193)
at net.sf.saxon.event.Sender.send(Sender.java:50)
at net.sf.saxon.Configuration.buildDocument(Configuration.java:2973)
at net.sf.saxon.sxpath.XPathExpression.evaluate(XPathExpression.java:154)
at org.jetel.component.tree.reader.xml.XmlXPathEvaluator.iterate(XmlXPathEvaluator.java:79)
at org.jetel.component.tree.reader.XPathPushParser.handleContext(XPathPushParser.java:104)
at org.jetel.component.tree.reader.XPathPushParser.parse(XPathPushParser.java:84)
at org.jetel.component.TreeReader$StreamConvertingXPathProcessor$PipeParser.work(TreeReader.java:827)
at org.jetel.graph.runtime.CloverWorker.run(CloverWorker.java:87)
... 1 more
This looks like the second case: this error is caused by insufficient memory for your task.
Error occurred during evaluating (one of) your JSONReader component(s).
The JSON seems to be really huge and you should consider splitting this task into smaller ones if possible.
Did you run your transformation locally or on the gooddata server?
It is really hard to advise something specific without knowing details.
Try to use JSONExtract instead if JSONReader - it uses less memory, but also reads JSON files.
From the respective help documents:
JSONReader uses DOM, so the whole input is stored in memory and therefore the component can be memory-greedy.
JSONExtract uses SAX instead of DOM, so it uses less memory than JSONReader

blob.uploadtext to the same blob from multiple threads

A beginner question but I could not find a definite answer anywhere else.
What will happen if I use blob.uploadtext() to update the same blob at the same time from two different threads? Will the latter just fail? Or the latter just overwrite?
Also, what happens if I read from the blob while the uploadtext() is in progress? Will the read return a previous version? Or will the read fail? If the read fails, should I retry?
In my scenario, I have multiple instances of the worker role that may update the same blob (of size 10 MB) from time to time. I do not really care about who wins the write as long as they do not corrupt the blob and reading the blob is not blocked for a long time.
Update: I wrote a sample program to upload a 5MB file to the same blobreference from 3 different threads. What I found out is this:
1. If the thread reuse the blobReference object, then the second and third thread will get the MD5 not match exception. But the first thread still can successfully finish the uploading.
2.If each thread create their own BlobClient/BlobReference object, then all the upload will succeed and the actual blob content matches with the one that finished the last.
However, after I increased the size of the blob upload to a 10MB file, I am getting a StorageClientException: The specified block list is invalid. What is worse is that the blob content is somehow 0 bytes after this exception. So my blob upload failed and wiped out the blob content. That seems to be a pretty serious problem. Any ideas?
At below is the sample code:
Task[] tasks = new Task[6];
tasks[0] = Task.Factory.StartNew(() => DoUpload(BlobClient, testStreams, 0));
tasks[1] = Task.Factory.StartNew(() => ReadBlob(BlobClient,0));
tasks[2] = Task.Factory.StartNew(() => DoUpload(BlobClient, testStreams, 1));
tasks[3] = Task.Factory.StartNew(() => ReadBlob(BlobClient, 1));
tasks[4] = Task.Factory.StartNew(() => DoUpload(BlobClient, testStreams, 2));
tasks[5] = Task.Factory.StartNew(() => ReadBlob(BlobClient, 2));
Task.WaitAll(tasks);
DoUpload uses the following to upload:
blob.UploadFromStream(streams[i]);
Thanks.

Cascading S3 Sink Tap not being deleted with SinkMode.REPLACE

We are running Cascading with a Sink Tap being configured to store in Amazon S3 and were facing some FileAlreadyExistsException (see [1]).
This was only from time to time (1 time on around 100) and was not reproducable.
Digging into the Cascading codem, we discovered the Hfs.deleteResource() is called (among others) by the BaseFlow.deleteSinksIfNotUpdate().
Btw, we were quite intrigued with the silent NPE (with comment "hack to get around npe thrown when fs reaches root directory").
From there, we extended the Hfs tap with our own Tap to add more action in the deleteResource() method (see [2]) with a retry mechanism calling directly the getFileSystem(conf).delete.
The retry mechanism seemed to bring improvement, but we are still sometimes facing failures (see example in [3]): it sounds like HDFS returns isDeleted=true, but asking directly after if the folder exists, we receive exists=true, which should not happen. Logs also shows randomly isDeleted true or false when the flow succeeds, which sounds like the returned value is irrelevant or not to be trusted.
Can anybody bring his own S3 experience with such a behavior: "folder should be deleted, but it is not"? We suspect a S3 issue, but could it also be in Cascading or HDFS?
We run on Hadoop Cloudera-cdh3u5 and Cascading 2.0.1-wip-dev.
[1]
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory s3n://... already exists
at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:132)
at com.twitter.elephantbird.mapred.output.DeprecatedOutputFormatWrapper.checkOutputSpecs(DeprecatedOutputFormatWrapper.java:75)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:923)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:882)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:882)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:856)
at cascading.flow.hadoop.planner.HadoopFlowStepJob.internalNonBlockingStart(HadoopFlowStepJob.java:104)
at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:174)
at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:137)
at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:122)
at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:42)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.j
[2]
#Override
public boolean deleteResource(JobConf conf) throws IOException {
LOGGER.info("Deleting resource {}", getIdentifier());
boolean isDeleted = super.deleteResource(conf);
LOGGER.info("Hfs Sink Tap isDeleted is {} for {}", isDeleted,
getIdentifier());
Path path = new Path(getIdentifier());
int retryCount = 0;
int cumulativeSleepTime = 0;
int sleepTime = 1000;
while (getFileSystem(conf).exists(path)) {
LOGGER
.info(
"Resource {} still exists, it should not... - I will continue to wait patiently...",
getIdentifier());
try {
LOGGER.info("Now I will sleep " + sleepTime / 1000
+ " seconds while trying to delete {} - attempt: {}",
getIdentifier(), retryCount + 1);
Thread.sleep(sleepTime);
cumulativeSleepTime += sleepTime;
sleepTime *= 2;
} catch (InterruptedException e) {
e.printStackTrace();
LOGGER
.error(
"Interrupted while sleeping trying to delete {} with message {}...",
getIdentifier(), e.getMessage());
throw new RuntimeException(e);
}
if (retryCount == 0) {
getFileSystem(conf).delete(getPath(), true);
}
retryCount++;
if (cumulativeSleepTime > MAXIMUM_TIME_TO_WAIT_TO_DELETE_MS) {
break;
}
}
if (getFileSystem(conf).exists(path)) {
LOGGER
.error(
"We didn't succeed to delete the resource {}. Throwing now a runtime exception.",
getIdentifier());
throw new RuntimeException(
"Although we waited to delete the resource for "
+ getIdentifier()
+ ' '
+ retryCount
+ " iterations, it still exists - This must be an issue in the underlying storage system.");
}
return isDeleted;
}
[3]
INFO [pool-2-thread-15] (BaseFlow.java:1287) - [...] at least one sink is marked for delete
INFO [pool-2-thread-15] (BaseFlow.java:1287) - [...] sink oldest modified date: Wed Dec 31 23:59:59 UTC 1969
INFO [pool-2-thread-15] (HiveSinkTap.java:148) - Now I will sleep 1 seconds while trying to delete s3n://... - attempt: 1
INFO [pool-2-thread-15] (HiveSinkTap.java:130) - Deleting resource s3n://...
INFO [pool-2-thread-15] (HiveSinkTap.java:133) - Hfs Sink Tap isDeleted is true for s3n://...
ERROR [pool-2-thread-15] (HiveSinkTap.java:175) - We didn't succeed to delete the resource s3n://... Throwing now a runtime exception.
WARN [pool-2-thread-15] (Cascade.java:706) - [...] flow failed: ...
java.lang.RuntimeException: Although we waited to delete the resource for s3n://... 0 iterations, it still exists - This must be an issue in the underlying storage system.
at com.qubit.hive.tap.HiveSinkTap.deleteResource(HiveSinkTap.java:179)
at com.qubit.hive.tap.HiveSinkTap.deleteResource(HiveSinkTap.java:40)
at cascading.flow.BaseFlow.deleteSinksIfNotUpdate(BaseFlow.java:971)
at cascading.flow.BaseFlow.prepare(BaseFlow.java:733)
at cascading.cascade.Cascade$CascadeJob.call(Cascade.java:761)
at cascading.cascade.Cascade$CascadeJob.call(Cascade.java:710)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
First, double check the Cascading compatibility page for supported distributions.
http://www.cascading.org/support/compatibility/
Note Amazon EMR is listed as they periodically run the compatibility tests and report the results back.
Second, S3 is an eventually consistent filesystem. HDFS is not. So assumptions about the behavior of HDFS don't carry over to storing data against S3. For example, a rename is really a copy and delete. Where the copy can take hours. Amazon has patched their internal distribution to accommodate many of the differences.
Third, there are no directories in S3. It is a hack and supported differently by different S3 interfaces (jets3t vs s3cmd vs ...). This is bound to be problematic considering the prior point.
Fourth, network latency and reliability are critical, especially when communicating to S3. Historically I've found the Amazon network to be better behaved when manipulating massive datasets on S3 when using EMR vs standard EC2 instances. I also believe their is a patch in EMR that improves matters here as well.
So I'd suggest try running the EMR Apache Hadoop distribution to see if your issues clear up.
When running any jobs on Hadoop that use files in S3, the nuances of eventual consistency must be kept in mind.
I've helped troubleshoot many apps which turned out to have similar race conditions for delete as their root issue -- whether they were in Cascading or Hadoop streaming or written directly in Java.
There was discussion at one point of having notifications from S3 after a given key/value pair had been fully deleted. I haven't kept up on where that feature stood. Otherwise, it's probably best to design systems -- again, whether in Cascading or any other app that uses S3 -- such that data which is consumed or produced by a batch workflow gets managed in HDFS or HBase or a key/value framework (e.g., have used Redis for this). Then S3 gets used for durable storage, but not for intermediate data.