Kafka Streams failing with Materialized.`as`(STORE_NAME) - kotlin

I am creating a streams application to enrich data from a topic, stored in a table. Everything works perfectly until i add the Materialized as api. Anyone have any idea why this happens? I have scoured the web for resources explaining this.
It is always running with a fresh kafka instance.
val source = builder.stream<String, Example>(INPUT)
val newValues = source
.mapValues { key, value ->
NewValue()
}
.toTable(Materialized.`as`(STORE_NAME))
If i remove it, code works fine.
It throws the following stacktrace and kills the stream:
java.lang.IllegalStateException: Tried to lookup lag for unknown task 0_1
at org.apache.kafka.streams.processor.internals.assignment.ClientState.lagFor(ClientState.java:306)
at org.apache.kafka.streams.processor.internals.StreamsPartitionAssignor$$Lambda$1311/000000000000000000.applyAsLong(Unknown Source)
at java.base/java.util.Comparator.lambda$comparingLong$6043328a$1(Comparator.java:511)
at java.base/java.util.Comparator$$Lambda$1312/000000000000000000.compare(Unknown Source)
at java.base/java.util.Comparator.lambda$thenComparing$36697e65$1(Comparator.java:216)
at java.base/java.util.Comparator$$Lambda$178/000000000000000000.compare(Unknown Source)
at java.base/java.util.TreeMap.put(TreeMap.java:550)
at java.base/java.util.TreeSet.add(TreeSet.java:255)
at java.base/java.util.AbstractCollection.addAll(AbstractCollection.java:352)
at java.base/java.util.TreeSet.addAll(TreeSet.java:312)
at org.apache.kafka.streams.processor.internals.StreamsPartitionAssignor.getPreviousTasksByLag(StreamsPartitionAssignor.java:1265)
at org.apache.kafka.streams.processor.internals.StreamsPartitionAssignor.assignTasksToThreads(StreamsPartitionAssignor.java:1179)
at org.apache.kafka.streams.processor.internals.StreamsPartitionAssignor.computeNewAssignment(StreamsPartitionAssignor.java:930)
at org.apache.kafka.streams.processor.internals.StreamsPartitionAssignor.assign(StreamsPartitionAssignor.java:394)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.performAssignment(ConsumerCoordinator.java:590)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.onJoinLeader(AbstractCoordinator.java:689)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.access$1400(AbstractCoordinator.java:111)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$JoinGroupResponseHandler.handle(AbstractCoordinator.java:602)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$JoinGroupResponseHandler.handle(AbstractCoordinator.java:575)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:1132)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:1107)
at org.apache.kafka.clients.consumer.internals.RequestFuture$1.onSuccess(RequestFuture.java:206)
at org.apache.kafka.clients.consumer.internals.RequestFuture.fireSuccess(RequestFuture.java:169)
at org.apache.kafka.clients.consumer.internals.RequestFuture.complete(RequestFuture.java:129)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:602)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.firePendingCompletedRequests(ConsumerNetworkClient.java:412)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:297)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:236)
at org.apache.kafka.clients.consumer.KafkaConsumer.pollForFetches(KafkaConsumer.java:1301)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1237)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1210)
at org.apache.kafka.streams.processor.internals.StreamThread.pollRequests(StreamThread.java:766)
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:624)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:551)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:510)
Edit:
Turned out the state was stored between runs of the testcontainer tests. So used KafkaStreams#cleanup method to solve this problem of conflicting stream states.

Related

How can I run various tests for Quarkus Kafka Stream with Testcontainer?

Following the steps described here https://quarkus.io/guides/kafka#testing-using-a-kafka-broker it's possible to define quarkus tests using a "real" Kafka broker.
#QuarkusTest instantiate all the resources needed, including KafkaStrems and during the individual tests (#Test) we can limit ourselves to produce records for input topics and consume results from output topics.
The current stream Topology include steps of groupBy, aggregation, join, ...
During test, the problem is that, after first one, all other tests have "dirty aggregates". A kafkaStreams.cleanUp() might solve the problem but produce an error:
Caused by: java.lang.IllegalStateException: Cannot clean up while running.
at org.apache.kafka.streams.KafkaStreams.cleanUp(KafkaStreams.java:1486)
at eu.reply.lea.visibility.unieuro.stream.TopologyProducerIT.setup(TopologyProducerIT.java:70)
at eu.reply.lea.visibility.unieuro.stream.TopologyProducerIT_Bean.create(Unknown Source)
at eu.reply.lea.visibility.unieuro.stream.TopologyProducerIT_Bean.get(Unknown Source)
at eu.reply.lea.visibility.unieuro.stream.TopologyProducerIT_Bean.get(Unknown Source)
at io.quarkus.arc.impl.InstanceImpl.getBeanInstance(InstanceImpl.java:225)
at io.quarkus.arc.impl.InstanceImpl.getInternal(InstanceImpl.java:211)
at io.quarkus.arc.impl.InstanceImpl.get(InstanceImpl.java:97)
... 73 more
The question is: what is the correct approch to run KafkaStream testing in quarkus (the "traditional" approach of: perform a test, perform rollback and continue with the next ones seems not applicable).
Also the following approach fails:
// test 1
kafkaStreams.close();
kafkaStreams.cleanUp();
kafkaStreams.start();
// test 2
kafkaStreams.close();
kafkaStreams.cleanUp();
kafkaStreams.start();
// ...

Issues while running with BigQueryIO.Write.Method.STORAGE_WRITE_API

We are testing with STORAGE_WRITE_API to insert data into BigQuery. We've seen several errors/warnings in our Dataflow pipeline(written in Java). It might work well in the beginning, but eventually the system lag would be increasing, it would stop processing any data from PubSub and the unacked messages piled up.
One common warning is:
Operation ongoing in step insertTableRowsToBigQuery/StorageApiLoads/StorageApiWriteSharded/Write Records for at least 03h35m00s without outputting or completing in state process
at java.base#11.0.9/jdk.internal.misc.Unsafe.park(Native Method)
at java.base#11.0.9/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
at java.base#11.0.9/java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:885)
at java.base#11.0.9/java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1039)
at java.base#11.0.9/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1345)
at java.base#11.0.9/java.util.concurrent.CountDownLatch.await(CountDownLatch.java:232)
at app//org.apache.beam.sdk.io.gcp.bigquery.RetryManager$Callback.await(RetryManager.java:153)
at app//org.apache.beam.sdk.io.gcp.bigquery.RetryManager$Operation.await(RetryManager.java:136)
at app//org.apache.beam.sdk.io.gcp.bigquery.RetryManager.await(RetryManager.java:256)
at app//org.apache.beam.sdk.io.gcp.bigquery.RetryManager.run(RetryManager.java:248)
at app//org.apache.beam.sdk.io.gcp.bigquery.StorageApiWritesShardedRecords$WriteRecordsDoFn.process(StorageApiWritesShardedRecords.java:453)
at app//org.apache.beam.sdk.io.gcp.bigquery.StorageApiWritesShardedRecords$WriteRecordsDoFn$DoFnInvoker.invokeProcessElement(Unknown Source)
Other exceptions we've seen:
Got error io.grpc.StatusRuntimeException: FAILED_PRECONDITION: Stream is closed
Got error io.grpc.StatusRuntimeException: ALREADY_EXIST
PodSandboxStatus of sandbox "..." for pod "df-...-pipeline-...-harness-qw4j_default(...)" error: rpc error: code = Unknown desc = Error: No such container
Code sample:
toBq.apply("insertTableRowsToBigQuery",
BigQueryIO
.writeTableRows()
.to(String.format("%s:%s.%s", PROJECT_ID, DATASET, table))
.withTriggeringFrequency(Duration.standardSeconds(options.getTriggeringFrequency()))
.withNumStorageWriteApiStreams(options.getNumStorageWriteApiStreams())
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
There was a production issue related to connection being stuck after streaming 10MB which has been fixed. If you try again, it should work.

Getting sometimes NullPointerException while saving into cassandra

I have following method to write into cassandra some time it saving data fine.
When I run it again , sometimes it is throwing NullPointerException
Not sure what is going wrong here ... Can you please help me.
'
#throws(classOf[IOException])
def writeDfToCassandra(o_model_family:DataFrame , keyspace:String, columnFamilyName: String) = {
logger.info(s"writeDfToCassandra")
o_model_family.write.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> columnFamilyName, "keyspace" -> keyspace ))
.mode(SaveMode.Append)
.save()
}
'
18/10/29 05:23:56 ERROR BMValsProcessor: java.lang.NullPointerException
at java.util.regex.Matcher.getTextLength(Matcher.java:1283)
at java.util.regex.Matcher.reset(Matcher.java:309)
at java.util.regex.Matcher.<init>(Matcher.java:229)
at java.util.regex.Pattern.matcher(Pattern.java:1093)
at scala.util.matching.Regex.findFirstIn(Regex.scala:388)
at org.apache.spark.util.Utils$$anonfun$redact$1$$anonfun$apply$15.apply(Utils.scala:2698)
at org.apache.spark.util.Utils$$anonfun$redact$1$$anonfun$apply$15.apply(Utils.scala:2698)
at scala.Option.orElse(Option.scala:289)
at org.apache.spark.util.Utils$$anonfun$redact$1.apply(Utils.scala:2698)
at org.apache.spark.util.Utils$$anonfun$redact$1.apply(Utils.scala:2696)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at org.apache.spark.util.Utils$.redact(Utils.scala:2696)
at org.apache.spark.util.Utils$.redact(Utils.scala:2663)
at org.apache.spark.sql.internal.SQLConf$$anonfun$redactOptions$1.apply(SQLConf.scala:1650)
at org.apache.spark.sql.internal.SQLConf$$anonfun$redactOptions$1.apply(SQLConf.scala:1650)
at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
at scala.collection.immutable.List.foldLeft(List.scala:84)
at org.apache.spark.sql.internal.SQLConf.redactOptions(SQLConf.scala:1650)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.simpleString(SaveIntoDataSourceCommand.scala:52)
at org.apache.spark.sql.catalyst.plans.QueryPlan.verboseString(QueryPlan.scala:178)
at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:556)
at org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:480)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$4.apply(QueryExecution.scala:198)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$4.apply(QueryExecution.scala:198)
at org.apache.spark.sql.execution.QueryExecution.stringOrError(QueryExecution.scala:100)
at org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:198)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:654)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)
at com.snp.utils.DbUtils$.writeDfToCassandra(DbUtils.scala:47)
Oddly this is failing in the "redact" function of the Spark Utils. This is used on options that are potentially passed to Spark to remove sensitive data from UI's and such. I can't imagine why a null key-name would pop up in your SqlConf (since I believe you can only have Empty Strings) but I would check there. Could be a mutation of the conf while the method is being executed?

org.jenkinsci.plugins.scriptsecurity.sandbox.RejectedAccessException: Scripts not permitted to use method hudson.model.Item getName

I was trying to delete the old history of builds using a groovy script, and earlier it was working fine and without any changes now I am facing issue as below:
ERROR: Build step failed with exception
org.jenkinsci.plugins.scriptsecurity.sandbox.RejectedAccessException: Scripts not permitted to use method hudson.model.Item getName
at org.jenkinsci.plugins.scriptsecurity.sandbox.whitelists.StaticWhitelist.rejectMethod(StaticWhitelist.java:175)
at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxInterceptor.onMethodCall(SandboxInterceptor.java:137)
at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:155)
at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:159)
at org.kohsuke.groovy.sandbox.impl.Checker$checkedCall.callStatic(Unknown Source)
at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCallStatic(CallSiteArray.java:56)
at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callStatic(AbstractCallSite.java:194)
at Script1.deleteBuildHistory(Script1.groovy:71)
at Script1$deleteBuildHistory.call(Unknown Source)
at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:48)
at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:113)
at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:157)
at org.kohsuke.groovy.sandbox.GroovyInterceptor.onMethodCall(GroovyInterceptor.java:23)
at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxInterceptor.onMethodCall(SandboxInterceptor.java:133)
at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:155)
at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:159)
at org.kohsuke.groovy.sandbox.impl.Checker$checkedCall.callStatic(Unknown Source)
at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCallStatic(CallSiteArray.java:56)
at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callStatic(AbstractCallSite.java:194)
at Script1.run(Script1.groovy:58)
at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.GroovySandbox.run(GroovySandbox.java:141)
at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SecureGroovyScript.evaluate(SecureGroovyScript.java:333)
at hudson.plugins.groovy.SystemGroovy.run(SystemGroovy.java:95)
at hudson.plugins.groovy.SystemGroovy.perform(SystemGroovy.java:59)
at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:744)
at hudson.model.Build$BuildExecution.build(Build.java:206)
at hudson.model.Build$BuildExecution.doRun(Build.java:163)
at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:504)
at hudson.model.Run.execute(Run.java:1798)
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
at hudson.model.ResourceController.execute(ResourceController.java:97)
at hudson.model.Executor.run(Executor.java:429)
Build step 'Execute system Groovy script' marked build as failure
Finished: FAILURE
In my groovy I am using the API "hudson.model.Hudson.instance.getItem(envVar.get("JOB_NAME"));" to get the Jenkins job name. Since it is working earlier, now I am facing this issue and not sure how to resolve the same. Kindly provide inputs.
You are using a rather generic way to access data from an object, which might be exploited somehow, so it got blacklisted or rather not whitelisted in Jenkins Groovy Sandbox.
You have several options here:
Just add an exception using in-process script approval
Use a less generic and therefore saver syntax like env.JOB_NAME.
I would definitely go for the second option in your case for it has no disadvantages and is simpler then your current code.
As for why it worked before: three might have been an approval, which somehow got lost –happened to me once– or the call you are using got un-whitelisted in an update of the security plugin.

Cascading S3 Sink Tap not being deleted with SinkMode.REPLACE

We are running Cascading with a Sink Tap being configured to store in Amazon S3 and were facing some FileAlreadyExistsException (see [1]).
This was only from time to time (1 time on around 100) and was not reproducable.
Digging into the Cascading codem, we discovered the Hfs.deleteResource() is called (among others) by the BaseFlow.deleteSinksIfNotUpdate().
Btw, we were quite intrigued with the silent NPE (with comment "hack to get around npe thrown when fs reaches root directory").
From there, we extended the Hfs tap with our own Tap to add more action in the deleteResource() method (see [2]) with a retry mechanism calling directly the getFileSystem(conf).delete.
The retry mechanism seemed to bring improvement, but we are still sometimes facing failures (see example in [3]): it sounds like HDFS returns isDeleted=true, but asking directly after if the folder exists, we receive exists=true, which should not happen. Logs also shows randomly isDeleted true or false when the flow succeeds, which sounds like the returned value is irrelevant or not to be trusted.
Can anybody bring his own S3 experience with such a behavior: "folder should be deleted, but it is not"? We suspect a S3 issue, but could it also be in Cascading or HDFS?
We run on Hadoop Cloudera-cdh3u5 and Cascading 2.0.1-wip-dev.
[1]
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory s3n://... already exists
at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:132)
at com.twitter.elephantbird.mapred.output.DeprecatedOutputFormatWrapper.checkOutputSpecs(DeprecatedOutputFormatWrapper.java:75)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:923)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:882)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:882)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:856)
at cascading.flow.hadoop.planner.HadoopFlowStepJob.internalNonBlockingStart(HadoopFlowStepJob.java:104)
at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:174)
at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:137)
at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:122)
at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:42)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.j
[2]
#Override
public boolean deleteResource(JobConf conf) throws IOException {
LOGGER.info("Deleting resource {}", getIdentifier());
boolean isDeleted = super.deleteResource(conf);
LOGGER.info("Hfs Sink Tap isDeleted is {} for {}", isDeleted,
getIdentifier());
Path path = new Path(getIdentifier());
int retryCount = 0;
int cumulativeSleepTime = 0;
int sleepTime = 1000;
while (getFileSystem(conf).exists(path)) {
LOGGER
.info(
"Resource {} still exists, it should not... - I will continue to wait patiently...",
getIdentifier());
try {
LOGGER.info("Now I will sleep " + sleepTime / 1000
+ " seconds while trying to delete {} - attempt: {}",
getIdentifier(), retryCount + 1);
Thread.sleep(sleepTime);
cumulativeSleepTime += sleepTime;
sleepTime *= 2;
} catch (InterruptedException e) {
e.printStackTrace();
LOGGER
.error(
"Interrupted while sleeping trying to delete {} with message {}...",
getIdentifier(), e.getMessage());
throw new RuntimeException(e);
}
if (retryCount == 0) {
getFileSystem(conf).delete(getPath(), true);
}
retryCount++;
if (cumulativeSleepTime > MAXIMUM_TIME_TO_WAIT_TO_DELETE_MS) {
break;
}
}
if (getFileSystem(conf).exists(path)) {
LOGGER
.error(
"We didn't succeed to delete the resource {}. Throwing now a runtime exception.",
getIdentifier());
throw new RuntimeException(
"Although we waited to delete the resource for "
+ getIdentifier()
+ ' '
+ retryCount
+ " iterations, it still exists - This must be an issue in the underlying storage system.");
}
return isDeleted;
}
[3]
INFO [pool-2-thread-15] (BaseFlow.java:1287) - [...] at least one sink is marked for delete
INFO [pool-2-thread-15] (BaseFlow.java:1287) - [...] sink oldest modified date: Wed Dec 31 23:59:59 UTC 1969
INFO [pool-2-thread-15] (HiveSinkTap.java:148) - Now I will sleep 1 seconds while trying to delete s3n://... - attempt: 1
INFO [pool-2-thread-15] (HiveSinkTap.java:130) - Deleting resource s3n://...
INFO [pool-2-thread-15] (HiveSinkTap.java:133) - Hfs Sink Tap isDeleted is true for s3n://...
ERROR [pool-2-thread-15] (HiveSinkTap.java:175) - We didn't succeed to delete the resource s3n://... Throwing now a runtime exception.
WARN [pool-2-thread-15] (Cascade.java:706) - [...] flow failed: ...
java.lang.RuntimeException: Although we waited to delete the resource for s3n://... 0 iterations, it still exists - This must be an issue in the underlying storage system.
at com.qubit.hive.tap.HiveSinkTap.deleteResource(HiveSinkTap.java:179)
at com.qubit.hive.tap.HiveSinkTap.deleteResource(HiveSinkTap.java:40)
at cascading.flow.BaseFlow.deleteSinksIfNotUpdate(BaseFlow.java:971)
at cascading.flow.BaseFlow.prepare(BaseFlow.java:733)
at cascading.cascade.Cascade$CascadeJob.call(Cascade.java:761)
at cascading.cascade.Cascade$CascadeJob.call(Cascade.java:710)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
First, double check the Cascading compatibility page for supported distributions.
http://www.cascading.org/support/compatibility/
Note Amazon EMR is listed as they periodically run the compatibility tests and report the results back.
Second, S3 is an eventually consistent filesystem. HDFS is not. So assumptions about the behavior of HDFS don't carry over to storing data against S3. For example, a rename is really a copy and delete. Where the copy can take hours. Amazon has patched their internal distribution to accommodate many of the differences.
Third, there are no directories in S3. It is a hack and supported differently by different S3 interfaces (jets3t vs s3cmd vs ...). This is bound to be problematic considering the prior point.
Fourth, network latency and reliability are critical, especially when communicating to S3. Historically I've found the Amazon network to be better behaved when manipulating massive datasets on S3 when using EMR vs standard EC2 instances. I also believe their is a patch in EMR that improves matters here as well.
So I'd suggest try running the EMR Apache Hadoop distribution to see if your issues clear up.
When running any jobs on Hadoop that use files in S3, the nuances of eventual consistency must be kept in mind.
I've helped troubleshoot many apps which turned out to have similar race conditions for delete as their root issue -- whether they were in Cascading or Hadoop streaming or written directly in Java.
There was discussion at one point of having notifications from S3 after a given key/value pair had been fully deleted. I haven't kept up on where that feature stood. Otherwise, it's probably best to design systems -- again, whether in Cascading or any other app that uses S3 -- such that data which is consumed or produced by a batch workflow gets managed in HDFS or HBase or a key/value framework (e.g., have used Redis for this). Then S3 gets used for durable storage, but not for intermediate data.