I am getting an intermittent HTTP error when I try to load the contents of files in Azure Databricks from ADLS Gen2. The storage account has been mounted using a service principal associated with Databricks and has been given Storage Blob Data Contributor access through RBAC on the data lake storage account. A sample statement to load is
df = spark.read.format("orc").load("dbfs:/mnt/{storageaccount}/{filesystem}/{filename}")
The error message I get is:
Py4JJavaError: An error occurred while calling o214.load. : java.io.IOException: GET https://{storageaccount}.dfs.core.windows.net/{filesystem}/{filename}?timeout=90``` StatusCode=412 StatusDescription=The condition specified using HTTP conditional header(s) is not met.
ErrorCode=ConditionNotMet ErrorMessage=The condition specified using HTTP conditional header(s) is not met.
RequestId:51fbfff7-d01f-002b-49aa-4c89d5000000
Time:2019-08-06T22:55:14.5585584Z
This error is not with all the files in the filesystem. I can load most of the files. The error is just with some of the files. Not sure what the issue is here.
This has been resolved now. The underlying issue was due to a change at Microsoft end. This is the RCA I got from Microsoft Support:
There was a storage configuration that is turned on incorrectly during the latest storage tenant upgrade. This type of error would only show up for the namespace enabled account on the latest upgraded tenant. The mitigation for this issue is to turn off the configuration on the specific tenant, and we had kicked off the super sonic configuration rollout for the all the tenants. We have since added additional Storage upgrade validation for ADLS Gen 2 to help cover this type of scenario.
I had the same problem on one file today. Downloading the file, deleting it from storage and putting it back solved the problem.
Tried to rename file -> didn't work.
Edit: we have it on more files, random.
We worked around the problem by copying the entire folder to a new folder and rename it to original. Jobs run without problems again.
Still the question remains, why did the files end up in this situation?
Same issue here. After some research, it seems it was probably an If-Match eTag condition failure in the http GET request. Microsoft talk about how they will return error 412 when this happens in this post: https://azure.microsoft.com/de-de/blog/managing-concurrency-in-microsoft-azure-storage-2/
Regardless, Databricks seem to have resolved the issue on their end now.
I want to store all the data from a Kafka's topic into Amazon S3. I have a Kafka cluster that receives in one topic 200.000 messages per second, and each value message has 50 fields (strings, timestamps, integers, and floats).
My main idea is to use Kafka Connector to store the data in a bucket s3 and after that use Amazon Glue to transform the data and keep it into another bucket. I have the next questions:
1) How to do it? That architecture will work well? I tried with Amazon EMR (Spark Streaming) but I had too many concerns How to decrease the processing time and failed tasks using Apache Spark for events streaming from Apache Kafka?
2) I tried to use Kafka Connect from Confluent, but I have a few questions:
Can I connect to my Kafka Cluster from other Kafka instance and
run in a standalone way my Kafka Connector s3?
What means this error "ERROR Task s3-sink-0 threw an uncaught an
unrecoverable exception"?
ERROR Task s3-sink-0 threw an uncaught and unrecoverable exception
(org.apache.kafka.connect.runtime.WorkerTask:142)
java.lang.NullPointerException at
io.confluent.connect.hdfs.HdfsSinkTask.close(HdfsSinkTask.java:122)
at
org.apache.kafka.connect.runtime.WorkerSinkTask.commitOffsets(WorkerSinkTask.java:290)
at
org.apache.kafka.connect.runtime.WorkerSinkTask.closePartitions(WorkerSinkTask.java:421)
at
org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:146)
at
org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:140)
at
org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:175)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745) [2018-10-05 15:32:26,086]
ERROR Task is being killed and will not recover until manually
restarted (org.apache.kafka.connect.runtime.WorkerTask:143)
[2018-10-05 15:32:27,980] WARN could not create Dir using directory
from url file:/targ. skipping. (org.reflections.Reflections:104)
java.lang.NullPointerException at
org.reflections.vfs.Vfs$DefaultUrlTypes$3.matches(Vfs.java:239) at
org.reflections.vfs.Vfs.fromURL(Vfs.java:98) at
org.reflections.vfs.Vfs.fromURL(Vfs.java:91) at
org.reflections.Reflections.scan(Reflections.java:237) at
org.reflections.Reflections.scan(Reflections.java:204) at
org.reflections.Reflections.(Reflections.java:129) at
org.apache.kafka.connect.runtime.AbstractHerder.connectorPlugins(AbstractHerder.java:268)
at
org.apache.kafka.connect.runtime.AbstractHerder$1.run(AbstractHerder.java:377)
at java.lang.Thread.run(Thread.java:745) [2018-10-05 15:32:27,981]
WARN could not create Vfs.Dir from url. ignoring the exception and
continuing (org.reflections.Reflections:208)
org.reflections.ReflectionsException: could not create Vfs.Dir from
url, no matching UrlType was found [file:/targ] either use
fromURL(final URL url, final List urlTypes) or use the static
setDefaultURLTypes(final List urlTypes) or
addDefaultURLTypes(UrlType urlType) with your specialized UrlType. at
org.reflections.vfs.Vfs.fromURL(Vfs.java:109) at
org.reflections.vfs.Vfs.fromURL(Vfs.java:91) at
org.reflections.Reflections.scan(Reflections.java:237) at
org.reflections.Reflections.scan(Reflections.java:204) at
org.reflections.Reflections.(Reflections.java:129) at
org.apache.kafka.connect.runtime.AbstractHerder.connectorPlugins(AbstractHerder.java:268)
at
org.apache.kafka.connect.runtime.AbstractHerder$1.run(AbstractHerder.java:377)
at java.lang.Thread.run(Thread.java:745) [2018-10-05 15:32:35,441]
INFO Reflections took 12393 ms to scan 429 urls, producing 13521 keys
and 95814 values (org.reflections.Reflections:229)
If you can resume the steps to connect to Kafka and keep on s3 from
another Kafka instance, how will you do?
What means all these fields key.converter, value.converter, key.converter.schemas.enable, value.converter.schemas.enable, internal.key.converter,internal.value.converter, internal.key.converter.schemas.enable, internal.value.converter.schemas.enable?
What are the possible values for key.converter, value.converter?
3) Once my raw data is in a bucket, I would like to use Amazon Glue to take these data, to deserialize Protobuffer, to change the format of some fields, and finally to store it in another bucket in Parquet. How can I use my own java protobuffer library in Amazon Glue?
4) If I want to query with Amazon Athena, how can I load the partitions automatically (year, month, day, hour)? With the crawlers and schedulers of Amazon Glue?
To complement #cricket_007's answer
Can I connect to my Kafka Cluster from other Kafka instance and run in a standalone way my Kafka Connector s3?
Kafka S3 Connector is part of the Confluent distribution, which also includes Kafka, as well as other related services, but it is not meant to run on your brokers directly, rather:
as a standalone worker running a Connector's configuration given when the service is launched
or as an additional workers' cluster running on the side of your Kafka Brokers' cluster. In that case, interaction/running of connectors is better via the Kafka Connect REST API (Search for "Managing Kafka Connectors" for documentation with examples)
If you can resume the steps to connect to Kafka and keep on s3 from
another Kafka instance, how will you do?
Are you talking about another Kafka Connect instance?
if so, you can simply execute the Kafka Connect service in distributed mode which was meant to give the reliability you seem to be looking for...
Or do you mean another Kafka (brokers) cluster?
in that case, you could try (but that would be experimental, and I haven't tried it myself...) to run Kafka Connect in standalone mode and simply update bootstrap.servers parameter of your connector's configuration to point to the new cluster. Why that might work: in standalone mode the offsets of your sink connector(s) are stored locally on your worker (contrarily to distributed mode where the offsets are stored on the Kafka cluster directly...). Why that might not work: it's simply not intended for this use and I'm guessing you might need your topics and partitions to be exactly the same...?
What are the possible values for key.converter, value.converter?
Check Confluent's documentation for kafka-connect-s3 ;)
How can I use my own java protobuffer library in Amazon Glue?
Not sure of the actual method, but Glue jobs spawn off an EMR cluster behind the scenes so I don't see why it shouldn't be possible...
If I want to query with Amazon Athena, how can I load the partitions automatically (year, month, day, hour)? With the crawlers and schedulers of Amazon Glue?
Yes.
Assuming a daily partitioning, you could actually have you're schedule run the crawler first thing in the morning, as soon as you can expect new data to have created that day's folder on S3 (so at least one object for that day exists on S3)... The crawler will add that day's partition which will then be available for querying with any newly added object.
We use S3 Connect for hundreds of topics and process data using Hive, Athena, Spark, Presto, etc. Seems to work fine, though I feel like an actual database might return results faster.
In any case, to answer about Connect
Can I connect to my Kafka Cluster from other Kafka instance and run in a standalone way my Kafka Connector s3?
I'm not sure I understand the question, but Kafka Connect needs to connect to one cluster, you don't need two Kafka clusters to use it. You'd typically run Kafka Connect processes as part of their own cluster, not on the brokers.
What means this error "ERROR Task s3-sink-0 threw an uncaught an unrecoverable exception"?
It means you need to look at the logs to figure out what exception is being thrown and stopping the connector from reading data.
WARN could not create Dir using directory from url file:/targ ... If you're using HDFS connector, I don't think you should be using the default file:// URI
If you can resume the steps to connect to Kafka and keep on s3 from another Kafka instance, how will you do?
You can't "resume from another Kafka instance". As mentioned, Connect can only consume from a single Kafka cluster, and any consumed offsets and consumer groups are stored with it.
What means all these fields
These fields are removed from the latest Kafka releases, you can ignore them. You definitely should not change them
internal.key.converter,internal.value.converter, internal.key.converter.schemas.enable, internal.value.converter.schemas.enable
These are your serializers and deserializers like the regular producer consumer API have
key.converter, value.converter
I believe these are only important for JSON converters. See https://rmoff.net/2017/09/06/kafka-connect-jsondeserializer-with-schemas-enable-requires-schema-and-payload-fields
key.converter.schemas.enable, value.converter.schemas.enable
to deserialize Protobuf, to change the format of some fields, and finally to store it in another bucket in Parquet
Kafka Connect would need to be loaded with a Protobuf converter, and I don't know there is one (I think Blue Apron wrote something... Search github).
Generally speaking, Avro would be much easier to convert to Parquet because native libraries already exist to do that. S3 Connect by Confluent doesn't currently write Parquet format, but there in an open PR. The alternative is to use Pinterest Secor library.
I don't know Glue, but if it's like Hive, you would use ADD JAR during a query to load external code plugins and functions
I have minimal experience with Athena, but Glue maintains all the partitions as a Hive metastore. The automatic part would be the crawler, you can put a filter on the query to do partition pruning
We have Apache Flink (1.4.2) running on an EMR cluster. We are checkpointing to an S3 bucket, and are pushing about 5,000 records per second through the flows. We recently saw the following error in our logs:
java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink#ip-XXX-XXX-XXX-XXX:XXXXXX/user/taskmanager#-XXXXXXX]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.messages.TaskManagerMessages$RequestTaskManagerLog".
at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:442)
at akka.dispatch.OnComplete.internal(Future.scala:258)
at akka.dispatch.OnComplete.internal(Future.scala:256)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83)
Immediately after this we got the following in our logs:
2018-07-30 15:08:32,177 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 831 # 1532963312177
2018-07-30 15:09:46,750 ERROR org.apache.flink.runtime.blob.BlobServerConnection - PUT operation failed
java.io.EOFException: Read an incomplete length
at org.apache.flink.runtime.blob.BlobUtils.readLength(BlobUtils.java:366)
at org.apache.flink.runtime.blob.BlobServerConnection.readFileFully(BlobServerConnection.java:403)
at org.apache.flink.runtime.blob.BlobServerConnection.put(BlobServerConnection.java:349)
at org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:114)
At this point, the flow crashed and was not able to automatically recover, however we were able to restart the flow manually, without needing to change the location of the s3 bucket. The fact that the crash occurred while pushing to S3, makes me think that is the crux of the problem.
Any ideas?
FYI, this was caused by too much cross-talk between nodes flooding the NICs on each server. The solution was more intelligent partitioning.
We have a Google Cloud Dataflow job, which writes to Bigtable (via HBase API). Unfortunately, it fails due to:
java.io.IOException: The Application Default Credentials are not available. They are available if running in Google Compute Engine. Otherwise, the environment variable GOOGLE_APPLICATION_CREDENTIALS must be defined pointing to a file defining the credentials. See https://developers.google.com/accounts/docs/application-default-credentials for more information. at com.google.bigtable.repackaged.com.google.auth.oauth2.DefaultCredentialsProvider.getDefaultCredentials(DefaultCredentialsProvider.java:74) at com.google.bigtable.repackaged.com.google.auth.oauth2.GoogleCredentials.getApplicationDefault(GoogleCredentials.java:54) at com.google.bigtable.repackaged.com.google.cloud.config.CredentialFactory.getApplicationDefaultCredential(CredentialFactory.java:181) at com.google.bigtable.repackaged.com.google.cloud.config.CredentialFactory.getCredentials(CredentialFactory.java:100) at com.google.bigtable.repackaged.com.google.cloud.grpc.io.CredentialInterceptorCache.getCredentialsInterceptor(CredentialInterceptorCache.java:85) at com.google.bigtable.repackaged.com.google.cloud.grpc.BigtableSession.<init>(BigtableSession.java:257) at org.apache.hadoop.hbase.client.AbstractBigtableConnection.<init>(AbstractBigtableConnection.java:123) at org.apache.hadoop.hbase.client.AbstractBigtableConnection.<init>(AbstractBigtableConnection.java:91) at com.google.cloud.bigtable.hbase1_0.BigtableConnection.<init>(BigtableConnection.java:33) at com.google.cloud.bigtable.dataflow.CloudBigtableConnectionPool$1.<init>(CloudBigtableConnectionPool.java:72) at com.google.cloud.bigtable.dataflow.CloudBigtableConnectionPool.createConnection(CloudBigtableConnectionPool.java:72) at com.google.cloud.bigtable.dataflow.CloudBigtableConnectionPool.getConnection(CloudBigtableConnectionPool.java:64) at com.google.cloud.bigtable.dataflow.CloudBigtableConnectionPool.getConnection(CloudBigtableConnectionPool.java:57) at com.google.cloud.bigtable.dataflow.AbstractCloudBigtableTableDoFn.getConnection(AbstractCloudBigtableTableDoFn.java:96) at com.google.cloud.bigtable.dataflow.CloudBigtableIO$CloudBigtableSingleTableBufferedWriteFn.getBufferedMutator(CloudBigtableIO.java:836) at com.google.cloud.bigtable.dataflow.CloudBigtableIO$CloudBigtableSingleTableBufferedWriteFn.processElement(CloudBigtableIO.java:861)
Which makes very little sense, because the job is already running on Cloud Dataflow service/VMs.
The Cloud Dataflow job id: 2016-05-13_11_11_57-8485496303848899541
We are using bigtable-hbase-dataflow version 0.3.0, and we want to use HBase API.
I believe this is a known issue where GCE instances are very rarely unable to get the default credentials during startup.
We have been working on a fix which should be part of the next release (1.6.0) which should be coming soon. In the meantime we'd suggest re-submitting the job which should work. If you run into problems consistently or want to discuss other workarounds (such as backporting the 1.6.0 fix) please reach out to us.
1.7.0 is released so this should be fixed now https://cloud.google.com/dataflow/release-notes/release-notes-java
I have two clusters, one in local virtual machine another in remote cloud. Both clusters in Standalone mode.
My Environment:
Scala: 2.10.4
Spark: 1.5.1
JDK: 1.8.40
OS: CentOS Linux release 7.1.1503 (Core)
The local cluster:
Spark Master: spark://local1:7077
The remote cluster:
Spark Master: spark://remote1:7077
I want to finish this:
Write codes(just simple word-count) in IntelliJ IDEA locally(on my laptp), and set the Spark Master URL to spark://local1:7077 and spark://remote1:7077, then run my codes in IntelliJ IDEA. That is, I don't want to use spark-submit to submit a job.
But I got some problem:
When I use the local cluster, everything goes well. Run codes in IntelliJ IDEA or use spark-submit can submit job to cluster and can finish the job.
But When I use the remote cluster, I got a warning log:
TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
It is sufficient resources not sufficient memory!
And this log keep printing, no further actions. Both spark-submit and run codes in IntelliJ IDEA result the same.
I want to know:
Is it possible to submit codes from IntelliJ IDEA to remote cluster?
If it's OK, does it need configuration?
What are the possible reasons that can cause my problem?
How can I handle this problem?
Thanks a lot!
Update
There is a similar question here, but I think my scene is different. When I run my codes in IntelliJ IDEA, and set Spark Master to local virtual machine cluster, it works. But I got Initial job has not accepted any resources;... warning instead.
I want to know whether the security policy or fireworks can cause this?
Submitting code programatically (e.g. via SparkSubmit) is quite tricky. At the least there is a variety of environment settings and considerations -handled by the spark-submit script - that are quite difficult to replicate within a scala program. I am still uncertain of how to achieve it: and there have been a number of long running threads within the spark developer community on the topic.
My answer here is about a portion of your post: specifically the
TaskSchedulerImpl: Initial job has not accepted any resources; check
your cluster UI to ensure that workers are registered and have
sufficient resources
The reason is typically there were a mismatch on the requested memory and/or number of cores from your job versus what were available on the cluster. Possibly when submitting from IJ the
$SPARK_HOME/conf/spark-defaults.conf
were not properly matching the parameters required for your task on the existing cluster. You may need to update:
spark.driver.memory 4g
spark.executor.memory 8g
spark.executor.cores 8
You can check the spark ui on port 8080 to verify that the parameters you requested are actually available on the cluster.