Databricks spark_jar_task failed when submitted via API - api

I am using to submit a sample spark_jar_task
My sample spark_jar_task request to calculate Pi :
"libraries": [
{
"jar": "dbfs:/mnt/test-prd-foundational-projects1/spark-examples_2.11-2.4.5.jar"
}
],
"spark_jar_task": {
"main_class_name": "org.apache.spark.examples.SparkPi"
}
Databricks sysout logs where it prints the Pi value as expected
....
(This session will block until Rserve is shut down) Spark package found in SPARK_HOME: /databricks/spark DATABRICKS_STDOUT_END-19fc0fbc-b643-4801-b87c-9d22b9e01cd2-1589148096455
Executing command, time = 1589148103046.
Executing command, time = 1589148115170.
Pi is roughly 3.1370956854784273
Heap
.....
Spark_jar_task though prints the PI value in log, the job got terminated with failed status without stating the error. Below are response of api /api/2.0/jobs/runs/list/?job_id=23.
{
"runs": [
{
"job_id": 23,
"run_id": 23,
"number_in_job": 1,
"state": {
"life_cycle_state": "TERMINATED",
"result_state": "FAILED",
"state_message": ""
},
"task": {
"spark_jar_task": {
"jar_uri": "",
"main_class_name": "org.apache.spark.examples.SparkPi",
"run_as_repl": true
}
},
"cluster_spec": {
"new_cluster": {
"spark_version": "6.4.x-scala2.11",
......
.......
Why the job failed here? Any suggestions will be appreciated!
EDIT :
The errorlog says
20/05/11 18:24:15 INFO ProgressReporter$: Removed result fetcher for 740457789401555410_9000204515761834296_job-34-run-1-action-34
20/05/11 18:24:15 WARN ScalaDriverWrapper: Spark is detected to be down after running a command
20/05/11 18:24:15 WARN ScalaDriverWrapper: Fatal exception (spark down) in ReplId-a46a2-6fb47-361d2
com.databricks.backend.common.rpc.SparkStoppedException: Spark down:
at com.databricks.backend.daemon.driver.DriverWrapper.getCommandOutputAndError(DriverWrapper.scala:493)
at com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:597)
at com.databricks.backend.daemon.driver.DriverWrapper.runInnerLoop(DriverWrapper.scala:390)
at com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:337)
at com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:219)
at java.lang.Thread.run(Thread.java:748)
20/05/11 18:24:17 INFO ShutdownHookManager: Shutdown hook called

I found answer from this post https://github.com/dotnet/spark/issues/126
Looks like, we shouldnt deliberately call
spark.stop()
when running as a jar in databricks

Related

Unable to run relay-compiler

I installed the relay for my react-native project following the steps on relay.dev. Running the compiler works fine when I have an empty schema file. Putting schema to the file starts throwing me this error:
thread 'main' panicked at 'Expect GraphQLAsts to exist.', /home/runner/work/relay/relay/compiler/crates/relay-compiler/src/compiler.rs:335:14
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace
My schema file is
// relay_schema.graphql
type Query {
tasks: [TaskNode]
}
type TaskNode {
id: ID!
}
and my relay config is:
module.exports = {
// ...
// Configuration options accepted by the `relay-compiler` command-line tool and `babel-plugin-relay`.
src: './src/',
language: 'flow',
schema: './relay_schema.graphql',
exclude: ['**/node_modules/**', '**/__mocks__/**', '**/__generated__/**'],
};
I'm completely lost on what to do here
Started working after updating src to src: './'. It's better to verify the paths such as src and schema. Error messages shown by the compiler are not much help when stuck here.

nodejs loopback api is not running while trying to run in pm2

My pm2 ecosystem.config.js configuration like below:
HGBackend is not running. Others are running in pm2
module.exports = {
apps : [
{
name : "HGBackend",
cwd : "hgbackend/server",
script : "config.json"
},
{
name : "HGBlockchain",
cwd : "hgblockchain/localgrammes",
script : "index.js"
// args : "start:staging"
// instances : 4,
// exec_mode : "cluster"
},
{
name : "HGWeb",
cwd : "hgweb/src/server",
script : "server.js",
description: ""
}
]}
All are working except HGBackend. HGBackend is loopback api. Others are react and express api.
What will be the cause for not running HGBackend? Can anyone help me?
For the HGWeb and HGBlockchain applications, you are executing the relevant server.js in the script configuration. For the HGBackend you are passing the config.json as the script. For Loopback you also need to execute the server.js.
Yes I have corrected is. Actually problem in fatasource.json file. Thanks

SSL Error: Handshake failed with fatal error - Querying fabric-sdk-rest server on a Fabric Network with TLS enabled

I started a Multi-Host Fabric Network usind docker swarm made up of 1 CA-server, 1 Orderer, 2 Peers (both in Org1, one on PC1 and one on PC2), 2 CouchBD (one for each Peer) with fabric-sdk-rest running on PC2.
Now if I disable TLS in the Fabric Network, everything works fine. But if i enable the TLS in the network, the SDK cannot connect to the peers failing to query.
Here I show the configuration of the network and the fabric-sdk-rest:
(crypto-config.yaml)
OrdererOrgs:
- Name: Orderer
Domain: example.com
Specs:
- Hostname: orderer
PeerOrgs:
- Name: Org1
Domain: org1.example.com
Template:
Count: 2
Users:
Count: 0
(datasources.json)
{
"db": {
"name": "db",
"connector": "memory"
},
"fabricDataSource": {
"name": "fabricDataSource",
"connector": "fabric",
"keyStoreFile": "/tmp/fabricSDKStore",
"fabricUser": {
"username": "Admin#org1.example.com",
"mspid": "Org1MSP",
"cryptoContent": {
"privateKey":"$HOME/mynetwork/crypto-config/peerOrganizations/org1.example.com/users/Admin#org1.example.com/msp/keystore/KEY_sk",
"signedCert":"$HOME/mynetwork/crypto-config/peerOrganizations/org1.example.com/users/Admin#org1.example.com/msp/signcerts/Admin#org1.example.com-cert.pem"
}
},
"COMMENT_orgs":"Referenced by peers to avoid having to configure the same file location multiple times. Change CACertFile locations for your fabric",
"orgs": [
{ "name":"org1", "CACertFile":"$HOME/mynetwork/crypto-config/peerOrganizations/org1.example.com/ca/ca.org1.example.com-cert.pem"}
],
"COMMENT_peers" : "Configured array is for use with the fabric-sample when running it in a local docker set up. eventURL and publicCertFile not currently used.",
"peers": [
{ "requestURL":"grpcs://peer1.org1.example.com:7051", "eventURL":"grpcs://peer1.org1.example.com:7053", "orgIndex":"0", "publicCertFile":"$HOME/mynetwork/crypto-config/peerOrganizations/org1.example.com/peers/peer1.org1.example.com/msp/signcerts/peer1.org1.example.com-cert.pem", "hostname":"peer1" }
],
"COMMENT_peers_secure" : "UNUSED. This is a copy of the above with grpcs URLs. Replace peers content with this if grpcs urls are needed.",
"peers-secure": [
{ "requestURL":"grpcs://peer1.org1.example.com:7051", "eventURL":"grpcs://peer1.org1.example.com:7053", "orgIndex":"0", "publicCertFile":"$HOME/mynetwork/crypto-config/peerOrganizations/org1.example.com/peers/peer1.org1.example.com/msp/signcerts/peer1.org1.example.com-cert.pem", "hostname":"peer1" }
],
"orderers": [
{ "url":"grpcs://orderer.example.com:7050", "CACertFile":"$HOME/mynetwork/crypto-config/ordererOrganizations/example.com/ca/ca.example.com-cert.pem", "publicCertFile": "$HOME/mynetwork/crypto-config/ordererOrganizations/example.com/orderers/orderer.example.com/msp/signcerts/orderer.example.com-cert.pem", "hostname":"orderer"}
],
"COMMENT_orderers_secure" : "UNUSED. This is a copy of the above with grpcs URLs. Replace orderers content with this if grpcs urls are needed.",
"orderers-secure": [
{ "url":"grpcs://orderer.example.com:7050", "CACertFile":"$HOME/mynetwork/crypto-config/ordererOrganizations/example.com/ca/ca.example.com-cert.pem", "publicCertFile": "$HOME/mynetwork/crypto-config/ordererOrganizations/example.com/orderers/orderer.example.com/msp/signcerts/orderer.example.com-cert.pem", "hostname":"orderer"}
],
"COMMENT_channels":"fabric-sdk-node Client class requires channel information to be configured during bootstrap.",
"channels": [
{ "name":"mychannel", "peersIndex":[0], "orderersIndex":[0] }
],
"channels-first-network": [
{ "name":"mychannel", "peersIndex":[0,1,2,3], "orderersIndex":[0] }
]
}
}
Once started the Hyperledger Fabric SDK REST server at https://0.0.0.0:3000, when I try to make the GET channels query from the explorer, I get the following error:
error: [fabricconnector.js]: Failed to queryChannels: Error: 14 UNAVAILABLE: Connect Failed
Error not handled for the GET request /api/fabric/1_0/channels: Error: 14 UNAVAILABLE: Connect Failed
at Object.exports.createStatusError ($HOME/mynetwork/fabric-sdk-rest/packages/loopback-connector-fabric/node_modules/grpc/src/common.js:87:15)
at Object.onReceiveStatus ($HOME/mynetwork/fabric-sdk-rest/packages/loopback-connector-fabric/node_modules/grpc/src/client_interceptors.js:1214:28)
at InterceptingListener._callNext ($HOME/mynetwork/fabric-sdk-rest/packages/loopback-connector-fabric/node_modules/grpc/src/client_interceptors.js:590:42)
at InterceptingListener.onReceiveStatus ($HOME/mynetwork/fabric-sdk-rest/packages/loopback-connector-fabric/node_modules/grpc/src/client_interceptors.js:640:8)
at callback ($HOME/mynetwork/fabric-sdk-rest/packages/loopback-connector-fabric/node_modules/grpc/src/client_interceptors.js:867:24)
E0510 10:51:04.780559355 12247 ssl_transport_security.cc:989] Handshake failed with fatal error SSL_ERROR_SSL: error:14090086:SSL routines:ssl3_get_server_certificate:certificate verify failed.
Has anyone ever seen this error? Can anyone help me get through this, please?

Apache Beam on Cloud Dataflow - Failed to query Cadvisor

I have a cloud dataflow that is reading from a Pub/Sub and pushing data out to BQ. Recently the dataflow is reporting the error below and not writing any data to BQ.
{
insertId: "3878608796276796502:822931:0:1075"
jsonPayload: {
line: "work_service_client.cc:490"
message: "gcpnoelevationcall-01211413-b90e-harness-n1wd Failed to query CAdvisor at URL=<IPAddress>:<PORT>/api/v2.0/stats?count=1, error: INTERNAL: Couldn't connect to server"
thread: "231"
}
labels: {
compute.googleapis.com/resource_id: "3878608796276796502"
compute.googleapis.com/resource_name: "gcpnoelevationcall-01211413-b90e-harness-n1wd"
compute.googleapis.com/resource_type: "instance"
dataflow.googleapis.com/job_id: "2018-01-21_14_13_45"
dataflow.googleapis.com/job_name: "gcpnoelevationcall"
dataflow.googleapis.com/region: "global"
}
logName: "projects/poc/logs/dataflow.googleapis.com%2Fshuffler"
receiveTimestamp: "2018-01-21T22:41:40.053806623Z"
resource: {
labels: {
job_id: "2018-01-21_14_13_45"
job_name: "gcpnoelevationcall"
project_id: "poc"
region: "global"
step_id: ""
}
type: "dataflow_step"
}
severity: "ERROR"
timestamp: "2018-01-21T22:41:39.524005Z"
}
Any ideas, on how could I help this? Has anyone faced a similar issue before?
If this just happened once it could be attributed to a transient issue. The process running on the worker node can't reach cAdvisor. Either the cAdvisor container is not running or there is a temporal problem on the worker that can't contact cAdvisor and the job gets stuck.

Meld error with Datastax Enterprise

Provisioning a DSE cluster with the lifecycle manager fails consitently. Master node (also the one OpsCenter is running on) installed correctly. Each one of the other nodes fails the install (also config) task. Have double-checked the SSH credentials and ports. Any ideas on how to investigate further and fix the issue would be great.
Please excuse the length - trying to provide all of the relevant info.
Ubuntu 14.04.4,
JRE: 1.8.0.91,
DSE 5.0.0
job events:
...
"results": [
{
"event-subtype": "start",
"event-type": "milestone",
"message": "job started...",
...
},
{
"event-subtype": "invocation",
"event-type": "shell-command",
"message": "Invoked command: if [ -x $(which yum) ] && [ -f /etc/redhat-release -o -f /etc/SuSE-release ]; then echo -n yum; elif [ -x $(which apt-get) ]; then echo -n apt; fi"
...
},
{
"event-subtype": "uploaded-facts",
"event-type": "milestone",
"message": "Uploaded facts to OpsCenter server",
...
},
{
"event-subtype": "meld-error",
"event-type": "error",
"message": "Unexpected error executing meld",
...
},
{
"event-subtype": "MeldError",
"event-type": "error",
"message": "Meld failed on: name=\"NODE-2\" ssh-management-address=\"<IP>\" node-id=\"<node-id>\" job-id=\"<job-id>\" stdout=\"\r\n\" stderr=\"\"",
...
}
]
opscenterd.log
/var/log/opscenter/opscenterd.log-2016-07-02 16:34:16,848 [opscenterd] INFO: Install job started for node name="NODE-2" ssh-management-address="<IP>" node-id="<node-id>" (async-thread-macro-53)
/var/log/opscenter/opscenterd.log-2016-07-02 16:34:16,850 [opscenterd] INFO: using ssh-private-key (async-thread-macro-53)
/var/log/opscenter/opscenterd.log-2016-07-02 16:34:18,478 [opscenterd] INFO: Received milestone from node name="NODE-2" ssh-management-address="<IP>" node-id="<node-id>" message="Uploaded facts to OpsCenter server" job-id="a630c081-6ac1-4b00-ac08-18fef320e0d5" (MainThread)
/var/log/opscenter/opscenterd.log:2016-07-02 16:34:18,675 [opscenterd] ERROR: Received error from node event-subtype="meld-error" job-id="a630c081-6ac1-4b00-ac08-18fef320e0d5" name="NODE-2" traceback="Traceback (most recent call last):
/var/log/opscenter/opscenterd.log: File \"meld.py\", line 3313, in run
/var/log/opscenter/opscenterd.log- rc = engine.go()
/var/log/opscenter/opscenterd.log: File \"meld.py\", line 2991, in go
/var/log/opscenter/opscenterd.log- self.file_manager.get_config_files()
/var/log/opscenter/opscenterd.log: File \"meld.py\", line 1280, in get_config_files
/var/log/opscenter/opscenterd.log- {\"accept\": \"application/json\"})
/var/log/opscenter/opscenterd.log: File \"meld.py\", line 598, in get
/var/log/opscenter/opscenterd.log- return json.loads(response.read())
/var/log/opscenter/opscenterd.log- File \"/usr/lib/python2.7/socket.py\", line 351, in read
/var/log/opscenter/opscenterd.log- data = self._sock.recv(rbufsize)
/var/log/opscenter/opscenterd.log- File \"/usr/lib/python2.7/httplib.py\", line 549, in read
/var/log/opscenter/opscenterd.log- return self._read_chunked(amt)
/var/log/opscenter/opscenterd.log- File \"/usr/lib/python2.7/httplib.py\", line 609, in _read_chunked
/var/log/opscenter/opscenterd.log- value.append(self._safe_read(amt))
/var/log/opscenter/opscenterd.log- File \"/usr/lib/python2.7/httplib.py\", line 666, in _safe_read
/var/log/opscenter/opscenterd.log- raise IncompleteRead(''.join(s), amt)
/var/log/opscenter/opscenterd.log:IncompleteRead: IncompleteRead(4153 bytes read, 4039 more expected)" ssh-management-address="<IP>" node-id="<node-id>" event-type="error" message="Unexpected error executing meld" (MainThread)
/var/log/opscenter/opscenterd.log-2016-07-02 16:34:18,892 [opscenterd] ERROR: Install job a630c081-6ac1-4b00-ac08-18fef320e0d5 failed! (async-thread-macro-54)
/var/log/opscenter/opscenterd.log:2016-07-02 16:34:19,105 [opscenterd] ERROR: Meld failed on: name="NODE-2" ssh-management-address="<IP>" node-id="<node-id>" job-id="a630c081-6ac1-4b00-ac08-18fef320e0d5" stdout="
/var/log/opscenter/opscenterd.log-" stderr="" (async-thread-macro-53)
Thank you
EDIT: Captured the HTTP traffic between NODE2 and master. The error occurs while transferring config files. One of them is not transferred completely for some reason. The json looks resonable until some gibberish appears.
{"filename": "dse.yaml", "contents": {"internode_messaging_options": {"client_worker_threads": 16, "port": 8609, "server_worker_threads": 16, "server_acceptor_thread
Yvatv+~UK{.kMI4^QOrqQTDX_3"DPm,v!"H&M$!1M7
LRYCs{l>-df;cj
W6C9dq
The config files are valid and do work on the master node. Only the replication fails.
OpsCenter LCM developer here. Your issue is caused by OPSC-8851 in the LCM known issues list: http://docs.datastax.com/en/opscenter/6.0/opsc/release_notes/opscReleaseNotes600.html
This is only triggered under certain network conditions and was discovered too close to release to get fixed in 6.0.0. It's a high priority though, and will be fixed in a subsequent release soon. Unfortunately, I don't think there's anything you can do to work around this in the field. If you're a DataStax customer, you could contact support and potentially get a patch now to workaround the issue... otherwise the only thing I can suggest is to watch the upcoming release notes.
Edit: I should also note that in our tests the issue is intermittent. LCM is designed so you can rerun failed jobs safely (aka it's idempotent) so in all but the most extreme cases you can also work around this just by rerunning your job.
You can specify the private IP for Listen Address and 0.0.0.0 for broadcast address and LCM should be able to provision appropriately.