hive on tez throws "No LLAP Daemons are running" ERROR - hive

I have a LLAP service runing on yarn cluster on Amazon EMR.
Here is the image showing that the llap service is on, and it's name was llap_service:
And I've set the "hive.llap.daemon.service.hosts" to "#llap_service",
but my query in hive could not success, the log was like this:
2018-01-08 08:22:43,866 [INFO] [LlapScheduler] |tezplugins.LlapTaskSchedulerService|: Timeout monitor task not started. Timeout future state: false, #instances: 0
2018-01-08 08:22:43,866 [INFO] [TaskSchedulerEventHandlerThread] |tezplugins.LlapTaskSchedulerService|: PendingTasksInfo=numPriorityLevels=5. [p=38,c=2][p=41,c=2][p=44,c=1][p=50,c=2][p=89,c=22]. totalPendingTasks=29. delayedTaskQueueSize=0
2018-01-08 08:22:43,866 [INFO] [LlapScheduler] |tezplugins.LlapTaskSchedulerService|: Inadequate total resources before scheduling pending tasks. Signalling scheduler timeout monitor thread to start timer.
2018-01-08 08:22:43,866 [INFO] [TaskSchedulerEventHandlerThread] |tezplugins.LlapTaskSchedulerService|: Received allocateRequest. task=attempt_1515129349345_0057_1_08_000022_0, priority=89, capability=<memory:2048, vCores:1>, hosts=null
2018-01-08 08:22:43,866 [INFO] [LlapScheduler] |tezplugins.LlapTaskSchedulerService|: Timeout monitor task not started. Timeout future state: false, #instances: 0
2018-01-08 08:22:43,866 [INFO] [TaskSchedulerEventHandlerThread] |tezplugins.LlapTaskSchedulerService|: PendingTasksInfo=numPriorityLevels=5. [p=38,c=2][p=41,c=2][p=44,c=1][p=50,c=2][p=89,c=23]. totalPendingTasks=30. delayedTaskQueueSize=0
2018-01-08 08:22:43,866 [INFO] [LlapScheduler] |tezplugins.LlapTaskSchedulerService|: Inadequate total resources before scheduling pending tasks. Signalling scheduler timeout monitor thread to start timer.
2018-01-08 08:22:43,867 [INFO] [LlapScheduler] |tezplugins.LlapTaskSchedulerService|: Timeout monitor task not started. Timeout future state: false, #instances: 0
2018-01-08 08:22:50,987 [INFO] [LlapTaskSchedulerTimedLogThread] |tezplugins.LlapTaskSchedulerService|: Stats for current dag: NumPreemptedTasks=0, NumRequestedAllocations=30, NumRequestsWithlocation=4, NumLocalAllocations=0,NumNonLocalAllocations=0,NumTotalAllocations=0,NumRequestsWithoutLocation=26, NumRejectedTasks=0, NumCommFailures=0, NumDelayedAllocations=2, LocalityBasedAllocationsPerHost={}, NumAllocationsPerHost={}
2018-01-08 08:23:31,081 [INFO] [AMRM Callback Handler Thread] |rm.YarnTaskSchedulerService|: Allocated: <memory:0, vCores:0> Free: <memory:6477824, vCores:1> pendingRequests: 0 delayedContainers: 0 heartbeats: 51 lastPreemptionHeartbeat: 50
2018-01-08 08:23:42,334 [INFO] [LlapTaskSchedulerTimeoutMonitor] |tezplugins.LlapTaskSchedulerService$SchedulerTimeoutMonitor|: Reporting SERVICE_UNAVAILABLE error as no instances are running
2018-01-08 08:23:42,336 [WARN] [TaskSchedulerAppCallbackExecutor #0] |tez.Utils|: Error reported by TaskScheduler [[2:LLAP]][SERVICE_UNAVAILABLE] No LLAP Daemons are running, Failing dag: [SELECT d_placement.name AS `d_placem...DESC(Stage-1), dag_1515129349345_0057_1]
And Hive on tez worked perfectly without LLAP, could any one tell me where I was wrong please? Thanks very much.

You need to find out from logs why is the LLAP demon failing to start. From RM UI > Slider llap application > container logs, you might see something like this
"LLAP service hosts startswith '#' but hive.zookeeper.quorum is not set. hive.zookeeper.quorum must be set.You need to add this following property in hive-site.xml"
<property>
<name>hive.zookeeper.quorum</name>
<value><hostname1>:2181,<hostname2>:2181,[...]</value>
</property>
And regenerate the slider llap yarn app.
This blog post helped a lot!

Related

Manual AWS X-Ray traces not showing even though they are sent

I'm sending xray information from Python manually (no Django, Flask, etc.). I can see the xray information sent in the logs, for example:
Jan 24 16:50:17 ip-172-16-7-143 python3[10700]: DEBUG:sending: {"format":"json","version":1}
Jan 24 16:50:17 ip-172-16-7-143 python3[10700]: {"aws": {"xray": {"sdk": "X-Ray for Python", "sdk_version": "2.4.3"}}, "end_time": 1579884617.5194468, "id": "c59efdf40abecd22", "in_progress": false, "name": "handle request", "service": {"runtime": "CPython", "runtime_version": "3.6.9"}, "start_time": 1579884515.5117097, "trace_id": "1-5e2b1fe3-c1c3cbc802cae49e9c364371"} to 127.0.0.1:2000.
But nothing shows up in the console. I've tried all the different filters and time frames, but nothing shows up. Where should I be looking?
UPDATE:
Adding xray logs:
2020-01-24T01:50:35Z [Info] Initializing AWS X-Ray daemon 3.2.0
2020-01-24T01:50:35Z [Info] Using buffer memory limit of 9 MB
2020-01-24T01:50:35Z [Info] 144 segment buffers allocated
2020-01-24T01:50:35Z [Info] Using region: us-east-2
2020-01-24T01:50:35Z [Info] HTTP Proxy server using X-Ray Endpoint : https://xray.us-east-2.amazonaws.com
2020-01-24T01:50:35Z [Info] Starting proxy http server on 127.0.0.1:2000
From the log it looks like your X-Ray daemon never received any trace segment, otherwise there should be a log line like "[Info] Successfully sent batch of 1 segments (0.100 seconds)".
Are you using the official X-Ray Python SDK? How did the "manually sending" work? Please verify the daemon and your application is running in the same network circumstance. For example, if the daemon is running in a container, please make sure its UDP 2000 port is opened, vice versa.

RabbitMQ change listeners.tcp.default port is not changed how is expected

Through Homebrew I have installed RabbitMQ.
It starts with ./rabbitmq-server without any problem:
## ##
## ## RabbitMQ 3.7.6. Copyright (C) 2007-2018 Pivotal Software, Inc.
########## Licensed under the MPL. See http://www.rabbitmq.com/
###### ##
########## Logs: /usr/local/var/log/rabbitmq/rabbit#localhost.log
/usr/local/var/log/rabbitmq/rabbit#localhost_upgrade.log
Starting broker...
completed with 6 plugins.
I have read the following:
RabbitMQ Configuration
rabbitmq.conf.example
rabbitmq_conf_homebrew
Thus in the /usr/local/etc/rabbitmq path exists:
enabled_plugins
rabbitmq-env.conf
rabbitmq.conf (created manually)
The content of these files are:
enabled_plugins
[rabbitmq_management,rabbitmq_stomp,rabbitmq_amqp1_0,rabbitmq_mqtt].
rabbitmq-env.conf
CONFIG_FILE=/usr/local/etc/rabbitmq/rabbitmq
NODE_IP_ADDRESS=127.0.0.1
NODENAME=rabbit#localhost
rabbitmq.conf
# listeners.tcp.default = 5672
listeners.tcp.default = 5662
#listeners.tcp.local = 127.0.0.1:5662 <-- Alpha
#listeners.tcp.local_v6 = ::1:5662 <-- Beta
# mqtt.listeners.tcp.default = 1883
mqtt.listeners.tcp.default = 1873
# stomp.listeners.tcp.default = 61613
stomp.listeners.tcp.default = 61603
The purpose with the ports are decrease them by -10 each one. It only works for mqtt and stomp. The listeners.tcp.default value is ignored, it remains how 5672 and not with 5662 how is expected. I can confirm this showing the /usr/local/var/log/rabbitmq/rabbit#localhost.log content, as follows:
...
2018-07-29 12:46:31.461 [info] <0.321.0> Starting message stores for vhost '/'
2018-07-29 12:46:31.461 [info] <0.325.0> Message store "628WB79CIFDYO9LJI6DKMI09L/msg_store_transient": using rabbit_msg_store_ets_index to provide index
2018-07-29 12:46:31.465 [info] <0.321.0> Started message store of type transient for vhost '/'
2018-07-29 12:46:31.465 [info] <0.328.0> Message store "628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent": using rabbit_msg_store_ets_index to provide index
2018-07-29 12:46:31.490 [info] <0.321.0> Started message store of type persistent for vhost '/'
2018-07-29 12:46:31.495 [info] <0.363.0> started TCP Listener on 127.0.0.1:5672
2018-07-29 12:46:31.495 [info] <0.223.0> Setting up a table for connection tracking on this node: tracked_connection_on_node_rabbit#localhost
2018-07-29 12:46:31.495 [info] <0.223.0> Setting up a table for per-vhost connection counting on this node: tracked_connection_per_vhost_on_node_rabbit#localhost
2018-07-29 12:46:31.496 [info] <0.33.0> Application rabbit started on node rabbit#localhost
2018-07-29 12:46:31.496 [info] <0.369.0> rabbit_stomp: default user 'guest' enabled
2018-07-29 12:46:31.497 [info] <0.385.0> started STOMP TCP Listener on [::]:61603
2018-07-29 12:46:31.497 [info] <0.33.0> Application rabbitmq_stomp started on node rabbit#localhost
2018-07-29 12:46:31.497 [info] <0.33.0> Application cowboy started on node rabbit#localhost
2018-07-29 12:46:31.498 [info] <0.33.0> Application rabbitmq_web_dispatch started on node rabbit#localhost
2018-07-29 12:46:31.572 [info] <0.33.0> Application rabbitmq_management_agent started on node rabbit#localhost
2018-07-29 12:46:31.600 [info] <0.438.0> Management plugin started. Port: 15672
2018-07-29 12:46:31.600 [info] <0.544.0> Statistics database started.
2018-07-29 12:46:31.601 [info] <0.33.0> Application rabbitmq_management started on node rabbit#localhost
2018-07-29 12:46:31.601 [info] <0.33.0> Application rabbitmq_amqp1_0 started on node rabbit#localhost
2018-07-29 12:46:31.601 [info] <0.557.0> MQTT retained message store: rabbit_mqtt_retained_msg_store_dets
2018-07-29 12:46:31.621 [info] <0.575.0> started MQTT TCP Listener on [::]:1873
2018-07-29 12:46:31.622 [info] <0.33.0> Application rabbitmq_mqtt started on node rabbit#localhost
2018-07-29 12:46:31.622 [notice] <0.94.0> Changed loghwm of /usr/local/var/log/rabbitmq/rabbit#localhost.log to 50
2018-07-29 12:46:31.882 [info] <0.5.0> Server startup complete; 6 plugins started.
* rabbitmq_mqtt
* rabbitmq_amqp1_0
* rabbitmq_management
* rabbitmq_management_agent
* rabbitmq_web_dispatch
* rabbitmq_stomp
Thus from above:
started TCP Listener on 127.0.0.1:5672 should be 5662
started STOMP TCP Listener on [::]:61603 changed how is expected
Management plugin started. Port: 15672 is not necessary change it
started MQTT TCP Listener on [::]:1873changed how is expected
I have the same behaviour if I enable Alpha and Beta.
The server is stopped with ./rabbitmqctl stop and started again with ./rabbitmq-server
What is missing or wrong?

Yarn report flink job as FINISHED and SUCCEED when flink job failure

I am running flink job on yarn, we use "fink run" in command line to submit our job to yarn, one day we had an exception on flink job, as we didn't enable the flink restart strategy so it simply failed, but eventually we found that the job status was "SUCCEED" from the yarn application list, which we expect to be "FAILED".
Flink CLI log:
06/12/2018 03:13:37 FlatMap (getTagStorageMapper.flatMap)(23/32) switched to CANCELED
06/12/2018 03:13:37 GroupReduce (ResultReducer.reduceGroup)(31/32) switched to CANCELED
06/12/2018 03:13:37 FlatMap (SubClassEDFJoinMapper.flatMap)(29/32) switched to CANCELED
06/12/2018 03:13:37 CHAIN DataSource (SubClassInventory.AvroInputFormat.createInput) -> FlatMap (SubClassInventoryMapper.flatMap)(27/32) switched to CANCELED
06/12/2018 03:13:37 GroupReduce (OutputReducer.reduceGroup)(28/32) switched to CANCELED
06/12/2018 03:13:37 CHAIN DataSource (SubClassInventory.AvroInputFormat.createInput) -> FlatMap (BIMBQMInstrumentMapper.flatMap)(27/32) switched to CANCELED
06/12/2018 03:13:37 GroupReduce (BIMBQMGovCorpReduce.reduceGroup)(30/32) switched to CANCELED
06/12/2018 03:13:37 FlatMap (BIMBQMEVMJoinMapper.flatMap)(32/32) switched to CANCELED
06/12/2018 03:13:37 Job execution switched to status FAILED.
No JobSubmissionResult returned, please make sure you called ExecutionEnvironment.execute()
2018-06-12 03:13:37,625 INFO org.apache.flink.yarn.YarnClusterClient - Sending shutdown request to the Application Master
2018-06-12 03:13:37,625 INFO org.apache.flink.yarn.YarnClusterClient - Start application client.
2018-06-12 03:13:37,630 INFO org.apache.flink.yarn.ApplicationClient - Notification about new leader address akka.tcp://flink#ip-10-97-46-149.tr-fr-nonprod.aws-int.thomsonreuters.com:45663/user/jobmanager with session ID 00000000-0000-0000-0000-000000000000.
2018-06-12 03:13:37,632 INFO org.apache.flink.yarn.ApplicationClient - Sending StopCluster request to JobManager.
2018-06-12 03:13:37,633 INFO org.apache.flink.yarn.ApplicationClient - Received address of new leader akka.tcp://flink#ip-10-97-46-149.tr-fr-nonprod.aws-int.thomsonreuters.com:45663/user/jobmanager with session ID 00000000-0000-0000-0000-000000000000.
2018-06-12 03:13:37,634 INFO org.apache.flink.yarn.ApplicationClient - Disconnect from JobManager null.
2018-06-12 03:13:37,635 INFO org.apache.flink.yarn.ApplicationClient - Trying to register at JobManager akka.tcp://flink#ip-10-97-46-149.tr-fr-nonprod.aws-int.thomsonreuters.com:45663/user/jobmanager.
2018-06-12 03:13:37,688 INFO org.apache.flink.yarn.ApplicationClient - Successfully registered at the ResourceManager using JobManager Actor[akka.tcp://flink#ip-10-97-46-149.tr-fr-nonprod.aws-int.thomsonreuters.com:45663/user/jobmanager#182802345]
2018-06-12 03:13:38,648 INFO org.apache.flink.yarn.ApplicationClient - Sending StopCluster request to JobManager.
2018-06-12 03:13:39,480 INFO org.apache.flink.yarn.YarnClusterClient - Application application_1528772982594_0001 finished with state FINISHED and final state SUCCEEDED at 1528773218662
2018-06-12 03:13:39,480 INFO org.apache.flink.yarn.YarnClusterClient - YARN Client is shutting down
2018-06-12 03:13:39,582 INFO org.apache.flink.yarn.ApplicationClient - Stopped Application client.
2018-06-12 03:13:39,583 INFO org.apache.flink.yarn.ApplicationClient - Disconnect from JobManager Actor[akka.tcp://flink#ip-10-97-46-149.tr-fr-nonprod.aws-int.thomsonreuters.com:45663/user/jobmanager#182802345].
Flink job manager Log:
FlatMap (BIMBQMEVMJoinMapper.flatMap) (32/32) (67a002e07fe799c1624a471340c8cf9d) switched from CANCELING to CANCELED.
Try to restart or fail the job Flink Java Job at Tue Jun 12 03:13:17 UTC 2018 (1086cedb3617feeee8aace29a7fc6bd0) if no longer possible.
Requesting new TaskManager container with 8192 megabytes memory. Pending requests: 1
Job Flink Java Job at Tue Jun 12 03:13:17 UTC 2018 (1086cedb3617feeee8aace29a7fc6bd0) switched from state FAILING to FAILED.
Could not restart the job Flink Java Job at Tue Jun 12 03:13:17 UTC 2018 (1086cedb3617feeee8aace29a7fc6bd0) because the restart strategy prevented it.
Unregistered task manager ip-10-97-44-186/10.97.44.186. Number of registered task managers 31. Number of available slots 31
Stopping JobManager with final application status SUCCEEDED and diagnostics: Flink YARN Client requested shutdown
Shutting down cluster with status SUCCEEDED : Flink YARN Client requested shutdown
Unregistering application from the YARN Resource Manager
Waiting for application to be successfully unregistered.
Can anybody help me understand why does yarn say my flink job was "SUCCEED"?
The reported application status in Yarn does not reflect the status of the executed job but the status of the Flink cluster since this is the Yarn application. Thus, the final status of the Yarn application only depends on whether the Flink cluster finished properly or not. Differently said, if a job fails, then it does not necessarily mean that the Flink cluster failed. These are two different things.

Way to show AWS API calls being made by Packer in post-processors section?

I have a Packer template with the following post-processors section:
"post-processors": [
{
"type": "amazon-import",
"ami_name": "my_image-{{user `os_version`}}",
"access_key": "{{user `aws_access_key`}}",
"secret_key": "{{user `aws_secret_key`}}",
"region": "us-east-1",
"s3_bucket_name": "my_s3_bucket",
"tags": {
"Description": "Packer build {{timestamp}}",
"Version": "{{user `build_version`}}"
},
"only": ["aws"]
}
I'm trying to debug a policy/permissions issue and wanted to see more details as to what AWS API calls Packer is making here with the amazon-import Post-Processor.
I'm aware of the PACKER_LOG=1 environment variable, but is there anything more verbose than this? This output doesn't give me much to go on:
2017/08/11 23:55:24 packer: 2017/08/11 23:55:24 Waiting for state to become: completed
2017/08/11 23:55:24 packer: 2017/08/11 23:55:24 Using 2s as polling delay (change with AWS_POLL_DELAY_SECONDS)
2017/08/11 23:55:24 packer: 2017/08/11 23:55:24 Allowing 300s to complete (change with AWS_TIMEOUT_SECONDS)
2017/08/12 00:29:59 ui: aws (amazon-import): Import task import-ami-fg0qxxdb complete
aws (amazon-import): Import task import-ami-fg0qxxdb complete
2017/08/12 00:29:59 ui: aws (amazon-import): Starting rename of AMI (ami-c01125bb)
aws (amazon-import): Starting rename of AMI (ami-c01125bb)
2017/08/12 00:29:59 ui: aws (amazon-import): Waiting for AMI rename to complete (may take a while)
2017/08/12 00:29:59 packer: 2017/08/12 00:29:59 Waiting for state to become: available
aws (amazon-import): Waiting for AMI rename to complete (may take a while)
2017/08/12 00:29:59 packer: 2017/08/12 00:29:59 Using 2s as polling delay (change with AWS_POLL_DELAY_SECONDS)
2017/08/12 00:29:59 packer: 2017/08/12 00:29:59 Allowing 300s to complete (change with AWS_TIMEOUT_SECONDS)
2017/08/12 00:29:59 packer: 2017/08/12 00:29:59 Error on AMIStateRefresh: UnauthorizedOperation: You are not authorized to perform this operation.
2017/08/12 00:29:59 packer: status code: 403, request id: f53ea750-788e-4213-accc-def6ca459113
2017/08/12 00:29:59 [INFO] (telemetry) ending amazon-import
2017/08/12 00:29:59 [INFO] (telemetry) found error: Error waiting for AMI (ami-3f132744): UnauthorizedOperation: You are not authorized to perform this operation.
status code: 403, request id: f53ea750-788e-4213-accc-def6ca459113
2017/08/12 00:29:59 Deleting original artifact for build 'aws'
2017/08/12 00:29:59 ui error: Build 'aws' errored: 1 error(s) occurred:
* Post-processor failed: Error waiting for AMI (ami-3f132744): UnauthorizedOperation: You are not authorized to perform this operation.
status code: 403, request id: f53ea750-788e-4213-accc-def6ca459113
2017/08/12 00:29:59 Builds completed. Waiting on interrupt barrier...
2017/08/12 00:29:59 machine readable: error-count []string{"1"}
2017/08/12 00:29:59 ui error:
==> Some builds didn't complete successfully and had errors:
2017/08/12 00:29:59 machine readable: aws,error []string{"1 error(s) occurred:\n\n* Post-processor failed: Error waiting for AMI (ami-3f132744): UnauthorizedOperation: You are not authorized to perform this operation.\n\tstatus code: 403, request id: f53ea750-788e-4213-accc-def6ca459113"}
Build 'aws' errored: 1 error(s) occurred:
2017/08/12 00:29:59 ui error: --> aws: 1 error(s) occurred:
* Post-processor failed: Error waiting for AMI (ami-3f132744): UnauthorizedOperation: You are not authorized to perform this operation.
status code: 403, request id: f53ea750-788e-4213-accc-def6ca459113
2017/08/12 00:29:59 ui:
==> Builds finished but no artifacts were created.
* Post-processor failed: Error waiting for AMI (ami-3f132744): UnauthorizedOperation: You are not authorized to perform this operation.
status code: 403, request id: f53ea750-788e-4213-accc-def6ca459113
==> Some builds didn't complete successfully and had errors:
--> aws: 1 error(s) occurred:
* Post-processor failed: Error waiting for AMI (ami-3f132744): UnauthorizedOperation: You are not authorized to perform this operation.
status code: 403, request id: f53ea750-788e-4213-accc-def6ca459113
==> Builds finished but no artifacts were created.
2017/08/12 00:30:00 [WARN] (telemetry) Error finalizing report. This is safe to ignore. Post https://checkpoint-api.hashicorp.com/v1/telemetry/packer: context deadline exceeded
2017/08/12 00:30:00 waiting for all plugin processes to complete...
2017/08/12 00:30:00 /usr/local/bin/packer: plugin process exited
2017/08/12 00:30:00 /usr/local/bin/packer: plugin process exited
2017/08/12 00:30:00 /usr/local/bin/packer: plugin process exited
I'm assuming this is a policy permissions issue but I can't tell what I'm missing from the above output.
Unfortunately there is no more debugging to enable.
I recommend that that you review that you have created all policies according to the docs and review the permission for the user. You can do that by pasting the ACCESS KEY ID in Search IAM.
As an last resource it can be good to go through the process manually with the AWS cli.
Not within Packer, but you could use AWS CloudTrail to see which API's have been called:
https://aws.amazon.com/cloudtrail/

IO thread error : 1595 (Relay log write failure: could not queue event from master)

Slave status :
Last_IO_Errno: 1595
Last_IO_Error: Relay log write failure: could not queue event from master
Last_SQL_Errno: 0
from error log :
[ERROR] Slave I/O for channel 'db12': Unexpected master's heartbeat data: heartbeat is not compatible with local info; the event's data: log_file_name toku10-bin.000063<D1> log_pos 97223067, Error_code: 1623
[ERROR] Slave I/O for channel 'db12': Relay log write failure: could not queue event from master, Error_code: 1595
I tried to restarting the slave_io thread for many times, still its same.
we need to keep on start io_thread whenever it stopped manually, hope its bug from percona
I have simply written shell and scheduled the same for every 10mins to check if io_thread is not running , start slave io_thread for channel 'db12';. It's working as of now