How to delete corrupt local EKS Anywhere cluster - amazon-eks

I've created a local EKS Anywhere cluster following this tutorial.
$ CLUSTER_NAME=dev-cluster
$ eksctl anywhere generate clusterconfig $CLUSTER_NAME \
--provider docker > $CLUSTER_NAME.yaml
$ eksctl anywhere create cluster -f $CLUSTER_NAME.yaml
After creating the cluster, I tried to delete it, but while it was still processing the deletion, I pressed Ctrl C to stop the operation, but it seems like the cluster is corrupt and I can no longer delete it.
$ eksctl anywhere delete cluster -f ${CONFIG_FILE}
Performing provider setup and validations
Creating management cluster
collecting cluster diagnostics
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x2479de7]
goroutine 1 [running]:
github.com/aws/eks-anywhere/pkg/workflows.(*CollectDiagnosticsTask).Run(0x0, 0x2b04168, 0xc000130008, 0xc00015efd0, 0xc061b451d05e8958, 0xc000c83440)
/codebuild/output/src572116999/src/git-codecommit.us-west-2.amazonaws.com/v1/repos/aws.eks-anywhere/pkg/workflows/diagnostics.go:23 +0x67
github.com/aws/eks-anywhere/pkg/workflows.(*deleteManagementCluster).Run(0xc0004824f8, 0x2b04168, 0xc000130008, 0xc00015efd0, 0x13, 0x2)
/codebuild/output/src572116999/src/git-codecommit.us-west-2.amazonaws.com/v1/repos/aws.eks-anywhere/pkg/workflows/delete.go:172 +0x191
github.com/aws/eks-anywhere/pkg/task.(*taskRunner).RunTask(0xc000a93ba8, 0x2b04168, 0xc000130008, 0xc00015efd0, 0x0, 0x0)
/codebuild/output/src572116999/src/git-codecommit.us-west-2.amazonaws.com/v1/repos/aws.eks-anywhere/pkg/task/task.go:115 +0x1ef
github.com/aws/eks-anywhere/pkg/workflows.(*Delete).Run(0xc000a93c80, 0x2b04168, 0xc000130008, 0xc000a76330, 0xc0004d57a0, 0xc000c85300, 0x0, 0x0, 0x0, 0xc0007d1cc0)
/codebuild/output/src572116999/src/git-codecommit.us-west-2.amazonaws.com/v1/repos/aws.eks-anywhere/pkg/workflows/delete.go:53 +0x145
github.com/aws/eks-anywhere/cmd/eksctl-anywhere/cmd.(*deleteClusterOptions).deleteCluster(0x37fb100, 0x2b04168, 0xc000130008, 0xc00041cec0, 0x0)
/codebuild/output/src572116999/src/git-codecommit.us-west-2.amazonaws.com/v1/repos/aws.eks-anywhere/cmd/eksctl-anywhere/cmd/deletecluster.go:120 +0x389
github.com/aws/eks-anywhere/cmd/eksctl-anywhere/cmd.glob..func2(0x37cc180, 0xc00041cec0, 0x0, 0x2, 0x0, 0x0)
/codebuild/output/src572116999/src/git-codecommit.us-west-2.amazonaws.com/v1/repos/aws.eks-anywhere/cmd/eksctl-anywhere/cmd/deletecluster.go:36 +0xb3
github.com/spf13/cobra.(*Command).execute(0x37cc180, 0xc00041cea0, 0x2, 0x2, 0x37cc180, 0xc00041cea0)
/go/pkg/mod/github.com/spf13/cobra#v1.1.3/command.go:852 +0x472
github.com/spf13/cobra.(*Command).ExecuteC(0x37cb280, 0x8, 0xc000000180, 0x2496ec5)
/go/pkg/mod/github.com/spf13/cobra#v1.1.3/command.go:960 +0x375
github.com/spf13/cobra.(*Command).Execute(...)
/go/pkg/mod/github.com/spf13/cobra#v1.1.3/command.go:897
github.com/spf13/cobra.(*Command).ExecuteContext(...)
/go/pkg/mod/github.com/spf13/cobra#v1.1.3/command.go:890
github.com/aws/eks-anywhere/cmd/eksctl-anywhere/cmd.Execute(0x0, 0x0)
/codebuild/output/src572116999/src/git-codecommit.us-west-2.amazonaws.com/v1/repos/aws.eks-anywhere/cmd/eksctl-anywhere/cmd/root.go:43 +0x53
main.main()
/codebuild/output/src572116999/src/git-codecommit.us-west-2.amazonaws.com/v1/repos/aws.eks-anywhere/cmd/eksctl-anywhere/main.go:29 +0xe5
The main error seems to be:
invalid memory address or nil pointer dereference
How can I manually delete this local cluster? brew uninstall aws/tap/eks-anywhere doesn't seem to have worked.

Related

Error running tez in hive. Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask

Hadoop 3.3.5
Hive 3.1.3
Tez 0.10.2
I follow the instruction in this link to build tez 0.10.2 for hadoop 3.3.5: https://tez.apache.org/install.html
The db is stored on s3 bucket and I am able to run 'select count(*) from m1.t1' using hive.execution.engine=mr.
When I set hive.execution.engine=tez, and run the same query, I got this error immediately:
2023-02-15T21:21:09,208 INFO [a6e2cd1a-b2c9-42d8-9568-8e0b64677f77 main] client.TezClient: App did not succeed. Diagnostics: Application application_1676506240754_0019 failed 2 times due to AM Contai
ner for appattempt_1676506240754_0019_000002 exited with exitCode: 1
Failing this attempt.Diagnostics: [2023-02-15 21:21:08.730]Exception from container-launch.
Container id: container_1676506240754_0019_02_000001
Exit code: 1
[2023-02-15 21:21:08.732]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
Error: Could not find or load main class org.apache.tez.dag.app.DAGAppMaster
[2023-02-15 21:21:08.733]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
Error: Could not find or load main class org.apache.tez.dag.app.DAGAppMaster
If I set tez.use.cluster.hadoop-libs to true in tez-site.xml, I got YARN running but failed with load aws credential error even I have set the fs.s3a credentials in hadoop's core-site.xml, hive's hive-site.xml and .bashrc environment variables.
keys are masked to show sample only:
echo $AWS_ACCESS_KEY_ID
I9U996400005XXXXXXXX
echo $AWS_SECRET_KEY
mPY8GiU6NegNWoVnaODXXXXXXXXXXXXXXXXXXXX
hive> set hive.execution.engine=tez;
hive> select count(*) from m1.t1;
Query ID = hdp-user_20230215210146_62ed9fab-5d4a-42a9-bf54-5fb6f84a9048
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1676506240754_0015)
----------------------------------------------------------------------------------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
----------------------------------------------------------------------------------------------
Map 1 container INITIALIZING -1 0 0 -1 0 0
Reducer 2 container INITED 1 0 0 1 0 0
----------------------------------------------------------------------------------------------
VERTICES: 00/02 [>>--------------------------] 0% ELAPSED TIME: 2.03 s
----------------------------------------------------------------------------------------------
Status: Failed
Vertex failed, vertexName=Map 1, vertexId=vertex_1676506240754_0015_3_00, diagnostics=[Vertex vertex_1676506240754_0015_3_00 [Map 1] killed/failed due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: t1 initializer failed, vertex=vertex_1676506240754_0015_3_00 [Map 1], java.nio.file.AccessDeniedException: s3a://hadoop-cluster/warehouse/tablespace/managed/hive/m1.db/t1: org.apache.hadoop.fs.s3a.auth.NoAuthWithAWSException: No AWS Credentials provided by TemporaryAWSCredentialsProvider SimpleAWSCredentialsProvider EnvironmentVariableCredentialsProvider IAMInstanceCredentialsProvider : com.amazonaws.SdkClientException: Unable to load AWS credentials from environment variables (AWS_ACCESS_KEY_ID (or AWS_ACCESS_KEY) and AWS_SECRET_KEY (or AWS_SECRET_ACCESS_KEY))
Tried to add all fs.s3a properties from core-site.xml to tez-site.xml and set fs,s3a,access.key and set fs.s3a.secret.key= inside hive session but still get same error.
org.apache.hadoop.fs.s3a.auth.NoAuthWithAWSException: No AWS Credentials provided by TemporaryAWSCredentialsProvider SimpleAWSCredentialsProvider EnvironmentVariableCredentialsProvider IAMInstanceCredentialsProvider : com.amazonaws.SdkClientException: Unable to load AWS credentials from environment variables (AWS_ACCESS_KEY_ID (or AWS_ACCESS_KEY) and AWS_SECRET_KEY (or AWS_SECRET_ACCESS_KEY))
Question: according to tez install instruction
Ensure tez.use.cluster.hadoop-libs is not set in tez-site.xml, or if it is set, the value should be false
But when set to false, tez could not run.
When set to true, I got aws credential error even though I set them in every possible location or environment variables.
==========================================================
Update:
Not sure if this is the right answer to this problem but I finally got it working by adding this property to hive-site.xml
<property>
<name>hive.conf.hidden.list</name>
<value>javax.jdo.option.ConnectionPassword,hive.server2.keystore.password,fs.s3a.proxy.password,dfs.adls.oauth2.credential,fs.adl.oauth2.credential</value>
</property>
Default all fs.s3a credential are hidden config even you don't set this property. I explicitly add this property and remove all fs.s3a credential related from the value.
Now, I can run select count(*) with tez.

Failed to use vscode remote ssh, but use ssh directly can work

Problem
I re-installed my server system.Before then, I can use remote-ssh normally.However, I can't use remote-ssh to connect to my server anymore.But I can still use ssh directly to connect to the server.
I suppose it managed to get into the system but somehow it broke down.
The error log is below:
Welcome to Ubuntu 20.04 LTS (GNU/Linux 5.4.0-77-generic x86_64)
* Documentation: https://help.ubuntu.com
* Management: https://landscape.canonical.com
* Support: https://ubuntu.com/advantage
System information as of Tue 14 Sep 2021 09:56:58 PM CST
System load: 0.07 Processes: 117
Usage of /: 6.5% of 59.00GB Users logged in: 1
Memory usage: 10% IPv4 address for eth0: 10.0.12.2
Swap usage: 0%
* Super-optimized for small spaces - read how we shrank the memory
footprint of MicroK8s to make it the smallest full K8s around.
https://ubuntu.com/blog/microk8s-memory-optimisation
ready: 6425958cce28
Linux 5.4.0-77-generic #86-Ubuntu SMP Thu Jun 17 02:35:03 UTC 2021
6425958cce28: running
bash: line 1: _exitcode: command not found
bash: line 2: syntax error near unexpected token `elif'
bash: line 2: ` elif [[ $ALLOW_CLIENT_DOWNLOAD == "1" ]]; then'
-sh: 4: function: not found
-sh: 69: [[: not found
-sh: 90: [[: not found
-sh: 155: Syntax error: "(" unexpected (expecting "then")
Transferred: sent 17180, received 4016 bytes, in 0.5 seconds
Bytes per second: sent 35433.6, received 8283.0
local-server-1> ssh child died, shutting down
[21:56:58.587] Failed to parse remote port from server output
[21:56:58.588] Resolver error: Error:
at Function.Create (/Users/luther/.vscode/extensions/ms-vscode-remote.remote-ssh-0.65.7/out/extension.js:1:64659)
at Object.t.handleInstallOutput (/Users/luther/.vscode/extensions/ms-vscode-remote.remote-ssh-0.65.7/out/extension.js:1:63302)
at Object.e [as tryInstallWithLocalServer] (/Users/luther/.vscode/extensions/ms-vscode-remote.remote-ssh-0.65.7/out/extension.js:1:387573)
at processTicksAndRejections (internal/process/task_queues.js:93:5)
at async /Users/luther/.vscode/extensions/ms-vscode-remote.remote-ssh-0.65.7/out/extension.js:1:294473
at async Object.t.withShowDetailsEvent (/Users/luther/.vscode/extensions/ms-vscode-remote.remote-ssh-0.65.7/out/extension.js:1:406463)
at async /Users/luther/.vscode/extensions/ms-vscode-remote.remote-ssh-0.65.7/out/extension.js:1:386112
at async E (/Users/luther/.vscode/extensions/ms-vscode-remote.remote-ssh-0.65.7/out/extension.js:1:382710)
at async Object.t.resolveWithLocalServer (/Users/luther/.vscode/extensions/ms-vscode-remote.remote-ssh-0.65.7/out/extension.js:1:385728)
at async Object.t.resolve (/Users/luther/.vscode/extensions/ms-vscode-remote.remote-ssh-0.65.7/out/extension.js:1:295870)
at async /Users/luther/.vscode/extensions/ms-vscode-remote.remote-ssh-0.65.7/out/extension.js:127:110656
[21:56:58.592] ------
Tried
I tried delete the know_hosts file from host, re-install the remote-ssh plugin, but can't work
I am pretty new to remote-ssh, hope can give me more detailed solution.
Thanks :)
I downgraded remote-ssh.Then I changed my default shell into zsh and upgrade remote-ssh.It began to install '.vscode-server' file again and magically it worked.

gem5 x86 kvm doesn't work with error "KVM: Failed to enter virtualized mode (hw reason: 0x80000021)"

I tried to run gem5 fs mode with KVM to fast forward linux boot-up and failed with this error.
info: 0x4b564d04: 0x0
info: 0x3b: 0x0
info: 0x6e0: 0x0
info: 0x1a0: 0x0
info: 0x17a: 0x0
info: 0x17b: 0x0
info: 0x9e: 0x0
panic: KVM: Failed to enter virtualized mode (hw reason: 0x80000021)
Memory Usage: 33878524 KBytes
Program aborted at tick 186932115
--- BEGIN LIBC BACKTRACE ---
gem5/build/X86/gem5.opt(_Z15print_backtracev+0x28)[0x15e45d8]
gem5/build/X86/gem5.opt(_Z12abortHandleri+0x46)[0x15f5196]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7fb3c9f7d390]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x38)[0x7fb3c8a72428]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x16a)[0x7fb3c8a7402a]
gem5/build/X86/gem5.opt[0x80f14f]
gem5/build/X86/gem5.opt[0x18cb151]
gem5/build/X86/gem5.opt(_ZN10BaseKvmCPU13handleKvmExitEv+0x1bc)[0x18cb8bc]
gem5/build/X86/gem5.opt(_ZN10BaseKvmCPU4tickEv+0x229)[0x18c8d69]
gem5/build/X86/gem5.opt(_ZN10EventQueue10serviceOneEv+0xd5)[0x15eb485]
gem5/build/X86/gem5.opt(_Z9doSimLoopP10EventQueue+0x48)[0x160a9c8]
gem5/build/X86/gem5.opt[0x160ad1f]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd57f)[0x7fb3c93e557f]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7fb3c9f736ba]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fb3c8b4441d]
--- END LIBC BACKTRACE ---
I've used the gem5art and slightly modified the runscript not to run spec benchmark and run /bin/bash instead. It seems that this error has happened a while ago and issued in here. It seems that this problem has been fixed by the gem5 v19 but got the same error code. Could anyone explain why this error happens and how to fix it?

Crash when starting Traefik in cluster mode

I recently wanted to change from a one-node Traefik install (that was using a configuration file), to a 3-node Traefik cluster.
Following the docs, I uploaded the configuration:
$ traefik storeconfig
It displayed no error, and checking the Consul KV, keys are there.
But when launching Traefik in cluster mode, I get a segmentation fault:
$ traefik --cluster=true -d
INFO[0001] Using TOML configuration file /etc/traefik/traefik.toml
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x83500e]
goroutine 1 [running]:
github.com/containous/traefik/cluster.NewLeadership(0x2e08560, 0xc420557840, 0xc4202a1340, 0x0)
/go/src/github.com/containous/traefik/cluster/leadership.go:28 +0x6e
github.com/containous/traefik/server.NewServer(0x2540be400, 0x100, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc42035b930, 0x5, 0xc4205ef740, ...)
/go/src/github.com/containous/traefik/server/server.go:105 +0x63e
main.run(0xc4205678c0)
/go/src/github.com/containous/traefik/cmd/traefik/traefik.go:307 +0x6f6
main.main.func1(0xc42016cdc0, 0xc4202b31a0)
/go/src/github.com/containous/traefik/cmd/traefik/traefik.go:61 +0xd9
github.com/containous/traefik/vendor/github.com/containous/staert.(*Staert).Run(0xc4206c1f30, 0x1aa1940, 0xc420496300)
/go/src/github.com/containous/traefik/vendor/github.com/containous/staert/staert.go:83 +0x2e
main.main()
/go/src/github.com/containous/traefik/cmd/traefik/traefik.go:218 +0x1bf1
I've tried with latest stable, 1.3.7 and the 1.4.0-rc1 releases, both show the same error.
Any ideas?
I feel like your traefik.toml is incorrect and does not have the correct configuration for your consul backend.
try using this as your command or modify the consul section of the config # /etc/traefik/traefik.toml
traefik --consul --consul.endpoint=YOURENDPOINTHERE --cluster=true -d
make sure to refer to this
https://docs.traefik.io/configuration/backends/consul/

Redis3.0.3 cluster warning: wrong element type nil at 0 (expected array)

when I create redis cluster, I met errors as follows:
[pirate#zhangbincan src]$ /home/zhangbincan/tools/ruby/ruby-2.2.2/ruby redis-trib.rb create --replicas 1 192.168.1.114:6379 192.168.1.114:6780 192.168.1.114:6381 192.168.1.114:6382 192.168.1.114:6383 192.168.1.114:6384
Creating cluster
Connecting to node 192.168.1.114:6379: OK
/usr/local/ruby2.2.2/lib/ruby/gems/2.2.0/gems/redis3.0.0/lib/redis.rb:182: warning: wrong element type nil at 0 (expected array)
/usr/local/ruby2.2.2/lib/ruby/gems/2.2.0/gems/redis3.0.0/lib/redis.rb:182: warning: ignoring wrong elements is deprecated, remove them explicitly
/usr/local/ruby2.2.2/lib/ruby/gems/2.2.0/gems/redis3.0.0/lib/redis.rb:182: warning: this causes ArgumentError in the next release
/usr/local/ruby2.2.2/lib/ruby/gems/2.2.0/gems/redis3.0.0/lib/redis.rb:182: warning: wrong element type nil at 18 (expected array)
/usr/local/ruby2.2.2/lib/ruby/gems/2.2.0/gems/redis3.0.0/lib/redis.rb:182: warning: this causes ArgumentError in the next release
Connecting to node 192.168.1.114:6780: [ERR] Sorry, can't connect to node 192.168.1.114:6780
The envs are :
[pirate#zhangbincan src]$ gem -v
2.4.8
[pirate#zhangbincan src]$ /home/zhangbincan/tools/ruby/ruby-2.2.2/ruby -v
ruby 2.2.2p95 (2015-04-13 revision 50295) [x86_64-linux]
“Sorry, can't connect to node 192.168.1.114:6780”, check if the port 6780 is ok.