successful snapshot fails to load some shards, RepositoryMissingException in elasticsearch - amazon-s3

I had a backup successfully complete to my s3 bucket in elasticsearch:
{
"state": "SUCCESS",
"start_time": "2014-12- 06T00:12:39.362Z",
"start_time_in_millis": 1417824759362,
"end_time": "2014-12-06T00:33:34.352Z",
"end_time_in_millis": 1417826014352,
"duration_in_millis": 1254990,
"failures": [],
"shards": {
"total": 345,
"failed": 0,
"successful": 345
}
}
But when I restore from the snapshot, I have a few failed shards, with the following message:
[2014-12-08 00:00:05,580][WARN ][cluster.action.shard] [Sunder] [kibana-int][4] received shard failed for [kibana-int][4],
node[_QG8dkDaRD-H1uPL_p57lw], [P], restoring[elasticsearch:snapshot_1], s[INITIALIZING], indexUUID [SAuv_EU3TBGZ71NhkC7WOA],
reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[kibana-int][4] failed recovery];
nested: IndexShardRestoreFailedException[[kibana-int][4] restore failed];
nested: RepositoryMissingException[[elasticsearch] missing]; ]]
how I do reconcile the data, or if necessary remove the shards from my cluster to complete the recovery?

Related

Pyodbc to Hive on Dataproc intermittent error: (79) Failed to reconnect to server. (79) (SQLExecDirectW)

I launch a Dataproc cluster with Hive using GoogleAPIs in Python and connect to the Hive with pyodbc. Hive queries succeed and fail seemingly randomly.
Cloudera Hive ODBC driver 2.6.9
pyodbc 4.0.30
Error: "pyodbc.OperationalError: ('08S01', '[08S01] [Cloudera][Hardy] (79) Failed to reconnect to server. (79) (SQLExecDirectW)')"
Some server logs:
{
"insertId": "xxx",
"jsonPayload": {
"filename": "yarn-yarn-timelineserver-xxx-m.log",
"class": "org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager",
"message": "ExpiredTokenRemover received java.lang.InterruptedException: sleep interrupted"
},
{
"insertId": "xxx",
"jsonPayload": {
"container": "container_xxx_0002_01_000001",
"thread": "ORC_GET_SPLITS #4",
"application": "application_xxx_0002",
"message": "Failed to get files with ID; using regular API: Only supported for DFS; got class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem",
"class": "|io.AcidUtils|",
"filename": "application_xxx_0002.container_xxx_0002_01_000001.syslog_dag_xxx_0002_4",
"container_logname": "syslog_dag_xxx_0002_4"
},

Terraform data source of an existing S3 bucket fails plan stage attempting a GetBucketWebsite request which returns NoSuchWebsiteConfiguration

I'm trying to use a data source of an existing S3 bucket like this:
data "aws_s3_bucket" "src-config-bucket" {
bucket = "single-word-name" }
And Terraform always fails the plan stage with the message:
Error: UnauthorizedOperation: You are not authorized to perform this operation.
status code: 403, request id: XXXXX
The requests failing can be viewed with the following info in the results:
{
"eventVersion": "1.08",
"userIdentity": {
​​
"type": "IAMUser",
"principalId": "ANONYMIZED",
"arn": "arn:aws:iam::1234567890:user/terraformops",
"accountId": "123456789012",
"accessKeyId": "XXXXXXXXXXXXXXXXXX",
"userName": "terraformops"
}​​,
"eventTime": "2021-02-02T18:12:19Z",
"eventSource": "s3.amazonaws.com",
"eventName": "GetBucketWebsite",
"awsRegion": "eu-west-1",
"sourceIPAddress": "X.Y.Z.W",
"userAgent": "[aws-sdk-go/1.36.28 (go1.15.5; linux; amd64) APN/1.0 HashiCorp/1.0 Terraform/0.14.4 (+https://www.terraform.io)]",
"errorCode": "NoSuchWebsiteConfiguration",
"errorMessage": "The specified bucket does not have a website configuration",
"requestParameters": {
​​
"bucketName": "s3-bucket-name",
"website": "",
"Host": "s3-bucket-name.s3.eu-west-1.amazonaws.com"
}
Why can't I use an existing S3 bucket as a data source within Terraform ? I don't treat it as a website anywhere in the terraform project so I don't know why it asks the server the GetBucketWebsite call and fail. Hope someone can help.
Thanks.
I don't know why it asks the server the GetBucketWebsite call and fail.
It asks GetBucketWebsite as data source aws_s3_bucket returns this information by providing website_endpoint and website_domain.
So you need to have permissions to call this action on the bucket. The error message suggests that the IAM user/role which you use for querying the bucket does not have all permissions to get needed information.

Seaweedfs volume management

I have 2 questions concerning a Seaweedfs cluster we have running. The leader is started with the following command:
/usr/local/bin/weed server -ip=192.168.13.154 -ip.bind=192.168.13.154 -dir=/opt/seaweedfs/volume-1,/opt/seaweedfs/volume-2,/opt/seaweedfs/volume-3 -master.dir=/opt/seaweedfs/master -master.peers=192.168.13.154:9333,192.168.13.155:9333,192.168.13.156:9333 -volume.max=30,30,30 -filer=true -s3=true -metrics.address=192.168.13.84:9091
Question 1
I created a master.toml file using weed scaffold -config=master:
[master.maintenance]
# periodically run these scripts are the same as running them from 'weed shell'
scripts = """
ec.encode -fullPercent=95 -quietFor=1h
ec.rebuild -force
ec.balance -force
volume.balance -force
"""
sleep_minutes = 17 # sleep minutes between each script execution
However the maintenance scripts seem to fail because
shell failed to keep connected to localhost:9333: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp [::1]:19333: connect: connection refused"
This makes sense since the master is bound to ip 192.168.13.154 and the maintenance script tries to connect to localhost. How can I specificy the master ip in the master.toml file?
Question 2
The amount of volumes seems to grow faster than the amount of disk space used. For example on the .154 server there are only 11 free volumes. But looking at the disk space there should be much more.
Status:
{
"Topology": {
"DataCenters": [
{
"Free": 16,
"Id": "DefaultDataCenter",
"Max": 270,
"Racks": [
{
"DataNodes": [
{
"EcShards": 0,
"Free": 0,
"Max": 90,
"PublicUrl": "192.168.13.155:8080",
"Url": "192.168.13.155:8080",
"Volumes": 90
},
{
"EcShards": 0,
"Free": 11,
"Max": 90,
"PublicUrl": "192.168.13.154:8080",
"Url": "192.168.13.154:8080",
"Volumes": 79
},
{
"EcShards": 0,
"Free": 5,
"Max": 90,
"PublicUrl": "192.168.13.156:8080",
"Url": "192.168.13.156:8080",
"Volumes": 85
}
],
"Free": 16,
"Id": "DefaultRack",
"Max": 270
}
]
}
],
"Free": 16,
"Max": 270,
"layouts": [
...
]
},
"Version": "30GB 1.44"
}
Disk (192.168.13.154):
/dev/sdb1 1007G 560G 397G 59% /opt/seaweedfs/volume-1
/dev/sdc1 1007G 542G 414G 57% /opt/seaweedfs/volume-2
/dev/sdd1 1007G 398G 559G 42% /opt/seaweedfs/volume-3
Is this related to the maintenance scripts not running properly, or is there something else I'm not understanding correctly?
Question 1: Added a fix https://github.com/chrislusf/seaweedfs/commit/56244fb9a13c75616aa8a9232c62d1b896906e98
Question 2: Likely related to master leadership changes.

My AKS Cluster was brought down, how can I recover?

I have been playing around with load-testing my application on a single agent cluster in AKS. During the testing, the connection to the dashboard stalled and never resumed. My application seems down as well, so I am assuming the cluster is in a bad state.
The API server is restate-f4cbd3d9.hcp.centralus.azmk8s.io
kubectl cluster-info dump shows the following error:
{
"name": "kube-dns-v20-6c8f7f988b-9wpx9.14fbbbd6bf60f0cf",
"namespace": "kube-system",
"selfLink": "/api/v1/namespaces/kube-system/events/kube-dns-v20-6c8f7f988b-9wpx9.14fbbbd6bf60f0cf",
"uid": "47f57d3c-d577-11e7-88d4-0a58ac1f0249",
"resourceVersion": "185572",
"creationTimestamp": "2017-11-30T02:36:34Z",
"InvolvedObject": {
"Kind": "Pod",
"Namespace": "kube-system",
"Name": "kube-dns-v20-6c8f7f988b-9wpx9",
"UID": "9d2b20f2-d3f5-11e7-88d4-0a58ac1f0249",
"APIVersion": "v1",
"ResourceVersion": "299",
"FieldPath": "spec.containers{kubedns}"
},
"Reason": "Unhealthy",
"Message": "Liveness probe failed: Get http://10.244.0.4:8080/healthz-kubedns: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)",
"Source": {
"Component": "kubelet",
"Host": "aks-agentpool-34912234-0"
},
"FirstTimestamp": "2017-11-30T02:23:50Z",
"LastTimestamp": "2017-11-30T02:59:00Z",
"Count": 6,
"Type": "Warning"
}
As well as some Pod Sync errors in Kube-System.
Example of issue:
az aks browse -g REstate.Server -n REstate
Merged "REstate" as current context in C:\Users\User\AppData\Local\Temp\tmp29d0conq
Proxy running on http://127.0.0.1:8001/
Press CTRL+C to close the tunnel...
error: error upgrading connection: error dialing backend: dial tcp 10.240.0.4:10250: getsockopt: connection timed out
You'll probably need to ssh to the node to see if the Kubelet service is running. For future you can set Resource quotas from exhausting all resources in the cluster nodes.
Resource Quotas -https://kubernetes.io/docs/concepts/policy/resource-quotas/

AWS xray put trace segment command return error

I am trying to send segment doc manually using the CLI with example on this page: https://docs.aws.amazon.com/xray/latest/devguide/xray-api-sendingdata.html#xray-api-segments
I created my own Trace ID and also start and end time.
The command i used are:
> DOC='{"trace_id": "'$TRACE_ID'", "id": "6226467e3f841234", "start_time": 1581596193, "end_time": 1581596198, "name": "test.com"}'
>echo $DOC
{"trace_id": "1-5e453c54-3dc3e03a3c86f97231d06c88", "id": "6226467e3f845502", "start_time": 1581596193, "end_time": 1581596198, "name": "test.com"}
> aws xray put-trace-segments --trace-segment-documents $DOC
{
"UnprocessedTraceSegments": [
{
"ErrorCode": "ParseError",
"Message": "Invalid segment. ErrorCode: ParseError"
},
{
"ErrorCode": "MissingId",
"Message": "Invalid segment. ErrorCode: MissingId"
},
{
"ErrorCode": "MissingId",
"Message": "Invalid segment. ErrorCode: MissingId"
},
.................
The put-trace-segment keep giving me error. The segment doc comply with the JSON schema too. Am i missing something else?
Thanks.
I need to enclose the JSON with "..". The command that works for me was: aws xray put-trace-segments --trace-segment-documents "$DOC"
This is probably due an error in the documentation or that the xray team was using another kind of shell.