Rancher Cluster Flapping - Increase API Read Body Timeout? - api

We are using Rancher 2.2.13 and Kubernetes 1.13.12 in GKE. Our instance keeps flapping while being connected in Rancher. The agent logs show:
E0804 00:50:08.384154 6 request.go:853] Unexpected error when reading response body: context.deadlineExceededError{}
E0804 00:50:08.384223 6 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.Secret: Unexpected error context.deadlineExceededError{} when reading response body. Please retry.
E0804 00:50:08.385380 6 request.go:853] Unexpected error when reading response body: &http.httpError{err:"context deadline exceeded (Client.Timeout exceeded while reading body)", timeout:true}
E0804 00:50:08.385431 6 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.ConfigMap: Unexpected error &http.httpError{err:"context deadline exceeded (Client.Timeout exceeded while reading body)", timeout:true} when reading response body. Please retry.
The underlying issue seems to be that there are roughly ~24K ConfigMaps for this particular Cluster and ~17K secrets. So obviously the return is going to be immense for both.
Is there any way to increase the read timeout for the body? Is there a paging feature, or is there anyway to implement one if there isn't?

Related

Load from GCS to GBQ causes an internal BigQuery error

My application creates thousands of "load jobs" daily to load data from Google Cloud Storage URIs to BigQuery and only a few cases causing the error:
"Finished with errors. Detail: An internal error occurred and the request could not be completed. This is usually caused by a transient issue. Retrying the job with back-off as described in the BigQuery SLA should solve the problem: https://cloud.google.com/bigquery/sla. If the error continues to occur please contact support at https://cloud.google.com/support. Error: 7916072"
The application is written on Python and uses libraries:
google-cloud-storage==1.42.0
google-cloud-bigquery==2.24.1
google-api-python-client==2.37.0
Load job is done by calling
load_job = self._client.load_table_from_uri(
source_uris=source_uri,
destination=destination,
job_config=job_config,
)
this method has a default param:
retry: retries.Retry = DEFAULT_RETRY,
so the job should automatically retry on such errors.
Id of specific job that finished with error:
"load_job_id": "6005ab89-9edf-4767-aaf1-6383af5e04b6"
"load_job_location": "US"
after getting the error the application recreates the job, but it doesn't help.
Subsequent failed job ids:
5f43a466-14aa-48cc-a103-0cfb4e0188a2
43dc3943-4caa-4352-aa40-190a2f97d48d
43084fcd-9642-4516-8718-29b844e226b1
f25ba358-7b9d-455b-b5e5-9a498ab204f7
...
As mentioned in the error message, Wait according to the back-off requirements described in the BigQuery Service Level Agreement, then try the operation again.
If the error continues to occur, if you have a support plan please create a new GCP support case. Otherwise, you can open a new issue on the issue tracker describing your issue. You can also try to reduce the frequency of this error by using Reservations.
For more information about the error messages you can refer to this document.

Request timed out error on copying data in azure data factory

I am receiving the below error on running a copy activity in my adf pipeline. My source and sink are cosmos db containers in different subscription. ADF pipeline is created in subscription which has target(sink) cosmos db container.
Error:
Error code 2200 Failure type User configuration issue
Type=Microsoft.Azure.Documents.RequestTimeoutException,Message=Request
timed out. ActivityId: 0d2b8ebb-090d-43eb-8494-f82e53b3134b, Request
URI: /dbs/ZLQDAA==/colls/ZLQDAIez1wo=/docs, RequestStats: , SDK:
documentdb-dotnet-sdk/2.5.1 Host/64-bit
MicrosoftWindowsNT/6.2.9200.0,Source=Microsoft.Azure.Documents.Client,''Type=System.Threading.Tasks.TaskCanceledException,Message=A
task was
canceled.,Source=mscorlib,''Type=Microsoft.Azure.Documents.RequestTimeoutException,Message=Request
timed out. ActivityId: 0d2b8ebb-090d-43eb-8494-f82e53b3134b, Request
URI: /dbs/ZLQDAA==/colls/ZLQDAIez1wo=/docs, RequestStats: , SDK:
documentdb-dotnet-sdk/2.5.1 Host/64-bit
MicrosoftWindowsNT/6.2.9200.0,Source=Microsoft.Azure.Documents.Client,''Type=System.Threading.Tasks.TaskCanceledException,Message=A
task was
canceled.,Source=mscorlib,''Type=Microsoft.Azure.Documents.RequestTimeoutException,Message=Request
timed out. ActivityId: 0d2b8ebb-090d-43eb-8494-f82e53b3134b, Request
URI: /dbs/ZLQDAA==/colls/ZLQDAIez1wo=/docs, RequestStats: , SDK:
documentdb-dotnet-sdk/2.5.1 Host/64-bit
MicrosoftWindowsNT/6.2.9200.0,Source=Microsoft.Azure.Documents.Client,''Type=System.Threading.Tasks.TaskCanceledException,Message=A
task was
canceled.,Source=mscorlib,''Type=Microsoft.Azure.Documents.RequestTimeoutException,Message=Request
timed out. ActivityId: 0d2b8ebb-090d-43eb-8494-f82e53b3134b, Request
URI: /dbs/ZLQDAA==/colls/ZLQDAIez1wo=/docs, RequestStats: , SDK:
documentdb-dotnet-sdk/2.5.1 Host/64-bit
MicrosoftWindowsNT/6.2.9200.0,Source=Microsoft.Azure.Documents.Client,''Type=System.Threading.Tasks.TaskCanceledException,Message=A
task was canceled.,Source=mscorlib,'
As per official documentation
Cosmos DB limits single request’s size to 2MB. The formula is Request Size = Single Document Size * Write Batch Size. If you hit error saying “Request size is too large.”, reduce the writeBatchSize value in copy sink configuration.
Page size: The number of documents per page of the query result. Default is "-1" which uses the service dynamic page up to 1000.
Throughput: Set an optional value for the number of RUs you'd like to apply to your CosmosDB collection for each execution of this data flow during the read operation. Minimum is 400.

First Request, response "OVER_QUERY_LIMIT"

We are experiencing a problem where we have incurred all request
Request:
https://maps.googleapis.com/maps/api/directions/json?key=****&units=metricmode=driving&origin=-18.953220736126821,-48.24894517816412&destination=-48.2786408,-18.9218197&
Response:
error_message "You have exceeded your daily request quota for this API."
routes []
status "OVER_QUERY_LIMIT"
Our metrics haven't reported any issues. I am curious how and why these errors are being created.
My account is new and is not in production, we're just testing and ours metrics indicate 22 requests...

Splunk rex query does not return desired result

I am looking to search for error type in my spunk. A typical error log looks like this:
ERROR 2016/03/16 22:13:55 Program exited with error Calling service: Post http://hostname/v1.21/resource/create?name=/60b80cf9-ebc4-11e5-a9cb-3c4a92db9491-2: read unix #->/var/run/program.sock: use of closed network connection (Client.Timeout exceeded while awaiting headers)
Note that common part is "Program exited with error". I am looking to capture the part that follows this common part of the error message. I tried with a couple of rex expressions. Both returned different results. Importantly, neither captured the error type I have shown above. I am giving the one that worked better here.
* | rex "Program exited with error\s+(?<reason>.+)" | top reason
An example of the log it matched-
Unable to get program status, Get http://192.168.0.2:2774/program/v1/status: net/http: timeout awaiting response headers
However, it did not match log of the form-
initial ZK connection failed, stat /var/program/f47aae5c-ea42-11e5-8975-fc15b40f4cc4/srcheck/started: no such file or directory
Calling service: Post http://hostname/v1.21/resource/create?name=/60b80cf9-ebc4-11e5-a9cb-3c4a92db9491-2: read unix #->/var/run/program.sock: use of closed network connection (Client.Timeout exceeded while awaiting headers)
Could someone help me understand what's wrong with my rex expression and what the right one would be so I get all possible error types?
This recipe:
"ERROR.*Program exited with error.*:.*:.*:\s+(?<reason>.+)"
will yield:
use of closed network connection (Client.Timeout exceeded while awaiting headers)
I don't have enough sample data to know if this will hold up or not. For example, I'm counting on exactly 3 colons, to get me to the interesting part. Also I don't know if you care about other things like the hostname, the fact that it's a Post, etc.. But based on your sample of 1, this answer should do the trick.

Frequent 503 errors raised from BigQuery Streaming API

Streaming data into BigQuery keeps failing due to the following error, which occurs more frequently recently:
com.google.api.client.googleapis.json.GoogleJsonResponseException: 503 Service Unavailable
{
"code" : 503,
"errors" : [ {
"domain" : "global",
"message" : "Connection error. Please try again.",
"reason" : "backendError"
} ],
"message" : "Connection error. Please try again."
}
at com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:145)
at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113)
at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest$1.interceptResponse(AbstractGoogleClientRequest.java:312)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1049)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:410)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:343)
Relevant question references:
Getting high rate of 503 errors with BigQuery Streaming API
BigQuery - BackEnd error when loading from JAVA API
We (the BigQuery team) are looking into your report of increased connection errors. From the internal monitoring, there hasn't been global a spike in connection errors in the last several days. However, that doesn't mean that your tables, specifically, weren't affected.
Connection errors can be tricky to chase down, because they can be caused by errors before they get to the BigQuery servers or after they leave. The more information your can provide, the easier it is for us to diagnose the issue.
The best practice for streaming input is to handle temporary errors like this to retry the request. It can be a little tricky, since when you get a connection error you don't actually know whether the insert succeeded. If you include a unique insertId with your data (see the documentation here), you can safely resend the request (within the deduplication window period, which I think is 15 minutes) without worrying that the same row will get added multiple times.