Debugging istio rate limiting handler - redis

I'm trying to apply rate limiting on some of our internal services (inside the mesh).
I used the example from the docs and generated redis rate limiting configurations that include a (redis) handler, quota instance, quota spec, quota spec binding and rule to apply the handler.
This redis handler:
apiVersion: config.istio.io/v1alpha2
kind: handler
metadata:
name: redishandler
namespace: istio-system
spec:
compiledAdapter: redisquota
params:
redisServerUrl: <REDIS>:6379
connectionPoolSize: 10
quotas:
- name: requestcountquota.instance.istio-system
maxAmount: 10
validDuration: 100s
rateLimitAlgorithm: FIXED_WINDOW
overrides:
- dimensions:
destination: s1
maxAmount: 1
- dimensions:
destination: s3
maxAmount: 1
- dimensions:
destination: s2
maxAmount: 1
The quota instance (I'm only interested in limiting by destination at the moment):
apiVersion: config.istio.io/v1alpha2
kind: instance
metadata:
name: requestcountquota
namespace: istio-system
spec:
compiledTemplate: quota
params:
dimensions:
destination: destination.labels["app"] | destination.service.host | "unknown"
A quota spec, charging 1 per request if I understand correctly:
apiVersion: config.istio.io/v1alpha2
kind: QuotaSpec
metadata:
name: request-count
namespace: istio-system
spec:
rules:
- quotas:
- charge: 1
quota: requestcountquota
A quota binding spec that all participating services pre-fetch. I also tried with service: "*" which also did nothing.
apiVersion: config.istio.io/v1alpha2
kind: QuotaSpecBinding
metadata:
name: request-count
namespace: istio-system
spec:
quotaSpecs:
- name: request-count
namespace: istio-system
services:
- name: s2
namespace: default
- name: s3
namespace: default
- name: s1
namespace: default
# - service: '*' # Uncomment this to bind *all* services to request-count
A rule to apply the handler. Currently on all occasions (tried with matches but didn't change anything as well):
apiVersion: config.istio.io/v1alpha2
kind: rule
metadata:
name: quota
namespace: istio-system
spec:
actions:
- handler: redishandler
instances:
- requestcountquota
The VirtualService definitions are pretty similar for all participants:
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: s1
spec:
hosts:
- s1
http:
- route:
- destination:
host: s1
The problem is nothing really happens and no rate limiting takes place. I tested with curl from pods inside the mesh. The redis instance is empty (no keys on db 0, which I assume is what the rate limiting would use) so I know it can't practically rate-limit anything.
The handler seems to be configured properly (how can I make sure?) because I had some errors in it which were reported in mixer (policy). There are still some errors but none which I associate to this problem or the configuration. The only line in which redis handler is mentioned is this:
2019-12-17T13:44:22.958041Z info adapters adapter closed all scheduled daemons and workers {"adapter": "redishandler.istio-system"}
But its unclear if its a problem or not. I assume its not.
These are the rest of the lines from the reload once I deploy:
2019-12-17T13:44:22.601644Z info Built new config.Snapshot: id='43'
2019-12-17T13:44:22.601866Z info adapters getting kubeconfig from: "" {"adapter": "kubernetesenv.istio-system"}
2019-12-17T13:44:22.601881Z warn Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
2019-12-17T13:44:22.602718Z info adapters Waiting for kubernetes cache sync... {"adapter": "kubernetesenv.istio-system"}
2019-12-17T13:44:22.903844Z info adapters Cache sync successful. {"adapter": "kubernetesenv.istio-system"}
2019-12-17T13:44:22.903878Z info adapters getting kubeconfig from: "" {"adapter": "kubernetesenv.istio-system"}
2019-12-17T13:44:22.903882Z warn Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
2019-12-17T13:44:22.904808Z info Setting up event handlers
2019-12-17T13:44:22.904939Z info Starting Secrets controller
2019-12-17T13:44:22.904991Z info Waiting for informer caches to sync
2019-12-17T13:44:22.957893Z info Cleaning up handler table, with config ID:42
2019-12-17T13:44:22.957924Z info adapters deleted remote controller {"adapter": "kubernetesenv.istio-system"}
2019-12-17T13:44:22.957999Z info adapters adapter closed all scheduled daemons and workers {"adapter": "prometheus.istio-system"}
2019-12-17T13:44:22.958041Z info adapters adapter closed all scheduled daemons and workers {"adapter": "redishandler.istio-system"}
2019-12-17T13:44:22.958065Z info adapters shutting down daemon... {"adapter": "kubernetesenv.istio-system"}
2019-12-17T13:44:22.958050Z info adapters shutting down daemon... {"adapter": "kubernetesenv.istio-system"}
2019-12-17T13:44:22.958096Z info adapters shutting down daemon... {"adapter": "kubernetesenv.istio-system"}
2019-12-17T13:44:22.958182Z info adapters shutting down daemon... {"adapter": "kubernetesenv.istio-system"}
2019-12-17T13:44:23.958109Z info adapters adapter closed all scheduled daemons and workers {"adapter": "kubernetesenv.istio-system"}
2019-12-17T13:55:21.042131Z info transport: loopyWriter.run returning. connection error: desc = "transport is closing"
2019-12-17T14:14:00.265722Z info transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I'm using the demo profile with disablePolicyChecks: false to enable rate limiting. This is on istio 1.4.0, deployed on EKS.
I also tried memquota (this is our staging environment) with low limits and nothing seems to work. I never got a 429 no matter how much I went over the rate limit configured.
I don't know how to debug this and see where the configuration is wrong causing it to do nothing.
Any help is appreciated.

I too spent hours trying to decipher the documentation and get a sample working.
According to the documentation, they recommended that we enable policy checks:
https://istio.io/docs/tasks/policy-enforcement/rate-limiting/
However when that did not work, I did an "istioctl profile dump", searched for policy, and tried several settings.
I used Helm install and passed the following and then was able to get the described behaviour:
--set global.disablePolicyChecks=false \
--set values.pilot.policy.enabled=true \ ===> this made it work, but it's not in the docs.

Related

Forward Flex Gateway Logs to Splunk

I have an instance of MuleSoft's Flex Gateway (v 1.2.0) installed on a Linux machine in a podman container. I am trying to forward container as well as API logs to Splunk. Below is my log.yaml file in /home/username/app folder. Not sure what I am doing wrong, but the logs are not getting forwarded to Splunk.
apiVersion: gateway.mulesoft.com/v1alpha1
kind: Configuration
metadata:
name: logging-config
spec:
logging:
outputs:
- name: default
type: splunk
parameters:
host: <instance-name>.splunkcloud.com
port: "443"
splunk_token: xxxxx-xxxxx-xxxx-xxxx
tls: "on"
tls.verify: "off"
splunk_send_raw: "on"
runtimeLogs:
logLevel: info
outputs:
- default
accessLogs:
outputs:
- default
Please advise.
The endpoint for Splunk's HTTP Event Collector (HEC) is https://http-input.<instance-name>.splunkcloud.com:443/services/collector/raw. If you're using a free trial of Splunk Cloud then change the port number to 8088. See https://docs.splunk.com/Documentation/Splunk/latest/Data/UsetheHTTPEventCollector#Send_data_to_HTTP_Event_Collector_on_Splunk_Cloud_Platform for details.
I managed to get this work. The issue was that I had to give full permissions to the app folder using "chmod" command. After it was done, the fluent-bit.conf file had an entry for Splunk and logs started flowing.

OpenShift: Pod often does not return expected requests

I'm running a .dotnet 3.1 RestAPI inside a pod in Openshift, and it process every request smoothly - all transactions to the database (outside the Openshift network) are executed properly, all programatic executions are being finalized without errors. However, 1 in 15 requests will always ECONNRESET and fail to return the HTTP request.
Let's say I make a GET to /users/id/3 - I can see this request hitting my restAPI, being processed all the way down the infra layer, fetch data from the DB, wrap the return, and send it back finishing the request, however at this point, no return is received by the frontend, or postman, but I can see on the API logs that the request was finished and returned.
All of this requests take 2.3min to execute, and often finish in a ECONNRESET. I'm at odds at how to troubleshoot this. I have tried curl'ing the resource in another pod and the same behaviour appears.
I think these requests sometimes are getting lost in the cluster network, so I tried playing with the sessionaffinity of the service config but it's not really tied to this, as far as I understood. Do I have a wrong route config, or service config?
Route config
spec:
host: api.com.cloud
to:
kind: Service
name: api
weight: 100
port:
targetPort: 8080-tcp
tls:
termination: edge
wildcardPolicy: None
status:
ingress:
- host: api.com.cloud
routerName: default
conditions:
- type: Admitted
status: 'True'
lastTransitionTime: XXXX
wildcardPolicy: None
routerCanonicalHostname: router-default.apps.com.cloud
Service config
spec:
clusterIP: XXXX
ipFamilies:
- IPv4
ports:
- name: 8080-tcp
protocol: TCP
port: 8080
targetPort: 8080
internalTrafficPolicy: Cluster
clusterIPs:
- XXX
type: ClusterIP
ipFamilyPolicy: SingleStack
sessionAffinity: None
selector:
deploymentconfig: api
status:
loadBalancer: {}

Registered Targets Disappear

I have a working EKS cluster. It is using a ALB for ingress.
When I apply a service and then an ingress most of these work as expected. However some target groups eventually have no registered targets. If I get the service IP address kubectl describe svc my-service-name and manually register the EndPoints in the target group the pods are reachable again but that's not a sustainable process.
Any ideas on what might be happening? Why doesn't EKS find the target groups as pods cycle?
Each service (secrets, deployment, service and ingress consists of a set of .yaml files applied like:
deploy.sh
#!/bin/bash
set -e
kubectl apply -f ./secretsMap.yaml
kubectl apply -f ./configMap.yaml
kubectl apply -f ./deployment.yaml
kubectl apply -f ./service.yaml
kubectl apply -f ./ingress.yaml
service.yaml
apiVersion: v1
kind: Service
metadata:
name: "site-bob"
namespace: "next-sites"
spec:
ports:
- port: 80
targetPort: 3000
protocol: TCP
type: NodePort
selector:
app: "site-bob"
ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: "site-bob"
namespace: "next-sites"
annotations:
kubernetes.io/ingress.class: alb
alb.ingress.kubernetes.io/tags: Environment=Production,Group=api
alb.ingress.kubernetes.io/backend-protocol: HTTP
alb.ingress.kubernetes.io/ip-address-type: ipv4
alb.ingress.kubernetes.io/target-type: ip
alb.ingress.kubernetes.io/scheme: internet-facing
alb.ingress.kubernetes.io/listen-ports: '[{"HTTP":80},{"HTTPS":443}]'
alb.ingress.kubernetes.io/load-balancer-name: eks-ingress-1
alb.ingress.kubernetes.io/group.name: eks-ingress-1
alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-east-2:402995436123:certificate/9db9dce3-055d-4655-842e-xxxxx
alb.ingress.kubernetes.io/healthcheck-port: traffic-port
alb.ingress.kubernetes.io/healthcheck-path: /
alb.ingress.kubernetes.io/healthcheck-interval-seconds: '30'
alb.ingress.kubernetes.io/healthcheck-timeout-seconds: '16'
alb.ingress.kubernetes.io/success-codes: 200,201
alb.ingress.kubernetes.io/healthy-threshold-count: '2'
alb.ingress.kubernetes.io/unhealthy-threshold-count: '2'
alb.ingress.kubernetes.io/load-balancer-attributes: idle_timeout.timeout_seconds=60
alb.ingress.kubernetes.io/target-group-attributes: deregistration_delay.timeout_seconds=30
alb.ingress.kubernetes.io/actions.ssl-redirect: >
{
"type": "redirect",
"redirectConfig": { "Protocol": "HTTPS", "Port": "443", "StatusCode": "HTTP_301"}
}
alb.ingress.kubernetes.io/actions.svc-host: >
{
"type":"forward",
"forwardConfig":{
"targetGroups":[
{
"serviceName":"site-bob",
"servicePort": 80,"weight":20}
],
"targetGroupStickinessConfig":{"enabled":true,"durationSeconds":200}
}
}
labels:
app: site-bob
spec:
rules:
- host: "staging-bob.imgeinc.net"
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: ssl-redirect
port:
name: use-annotation
- backend:
service:
name: svc-host
port:
name: use-annotation
pathType: ImplementationSpecific
Something in my configuration added tagged two security groups as being owned by the cluster. When I checked the load balancer controller logs:
kubectl logs -n kube-system aws-load-balancer-controller-677c7998bb-l7mwb
I saw many lines like:
{"level":"error","ts":1641996465.6707578,"logger":"controller-runtime.manager.controller.targetGroupBinding","msg":"Reconciler error","reconciler group":"elbv2.k8s.aws","reconciler kind":"TargetGroupBinding","name":"k8s-nextsite-sitefest-89a6f0ff0a","namespace":"next-sites","error":"expect exactly one securityGroup tagged with kubernetes.io/cluster/imageinc-next-eks-4KN4v6EX for eni eni-0c5555fb9a87e93ad, got: [sg-04b2754f1c85ac8b9 sg-07b026b037dd4d6a4]"}
sg-07b026b037dd4d6a4 has description: EKS created security group applied to ENI that is attached to EKS Control Plane master nodes, as well as any managed workloads.
sg-04b2754f1c85ac8b9 has description: Security group for all nodes in the cluster.
I removed the tag:
{
Key: 'kubernetes.io/cluster/_cluster name_',
value:'owned'
}
from sg-04b2754f1c85ac8b9
and the TargetGroups started to fill in and everything is now working. Both groups were created and tagged by Terraform. I suspect my worker group configuration is off.
facing the same issue when creating the cluster with terraform. Solved updating aws load balancer controller from 2.3 to 2.4.4

Assign roles to EKS cluster in manifest file?

I'm new to Kubernetes, and am playing with eksctl to create an EKS cluster in AWS. Here's my simple manifest file
kind: ClusterConfig
apiVersion: eksctl.io/v1alpha5
metadata:
name: sandbox
region: us-east-1
version: "1.18"
managedNodeGroups:
- name: ng-sandbox
instanceType: r5a.xlarge
privateNetworking: true
desiredCapacity: 2
minSize: 1
maxSize: 4
ssh:
allow: true
publicKeyName: my-ssh-key
fargateProfiles:
- name: fp-default
selectors:
# All workloads in the "default" Kubernetes namespace will be
# scheduled onto Fargate:
- namespace: default
# All workloads in the "kube-system" Kubernetes namespace will be
# scheduled onto Fargate:
- namespace: kube-system
- name: fp-sandbox
selectors:
# All workloads in the "sandbox" Kubernetes namespace matching the
# following label selectors will be scheduled onto Fargate:
- namespace: sandbox
labels:
env: sandbox
checks: passed
I created 2 roles, EKSClusterRole for cluster management, and EKSWorkerRole for the worker nodes? Where do I use them in the file? I'm looking at eksctl Config file schema page and it's not clear to me where in manifest file to use them.
As you mentioned, it's in the managedNodeGroups docs
managedNodeGroups:
- ...
iam:
instanceRoleARN: my-role-arn
# or
# instanceRoleName: my-role-name
You should also read about
Creating a cluster with Fargate support using a config file
AWS Fargate

DigitalOcean pod has unbound immediate PersistentVolumeClaims

I am trying to run a Redis cluster in Kubernetes in DigitalOcean.
As a poc, I simply tried running an example I found online (https://github.com/sanderploegsma/redis-cluster/blob/master/redis-cluster.yml), which is able to spin up the pods appropriately when running locally using minikube.
However, when running it on Digital Ocean, I always get the following error:
Warning FailedScheduling 3s (x8 over 17s) default-scheduler pod has unbound immediate PersistentVolumeClaims (repeated 4 times)
Given that I am not changing anything, I am not sure why this would not work. Does anyone have any suggestions?
EDIT: some additional info
$ kubectl describe pvc
Name: data-redis-cluster-0
Namespace: default
StorageClass:
Status: Pending
Volume:
Labels: app=redis-cluster
Annotations: <none>
Finalizers: [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal FailedBinding 3m19s (x3420 over 14h) persistentvolume-controller no persistent volumes available for this claim and no storage class is set
Mounted By: <none>
EDIT: setting the default storage class partially resolved the problem!
However, the node is now not able to find available volumes to bind:
kubectl describe pvc:
Name: data-redis-cluster-0
Namespace: default
StorageClass: local-storage
Status: Pending
Volume:
Labels: app=redis-cluster
Annotations: <none>
Finalizers: [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal WaitForFirstConsumer 12m (x9 over 13m) persistentvolume-controller waiting for first consumer to be created before binding
Normal WaitForFirstConsumer 3m19s (x26 over 9m34s) persistentvolume-controller waiting for first consumer to be created before binding
kubectl describe pod redis-cluster-0
....
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 16m (x25 over 17m) default-scheduler 0/5 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 4 node(s) didn't find available persistent volumes to bind.
kubectl describe sc
Name: local-storage
IsDefaultClass: Yes
Annotations: storageclass.kubernetes.io/is-default-class=true
Provisioner: kubernetes.io/no-provisioner
Parameters: <none>
AllowVolumeExpansion: <unset>
MountOptions: <none>
ReclaimPolicy: Delete
VolumeBindingMode: WaitForFirstConsumer
Events: <none>
kubernetes manager pod logs:
I1028 15:30:56.154131 1 event.go:221] Event(v1.ObjectReference{Kind:"StatefulSet", Namespace:"default", Name:"redis-cluster", UID:"7528483e-dac6-11e8-871f-2e55450d570e", APIVersion:"apps/v1", ResourceVersion:"2588806", FieldPath:""}): type: 'Normal' reason: 'SuccessfulCreate' create Claim data-redis-cluster-0 Pod redis-cluster-0 in StatefulSet redis-cluster success
I1028 15:30:56.166649 1 event.go:221] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"data-redis-cluster-0", UID:"76746506-dac6-11e8-871f-2e55450d570e", APIVersion:"v1", ResourceVersion:"2588816", FieldPath:""}): type: 'Normal' reason: 'WaitForFirstConsumer' waiting for first consumer to be created before binding
I1028 15:30:56.220464 1 event.go:221] Event(v1.ObjectReference{Kind:"StatefulSet", Namespace:"default", Name:"redis-cluster", UID:"7528483e-dac6-11e8-871f-2e55450d570e", APIVersion:"apps/v1", ResourceVersion:"2588806", FieldPath:""}): type: 'Normal' reason: 'SuccessfulCreate' create Pod redis-cluster-0 in StatefulSet redis-cluster successful
I1028 15:30:57.004631 1 event.go:221] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"data-redis-cluster-0", UID:"76746506-dac6-11e8-871f-2e55450d570e", APIVersion:"v1", ResourceVersion:"2588825", FieldPath:""}): type: 'Normal' reason: 'WaitForFirstConsumer' waiting for first consumer to be created before binding
This:
no storage class is set
And an empty output for kubectl describe sc means that there's no storage class.
I recommend installing the CSI-driver for Digital Ocean. That will create a do-block-storage class using the Kubernetes CSI interface.
Another option is to use local storage. Using a local storage class:
$ cat <<EOF
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: local-storage
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer
EOF | kubectl apply -f -
Then for either case you may need to set it as a default storage class if you don't specify storageClassName in your PVC:
$ kubectl patch storageclass local-storage -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
or
$ kubectl patch storageclass do-block-storage -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
It is a statefulSet using PersistentVolumeClaims
You need to configure a default storageClass in your cluster so that the PersistentVolumeClaim can take the storage from there.
In minikube one is already available so it succeeds without error:
C02W84XMHTD5:ucp iahmad$ kubectl get sc --all-namespaces
NAME PROVISIONER AGE
standard (default) k8s.io/minikube-hostpath 7d