Pods restart frequently causing periodic timeout errors

After you complete your installation, you might encounter an issue that causes some pods to become not ready every few minutes. In addition, this issue can cause login difficulty.

Symptoms

One or more pods experience multiple restarts that result in the pod or pods being frequently in a not ready state. In addition, attempts to log in result in periodic 502 Bad Gateway or 504 Gateway Timeout errors.

You can view the events for a pod that is frequently restarting by running the following commands:

kubectl describe pod <pod name> -n <namespace-name>

Your output can include the following error message:

Readiness probe failed: Get http://<host>:<port>/readinessProbe: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

kubectl logs <pod name> -n <namespace-name>

The logs might show errors similar to the following sample log messages:

[2020-01-23T19:59:23.036] [ERROR] [mcm-ui] [status] GET /readinessProbe  500
[2020-01-23T19:59:29.064] [ERROR] [mcm-ui] [status] GET /readinessProbe  500
[2020-01-23T19:59:38.087] [ERROR] [mcm-ui] [status] GET /readinessProbe  500
[2020-01-23T20:01:53.096] [ERROR] [mcm-ui] [status] GET /readinessProbe  500
[2020-01-23T20:01:59.111] [ERROR] [mcm-ui] [status] GET /readinessProbe  500
[2020-01-23T20:02:08.137] [ERROR] [mcm-ui] [status] GET /readinessProbe  500
[2020-01-23T20:11:39.951] [ERROR] [mcm-ui] [status] GET /readinessProbe  500
[2020-01-23T20:11:50.184] [ERROR] [mcm-ui] [status] GET /readinessProbe  500
[2020-01-23T20:11:59.207] [ERROR] [mcm-ui] [status] GET /readinessProbe  500
[2020-01-23T20:12:08.232] [ERROR] [mcm-ui] [status] GET /readinessProbe  500
[2020-01-23T20:13:53.051] [ERROR] [mcm-ui] [status] GET /readinessProbe  500

Causes

This issue can occur due to frequent failing readiness probes for a pod. When the pod becomes 'not ready', you might not be able to log in or use the console.

Resolving the problem

To reduce the frequency of timeout errors from this issue, you can configure a workaround or apply a DNS config patch to help resolve this issue.

Apply DNS config patch

To help address the periodic timeout errors, you can apply a OpenShift DNS config patch for IBM Cloud Pak foundational services clusters. This patch addresses an issue that occurs when accessing services from pods that results in requests that have a response delay of up to 5 seconds. Normal response times for requests typically require only a millisecond delay.

This patch is available as an interim fix for IBM Cloud Pak foundational services. For more information about this patch, and to obtain this patch, go to IBM® Fix Central. IBM® Fix Central contains fixes and updates for IBM® products. To access this website, see IBM Fix Central .

To directly access the interim fix for this patch, see CS=3.2.4-fix-37137 .

To apply the patch, follow the README instructions that are included with the interim fix.

Workaround

To resolve the issue, complete manual adjustments to all liveness and readiness probes in the pod daemonset, for example the following steps adjusts the auth-idp daemonset as an example.

Install kubectl. See Installing the Kubernetes CLI (kubectl).

Edit the auth-idp daemonset.

kubectl edit ds auth-idp -n kube-system

Locate the settings for each liveness and readiness probe for all containers in the daemonset. For example, the following section shows the readinessProbe settings for the platform-auth-service container:

     name: platform-auth-service
     ports:
     - containerPort: 8443
       hostPort: 8443
       name: http
       protocol: TCP
     readinessProbe:
       failureThreshold: 3
       httpGet:
         path: /
         port: 8443
         scheme: HTTPS
       periodSeconds: 10
       successThreshold: 1
       timeoutSeconds: 1
     resources:
       limits:
         cpu: "1"
         memory: 1Gi
       requests:
         cpu: 100m
         memory: 256Mi

Set values for each probe setting for all containers in the daemonset to increase the initialDelaySeconds, periodSeconds, and timeoutSeconds settings. If missing, add the initialDelaySeconds setting. The following example shows the placement of these settings for a readiness probe:
```
 readinessProbe:
   failureThreshold: 3
   httpGet:
     path: /
     port: 8443
     scheme: HTTPS
   initialDelaySeconds: 420
   periodSeconds: 30
   successThreshold: 1
   timeoutSeconds: 10
```
Save the file and wait until all the auth-idp pods restart. The pods might take a few minutes to restart.

原文地址：https://www.cnblogs.com/cheyunhua/p/15246305.html