Kubernetes: Troubleshooting First Smells
Starting from Application Failure inspection, going further to Master Nodes inspection, up to identifying Worker Nodes issues and possible resuming of services
$ alias k = kubectl ( To avoid typing ‘kubect’ everytime )
Application Failure
> Check availability of web service at Node port
> Verify endpoint
$ k describe service web-service
> > Compare Endpoint from description to above node-port
>> Compare Selector values with underlying Pod
$ k get pod
> Check status if Running
$ k describe pod pod-name
> Check events related to pod
$ k logs pod-name ( optional: -f to watch, — previous to see earlier logs)
> Check logs at pod
Here onward go further down to inspect above steps for underlying service & pods. Further details: https://kubernetes.io/docs/tasks/debug-application-cluster/debug-application/
Control Plane (Master Node) Failure
(Here we are dealing with Static pods basically)
> Check the status of Cluster if all nodes healthy
$ k get nodes
> Check status of pods
$ k get pods
> Check if all are running
$ k get pods -n kube-system
> Check all pods are running in system name space (incl coredns, etcd, kube-apiserver, kube-controller-manager, kube-proxy, kube-scheduler.
(NOTE: Each dysfunctional component among above might require separate configuration checks & debugging not scoped in this article)
Alternatively, check these 3 on Master Nodes
$service kube-apiserver status
$service kuber-controller-manager status
$service kube-schedule status
And check these 2 on Worker Nodes
$ service kubelet status
$ service kube-proxy status
> Check Logs of Control Plane Components
$ k get logs kube-api-server-master -n kube-system , OR
$ journalctl -u kube-api-server
Further details: https://kubernetes.io/docs/tasks/debug-application-cluster/debug-cluster/
Worker Node Failure
> Start by checking status of nodes in cluster
$ k get nodes
> Above nodes should be all “Ready”, however if “NotReady”, check details of that node by following steps
$ k describe node node-name
> Each node has set of conditions ( viz. OutOfDisk, MemoryPressure, DiskPresure, PIDPressure, Ready ) reflecting their status.
> If status is Unknown then check LastHeartBeatTime when the node went down, and try to bring node back up
$ top (Check CPU situation)
$ df -h (Check Disk space situation)
And
$ ssh node-name
$ /etc/init.d/kubelet restart
> Check Kubelet status and Kubelet logs for possible issues
$ service kubelet status
$ journalctl -u kubelet
> Check Kubelet Certificates
$ openssl x509 -in /var/lib/kubelet/NODENAME.crt -text
> Check certificate “Issuer” and “Validity”
Pod Failure
> Get Pod description
$k describe pod mypod
> Check Events, Check Commands if there is no typo, Check correct image
> Extract Pod definition to make changes
$k get pod mypod -o yaml > mypoddef.yaml
> Apply changes and redeploy pod
$k delete pod mypod
$k apply -f mypoddef.yaml