Kubernetes: Troubleshooting First Smells

Amandeep Midha
3 min readJan 31, 2021

Starting from Application Failure inspection, going further to Master Nodes inspection, up to identifying Worker Nodes issues and possible resuming of services

$ alias k = kubectl ( To avoid typing ‘kubect’ everytime )

Application Failure

> Check availability of web service at Node port

$ curl http://web-service-ip:node-port

> Verify endpoint

$ k describe service web-service

> > Compare Endpoint from description to above node-port

>> Compare Selector values with underlying Pod

$ k get pod

> Check status if Running

$ k describe pod pod-name

> Check events related to pod

$ k logs pod-name ( optional: -f to watch, — previous to see earlier logs)

> Check logs at pod

Here onward go further down to inspect above steps for underlying service & pods. Further details: https://kubernetes.io/docs/tasks/debug-application-cluster/debug-application/

Control Plane (Master Node) Failure

(Here we are dealing with Static pods basically)

> Check the status of Cluster if all nodes healthy

$ k get nodes

> Check status of pods

$ k get pods

> Check if all are running

$ k get pods -n kube-system

> Check all pods are running in system name space (incl coredns, etcd, kube-apiserver, kube-controller-manager, kube-proxy, kube-scheduler.

(NOTE: Each dysfunctional component among above might require separate configuration checks & debugging not scoped in this article)

Alternatively, check these 3 on Master Nodes

$service kube-apiserver status

$service kuber-controller-manager status

$service kube-schedule status

And check these 2 on Worker Nodes

$ service kubelet status

$ service kube-proxy status

> Check Logs of Control Plane Components

$ k get logs kube-api-server-master -n kube-system , OR

$ journalctl -u kube-api-server

Further details: https://kubernetes.io/docs/tasks/debug-application-cluster/debug-cluster/

Worker Node Failure

> Start by checking status of nodes in cluster

$ k get nodes

> Above nodes should be all “Ready”, however if “NotReady”, check details of that node by following steps

$ k describe node node-name

> Each node has set of conditions ( viz. OutOfDisk, MemoryPressure, DiskPresure, PIDPressure, Ready ) reflecting their status.

> If status is Unknown then check LastHeartBeatTime when the node went down, and try to bring node back up

$ top (Check CPU situation)

$ df -h (Check Disk space situation)

And

$ ssh node-name

$ /etc/init.d/kubelet restart

> Check Kubelet status and Kubelet logs for possible issues

$ service kubelet status

$ journalctl -u kubelet

> Check Kubelet Certificates

$ openssl x509 -in /var/lib/kubelet/NODENAME.crt -text

> Check certificate “Issuer” and “Validity”

Pod Failure

> Get Pod description

$k describe pod mypod

> Check Events, Check Commands if there is no typo, Check correct image

> Extract Pod definition to make changes

$k get pod mypod -o yaml > mypoddef.yaml

> Apply changes and redeploy pod

$k delete pod mypod

$k apply -f mypoddef.yaml

--

--