Troubleshooting AWS Elastic Kubernetes Service (EKS)

Blog

January 13, 2020

TABLE OF CONTENTS

Overview

Kubernetes has become the de-facto standard when it comes to containerization. It is widely deployed in organizations of all sizes, especially those that are moving on-premise workloads to the cloud and to a microservices-based architecture. While raw Kubernetes is not easy to deploy and manage, cloud services providers such as AWS, Azure and IBM Bluemix provide managed services that significantly ease the adoption of this technology. Specifically, AWS offers Elastic Kubernetes Service (EKS) which is nicely integrated into a variety of other AWS services including compute, networking, and security.

Given its complexity, Kubernetes errors can be hard to diagnose and troubleshoot, even with such managed services. This document describes some common errors with AWS EKS deployment and techniques to troubleshoot them. This document does not cover errors associated with deploying containerized applications into Kubernetes, but only focuses on errors related to the infrastructure.

A Very Brief Introduction to EKS

EKS consists of 2 subsystems: a control plane that is fully managed by AWS, and worker nodes which are provisioned by the customer as needed. The control plane runs Kubernetes components such as etcd (which acts as a backing store for cluster data) and API server (which allows worker nodes and command line tools to communicate with the control plane). Worker nodes are EC2 instances provisioned using an Auto-Scaling Group, which allows the customer to decide how much capacity and elasticity is required.

Common Errors and Troubleshooting Tips

This section describes a few common error situations that you may encounter with EKS. Each of these errors has an underlying cause which can be recognized by the symptoms and error messages found in a variety of log files.

Invalid EC2 AMI

Worker nodes (i.e. EC2 instances) register themselves with the control plane on startup. In order to do this, the instance requires a number of additional packages and configurations. AWS provides AMIs for EKS that include these prerequisites. (AWS also provides the source code required for building custom AMIs). EC2 instances that do not include (or include an incompatible version of) the required packages will result in a node status of “Not Ready”. A comparison of the version of the AMI with the version of your EKS cluster will show whether they are compatible. In the example below, the version of the AMI is incompatible with the version of the cluster. (Tip: Click view raw in the gist viewer embedded below to see messages in their raw format).


$ kubectl get nodes
NAME                      STATUS   ROLES  AGE VERSION
ip-10-0-0-62.ec2.internal NotReady <none> 11m v1.14.7-eks-1861c5
ip-10-0-0-95.ec2.internal NotReady <none> 10m v1.14.7-eks-1861c5

Server version: 1.11

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.3", GitCommit:"721bfa751924da8d1680787490c54b9179b1fed0", GitTreeState:"clean", BuildDate:"2019-02-04T04:48:03Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"11+", GitVersion:"v1.11.10-eks-7f15cc", GitCommit:"7f15ccb4e58f112866f7ddcfebf563f199558488", GitTreeState:"clean", BuildDate:"2019-08-19T17:46:02Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}

Incorrect Security Group Configuration

Security Groups must be correctly configured in order for worker nodes and the control plane to communicate with each other. When these security groups are incorrectly configured, Kubernetes will not be able to register worker nodes. Specifically, the following rules must be configured:

  • Inbound HTTPS on port 443 to the control plane (i.e to the EKS cluster). This is required for worker nodes and kubectl CLI to communicate with the control plane.
  • Inbound TCP on port 1025+ to the worker nodes from the control plane.

This error manifests itself as a network error as can be seen in several ways. In the following example, pods in the kube-system namespace show error or Pending status:

NAME                 READY     STATUS             RESTARTS    AGE
aws-node-bbwpq             0/1     CrashLoopBackOff     12         51m
aws-node-nw7v8             0/1     CrashLoopBackOff     12         51m
coredns-7bcbfc4774-g8sz7     0/1     Pending         0         54m
coredns-7bcbfc4774-qnrcw     0/1     Pending         0         54m
kube-proxy-dnhr6         1/1     Running         0         51m
kube-proxy-j5gps         1/1     Running         0         51m

This error can also be diagnosed by examining the log files on the worker node (in /var/log) - assuming you have SSH access to the nodes. To see the error message, view the log file corresponding to the aws-node container as shown in the example below. The error message in this example is “Failed to communicate with K8S Server”.

{"log":"====== Installing AWS-CNI ======\n","stream":"stdout","time":"2019-10-03T18:42:42.807515657Z"}
{"log":"====== Starting amazon-k8s-agent ======\n","stream":"stdout","time":"2019-10-03T18:42:42.821888604Z"}
{"log":"ERROR: logging before flag.Parse: E1003 18:43:12.854890 9 memcache.go:138] couldn't get current server API group list; will keep using cached value. (Get https://172.20.0.1:443/api?timeout=32s: dial tcp 172.20.0.1:443: i/o timeout)\n","stream":"stderr","time":"2019-10-03T18:43:12.855417379Z"}
{"log":"Failed to communicate with K8S Server. Please check instance security groups or http proxy setting","stream":"stdout","time":"2019-10-03T18:43:42.907376066Z"}

You can confirm this error by trying to connect to the Kubernetes server from the EC2 node using curl as shown below (replace the IP address with the cluster IP of your EKS cluster), which will error out if there is no network connectivity due to incorrect security group configuration.

curl -vk https://172.20.0.1:443/api

Missing or Incorrectly Configured Gateway

As mentioned previously, worker nodes require a number of additional linux packages to be installed at startup in order to communicate with the control plane. This is accomplished via standard linux package managers and package repositories. An Internet Gateway (or NAT Gateway) must be attached to the VPC to enable EC2 instances to communicate with package repositories. If this gateway is either missing or incorrectly configured, worker nodes will not bootstrap correctly, and the cluster will not recognize them. This error can be seen in the system log of the EC2 instance as shown below:

[   44.600544] cloud-init[3835]: and yum doesn't have enough cached data to continue. At this point the only
[   44.612407] cloud-init[3835]: safe thing yum can do is fail. There are a few ways to work "fix" this:
[   44.622313] cloud-init[3835]: 1. Contact the upstream for the repository and get them to fix the problem.
[   44.628873] cloud-init[3835]: 2. Reconfigure the baseurl/etc. for the repository, to point to a working

This error can also be caused by incorrectly configured outbound rules in the security group associated with worker node EC2 instances. If the security group does not allow outbound access, the instance will not be able to communicate with the package repository to install required packages.

Insufficient AWS Permissions

Worker nodes require a few AWS IAM permissions in order to access required resources during startup. One such permissions is ecr:GetAuthorizationToken. If the instance profile attached to the worker node EC2 instances does not have this permission, the nodes will not be able to download Docker container images required to run Kubernetes. In such situations, an error message similar to the listing below may be seen in /var/log/messages on the EC2 instances.

Oct  4 17:48:57 ip-10-0-0-9 kubelet: status code: 400, request id: 81a6c1bc-977d-48c7-9032-fdea26b8e7bd
Oct  4 17:48:57 ip-10-0-0-9 dockerd: time="2019-10-04T17:48:57.474787082Z" level=info msg="Attempting next endpoint for pull after error: Get https://602401143452.dkr.ecr.us-east-1.amazonaws.com/v2/eks/pause-amd64/manifests/3.1: no basic auth credentials"

Insufficient Network Capacity

Availability of network resources must be taken into account when determining how many EC2 instances to provision as worker nodes, and how large each instance should be. Kubernetes uses CNI (Container Networking Interface) to allocated network resources. Amazon VPC CNI plugin for Kubernetes assigns VPC IP addresses to each pod. As a result, the number of pods that can be deployed in the cluster is limited by the number of IP addresses available with the selected EC2 instance type. For example, the instance type t3.medium supports 3 interfaces with up to 6 IP addresses each. When the number of pods exceed the available number of IP addresses, they will remain in a state of Pending or ContainerCreating status.

This error can be diagnosed in 2 ways (with a “failed to assign an IP address to container” seen in both cases):

a. Examining pod events using kubectl CLI (kubectl describe pod … )

Type     Reason          Age      From              Message
----     ------          ----     ----              ------- 
Warning  FailedScheduling      2m21s     (x5 over 2m38s)      default-scheduler 0/1 nodes are available: 1 Insufficient pods. 
Normal      Scheduled         2m19s      default-scheduler     Successfully assigned istio-system/servicegraph-849c995588-n2tjg to ip-10-0-0-39.ec2.internal 
Warning  FailedCreatePodSandBox 2m17s      kubelet, ip-10-0-0-39.ec2.internal      Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "1814151dac7d051c85eb667e1c1f6abd595fda434a7bcee975558b9c46ade728" network for pod "servicegraph-849c995588-n2tjg": NetworkPlugin cni failed to set up pod "servicegraph-849c995588-n2tjg_istio-system" network: add cmd: failed to assign an IP address to container

b. Examining /var/log/messages on EC2 instance

Oct 7 16:41:26 ip-10-0-0-61 kubelet: E1007 16:41:26.903309 4468 remote_runtime.go:92] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = failed to set up sandbox container "5a12123016eec119fb3c1fd6baa233344e650992cb11b23ebabc72890b4e39ca" network for pod "istio-policy-7d667689b7-xkjz6": NetworkPlugin cni failed to set up pod "istio-policy-7d667689b7-xkjz6_istio-system" network: add cmd: failed to assign an IP address to container

Troubleshooting Guidelines

Here are some general guidelines for troubleshooting and root cause analysis. These guidelines assume you have access to the kubectl CLI and, in some cases, also have access to the EC2 instances via SSH.

  • Check if all nodes show a healthy status, i.e. Ready status (kubectl get nodes). If not, examine the logs listed below to determine possible errors.
  • Check the status of all Pods in kube-system namespace to see if they are healthy - i.e. Running or Completed status (kubectl get pods -n kube-system). If not, examine the log files listed below.
  • If any node is not healthy, check the system log (using AWS Console) of the corresponding EC2 instance to look for errors.
  • If the EC2 instances can be accessed via SSH (either from a remote machine or a Bastion host), check whether docker daemon is running. (ps -ef | grep dockerd). If not, check system log files to look for docker related error messages.
  • Examine Linux system log file (/var/log/messages) and container log files (/var/log/containers/*) for error messages. Examples of error messages would include insufficient resources, unable to contact master (i.e. control plane), etc.
  • Of course, if Cloudwatch is enabled, examine Cloudwatch logs to look for errors (though these logs may not contain sufficient details to diagnose the root cause of the error)

Authored By

Sonny Werghis

Principal Architecture Consultant

RECOMMENDED CONTENT

Open House: Automation in the Enterprise

Event

Zelle Implementation Considerations

Guide

2020 Real-time Payments Report

Report

Meet our Experts

Sonny Werghis
Principal Architecture Consultant

Sonny Werghis is a Principal Architecture Consultant at Levvel where he advises clients on Payment technology. Previously, Sonny worked at IBM as a Product Manager and a Solution Architect focused on Cloud and Cognitive technology where he developed AI and Machine Learning-based business solutions for customers in various industries, including Finance, Government, Healthcare, and Transportation. Sonny is an Open Group Master Certified IT Architect and a certified Enterprise Architect.

Related Content

API Design: GraphQL vs. REST

API design is crucial, giving structure to application interaction. Given cross-functional teams and applications, development time is reduced with a clear, intuitive way to access data. API development often follows two approaches: REST and GraphQL.

Blog

Feb 07

What is a Data Lake? A Primer on Big Data Storage

Before your data scientists wring value out of your reams of data, it has to be accessible and, on some basic level, coherently arranged. To harness all that brainpower, you need to keep the data wrangling to a minimum. Enter the data lake.

Blog

Feb 07

Let's chat.

You're doing big things, and big things come with big challenges. We're here to help.

Access the Blog

By clicking the button below you agree to our Terms of Service and Privacy Policy.

levvel mark white

Let's improve the world together.

levvel-mark-mint

© Levvel 2020