Monitoring Kubernetes

Overview

Visibility is a key characteristic of cloud native application architectures. We need to see when and where failures occur. And we need to measure failures to establish a profile or baseline against which deviations from normalcy can be identified and addressed. As such, monitoring, feature-rich metrics, alerting tools, and data visualization frameworks are a key element of cloud native applications.

How Is Monitoring Apps on Kubernetes Different?

Containerized systems such as Kubernetes environments present new monitoring challenges as compared to virtual machine-based compute environments. These differences include the following:

  • The ephemeral nature of containers

  • An increasing density of objects, services, and metrics within a given node

  • The focus moves to services, rather than machines

  • The consumers of monitoring data have become more diverse

  • Changes in the software development life cycle

As monolithic apps are refactored into microservices and orchestrated with Kubernetes, requirements for monitoring those apps are changing. To start, instrumentation to capture application data needs to be at a container level, at scale, across thousands of endpoints. Because Kubernetes workloads are ephemeral by default and can start or stop at any time, application monitoring must be dynamic and aware of Kubernetes labels and namespaces. A consistent set of rules or alerts must be applied to all pods, new and old.

Observability should always be top of mind when you’re developing new apps or refactoring existing ones. Maintaining a common layer of baseline metrics that applies to all apps and infrastructure while incorporating custom metrics is extremely desirable. No new metric based on user feedback should trigger a major replumb of your monitoring stack.

App Monitoring with Prometheus

The open-source community is converging on Prometheus as the preferred solution to address challenges associated with Kubernetes monitoring. The ability to address evolving requirements of Kubernetes while including a rich set of language-specific client libraries gives Prometheus an advantage.

Monitoring Resource Consumption and Preventing Infiltration

How can you protect your cloud-based Kubernetes system from hijackers and infiltrators? Here are some suggestions:

  • Monitor cluster and network utilization.

  • Monitor for suspicious activity and analyze failed login and RBAC events.

  • Monitor configurations, such as dashboard access, for risks and vulnerabilities.

The NIST document titled Security Assurance Requirements for Linux Application Container Deployments sets forth security requirements and countermeasures to help meet the recommendations of the NIST Application Container Security Guide when containerized applications are deployed in production environments. According to NIST, you should log and monitor resource consumption of containers to ensure availability of critical resources.

Security monitoring and auditing

Let's talk more about security monitoring. The right amount and type of security monitoring for your cluster depends largely on the amount of time and staffing you have to respond to alerts and keep an eye on things. As a general rule, you shouldn't spend time building security monitoring systems that you don't have the time to maintain and tune. Start with the real-time (alert-based) and periodic (audit review) analyst or operator workflows you want to enable, and build the monitoring platform you need to enable those workflows.

Logging

The bedrock of security monitoring is logging. You should generally capture application logs, host-level logs, Kubernetes API audit logs, and cloud provider logs (if applicable). There are well-established patterns for implementing log aggregation on common cluster configurations.

Centralized logging is an essential part of any enterprise Kubernetes deployment. Configuring and maintaining a real-time high-performance central repository for log collection can ease the day-to-day operations of tracking what went wrong and its impact. Effective central logging also helps development teams quickly observe application logs to characterize application performance. Security compliance and auditing often require a company to maintain digital trails of who did what and when. In most cases, a robust logging solution is the most efficient way to satisfy these requirements

For security audit purposes, consider streaming your logs to an external location with append-only access from within your cluster. For example, on AWS, you can create an S3 bucket in an isolated AWS account and give append-only access to your cluster log aggregator. This ensures your logs cannot be tampered with even in the case of a total cluster compromise.

Log Aggregation and vRealize Log Insight

Log aggregation requirements are much more than message rendering. An effective log aggregator must support the processing of events from thousands of endpoints, the ability to accommodate real-time queries, and a superior analytics engine to provide intelligent metrics to solve complex technical and business problems. You have the option to implement log aggregation using vRealize Log Insight or a number of popular open source or commercial logging analytics solutions, such as Elasticsearch, Fluentd, Kibana, or Splunk. Each solution has a set of strengths and weaknesses. VMware PKS offers the flexibility to let you choose a solution that most aligns with your processes and tooling.

Network monitoring

Network-based security monitoring tools, such as a network intrusion detection system (IDS) and web application firewalls, may work nearly out of the box, but making them work well takes some effort. The biggest hurdle is that many tools expect IP addresses to be a useful context for events. To integrate these tools with Kubernetes, consider enriching the collected events with Kubernetes namespace, pod name, and pod label metadata. This adds valuable context to the event that you can use for alerting or manual review, and can make these traditional tools even more powerful in your cluster than in a more traditional environment. Some monitoring tools can collect Kubernetes metadata already, but you can also write custom event enrichment code to add this kind of metadata integration to those that don't. The tools that come with VMware NSX can help with network monitoring.

Host event monitoring

It's also possible to run a host-based IDS, such as file integrity monitoring and Linux system call logging (for example, auditd), directly with Kubernetes, but the results are hard to manage because the workload running on any particular node varies from hour to hour as applications deploy and Kubernetes orchestrates pods.

To make sense of host-based events, you'll again want to consider extending your existing tools to include Kubernetes pod or container metadata in the context of captured events. Newer systems such as Sysdig Falco include this context out of the box.

Monitoring and DevOps

To help DevOps, a container platform adds security, logging, monitoring, analytics, dashboards, and other operational features.

Wavefront

VMware Cloud PKS can be integrated with Wavefront by VMware to efficiently monitor containers at enterprise scale. Wavefront delivers monitoring and analytics throughout a cloud-native stack for always-on metrics as a service. When integrated with VMware Cloud PKS, Wavefront gives developers and DevOps real-time visibility into the operations and performance of containerized workloads and Kubernetes clusters.

Fluentd, Fluent Bit, and Elasticsearch

Fluentd is an open-source data collector for unified logging. Fluent Bit is a lightweight data forwarder for Fluentd. Fluentd is used to create a unified logging layer to collect and process data. Fluent Bit is for forwarding data from the edge to Fluentd aggregators. Fluentd and Fluent Bit can be integrated with VMware Cloud PKS to collect logging data and push it to an output destination, such as Elasticsearch, which is a distributed search and analytics engine that lets data engineers query unstructured, structured, and time-series data.

Prometheus and Grafana

Prometheus is an open-source monitoring system for Kubernetes that can be integrated with VMware Cloud PKS. Prometheus excels at monitoring multidimensional data, including time-series data. Prometheus is hosted by the Cloud Native Computing Foundation, of which VMware is a member. Grafana is an open-source metrics dashboard commonly used with Prometheus to display data. Prometheus and Grafana can be integrated with VMware Cloud PKS.

Find Out More

Last updated