Skip to content

Conversation

@amdove
Copy link

@amdove amdove commented Jan 16, 2026

Description

Expanded the monitoring documentation to provide comprehensive coverage of all container-level metrics collected via cAdvisor and kube-state-metrics. Organized metrics into clear categories (memory, CPU, network, filesystem, lifecycle) with detailed descriptions of each metric's purpose.

Added extensive troubleshooting section with practical PromQL queries for diagnosing:

  • Memory issues and OOMKilled pods
  • CPU throttling and performance problems
  • Network connectivity, throughput, and packet drops
  • Disk I/O usage and IOPS bottlenecks
  • Container restart patterns and lifecycle issues

Each troubleshooting category includes ready-to-use Grafana queries and key investigation points to help operators quickly diagnose and resolve container resource issues.

Category of change

  • Bug fix (non-breaking change which fixes an issue)
  • Version upgrade (upgrading the version of a service or product)
  • New feature (non-breaking change which adds functionality)
  • Build: a code change that affects the build system or external dependencies
  • Performance: a code change that improves performance
  • Refactor: a code change that neither fixes a bug nor adds a feature
  • Documentation: documentation changes
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

@amdove amdove marked this pull request as ready for review January 20, 2026 21:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants