Prometheus and Grafana combo on k8

I’ve previously used Prometheus/Grafana with Kubernetes-based applications, but this time I wanted to delve more into the theoretical aspects, hence this article :). Additionally, I will follow it up with a more practical example: a simple CRUD backend/DB application hosted on Kubernetes that utilizes Prometheus.

Prometheus

Before explaining what’s a Prometheus, let’s first define what is a metric and which metrics do we typically need?
Metric is a quantitative measure that provides information about the performance, health, or behavior of a system or application, helping to assess its state and detect anomalies or trends over time.

Prometheus is an open-source monitoring and alerting system. designed to collect metrics from various targets, such as servers, containers, databases, and other systems, by scraping HTTP endpoints. Prometheus stores these metrics in a time-series database, allowing users to query, visualize, and alert on the data. It supports powerful querying language (PromQL) and provides a flexible alerting mechanism based on defined rules.

 Time series data, i.e. metrics information is stored with the timestamp at which it was recorded, alongside optional key-value pairs called labels

For example, in the following code example we can see the metric named http_requests_total representing the total number of HTTP requests. After the metric’s name and before it’s actual value, we can find the labels inside the JSON body, aka key-value pairs which further describe the metric.

# HELP http_requests_total Total number of HTTP requests.
# TYPE http_requests_total counter
http_requests_total{method="GET", endpoint="/api/users", status="200"} 100 http_requests_total{method="POST", endpoint="/api/users", status="200"} 50 http_requests_total{method="GET", endpoint="/api/products", status="404"} 20 http_requests_total{method="POST", endpoint="/api/products", status="500"} 5

Architecture of Prometheus

Before diving into the details, it’s important to note: Prometheus operates on the PULL based mechanism, it pulls metrics from the configured targets at regular intervals (default being 1 min.).

Following picture shows the architecture of often called Prometheus stack (rest of the components which get lumped and deployed alongside the Prometheus server as well).

In a nutshell: we have our applications which import the client libraries and using them we define our metrics.
Alongside the applications we have the so called exporters: software components that expose metrics in a format that Prometheus can scrape and store. We think of the as intermediaries between the applications you want to monitor and Prometheus itself.

Following “components” we can think of more as “logical” parts of Prometheus server itself: Service discovery is a mechanism by which Prometheus identifies and dynamically discovers the targets it should monitor.

We already explained the Scraping briefly, it’s basically a pull mechanism which Prometheus executes, we configure things such as scrape interval, HTTP endpoint to scrape from etc.

Prometheus uses its own time-series database optimized for handling metrics data. This database organizes metrics into time-series, where each time-series represents a sequence of data points over time. This design enables efficient Storage and querying of metrics data. By default, Prometheus stores metrics data locally on disk.

Prometheus has a number of HTTP APIs that allow you to both request raw data
and evaluate PromQL queries. Using these, Prometheus provides the expression browser. It uses these APIs and is suitable for ad hoc querying and data exploration, but it is not a general Dashboard system.

Rules in Prometheus are used to define conditions or expressions that evaluate metrics data, they are written in the PromQL language which allows users to express complex conditions and calculations based on the metrics data.

Alerts in Prometheus are derived from rules and are used to notify users when specific conditions or thresholds are met or violated.

Alert-Manager is a separate component which receives alerts from Prometheus server and turns them into notifications. Notifications can include Email, chat applications such as Slack, and services such as PagerDuty.

Types of Prometheus metrics

We briefly touched on the format of metrics, currently there are several types of them.
First one is counter we already explained: a counter is a type of metric that keeps adding up over time. It’s like a scoreboard that only goes up. Once it’s set, it either increases or gets reset to zero when the system restarts.

Second one is gauge: a single numerical value that can arbitrarily go up and down.

# HELP memory_usage_bytes Amount of memory being used. # TYPE memory_usage_bytes gauge memory_usage_bytes{process="web_server"} 1.5e+09

Third one is histogram: samples observations and counts them in configurable buckets. It also provides a sum of all observed values. Think of it as a histogram Excel graph :).

# HELP http_request_duration_seconds Duration of HTTP requests. # TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.1"} 100 http_request_duration_seconds_bucket{le="0.5"} 200#... more buckets
http_request_duration_seconds_sum 150 http_request_duration_seconds_count 300

Last one is summary. While it also provides a total count of observations and a sum of all observed values, it calculates configurable quantiles over a sliding time window. For example:

# HELP db_query_latency_seconds Latency of database queries in seconds
# TYPE db_query_latency_seconds summary
db_query_latency_seconds{query_type="SELECT", database="customers"} [0.5, 0.9, 0.99] 500

The summary metric db_query_latency_seconds provides insights into the distribution of SELECT query latencies on the “customers” database..

The quantiles [0.5, 0.9, 0.99] specify the percentiles we’re interested in. The 500 at the end represents the total count of observations accumulated over a certain time window for this summary metric.

To retrieve the value for one of the quantiles, we use for example the following query with Prometheus quantile function:
quantile(0.5, db_query_latency_seconds{query_type="SELECT", database="customers"})
In the end we get the value of X which represents the latency of 50% SELECT queries on the “customers” database.

Grafana

Grafana is an open-source platform for monitoring and observability, allowing users to visualize and analyze metrics and logs from various data sources in real-time.
In order to visualize something, we need a source of information, here the Grafana’s data sources come in play. In Grafana, a data source is a connection to a specific data repository or service from which Grafana retrieves metrics, logs, or other time-series data for visualization and analysis. Grafana supports a wide range of data sources, allowing users to integrate data from multiple sources into their dashboards for comprehensive monitoring and observability.

Once the data source has been configured, the next step is to create queries to retrieve specific data for visualization on your panel. Each data source comes with a query editor, which formulates custom queries according to the source‚Äôs structure.

In practice you will rarely write your own queries but rather use the sets of dashboards, which are basically already preconfigured queries and panels.

Adding dashboards is very easy, head over to https://grafana.com/grafana/dashboards/, find the dashboard you would like to use, get its ID and import inside the Grafana.

For example, following picture is the dashboard (https://grafana.com/grafana/dashboards/13332-kube-state-metrics-v2/) which visualizes the data from Prometheus exporter kube-state-metrics.


Running Prometheus (stack) and Grafana on k8

Now that we have gone over the theoretical part, let’s put this in action and deploy stuff.

There are generally three ways, just like most apps these days, to run Prometheus stack on Kubernetes: executing raw manifests, deploying via a basic Helm chart, and deploying through an Operator.

For production purposes, using an Operator is heavily advised. However, for the scope of this project, I usually do it using raw manifests. This way, we can learn about all the nitty-gritty details that are usually hidden when we use the other two methods.

Following picture shows the overall architecture of Prometheus stack deployed in the context of Kubernetes.

Although not shown in the picture, for each deployment, a corresponding config map is created. This config map contains all the necessary configurations, such as Prometheus rules and scrape config jobs, Grafana’s Prometheus service address, Alert Manager’s notification email addresses, and more, specific to each service. Subsequently, it is mounted as a volume.

Adding exporters

TODO

Sources

I’ve used mostly Prometheus docs, DevOpsCube blogs on: https://devopscube.com/setup-prometheus-monitoring-on-kubernetes/ and Prometheus Up and Running book.