Add a new `Kafka (Enterprise)` row to both vmagent dashboards:
- `dashboards/vmagent.json`
- `dashboards/vm/vmagent.json`
The row is placed before `Drilldown` and contains three Kafka-specific
panels:
- `Kafka bytes`
- `Kafka messages in/out`
- `Kafka and consumer errors`
The goal is to provide a compact Kafka-focused view for enterprise
vmagent deployments without duplicating the existing generic remote
write panels such as connection saturation and persistent queue size.
The new row helps distinguish:
- producer vs consumer throughput at the Kafka topic level
- message-rate shifts that may indicate smaller Kafka payloads and
higher per-message overhead
- producer-side Kafka errors vs consumer-side Kafka errors
Descriptions include links to the relevant Kafka documentation sections.
PR https://github.com/VictoriaMetrics/VictoriaMetrics/pull/10728
---------
Co-authored-by: Max Kotliar <mkotlyar@victoriametrics.com>
Metadata is enabled by default since v1.137.0, and the metadata volume
can be a big contributor to resource usage and network traffic.
vmagent dahsboard:
1. `Troubleshooting` section: rename `Datapoints rate` panel to `Rows
rate` to include metadata rate;
2. `Ingestion` section: add metadata rate to existing `Rows rate` panel.
(The difference between this panel and the one above is that this panel
only contains data from write requests, while the above panel also
includes the scraping part.)
vmcluster dashboard:
1. `vminsert` section: add `Rows rate` panel
Didn’t see a good place for it in the vmsingle dashboard, since it
doesn’t have a dedicated insert section, and I don’t want to add it to
`overview` yet.
https://github.com/VictoriaMetrics/VictoriaMetrics/pull/10868
This commit adds new metrics `vmalert_remotewrite_queue_capacity` and `vmalert_remotewrite_queue_size`, which is updated with each push and it's
frequency depends on `-remoteWrite.concurrency`,
`remoteWrite.flushInterval`
It doesn't account for the pending data within each pushers request, it
should provide a general indication of the queue usage.
Related PR https://github.com/VictoriaMetrics/VictoriaMetrics/pull/10765
Changes:
- Added the number of `pending alerts` and `firing alerts`
- Improvement `transormations` for panel - FIRING over time by group and rules
- Added sort for panel - FIRING over time by rule
Signed-off-by: sias32 <sias.32@yandex.ru>
Co-authored-by: Max Kotliar <mkotlyar@victoriametrics.com>
Before, by mistake, datasource was referenced by input name instead
of variable name. For an unknown reason, it worked well in local setup
and on playground.
This fix is confirmed by users and continues working at local setup
and playground.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The new Grafana dashboard uses the following APIs:
- /api/v1/status/tsdb
- /api/v1/status/metric_names_stats
It shows the list of metric names, the request count and the last time
they were "used". Clicking on metric name allows exploring its
cardinality.
Based on https://github.com/VictoriaMetrics/VictoriaMetrics/issues/9832
-----------
The PR contains a few unrelated changes:
* rename of folder for prometheus datasource to remove the duplicated
word
* fix for vmalert's access to the datasource, as before it wasn't able
to write/read properly
-------------
The dashboard screen cast:
https://github.com/user-attachments/assets/01dda5d9-14e5-4f5a-b795-a838abec4f5e
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Haley Wang <haley@victoriametrics.com>
* add meaningful description, it is required for publishin on grafana.com
* remove dependency on `victoriametrics-metrics-datasource` as it is not used
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Components like vmselect and vminsert rarely touch disk, so most of the
time their values are 0. Filtering out 0 values makes the panel cleaner.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
### Describe Your Changes
Add Source Code data link (link to bar or line in graph to see) that
points directly to a source code file on Github. `VictoriaMetrics -
cluster`, `VictoriaMetrics - single-node`, and `VictoriaMetrics -
vmagent` dashboards were updated. I did not add it to other panels since
they do not have Drilldown section at all.
Also, fixed a misplaced Drilldown link in `VictoriaMetrics -
single-node` dashboard.
Proxy service code is here
https://github.com/VictoriaMetrics/location2source/
### Checklist
The following checks are **mandatory**:
- [ ] My change adheres to [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/victoriametrics/contributing/#pull-request-checklist).
- [ ] My change adheres to [VictoriaMetrics development
goals](https://docs.victoriametrics.com/victoriametrics/goals/).
### Describe Your Changes
The new title better aligns with the code of
[writeconcurrencylimiter](d9dabea303/lib/writeconcurrencylimiter/concurrencylimiter.go (L140)),
the panel description and the metric used in the query.
Previously, the panel title suggested that it reflected only disk write
performance. During an incident investigation, this led to a wrong
assumption that the panel was unrelated to client-side performance.
In reality, the metric [includes the full write
path](98e320842c/lib/vminsertapi/server.go (L263)):
time spent reading data from the TCP connection, processing it, and
acknowledging the block. The updated title reflects this behavior more
accurately and reduces the risk of misinterpretation during incident
analysis.
### Checklist
The following checks are **mandatory**:
- [ ] My change adheres to [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/victoriametrics/contributing/#pull-request-checklist).
- [ ] My change adheres to [VictoriaMetrics development
goals](https://docs.victoriametrics.com/victoriametrics/goals/).
This simplifies troubleshooting by investigating the vm_log_messages_total metric
when logs are unavailable. The logs may be unavailable when the -loggerLevel command-line
flag is set to value other than INFO. The logs may be unavailable when clients
use Monitoring of Monitoring service ( https://victoriametrics.com/products/mom/ ),
which provides metrics, but doesn't provide logs from VictoriaMetrics components
running at the client side.
Add `is_printed` label to the `vm_log_messages_total` metric in order to detect whether
the given log has been suppressed or printed.
See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/10304
While at it, make more readable the description for the TooManyLogs alert,
which is based on the vm_log_messages_total metric.
Also return back the `level!="info"` instead of `level="error"` filter
in the query for this alerting rule, in order to be consistent with queries
at the official dashboards for VictoriaMetrics components.
TODO: investigate too high warnings rate at https://github.com/VictoriaMetrics/VictoriaMetrics/pull/2760
and fix it at the source of these warnings instead of modifying the query
for the TooManyLogs alert.
### Describe Your Changes
The Backups errors panel uses a hard coded rate, when looking over a
large period of time this number would likely stay low do to the hard
coded rate when in reality the amount of errors is much larger.
This change addresses this by using the __rate variable in Grafana so
the rate will align with the date/time range in Grafana.
### Checklist
The following checks are **mandatory**:
- [x] My change adheres to [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/victoriametrics/contributing/#pull-request-checklist).
- [x] My change adheres to [VictoriaMetrics development
goals](https://docs.victoriametrics.com/victoriametrics/goals/).
which tracks the number of requests to the query rollup cache
As described in
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/10117, when
retrieving cached data from the rollup result cache, there can be mixed
`get()` and `getBig()` calls to the underlaying fastcache. And it's
unpredictable how many times `getBig()` will call `get()`, so the
metrics from fastcache cannot be used to indicate query cache miss
ratio.
Exposing a new counter `vm_rollup_result_cache_requests_total` to track
the number of requests to the query rollup cache, together with the
existing `vm_rollup_result_cache_miss_total`, allows for monitoring the
rollup cache miss rate per query (or subquery), which is more
user-facing.
fixes https://github.com/VictoriaMetrics/VictoriaMetrics/issues/10117
related to https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5056
follow up https://github.com/VictoriaMetrics/VictoriaMetrics/pull/10177
Add `vmauth_user_request_backend_requests_total` and
`vmauth_unauthorized_user_request_backend_requests_total` which track
the number of user request errors, and aligned with
`vmauth_user_requests_total`.
The existing `vmauth_http_request_errors_total` currently only counts
requests with `invalid_auth_token`. Once authorization has passed, any
subsequent request errors are tracked under
`xxx_user_request_backend_requests_total`.
Right now we have two separate panels: RSS memory % usage and RSS
anonymous memory % usage. This makes trend comparison difficult because
one have to visually correlate two independent panels. Another problem
is that these panels don't show Go runtime allocations at all. The same
applies to memory allocated in C. There are allocations in C (zstd) one
should account for but there is no even a metric to expose it.
The commit adds Memory usage breakdown panel into Drilldown section. It
provides insight into Go Stack, Go Heap, Go Heap Released, Go Other,
Mmap: VM Cache, File cache memory distribution
It should help spot trends changes in memory by type or invistigate
issues such as #10069 and #10028 easier.
Panel info:
This panel shows memory usage by category.
How to use:
- Start from the high-level RSS panel.
- Identify an instance with unexpected or abnormal memory growth.
- Filter to that instance to inspect the detailed breakdown here.
Interpretation
- A steadily rising Go Heap usually indicates a memory leak. Collect
pprof memory profile.
- A growing Go Stack commonly points to a goroutine leak.
<img width="1508" height="628" alt="Screenshot 2025-12-08 at 13 18 44"
src="https://github.com/user-attachments/assets/0e794324-e86d-468e-b926-8bb11f5a2043"
/>
<img width="1503" height="674" alt="Screenshot 2025-12-08 at 13 19 34"
src="https://github.com/user-attachments/assets/62fc3fff-33b3-4dfe-ad3f-ad0526a8a606"
/>
Related PR https://github.com/VictoriaMetrics/VictoriaMetrics/pull/10139
Right now we have two separate panels: RSS memory % usage and RSS
anonymous memory % usage. This makes trend comparison difficult because
one have to visually correlate two independent panels. Another problem
is that these panels don't show Go runtime allocations at all. The same
applies to memory allocated in C. There are allocations in C (zstd) one
should account for but there is no even a metric to expose it.
The commit adds Memory usage breakdown panel into Drilldown section. It
provides insight into Go Stack, Go Heap, Go Heap Released, Go Other,
Mmap: VM Cache, File cache memory distribution
It should help spot trends changes in memory by type or invistigate
issues such as #10069 and #10028 easier.
Panel info:
This panel shows memory usage by category.
How to use:
- Start from the high-level RSS panel.
- Identify an instance with unexpected or abnormal memory growth.
- Filter to that instance to inspect the detailed breakdown here.
Interpretation
- A steadily rising Go Heap usually indicates a memory leak. Collect
pprof memory profile.
- A growing Go Stack commonly points to a goroutine leak.
Add ad-hoc filters to query stats and operator dashboards.
These filters are useful for exploring non-uniform metrics sets
without distinct job/instance filters.