Update docs/victoriametrics/Troubleshooting.md

Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com> Signed-off-by: Pablo (Tomas) Fernandez <46322567+TomFern@users.noreply.github.com>
Grammar pass
2026-06-24 19:17:42 +03:00 · 2026-06-24 13:31:27 +01:00 · 2026-06-24 13:13:39 +01:00 · 2026-06-23 19:17:57 +02:00 · 2026-06-23 19:46:48 +03:00 · 2026-06-23 19:37:51 +03:00
3 changed files with 150 additions and 6 deletions
--- a/.github/workflows/codeql-analysis-go.yml
+++ b/.github/workflows/codeql-analysis-go.yml
@@ -54,14 +54,14 @@ jobs:
          restore-keys: go-artifacts-${{ runner.os }}-codeql-analyze-${{ steps.go.outputs.go-version }}-

      - name: Initialize CodeQL
-        uses: github/codeql-action/init@87557b9c84dde89fdd9b10e88954ac2f4248e463  # v4.36.1
+        uses: github/codeql-action/init@e46ed2cbd01164d986452f91f178727624ae40d7  # v4.35.3
        with:
          languages: go

      - name: Autobuild
-        uses: github/codeql-action/autobuild@87557b9c84dde89fdd9b10e88954ac2f4248e463  # v4.36.1
+        uses: github/codeql-action/autobuild@e46ed2cbd01164d986452f91f178727624ae40d7  # v4.35.3

      - name: Perform CodeQL Analysis
-        uses: github/codeql-action/analyze@87557b9c84dde89fdd9b10e88954ac2f4248e463  # v4.36.1
+        uses: github/codeql-action/analyze@e46ed2cbd01164d986452f91f178727624ae40d7  # v4.35.3
        with:
          category: 'language:go'
--- a/docs/victoriametrics/BestPractices.md
+++ b/docs/victoriametrics/BestPractices.md
@@ -16,9 +16,39 @@ aliases:
 ---
 ## Install Recommendation

-It is recommended to run the latest available release of VictoriaMetrics from [this page](https://github.com/VictoriaMetrics/VictoriaMetrics/releases/latest), since it contains all the bugfixes and enhancements.
+It is recommended to run the latest available release of VictoriaMetrics from [this page](https://github.com/VictoriaMetrics/VictoriaMetrics/releases/latest), as it includes all bug fixes and enhancements.

-There is no need to tune VictoriaMetrics because it uses reasonable defaults for command-line flags. These flags are automatically adjusted for the available CPU and RAM resources. There is no need in Operating System tuning because VictoriaMetrics is optimized for default OS settings. The only option is to increase the limit on the [number of open files in the OS](https://medium.com/@muhammadtriwibowo/set-permanently-ulimit-n-open-files-in-ubuntu-4d61064429a), so VictoriaMetrics could accept more incoming connections and could keep open more data files.
+There is no need to tune VictoriaMetrics, as it uses reasonable defaults for its command-line flags. These flags are automatically adjusted for the available CPU and RAM resources. There is no need for operating system tuning because VictoriaMetrics is optimized for default OS settings. The only option is to increase the limit on the [number of open files in the OS](https://medium.com/@muhammadtriwibowo/set-permanently-ulimit-n-open-files-in-ubuntu-4d61064429a), so VictoriaMetrics could accept more incoming connections and could keep open more data files. VictoriaMetrics is tested and developed to run efficiently on these defaults, which fit the majority of workloads. Change a setting only when the docs explicitly instruct you to, including when and why.
+
+## Memory
+
+VictoriaMetrics components detect the available memory at startup as the smaller of the host RAM and the cgroup memory limit.
+To keep them stable:
+
+1. Do not set `GOMEMLIMIT`. VictoriaMetrics paces garbage collection with `GOGC` and does not use `GOMEMLIMIT` for sizing.
+   `GOMEMLIMIT` bounds only Go runtime memory, not the process's total RSS, off-heap caches, mmap-ed files,
+   or the OS page cache, so it is not a reliable way to size VictoriaMetrics containers. It can curb Go heap growth,
+   but setting it too low makes the garbage collector run more often than `GOGC` dictates,
+   spending in the worst case up to ~50% of CPU time on GC —
+   the ceiling enforced by Go's [GC CPU limiter](https://go.dev/doc/gc-guide).
+   Set the container memory limit, and VictoriaMetrics automatically sizes its caches and memory-aware limits.
+
+1. Do not hand-tune cache sizes with `-storage.cacheSize*` flags; rely on the defaults.
+   If a component needs larger caches, move it to a host with more memory.
+   See [Cache tuning](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#cache-tuning).
+
+1. Do not autoscale `vmstorage` with the Vertical Pod Autoscaler (VPA) or the Horizontal Pod Autoscaler (HPA).
+   VPA: cache sizes are derived from the memory limit, read only once at startup.
+   Modes that recreate the pod (`Recreate`, `Auto`) reset the caches and force a cold start,
+   causing slow inserts and query latency spikes. In-place resizing is not picked up at runtime,
+   so `vmstorage` keeps the budget and `vm_available_memory_bytes` initialized at startup, which also skews the dashboards.
+   Set fixed memory requests and limits for `vmstorage` rather than autoscaling.
+   HPA: `vmstorage` is stateful. Adding nodes sends new series to them while existing data stays where it is.
+   Removing nodes makes the data on them unavailable to queries and can cause data loss without replication.
+   Frequent scaling keeps changing the routing and can degrade the cluster.
+
+1. Leave headroom for the OS page cache and workload spikes —
+   see [capacity planning](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#capacity-planning).

 ## Swap

--- a/docs/victoriametrics/Troubleshooting.md
+++ b/docs/victoriametrics/Troubleshooting.md
@@ -188,7 +188,7 @@ If you see unexpected or unreliable query results from VictoriaMetrics, then try

 These are the most common reasons for slow data ingestion in VictoriaMetrics:

-1. Memory shortage for the given amounts of [active time series](https://docs.victoriametrics.com/victoriametrics/faq/#what-is-an-active-time-series).
+1. [Memory shortage](#memory-shortage) for the given amounts of [active time series](https://docs.victoriametrics.com/victoriametrics/faq/#what-is-an-active-time-series).

   VictoriaMetrics (or `vmstorage` in the cluster version of VictoriaMetrics) maintains an in-memory cache `storage/tsid`
   for a quick search for internal series IDs for each incoming metric. VictoriaMetrics automatically determines the maximum 
@@ -352,6 +352,120 @@ These are the solutions that exist for improving the performance of slow queries
  See also [this article](https://valyala.medium.com/how-to-optimize-promql-and-metricsql-queries-85a1b75bf986),
  which explains how to identify and optimize slow queries.

+## Memory shortage
+
+High memory utilization alone does not indicate a shortage.
+VictoriaMetrics components could operate normally under high memory utilization,
+but it is recommended to have
+[at least 50% of free memory for stability](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#capacity-planning).
+Before reacting, it is important to identify memory shortage — not enough memory for the workload —
+apart from high utilization and memory pressure.
+Memory pressure is the kernel's reclaim activity as shown by [PSI](https://docs.kernel.org/accounting/psi.html).
+It could be an indicator of a shortage. It is better to avoid memory pressure
+because it puts a VictoriaMetrics component at risk of not getting enough memory, which can lead to OOMs.
+
+### Memory resource accounting
+
+A component detects the memory available to it at startup as the smaller of the host RAM and the cgroup memory limit,
+exposed as `vm_available_memory_bytes`. Its main observable parts are:
+
+1. Go (anonymous) memory — `process_resident_memory_anon_bytes`. It includes:
+
+   - A portion of total memory for caches, capped by `-memory.allowedPercent` (default 60%) or `-memory.allowedBytes`.
+     The memory is used for in-process caches, for example, `storage/tsid`, `indexdb` blocks, and the rollup cache.
+   - The Go heap, goroutine stacks, anonymous off-heap allocations, and runtime overhead used for ingestion and queries.
+     Only the cache part is capped by `-memory.allowedPercent`. The Go heap grows on top of it under load.
+
+1. OS page cache for mmap-ed data and index files — the OS uses the memory not taken by the cache budget or the Go heap,
+   and reclaims it as needed. `process_resident_memory_file_bytes` shows the part of these file-backed pages
+   currently resident for the process, so it is a lower bound on the useful page cache.
+
+Before tuning and troubleshooting memory issues,
+see [Best practices](https://docs.victoriametrics.com/victoriametrics/bestpractices/#memory)
+for memory configuration guidance. Be sure that `GOMEMLIMIT` is not set,
+and that the [VPA](https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler) is not used for `vmstorage` pods.
+
+### Signals
+
+These metrics describe how a component uses memory and are visible on the
+[official Grafana dashboards for VictoriaMetrics](https://grafana.com/orgs/victoriametrics/dashboards).
+None of them means a shortage on its own — the next section reads them together.
+
+- `process_resident_memory_anon_bytes` / `vm_available_memory_bytes` — anonymous memory (caches plus Go heap)
+  as a share of the available memory. This is the unreclaimable side, without OS page cache.
+
+- `process_resident_memory_file_bytes` — the process's resident file-backed memory for mmap-ed data and index files.
+  It is reclaimable and is a lower bound on the useful page cache.
+
+- `process_pressure_memory_waiting_seconds_total`, `process_pressure_memory_stalled_seconds_total` —
+  [PSI](https://docs.kernel.org/accounting/psi.html): time tasks were stalled waiting for memory reclaim.
+  Populated only on Linux hosts with PSI support.
+
+- `vm_cache_size_bytes` and `vm_cache_size_max_bytes` (per `type`, e.g., `storage/tsid`) —
+  current and maximum size of each in-process cache. Their ratio shows how full a cache is.
+
+- `vm_slow_row_inserts_total` vs `vm_rows_added_to_storage_total` —
+  share of ingested rows that missed the `storage/tsid` cache ([slow inserts](#slow-data-ingestion)).
+
+- `increase(vm_new_timeseries_created_total[24h])` vs the number of
+  [active time series](https://docs.victoriametrics.com/victoriametrics/faq/#what-is-an-active-time-series) —
+  the [churn rate](https://docs.victoriametrics.com/victoriametrics/faq/#what-is-high-churn-rate):
+  new series registered over a day relative to the active set.
+
+- `process_major_pagefaults_total` — rate of pages read from disk (page-cache misses, refaults, or swap-in).
+
+- `go_memstats_heap_inuse_bytes` and the `CPU spent on GC` panel — the Go heap working set and the CPU cost of garbage collection.
+
+### Shortage patterns
+
+There are three patterns of memory shortage:
+
+1. **The cache cannot hold the active series (cache-bound shortage).** The `storage/tsid` cache is full:
+   `vm_cache_size_bytes{type="storage/tsid"}` is close to `vm_cache_size_max_bytes{type="storage/tsid"}`, and slow inserts stay high.
+   Most sustained slow inserts are caused by `storage/tsid` cache misses.
+   The main sources are new series and misses on already-known active series.
+   `indexdb`-rotation repopulation can also contribute to `vm_slow_row_inserts_total`,
+   but it usually appears as a time-bound spike around rotation, not as a sustained high slow-inserts rate.
+   If slow inserts stay above 5% of ingested rows during a stable window without restarts or rerouting
+   and cannot be roughly explained by `rate(vm_new_timeseries_created_total)` plus `rate(vm_timeseries_repopulated_total)`,
+   it points to misses on active series that no longer fit the cache.
+   See the detailed explanation in the [Slow data ingestion](#slow-data-ingestion) section.
+
+1. **The Go heap exceeds its budget (heap-bound shortage).** `go_memstats_heap_inuse_bytes` climbs well above its stable baseline.
+   Check it together with the non-cache part of anonymous memory:
+   `process_resident_memory_anon_bytes` minus the component's `vm_cache_size_bytes`.
+   There is no fixed normal value, so compare against the component's stable historical data.
+   A `process_resident_memory_anon_bytes / vm_available_memory_bytes` ratio that keeps rising leaves little headroom
+   and may lead to an [OOM kill](#out-of-memory-errors).
+   If heap growth tracks query or ingestion load, it is workload-driven.
+   If it grows regardless of load, suspect a memory leak and collect a heap profile.
+
+1. **The OS page cache is too small (I/O-bound shortage).** `process_major_pagefaults_total` rises above the component's baseline
+   and `process_resident_memory_file_bytes` shrinks or stays too small for the working set,
+   while disk reads and query latency grow.
+   The file working set (data and index parts) no longer fits within the available memory for the page cache.
+   Add memory according to
+   [capacity planning](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#capacity-planning).
+
+PSI is an additional confirmation signal: a rising `process_pressure_memory_*` value indicates that the kernel is reclaiming memory for the cgroup.
+Where PSI is unavailable, rely on the per-pattern signals above.
+
+### How to fix
+
+After distinguishing the shortage from normal high memory utilization, and if it is sustained,
+you can use the approaches below to resolve it:
+
+- Reduce the number of [active time series](https://docs.victoriametrics.com/victoriametrics/faq/#what-is-an-active-time-series)
+  or the [churn rate](https://docs.victoriametrics.com/victoriametrics/faq/#what-is-high-churn-rate) —
+  see [Slow data ingestion](#slow-data-ingestion).
+- Increase available memory by scaling vertically or horizontally — see capacity planning for
+  [single-node VictoriaMetrics](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#capacity-planning)
+  and the [cluster version](https://docs.victoriametrics.com/victoriametrics/cluster-victoriametrics/#capacity-planning).
+  Scaling memory is preferred over tuning individual caches, as covered in [Out of memory errors](#out-of-memory-errors).
+- Investigate Go heap growth or a suspected memory leak — collect a memory profile using the profiling guide for
+  [single-node VictoriaMetrics](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#profiling)
+  or [cluster components](https://docs.victoriametrics.com/victoriametrics/cluster-victoriametrics/#profiling).
+
 ## Out of memory errors

 The following are the most common sources of out-of-memory (aka OOM) crashes in VictoriaMetrics:
Author	SHA1	Message	Date
Pablo (Tomas) Fernandez	c19a5fd334	Update docs/victoriametrics/Troubleshooting.md Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com> Signed-off-by: Pablo (Tomas) Fernandez <46322567+TomFern@users.noreply.github.com>	2026-06-24 13:31:27 +01:00
Pablo Fernandez	fcae2ea9cd	Grammar pass	2026-06-24 13:13:39 +01:00
kirillyu	51cff76dcf	docs: address review feedback on memory shortage guide	2026-06-23 19:17:57 +02:00
Kirill Yurkov	33cf6845c1	Apply suggestion from @vrutkovs Co-authored-by: Vadim Rutkovsky <vadim@vrutkovs.eu> Signed-off-by: Kirill Yurkov <kirillyu@users.noreply.github.com>	2026-06-23 19:46:48 +03:00
Kirill Yurkov	ee5cbb5c74	Apply suggestion from @vrutkovs Co-authored-by: Vadim Rutkovsky <vadim@vrutkovs.eu> Signed-off-by: Kirill Yurkov <kirillyu@users.noreply.github.com>	2026-06-23 19:37:51 +03:00
Kirill Yurkov	d35c7a77f9	Apply suggestion from @vrutkovs Co-authored-by: Vadim Rutkovsky <vadim@vrutkovs.eu> Signed-off-by: Kirill Yurkov <kirillyu@users.noreply.github.com>	2026-06-23 19:33:23 +03:00
Kirill Yurkov	ccef171138	Apply suggestion from @vrutkovs Co-authored-by: Vadim Rutkovsky <vadim@vrutkovs.eu> Signed-off-by: Kirill Yurkov <kirillyu@users.noreply.github.com>	2026-06-23 19:30:58 +03:00
Kirill Yurkov	b79304a9a6	Apply suggestion from @vrutkovs Co-authored-by: Vadim Rutkovsky <vadim@vrutkovs.eu> Signed-off-by: Kirill Yurkov <kirillyu@users.noreply.github.com>	2026-06-23 19:28:25 +03:00
Kirill Yurkov	fc1f7c9ca0	Apply suggestion from @vrutkovs Co-authored-by: Vadim Rutkovsky <vadim@vrutkovs.eu> Signed-off-by: Kirill Yurkov <kirillyu@users.noreply.github.com>	2026-06-23 19:25:51 +03:00
kirillyu	5ae375f6af	docs: add memory shortage troubleshooting guide Add a Memory shortage section to Troubleshooting: how to tell a real shortage from normal high memory use, the signals to watch on the dashboards, the cache-bound, heap-bound and I/O-bound patterns, and how to fix each. Add a Memory section to Best practices (GOMEMLIMIT, cache tuning, VPA, headroom) and link it from the slow data ingestion notes.	2026-06-22 17:15:41 +02:00