Add Memory shortage troubleshooting section and Memory best practices

Memory accounting, signals, shortage patterns, out-of-memory detection, and fixes; plus a Memory section in Best practices.
docs/vmanomaly: fill in missing args and links (post v1.29.7 update) (#11165 )
2026-06-26 20:18:05 +03:00 · 2026-06-26 12:18:21 +02:00 · 2026-06-25 10:27:18 +03:00
6 changed files with 176 additions and 33 deletions
--- a/.github/workflows/codeql-analysis-go.yml
+++ b/.github/workflows/codeql-analysis-go.yml
@@ -54,14 +54,14 @@ jobs:
          restore-keys: go-artifacts-${{ runner.os }}-codeql-analyze-${{ steps.go.outputs.go-version }}-

      - name: Initialize CodeQL
-        uses: github/codeql-action/init@8aad20d150bbac5944a9f9d289da16a4b0d87c1e  # v4.36.2
+        uses: github/codeql-action/init@e46ed2cbd01164d986452f91f178727624ae40d7  # v4.35.3
        with:
          languages: go

      - name: Autobuild
-        uses: github/codeql-action/autobuild@8aad20d150bbac5944a9f9d289da16a4b0d87c1e  # v4.36.2
+        uses: github/codeql-action/autobuild@e46ed2cbd01164d986452f91f178727624ae40d7  # v4.35.3

      - name: Perform CodeQL Analysis
-        uses: github/codeql-action/analyze@8aad20d150bbac5944a9f9d289da16a4b0d87c1e  # v4.36.2
+        uses: github/codeql-action/analyze@e46ed2cbd01164d986452f91f178727624ae40d7  # v4.35.3
        with:
          category: 'language:go'
--- a/docs/anomaly-detection/CHANGELOG.md
+++ b/docs/anomaly-detection/CHANGELOG.md
@@ -17,11 +17,11 @@ Please find the changelog for VictoriaMetrics Anomaly Detection below.
 ## v1.29.7
 Released: 2026-06-25

- UI: updated [vmanomaly UI](https://docs.victoriametrics.com/anomaly-detection/ui/) from [v1.7.1](https://docs.victoriametrics.com/anomaly-detection/ui/#v171) to [v1.7.2](https://docs.victoriametrics.com/anomaly-detection/ui/#v172), see respective [release notes](https://docs.victoriametrics.com/anomaly-detection/ui/#v172) for details.
+- UI: updated [vmanomaly UI](https://docs.victoriametrics.com/anomaly-detection/ui/) from [v1.7.1](https://docs.victoriametrics.com/anomaly-detection/ui/#v171) to [v1.7.2](https://docs.victoriametrics.com/anomaly-detection/ui/#v172), see respective [release notes](https://docs.victoriametrics.com/anomaly-detection/ui/#v172) for details. Notable mentions include `api/v1/server/model` endpoint for accessing production models config and queries from UI, manually or through [AI assistant](https://docs.victoriametrics.com/anomaly-detection/ui/#ai-assistance).

 - IMPROVEMENT: Increased high-cardinality inference scaling by optionally scattering periodic infer jobs to reduce contention on shared resources (e.g. datasource, CPU, RAM) when `settings.n_workers > 1` and `scheduler.infer_every` is smaller than the total time to fetch and process all queries. This is controlled by new `scatter_infer_jobs` boolean argument of [Periodic Scheduler](https://docs.victoriametrics.com/anomaly-detection/components/scheduler/#parameters-1) (default: `false`).

- IMPROVEMENT: Optimized internal batching for reader post-fetch series processing, exposing reader processing queue depth, and clarifying inference skip logs after data fetch timeouts.
+- IMPROVEMENT: Optimized internal batching for reader post-fetch series processing, exposing reader processing queue depth (`vmanomaly_reader_processing_tasks_queued` [metric](https://docs.victoriametrics.com/anomaly-detection/components/monitoring/#reader-behaviour-metrics)), and clarifying inference skip logs after data fetch timeouts. See `series_processing_batch_size` argument of [VmReader](https://docs.victoriametrics.com/anomaly-detection/components/reader/#vm-reader) and [VLogsReader](https://docs.victoriametrics.com/anomaly-detection/components/reader/#victorialogs-reader) for details.

 - IMPROVEMENT: Refined `VmReader` and `VLogsReader` logging after datasource request failures by suppressing the follow-up generic "No data" or "No unseen data" warning for failed fetches. Failed requests now keep the original datasource error while empty successful responses still emit the no-data warning.

--- a/docs/anomaly-detection/components/reader.md
+++ b/docs/anomaly-detection/components/reader.md
@@ -893,6 +893,19 @@ If a path to a CA bundle file (like `ca.crt`), it will verify the certificate us
 (Optional) Password for authentication. If set, it will be used to authenticate the request.
            </td>
        </tr>
+        <tr>
+            <td>
+
+<span style="white-space: nowrap;">`series_processing_batch_size`</span>
+            </td>
+            <td>
+
+`8`
+            </td>
+            <td>
+Optional argument {{% available_from "v1.29.7" anomaly %}}, allows specifying the number of time series to process together while preparing data for fit or infer stages. Defaults to `8`. Suggested values are 4-16 for high-cardinality queries.
+            </td>
+        </tr>
    </tbody>
 </table>

@@ -911,6 +924,7 @@ reader:
  # tenant_id: '0:0'  # for cluster version only
  sampling_period: '1m'
  max_points_per_query: 10000
+  series_processing_batch_size: 8
  data_range: [0, 'inf']  # reader-level
  offset: '0s'  # reader-level
  timeout: '30s'
--- a/docs/victoriametrics/BestPractices.md
+++ b/docs/victoriametrics/BestPractices.md
@@ -16,9 +16,35 @@ aliases:
 ---
 ## Install Recommendation

-It is recommended to run the latest available release of VictoriaMetrics from [this page](https://github.com/VictoriaMetrics/VictoriaMetrics/releases/latest), since it contains all the bugfixes and enhancements.
+It is recommended to run the latest available release of VictoriaMetrics from [this page](https://github.com/VictoriaMetrics/VictoriaMetrics/releases/latest), as it includes all bug fixes and enhancements.

-There is no need to tune VictoriaMetrics because it uses reasonable defaults for command-line flags. These flags are automatically adjusted for the available CPU and RAM resources. There is no need in Operating System tuning because VictoriaMetrics is optimized for default OS settings. The only option is to increase the limit on the [number of open files in the OS](https://medium.com/@muhammadtriwibowo/set-permanently-ulimit-n-open-files-in-ubuntu-4d61064429a), so VictoriaMetrics could accept more incoming connections and could keep open more data files.
+There is no need to tune VictoriaMetrics, as it uses reasonable defaults for its command-line flags. These flags are automatically adjusted for the available CPU and RAM resources. There is no need for operating system tuning because VictoriaMetrics is optimized for default OS settings. The only option is to increase the limit on the [number of open files in the OS](https://medium.com/@muhammadtriwibowo/set-permanently-ulimit-n-open-files-in-ubuntu-4d61064429a), so VictoriaMetrics could accept more incoming connections and could keep open more data files. VictoriaMetrics is tested and developed to run efficiently on these defaults, which fit the majority of workloads. Change a setting only when the docs explicitly instruct you to, including when and why.
+
+## Memory
+
+VictoriaMetrics components detect the available memory at startup as the smaller of the host RAM and the cgroup memory limit.
+To keep them stable:
+
+1. Do not set `GOMEMLIMIT`. Set the container/cgroup memory limit, and VictoriaMetrics automatically
+   sizes its memory-aware limits from it. All VictoriaMetrics components have their own GC settings,
+   which are recommended.
+
+1. Do not hand-tune cache sizes with `-storage.cacheSize*` flags; rely on the defaults.
+   If a component needs larger caches, move it to a host with more memory.
+   See [Cache tuning](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#cache-tuning).
+
+1. Do not autoscale `vmstorage` with the Vertical Pod Autoscaler (VPA) or the Horizontal Pod Autoscaler (HPA).
+   VPA: cache sizes are derived from the memory limit, read only once at startup.
+   Modes that recreate the pod (`Recreate`, `Auto`) reset the caches and force a cold start,
+   causing slow inserts and query latency spikes. In-place resizing is not picked up at runtime,
+   so `vmstorage` keeps the budget and `vm_available_memory_bytes` initialized at startup, which also skews the dashboards.
+   Set fixed memory requests and limits for `vmstorage` rather than autoscaling.
+   HPA: `vmstorage` is stateful. Adding nodes sends new series to them while existing data stays where it is.
+   Removing nodes makes the data on them unavailable to queries and can cause data loss without replication.
+   Frequent scaling keeps changing the routing and can degrade the cluster.
+
+1. Leave headroom for the OS page cache and workload spikes -
+   see [capacity planning](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#capacity-planning).

 ## Swap

--- a/docs/victoriametrics/FAQ.md
+++ b/docs/victoriametrics/FAQ.md
@@ -416,7 +416,7 @@ The cache size depends on the available memory for VictoriaMetrics in the host s
 then VictoriaMetrics needs to read and unpack the information from disk on every incoming sample for time series missing in the cache.
 This operation is much slower than the cache lookup, so such an insert is named a `slow insert`.
 A high percentage of slow inserts on the [official dashboard for VictoriaMetrics](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#monitoring) indicates
-a memory shortage for the current number of [active time series](#what-is-an-active-time-series). Such a condition usually leads
+a [memory shortage](https://docs.victoriametrics.com/victoriametrics/troubleshooting/#memory-shortage) for the current number of [active time series](#what-is-an-active-time-series). Such a condition usually leads
 to a significant slowdown for data ingestion and to significantly increased disk IO and CPU usage.
 The solution is to add more memory or to reduce the number of [active time series](#what-is-an-active-time-series).

--- a/docs/victoriametrics/Troubleshooting.md
+++ b/docs/victoriametrics/Troubleshooting.md
@@ -188,7 +188,7 @@ If you see unexpected or unreliable query results from VictoriaMetrics, then try

 These are the most common reasons for slow data ingestion in VictoriaMetrics:

-1. Memory shortage for the given amounts of [active time series](https://docs.victoriametrics.com/victoriametrics/faq/#what-is-an-active-time-series).
+1. [Memory shortage](https://docs.victoriametrics.com/victoriametrics/troubleshooting/#memory-shortage) for the given amounts of [active time series](https://docs.victoriametrics.com/victoriametrics/faq/#what-is-an-active-time-series).

   VictoriaMetrics (or `vmstorage` in the cluster version of VictoriaMetrics) maintains an in-memory cache `storage/tsid`
   for a quick search for internal series IDs for each incoming metric. VictoriaMetrics automatically determines the maximum 
@@ -352,35 +352,138 @@ These are the solutions that exist for improving the performance of slow queries
  See also [this article](https://valyala.medium.com/how-to-optimize-promql-and-metricsql-queries-85a1b75bf986),
  which explains how to identify and optimize slow queries.

-## Out of memory errors
+## Memory shortage

-The following are the most common sources of out-of-memory (aka OOM) crashes in VictoriaMetrics:
+High memory utilization alone does not indicate a shortage.
+A VictoriaMetrics component can operate normally under high memory utilization,
+but it is recommended to keep [at least 50% of free memory for stability](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#capacity-planning).
+A shortage means there is not enough memory for the workload.
+It is different from high utilization and from memory pressure (the kernel's reclaim activity, shown by [PSI](https://docs.kernel.org/accounting/psi.html)).
+Use the [signals](https://docs.victoriametrics.com/victoriametrics/troubleshooting/#memory-signals) and [patterns](https://docs.victoriametrics.com/victoriametrics/troubleshooting/#memory-shortage-patterns) below to tell them apart,
+and [how to fix](https://docs.victoriametrics.com/victoriametrics/troubleshooting/#how-to-fix-memory-issues) to resolve it.

-1. Improper command-line flag values. Inspect command-line flags passed to VictoriaMetrics components.
-   If you don't clearly understand the purpose or the effect of some flags, remove them
-   from the list of flags passed to VictoriaMetrics components. Improper command-line flag values
-   may lead to increased memory and CPU usage. Increased memory usage increases the risk of OOM crashes.
-   VictoriaMetrics is optimized to run with default flag values (e.g., when they aren't explicitly set).
+VictoriaMetrics components detect the available memory at startup as the smaller of the host RAM and the cgroup memory limit,
+and expose it as `vm_available_memory_bytes`. The actual usage is `process_resident_memory_bytes`, which has two main parts:

-   For example, it isn't recommended to change cache sizes in VictoriaMetrics, as this frequently leads to OOM exceptions.
-   [These docs](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#cache-tuning) refer to command-line flags that aren't
-   recommended to tune. If you see that VictoriaMetrics needs to increase some cache sizes for the current workload,
-   then it is better to migrate to a host with more memory instead of trying to tune cache sizes manually.
+1. Go (anonymous) memory - `process_resident_memory_anon_bytes`. It includes:

-1. Unexpected heavy queries. The query is considered heavy if it needs to select and process millions of unique time series.
-   Such a query may cause an OOM exception, as VictoriaMetrics needs to keep some per-series data in memory.
-   VictoriaMetrics provides [various settings](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#resource-usage-limits)
-   that can help limit resource usage.
-   For more context, see [How to optimize PromQL and MetricsQL queries](https://valyala.medium.com/how-to-optimize-promql-and-metricsql-queries-85a1b75bf986).
-   VictoriaMetrics also provides [query tracer](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#query-tracing)
-   to help identify the source of heavy queries. Slow queries can be logged with additional details via [Query execution stats](https://docs.victoriametrics.com/victoriametrics/query-stats/). 
+   - `-memory.allowedPercent` (default 60%, or `-memory.allowedBytes`) sets a memory budget whose use differs per component:
+     - `vmstorage`: in-process caches (for example `storage/tsid` and `indexdb/file`) and in-memory data parts.
+     - `vmselect`: the rollup result cache and the per-query rollup memory.
+     - `vminsert`: in-memory row buffers held per `vmstorage` node before flushing to `vmstorage`.
+     - `vmagent`: in-memory blocks held before they are written to the persistent queue.
+     - `vmauth`: not sized by it, but bounded by `-maxConcurrentRequests` and `-requestBufferSize` instead.
+   - The Go heap, goroutine stacks and runtime overhead used for ingestion and queries.

-1. Lack of free memory for processing workload spikes. If VictoriaMetrics components use almost all the available memory
-   under the current workload, then it is recommended to migrate to a host with larger amounts of memory.
-   This would protect from possible OOM crashes on workload spikes. It is recommended to have at least 50%
-   of free memory to gracefully handle possible workload spikes.
-   See [capacity planning for single-node VictoriaMetrics](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#capacity-planning)
-   and [capacity planning for the cluster version of VictoriaMetrics](https://docs.victoriametrics.com/victoriametrics/cluster-victoriametrics/#capacity-planning).
+1. OS page cache for the on-disk [data](https://docs.victoriametrics.com/victoriametrics/#storage)
+   and [indexdb](https://docs.victoriametrics.com/victoriametrics/#indexdb). The OS caches recently
+   read parts of these files in free RAM and reclaims them under memory pressure.
+   `process_resident_memory_file_bytes` shows how much of them is currently resident for the process.
+
+Before tuning and troubleshooting memory issues,
+see [Best practices](https://docs.victoriametrics.com/victoriametrics/bestpractices/#memory)
+for memory configuration guidance. Be sure that `GOMEMLIMIT` is not set,
+and that the [VPA](https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler) is not used for `vmstorage` pods.
+
+### Memory signals
+
+These metrics describe how a component uses memory. See [how to monitor VictoriaMetrics](https://docs.victoriametrics.com/victoriametrics/#monitoring)
+to set up scraping and the Grafana dashboards that show them.
+None of them means a shortage on its own; read them together in [Memory shortage patterns](https://docs.victoriametrics.com/victoriametrics/troubleshooting/#memory-shortage-patterns).
+
+- `process_resident_memory_anon_bytes` / `vm_available_memory_bytes` - anonymous memory (caches plus Go heap)
+  as a share of the available memory. This memory can't be reclaimed back by OS.
+
+- `process_resident_memory_file_bytes` - the OS page cache for the component's on-disk data,
+  currently resident for the process. Reclaimable by the OS.
+
+- `process_pressure_memory_waiting_seconds_total`, `process_pressure_memory_stalled_seconds_total` -
+  [PSI](https://docs.kernel.org/accounting/psi.html): time tasks were stalled waiting for memory reclaim.
+  Populated only on Linux hosts with PSI support.
+
+- `vm_cache_size_bytes / vm_cache_size_max_bytes` (per `type`, e.g., `storage/tsid`) -
+  how full each in-process cache is.
+
+- `vm_slow_row_inserts_total` / `vm_rows_added_to_storage_total` -
+  share of ingested rows that missed the `storage/tsid` cache ([slow inserts](https://docs.victoriametrics.com/victoriametrics/troubleshooting/#slow-data-ingestion)).
+
+- `increase(vm_new_timeseries_created_total[24h]) / vm_cache_entries{type="storage/hour_metric_ids"}` -
+  the [churn rate](https://docs.victoriametrics.com/victoriametrics/faq/#what-is-high-churn-rate):
+  new series created over a day relative to the
+  [active time series](https://docs.victoriametrics.com/victoriametrics/faq/#what-is-an-active-time-series).
+
+- `process_major_pagefaults_total` - rate of pages read from disk (page-cache misses, refaults, or swap-in).
+
+- `go_memstats_heap_inuse_bytes` and the `CPU spent on GC` panel - the Go heap working set and the CPU cost of garbage collection.
+
+#### Out of memory errors
+
+An out-of-memory (OOM) kill is the strongest sign of a memory shortage, but the process cannot report it:
+the process is already dead. Detect the kill from outside the process:
+
+- Kubernetes: a container restart with reason `OOMKilled` in the pod events (`kubectl describe pod`).
+- Linux hosts: the kernel OOM killer log in `dmesg` or `journalctl`, with the `oom_score` for the killed process.
+- Container runtime logs record the same kill from the runtime side.
+
+To prevent recurrence, resolve the underlying shortage (see [How to fix memory issues](https://docs.victoriametrics.com/victoriametrics/troubleshooting/#how-to-fix-memory-issues)).
+
+### Memory shortage patterns
+
+There are three patterns of memory shortage:
+
+1. **The cache cannot hold the active series (cache-bound shortage).** The `storage/tsid` cache is full:
+   `vm_cache_size_bytes{type="storage/tsid"}` is close to `vm_cache_size_max_bytes{type="storage/tsid"}`, and slow inserts stay high.
+   Most slow inserts come from `storage/tsid` cache misses, on new series or on already-known active series.
+   If they stay above 5% of ingested rows during a stable window without restarts or rerouting,
+   and are not explained by `rate(vm_new_timeseries_created_total)`,
+   it points to misses on active series that no longer fit the cache.
+   See the detailed explanation in the [Slow data ingestion](https://docs.victoriametrics.com/victoriametrics/troubleshooting/#slow-data-ingestion) section.
+
+1. **The Go heap exceeds its budget (heap-bound shortage).** `go_memstats_heap_inuse_bytes` climbs well above its stable baseline.
+   Check it together with the non-cache part of anonymous memory:
+   `process_resident_memory_anon_bytes` minus the component's `vm_cache_size_bytes`.
+   There is no fixed normal value, so compare against the component's own stable history.
+   A `process_resident_memory_anon_bytes / vm_available_memory_bytes` ratio that keeps rising leaves little headroom
+   and may lead to an [OOM kill](https://docs.victoriametrics.com/victoriametrics/troubleshooting/#out-of-memory-errors).
+   A single heavy query can spike the heap on its own: if it has to select and process millions of unique time series,
+   VictoriaMetrics keeps some per-series data in memory while the query runs. Find it with the
+   [query tracer](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#query-tracing),
+   log slow queries with [query execution stats](https://docs.victoriametrics.com/victoriametrics/query-stats/),
+   bound it with [resource usage limits](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#resource-usage-limits),
+   and see [how to optimize PromQL and MetricsQL queries](https://valyala.medium.com/how-to-optimize-promql-and-metricsql-queries-85a1b75bf986).
+   If heap growth correlates with query or ingestion load, it is workload-driven.
+   If the heap grows regardless of load, suspect a memory leak, collect a heap profile, and [file a bug report](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/new).
+
+1. **The OS page cache is too small (I/O-bound shortage).** The main signal is a consistently high
+   `process_major_pagefaults_total` rate: the component's data and `indexdb` no longer fit the page cache,
+   so it reads them from disk and query latency grows. `process_resident_memory_file_bytes` drops as the OS
+   evicts these file pages. Swap causes the same symptoms, so keep it disabled on `vmstorage` and
+   single-node hosts (see [Swap](https://docs.victoriametrics.com/victoriametrics/bestpractices/#swap)).
+   Add memory according to
+   [capacity planning](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#capacity-planning).
+
+PSI is an additional confirmation signal: a rising `process_pressure_memory_*` value indicates that the kernel is reclaiming memory for the cgroup.
+Where PSI is unavailable, rely on the per-pattern signals above.
+
+### How to fix memory issues
+
+After distinguishing the shortage from normal high memory utilization, and if it persists,
+you can use the approaches below to resolve it:
+
+- Reduce the number of [active time series](https://docs.victoriametrics.com/victoriametrics/faq/#what-is-an-active-time-series)
+  or the [churn rate](https://docs.victoriametrics.com/victoriametrics/faq/#what-is-high-churn-rate) -
+  see [Slow data ingestion](https://docs.victoriametrics.com/victoriametrics/troubleshooting/#slow-data-ingestion).
+- Add more memory by scaling vertically or horizontally - see capacity planning for
+  [single-node VictoriaMetrics](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#capacity-planning)
+  and the [cluster version](https://docs.victoriametrics.com/victoriametrics/cluster-victoriametrics/#capacity-planning).
+  Spare memory also absorbs workload spikes that would otherwise OOM a component running near its limit.
+- Remove command-line flags whose impact you do not clearly understand. Improper flags can raise
+  memory usage and lead to OOM crashes. In particular, do not change
+  [cache sizes](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#cache-tuning);
+  add more memory instead.
+- Investigate Go heap growth or a suspected memory leak - collect a memory profile using the profiling guide for
+  [single-node VictoriaMetrics](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#profiling)
+  or [cluster components](https://docs.victoriametrics.com/victoriametrics/cluster-victoriametrics/#profiling).

 ## Cluster instability
Author	SHA1	Message	Date
kirillyu	a1c3b4d267	Add Memory shortage troubleshooting section and Memory best practices Memory accounting, signals, shortage patterns, out-of-memory detection, and fixes; plus a Memory section in Best practices.	2026-06-26 12:18:21 +02:00
Fred Navruzov	50a827256a	docs/vmanomaly: fill in missing args and links (post v1.29.7 update) (#11165 ) Addition of missing links/args and slight refactor of changelog notes for clarity (post v1.29.7 update) Follow-up on `e30e8be1f4`	2026-06-25 10:27:18 +03:00