Compare commits

...

14 Commits

Author SHA1 Message Date
hagen1778
13bd827ea3 * add changelog lines
* use `${ds:text}` instead of hardcoded value

Signed-off-by: hagen1778 <roman@victoriametrics.com>
2026-02-25 11:23:30 +01:00
Roman Khavronenko
96d3ee0209 app/vmselect: properly apply extra filters for tenant tokens for /api/v1/label/../values (#10503)
Previosly, extra filters were ignored for
`/api/v1/label/vm_account_id/values` or
`/api/v1/label/vm_project_id/values` calls. In result, even if user's
visibility was limited by applying
`?extra_filters[]={vm_account_id="1"}` param they could get the list of
all available tenants in the system.

---------

Signed-off-by: hagen1778 <roman@victoriametrics.com>

(cherry picked from commit d2a033453e)
Signed-off-by: hagen1778 <roman@victoriametrics.com>
2026-02-25 09:06:17 +01:00
hagen1778
8c38e8dae2 app/vmalert: rename MiniMum => Minimum
Follow-up after a5811d3c3b

Signed-off-by: hagen1778 <roman@victoriametrics.com>
2026-02-25 09:06:17 +01:00
Fedor Kanin
dd0e8d73b9 docs/vmalert: fix a typo by replacing maxiMum with maximum (#10516)
### Describe Your Changes

Fix a typo by replacing `maxiMum` with `maximum` in Markdown docs and
CLI flags help.

Resolve #10515 

### Checklist

The following checks are **mandatory**:

- [x] My change adheres to [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/victoriametrics/contributing/#pull-request-checklist).
- [x] My change adheres to [VictoriaMetrics development
goals](https://docs.victoriametrics.com/victoriametrics/goals/).
2026-02-25 09:06:17 +01:00
JAYICE
86aa9c92e6 document: enrich the description of buckets_limit (#10465)
### Describe Your Changes

fix https://github.com/VictoriaMetrics/VictoriaMetrics/issues/10417

### Checklist

The following checks are **mandatory**:

- [x] My change adheres to [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/victoriametrics/contributing/#pull-request-checklist).
- [x] My change adheres to [VictoriaMetrics development
goals](https://docs.victoriametrics.com/victoriametrics/goals/).
2026-02-25 09:06:17 +01:00
Roman Khavronenko
6752178b68 docs: re-visit Troubleshooting docs (#10512)
* remove ToC in the beginning, as it duplicates right-bar functionality
and is easier to make a mistake with. For example, it didn't have the
ZFS section in it
* simplify wording where it was possible
* reference new tools VM got in recent releases
* re-prioritize tips order based on personal experience

---------

Signed-off-by: hagen1778 <roman@victoriametrics.com>
Signed-off-by: Roman Khavronenko <hagen1778@gmail.com>
Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
Co-authored-by: Pablo (Tomas) Fernandez <46322567+TomFern@users.noreply.github.com>
2026-02-25 09:06:16 +01:00
Roman Khavronenko
d31accad0b dashboards: filter out zero value for Major page faults panel (#10517)
Components like vmselect and vminsert rarely touch disk, so most of the
time their values are 0. Filtering out 0 values makes the panel cleaner.

Signed-off-by: hagen1778 <roman@victoriametrics.com>
2026-02-25 09:06:16 +01:00
Artem Fetishev
82e371f8b7 lib/uint64set: move set un/marshal methods from Storage to uint64set (#10521)
A refactoring that moves the uint64set.Set marshaling and unmarshaling from lib/storage/storage.go to lib/uint64set. Also added function docs and tests.

Signed-off-by: Artem Fetishev <rtm@victoriametrics.com>
2026-02-25 09:06:16 +01:00
Zhu Jiekun
2232a7b7c8 flaky test: disable GC during sync.Pool test (#10523)
Disable GC when testing sync.Pool `Get` and `Put` logic, so the items in pool won't be recycled too fast.

Follow-up for 785daff65d.
2026-02-25 09:06:16 +01:00
Fred Navruzov
7f9f0b3040 docs/vmanomaly - strip bad chars from filenames (#10525)
### Describe Your Changes

Strip spaces and `=` from filenames as suggested in #10522 

now
```shellhelp
find ./docs |egrep '[ =]'
```
returns no such files

### Checklist

The following checks are **mandatory**:

- [x] My change adheres to [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/victoriametrics/contributing/#pull-request-checklist).
- [x] My change adheres to [VictoriaMetrics development
goals](https://docs.victoriametrics.com/victoriametrics/goals/).
2026-02-25 09:06:16 +01:00
Max Kotliar
8d17632de7 .github: Run apptests on separate pool of runners
It should prvent apptest timeouts due to runners saturation. When
apptests are run with other tests and linters they do not have enough
CPU to complete in time and often times out.

If one re-runs the apptests shortly after they are likely to pass
because the same runner has enough resources available (other job
finished).

Remove GOGC=10 as the runner has enough memory (16Gb)  to run apptests.

I did some tests and obeserve drop in overal test duration from 4.5m to
3.30-3m.
2026-02-25 09:06:16 +01:00
Vadim Rutkovsky
ec6b456ae5 dashboards: operator dashboard should extract version from metrics (#10502)
### Describe Your Changes

Use vm_app_version to determine operator version instead of static text

### Checklist

The following checks are **mandatory**:

- [x] My change adheres to [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/victoriametrics/contributing/#pull-request-checklist).
- [x] My change adheres to [VictoriaMetrics development
goals](https://docs.victoriametrics.com/victoriametrics/goals/).

Signed-off-by: Vadim Rutkovsky <vadim@vrutkovs.eu>
2026-02-25 09:06:16 +01:00
Roman Khavronenko
8fefa62143 docs: add dedicated opentelemetry section to docs (#10491)
The new section is supposed to contain otel related information for all
products, like VT, VM, VL.

It also supposed to be visible for readers right away, without need to
dig for info in each product.

It contains basic information and is supposed to act as a router to more
detailed info in each product.

While there, also updated VM-related otel info.


---------

Depends on
https://github.com/VictoriaMetrics/victoriametrics-datasource/pull/458

---------

Signed-off-by: hagen1778 <roman@victoriametrics.com>
2026-02-25 09:06:15 +01:00
sias32
973f673c33 dashboards/deployment: added links for vmalert
Signed-off-by: sias32 <sias.32@yandex.ru>
2026-02-22 15:27:07 +03:00
35 changed files with 718 additions and 354 deletions

View File

@@ -86,7 +86,7 @@ jobs:
- run: go version
- name: Run tests
run: GOGC=10 make ${{ matrix.scenario}}
run: make ${{ matrix.scenario}}
- name: Publish coverage
uses: codecov/codecov-action@v5
@@ -95,7 +95,7 @@ jobs:
apptest:
name: apptest
runs-on: ubuntu-latest
runs-on: apptest
steps:
- name: Code checkout

View File

@@ -31,8 +31,8 @@ var (
"0 means no limit.")
ruleUpdateEntriesLimit = flag.Int("rule.updateEntriesLimit", 20, "Defines the max number of rule's state updates stored in-memory. "+
"Rule's updates are available on rule's Details page and are used for debugging purposes. The number of stored updates can be overridden per rule via update_entries_limit param.")
resendDelay = flag.Duration("rule.resendDelay", 0, "MiniMum amount of time to wait before resending an alert to notifier.")
maxResolveDuration = flag.Duration("rule.maxResolveDuration", 0, "Limits the maxiMum duration for automatic alert expiration, "+
resendDelay = flag.Duration("rule.resendDelay", 0, "Minium amount of time to wait before resending an alert to notifier.")
maxResolveDuration = flag.Duration("rule.maxResolveDuration", 0, "Limits the maximum duration for automatic alert expiration, "+
"which by default is 4 times evaluationInterval of the parent group")
evalDelay = flag.Duration("rule.evalDelay", 30*time.Second, "Adjustment of the 'time' parameter for rule evaluation requests to compensate intentional data delay from the datasource. "+
"Normally, should be equal to '-search.latencyOffset' (cmd-line flag configured for VictoriaMetrics single-node or vmselect). "+

View File

@@ -171,6 +171,26 @@ func TestClusterMultiTenantSelect(t *testing.T) {
t.Errorf("unexpected response (-want, +got):\n%s", diff)
}
// /api/v1/label/../value with extra_filters
wantVR := apptest.NewPrometheusAPIV1LabelValuesResponse(t,
`{"data": [
"5"
]
}`)
wantSR.Sort()
gotVR := vmselect.PrometheusAPIV1LabelValues(t, "vm_account_id", "foo", apptest.QueryOpts{
Start: "2022-05-10T08:00:00.000Z",
End: "2022-05-10T08:30:00.000Z",
ExtraFilters: []string{`{vm_account_id="5"}`},
Tenant: "multitenant",
})
gotSR.Sort()
if diff := cmp.Diff(wantVR, gotVR, cmpopts.IgnoreFields(apptest.PrometheusAPIV1LabelValuesResponse{}, "Status", "IsPartial")); diff != "" {
t.Errorf("unexpected response (-want, +got):\n%s", diff)
}
// Delete series from specific tenant
vmselect.APIV1AdminTSDBDeleteSeries(t, "foo_bar", apptest.QueryOpts{
Tenant: "5:15",

View File

@@ -506,6 +506,24 @@
"value": 200
}
]
},
{
"matcher": {
"id": "byName",
"options": "Alert"
},
"properties": [
{
"id": "links",
"value": [
{
"targetBlank": true,
"title": "Alert",
"url": "/alerting/${ds:text}/${__value.text}/find"
}
]
}
]
}
]
},
@@ -659,4 +677,4 @@
"uid": "ehXxUsGSk",
"version": 1,
"weekStart": ""
}
}

View File

@@ -91,8 +91,26 @@
"type": "row"
},
{
"datasource": {
"type": "prometheus",
"uid": "$ds"
},
"fieldConfig": {
"defaults": {},
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": 0
}
]
}
},
"overrides": []
},
"gridPos": {
@@ -103,17 +121,42 @@
},
"id": 24,
"options": {
"code": {
"language": "plaintext",
"showLineNumbers": false,
"showMiniMap": false
"colorMode": "value",
"graphMode": "none",
"justifyMode": "auto",
"orientation": "auto",
"percentChangeColorMode": "standard",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "/^short_version$/",
"values": false
},
"content": "<div style=\"text-align: center;\">$version</div>",
"mode": "markdown"
"showPercentChange": false,
"textMode": "value",
"wideLayout": true
},
"pluginVersion": "12.3.0",
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "$ds"
},
"editorMode": "code",
"exemplar": false,
"expr": "vm_app_version{job=~\"$job\",instance=~\"$instance\"}",
"format": "table",
"instant": true,
"interval": "",
"legendFormat": "{{short_version}}",
"range": false,
"refId": "A"
}
],
"title": "Version",
"type": "text"
"type": "stat"
},
{
"datasource": {

View File

@@ -5129,7 +5129,7 @@
"uid": "${ds}"
},
"editorMode": "code",
"expr": "sum(rate(process_major_pagefaults_total{job=~\"$job\", instance=~\"$instance\"}[$__rate_interval])) by (job)",
"expr": "sum(rate(process_major_pagefaults_total{job=~\"$job\", instance=~\"$instance\"}[$__rate_interval])) by (job) > 0",
"legendFormat": "__auto",
"range": true,
"refId": "A"
@@ -11153,7 +11153,7 @@
"uid": "${ds}"
},
"editorMode": "code",
"expr": "sum(rate(process_major_pagefaults_total{job=~\"$job\", instance=~\"$instance\"}[$__rate_interval])) by (job,instance)",
"expr": "sum(rate(process_major_pagefaults_total{job=~\"$job\", instance=~\"$instance\"}[$__rate_interval])) by (job,instance) > 0",
"legendFormat": "{{instance}} ({{job}})",
"range": true,
"refId": "A"

View File

@@ -5174,7 +5174,7 @@
"uid": "${ds}"
},
"editorMode": "code",
"expr": "sum(rate(process_major_pagefaults_total{job=~\"$job\", instance=~\"$instance\"}[$__rate_interval])) by (job)",
"expr": "sum(rate(process_major_pagefaults_total{job=~\"$job\", instance=~\"$instance\"}[$__rate_interval])) by (job) > 0",
"legendFormat": "__auto",
"range": true,
"refId": "A"
@@ -7667,7 +7667,7 @@
"uid": "${ds}"
},
"editorMode": "code",
"expr": "sum(rate(process_major_pagefaults_total{job=~\"$job\", instance=~\"$instance\"}[$__rate_interval])) by (job,instance)",
"expr": "sum(rate(process_major_pagefaults_total{job=~\"$job\", instance=~\"$instance\"}[$__rate_interval])) by (job,instance) > 0",
"legendFormat": "{{instance}} ({{job}})",
"range": true,
"refId": "A"

View File

@@ -92,29 +92,72 @@
"type": "row"
},
{
"datasource": {
"type": "victoriametrics-metrics-datasource",
"uid": "$ds"
},
"fieldConfig": {
"defaults": {},
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": 0
}
]
}
},
"overrides": []
},
"gridPos": {
"h": 3,
"h": 4,
"w": 4,
"x": 0,
"y": 1
},
"id": 24,
"options": {
"code": {
"language": "plaintext",
"showLineNumbers": false,
"showMiniMap": false
"colorMode": "value",
"graphMode": "none",
"justifyMode": "auto",
"orientation": "auto",
"percentChangeColorMode": "standard",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "/^short_version$/",
"values": false
},
"content": "<div style=\"text-align: center;\">$version</div>",
"mode": "markdown"
"showPercentChange": false,
"textMode": "value",
"wideLayout": true
},
"pluginVersion": "12.3.0",
"targets": [
{
"datasource": {
"type": "victoriametrics-metrics-datasource",
"uid": "$ds"
},
"editorMode": "code",
"exemplar": false,
"expr": "vm_app_version{job=~\"$job\",instance=~\"$instance\"}",
"format": "table",
"instant": true,
"interval": "",
"legendFormat": "{{short_version}}",
"range": false,
"refId": "A"
}
],
"title": "Version",
"type": "text"
"type": "stat"
},
{
"datasource": {

View File

@@ -5130,7 +5130,7 @@
"uid": "${ds}"
},
"editorMode": "code",
"expr": "sum(rate(process_major_pagefaults_total{job=~\"$job\", instance=~\"$instance\"}[$__rate_interval])) by (job)",
"expr": "sum(rate(process_major_pagefaults_total{job=~\"$job\", instance=~\"$instance\"}[$__rate_interval])) by (job) > 0",
"legendFormat": "__auto",
"range": true,
"refId": "A"
@@ -11154,7 +11154,7 @@
"uid": "${ds}"
},
"editorMode": "code",
"expr": "sum(rate(process_major_pagefaults_total{job=~\"$job\", instance=~\"$instance\"}[$__rate_interval])) by (job,instance)",
"expr": "sum(rate(process_major_pagefaults_total{job=~\"$job\", instance=~\"$instance\"}[$__rate_interval])) by (job,instance) > 0",
"legendFormat": "{{instance}} ({{job}})",
"range": true,
"refId": "A"

View File

@@ -5175,7 +5175,7 @@
"uid": "${ds}"
},
"editorMode": "code",
"expr": "sum(rate(process_major_pagefaults_total{job=~\"$job\", instance=~\"$instance\"}[$__rate_interval])) by (job)",
"expr": "sum(rate(process_major_pagefaults_total{job=~\"$job\", instance=~\"$instance\"}[$__rate_interval])) by (job) > 0",
"legendFormat": "__auto",
"range": true,
"refId": "A"
@@ -7668,7 +7668,7 @@
"uid": "${ds}"
},
"editorMode": "code",
"expr": "sum(rate(process_major_pagefaults_total{job=~\"$job\", instance=~\"$instance\"}[$__rate_interval])) by (job,instance)",
"expr": "sum(rate(process_major_pagefaults_total{job=~\"$job\", instance=~\"$instance\"}[$__rate_interval])) by (job,instance) > 0",
"legendFormat": "{{instance}} ({{job}})",
"range": true,
"refId": "A"

View File

@@ -27,7 +27,7 @@ groups:
labels:
severity: critical
annotations:
dashboard: "http://localhost:3000/d/oS7Bi_0Wz?viewPanel=20&var-instance={{ $labels.instance }}"
dashboard: "{{ $externalURL }}/d/oS7Bi_0Wz?viewPanel=20&var-instance={{ $labels.instance }}"
summary: "Instance {{ $labels.instance }} will run out of disk space in 3 days"
description: "Taking into account current ingestion rate, free disk space will be enough only
for {{ $value | humanizeDuration }} on instance {{ $labels.instance }}.\n
@@ -51,7 +51,7 @@ groups:
labels:
severity: warning
annotations:
dashboard: "http://localhost:3000/d/oS7Bi_0Wz?viewPanel=20&var-instance={{ $labels.instance }}"
dashboard: "{{ $externalURL }}/d/oS7Bi_0Wz?viewPanel=20&var-instance={{ $labels.instance }}"
summary: "Instance {{ $labels.instance }} will become read-only in 3 days"
description: "Taking into account current ingestion rate, free disk space and -storage.minFreeDiskSpaceBytes
instance {{ $labels.instance }} will remain writable for {{ $value | humanizeDuration }}.\n
@@ -68,7 +68,7 @@ groups:
labels:
severity: critical
annotations:
dashboard: "http://localhost:3000/d/oS7Bi_0Wz?viewPanel=20&var-instance={{ $labels.instance }}"
dashboard: "{{ $externalURL }}/d/oS7Bi_0Wz?viewPanel=20&var-instance={{ $labels.instance }}"
summary: "Instance {{ $labels.instance }} (job={{ $labels.job }}) will run out of disk space soon"
description: "Disk utilisation on instance {{ $labels.instance }} is more than 80%.\n
Having less than 20% of free disk space could cripple merges processes and overall performance.
@@ -81,7 +81,7 @@ groups:
severity: warning
show_at: dashboard
annotations:
dashboard: "http://localhost:3000/d/oS7Bi_0Wz?viewPanel=52&var-instance={{ $labels.instance }}"
dashboard: "{{ $externalURL }}/d/oS7Bi_0Wz?viewPanel=52&var-instance={{ $labels.instance }}"
summary: "Too many errors served for {{ $labels.job }} path {{ $labels.path }} (instance {{ $labels.instance }})"
description: "Requests to path {{ $labels.path }} are receiving errors.
Please verify if clients are sending correct requests."
@@ -100,7 +100,7 @@ groups:
severity: warning
show_at: dashboard
annotations:
dashboard: "http://localhost:3000/d/oS7Bi_0Wz?viewPanel=44&var-instance={{ $labels.instance }}"
dashboard: "{{ $externalURL }}/d/oS7Bi_0Wz?viewPanel=44&var-instance={{ $labels.instance }}"
summary: "Too many RPC errors for {{ $labels.job }} (instance {{ $labels.instance }})"
description: "RPC errors are interconnection errors between cluster components.\n
Possible reasons for errors are misconfiguration, overload, network blips or unreachable components."
@@ -116,7 +116,7 @@ groups:
labels:
severity: warning
annotations:
dashboard: "http://localhost:3000/d/oS7Bi_0Wz?viewPanel=102"
dashboard: "{{ $externalURL }}/d/oS7Bi_0Wz?viewPanel=102"
summary: "Churn rate is more than 10% for the last 15m"
description: "VM constantly creates new time series.\n
This effect is known as Churn Rate.\n
@@ -132,7 +132,7 @@ groups:
labels:
severity: warning
annotations:
dashboard: "http://localhost:3000/d/oS7Bi_0Wz?viewPanel=102"
dashboard: "{{ $externalURL }}/d/oS7Bi_0Wz?viewPanel=102"
summary: "Too high number of new series created over last 24h"
description: "The number of created new time series over last 24h is 3x times higher than
current number of active series.\n
@@ -151,7 +151,7 @@ groups:
labels:
severity: warning
annotations:
dashboard: "http://localhost:3000/d/oS7Bi_0Wz?viewPanel=108"
dashboard: "{{ $externalURL }}/d/oS7Bi_0Wz?viewPanel=108"
summary: "Percentage of slow inserts is more than 5% for the last 15m"
description: "High rate of slow inserts may be a sign of resource exhaustion
for the current load. It is likely more RAM is needed for optimal handling of the current number of active time series.
@@ -164,7 +164,7 @@ groups:
severity: warning
show_at: dashboard
annotations:
dashboard: "http://localhost:3000/d/oS7Bi_0Wz?viewPanel=139&var-instance={{ $labels.instance }}"
dashboard: "{{ $externalURL }}/d/oS7Bi_0Wz?viewPanel=139&var-instance={{ $labels.instance }}"
summary: "Connection between vminsert on {{ $labels.instance }} and vmstorage on {{ $labels.addr }} is saturated"
description: "The connection between vminsert (instance {{ $labels.instance }}) and vmstorage (instance {{ $labels.addr }})
is saturated by more than 90% and vminsert won't be able to keep up.\n

View File

@@ -15,7 +15,7 @@ groups:
labels:
severity: critical
annotations:
dashboard: "http://localhost:3000/d/G7Z9GzMGz?viewPanel=49&var-instance={{ $labels.instance }}"
dashboard: "{{ $externalURL }}/d/G7Z9GzMGz?viewPanel=49&var-instance={{ $labels.instance }}"
summary: "Instance {{ $labels.instance }} is dropping data from persistent queue"
description: "Vmagent dropped {{ $value | humanize1024 }} from persistent queue
on instance {{ $labels.instance }} for the last 10m."
@@ -26,7 +26,7 @@ groups:
labels:
severity: warning
annotations:
dashboard: "http://localhost:3000/d/G7Z9GzMGz?viewPanel=79&var-instance={{ $labels.instance }}"
dashboard: "{{ $externalURL }}/d/G7Z9GzMGz?viewPanel=79&var-instance={{ $labels.instance }}"
summary: "Vmagent is dropping data blocks that are rejected by remote storage"
description: "Job \"{{ $labels.job }}\" on instance {{ $labels.instance }} drops the rejected by
remote-write server data blocks. Check the logs to find the reason for rejects."
@@ -37,7 +37,7 @@ groups:
labels:
severity: warning
annotations:
dashboard: "http://localhost:3000/d/G7Z9GzMGz?viewPanel=31&var-instance={{ $labels.instance }}"
dashboard: "{{ $externalURL }}/d/G7Z9GzMGz?viewPanel=31&var-instance={{ $labels.instance }}"
summary: "Vmagent fails to scrape one or more targets"
description: "Job \"{{ $labels.job }}\" on instance {{ $labels.instance }} fails to scrape targets for last 15m"
@@ -61,7 +61,7 @@ groups:
labels:
severity: warning
annotations:
dashboard: "http://localhost:3000/d/G7Z9GzMGz?viewPanel=77&var-instance={{ $labels.instance }}"
dashboard: "{{ $externalURL }}/d/G7Z9GzMGz?viewPanel=77&var-instance={{ $labels.instance }}"
summary: "Vmagent responds with too many errors on data ingestion protocols"
description: "Job \"{{ $labels.job }}\" on instance {{ $labels.instance }} responds with errors to write requests for last 15m."
@@ -71,7 +71,7 @@ groups:
labels:
severity: warning
annotations:
dashboard: "http://localhost:3000/d/G7Z9GzMGz?viewPanel=61&var-instance={{ $labels.instance }}"
dashboard: "{{ $externalURL }}/d/G7Z9GzMGz?viewPanel=61&var-instance={{ $labels.instance }}"
summary: "Job \"{{ $labels.job }}\" on instance {{ $labels.instance }} fails to push to remote storage"
description: "Vmagent fails to push data via remote write protocol to destination \"{{ $labels.url }}\"\n
Ensure that destination is up and reachable."
@@ -87,7 +87,7 @@ groups:
labels:
severity: warning
annotations:
dashboard: "http://localhost:3000/d/G7Z9GzMGz?viewPanel=84&var-instance={{ $labels.instance }}"
dashboard: "{{ $externalURL }}/d/G7Z9GzMGz?viewPanel=84&var-instance={{ $labels.instance }}"
summary: "Remote write connection from \"{{ $labels.job }}\" (instance {{ $labels.instance }}) to {{ $labels.url }} is saturated"
description: "The remote write connection between vmagent \"{{ $labels.job }}\" (instance {{ $labels.instance }}) and destination \"{{ $labels.url }}\"
is saturated by more than 90% and vmagent won't be able to keep up.\n
@@ -101,7 +101,7 @@ groups:
labels:
severity: warning
annotations:
dashboard: "http://localhost:3000/d/G7Z9GzMGz?viewPanel=98&var-instance={{ $labels.instance }}"
dashboard: "{{ $externalURL }}/d/G7Z9GzMGz?viewPanel=98&var-instance={{ $labels.instance }}"
summary: "Persistent queue writes for instance {{ $labels.instance }} are saturated"
description: "Persistent queue writes for vmagent \"{{ $labels.job }}\" (instance {{ $labels.instance }})
are saturated by more than 90% and vmagent won't be able to keep up with flushing data on disk.
@@ -113,7 +113,7 @@ groups:
labels:
severity: warning
annotations:
dashboard: "http://localhost:3000/d/G7Z9GzMGz?viewPanel=99&var-instance={{ $labels.instance }}"
dashboard: "{{ $externalURL }}/d/G7Z9GzMGz?viewPanel=99&var-instance={{ $labels.instance }}"
summary: "Persistent queue reads for instance {{ $labels.instance }} are saturated"
description: "Persistent queue reads for vmagent \"{{ $labels.job }}\" (instance {{ $labels.instance }})
are saturated by more than 90% and vmagent won't be able to keep up with reading data from the disk.
@@ -124,7 +124,7 @@ groups:
labels:
severity: critical
annotations:
dashboard: "http://localhost:3000/d/G7Z9GzMGz?viewPanel=88&var-instance={{ $labels.instance }}"
dashboard: "{{ $externalURL }}/d/G7Z9GzMGz?viewPanel=88&var-instance={{ $labels.instance }}"
summary: "Instance {{ $labels.instance }} reached 90% of the limit"
description: "Max series limit set via -remoteWrite.maxHourlySeries flag is close to reaching the max value.
Then samples for new time series will be dropped instead of sending them to remote storage systems."
@@ -134,7 +134,7 @@ groups:
labels:
severity: critical
annotations:
dashboard: "http://localhost:3000/d/G7Z9GzMGz?viewPanel=90&var-instance={{ $labels.instance }}"
dashboard: "{{ $externalURL }}/d/G7Z9GzMGz?viewPanel=90&var-instance={{ $labels.instance }}"
summary: "Instance {{ $labels.instance }} reached 90% of the limit"
description: "Max series limit set via -remoteWrite.maxDailySeries flag is close to reaching the max value.
Then samples for new time series will be dropped instead of sending them to remote storage systems."

View File

@@ -23,7 +23,7 @@ groups:
labels:
severity: warning
annotations:
dashboard: "http://localhost:3000/d/LzldHAVnz?viewPanel=13&var-instance={{ $labels.instance }}&var-file={{ $labels.file }}&var-group={{ $labels.group }}"
dashboard: "{{ $externalURL }}/d/LzldHAVnz?viewPanel=13&var-instance={{ $labels.instance }}&var-file={{ $labels.file }}&var-group={{ $labels.group }}"
summary: "Alerting rules are failing for vmalert instance {{ $labels.instance }}"
description: "Alerting rules execution is failing for \"{{ $labels.alertname }}\" from group \"{{ $labels.group }}\" in file \"{{ $labels.file }}\".
Check vmalert's logs for detailed error message."
@@ -34,7 +34,7 @@ groups:
labels:
severity: warning
annotations:
dashboard: "http://localhost:3000/d/LzldHAVnz?viewPanel=30&var-instance={{ $labels.instance }}&var-file={{ $labels.file }}&var-group={{ $labels.group }}"
dashboard: "{{ $externalURL }}/d/LzldHAVnz?viewPanel=30&var-instance={{ $labels.instance }}&var-file={{ $labels.file }}&var-group={{ $labels.group }}"
summary: "Recording rules are failing for vmalert instance {{ $labels.instance }}"
description: "Recording rules execution is failing for \"{{ $labels.recording }}\" from group \"{{ $labels.group }}\" in file \"{{ $labels.file }}\".
Check vmalert's logs for detailed error message."
@@ -45,7 +45,7 @@ groups:
labels:
severity: info
annotations:
dashboard: "http://localhost:3000/d/LzldHAVnz?viewPanel=33&var-file={{ $labels.file }}&var-group={{ $labels.group }}"
dashboard: "{{ $externalURL }}/d/LzldHAVnz?viewPanel=33&var-file={{ $labels.file }}&var-group={{ $labels.group }}"
summary: "Recording rule {{ $labels.recording }} ({{ $labels.group }}) produces no data"
description: "Recording rule \"{{ $labels.recording }}\" from group \"{{ $labels.group }}\ in file \"{{ $labels.file }}\"
produces 0 samples over the last 30min. It might be caused by a misconfiguration

View File

@@ -11,7 +11,7 @@ groups:
labels:
severity: warning
annotations:
dashboard: "http://localhost:3000/d/nbuo5Mr4k?viewPanel=10&var-instance={{ $labels.instance }}"
dashboard: "{{ $externalURL }}/d/nbuo5Mr4k?viewPanel=10&var-instance={{ $labels.instance }}"
summary: "vmauth ({{ $labels.instance }}) reached concurrent requests limit"
description: "Possible solutions: increase -maxQueueDuration flag value, increase -maxConcurrentRequests flag value,
deploy additional vmauth replicas, check requests latency at backend service.
@@ -22,7 +22,7 @@ groups:
labels:
severity: warning
annotations:
dashboard: "http://localhost:3000/d/nbuo5Mr4k?viewPanel=10&var-instance={{ $labels.instance }}"
dashboard: "{{ $externalURL }}/d/nbuo5Mr4k?viewPanel=10&var-instance={{ $labels.instance }}"
summary: "vmauth ({{ $labels.instance }}) has reached concurrent requests limit for username {{ $labels.username }}"
description: "Possible solutions: increase -maxQueueDuration flag value, increase -maxConcurrentPerUserRequests flag value,
deploy additional vmauth replicas, check requests latency at backend service."
@@ -32,7 +32,7 @@ groups:
labels:
severity: warning
annotations:
dashboard: "http://localhost:3000/d/nbuo5Mr4k?viewPanel=10&var-instance={{ $labels.instance }}"
dashboard: "{{ $externalURL }}/d/nbuo5Mr4k?viewPanel=10&var-instance={{ $labels.instance }}"
summary: "vmauth ({{ $labels.instance }}) has reached concurrent requests limit for unauthorized user"
description: "Possible solutions: increase -maxQueueDuration flag value, increase -maxConcurrentPerUserRequests flag value,
deploy additional vmauth replicas, check requests latency at backend service."
@@ -42,7 +42,7 @@ groups:
labels:
severity: warning
annotations:
dashboard: "http://localhost:3000/d/nbuo5Mr4k?viewPanel=37&var-instance={{ $labels.instance }}"
dashboard: "{{ $externalURL }}/d/nbuo5Mr4k?viewPanel=37&var-instance={{ $labels.instance }}"
summary: "Too many errors served for unauthorized user (instance {{ $labels.instance }})"
description: "Requests from unauthorized user are receiving errors.
Please check the vmauth logs to verify that the configuration is correct and clients are sending valid requests."
@@ -52,7 +52,7 @@ groups:
labels:
severity: warning
annotations:
dashboard: "http://localhost:3000/d/nbuo5Mr4k?viewPanel=37&var-instance={{ $labels.instance }}"
dashboard: "{{ $externalURL }}/d/nbuo5Mr4k?viewPanel=37&var-instance={{ $labels.instance }}"
summary: "Too many errors served for user {{ $labels.username }} (instance {{ $labels.instance }})"
description: "Requests from user {{ $labels.username }} are receiving errors.
Please check the vmauth logs to verify that the configuration is correct and clients are sending valid requests."

View File

@@ -27,7 +27,7 @@ groups:
labels:
severity: critical
annotations:
dashboard: "http://localhost:3000/d/wNf0q_kZk?viewPanel=53&var-instance={{ $labels.instance }}"
dashboard: "{{ $externalURL }}/d/wNf0q_kZk?viewPanel=53&var-instance={{ $labels.instance }}"
summary: "Instance {{ $labels.instance }} will run out of disk space soon"
description: "Taking into account current ingestion rate, free disk space will be enough only
for {{ $value | humanizeDuration }} on instance {{ $labels.instance }}.\n
@@ -51,7 +51,7 @@ groups:
labels:
severity: warning
annotations:
dashboard: "http://localhost:3000/d/oS7Bi_0Wz?viewPanel=53&var-instance={{ $labels.instance }}"
dashboard: "{{ $externalURL }}/d/oS7Bi_0Wz?viewPanel=53&var-instance={{ $labels.instance }}"
summary: "Instance {{ $labels.instance }} will become read-only in 3 days"
description: "Taking into account current ingestion rate and free disk space
instance {{ $labels.instance }} is writable for {{ $value | humanizeDuration }}.\n
@@ -68,7 +68,7 @@ groups:
labels:
severity: critical
annotations:
dashboard: "http://localhost:3000/d/wNf0q_kZk?viewPanel=53&var-instance={{ $labels.instance }}"
dashboard: "{{ $externalURL }}/d/wNf0q_kZk?viewPanel=53&var-instance={{ $labels.instance }}"
summary: "Instance {{ $labels.instance }} (job={{ $labels.job }}) will run out of disk space soon"
description: "Disk utilisation on instance {{ $labels.instance }} is more than 80%.\n
Having less than 20% of free disk space could cripple merge processes and overall performance.
@@ -80,7 +80,7 @@ groups:
labels:
severity: warning
annotations:
dashboard: "http://localhost:3000/d/wNf0q_kZk?viewPanel=35&var-instance={{ $labels.instance }}"
dashboard: "{{ $externalURL }}/d/wNf0q_kZk?viewPanel=35&var-instance={{ $labels.instance }}"
summary: "Too many errors served for path {{ $labels.path }} (instance {{ $labels.instance }})"
description: "Requests to path {{ $labels.path }} are receiving errors.
Please verify if clients are sending correct requests."
@@ -96,7 +96,7 @@ groups:
labels:
severity: warning
annotations:
dashboard: "http://localhost:3000/d/wNf0q_kZk?viewPanel=66&var-instance={{ $labels.instance }}"
dashboard: "{{ $externalURL }}/d/wNf0q_kZk?viewPanel=66&var-instance={{ $labels.instance }}"
summary: "Churn rate is more than 10% on \"{{ $labels.instance }}\" for the last 15m"
description: "VM constantly creates new time series on \"{{ $labels.instance }}\".\n
This effect is known as Churn Rate.\n
@@ -112,7 +112,7 @@ groups:
labels:
severity: warning
annotations:
dashboard: "http://localhost:3000/d/wNf0q_kZk?viewPanel=66&var-instance={{ $labels.instance }}"
dashboard: "{{ $externalURL }}/d/wNf0q_kZk?viewPanel=66&var-instance={{ $labels.instance }}"
summary: "Too high number of new series on \"{{ $labels.instance }}\" created over last 24h"
description: "The number of created new time series over last 24h is 3x times higher than
current number of active series on \"{{ $labels.instance }}\".\n
@@ -131,7 +131,7 @@ groups:
labels:
severity: warning
annotations:
dashboard: "http://localhost:3000/d/wNf0q_kZk?viewPanel=68&var-instance={{ $labels.instance }}"
dashboard: "{{ $externalURL }}/d/wNf0q_kZk?viewPanel=68&var-instance={{ $labels.instance }}"
summary: "Percentage of slow inserts is more than 5% on \"{{ $labels.instance }}\" for the last 15m"
description: "High rate of slow inserts on \"{{ $labels.instance }}\" may be a sign of resource exhaustion
for the current load. It is likely more RAM is needed for optimal handling of the current number of active time series.

View File

@@ -136,21 +136,21 @@ models:
Here's how default (backward-compatible) behavior looks like - anomalies will be tracked in `both` directions (`y > yhat` or `y < yhat`). This is useful when there is no domain expertise to filter the required direction.
![schema_detection_direction=both](schema_detection_direction=both.webp)
![schema_detection_direction=both](schema_detection_direction_both.webp)
When set to `above_expected`, anomalies are tracked only when `y > yhat`.
*Example metrics*: Error rate, response time, page load time, number of failed transactions - metrics where *lower values are better*, so **higher** values are typically tracked.
![schema_detection_direction=above_expected](schema_detection_direction=above_expected.webp)
![schema_detection_direction=above_expected](schema_detection_direction_above_expected.webp)
When set to `below_expected`, anomalies are tracked only when `y < yhat`.
*Example metrics*: Service Level Agreement (SLA) compliance, conversion rate, Customer Satisfaction Score (CSAT) - metrics where *higher values are better*, so **lower** values are typically tracked.
![schema_detection_direction=below_expected](schema_detection_direction=below_expected.webp)
![schema_detection_direction=below_expected](schema_detection_direction_below_expected.webp)
Config with a split example:
@@ -199,13 +199,13 @@ reader:
Visualizations below demonstrate this concept; the green zone defined as the `[yhat - min_dev_from_expected, yhat + min_dev_from_expected]` range excludes actual data points (`y`) from generating anomaly scores if they fall within that range.
![min_dev_from_expected-default](schema_min_dev_from_expected=0.webp)
![min_dev_from_expected-default](schema_min_dev_from_expected_0.webp)
![min_dev_from_expected-small](schema_min_dev_from_expected=1.0.webp)
![min_dev_from_expected-small](schema_min_dev_from_expected_1_0.webp)
![min_dev_from_expected-big](schema_min_dev_from_expected=5.0.webp)
![min_dev_from_expected-big](schema_min_dev_from_expected_5_0.webp)
Example config of how to use this param based on query results:

View File

@@ -0,0 +1,91 @@
VictoriaMetrics software provides native [OpenTelemetry](https://opentelemetry.io/) ingestion across **metrics**, **logs**, and **traces** via dedicated components.
This allows running OpenTelemetry-based observability pipeline with VictoriaMetrics software as your backend.
VictoriaMetrics provides a dedicated database for each [signal type](https://opentelemetry.io/docs/concepts/signals/):
- [VictoriaMetrics](https://docs.victoriametrics.com/victoriametrics/) for [Metrics](https://opentelemetry.io/docs/concepts/signals/metrics/);
- [VictoriaLogs](https://docs.victoriametrics.com/victorialogs/) for [Logs](https://opentelemetry.io/docs/concepts/signals/logs/);
- [VictoriaTraces](https://docs.victoriametrics.com/victoriatraces/) for [Traces](https://opentelemetry.io/docs/concepts/signals/traces/).
![README.webp](README.webp)
{width="700"}
Each database is optimized for its own signal and usage scenario to improve maintainability and efficiency.
Resources:
* [OpenTelemetry Astronomy Shop demo](https://github.com/VictoriaMetrics-Community/opentelemetry-demo) with integrated VictoriaMetrics backends.
* Live [Grafana Playground](https://play-grafana.victoriametrics.com/) with OTeL demo and VictoriaMetrics components.
* [Full-Stack Observability with VictoriaMetrics in the OTel Demo](https://victoriametrics.com/blog/victoriametrics-full-stack-observability-otel-demo/) blogpost.
---
## Metrics (VictoriaMetrics)
VictoriaMetrics single-node, vmagent and vminsert components support ingestion of metrics via OpenTelemetry Protocol (OTLP)
from [OpenTelemetry Collector](https://opentelemetry.io/docs/collector/) and applications instrumented with [OpenTelemetry SDKs](https://opentelemetry.io/docs/languages/).
See the detailed description about protocol support [here](https://docs.victoriametrics.com/victoriametrics/#sending-data-via-opentelemetry).
> See a practical guide [How to use OpenTelemetry metrics with VictoriaMetrics](https://docs.victoriametrics.com/guides/getting-started-with-opentelemetry/).
Once metrics are ingested into VictoriaMetrics, they can be read via the following tools:
1. [vmui](https://docs.victoriametrics.com/victoriametrics/#vmui) - VictoriaMetrics User Interface for ad-hoc queries
and data exploration.
1. [Grafana](https://docs.victoriametrics.com/victoriametrics/integrations/grafana/) - integrates with VictoriaMetrics
using [Prometheus datasource](https://grafana.com/docs/grafana/latest/datasources/prometheus/)
or [VictoriaMetrics datasource](https://grafana.com/grafana/plugins/victoriametrics-metrics-datasource/) plugins.
1. [Perses](https://docs.victoriametrics.com/victoriametrics/integrations/perses/) - integrates with VictoriaMetrics
via [Prometheus plugins](https://perses.dev/plugins/docs/prometheus/).
1. [vmalert](https://docs.victoriametrics.com/victoriametrics/vmalert/) - is an alerting tool for VictoriaMetrics.
It executes a list of the given alerting or recording rules and sends notifications to Alertmanager.
## Logs (VictoriaLogs)
VictoriaLogs single-node, vlagent and vlinsert components support ingestion of logs via OpenTelemetry Protocol (OTLP) from [OpenTelemetry Collector](https://opentelemetry.io/docs/collector/)
and applications instrumented with [OpenTelemetry SDKs](https://opentelemetry.io/docs/languages/).
See the detailed description about protocol support [here](https://docs.victoriametrics.com/victorialogs/data-ingestion/opentelemetry/).
> See a practical guide
[How to use OpenTelemetry metrics with VictoriaLogs](https://docs.victoriametrics.com/guides/getting-started-with-opentelemetry/).
Once logs are ingested into Victorialogs, they can be read via the following tools:
1. [vmui](https://docs.victoriametrics.com/victorialogs/querying/#web-ui) - VictoriaLogs User Interface for ad-hoc queries
and data exploration.
1. [Grafana](https://docs.victoriametrics.com/victorialogs/integrations/grafana/) - integrates with VictoriaLogs
using [VictoriaLogs datasource](https://grafana.com/grafana/plugins/victoriametrics-logs-datasource/) plugin.
1. [Perses](https://docs.victoriametrics.com/victorialogs/integrations/perses/) - integrates with VictoriaLogs
via [VictoriaLogs plugins](https://perses.dev/plugins/docs/victorialogs/).
1. [vmalert](https://docs.victoriametrics.com/victorialogs/vmalert/) - is an alerting tool for VictoriaLogs.
It executes a list of the given alerting and sends notifications to Alertmanager. It can convert LogsQL queries
into metrics via recording rules and persist them into VictoriaMetrics.
## Traces (VictoriaTraces)
VictoriaTraces single-node and vtinsert components support ingestion of traces via OpenTelemetry Protocol (OTLP) from [OpenTelemetry Collector](https://opentelemetry.io/docs/collector/)
and applications instrumented with [OpenTelemetry SDKs](https://opentelemetry.io/docs/languages/).
See the detailed description about protocol support [here](https://docs.victoriametrics.com/victoriatraces/data-ingestion/opentelemetry/).
Once traces are ingested into VictoriaTraces, they can be read via the following tools:
1. [Grafana](https://docs.victoriametrics.com/victorialogs/integrations/grafana/) - integrates with VictoriaTraces
using [Jaeger datasource](https://grafana.com/docs/grafana/latest/datasources/jaeger/) plugin.
1. [Jaeger frontend](https://www.jaegertracing.io/docs/2.6/deployment/frontend-ui/) - integrates with VictoriaTraces
via [Jaeger Query Service JSON APIs](https://www.jaegertracing.io/docs/2.6/apis/#internal-http-json).
1. [vmalert](https://docs.victoriametrics.com/victoriatraces/vmalert/) - is an alerting tool for VictoriaTraces.
It executes a list of the given alerting and sends notifications to Alertmanager. It can convert LogsQL queries
into metrics via recording rules and persist them into VictoriaMetrics.
## Correlations
Signals can be correlated together if they share the same list of attributes, so they can uniquely identify the
same system or event. The recommended user interface for correlations is Grafana thanks to its [correlation interfaces](https://grafana.com/docs/grafana/latest/administration/correlations/).
See below various scenarios of correlating signals in Grafana using VictoriaMetrics, VictoriaLogs and VictoriaTraces as backends.
Depending on the Grafana datasource plugin there could be multiple correlations available:
1. Trace to logs, log to trace, log to metrics - see [correlations via VictoriaLogs plugin](https://docs.victoriametrics.com/victorialogs/integrations/grafana/#correlations).
1. Trace to metrics, metric to logs, metric to traces - see [correlations via VictoriaMetrics plugin](https://docs.victoriametrics.com/victoriametrics/integrations/grafana/datasource/#correlations).
1. Metrics to logs or traces correlations are possible via Prometheus datasource as well.
1. Plugins Tempo, Jaeger, and Zipkin can correlate with logs or metrics using [Trace to logs](https://grafana.com/docs/grafana/latest/explore/trace-integration/#trace-to-logs)
and [Trace to metrics](https://grafana.com/docs/grafana/latest/visualizations/explore/trace-integration/#trace-to-metrics) feature.

Binary file not shown.

After

Width:  |  Height:  |  Size: 80 KiB

View File

@@ -0,0 +1,14 @@
---
title: OpenTelemetry
weight: 60
menu:
docs:
weight: 60
identifier: opentelemetry
tags:
- metrics
- logs
- traces
- otel
---
{{% content "README.md" %}}

View File

@@ -1227,7 +1227,10 @@ Metric names are stripped from the resulting series. Add [keep_metric_names](#ke
#### buckets_limit
`buckets_limit(limit, buckets)` is a [transform function](#transform-functions), which limits the number
of [histogram buckets](https://valyala.medium.com/improving-histogram-usability-for-prometheus-and-grafana-bc7e5df0e350) to the given `limit`.
of [histogram buckets](https://valyala.medium.com/improving-histogram-usability-for-prometheus-and-grafana-bc7e5df0e350) to the given `limit`.
The result will preserve the first and the last bucket to improve accuracy for min and max values.
So, if the `limit` is greater than 0 and less than 3, the function will still return 3 buckets: the first bucket, the last bucket, and a selected bucket.
See also [prometheus_buckets](#prometheus_buckets) and [histogram_quantile](#histogram_quantile).

View File

@@ -1033,18 +1033,10 @@ VictoriaMetrics also may scrape Prometheus targets - see [these docs](#how-to-sc
### Sending data via OpenTelemetry
VictoriaMetrics supports data ingestion via [OpenTelemetry protocol for metrics](https://github.com/open-telemetry/opentelemetry-specification/blob/ffddc289462dfe0c2041e3ca42a7b1df805706de/specification/metrics/data-model.md) at `/opentelemetry/v1/metrics` path.
VictoriaMetrics supports data ingestion via [OpenTelemetry protocol (OTLP) for metrics](https://github.com/open-telemetry/opentelemetry-specification/blob/ffddc289462dfe0c2041e3ca42a7b1df805706de/specification/metrics/data-model.md) at `/opentelemetry/v1/metrics` path.
It expects `protobuf`-encoded requests at `/opentelemetry/v1/metrics`. For gzip-compressed workload set HTTP request header `Content-Encoding: gzip`.
VictoriaMetrics expects `protobuf`-encoded requests at `/opentelemetry/v1/metrics`.
Set HTTP request header `Content-Encoding: gzip` when sending gzip-compressed data to `/opentelemetry/v1/metrics`.
VictoriaMetrics stores the ingested OpenTelemetry [raw samples](https://docs.victoriametrics.com/victoriametrics/keyconcepts/#raw-samples) as is without any transformations.
Pass `-opentelemetry.usePrometheusNaming` command-line flag to VictoriaMetrics for automatic conversion of metric names and labels into Prometheus-compatible format.
Pass `-opentelemetry.convertMetricNamesToPrometheus` command-line flag to VictoriaMetrics for applying Prometheus-compatible format conversion only for metrics names.
OpenTelemetry [exponential histogram](https://opentelemetry.io/docs/specs/otel/metrics/data-model/#exponentialhistogram) is automatically converted
to [VictoriaMetrics histogram format](https://valyala.medium.com/improving-histogram-usability-for-prometheus-and-grafana-bc7e5df0e350).
Using the following exporter configuration in the OpenTelemetry collector will allow you to send metrics into VictoriaMetrics:
Use the following OpenTelemetry collector exporter configuration to push metrics to VictoriaMetrics:
```yaml
exporters:
@@ -1057,7 +1049,7 @@ exporters:
> Note, [cluster version of VM](https://docs.victoriametrics.com/victoriametrics/cluster-victoriametrics/#url-format) expects specifying tenant ID, i.e. `http://<vminsert>:<port>/insert/<accountID>/opentelemetry`.
> See more about [multitenancy](https://docs.victoriametrics.com/victoriametrics/cluster-victoriametrics/#multitenancy).
Remember to add the exporter to the desired service pipeline in order to activate the exporter.
Remember to add the exporter to the desired service pipeline to activate the exporter.
```yaml
service:
@@ -1069,7 +1061,22 @@ service:
- otlp
```
By default, VictoriaMetrics stores the ingested OpenTelemetry [metric samples](https://docs.victoriametrics.com/victoriametrics/keyconcepts/#raw-samples) as is **without any transformations**.
The following label transformations can be enabled:
* `--usePromCompatibleNaming` - replaces characters unsupported by Prometheus with `_` in metric names and labels **for all ingestion protocols**.
For example, `process.cpu.time{service.name="foo"}` is converted to `process_cpu_time{service_name="foo"}`.
* `--opentelemetry.usePrometheusNaming` - converts metric names and labels according to [OTLP Metric points to Prometheus specification](https://github.com/open-telemetry/opentelemetry-specification/blob/v1.33.0/specification/compatibility/prometheus_and_openmetrics.md#otlp-metric-points-to-prometheus) for metrics ingested via OTLP.
For example, `process.cpu.time{service.name="foo"}` is converted to `process_cpu_time_seconds_total{service_name="foo"}`.
* `-opentelemetry.convertMetricNamesToPrometheus` - converts **only metric names** according to [OTLP Metric points to Prometheus specification](https://github.com/open-telemetry/opentelemetry-specification/blob/v1.33.0/specification/compatibility/prometheus_and_openmetrics.md#otlp-metric-points-to-prometheus) for metrics ingested via OTLP.
For example, `process.cpu.time{service.name="foo"}` is converted to `process_cpu_time_seconds_total{service.name="foo"}`. See more about this use case [here](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/9830).
> These flags can applied on vmagent, vminsert or VictoriaMetrics single-node.
OpenTelemetry [exponential histogram](https://opentelemetry.io/docs/specs/otel/metrics/data-model/#exponentialhistogram) is automatically converted
to [VictoriaMetrics histogram format](https://valyala.medium.com/improving-histogram-usability-for-prometheus-and-grafana-bc7e5df0e350).
See [How to use OpenTelemetry metrics with VictoriaMetrics](https://docs.victoriametrics.com/guides/getting-started-with-opentelemetry/).
See more about [OpenTelemetry in VictoriaMetrics](https://docs.victoriametrics.com/opentelemetry/).
## JSON line format

View File

@@ -12,96 +12,78 @@ aliases:
- /troubleshooting/index.html
- /troubleshooting/
---
This document contains troubleshooting guides for the most common issues when working with VictoriaMetrics:
- [General troubleshooting checklist](#general-troubleshooting-checklist)
- [Unexpected query results](#unexpected-query-results)
- [Slow data ingestion](#slow-data-ingestion)
- [Slow queries](#slow-queries)
- [Out of memory errors](#out-of-memory-errors)
- [Cluster instability](#cluster-instability)
- [Too much disk space used](#too-much-disk-space-used)
- [Monitoring](#monitoring)
This document contains troubleshooting guides for the most common issues when working with VictoriaMetrics.
## General troubleshooting checklist
If you hit some issue or have some question about VictoriaMetrics components,
then please follow these steps in order to quickly find the solution:
If you encounter an issue or have a question about VictoriaMetrics components, follow these steps to quickly find a solution:
1. Check the version of VictoriaMetrics component, you are troubleshooting and compare
it to [the latest available version](https://docs.victoriametrics.com/victoriametrics/changelog/).
If the used version is lower than the latest available version, then there are high chances
that the issue is already resolved in newer versions. Carefully read [the changelog](https://docs.victoriametrics.com/victoriametrics/changelog/)
between your version and the latest version and check whether the issue is already fixed there.
1. Check the version of the VictoriaMetrics component you are troubleshooting and compare
it with [the latest available version](https://docs.victoriametrics.com/victoriametrics/changelog/).
If the issue is already fixed in newer versions, then upgrade to the newer version and verify whether the issue is fixed:
If you are running an older version, the issue may already be fixed. Review the [changelog](https://docs.victoriametrics.com/victoriametrics/changelog/)
for all releases between your version and the latest release to see whether the problem has been resolved.
If the issue is fixed in a newer release, upgrade and verify that the problem no longer occurs:
- [How to upgrade single-node VictoriaMetrics](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#how-to-upgrade-victoriametrics)
- [How to upgrade VictoriaMetrics cluster](https://docs.victoriametrics.com/victoriametrics/cluster-victoriametrics/#updating--reconfiguring-cluster-nodes)
Upgrade procedure for other VictoriaMetrics components is as simple as gracefully stopping the component
by sending `SIGINT` signal to it and starting the new version of the component.
The upgrade procedure for other VictoriaMetrics components is as simple as gracefully stopping the component
by sending it a `SIGINT` signal and starting the new version of the component.
There may be breaking changes between different versions of VictoriaMetrics components in rare cases.
These cases are documented in [the changelog](https://docs.victoriametrics.com/victoriametrics/changelog/).
So please read the changelog before the upgrade.
In rare cases, upgrades may include breaking changes. These cases are documented in the [changelog](https://docs.victoriametrics.com/victoriametrics/changelog/),
especially check the **Update notes** near the top of the changelog, as they point out any special actions or considerations to take when upgrading.
1. Inspect command-line flags passed to VictoriaMetrics components and remove flags that have unclear outcomes for your workload.
VictoriaMetrics components are designed to work optimally with the default command-line flag values (e.g. when these flags aren't set explicitly).
It is recommended to remove flags with unclear outcomes, since they may result in unexpected issues.
1. Review command-line flags passed to VictoriaMetrics components and remove any flags whose impact on your workload is unclear.
VictoriaMetrics components are optimized to work well with default settings (that is, when flags aren't explicitly set).
Unnecessary or poorly understood flags can lead to unexpected behavior, so it's best to remove them unless you clearly understand why they are needed.
1. Check for logs in VictoriaMetrics components. They may contain useful information about cause of the issue
and how to fix the issue. If the log message doesn't have enough useful information for troubleshooting,
then search the log message in Google. There are high chances that the issue is already reported
somewhere (docs, StackOverflow, Github issues, etc.) and the solution is already documented there.
1. Check logs. They often contain useful details about the root cause and possible fixes.
1. If VictoriaMetrics logs have no relevant information, then try searching for the issue in Google
via multiple keywords and phrases specific to the issue. There are high chances that the issue
and the solution is already documented somewhere.
If the logs don't provide enough information, try searching the error message on Google. In many cases, the issue has
already been discussed (in documentation, on Stack Overflow, or in GitHub issues), and a solution may already be available.
1. Try searching for the issue at [VictoriaMetrics GitHub](https://github.com/VictoriaMetrics/VictoriaMetrics/issues).
The signal/noise quality of search results here is much lower than in Google, but sometimes it may help
finding the relevant information about the issue when Google fails to find the needed information.
If you located the relevant GitHub issue, but it misses some information on how to diagnose or troubleshoot it,
then please provide this information in comments to the issue. This increases chances that it will be resolved soon.
1. If VictoriaMetrics logs do not have relevant information, then try searching for the issue on Google
using multiple keywords and phrases specific to the issue. In many cases, both the issue and its solution are already documented.
1. Try searching for information about the issue in [VictoriaMetrics source code](https://github.com/search?q=repo%3AVictoriaMetrics%2FVictoriaMetrics&type=code).
GitHub code search may be not very good in some cases, so it is recommended [checking out VictoriaMetrics source code](https://github.com/VictoriaMetrics/VictoriaMetrics/)
and perform local search in the checked out code.
Note that the source code for VictoriaMetrics cluster is located in [the cluster](https://github.com/VictoriaMetrics/VictoriaMetrics/tree/cluster) branch.
1. Try searching for the issue in [VictoriaMetrics GitHub](https://github.com/VictoriaMetrics/VictoriaMetrics/issues).
The signal-to-noise ratio of search results here is much lower than on Google, but sometimes it can help
find relevant information when Google fails.
If you located the relevant GitHub issue, but it lacks details for diagnosis or troubleshooting,
then please add them in the issue comments. This increases the chance that it will be resolved soon.
1. Try searching for information about the issue in the history of [VictoriaMetrics Slack chat](https://victoriametrics.slack.com).
There are non-zero chances that somebody already stuck with the same issue and documented the solution at Slack.
1. Try searching for information about the issue in the [VictoriaMetrics source code](https://github.com/search?q=repo%3AVictoriaMetrics%2FVictoriaMetrics&type=code).
GitHub code search may not be very effective in some cases, so it is recommended [to check out the VictoriaMetrics source code](https://github.com/VictoriaMetrics/VictoriaMetrics/)
and perform a local search in the code.
Note that the source code for the VictoriaMetrics cluster is located in [the cluster](https://github.com/VictoriaMetrics/VictoriaMetrics/tree/cluster) branch.
1. If steps above didn't help finding the solution to the issue, then please [file a new issue](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/new/choose)
by providing the maximum details on how to reproduce the issue.
1. If the steps above didn't help to find the solution to the issue, then please [file a new issue](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/new/choose)
with as many details as possible on how to reproduce it.
After that you can post the link to the issue to [VictoriaMetrics Slack chat](https://victoriametrics.slack.com),
so VictoriaMetrics community could help finding the solution to the issue. It is better filing the issue at VictoriaMetrics GitHub
before posting your question to VictoriaMetrics Slack chat, since GitHub issues are indexed by Google,
while Slack messages aren't indexed by Google. This simplifies searching for the solution to the issue for future VictoriaMetrics users.
After that you can post the link to the issue in the [VictoriaMetrics Slack chat](https://victoriametrics.slack.com),
so the VictoriaMetrics community can help find a solution. It is better to file the issue on VictoriaMetrics GitHub
before posting your question to the VictoriaMetrics Slack chat, since GitHub issues are indexed by Google,
while Slack messages are not. This simplifies finding a solution to the issue for future VictoriaMetrics users.
1. Pro tip 1: if you see that [VictoriaMetrics docs](https://docs.victoriametrics.com/victoriametrics/) contain incomplete or incorrect information,
then please create a pull request with the relevant changes. This will help VictoriaMetrics community.
All the docs published at `https://docs.victoriametrics.com` are located in the [docs](https://github.com/VictoriaMetrics/VictoriaMetrics/tree/master/docs)
folder inside VictoriaMetrics repository.
then please create a pull request with the relevant changes or a new issue explaining the problem. This will help the VictoriaMetrics community.
1. Pro tip 2: please provide links to existing docs / GitHub issues / StackOverflow questions
instead of copy-n-pasting the information from these sources when asking or answering questions
from VictoriaMetrics community. If the linked resources have no enough information,
then it is better posting the missing information in the web resource before providing links
to this information in Slack chat. This will simplify searching for this information in the future
instead of copying and pasting the information from these sources when asking or answering questions
to the VictoriaMetrics community. If the linked resources do not have enough information,
then it is better to add the missing information to the original web resource before linking it to Slack chat. This will simplify searching for this information in the future
for VictoriaMetrics users via Google and [Perplexity](https://www.perplexity.ai/).
1. Pro tip 3: if you are answering somebody's question about VictoriaMetrics components
at GitHub issues / Slack chat / StackOverflow, then the best answer is a direct link to the information
regarding the question.
in GitHub issues / Slack chat / StackOverflow, then the best answer is a direct link to the information
with the answer or solution to the question.
The better answer is a concise message with multiple links to the relevant information.
The worst answer is a message with misleading or completely wrong information.
1. Pro tip 4: if you can fix the issue on yourself, then please do it and provide the corresponding pull request!
We are glad to get pull requests from VictoriaMetrics community.
1. Pro tip 4: If you can fix the issue on your own, then please do it and provide the corresponding pull request!
We are happy to get pull requests from the VictoriaMetrics community.
## Unexpected query results
@@ -111,150 +93,150 @@ If you see unexpected or unreliable query results from VictoriaMetrics, then try
`sum(rate(http_requests_total[5m])) by (job)`, then check whether the following queries return
expected results:
- Remove the outer `sum` and execute `rate(http_requests_total[5m])`,
since aggregations could hide some missing series, gaps in data or anomalies in existing series.
If this query returns too many time series, then try adding more specific label filters to it.
- Remove the outer `sum` and execute `rate(http_requests_total[5m])`.
Aggregations could hide missing series, data gaps, or anomalies.
- If the query returns too many series, try adding more specific label filters.
For example, if you see that the original query returns unexpected results for the `job="foo"`,
then use `rate(http_requests_total{job="foo"}[5m])` query.
If this isn't enough, then continue adding more specific label filters, so the resulting query returns
manageable number of time series.
then use the `rate(http_requests_total{job="foo"}[5m])` query.
Continue adding more specific label filters until the resulting query returns a manageable number of time series.
- Remove the outer `rate` and execute `http_requests_total`. Additional label filters may be added here in order
to reduce the number of returned series.
- Remove the outer `rate` and execute `http_requests_total`. Add label filters to reduce the number of returned series
if needed.
Sometimes the query may be improperly constructed, so it returns unexpected results.
It is recommended reading and understanding [MetricsQL docs](https://docs.victoriametrics.com/victoriametrics/metricsql/),
Sometimes the query may be improperly constructed, leading to unexpected results.
It is recommended to read and understand [MetricsQL docs](https://docs.victoriametrics.com/victoriametrics/metricsql/),
especially [subqueries](https://docs.victoriametrics.com/victoriametrics/metricsql/#subqueries)
and [rollup functions](https://docs.victoriametrics.com/victoriametrics/metricsql/#rollup-functions) sections.
1. If the simplest query continues returning unexpected / unreliable results, then try verifying correctness
of raw unprocessed samples for this query via [/api/v1/export](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#how-to-export-data-in-json-line-format)
on the given `[start..end]` time range and check whether they are expected:
of raw unprocessed samples in [vmui](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#vmui) via the `Raw Query` tab.
Responses returned from [/api/v1/query](https://docs.victoriametrics.com/victoriametrics/keyconcepts/#instant-query)
and [/api/v1/query_range](https://docs.victoriametrics.com/victoriametrics/keyconcepts/#range-query) contain **evaluated** data
instead of stored raw samples. In some cases, [staleness](https://prometheus.io/docs/prometheus/latest/querying/basics/#staleness),
[deduplication](https://docs.victoriametrics.com/victoriametrics/#deduplication), or irregular scrapes can affect evaluations.
See [this short video](https://www.youtube.com/watch?v=7AyVCC6uKfI) for details.
Raw data can be downloaded via the `Export` button in vmui's `Raw Query` tab or via [/api/v1/export](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#how-to-export-data-in-json-line-format)
query on the given `[start..end]` time range and check whether they are expected:
```sh
single-node: curl http://victoriametrics:8428/api/v1/export -d 'match[]=http_requests_total' -d 'start=...' -d 'end=...' -d 'reduce_mem_usage=1'
cluster: curl http://<vmselect>:8481/select/<tenantID>/prometheus/api/v1/export -d 'match[]=http_requests_total' -d 'start=...' -d 'end=...' -d 'reduce_mem_usage=1'
```
When raising a GitHub ticket about query issues, please also attach the raw data, so maintainers can reproduce your case locally.
Note that responses returned from [/api/v1/query](https://docs.victoriametrics.com/victoriametrics/keyconcepts/#instant-query)
and from [/api/v1/query_range](https://docs.victoriametrics.com/victoriametrics/keyconcepts/#range-query) contain **evaluated** data
instead of raw samples stored in VictoriaMetrics. See [these docs](https://prometheus.io/docs/prometheus/latest/querying/basics/#staleness)
for details. The raw samples can be also viewed in [vmui](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#vmui) in `Raw Query` tab and shared via `export` button.
1. Try executing the query with [tracer](https://docs.victoriametrics.com/victoriametrics/#query-tracing) enabled. The trace
contains a lot of additional information about query execution, series matching, caches, and internal modifications.
When raising a GitHub ticket about query issues, please also attach the trace so maintainers can investigate.
If you migrate from InfluxDB, then pass `-search.setLookbackToStep` command-line flag to single-node VictoriaMetrics
or to `vmselect` in VictoriaMetrics cluster. See also [how to migrate from InfluxDB to VictoriaMetrics](https://docs.victoriametrics.com/guides/migrate-from-influx/).
1. If you observe gaps when plotting series, it is likely caused by irregular intervals for metrics collection (network delays
or targets unavailability during scrapes, irregular pushes, irregular timestamps).
VictoriaMetrics automatically [fills the gaps](https://docs.victoriametrics.com/victoriametrics/keyconcepts/#range-query)
based on the median interval between [data samples](https://docs.victoriametrics.com/victoriametrics/keyconcepts/#raw-samples).
This may yield incorrect results for irregular data, as the median will be skewed. In this case, it is recommended to either fix the
irregularities or switch to the static interval for gaps filling by setting `-search.minStalenessInterval=5m` command-line flag (`5m` is
used by Prometheus by default).
1. Sometimes response caching may lead to unexpected results when samples with older timestamps
1. Sometimes, response caching may lead to unexpected results when samples with older timestamps
are ingested into VictoriaMetrics (aka [backfilling](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#backfilling)).
Try disabling response cache and see whether this helps. This can be done in the following ways:
Try disabling response cache and see whether this helps:
- By clicking on the toggle `Disable cache` in vmui.
- By passing `-search.disableCache` command-line flag to a single-node VictoriaMetrics
or to all the `vmselect` components if cluster version of VictoriaMetrics is used.
or to all the `vmselect` components if the cluster version of VictoriaMetrics is used.
- By passing `nocache=1` query arg to every request to `/api/v1/query` and `/api/v1/query_range`.
If you use Grafana, then this query arg can be specified in `Custom Query Parameters` field
at Prometheus datasource settings - see [these docs](https://grafana.com/docs/grafana/latest/datasources/prometheus/) for details.
If you use Grafana, then this query arg can be specified in the `Custom Query Parameters` field
in Prometheus datasource settings. See [these docs](https://grafana.com/docs/grafana/latest/datasources/prometheus/) for details.
If the problem was in the cache, try resetting it via [resetRollupCache handler](https://docs.victoriametrics.com/victoriametrics/url-examples/#internalresetrollupresultcache).
If the problem was in the cache, try resetting it via the [resetRollupCache handler](https://docs.victoriametrics.com/victoriametrics/url-examples/#internalresetrollupresultcache).
1. If you use cluster version of VictoriaMetrics, then it may return partial responses by default
when some of `vmstorage` nodes are temporarily unavailable - see [cluster availability docs](https://docs.victoriametrics.com/victoriametrics/cluster-victoriametrics/#cluster-availability)
for details. If you want to prioritize query consistency over cluster availability,
then you can pass `-search.denyPartialResponse` command-line flag to all the `vmselect` nodes.
In this case VictoriaMetrics returns an error during querying if at least a single `vmstorage` node is unavailable.
Another option is to pass `deny_partial_response=1` query arg to `/api/v1/query` and `/api/v1/query_range`.
If you use Grafana, then this query arg can be specified in `Custom Query Parameters` field
at Prometheus datasource settings - see [these docs](https://grafana.com/docs/grafana/latest/datasources/prometheus/) for details.
1. Cluster version of VictoriaMetrics may return partial responses by default when some of the `vmstorage` nodes are temporarily
unavailable. See [cluster availability docs](https://docs.victoriametrics.com/victoriametrics/cluster-victoriametrics/#cluster-availability).
If you want to prioritize query consistency over cluster availability, then pass `-search.denyPartialResponse` command-line flag to all the `vmselect` nodes.
This causes VictoriaMetrics to return an error during query execution if at least one `vmstorage` node is unavailable.
Another option is to pass `deny_partial_response=1` query argument to `/api/v1/query` and `/api/v1/query_range`.
If you use Grafana, then this query argument can be specified in the `Custom Query Parameters` field
in Prometheus/VictoriaMetrics datasource settings. See [these docs](https://grafana.com/docs/grafana/latest/datasources/prometheus/) for details.
1. If you pass `-replicationFactor` command-line flag to `vmselect`, then it is recommended removing this flag from `vmselect`,
1. If you pass the `-replicationFactor` command-line flag to `vmselect`, then it is recommended to remove this flag from `vmselect`,
since it may lead to incomplete responses when `vmstorage` nodes contain less than `-replicationFactor`
copies of the requested data.
1. If you observe gaps when plotting time series try simplifying your query according to p2 and follow the list.
If problem still remains, then it is likely caused by irregular intervals for metrics collection (network delays
or targets unavailability on scrapes, irregular pushes, irregular timestamps).
VictoriaMetrics automatically [fills the gaps](https://docs.victoriametrics.com/victoriametrics/keyconcepts/#range-query)
based on median interval between [data samples](https://docs.victoriametrics.com/victoriametrics/keyconcepts/#raw-samples).
This may work incorrectly for irregular data as median will be skewed. In this case it is recommended to switch
to the static interval for gaps filling by setting `-search.minStalenessInterval=5m` command-line flag (`5m` is
the static interval used by Prometheus).
1. If you observe recently written data is not immediately visible/queryable, then read more about
1. If you observe that recently written data is not immediately visible/queryable, then read more about
[query latency](https://docs.victoriametrics.com/victoriametrics/keyconcepts/#query-latency) behavior.
1. Try upgrading to the [latest available version of VictoriaMetrics](https://github.com/VictoriaMetrics/VictoriaMetrics/releases/latest)
and verifying whether the issue is fixed there.
1. Try executing the query with `trace=1` query arg. This enables query tracing, that may contain
useful information on why the query returns unexpected data. See [query tracing docs](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#query-tracing) for details.
1. Inspect command-line flags passed to VictoriaMetrics components. If you don't clearly understand the purpose
or the effect of some flags, then remove them from the list of flags.
VictoriaMetrics components are optimized to work well with default settings (that is, when flags aren't explicitly set).
Unnecessary or poorly understood flags can lead to unexpected behavior, so it's best to remove them unless you clearly understand why they are needed.
1. Inspect command-line flags passed to VictoriaMetrics components. If you don't understand clearly the purpose
or the effect of some flags, then remove them from the list of flags passed to VictoriaMetrics components,
because some command-line flags may change query results in unexpected ways when set to improper values.
VictoriaMetrics is optimized for running with default flag values (e.g. when they aren't set explicitly).
1. If the steps above didn't help identifying the root cause of unexpected query results,
then [file a bugreport](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/new) with details on how to reproduce the issue.
Instead of sharing screenshots in the issue, consider sharing query and [trace](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#query-tracing)
results in [VMUI](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#vmui) by clicking on `Export query` button in top right corner of the graph area.
1. If the steps above didn't help identify the root cause of unexpected query results,
then [file a bug report](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/new) with details on how to reproduce the issue.
Instead of sharing screenshots in the issue, consider sharing the query, [raw samples](https://docs.victoriametrics.com/victoriametrics/#vmui) and [trace](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#query-tracing)
results via [VMUI](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#vmui).
## Slow data ingestion
These are the most commons reasons for slow data ingestion in VictoriaMetrics:
These are the most common reasons for slow data ingestion in VictoriaMetrics:
1. Memory shortage for the given amounts of [active time series](https://docs.victoriametrics.com/victoriametrics/faq/#what-is-an-active-time-series).
VictoriaMetrics (or `vmstorage` in cluster version of VictoriaMetrics) maintains an in-memory cache
for quick search for internal series ids per each incoming metric.
This cache is named `storage/tsid`. VictoriaMetrics automatically determines the maximum size for this cache
depending on the available memory on the host where VictoriaMetrics (or `vmstorage`) runs. If the cache size isn't enough
for holding all the entries for active time series, then VictoriaMetrics locates the needed data on disk,
unpacks it, re-constructs the missing entry and puts it into the cache. This takes additional CPU time and disk read IO.
VictoriaMetrics (or `vmstorage` in the cluster version of VictoriaMetrics) maintains an in-memory cache `storage/tsid`
for a quick search for internal series IDs for each incoming metric. VictoriaMetrics automatically determines the maximum
size for this cache depending on the available memory on the host where VictoriaMetrics (or `vmstorage`) runs.
If the cache size isn't enough to hold all the entries for active time series, then VictoriaMetrics locates the required data on disk,
unpacks it, reconstructs the missing entry, and adds it to the cache. This takes additional CPU time and disk read I/O.
The [official Grafana dashboards for VictoriaMetrics](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#monitoring)
contain `Slow inserts` graph, that shows the cache miss percentage for `storage/tsid` cache
during data ingestion. If `slow inserts` graph shows values greater than 5% for more than 10 minutes,
then it is likely the current number of [active time series](https://docs.victoriametrics.com/victoriametrics/faq/#what-is-an-active-time-series)
contain a `Slow inserts` graph that shows the cache miss percentage for the `storage/tsid` cache during data ingestion.
If the `slow inserts` graph shows values greater than 5% for more than 10 minutes,
then it is likely that the current number of [active time series](https://docs.victoriametrics.com/victoriametrics/faq/#what-is-an-active-time-series)
cannot fit the `storage/tsid` cache.
These are the solutions that exist for this issue:
- To increase the available memory on the host where VictoriaMetrics runs until `slow inserts` percentage
will become lower than 5%. If you run VictoriaMetrics cluster, then you need increasing total available
memory at `vmstorage` nodes. This can be done in two ways: either to increase the available memory
per each existing `vmstorage` node or to add more `vmstorage` nodes to the cluster.
- Increase the available memory on the host where VictoriaMetrics runs until the `slow inserts` percentage
drops to 5% or less. If you run a VictoriaMetrics cluster, then you need to increase the total available
memory at all `vmstorage` nodes. This can be done in two ways: either to increase the available memory
for each `vmstorage` node or to add more `vmstorage` nodes to the cluster to spread the load.
- To reduce the number of active time series. The [official Grafana dashboards for VictoriaMetrics](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#monitoring)
contain a graph showing the number of active time series. Recent versions of VictoriaMetrics
provide [cardinality explorer](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#cardinality-explorer),
that can help determining and fixing the source of [high cardinality](https://docs.victoriametrics.com/victoriametrics/faq/#what-is-high-cardinality).
- Reduce the number of active time series. The [official Grafana dashboards for VictoriaMetrics](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#monitoring)
contain a graph showing the number of active time series. Use the [cardinality explorer](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#cardinality-explorer)
to determine and fix the source of [high cardinality](https://docs.victoriametrics.com/victoriametrics/faq/#what-is-high-cardinality).
- Insert performance can degrade when the same time series arrives with labels in different order.
- Insert performance can degrade when the same time series arrives with labels in a different order.
Ensure your ingestion client always sends labels in a consistent order for each series.
Prometheus and `vmagent` already guarantee this, but custom or third-party clients might not.
As a fallback, you can enable `-sortLabels=true` on VictoriaMetrics or on `vminsert` in cluster mode.
This forces the server to normalize label order, though it increases CPU usage during ingestion.
1. [High churn rate](https://docs.victoriametrics.com/victoriametrics/faq/#what-is-high-churn-rate),
e.g. when old time series are substituted with new time series at a high rate.
When VictoriaMetrics encounters a sample for new time series, it needs to register the time series
in the internal index (aka `indexdb`), so it can be quickly located on subsequent select queries.
1. [High churn rate](https://docs.victoriametrics.com/victoriametrics/faq/#what-is-high-churn-rate).
When VictoriaMetrics encounters a sample for a new time series, it needs to register the time series
in the internal index (aka `indexdb`), so it can be quickly located during select queries.
The process of registering new time series in the internal index is an order of magnitude slower
than the process of adding new sample to already registered time series.
than the process of adding a new sample to an already registered time series.
So VictoriaMetrics may work slower than expected under [high churn rate](https://docs.victoriametrics.com/victoriametrics/faq/#what-is-high-churn-rate).
The [official Grafana dashboards for VictoriaMetrics](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#monitoring)
provides `Churn rate` graph, that shows the average number of new time series registered
provide a `Churn rate` graph, which shows the average number of new time series registered
during the last 24 hours. If this number exceeds the number of [active time series](https://docs.victoriametrics.com/victoriametrics/faq/#what-is-an-active-time-series),
then you need to identify and fix the source of [high churn rate](https://docs.victoriametrics.com/victoriametrics/faq/#what-is-high-churn-rate).
The most common source of high churn rate is a label, that frequently changes its value. Try avoiding such labels.
The [cardinality explorer](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#cardinality-explorer) can help identifying
The most common source of high churn rate is a label that frequently changes its value (like timestamp, session_id). **Try avoiding such labels.**
The [cardinality explorer](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#cardinality-explorer) can help identify
such labels.
1. Resource shortage. The [official Grafana dashboards for VictoriaMetrics](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#monitoring)
contain `resource usage` graphs, that show memory usage, CPU usage, disk IO usage and free disk size.
Make sure VictoriaMetrics has enough free resources for graceful handling of potential spikes in workload
contain `Resource usage` graphs that show memory usage, CPU usage, disk I/O usage, etc.
Make sure VictoriaMetrics has enough free resources for gracefully handling potential spikes in workload
according to the following recommendations:
- 50% of free CPU
@@ -262,52 +244,51 @@ These are the most commons reasons for slow data ingestion in VictoriaMetrics:
- 20% of free disk space
If VictoriaMetrics components have lower amounts of free resources, then this may lead
to **significant** performance degradation after workload increases slightly.
to **significant** performance degradation when workload increases slightly.
For example:
- If the percentage of free CPU is close to 0, then VictoriaMetrics
may experience arbitrary long delays during data ingestion when it cannot keep up
with slightly increased data ingestion rate.
may experience arbitrarily long delays during data ingestion, even with slight increases in ingestion rate.
- If the percentage of free memory reaches 0, then the Operating System where VictoriaMetrics components run,
may not have enough memory for [page cache](https://en.wikipedia.org/wiki/Page_cache).
VictoriaMetrics relies on page cache for quick queries over recently ingested data.
If the operating system has no enough free memory for page cache, then it needs
to re-read the requested data from disk. This may **significantly** increase disk read IO
- If the percentage of free memory reaches 0, then the Operating System where VictoriaMetrics components run
may not have enough memory for the [page cache](https://en.wikipedia.org/wiki/Page_cache).
VictoriaMetrics relies on the page cache for quick queries over recently ingested data.
If the operating system does not have enough free memory for the page cache, then it must
re-read the requested data from disk. This may **significantly** increase disk read I/O
and slow down both queries and data ingestion.
- If free disk space is lower than 20%, then VictoriaMetrics is unable to perform optimal
background merge of the incoming data. This leads to increased number of data files on disk,
that, in turn, slows down both data ingestion and querying. See [these docs](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#storage) for details.
- If free disk space is below 20%, then VictoriaMetrics may be unable to perform optimal
background merge of the incoming data. This results in more data files on disk.
That, in turn, slows down both data ingestion and querying. See [these docs](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#storage) for details.
1. If you run cluster version of VictoriaMetrics, then make sure `vminsert` and `vmstorage` components
are located in the same network with small network latency between them.
`vminsert` packs incoming data into batch packets and sends them to `vmstorage` one-by-one.
It waits until `vmstorage` returns back `ack` response before sending the next packet.
1. If you run the cluster version of VictoriaMetrics, then make sure `vminsert` and `vmstorage` components
are located in the same network with a low network latency between them.
`vminsert` packs incoming data into batch packets and sends them to `vmstorage` one by one.
It waits until `vmstorage` returns back an `ack` response before sending the next packet.
If the network latency between `vminsert` and `vmstorage` is high (for example, if they run in different datacenters),
then this may become limiting factor for data ingestion speed.
then this may become a limiting factor for data ingestion speed.
The [official Grafana dashboard for cluster version of VictoriaMetrics](https://docs.victoriametrics.com/victoriametrics/cluster-victoriametrics/#monitoring)
contain `connection saturation` graph for `vminsert` components. If this graph reaches 100% (1s),
The [official Grafana dashboard for the cluster version of VictoriaMetrics](https://docs.victoriametrics.com/victoriametrics/cluster-victoriametrics/#monitoring)
contains a `connection saturation` panel for `vminsert` components. If this graph reaches 100% (1s),
then it is likely you have issues with network latency between `vminsert` and `vmstorage`.
Another possible issue for 100% connection saturation between `vminsert` and `vmstorage`
is resource shortage at `vmstorage` nodes. In this case you need to increase amounts
of available resources (CPU, RAM, disk IO) at `vmstorage` nodes or to add more `vmstorage` nodes to the cluster.
is a resource shortage in the `vmstorage` nodes. In this case, you need to increase the amount
of available resources (CPU, RAM, disk I/O) at `vmstorage` nodes or add more `vmstorage` nodes to the cluster.
1. Noisy neighbor. Make sure VictoriaMetrics components run in an environment without other resource-hungry apps.
Such apps may steal RAM, CPU, disk IO and network bandwidth, that is needed for VictoriaMetrics components.
Issues like this are very hard to catch via [official Grafana dashboard for cluster version of VictoriaMetrics](https://docs.victoriametrics.com/victoriametrics/cluster-victoriametrics/#monitoring)
Such apps may steal RAM, CPU, disk I/O, and network bandwidth that are needed for VictoriaMetrics components.
Issues like this are hard to catch via the [official Grafana dashboard for the cluster version of VictoriaMetrics](https://docs.victoriametrics.com/victoriametrics/cluster-victoriametrics/#monitoring)
and proper diagnosis would require checking resource usage on the instances where VictoriaMetrics runs.
1. If you see `TooHighSlowInsertsRate` [alert](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#monitoring) when single-node VictoriaMetrics or `vmstorage` has enough
free CPU and RAM, then increase `-cacheExpireDuration` command-line flag at single-node VictoriaMetrics or at `vmstorage` to the value,
1. If you see a `TooHighSlowInsertsRate` [alert](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#monitoring) when single-node VictoriaMetrics or `vmstorage` has enough
free CPU and RAM, then increase the `-cacheExpireDuration` command-line flag at single-node VictoriaMetrics or at `vmstorage` to a value
that exceeds the interval between ingested samples for the same time series (aka `scrape_interval`).
See [this comment](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3976#issuecomment-1476883183) for more details.
1. If you see constant and abnormally high CPU usage of VictoriaMetrics component, check `CPU spent on GC` panel
on the corresponding [Grafana dashboard](https://grafana.com/orgs/victoriametrics) in `Resource usage` section. If percentage of CPU time spent on garbage collection
is high, then CPU usage of the component can be reduced at the cost of higher memory usage by changing [GOGC](https://tip.golang.org/doc/gc-guide#GOGC) environment variable
to higher values. By default VictoriaMetrics components use `GOGC=30`. Try running VictoriaMetrics components with `GOGC=100` and see whether this helps reducing CPU usage.
1. If you see constant and abnormally high CPU usage for the VictoriaMetrics component, check the `CPU spent on GC` panel
on the corresponding [Grafana dashboard](https://grafana.com/orgs/victoriametrics) in the `Resource usage` section. If the percentage of CPU time spent on garbage collection
is high, then CPU usage of the component can be reduced at the cost of higher memory usage by increasing the [GOGC](https://tip.golang.org/doc/gc-guide#GOGC) environment variable.
By default, VictoriaMetrics components use `GOGC=30`. Try running VictoriaMetrics components with `GOGC=100` and see whether this helps reduce CPU usage.
Note that higher `GOGC` values may increase memory usage.
## Slow queries
@@ -316,40 +297,43 @@ Some queries may take more time and resources (CPU, RAM, network bandwidth) than
VictoriaMetrics logs slow queries if their execution time exceeds the duration passed
to `-search.logSlowQueryDuration` command-line flag (5s by default).
VictoriaMetrics provides [`top queries` page at VMUI](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#top-queries), that shows
queries that took the most time to execute.
VictoriaMetrics provides a [`top queries` page in VMUI](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#top-queries) that shows
the longest-running queries. And [Query execution stats](https://docs.victoriametrics.com/victoriametrics/query-stats/) for dumping slow queries
to logs.
These are the solutions that exist for improving performance of slow queries:
These are the solutions that exist for improving the performance of slow queries:
- Investigating the bottleneck in query execution using [query tracing](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#query-tracing).
It will show the percentage of time spent on each execution step and help understand the volume of processed data.
- Adding more CPU and memory to VictoriaMetrics, so it may perform the slow query faster.
If you use cluster version of VictoriaMetrics, then migrating `vmselect` nodes to machines
with more CPU and RAM should help improving speed for slow queries. Query performance
is always limited by resources of one `vmselect` that processes the query. For example, if 2vCPU cores on `vmselect`
isn't enough to process query fast enough, then migrating `vmselect` to a machine with 4vCPU cores should increase heavy query performance by up to 2x.
If the line on `concurrent select` graph form the [official Grafana dashboard for VictoriaMetrics](https://docs.victoriametrics.com/victoriametrics/cluster-victoriametrics/#monitoring)
If you use the cluster version of VictoriaMetrics, then migrating `vmselect` nodes to machines
with more CPU and RAM should help improve speed for slow queries. Query performance
is always limited by the resources of **one** `vmselect` that processes the query. For example, if 2 vCPU cores on `vmselect`
can't process queries fast enough, then migrating `vmselect` to a machine with 4 vCPU cores should increase heavy query performance by up to 2x.
If the line on the `concurrent select` graph from the [official Grafana dashboard for VictoriaMetrics](https://docs.victoriametrics.com/victoriametrics/cluster-victoriametrics/#monitoring)
is close to the limit, then prefer adding more `vmselect` nodes to the cluster.
Sometimes adding more `vmstorage` nodes also can help improving the speed for slow queries.
Sometimes adding more `vmstorage` nodes can also help improve the speed for slow queries.
- Rewriting slow queries, so they become faster. Unfortunately it is hard determining
whether the given query is slow by just looking at it.
- Rewriting slow queries, so they become faster.
The main source of slow queries in practice is [alerting and recording rules](https://docs.victoriametrics.com/victoriametrics/vmalert/#rules)
with long lookbehind windows in square brackets. These queries are frequently used in SLI/SLO calculations such as [Sloth](https://github.com/slok/sloth).
For example, `avg_over_time(up[30d]) > 0.99` needs to read and process
all the [raw samples](https://docs.victoriametrics.com/victoriametrics/keyconcepts/#raw-samples)
for `up` [time series](https://docs.victoriametrics.com/victoriametrics/keyconcepts/#time-series) over the last 30 days
each time it executes. If this query is executed frequently, then it can take significant share of CPU, disk read IO, network bandwidth and RAM.
for the `up` [time series](https://docs.victoriametrics.com/victoriametrics/keyconcepts/#time-series) over the last 30 days
each time it executes. If this query is executed frequently, it can take a significant share of CPU, disk read I/O, network bandwidth, and RAM.
Such queries can be optimized in the following ways:
- To reduce the lookbehind window in square brackets. For example, `avg_over_time(up[10d])` takes up to 3x less compute resources
- To reduce the look-behind window in square brackets. For example, `avg_over_time(up[10d])` takes up to 3x less compute resources
than `avg_over_time(up[30d])` at VictoriaMetrics.
- To increase evaluation interval for alerting and recording rules, so they are executed less frequently.
For example, increasing `-evaluationInterval` command-line flag value at [vmalert](https://docs.victoriametrics.com/victoriametrics/vmalert/)
from `1m` to `2m` should reduce compute resource usage at VictoriaMetrics by 2x.
- To increase the evaluation interval for alerting and recording rules, so they are executed less frequently.
For example, increasing the `-evaluationInterval` command-line flag value at [vmalert](https://docs.victoriametrics.com/victoriametrics/vmalert/)
from `1m` to `2m` should reduce compute resource usage by VictoriaMetrics 2x.
Another source of slow queries is improper use of [subqueries](https://docs.victoriametrics.com/victoriametrics/metricsql/#subqueries).
It is recommended avoiding subqueries if you don't understand clearly how they work.
It is recommended to avoid subqueries if you don't clearly understand how they work.
It is easy to create a subquery without knowing about it.
For example, `rate(sum(some_metric))` is implicitly transformed into the following subquery
according to [implicit conversion rules for MetricsQL queries](https://docs.victoriametrics.com/victoriametrics/metricsql/#implicit-query-conversions):
@@ -365,67 +349,64 @@ These are the solutions that exist for improving performance of slow queries:
It is likely this query won't return the expected results. Instead, `sum(rate(some_metric))` must be used instead.
See [this article](https://www.robustperception.io/rate-then-sum-never-sum-then-rate/) for more details.
VictoriaMetrics provides [query tracing](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#query-tracing) feature,
that can help determining the source of slow query.
See also [this article](https://valyala.medium.com/how-to-optimize-promql-and-metricsql-queries-85a1b75bf986),
that explains how to determine and optimize slow queries.
which explains how to identify and optimize slow queries.
## Out of memory errors
There are the following most common sources of out of memory (aka OOM) crashes in VictoriaMetrics:
The following are the most common sources of out-of-memory (aka OOM) crashes in VictoriaMetrics:
1. Improper command-line flag values. Inspect command-line flags passed to VictoriaMetrics components.
If you don't understand clearly the purpose or the effect of some flags - remove them
from the list of flags passed to VictoriaMetrics components. Improper command-line flags values
may lead to increased memory and CPU usage. The increased memory usage increases chances for OOM crashes.
VictoriaMetrics is optimized for running with default flag values (e.g. when they aren't set explicitly).
If you don't clearly understand the purpose or the effect of some flags, remove them
from the list of flags passed to VictoriaMetrics components. Improper command-line flag values
may lead to increased memory and CPU usage. Increased memory usage increases the risk of OOM crashes.
VictoriaMetrics is optimized to run with default flag values (e.g., when they aren't explicitly set).
For example, it isn't recommended tuning cache sizes in VictoriaMetrics, since it frequently leads to OOM exceptions.
[These docs](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#cache-tuning) refer command-line flags, that aren't
recommended to tune. If you see that VictoriaMetrics needs increasing some cache sizes for the current workload,
then it is better migrating to a host with more memory instead of trying to tune cache sizes manually.
For example, it isn't recommended to change cache sizes in VictoriaMetrics, as this frequently leads to OOM exceptions.
[These docs](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#cache-tuning) refer to command-line flags that aren't
recommended to tune. If you see that VictoriaMetrics needs to increase some cache sizes for the current workload,
then it is better to migrate to a host with more memory instead of trying to tune cache sizes manually.
1. Unexpected heavy queries. The query is considered as heavy if it needs to select and process millions of unique time series.
Such query may lead to OOM exception, since VictoriaMetrics needs to keep some of per-series data in memory.
VictoriaMetrics provides [various settings](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#resource-usage-limits),
1. Unexpected heavy queries. The query is considered heavy if it needs to select and process millions of unique time series.
Such a query may cause an OOM exception, as VictoriaMetrics needs to keep some per-series data in memory.
VictoriaMetrics provides [various settings](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#resource-usage-limits)
that can help limit resource usage.
For more context, see [How to optimize PromQL and MetricsQL queries](https://valyala.medium.com/how-to-optimize-promql-and-metricsql-queries-85a1b75bf986).
VictoriaMetrics also provides [query tracer](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#query-tracing)
to help identify the source of heavy query.
to help identify the source of heavy queries. Slow queries can be logged with additional details via [Query execution stats](https://docs.victoriametrics.com/victoriametrics/query-stats/).
1. Lack of free memory for processing workload spikes. If VictoriaMetrics components use almost all the available memory
under the current workload, then it is recommended migrating to a host with bigger amounts of memory.
under the current workload, then it is recommended to migrate to a host with larger amounts of memory.
This would protect from possible OOM crashes on workload spikes. It is recommended to have at least 50%
of free memory for graceful handling of possible workload spikes.
of free memory to gracefully handle possible workload spikes.
See [capacity planning for single-node VictoriaMetrics](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#capacity-planning)
and [capacity planning for cluster version of VictoriaMetrics](https://docs.victoriametrics.com/victoriametrics/cluster-victoriametrics/#capacity-planning).
and [capacity planning for the cluster version of VictoriaMetrics](https://docs.victoriametrics.com/victoriametrics/cluster-victoriametrics/#capacity-planning).
## Cluster instability
VictoriaMetrics cluster may become unstable if there is no enough free resources (CPU, RAM, disk IO, network bandwidth)
The VictoriaMetrics cluster may become unstable if there are not enough free resources (CPU, RAM, disk I/O, network bandwidth)
for processing the current workload.
The most common sources of cluster instability are:
- Workload spikes. For example, if the number of active time series increases by 2x while
the cluster has no enough free resources for processing the increased workload,
then it may become unstable.
VictoriaMetrics provides various configuration settings, that can be used for limiting unexpected workload spikes.
the cluster does not have enough free resources for processing the increased workload, then it may become unstable.
VictoriaMetrics provides several configuration settings to limit unexpected workload spikes.
See [these docs](https://docs.victoriametrics.com/victoriametrics/cluster-victoriametrics/#resource-usage-limits) for details.
- Various maintenance tasks such as rolling upgrades or rolling restarts during configuration changes.
- Various maintenance tasks, such as rolling upgrades or rolling restarts, during configuration changes.
For example, if a cluster contains `N=3` `vmstorage` nodes and they are restarted one-by-one (aka rolling restart),
then the cluster will have only `N-1=2` healthy `vmstorage` nodes during the rolling restart.
This means that the load on healthy `vmstorage` nodes increases by at least `100%/(N-1)=50%`
comparing to the load before rolling restart. E.g. they need to process 50% more incoming
compared to the load before rolling restart. E.g., they need to process 50% more incoming
data and to return 50% more data during queries. In reality, the load on the remaining `vmstorage`
nodes increases even more because they need to register new time series, that were re-routed
from temporarily unavailable `vmstorage` node. If `vmstorage` nodes had less than 50%
of free resources (CPU, RAM, disk IO) before the rolling restart, then it
nodes increases even more because they need to register new time series that were re-routed
from a temporarily unavailable `vmstorage` node. If `vmstorage` nodes had less than 50%
of free resources (CPU, RAM, disk I/O) before the rolling restart, then it
can lead to cluster overload and instability for both data ingestion and querying.
The workload increase during rolling restart can be reduced by increasing
the number of `vmstorage` nodes in the cluster. For example, if VictoriaMetrics cluster contains
the number of `vmstorage` nodes in the cluster. For example, if the VictoriaMetrics cluster contains
`N=11` `vmstorage` nodes, then the workload increase during rolling restart of `vmstorage` nodes
would be `100%/(N-1)=10%`. It is recommended to have at least 8 `vmstorage` nodes in the cluster.
The recommended number of `vmstorage` nodes should be multiplied by `-replicationFactor` if replication is enabled -
@@ -433,11 +414,11 @@ The most common sources of cluster instability are:
for details.
- Time series sharding. Received time series [are consistently sharded](https://docs.victoriametrics.com/victoriametrics/cluster-victoriametrics/#architecture-overview)
by `vminsert` between configured `vmstorage` nodes. As a sharding key `vminsert` is using time series name and labels,
respecting their order. If the order of labels in time series is constantly changing, this could cause wrong sharding
calculation and result in un-even and sub-optimal time series distribution across available vmstorages. It is expected
that metrics pushing client is responsible for consistent labels order (like `Prometheus` or `vmagent` during scraping).
If this can't be guaranteed, set `-sortLabels=true` command-line flag to `vminsert`. Please note, sorting may increase
by `vminsert` between configured `vmstorage` nodes. As a sharding key, `vminsert` is using time series name and labels,
respecting their order. If the order of labels in a time series is constantly changing, this could cause wrong sharding
calculation and result in uneven and suboptimal time series distribution across available vmstorages. It is expected
that the client who is pushing metrics is responsible for consistent label order (like `Prometheus` or `vmagent` during scraping).
If this can't be guaranteed, set `-sortLabels=true` command-line flag to `vminsert`. Please note that sorting may increase
CPU usage for `vminsert`.
- Network instability between cluster components (`vminsert`, `vmselect`, `vmstorage`) may lead to increased error rates, timeouts, or degraded performance.
@@ -447,14 +428,14 @@ The most common sources of cluster instability are:
but can still cause transient network failures. In such cases, check CPU usage at the OS level with higher-resolution tools.
Consider increasing `-vmstorageDialTimeout` and `-rpc.handshakeTimeout`{{% available_from "v1.124.0" %}} to mitigate the effects of CPU spikes.
If resource usage looks normal but networking issues still occur, then the root cause is likely outside VictoriaMetrics.
If resource usage appears normal but networking issues persist, the root cause is likely outside VictoriaMetrics.
This may be caused by unreliable or congested network links, especially across availability zones or regions.
In multi-AZ setups, consider [a multi-level cluster](https://docs.victoriametrics.com/victoriametrics/cluster-victoriametrics/#multi-level-cluster-setup) with region-local load balancers to reduce cross-zone connections.
If the network cannot be improved, increasing timeouts such as `-vmstorageDialTimeout`, `-rpc.handshakeTimeout`{{% available_from "v1.124.0" %}}, or `-search.maxQueueDuration` may help, but should be done cautiously, as higher timeouts can impact cluster stability in other ways.
Keep in mind that VictoriaMetrics assumes reliable networking between components. If the network is unstable, the overall cluster stability may degrade regardless of resource availability.
The obvious solution against VictoriaMetrics cluster instability is to make sure cluster components
have enough free resources for graceful processing of the increased workload.
The obvious solution to VictoriaMetrics cluster instability is to make sure cluster components
have sufficient free resources to handle the increased workload gracefully.
See [capacity planning docs](https://docs.victoriametrics.com/victoriametrics/cluster-victoriametrics/#capacity-planning)
and [cluster resizing and scalability docs](https://docs.victoriametrics.com/victoriametrics/cluster-victoriametrics/#cluster-resizing-and-scalability)
for details.
@@ -464,10 +445,10 @@ for details.
If too much disk space is used by a [single-node VictoriaMetrics](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/) or by `vmstorage` component
at [VictoriaMetrics cluster](https://docs.victoriametrics.com/victoriametrics/cluster-victoriametrics/), then please check the following:
- Make sure that there are no old snapshots, since they can occupy disk space. See [how to work with snapshots](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#how-to-work-with-snapshots)
, [snapshot troubleshooting](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#snapshot-troubleshooting) and [vmbackup troubleshooting](https://docs.victoriametrics.com/victoriametrics/vmbackup/#troubleshooting).
- Make sure that there are no old snapshots, since they can occupy disk space. See [how to work with snapshots](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#how-to-work-with-snapshots),
[snapshot troubleshooting](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#snapshot-troubleshooting) and [vmbackup troubleshooting](https://docs.victoriametrics.com/victoriametrics/vmbackup/#troubleshooting).
- Under normal conditions the size of `<-storageDataPath>/indexdb` folder must be smaller than the size of `<-storageDataPath>/data` folder, where `-storageDataPath`
- Under normal conditions, the size of `<-storageDataPath>/indexdb` folder must be smaller than the size of `<-storageDataPath>/data` folder, where `-storageDataPath`
is the corresponding command-line flag value. This can be checked by the following query if [VictoriaMetrics monitoring](#monitoring) is properly set up:
```metricsql
@@ -476,22 +457,22 @@ at [VictoriaMetrics cluster](https://docs.victoriametrics.com/victoriametrics/cl
sum(vm_data_size_bytes{type=~"(storage|indexdb)/.+"}) without(type)
```
If this query returns values bigger than 0.5, then it is likely there is a [high churn rate](https://docs.victoriametrics.com/victoriametrics/faq/#what-is-high-churn-rate) issue,
that results in excess disk space usage for both `indexdb` and `data` folders under `-storageDataPath` folder.
The solution is to identify and fix the source of high churn rate with [cardinality explorer](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#cardinality-explorer).
If this query returns values greater than 0.5, then it is likely there is a [high churn rate](https://docs.victoriametrics.com/victoriametrics/faq/#what-is-high-churn-rate) issue,
that results in excess disk space usage for both the `indexdb` and `data` folders under the `-storageDataPath` folder.
The solution is to identify and fix the source of the high churn rate with the [cardinality explorer](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#cardinality-explorer).
## Monitoring
Having proper [monitoring](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#monitoring)
would help identify and prevent most of the issues listed above.
can help identify and prevent most of the issues listed above.
[Grafana dashboards](https://grafana.com/orgs/victoriametrics/dashboards) contain panels reflecting the
health state, resource usage and other specific metrics for VictoriaMetrics components.
health state, resource usage, and other specific metrics for VictoriaMetrics components.
The list of [recommended alerting rules](https://github.com/VictoriaMetrics/VictoriaMetrics/tree/master/deployment/docker#alerts)
for VictoriaMetrics components will notify about issues and provide recommendations for how to solve them.
Check the list of [recommended alerting rules](https://github.com/VictoriaMetrics/VictoriaMetrics/tree/master/deployment/docker#alerts)
for VictoriaMetrics components to receive notifications about issues and receive recommendations for resolving them.
Internally, we heavily rely both on dashboards and alerts, and constantly improve them.
Internally, we rely heavily on both dashboards and alerts, and we constantly improve them.
It is important to stay up to date with such changes.
@@ -500,11 +481,12 @@ It is important to stay up to date with such changes.
On some ZFS filesystems, mixing reads from memory-mapped files (`mmap`) with usage of the `mincore()` syscall can trigger a bug in the ZFS in-memory cache (ARC), potentially resulting in **data read corruption** in VictoriaMetrics processes. This scenario has been observed when VictoriaMetrics instances access data directories on ZFS.
Symptoms:
Note that the source code for the VictoriaMetrics cluster is located in [the cluster](https://github.com/VictoriaMetrics/VictoriaMetrics/tree/cluster) branch.
- Unexpected read errors when accessing data on ZFS.
- Corrupted or inconsistent query results.
- Crashes or panics in storage/query components when reading from ZFS.
It could be mitigated with `--fs.disableMincore` flag:
It could be mitigated with the `--fs.disableMincore` flag:
```text
./bin/victoria-metrics --storageDataPath /path/to/zfs/data --fs.disableMincore

View File

@@ -31,10 +31,15 @@ See also [LTS releases](https://docs.victoriametrics.com/victoriametrics/lts-rel
* FEATURE: all VictoriaMetrics components: expose `process_cpu_seconds_total`, `process_resident_memory_bytes`, and other process-level metrics when running on macOS. See [metrics#75](https://github.com/VictoriaMetrics/metrics/issues/75).
* FEATURE: [dashboards/vmauth](https://grafana.com/grafana/dashboards/21394): add `Request body buffering duration` panel to the `Troubleshooting` section. This panel shows the time spent buffering incoming client request bodies, helping identify slow client uploads and potential concurrency issues. The panel is only available when `-requestBufferSize` is non-zero. See [#10309](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/10309).
* FEATURE: [vmagent](https://docs.victoriametrics.com/victoriametrics/vmagent/), [vmsingle](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/), `vminsert` and `vmstorage` in [VictoriaMetrics cluster](https://docs.victoriametrics.com/victoriametrics/cluster-victoriametrics/): enable [ingestion](https://docs.victoriametrics.com/victoriametrics/vmagent/#metric-metadata) and in-memory [storage](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#metrics-metadata) of metrics metadata by default. Metadata ingestion can be disabled with `-enableMetadata=false`. See [#2974](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2974).
* FEATURE: [dashboards/operator](https://grafana.com/grafana/dashboards/17869): extract operator version from metrics instead of hardcoded value
* FEATURE: [dashboards/alert-statistics](https://github.com/VictoriaMetrics/VictoriaMetrics/blob/master/dashboards/alert-statistics.json): add a link to a specific alerting rule on the table of firing alerts. Thanks to @sias32.
* FEATURE: [alerts](https://github.com/VictoriaMetrics/VictoriaMetrics/blob/master/deployment/docker/rules): use `$externalURL` instead of `localhost` in the alerting rules. This should improve usability of the rules if `$externalURL` is correctly configured, without need to update rules annotations. Thanks to @sias32.
* BUGFIX: [vmsingle](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/) and `vmstorage` in [VictoriaMetrics cluster](https://docs.victoriametrics.com/victoriametrics/cluster-victoriametrics/): prevent panic `error parsing regexp: expression nests too deeply` triggered by large repetition ranges in regex, for example `{"__name__"=~"a{0,1000}"}`. See [VictoriaLogs#1112](https://github.com/VictoriaMetrics/VictoriaLogs/issues/1112).
* BUGFIX: [vmui](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/#vmui): fix escaping for label names with special characters. See [#10485](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/10485).
* BUGFIX: `vmstorage` in [VictoriaMetrics cluster](https://docs.victoriametrics.com/victoriametrics/cluster-victoriametrics/): properly search tenants for [multitenant](https://docs.victoriametrics.com/victoriametrics/cluster-victoriametrics/#multitenancy) query request. See [#10422](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/10422).
* BUGFIX: `vmstorage` in [VictoriaMetrics cluster](https://docs.victoriametrics.com/victoriametrics/cluster-victoriametrics/): properly apply `extra_filters[]` filter when querying `vm_account_id` or `vm_project_id` labels via [multitenant](https://docs.victoriametrics.com/victoriametrics/cluster-victoriametrics/#multitenancy) request for `/api/v1/label/…/values` API. Before, `extra_filters` was ignored.
## [v1.136.0](https://github.com/VictoriaMetrics/VictoriaMetrics/releases/tag/v1.136.0)

View File

@@ -429,9 +429,9 @@ See the docs at https://docs.victoriametrics.com/victoriametrics/vmalert/ .
-rule.evalDelay duration
Adjustment of the 'time' parameter for rule evaluation requests to compensate intentional data delay from the datasource. Normally, should be equal to '-search.latencyOffset' (cmd-line flag configured for VictoriaMetrics single-node or vmselect). This doesn't apply to groups with eval_offset specified. (default 30s)
-rule.maxResolveDuration duration
Limits the maxiMum duration for automatic alert expiration, which by default is 4 times evaluationInterval of the parent group
Limits the maximum duration for automatic alert expiration, which by default is 4 times evaluationInterval of the parent group
-rule.resendDelay duration
MiniMum amount of time to wait before resending an alert to notifier.
Minium amount of time to wait before resending an alert to notifier.
-rule.resultsLimit int
Limits the number of alerts or recording results a single rule can produce. Can be overridden by the limit option under group if specified. If exceeded, the rule will be marked with an error and all its results will be discarded. 0 means no limit.
-rule.templates array

View File

@@ -5,6 +5,7 @@ import (
"fmt"
"io"
"reflect"
"runtime/debug"
"sort"
"strings"
"sync"
@@ -821,6 +822,11 @@ func sortLabels(labels []prompb.Label) {
func TestPutBigWriteRequestContext(t *testing.T) {
f := func(l, c, expectC int) {
t.Helper()
// disable GC here so the items in pool won't be recycled too fast. reset it after the test.
prevPercent := debug.SetGCPercent(-1)
defer debug.SetGCPercent(prevPercent)
// let's reset the whole pool first, as different test case could interfere
wctxPool = sync.Pool{}

View File

@@ -891,7 +891,7 @@ func (s *Storage) mustLoadNextDayMetricIDs(date uint64) *nextDayMetricIDs {
}
// Unmarshal uint64set
m, tail, err := unmarshalUint64Set(src)
m, tail, err := uint64set.Unmarshal(src)
if err != nil {
logger.Infof("discarding %s because cannot load uint64set: %s", path, err)
return e
@@ -931,7 +931,7 @@ func (s *Storage) mustLoadHourMetricIDs(hour uint64, name string) *hourMetricIDs
}
// Unmarshal uint64set
m, tail, err := unmarshalUint64Set(src)
m, tail, err := uint64set.Unmarshal(src)
if err != nil {
logger.Infof("discarding %s because cannot load uint64set: %s", path, err)
return hm
@@ -952,7 +952,7 @@ func (s *Storage) mustSaveNextDayMetricIDs(e *nextDayMetricIDs) {
dst = encoding.MarshalUint64(dst, e.date)
// Marshal metricIDs
dst = marshalUint64Set(dst, &e.metricIDs)
dst = e.metricIDs.Marshal(dst)
fs.MustWriteSync(path, dst)
}
@@ -965,37 +965,11 @@ func (s *Storage) mustSaveHourMetricIDs(hm *hourMetricIDs, name string) {
dst = encoding.MarshalUint64(dst, hm.hour)
// Marshal hm.m
dst = marshalUint64Set(dst, hm.m)
dst = hm.m.Marshal(dst)
fs.MustWriteSync(path, dst)
}
func unmarshalUint64Set(src []byte) (*uint64set.Set, []byte, error) {
mLen := encoding.UnmarshalUint64(src)
src = src[8:]
if uint64(len(src)) < 8*mLen {
return nil, nil, fmt.Errorf("cannot unmarshal uint64set; got %d bytes; want at least %d bytes", len(src), 8*mLen)
}
m := &uint64set.Set{}
for range mLen {
metricID := encoding.UnmarshalUint64(src)
src = src[8:]
m.Add(metricID)
}
return m, src, nil
}
func marshalUint64Set(dst []byte, m *uint64set.Set) []byte {
dst = encoding.MarshalUint64(dst, uint64(m.Len()))
m.ForEach(func(part []uint64) bool {
for _, metricID := range part {
dst = encoding.MarshalUint64(dst, metricID)
}
return true
})
return dst
}
func mustGetMinTimestampForCompositeIndex(metadataDir string, isEmptyDB bool) int64 {
path := filepath.Join(metadataDir, "minTimestampForCompositeIndex")
minTimestamp, err := loadMinTimestampForCompositeIndex(path)

View File

@@ -1,6 +1,7 @@
package uint64set
import (
"fmt"
"math/bits"
"slices"
"sort"
@@ -8,6 +9,7 @@ import (
"sync/atomic"
"unsafe"
"github.com/VictoriaMetrics/VictoriaMetrics/lib/encoding"
"github.com/VictoriaMetrics/VictoriaMetrics/lib/slicesutil"
)
@@ -38,6 +40,49 @@ func (s *bucket32Sorter) Swap(i, j int) {
a[i], a[j] = a[j], a[i]
}
// Unmarshal creates an instance of a set from bytes.
//
// The first 8 src bytes contain the set length (number of the elements in the
// set). Since each element is 8-byte long, the number of remaining src bytes
// must be at least 8*length, or else the function will return an error. The
// function will read exactly 8*length bytes and construct an instance of a
// set. The remaining src bytes will be returned along with the set.
func Unmarshal(src []byte) (*Set, []byte, error) {
if len(src) < 8 {
return nil, nil, fmt.Errorf("cannot unmarshal uint64set; got %d bytes; want at least 8 bytes", len(src))
}
sLen := encoding.UnmarshalUint64(src)
src = src[8:]
if uint64(len(src)) < 8*sLen {
return nil, nil, fmt.Errorf("cannot unmarshal uint64set; got %d bytes; want at least %d bytes", len(src), 8*sLen)
}
s := &Set{}
for range sLen {
e := encoding.UnmarshalUint64(src)
src = src[8:]
s.Add(e)
}
return s, src, nil
}
// Marshal encodes the set as a sequence of bytes.
//
// The first 8 bytes contain the length of the set (number of the elements the
// set contains). The subsequent bytes are actual uint64 elements.
//
// The marshaling result is appended to the end of dst, i.e. the initial dst
// content is not overwritten.
func (s *Set) Marshal(dst []byte) []byte {
dst = encoding.MarshalUint64(dst, uint64(s.Len()))
s.ForEach(func(part []uint64) bool {
for _, e := range part {
dst = encoding.MarshalUint64(dst, e)
}
return true
})
return dst
}
// Clone returns an independent copy of s.
func (s *Set) Clone() *Set {
if s == nil || s.itemsCount == 0 {

View File

@@ -1,6 +1,7 @@
package uint64set
import (
"encoding/binary"
"fmt"
"math/rand"
"reflect"
@@ -895,3 +896,115 @@ func TestSubtract(t *testing.T) {
f(a, b1, want1)
f(a, b2, want2)
}
func TestUnmarshal(t *testing.T) {
n := uint64(100_000)
src := make([]byte, (n+1)*8+10)
binary.BigEndian.PutUint64(src, n)
want := &Set{}
for i := range n {
binary.BigEndian.PutUint64(src[(i+1)*8:], i)
want.Add(i)
}
got, gotTail, err := Unmarshal(src)
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
if !got.Equal(want) {
diff := cmp.Diff(want.AppendTo(nil), got.AppendTo(nil))
t.Fatalf("unexpected set (-want, +got):\n%s", diff)
}
wantTail := make([]byte, 10)
if diff := cmp.Diff(wantTail, gotTail); diff != "" {
t.Fatalf("unexpected tail bytes (-want, +got):\n%s", diff)
}
}
func TestUnmarshal_zeroLenSet(t *testing.T) {
src := make([]byte, 8)
want := &Set{}
got, gotTail, err := Unmarshal(src)
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
if !got.Equal(want) {
diff := cmp.Diff(want.AppendTo(nil), got.AppendTo(nil))
t.Fatalf("unexpected set (-want, +got):\n%s", diff)
}
wantTail := []byte{}
if diff := cmp.Diff(wantTail, gotTail); diff != "" {
t.Fatalf("unexpected tail bytes (-want, +got):\n%s", diff)
}
}
func TestUnmarshal_tooShortToIncludeSetLen(t *testing.T) {
src := make([]byte, 7) // set length occupies 8 bytes.
got, gotTail, err := Unmarshal(src)
if err == nil {
t.Fatalf("expected error but got nil")
}
if got != nil {
t.Fatalf("unexpected nil set but got: %v", got.AppendTo(nil))
}
if gotTail != nil {
t.Fatalf("unexpected nil tail bytes but got: %v", gotTail)
}
}
func TestUnmarshal_numElementsLessThanLen(t *testing.T) {
n := uint64(10)
src := make([]byte, n*8) // contains only 9 elements instead of 10.
binary.BigEndian.PutUint64(src, n)
for i := range n - 1 {
binary.BigEndian.PutUint64(src[(i+1)*8:], i)
}
got, gotTail, err := Unmarshal(src)
if err == nil {
t.Fatalf("expected error but got nil")
}
if got != nil {
t.Fatalf("unexpected nil set but got: %v", got.AppendTo(nil))
}
if gotTail != nil {
t.Fatalf("unexpected nil tail bytes but got: %v", gotTail)
}
}
func TestMarshal_emptyDst(t *testing.T) {
n := uint64(100_000)
want := make([]byte, (n+1)*8)
binary.BigEndian.PutUint64(want, n)
s := &Set{}
for i := range n {
binary.BigEndian.PutUint64(want[(i+1)*8:], i)
s.Add(i)
}
got := s.Marshal(nil)
if diff := cmp.Diff(want, got); diff != "" {
t.Fatalf("unexpected bytes (-want, +got):\n%s", diff)
}
}
func TestMarshal_nonEmptyDst(t *testing.T) {
n := uint64(100_000)
got := make([]byte, 10)
want := make([]byte, 10+(n+1)*8)
for i := range 10 {
got[i] = byte(i)
want[i] = byte(i)
}
binary.BigEndian.PutUint64(want[10:], n)
s := &Set{}
for i := range n {
binary.BigEndian.PutUint64(want[10+(i+1)*8:], i)
s.Add(i)
}
got = s.Marshal(got)
if diff := cmp.Diff(want, got); diff != "" {
t.Fatalf("unexpected bytes (-want, +got):\n%s", diff)
}
}