mirror of
https://github.com/VictoriaMetrics/VictoriaMetrics.git
synced 2026-05-17 08:36:55 +03:00
deployment/alerts: move IndexDBRecordsDrop and TooManyTSIDMisses rules to storage-related files
`IndexDBRecordsDrop` and `TooManyTSIDMisses` were mistakenly placed to `alerts-health.yml`, which was supposed to contain rules related to all VM components. But these two rules are related to storage components only (vmstorage and vmsingle). Moving them to corresponding files. Signed-off-by: hagen1778 <roman@victoriametrics.com>
This commit is contained in:
@@ -198,4 +198,29 @@ groups:
|
||||
are changing too frequently or if the cache size is too low. There are following ways to mitigate cache overutilization:
|
||||
- disable cache via `--storage.trackMetricNamesStats=false` flag, so metric names usage will stop tracking
|
||||
- increase the cache size via `--storage.cacheSizeMetricNamesStats` flag
|
||||
- reset the cache (see docs for details)"
|
||||
- reset the cache (see docs for details)"
|
||||
|
||||
- alert: IndexDBRecordsDrop
|
||||
expr: increase(vm_indexdb_items_dropped_total[5m]) > 0
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "IndexDB skipped registering items during data ingestion with reason={{ $labels.reason }}."
|
||||
description: |
|
||||
VictoriaMetrics could skip registering new timeseries during ingestion if they fail the validation process.
|
||||
For example, `reason=too_long_item` means that time series cannot exceed 64KB. Please, reduce the number
|
||||
of labels or label values for such series. Or enforce these limits via `-maxLabelsPerTimeseries` and
|
||||
`-maxLabelValueLen` command-line flags.
|
||||
|
||||
- alert: TooManyTSIDMisses
|
||||
expr: increase(vm_missing_tsids_for_metric_id_total[5m]) > 0
|
||||
for: 15m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "Unexpected TSID misses for job \"{{ $labels.job }}\" ({{ $labels.instance }}) for the last 15 minutes"
|
||||
description: |
|
||||
Unexpected TSID misses for \"{{ $labels.job }}\" ({{ $labels.instance }}) for the last 15 minutes.
|
||||
If this happens after unclean shutdown of VictoriaMetrics process (via \"kill -9\", OOM or power off),
|
||||
then this is OK - the alert must go away in a few minutes after the restart.
|
||||
Otherwise this may point to the corruption of index data.
|
||||
@@ -82,19 +82,6 @@ groups:
|
||||
Check the logs for the given target. Check also the \"location\" label at the vm_log_messages_total metric if -loggerLevel command-line flag is set to value other than INFO.
|
||||
This label contains code locations responsible for generating log messages suppressed by -loggerLevel.
|
||||
|
||||
- alert: TooManyTSIDMisses
|
||||
expr: increase(vm_missing_tsids_for_metric_id_total[5m]) > 0
|
||||
for: 15m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "Unexpected TSID misses for job \"{{ $labels.job }}\" ({{ $labels.instance }}) for the last 15 minutes"
|
||||
description: |
|
||||
Unexpected TSID misses for \"{{ $labels.job }}\" ({{ $labels.instance }}) for the last 15 minutes.
|
||||
If this happens after unclean shutdown of VictoriaMetrics process (via \"kill -9\", OOM or power off),
|
||||
then this is OK - the alert must go away in a few minutes after the restart.
|
||||
Otherwise this may point to the corruption of index data.
|
||||
|
||||
- alert: ConcurrentInsertsHitTheLimit
|
||||
expr: avg_over_time(vm_concurrent_insert_current[1m]) >= vm_concurrent_insert_capacity
|
||||
for: 15m
|
||||
@@ -109,28 +96,6 @@ groups:
|
||||
making write attempts. If vmagent's or vminsert's CPU usage and network saturation are at normal level, then
|
||||
it might be worth adjusting `-maxConcurrentInserts` cmd-line flag.
|
||||
|
||||
- alert: IndexDBRecordsDrop
|
||||
expr: increase(vm_indexdb_items_dropped_total[5m]) > 0
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "IndexDB skipped registering items during data ingestion with reason={{ $labels.reason }}."
|
||||
description: |
|
||||
VictoriaMetrics could skip registering new timeseries during ingestion if they fail the validation process.
|
||||
For example, `reason=too_long_item` means that time series cannot exceed 64KB. Please, reduce the number
|
||||
of labels or label values for such series. Or enforce these limits via `-maxLabelsPerTimeseries` and
|
||||
`-maxLabelValueLen` command-line flags.
|
||||
|
||||
- alert: RowsRejectedOnIngestion
|
||||
expr: rate(vm_rows_ignored_total[5m]) > 0
|
||||
for: 15m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Some rows are rejected on \"{{ $labels.instance }}\" on ingestion attempt"
|
||||
description: "Ingested rows on instance \"{{ $labels.instance }}\" are rejected due to the
|
||||
following reason: \"{{ $labels.reason }}\""
|
||||
|
||||
- alert: TooHighQueryLoad
|
||||
expr: increase(vm_concurrent_select_limit_timeout_total[5m]) > 0
|
||||
for: 15m
|
||||
@@ -148,3 +113,14 @@ groups:
|
||||
* increase compute resources or number of replicas;
|
||||
* adjust limits `-search.maxConcurrentRequests` and `-search.maxQueueDuration`.
|
||||
See more at https://docs.victoriametrics.com/victoriametrics/troubleshooting/#slow-queries.
|
||||
|
||||
- alert: RowsRejectedOnIngestion
|
||||
expr: rate(vm_rows_ignored_total[5m]) > 0
|
||||
for: 15m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Some rows are rejected on \"{{ $labels.instance }}\" on ingestion attempt"
|
||||
description: "Ingested rows on instance \"{{ $labels.instance }}\" are rejected due to the
|
||||
following reason: \"{{ $labels.reason }}\""
|
||||
|
||||
|
||||
@@ -164,4 +164,29 @@ groups:
|
||||
are changing too frequently or if the cache size is too low. There are following ways to mitigate cache overutilization:
|
||||
- disable cache via `--storage.trackMetricNamesStats=false` flag, so metric names usage will stop tracking
|
||||
- increase the cache size via `--storage.cacheSizeMetricNamesStats` flag
|
||||
- reset the cache (see docs for details)"
|
||||
- reset the cache (see docs for details)"
|
||||
|
||||
- alert: IndexDBRecordsDrop
|
||||
expr: increase(vm_indexdb_items_dropped_total[5m]) > 0
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "IndexDB skipped registering items during data ingestion with reason={{ $labels.reason }}."
|
||||
description: |
|
||||
VictoriaMetrics could skip registering new timeseries during ingestion if they fail the validation process.
|
||||
For example, `reason=too_long_item` means that time series cannot exceed 64KB. Please, reduce the number
|
||||
of labels or label values for such series. Or enforce these limits via `-maxLabelsPerTimeseries` and
|
||||
`-maxLabelValueLen` command-line flags.
|
||||
|
||||
- alert: TooManyTSIDMisses
|
||||
expr: increase(vm_missing_tsids_for_metric_id_total[5m]) > 0
|
||||
for: 15m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "Unexpected TSID misses for job \"{{ $labels.job }}\" ({{ $labels.instance }}) for the last 15 minutes"
|
||||
description: |
|
||||
Unexpected TSID misses for \"{{ $labels.job }}\" ({{ $labels.instance }}) for the last 15 minutes.
|
||||
If this happens after unclean shutdown of VictoriaMetrics process (via \"kill -9\", OOM or power off),
|
||||
then this is OK - the alert must go away in a few minutes after the restart.
|
||||
Otherwise this may point to the corruption of index data.
|
||||
Reference in New Issue
Block a user