Compare commits

...

3 Commits

Author SHA1 Message Date
Vadim Rutkovsky
f6f2b8b025 deployment/docker/rules: add VMSelectConcurrentQueriesExceedMemoryLimit alert
Warn users when cluster is misconfigured to allow too many concurrent selects
2026-07-01 12:53:03 +02:00
Max Kotliar
8e5ca7e9b0 docs/changelog: fix bugfixs order 2026-07-01 12:32:05 +03:00
Max Kotliar
119ba0fb5b app/vmauth: slow down unauthorized responses to mitigate brute-force attacks (#11195)
Without a delay, an attacker can attempt thousands of authentication requests per second at minimal cost. The only indication of such an attack is the `vmauth_http_request_errors_total{reason="invalid_auth_token"}` metric, which may go unnoticed if it isn't being monitored.

Introduce a random delay of 2–3 seconds before returning 401 Unauthorized responses. This significantly increases the cost of sustaining a brute-force attack while having a negligible impact on legitimate clients, which are not expected to generate unauthorized
requests frequently. 

OWASP Top10 also recommends adding delays on failed auth: https://owasp.org/Top10/2025/A07_2025-Authentication_Failures.

The added delay could itself be abused as part of a DoS attack by tying up request-processing resources. However, such an attack is easier to detect because it generates sustained unauthorized traffic from identifiable source IPs, allowing operators to block or rate-limit the offending clients using specialized tools.

Fixes https://github.com/VictoriaMetrics/VictoriaMetrics/issues/11180

---------

Signed-off-by: Nikolay <nik@victoriametrics.com>
Co-authored-by: f41gh7 <nik@victoriametrics.com>
2026-07-01 12:31:06 +03:00
4 changed files with 44 additions and 5 deletions

View File

@@ -6,6 +6,7 @@ import (
"flag"
"fmt"
"io"
"math/rand/v2"
"net"
"net/http"
"net/textproto"
@@ -31,6 +32,7 @@ import (
"github.com/VictoriaMetrics/VictoriaMetrics/lib/procutil"
"github.com/VictoriaMetrics/VictoriaMetrics/lib/promauth"
"github.com/VictoriaMetrics/VictoriaMetrics/lib/pushmetrics"
"github.com/VictoriaMetrics/VictoriaMetrics/lib/timerpool"
)
var (
@@ -206,6 +208,7 @@ func requestHandler(w http.ResponseWriter, r *http.Request) bool {
}
invalidAuthTokenRequests.Inc()
slowdownUnauthorizedResponse(r)
if *logInvalidAuthTokens {
err := fmt.Errorf("cannot authorize request with auth tokens %q", ats)
err = &httpserver.ErrorWithStatusCode{
@@ -889,3 +892,20 @@ func debugInfo(u *url.URL, r *http.Request) string {
fmt.Fprint(s, ")")
return s.String()
}
// slowdownUnauthorizedResponse adds a random delay in the [2..3] seconds range before returning an unauthorized response.
// This reduces the effectiveness of brute-force.
//
// Recommended by OWASP Top10:
// https://owasp.org/Top10/2025/A07_2025-Authentication_Failures
func slowdownUnauthorizedResponse(r *http.Request) {
d := 2*time.Second + time.Duration(rand.IntN(1000))*time.Millisecond
t := timerpool.Get(d)
select {
case <-t.C:
case <-r.Context().Done():
}
timerpool.Put(t)
}

View File

@@ -1687,6 +1687,10 @@ func assertInstantValues(tss []*timeseries) {
var memoryIntensiveQueries = metrics.NewCounter(`vm_memory_intensive_queries_total`)
var _ = metrics.NewGauge(`vm_max_memory_per_query`, func() float64 {
return float64(maxMemoryPerQuery.N)
})
func evalRollupFuncWithMetricExpr(qt *querytracer.Tracer, ec *EvalConfig, funcName string, rf rollupFunc,
expr metricsql.Expr, me *metricsql.MetricExpr, iafc *incrementalAggrFuncContext, windowExpr *metricsql.DurationExpr,
) ([]*timeseries, error) {

View File

@@ -223,4 +223,16 @@ groups:
Unexpected TSID misses for \"{{ $labels.job }}\" ({{ $labels.instance }}) for the last 15 minutes.
If this happens after unclean shutdown of VictoriaMetrics process (via \"kill -9\", OOM or power off),
then this is OK - the alert must go away in a few minutes after the restart.
Otherwise this may point to the corruption of index data.
Otherwise this may point to the corruption of index data.
- alert: VMSelectConcurrentQueriesExceedMemoryLimit
expr: (vm_max_memory_per_query * on(job, instance) vm_concurrent_select_capacity) > on(job, instance) vm_available_memory_bytes
for: 5m
labels:
severity: warning
annotations:
summary: "vmselect ({{ $labels.instance }}) concurrent query memory may exceed pod limit"
description: "Current concurrent queries ({{ $value | humanize1024 }} combined max memory) exceed
the available memory on instance {{ $labels.instance }}.
This may result in OOM kills. Consider reducing -maxConcurrentRequests,
lowering -maxMemoryPerQuery, or scaling up pod memory limits."

View File

@@ -28,14 +28,17 @@ See also [LTS releases](https://docs.victoriametrics.com/victoriametrics/lts-rel
* SECURITY: upgrade base docker image (Alpine) from 3.23.4 to 3.24.1. See [Alpine 3.24.1 release notes](https://www.alpinelinux.org/posts/Alpine-3.24.1-released.html).
* BUGFIX: all VictoriaMetrics components: cancel in-flight HTTP requests shortly before `-http.maxGracefulShutdownDuration` elapses during graceful shutdown, so they can drain and the shutdown completes cleanly within that window instead of timing out and exiting via `logger.Fatalf` -> `os.Exit`. This prevents skipping the storage flush and losing in-memory data when long-lived requests are in flight (such as VictoriaLogs live tailing). See [#1502](https://github.com/VictoriaMetrics/VictoriaLogs/issues/1502).
* BUGFIX: `vminsert` in [VictoriaMetrics cluster](https://docs.victoriametrics.com/victoriametrics/cluster-victoriametrics/): fixes unexpected rare rerouting. See [#11162](https://github.com/VictoriaMetrics/VictoriaMetrics/pull/11162).
* BUGFIX: `vmselect` in [VictoriaMetrics cluster](https://docs.victoriametrics.com/victoriametrics/cluster-victoriametrics/): propagate cache reset operation to `selectNode` when `/internal/resetRollupResultCache` is called. Previously, the propagation only happened when the `delete_series` API was called. See [#11112](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/11112).
* BUGFIX: [stream aggregation](https://docs.victoriametrics.com/victoriametrics/stream-aggregation/): fix possible unexpected increases in `rate_avg` and `rate_sum` if an out-of-order sample is ingested after the previous flush. See [#11140](https://github.com/VictoriaMetrics/VictoriaMetrics/pull/11140).
* FEATURE: `vmselect` in [VictoriaMetrics cluster](https://docs.victoriametrics.com/victoriametrics/cluster-victoriametrics/): expose `vm_max_memory_per_query` metric reflecting the `-search.maxMemoryPerQuery` limit. Create `VMSelectConcurrentQueriesExceedMemoryLimit` alert to warn when OOMs are possible due to misconfiguration of `-search.maxMemoryPerQuery` and max concurrent queries.
* FEATURE: [vmauth](https://docs.victoriametrics.com/victoriametrics/vmauth/): add `default_vm_access_claim` field into `jwt` section of auth config. It could be used at [JWT claim placeholders](https://docs.victoriametrics.com/victoriametrics/vmauth/#jwt-claim-based-request-templating), if `JWT` token doesn't have `vm_access` claim. See [#11054](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/11054).
* FEATURE: [vmagent](https://docs.victoriametrics.com/victoriametrics/vmagent/): reduces CPU usage by 10% at [sharding among remote storages](https://docs.victoriametrics.com/victoriametrics/vmagent/#sharding-among-remote-storages). See [#11113](https://github.com/VictoriaMetrics/VictoriaMetrics/pull/11113). Thanks to @bennf for contribution.
* FEATURE: [vmsingle](https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/) and `vmselect` in [VictoriaMetrics cluster](https://docs.victoriametrics.com/victoriametrics/cluster-victoriametrics/): add `optimize_repeated_binary_op_subexprs=1` query arg to [/api/v1/query_range](https://docs.victoriametrics.com/victoriametrics/keyconcepts/#range-query) for executing binary operator sides sequentially when they share the same optimized aggregate rollup result expression. This allows the second side to reuse rollup result cache populated by the first side. See [#10575](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/10575).
* FEATURE: [vmauth](https://docs.victoriametrics.com/victoriametrics/vmauth/): prevent possible password brute-force attacks with an artificial 2-3 second delay as recommended by [OWASP](https://owasp.org/Top10/2025/A07_2025-Authentication_Failures). See [#11180](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/11180).
* BUGFIX: all VictoriaMetrics components: cancel in-flight HTTP requests shortly before `-http.maxGracefulShutdownDuration` elapses during graceful shutdown, so they can drain and the shutdown completes cleanly within that window instead of timing out and exiting via `logger.Fatalf` -> `os.Exit`. This prevents skipping the storage flush and losing in-memory data when long-lived requests are in flight (such as VictoriaLogs live tailing). See [#1502](https://github.com/VictoriaMetrics/VictoriaLogs/issues/1502).
* BUGFIX: `vminsert` in [VictoriaMetrics cluster](https://docs.victoriametrics.com/victoriametrics/cluster-victoriametrics/): fixes unexpected rare rerouting. See [#11162](https://github.com/VictoriaMetrics/VictoriaMetrics/pull/11162).
* BUGFIX: `vmselect` in [VictoriaMetrics cluster](https://docs.victoriametrics.com/victoriametrics/cluster-victoriametrics/): propagate cache reset operation to `selectNode` when `/internal/resetRollupResultCache` is called. Previously, the propagation only happened when the `delete_series` API was called. See [#11112](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/11112).
* BUGFIX: [stream aggregation](https://docs.victoriametrics.com/victoriametrics/stream-aggregation/): fix possible unexpected increases in `rate_avg` and `rate_sum` if an out-of-order sample is ingested after the previous flush. See [#11140](https://github.com/VictoriaMetrics/VictoriaMetrics/pull/11140).
## [v1.146.0](https://github.com/VictoriaMetrics/VictoriaMetrics/releases/tag/v1.146.0)