Since the first connection is not closed, the vmstorage will never
terminate gracefully which will cause the reset of all caches on the
start-up.
Follow-up for 244769a00d (#10136)
Signed-off-by: Artem Fetishev <rtm@victoriametrics.com>
The `err` may contain information about request cancelation performed by the server code.
In such cases the error must be logged. The error must be ignored only if the client canceled the request.
This is a follow-up for the commit c9596a0364
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/10078
This is to debug cases when metric name tracker resets the tsid cache
after restart. It could be due vmstorage not having enough time to stop
gracefully. Logs should provide this info.
Signed-off-by: Artem Fetishev <rtm@victoriametrics.com>
This fixes the following corner case: if all instances of a cache have
zero size, the stats won't be set at all. This results in some weird
graphs if the cache is reset very often (such as tfssCache): the cache
sizeMaxBytes alternates between the actual value and zero.
Follow-up for f62893c151
Signed-off-by: Artem Fetishev <rtm@victoriametrics.com>
Commit 5a587f2006 was not properly ported
to the single node branch. Since single node is able to perform both
promscrape and self-scrape, it's required to add metadata add methods to
those paths.
This commit fixes missing metadata add to the storage.
Fixes https://github.com/VictoriaMetrics/VictoriaMetrics/issues/10175
Previously a short spike in the number of concurrent requests immediately led to `429 Too Many Requests` errors
when the number of concurrent requests exceeds -maxConcurrentRequests or -maxConcurrentPerUserRequests.
This commit allows processing short spikes in the number of concurrent requests during the -maxQueueDuration timeout.
The requests are rejected only if they couldn't be served accroding to the concurrency limits during the -maxQueueDuration.
See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/10078
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/10112
- Introduce backendURLs struct, which holds all the backend urls and allows stopping
all the health checkers across all the backend urls with a single call to backendURLs.stopHealthChecks().
- Immediately cancel the pending Dial call to the backend when backendURLs.stopHealthChecks() is called.
Use lib/netutil.Dialer.DialContext() for this.
- Replace a fragile closing of stopHealthCheckCh channel via stopHealthCheckOnce.Do()
with easier to maintain call of cancel() func for the corresponding healthChecksContext.
- Wait until health checker goroutines are finished before return from UserInfo.stopHealthChecks().
Previously the health checker goroutines could run for some time trying to dial the backend
after the return from UserInfo.stopHealthChecks().
- Try dialing the broken backend for https urls. It is better if the broken backend logs the error
instead of routing client requests to the broken backend.
- Log dial errors to the broken backend, so users could troubleshoot the backend connectivity issue with more details.
- Refer the correct issue - https://github.com/VictoriaMetrics/VictoriaMetrics/issues/9997 -
in the comments explaining why periodic dialing of the broken backend is needed.
Previously the https://github.com/VictoriaMetrics/VictoriaMetrics/issues/9890 was incorrectly referred.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/9997
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/10147
follow up https://github.com/VictoriaMetrics/VictoriaMetrics/pull/10177
Add `vmauth_user_request_backend_requests_total` and
`vmauth_unauthorized_user_request_backend_requests_total` which track
the number of user request errors, and aligned with
`vmauth_user_requests_total`.
The existing `vmauth_http_request_errors_total` currently only counts
requests with `invalid_auth_token`. Once authorization has passed, any
subsequent request errors are tracked under
`xxx_user_request_backend_requests_total`.
This commit introduces the global `sampleLimit` setting to restrict the number
of samples accepted per scrape target, mirroring the behavior of
Prometheus.
Motivation:
1) The existing `-promscrape.seriesLimitPerTarget` flag currently takes
precedence over any `sample_limit` setting defined directly on the
scrape target. The new `sampleLimit` implementation ensures that the
target configuration is able to override the global setting, allowing
users to define specific limits per target.
2) The existing series limit flag uses memory-intensive Bloom filters,
resulting in high RAM consumption under high-cardinality scraping
scenarios. The `sampleLimit` provides a much simpler, low-overhead
alternative.
fixes https://github.com/VictoriaMetrics/VictoriaMetrics/issues/10145
The encoding.DecompressZSTD* consistently updates the vm_zstd_block_decompress_calls_total metric.
Also make the follwing improvements after the commit 10f7cd2ffc:
- Add encoding.DecompressZSTDLimited() function and use it instead of zstd.DecompressLimited,
so it properly updates vm_zstd_block_decompress_calls_total metric.
- Clarify description for the encoding.DecompressZSTD* and zstd.Decompress* functions.
Currently, `dateMetricIDCache` is reset when it is full and it is never
reset is not full but the data it stores is no longer needed. This leads
to the following problems:
- During regular data ingestion the cache sizeBytes may exceed max
allowed size and the cache gets reset which may potentially slow down
data ingestion (see #10064)
- The cache is per-indexDB. This means that in partition index (#8134)
there will be as many instances of this cache as the number of
partitions. If someone performs a backfill across all partitions, this
will fill all caches and they will never get reset even if no more
historical data is ingested.
So the solution is to periodically rotate the cache. After first
rotation the data is not deleted but moved to `prev` storage. After
second rotation `prev` gets deleted. This gives the cache an opportunity
to restore the `prev` data if it is still in use. Based on #10167.
This PR also removes the introduced recently introduced
`-storage.cacheSizeIndexDBDateMetricID` flag (see #10135). This should
be safe since it is new and its use case is very niche, i.e. no one
would really use it.
---------
Signed-off-by: Artem Fetishev <rtm@victoriametrics.com>
We have `vmauth_user_requests_total` and
`vmauth_unauthorized_user_requests_total` to track requests from the
user side. However, in scenarios such as request timeouts or when the
response code matches `retry_status_code`, a single request may be
retried across multiple backends.
Exposing counters `vmauth_user_request_backend_requests_total` and
`vmauth_unauthorized_user_request_backend_requests_total` that track the
number of requests sent to backends provides insight into the routing
logic and can help identify if requests are being consistently retried,
which may contribute to increased request duration.
Related PR https://github.com/VictoriaMetrics/VictoriaMetrics/pull/10171
Currently, backendErrors may be counted twice if a request to the
backend fails due to context.DeadlineExceeded.
9bc7a17d80/app/vmauth/main.go (L328)9bc7a17d80/app/vmauth/main.go (L294)
And we increment this counter in a way that is somewhat inconsistent.
Given that the counter's name is `xx_request_backend_errors_total`, it
should only increase when a backend request returns an error. This value
can exceed the user request error count if multiple backend requests
fail for a single user request.
The `xxx_request_backend_errors_total` counter should be used in
conjunction with the `xxx_request_backend_requests_total` introduced in
https://github.com/VictoriaMetrics/VictoriaMetrics/pull/10171.
There is no reason to send a request to the first backend if all
backends are marked as broken.
Also,
>// getFirstAvailableBackendURL returns the first available backendURL,
which isn't broken.
The fix only skips a redundant request when all backends are
unavailable, it doesn't introduce any changes from user's perspective,
so I skipped changelog.
When the time series deletion is performed some of the storage caches
need to be reset but some not. This PR reviews all storage caches and
documents why there are reset or not and also places all the resetting
logic (and comments) in one place.
### Describe Your Changes
Previously, a backend was considered healthy as soon as its
'bu.brokenDeadline' deadline expired, even if it was still unavailable.
This caused avoidable request failures and retries.
Now vmauth performs a TCP dial (1s timeout) before restoring the backend
to the healthy
pool. This avoids routing traffic to backends that are still down.
The dial check also covers cases where a route to the backend cannot be
resolved. Without this check, user requests would hang until the
connection timeout, leading to long waits
or errors. The new check fails fast and doesn't impact real user
requests.
Fixes https://github.com/VictoriaMetrics/VictoriaMetrics/issues/9997
### Checklist
The following checks are **mandatory**:
- [ ] My change adheres to [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/victoriametrics/contributing/#pull-request-checklist).
- [ ] My change adheres to [VictoriaMetrics development
goals](https://docs.victoriametrics.com/victoriametrics/goals/).