Previously (*writeconcurrencylimiter.Reader).Read() could permanently leak concurrency tokens from the -maxConcurrentInserts semaphore.
Consider the following example:
* GetReader() acquires a token, then PutReader() unconditionally releases it.
* Read() calls DecConcurrency() before the underlying I/O and IncConcurrency() after it. If IncConcurrency() returns an error, Read() returns without holding a token.
* Each such failure permanently removes one slot from the concurrencyLimitCh semaphore. Slots leak one by one until the channel is fully drained, at which point DecConcurrency() blocks forever, deadlocking ingestion on vmstorage.
This commit adds tracking for obtained tokens to the reader. Which prevents possible tokens leakage.
Fixes https://github.com/VictoriaMetrics/VictoriaMetrics/issues/10784
This reverts commit b3c03c023c.
Reason for revert: the original logic was correct from the user's perspective:
- The -maxRequestBodySizeToRetry command-line flag controls the size of the request body,
which could be retried on backend failure. The meaining of this flag wasn't changed after
the introduction of the -requestBufferSize flag in the commit e31abfc25c
(see https://github.com/VictoriaMetrics/VictoriaMetrics/issues/10309 )
- The -requestBufferSize flag controls the size of the buffer for reading request body
before sending sending it to the backend and before applying concurrency limits.
These flags are independent from user's perspective. The fact that these flags share the implementation,
sholdn't be known to the user - this is an implementation detail, which allows avoiding double buffering.
Both flags enable request buffering. If the user wants disabling of all the request buffering,
then both flags must be set to 0. That's why these flags are cross-mentioned in their -help descriptions.
Also the reverted commit had the following issues:
- It reduced the default value for the -requestBufferSize flag from 32KiB to 16KiB.
The 32KiB value has been calculated and justified at https://github.com/VictoriaMetrics/VictoriaMetrics/issues/10309 .
It shouldn't increase vmagent memory usage too much for typical workloads.
For example, if vmagent handles 10K concurrent requests, then the memory overhead for the request buffering
will be 10K*32KiB=320MiB. This is a small price for being able to efficiently handling 10K concurrent requests.
- It added a dot to the end of the https://docs.victoriametrics.com/victoriametrics/vmauth/#request-body-buffering link
in the description for the description of the -requestBufferSize flag. This breaks clicking the link in some environments,
since the trailing dot is considered as a part of the url.
- It added a superflouous whitespace in front of the 'Disabling request buffering' text inside the description
for the -requstBufferSize flag.
- It introduced an unnecessary complexity to the user by mentioning that the zero value
at -maxBufferSize disables buffering for request reties (these things must be independent
from the user's perspective).
- It changed the bufferedBody logic in non-trivial ways, which aren't related to the original issue.
If these changes are needed, then they must be justified in a separate issue and must be prepared
in a separate pull request / commit.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/10675
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/10677
Previously, Storage.table was initialized after startFreeDiskSpaceWatcher was called.
This created a potential data race condition: if openTable took a long time to complete
and freed disk space during that window, the free disk space watcher could read an
uninitialized (or partially initialized) Storage.table, leading to an invalid memory
address or nil pointer dereference panic.
This commit properly initializes s.isReadOnly state during storage start and
starts FreeDiskSpaceWatcher after openTable.
Bug was introduced in github.com/VictoriaMetrics/VictoriaMetrics/commit/27b958ba8bc66578206ddac26ccf47b2cc3e8101
Fixes https://github.com/VictoriaMetrics/VictoriaMetrics/issues/10747
Align group evaluation time with the `eval_offset` option to allow users
to manage group execution more effectively by understanding the exact
time each group will be scheduled, particularly in cases of spreading
rule execution within a window, chaining groups, or debugging data delay
issue.
If the group evaluation takes less than the group interval, but the
initial evaluation combined with the additional restore operation
exceeds the group interval, the evaluation time will be gradually
corrected in subsequent evaluations, as the interval ticker schedule
remains unchanged.
For groups without `eval_offset`, this change also ensures that all
evaluations follow the interval. Previously, the gap between the first
and second evaluations was larger than the interval. And the
`eval_delay` continues to help prevent partial responses.
fixes https://github.com/VictoriaMetrics/VictoriaMetrics/issues/10772.
Follow-up commit for
211fb08028
Address @f41gh7 review comments:
- Move code from `lib/osinfo` to `lib/appmetrics`.
- Make the logic private.
- Use metrics.WriteGaugeUint64 func.
- Remove registration logic from `app/xxx/main.go`.
- Remove `lib/osinfo` package.
At 00:00 UTC the ingested samples start to have timestamps for the new
day (in the ingested samples are always recent). Even though there was a
next-day prefill of the per-day index during the last hour of the day,
some performance degradation is still possible.
For example, in https://github.com/VictoriaMetrics/VictoriaMetrics/issues/10698
it is manifested as `vminsert-to-vmstorage connection saturation` peaks
right after midnight.
Possible hypothesis why this is happening. At midnight,
currHourMetricIDs is empty and prevHourMetricIDs cannot be used because
it holds metricIDs for the previous day. So the ingestion logic hits
dateMetricIDsCache which may not have the metricID in its read-only
buffer and therefore should aquire lock to check its prev read-only
buffer or read-write buffer. Which creates lock contention and therefore
raises ingestion request latency.
A solution to this could be re-using the nextDayMetricIDs during the
first hour of the day. During this time, it is equivalent to
currHourMetricIDs.
---------
Signed-off-by: Artem Fetishev <rtm@victoriametrics.com>
Signed-off-by: Artem Fetishev <149964189+rtm0@users.noreply.github.com>
Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
This change reverts part of the changes in
https://github.com/VictoriaMetrics/VictoriaMetrics/pull/10686
Motivation: docs added https://github.com/VictoriaMetrics/VictoriaMetrics/pull/10686 in most cases are too verbose, ai-generated and bringing low practical sense.
The improvement goal: remove bloat from the docs and keep them practical and useful.
What it does:
- Completely removes items from the sidebar
- Moves the content of the most important playground pages to the
`/playground/` stub (README.md). Use H2s for each playground.
- Updates and cleans the text.
- Removes the individual children pages in the playground category (keep
only the `/playgrounds/` page/stub and remove the children).
- Removes items as these don't really need much introduction or aren't
playgrounds:
- log to logsql: a conversion tool
- sql to logsql: same
- adds Grafana playground section
Links of child pages will become invalid. We don't preserve them as this is pretty new doc (1w on prod) and is unlikely to have already persisted links somewhere.
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
This commit adds new metrics `vmalert_remotewrite_queue_capacity` and `vmalert_remotewrite_queue_size`, which is updated with each push and it's
frequency depends on `-remoteWrite.concurrency`,
`remoteWrite.flushInterval`
It doesn't account for the pending data within each pushers request, it
should provide a general indication of the queue usage.
Related PR https://github.com/VictoriaMetrics/VictoriaMetrics/pull/10765
Exctract repeated code from nextDayMetricIDs synctests into separate
funcs to make the code more readable.
The change was originally introduced in
https://github.com/VictoriaMetrics/VictoriaMetrics/pull/10704 and was
extracted into a separate PR to keep the original change simple.
Previously, vminsert did not account for the ingest concurrency limit in buffer size calculation.
This could lead to excessively large buffers and OOM errors when the concurrency limit was reached.
This commit fixes buffer size calculation by separating `insertCtx` and `storageNode` buffer size limits.
`storageNode` buffer size is set to a larger value, as it is allocated per configured `-storageNode`
and is independent of the concurrency limit.
`insertCtx` buffer size now accounts for the configured concurrency limit
and calculates the maximum buffer size accordingly.
fixes https://github.com/VictoriaMetrics/VictoriaMetrics/issues/10725
Previously, vmselect in cluster-native mode could return partial responses to upstream vmselect.
Since upstream vmselect expects full responses (mimicking vmstorage behavior),
partial responses must be disabled in cluster-native mode.
This prevents incomplete responses from being cached at the upstream vmselect level.
Fixes https://github.com/VictoriaMetrics/VictoriaMetrics/issues/10678
Automatically set daily and hourly series limits to `MaxInt32` when `remoteWrite.maxHourlySeries` or `remoteWrite.maxDailySeries` is set to `-1`.
This change addresses a usability issue with the cardinality limiter. Users may want to enable the limiter to observe its metrics before deciding on an appropriate limit. However, the underlying bloom filter only supports `int32`, so setting large values can lead to overflow.
With this PR:
* Setting either flag to `-1` is treated as “no practical limit” and internally mapped to `math.MaxInt32`
* Values exceeding `int32` are safely clamped to `MaxInt32` to prevent overflow
This allows users to enable the limiter for estimation purposes without risking invalid configurations or runtime issues.
https://github.com/VictoriaMetrics/VictoriaMetrics/pull/9614
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Signed-off-by: Max Kotliar <kotlyar.maksim@gmail.com>
Co-authored-by: Nikolay <nik@victoriametrics.com>
Co-authored-by: Max Kotliar <mkotlyar@victoriametrics.com>
Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
Previously, last scrape result was unconditionally update, despite possible scrape error.
The commit updates last scrape result only at successful scrape. It properly accounts `scrape_series_added` metric and aligns it with the same metric in Prometheus.
fixes https://github.com/VictoriaMetrics/VictoriaMetrics/issues/10653
Previously introduced flag `requestBufferSize` raised default value for
in-memory buffer from 16KB to 32KB. It could increase memory usage for
vmauth. Also it made unclean how to actually disable requests buffering.
This commit aligns flags value to the 16KB. And disables requests
buffering if any of flags value are 0 as mentioned at flags description.
If any of flags have non-default value, those value are used as max size
for request buffer. If both flags are modified - bigger value wins.
Fixes https://github.com/VictoriaMetrics/VictoriaMetrics/issues/10675
I expect the change to help in two ways:
1. Spreading remote write flushes over the flush interval to avoid
congestion at the remote write destination;
2. Enhance queue data consumption. Currently, all flushers may always
flush data simultaneously, resulting in periods where no flushers are
consuming data from the queue, which increases the risk of reaching the
queue limit `remoteWrite.maxQueueSize` even when a increased
`remoteWrite.concurrency`. By making the flushers more dispersed, it is
more likely that some flushers are consistently consuming data from the
queue, which should make queue management easier.
Related PR https://github.com/VictoriaMetrics/VictoriaMetrics/pull/10729/
Changes:
- Added the number of `pending alerts` and `firing alerts`
- Improvement `transormations` for panel - FIRING over time by group and rules
- Added sort for panel - FIRING over time by rule
Signed-off-by: sias32 <sias.32@yandex.ru>
Co-authored-by: Max Kotliar <mkotlyar@victoriametrics.com>
Replace 1.2 multiplier with 1.25 in disk space estimation formula.
1.2 only provides ~16.7% free space, while the docs recommend keeping
20%. Using 1.25 correctly accounts for 20% free space.
Inspired by
https://github.com/VictoriaMetrics/VictoriaMetrics/pull/10394
Add per-URL `-remoteWrite.disableMetadata` flag to control metadata
sending for each remote storage independently.
After v1.137.0 enabled `-enableMetadata` by default, metadata is sent to
ALL remote write targets, even those with relabeling filters that drop
most metrics. This causes unnecessary growth in
`vmagent_remotewrite_requests_total`. and significant increase in
network load for heavy filtered remote write destinations.