### Describe Your Changes
The current one-second timeout for individual read or write operations
during the handshake phase has proven to be insufficient in some
scenarios
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/9345. For
example, short-lived CPU spikes lasting a few seconds can cause
handshake failures due to the low timeout threshold.
While a small timeout may work well in environments with fast and
reliable networking, such as within a single datacenter, it becomes
problematic in more complex setups—particularly in a [multi-level
cluster
setup](https://docs.victoriametrics.com/victoriametrics/cluster-victoriametrics/#multi-level-cluster-setup)
where the top-level vmselect may reside in a different availability zone
and work on a less reliable network.
Another issue with the per-operation timeout approach is that it allows
the total time for a handshake to accumulate significantly in the
worst-case scenario. If each operation experiences a delay just under
the timeout threshold, the entire handshake process could take up to 6s.
Which accounts for 60% of `-search.maxQueueDuration` and leaves only 4s
for the actual query.
Introducing a single timeout for the entire handshake process would
provide more predictable behavior and improve usability from a
configuration standpoint. The timeout for the whole handshake op is also
easier to understand from the operator's point of view. Increasing the
timeout value and providing a configuration option for it would make the
system more resilient to transient conditions like CPU contention and
better suited for use cases involving cross-AZ communication.
Fixes https://github.com/VictoriaMetrics/VictoriaMetrics/issues/9345
### Checklist
The following checks are **mandatory**:
- [x] My change adheres to [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/victoriametrics/contributing/#pull-request-checklist).
- [x] My change adheres to [VictoriaMetrics development
goals](https://docs.victoriametrics.com/victoriametrics/goals/).
Adds a hint to check for errors on the client side when a network error
occurs during the handshake.
Follow-up on commit 53170abdccd2ca3f5952a916c5f544e0e77b5596
This commit modifies the logging behavior for client network errors
(e.g., EOFs, timeouts) during the handshake process. They are now logged
as warnings instead of errors, as they are not actionable from the
server’s perspective. Here's some examples of such errors.
Timeouts during the initial read phase:
2025-04-09T07:08:59.323Z error
VictoriaMetrics/lib/vmselectapi/server.go:204 cannot perform
vmselect handshake with client "<REDACTED>": cannot read hello: cannot
read message with size 11: read tcp4 <REDACTED>-><REDACTED>: i/o
timeout; read only 0 bytes
EOFs occurring later in the handshake process:
2025-04-08T18:01:30.783Z error
VictoriaMetrics/lib/vmselectapi/server.go:204 cannot perform
vmselect handshake with client "<REDACTED>": cannot read isCompressed
flag: cannot read message with size 1: EOF; read only 0 bytes
By logging these as warnings, we reduce noise in error logs while
preserving valuble information for debug.
This commit adds new fields - `requestsCount` and `lastRequestTimestamp`
to series count be metric names stats.
It allows to display an additional stats at explore cardinality page.
Stats will only be added if `storage.trackMetricNameStats` flag is set.
This change requires an update to RPC protocol in order to properly
marshal data.
In addition, this commit adds integration tests to TSDB stats API.
Related issue:
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6145
This feature allows to track query requests by metric names. Tracker
state is stored in-memory, capped by 1/100 of allocated memory to the
storage. If cap exceeds, tracker rejects any new items add and instead
registers query requests for already observed metric names.
This feature is disable by default and new flag:
`-storage.trackMetricNamesStats` enables it.
New API added to the select component:
* /api/v1/status/metric_names_stats - which returns a JSON
object
with usage statistics.
* /admin/api/v1/status/metric_names_stats/reset - which resets internal
state of the tracker and reset tsid/cache.
New metrics were added for this feature:
* vm_cache_size_bytes{type="storage/metricNamesUsageTracker"}
* vm_cache_size{type="storage/metricNamesUsageTracker"}
* vm_cache_size_max_bytes{type="storage/metricNamesUsageTracker"}
Related issue:
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4458
---------
Signed-off-by: f41gh7 <nik@victoriametrics.com>
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
This is a followup for https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4418 to also handle "connection reset by peer" errors in connection handling logic.
This error can be triggered just the same as described in original PR: when query was closed on vmselect side and connection has been interrupted.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
This error may be wrapped in another error, and should normally be tested using
`errors.Is(err, net.ErrClosed)`.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The "broken pipe" error is emitted when the connection has been interrupted abruptly.
It could happen due to unexpected network glitch or because connection was
interrupted by remote client. In both cases, remote client will notice
connection breach and handle it on its own. No need in logging this error
on both: server and client side.
This change should reduce the amount of log noise on vmstorage side. In the same time,
it is not expected to lose any information, since important logs should be still
emitted by the vmselect.
To conduct an experiment for testing this change see the following instructions:
1. Setup vmcluster with at least 2 storage nodes, 1 vminsert and 1 vmselect
2. Run vmselect with complexity limit checked on the client side: `-search.maxSamplesPerQuery=1`
3. Ingest some data and query it back: `count({__name__!=""})`
4. Observe the logs on vmselect and vmstorage side
Before the change, vmselect will log message about complexity limits exceeded. When this happens,
vmselect closes network connections to vmstorage nodes signalizing that it doesn't expect any data back.
Both vmstorage processes will try to push data to the connection and will fail with "broken pipe" error,
means that vmselect closed the connection.
After the change, vmstorages should remain silent. And vmselect will continue emittin the error message
about complexity limits exceeded.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
vmselect passes query timeout to vmstorage in seconds.
The commit 20e9598254 treated it as timeout in nanoseconds.
Fix this in order to prevent from the following errors under vmstorage load:
cannot process vmselect request: cannot execute "search_v7": couldn't start executing the request in 0.000 seconds,
since -search.maxConcurrentRequests=... concurrent requests are already executed.
Previously the -maxConcurrentInserts was limiting the number of established client connections,
which write data to VictoriaMetrics. Some of these connections could be idle.
Such connections do not consume big amounts of CPU and RAM, so there is a little sense in limiting
the number of such connections. So now the -maxConcurrentInserts command-line option
limits the number of concurrently executed insert requests, not including idle connections.
It is recommended removing -maxConcurrentInserts command-line option, since the default value
for this option should work good for most cases.
This should prevent from out of memory errors when big number of vmselect
nodes send many concurrent requests to vmstorage
The limit can be controlled at vmstorage via the following command-line flags:
- search.maxConcurrentRequests
- search.maxQueueDuration
See https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html#resource-usage-limits
* {app/vmstorage,app/vmselect}: add API to get list of existing tenants
* {app/vmstorage,app/vmselect}: add API to get list of existing tenants
* app/vmselect: fix error message
* {app/vmstorage,app/vmselect}: fix error messages
* app/vmselect: change log level for error handling
* wip
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Previously SearchMetricNames was returning unmarshaled metric names.
This wasn't great for vmstorage, which should spend additional CPU time
for marshaling the metric names before sending them to vmselect.
While at it, remove possible duplicate metric names, which could occur when
multiple samples for new time series are ingested via concurrent requests.
Also sort the metric names before returning them to the client.
This simplifies debugging of the returned metric names across repeated requests to /api/v1/series
This opens doors for implementing vmselect api server at vmselect level,
so top-level vmselect could query lower-level vmselect nodes in the same way
as it queries vmstorage nodes.
This will create the ability to create highly available querying architecture
when multiple independent VictoriaMetrics clusters with the same data
are located in distinct availability zones. In this case we can use top-level
vmselect instead of Promxy for simultaneous querying of all the clusters
in all the AZs.