VictoriaMetrics

mirror of https://github.com/VictoriaMetrics/VictoriaMetrics.git synced 2026-06-23 02:28:07 +03:00

Author	SHA1	Message	Date
Max Kotliar	5eae13fbe9	lib/handshake: set deadline for whole handshake; change deadline (1s per op to 3s whole process) (#9541 ) ### Describe Your Changes The current one-second timeout for individual read or write operations during the handshake phase has proven to be insufficient in some scenarios https://github.com/VictoriaMetrics/VictoriaMetrics/issues/9345. For example, short-lived CPU spikes lasting a few seconds can cause handshake failures due to the low timeout threshold. While a small timeout may work well in environments with fast and reliable networking, such as within a single datacenter, it becomes problematic in more complex setups—particularly in a [multi-level cluster setup](https://docs.victoriametrics.com/victoriametrics/cluster-victoriametrics/#multi-level-cluster-setup) where the top-level vmselect may reside in a different availability zone and work on a less reliable network. Another issue with the per-operation timeout approach is that it allows the total time for a handshake to accumulate significantly in the worst-case scenario. If each operation experiences a delay just under the timeout threshold, the entire handshake process could take up to 6s. Which accounts for 60% of `-search.maxQueueDuration` and leaves only 4s for the actual query. Introducing a single timeout for the entire handshake process would provide more predictable behavior and improve usability from a configuration standpoint. The timeout for the whole handshake op is also easier to understand from the operator's point of view. Increasing the timeout value and providing a configuration option for it would make the system more resilient to transient conditions like CPU contention and better suited for use cases involving cross-AZ communication. Fixes https://github.com/VictoriaMetrics/VictoriaMetrics/issues/9345 ### Checklist The following checks are mandatory: - [x] My change adheres to [VictoriaMetrics contributing guidelines](https://docs.victoriametrics.com/victoriametrics/contributing/#pull-request-checklist). - [x] My change adheres to [VictoriaMetrics development goals](https://docs.victoriametrics.com/victoriametrics/goals/).	2025-08-11 19:30:03 +03:00
Max Kotliar	4bd258a36d	lib/handshake: log client network errors during handshake as warnings (follow up) Adds a hint to check for errors on the client side when a network error occurs during the handshake. Follow-up on commit 53170abdccd2ca3f5952a916c5f544e0e77b5596	2025-05-06 12:01:06 +02:00
Aliaksandr Valialkin	bcacf4c28b	use new canonical urls to single-server-victoriametrics docs: https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/ This avoids a redirect from the old link https://docs.victoriametrics.com/ to https://docs.victoriametrics.com/victoriametrics/single-server-victoriametrics/ , and fixes `backwards` navigation for these links across VictoriaMetrics docs. This is a follow-up for `f152021521` See https://github.com/VictoriaMetrics/VictoriaMetrics/pull/8595#issuecomment-2831598274	2025-04-30 22:35:40 +02:00
f41gh7	234bc82f6c	lib/handshake: log client network errors during handshake as warnings This commit modifies the logging behavior for client network errors (e.g., EOFs, timeouts) during the handshake process. They are now logged as warnings instead of errors, as they are not actionable from the server’s perspective. Here's some examples of such errors. Timeouts during the initial read phase: 2025-04-09T07:08:59.323Z error VictoriaMetrics/lib/vmselectapi/server.go:204 cannot perform vmselect handshake with client "<REDACTED>": cannot read hello: cannot read message with size 11: read tcp4 <REDACTED>-><REDACTED>: i/o timeout; read only 0 bytes EOFs occurring later in the handshake process: 2025-04-08T18:01:30.783Z error VictoriaMetrics/lib/vmselectapi/server.go:204 cannot perform vmselect handshake with client "<REDACTED>": cannot read isCompressed flag: cannot read message with size 1: EOF; read only 0 bytes By logging these as warnings, we reduce noise in error logs while preserving valuble information for debug.	2025-04-25 12:02:39 +03:00
Nikolay	07d0593076	lib/storage: enhance TSDB status response This commit adds new fields - `requestsCount` and `lastRequestTimestamp` to series count be metric names stats. It allows to display an additional stats at explore cardinality page. Stats will only be added if `storage.trackMetricNameStats` flag is set. This change requires an update to RPC protocol in order to properly marshal data. In addition, this commit adds integration tests to TSDB stats API. Related issue: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6145	2025-04-16 19:56:46 +02:00
Nikolay	773b8b0b28	lib/storage: add tracker for time series metric names statistics This feature allows to track query requests by metric names. Tracker state is stored in-memory, capped by 1/100 of allocated memory to the storage. If cap exceeds, tracker rejects any new items add and instead registers query requests for already observed metric names. This feature is disable by default and new flag: `-storage.trackMetricNamesStats` enables it. New API added to the select component: * /api/v1/status/metric_names_stats - which returns a JSON object with usage statistics. * /admin/api/v1/status/metric_names_stats/reset - which resets internal state of the tracker and reset tsid/cache. New metrics were added for this feature: * vm_cache_size_bytes{type="storage/metricNamesUsageTracker"} * vm_cache_size{type="storage/metricNamesUsageTracker"} * vm_cache_size_max_bytes{type="storage/metricNamesUsageTracker"} Related issue: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4458 --------- Signed-off-by: f41gh7 <nik@victoriametrics.com> Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>	2025-03-06 22:10:41 +01:00
f41gh7	a98163a9e0	app/vmselect/netstorage: stop exposing `vm_index_search_duration_seconds metric This metric records time spent on search operations in the index. It was introduced in [v1.56.0](https://github.com/VictoriaMetrics/VictoriaMetrics/releases/tag/v1.56.0). However, this metric was used neither in dashboards nor in alerting rules. It also has high cardinality because index search operations latency can differ by 3 orders of magnitude. See [example](https://play.victoriametrics.com/select/accounting/1/6a716b0f-38bc-4856-90ce-448fd713e3fe/prometheus/graph/#/cardinality?date=2025-02-05&match=vm_index_search_duration_seconds_bucket&topN=10&focusLabel=). Hence, dropping it as unused. --------- Signed-off-by: hagen1778 <roman@victoriametrics.com>	2025-02-06 13:48:32 +01:00
Aliaksandr Valialkin	d845edc24b	lib: consistently use atomic.* types instead of atomic.* functions See `ea9e2b19a5`	2024-02-24 02:10:04 +02:00
Zakhar Bessarab	f7834767c1	vmcluster: re-routing enhancement (#5293 ) * app/vmstorage: close vminsert connections gradually before stopping storage Implements graceful shutdown approach suggested here - https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4922#issuecomment-1768146878 Test results for this can be found here - https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4922#issuecomment-1790640274 Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com> * app/vmstorage: update graceful shutdown logic - close connections from vminsert in determenistic order - update flag description - lower default timeout to 25 seconds. 25 seconds value was chosen because the lowest default value used in default configuration deployments is 30s(default value in Kubernetes and ansible-playbooks). Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com> * docs/cluster: add information about re-routing enhancement during restart Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com> * docs/changelog: add entry for new command-line flag Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com> * {app/vmstorage,lib/ingestserver}: address review feedback Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com> * docs/cluster: add note to update workload scheduler timeout Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com> * wip --------- Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com> Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>	2023-11-14 01:00:42 +01:00
Aliaksandr Valialkin	36a1fdca6c	all: consistently use %w instead of %s in when error is passed to fmt.Errorf() This allows consistently using errors.Is() for verifying whether the given error wraps some other known error.	2023-10-26 09:44:40 +02:00
Roman Khavronenko	8b2c30c51b	lib/vmselect: bump maxSearchQuerySize to 5MB (#5158 ) See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5154#issuecomment-1757216612 https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5154 Signed-off-by: hagen1778 <roman@victoriametrics.com>	2023-10-11 12:25:54 +02:00
Nikolay	fac272bc10	lib/vmselectapi: do not send empty label names for labelNames request (#4936 ) * lib/vmselectapi: do not send empty label names for labelNames request it breaks cluster communication, since vmselect incorrectly reads request buffer, leaving unread data on it https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4932 * typo fix * wip --------- Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>	2023-09-01 23:24:51 +02:00
Aliaksandr Valialkin	3bc3fb6adf	lib/vmselectapi: move the code for checking the expected client errors into a isExpectedError() function	2023-07-06 16:37:59 -07:00
Zakhar Bessarab	bf4120a3d9	lib/vmselectapi: extend error handling to ignore "reset by peer" (#4498 ) This is a followup for https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4418 to also handle "connection reset by peer" errors in connection handling logic. This error can be triggered just the same as described in original PR: when query was closed on vmselect side and connection has been interrupted. Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>	2023-06-22 11:24:18 +02:00
hagen1778	dde01c826d	lib/vmselectapi: properly check for net.ErrClosed This error may be wrapped in another error, and should normally be tested using `errors.Is(err, net.ErrClosed)`. Signed-off-by: hagen1778 <roman@victoriametrics.com>	2023-06-09 10:42:03 +02:00
Roman Khavronenko	dfb05c884b	lib/vmselectapi: suppress "broken pipe" error logs on vmstorage side (#4418 ) The "broken pipe" error is emitted when the connection has been interrupted abruptly. It could happen due to unexpected network glitch or because connection was interrupted by remote client. In both cases, remote client will notice connection breach and handle it on its own. No need in logging this error on both: server and client side. This change should reduce the amount of log noise on vmstorage side. In the same time, it is not expected to lose any information, since important logs should be still emitted by the vmselect. To conduct an experiment for testing this change see the following instructions: 1. Setup vmcluster with at least 2 storage nodes, 1 vminsert and 1 vmselect 2. Run vmselect with complexity limit checked on the client side: `-search.maxSamplesPerQuery=1` 3. Ingest some data and query it back: `count({__name__!=""})` 4. Observe the logs on vmselect and vmstorage side Before the change, vmselect will log message about complexity limits exceeded. When this happens, vmselect closes network connections to vmstorage nodes signalizing that it doesn't expect any data back. Both vmstorage processes will try to push data to the connection and will fail with "broken pipe" error, means that vmselect closed the connection. After the change, vmstorages should remain silent. And vmselect will continue emittin the error message about complexity limits exceeded. Signed-off-by: hagen1778 <roman@victoriametrics.com>	2023-06-08 08:31:05 -07:00
Aliaksandr Valialkin	0397b3f0f7	lib/handshake: do not pollute logs with `cannot read hello` messages on TCP health checks Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1762	2023-05-18 10:37:59 -07:00
Nikolay	113a89904d	lib/vmselectapi: fixes regression for disable compression setting (#3932 ) after vmselect api refactoring it wasn't possible to disable response cache. This patch restores correct behavior for rpc.disableCompression flag	2023-03-12 01:48:08 -08:00
Nikolay	ebebaecd94	lib/netutil: init implimentation of proxy protocol (#3687 ) * lib/netutil: init implimentation of proxy protocol https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3335 * wip Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>	2023-01-26 23:25:22 -08:00
Aliaksandr Valialkin	d8329e47cf	lib/vmselectapi: propagate timeout errors from vmselect to vmstorage instead of closing the connection established from vmselect to vmstorage This is a follow-up for `20e9598254`	2023-01-20 19:30:22 -08:00
Aliaksandr Valialkin	af58ac25f6	lib/vmselectapi: properly calculate query timeout vmselect passes query timeout to vmstorage in seconds. The commit `20e9598254` treated it as timeout in nanoseconds. Fix this in order to prevent from the following errors under vmstorage load: cannot process vmselect request: cannot execute "search_v7": couldn't start executing the request in 0.000 seconds, since -search.maxConcurrentRequests=... concurrent requests are already executed.	2023-01-11 01:21:55 -08:00
Aliaksandr Valialkin	f7130d571d	app/vmselect: improve logging when the incoming query cannot be executed because of timeout in the wait queue	2023-01-11 01:12:25 -08:00
Aliaksandr Valialkin	2ca48444e2	lib/vmselectapi: typo fix after `20e9598254`	2023-01-06 22:13:32 -08:00
Aliaksandr Valialkin	b275983403	lib/writeconcurrencylimiter: improve the logic behind -maxConcurrentInserts limit Previously the -maxConcurrentInserts was limiting the number of established client connections, which write data to VictoriaMetrics. Some of these connections could be idle. Such connections do not consume big amounts of CPU and RAM, so there is a little sense in limiting the number of such connections. So now the -maxConcurrentInserts command-line option limits the number of concurrently executed insert requests, not including idle connections. It is recommended removing -maxConcurrentInserts command-line option, since the default value for this option should work good for most cases.	2023-01-06 22:07:16 -08:00
Aliaksandr Valialkin	20e9598254	lib/vmselectapi: limit the number of concurrently executed requests This should prevent from out of memory errors when big number of vmselect nodes send many concurrent requests to vmstorage The limit can be controlled at vmstorage via the following command-line flags: - search.maxConcurrentRequests - search.maxQueueDuration See https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html#resource-usage-limits	2023-01-06 18:39:46 -08:00
Zakhar Bessarab	e407e7243a	{app/vmstorage,app/vmselect}: add API to get list of existing tenants (#3348 ) * {app/vmstorage,app/vmselect}: add API to get list of existing tenants * {app/vmstorage,app/vmselect}: add API to get list of existing tenants * app/vmselect: fix error message * {app/vmstorage,app/vmselect}: fix error messages * app/vmselect: change log level for error handling * wip Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>	2022-11-25 10:32:45 -08:00
Aliaksandr Valialkin	10402459d8	lib/vmselectapi: do not log connection accept/close from vmselect These log messages became too spammy in production clusters after the commit `190c8b463c` , which closes idle connections from vmselect to vmstorage. Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2508	2022-08-12 09:15:29 +03:00
Aliaksandr Valialkin	1b39be3305	lib/vmselectapi: add `rpc call` prefix to the trace of the rpc call in order to make it more clear	2022-08-12 00:20:49 +03:00
Aliaksandr Valialkin	1ec4dfd678	lib/vmselectapi: pass storage.SearchQuery to API calls instead of []*storage.TagFilters + storage.TimeRange + maxMetrics This reduces the number of args to vmselectapi calls	2022-07-06 12:46:22 +03:00
Aliaksandr Valialkin	2e721f7d16	lib/vmselectapi: rename Server.MustClose to more clear Server.MustStop	2022-07-06 12:46:22 +03:00
Aliaksandr Valialkin	270e555f47	lib/vmselectapi: pass maxSuffixes arg to tagValueSuffixes RPC call	2022-07-06 12:46:22 +03:00
Aliaksandr Valialkin	78eeca6f0d	lib/vmselectapi: rename deleteMetrics to more correct deleteSeries	2022-07-06 12:46:21 +03:00
Aliaksandr Valialkin	5afa54e845	lib/vmselectapi: use string type for tagKey and tagValuePrefix args at TagValueSuffixes() This improves the API consistency	2022-07-06 12:46:21 +03:00
Aliaksandr Valialkin	7d5d33fd71	lib/storage: return marshaled metric names from SearchMetricNames Previously SearchMetricNames was returning unmarshaled metric names. This wasn't great for vmstorage, which should spend additional CPU time for marshaling the metric names before sending them to vmselect. While at it, remove possible duplicate metric names, which could occur when multiple samples for new time series are ingested via concurrent requests. Also sort the metric names before returning them to the client. This simplifies debugging of the returned metric names across repeated requests to /api/v1/series	2022-06-28 18:16:32 +03:00
Aliaksandr Valialkin	399d4c36ae	app/vmselect: optimize /api/v1/series a bit for time ranges smaller than one day	2022-06-28 12:55:20 +03:00
Aliaksandr Valialkin	64505e924d	app/vmstorage: extract vmselect api server into a separate package - lib/vmselectapi This opens doors for implementing vmselect api server at vmselect level, so top-level vmselect could query lower-level vmselect nodes in the same way as it queries vmstorage nodes. This will create the ability to create highly available querying architecture when multiple independent VictoriaMetrics clusters with the same data are located in distinct availability zones. In this case we can use top-level vmselect instead of Promxy for simultaneous querying of all the clusters in all the AZs.	2022-06-27 14:20:41 +03:00

36 Commits