Compare commits

...

298 Commits

Author SHA1 Message Date
Zakhar Bessarab
87f1859300 lib/logstorage/tokenizer: try out swiss map 2 2023-12-14 19:57:14 +04:00
Zakhar Bessarab
31ab86c35f lib/logstorage/tokenizer: try out swiss map
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
2023-12-14 19:51:06 +04:00
Zakhar Bessarab
ae52ee1857 deployment/logs-benchmark/single: use udp input for fluentbit
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
2023-12-12 20:55:13 +04:00
Zakhar Bessarab
a987f2fab9 deployment/logs-benchmark/single: use fluentbit
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
2023-12-12 20:04:18 +04:00
Zakhar Bessarab
737a4264d4 deployment/logs-benchmark: fix download script
Handle redirects to actual files

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
2023-12-12 15:56:46 +04:00
Zakhar Bessarab
d5298b50d1 deployment/logs-benchmark: add case for self-testing
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
2023-12-12 15:55:29 +04:00
hagen1778
e0fc5ef140 lib/promscrape: comsetic changes after e373bb84d5
* fix typos in docs
* add `shard-` prefix to generated links when `-promscrape.cluster.memberURLTemplate` is enabled

Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-12-12 11:28:18 +01:00
hagen1778
1e02efd511 docs: fix formatting after a list in CHANGELOG.md
Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-12-12 10:28:11 +01:00
hagen1778
39c405ed4d app/vmctl: follow-up after 27668c9d01
* remove duplications in error messages
* mention the change in CHANGELOG.md

27668c9d01
Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-12-11 15:30:47 +01:00
wozz
27668c9d01 vmctl: check error in response from influxdb (#5446) 2023-12-11 15:24:08 +01:00
hagen1778
e13dc04fbf docs: mention alerts change in CHANGELOG.md
Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-12-11 15:23:09 +01:00
hagen1778
8fb68152e6 alerts: simplify aggregation of alerting rules
This is follow-up after
75196d7234

It updates some of the alerting rules to remove unnecessary aggregations.
It keeps aggregations for expressions which are using multiple time series
filters to make sure their label will match.

Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-12-11 15:17:30 +01:00
7840vz
75196d7234 alerts: inverse grouping in vmagent alerts (#5429)
Aggregations with by() have one sideeffect, that any custom labels you add to hosts are dropped too which can be used for alerts routing.

Therefore, some good practice could be to use without() instead, with labels, like without(path) , or without(url) to get same aggregations but with any external labels left intact.
2023-12-11 15:01:29 +01:00
Aliaksandr Valialkin
51df2248f0 vendor: run make vendor-update 2023-12-11 10:48:36 +02:00
Aliaksandr Valialkin
635da5fab7 docs/CHANGELOG.md: document v1.93.9 LTS release 2023-12-11 10:39:28 +02:00
Aliaksandr Valialkin
ce8ae450fc docs/CHANGELOG.md: document v1.87.12 2023-12-10 14:30:22 +02:00
Aliaksandr Valialkin
6d03779870 deployment/docker: update base Docker image from alpine:3.18.5 to alpine:3.19.0
See https://www.alpinelinux.org/posts/Alpine-3.19.0-released.html
2023-12-10 02:28:19 +02:00
Aliaksandr Valialkin
a5bc9d93cc docs: sync -help output for VictoriaMetrics components after recent changes 2023-12-10 01:13:41 +02:00
Aliaksandr Valialkin
3d3b0e31e0 app/vmselect: add -search.maxResponseSeries command-line flag for limiting the number of time series a single response can return
This limit can be used for preventing from high memory usage at Grafana when the response returns too many series.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5372
2023-12-10 00:54:42 +02:00
Aliaksandr Valialkin
b1fed78e0b app: make more clear that -tls enables https at -httpListenAddr 2023-12-10 00:25:01 +02:00
Aliaksandr Valialkin
c7504daa7a docs: follow-up after 49552eaa15
Link to the related issue - https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4792
Fix heading for `Modifying HTTP headers` chapter at docs/vmagent.md
2023-12-08 23:56:56 +02:00
Aliaksandr Valialkin
042267541f app/vmauth: add support for hot standby mode via first_available load balancing policy
vmauth in `hot standby` mode sends requests to the first url_prefix while it is available.
If the first url_prefix becomes unavailable, then vmauth falls back to the next url_prefix.
This allows building highly available setup as described at https://docs.victoriametrics.com/vmauth.html#high-availability

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4893
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4792
2023-12-08 23:31:07 +02:00
Aliaksandr Valialkin
b05e1512d4 lib/promscrape: add a wraning when the /service-discovery page contains incomplete list of dropped targets 2023-12-08 19:03:51 +02:00
dependabot[bot]
1065deccf8 build(deps): bump actions/setup-go from 4 to 5 (#5435)
Bumps [actions/setup-go](https://github.com/actions/setup-go) from 4 to 5.
- [Release notes](https://github.com/actions/setup-go/releases)
- [Commits](https://github.com/actions/setup-go/compare/v4...v5)

---
updated-dependencies:
- dependency-name: actions/setup-go
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-12-08 16:55:31 +03:00
noodles2hg
8efe694160 lib/streamaggr/streamaggr.go: fix link in error message (#5439) 2023-12-08 16:55:05 +03:00
Roman Khavronenko
74b09ab4de app/vmalert: sanitize label names before sending to Alertmanager (#5442)
Before, vmalert would send notifications with labels containing characters
  not supported by Alertmanager validator, resulting into validation errors
  like `msg="Failed to validate alerts" err="invalid label set: invalid name "foo.bar"`

Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-12-08 16:53:35 +03:00
hagen1778
6acf28715b docs: mentioned that re-routing increases number of active time series
Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-12-08 13:29:23 +01:00
Aliaksandr Valialkin
adc69b872c docs: run make docs-images-to-webp after 02a0a7f428
This reduces added image sizes by 3x

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5437
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5370
2023-12-07 16:40:37 +02:00
Alexander Marshalov
02a0a7f428 added field version to the response for /api/v1/status/buildinfo API for using more efficient API in Grafana for receiving label values, added additional info about setup Grafana datasource (#5370) (#5437) 2023-12-07 16:37:36 +02:00
Aliaksandr Valialkin
e373bb84d5 lib/promscrape: add -promscrape.cluster.memberURLTemplate command-line flag for creating direct links to vmagent instances at /service-discovery page
See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4018#issuecomment-1843811569
2023-12-07 16:04:21 +02:00
Aliaksandr Valialkin
802adf3b65 app/vmselect/prometheus: go fmt after b39e9257eb 2023-12-07 16:01:18 +02:00
Aliaksandr Valialkin
b39e9257eb app/vmselect/prometheus: properly encode Prometheus label values at /federate endpoint
Prometheus spec says that only \, \n and " must be escaped inside label values.
See 995743836e/content/docs/instrumenting/exposition_formats.md (L90)

See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5431
2023-12-07 15:36:01 +02:00
hagen1778
cb90d09c9d docs: fix wrong link in Troubleshooting.md
Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-12-07 12:54:55 +01:00
Aliaksandr Valialkin
3a8b9cc81e docs/vmagent.md: mention that vmagent displays up to -promscrape.maxDroppedTargets at /service-discovery page
Suggest increasing `-promscrape.maxDroppedTargets` command-line flag value if /service-discovery page
misses some dropped targets.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5389
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4018
2023-12-07 00:34:02 +02:00
Aliaksandr Valialkin
7cb8ed8271 lib/promscrape: show -promscrape.cluster.memberNum values for vmagent instances, which scrape the given dropped target at /service-discovery page
The /service-discovery page contains the list of all the discovered targets
after the commit 487f6380d0 on all the vmagent instances
in cluster mode ( https://docs.victoriametrics.com/vmagent.html#scraping-big-number-of-targets ).

This commit improves debuggability of targets in cluster mode by providing a list of -promscrape.cluster.memberNum
values per each target at /service-discovery page, which has been dropped becasue of sharding,
e.g. if this target is scraped by other vmagent instances in the cluster.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5389
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4018
2023-12-07 00:05:32 +02:00
Aliaksandr Valialkin
efbe25a678 deployment/docker: update Go builder from Go1.21.4 to Go1.21.5
See https://github.com/golang/go/issues?q=milestone%3AGo1.21.5+label%3ACherryPickApproved
2023-12-06 22:33:40 +02:00
Aliaksandr Valialkin
67468a0c46 lib/promscrape: show never scraped message for never scraped targets at /targets page 2023-12-06 22:33:39 +02:00
Dmytro Kozlov
935bec447b app/vmalert: replace error metrics for gauges with counter metrics (#5217)
See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5160

Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
2023-12-06 19:39:35 +01:00
Aliaksandr Valialkin
65bc460323 lib/promscrape: follow-up for 97373b7786
Substitute O(N^2) algorithm for exposing the `vm_promscrape_scrape_pool_targets` metric
with O(N) algorithm, where N is the number of scrape jobs. The previous algorithm could slow down
/metrics exposition significantly when -promscrape.config contains thousands of scrape jobs.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5311
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5335
2023-12-06 17:35:50 +02:00
Aliaksandr Valialkin
e4f5039509 app/vmselect: properly adjust the lower bound for the time range where raw samples must be selected for default_rollup() function
Previously the lower bound could be too small, which could result in missing values at the beginning of the graph
for default_rollup() function. This function is automatically applied to all the series selectors if they aren't
explicitly wrapped into a rollup function - see https://docs.victoriametrics.com/MetricsQL.html#implicit-query-conversions

While at it, properly take into account `-search.minStalenessInterval` command-line flag when adjusting
the lower bound for the selected time range.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5388
2023-12-06 14:20:14 +02:00
Hui Wang
97373b7786 vmagent: add vm_promscrape_scrape_pool_targets for scrape jobs like… (#5335)
* vmagent: export `vm_promscrape_scrape_pool_targets` metric to track the number of targets that each scrape_job discovers

* add extra panel for new metric
2023-12-06 15:44:39 +08:00
Artem Navoiev
17e2b4f814 fix anchor in docs
Signed-off-by: Artem Navoiev <tenmozes@gmail.com>
2023-12-05 13:10:43 +01:00
Aliaksandr Valialkin
06c73df55a Revert "add datadog /api/v2/series and /api/beta/sketches support (#5094)"
This reverts commit 543f218fe9.

Reason for revert: https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5094#issuecomment-1839789080
2023-12-05 02:26:22 +02:00
Aliaksandr Valialkin
bc550e22d7 Revert "lib/protoparser/datadog: follow-up after 543f218fe96574b9b2189c8350bb09afa349e3bb"
This reverts commit 98d0f81f21.

Reson for revert: see https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5094#issuecomment-1839789080
2023-12-05 02:19:29 +02:00
Aliaksandr Valialkin
fdbbbf33ca app/vmagent: add -enableMultitenantHandlers command-line flag
This flag allows converting tenant id to (vm_account_id, vm_project_id) labels.
this flag deprecates `-remoteWrite.multitenantURL` command-line flag,
because `-enableMultitenantHandlers` is easier to use and combine with multitenant url
at vminsert - https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html#multitenancy-via-labels

See https://docs.victoriametrics.com/vmagent.html#multitenancy

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/1505
2023-12-05 01:28:37 +02:00
Aliaksandr Valialkin
902f1e5fdc docs/vmagent.md: mention that it may be useful to disable on-disk data persistence when reading data from Kafka or Google PubSub
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2110
2023-12-04 23:12:51 +02:00
Aliaksandr Valialkin
0160435802 app/vmagent: code cleanup for Kafka and Google PubSub consumers / producers
- Add links to relevant docs into descriptions for every -kafka.* and -gcp.pubsub.* command-line flags.
- Wait until message processing goroutines are stopped before returning from gcppubsub.Stop().
- Prevent from multiple calls to Init() without Stop().
- Drop message if tenantID cannot be parsed properly.
- Take into account tenantID for all the supported message formats.
- Support gzip-compressed messages for graphite format.
- Use exponential backoff sleep when the message cannot be pushed to remote storage systems
  because of disabled on-disk persistence - https://docs.victoriametrics.com/vmagent.html#disabling-on-disk-persistence
- Unblock from sleep as soon as Stop() is called. Previously the sleep could take up to 2 seconds after Stop() is called.
- Remove unused globalCtx and initContext from app/vmagent/remotewrite/gcppubsub
- Mention Google PubSub support at docs/enterprise.md
- Make Google PubSub docs more clear at docs/vmagent.md

This is a follow-up for commits 115245924a5f096c5a3383d6cc8e8b6fbd421984
and e6eab781ce42285a6a1750dc01eba6801dd35516 .

Updates https://github.com/VictoriaMetrics/VictoriaMetrics-enterprise/pull/717
Updates https://github.com/VictoriaMetrics/VictoriaMetrics-enterprise/pull/713
2023-12-04 22:46:28 +02:00
Dmytro Kozlov
a28cc6ebec app/vmalert: expose /vmalert/api/v1/rule and /api/v1/rule API which returns rule status in JSON format (#5397)
* app/vmalert: expose `/vmalert/api/v1/rule` and `/api/v1/rule` API which returns rule status in JSON format

* app/vmalert: hide updates if query param not set

* app/vmalert: fix panic (recursion call)

* app/vmalert: add needed group name and file name

* app/vmalert: fix comment, update behavior

* app/vmalert: fix description

* app/vmalert: simplify API for /api/v1/rule

Signed-off-by: hagen1778 <roman@victoriametrics.com>

* app/vmalert: simplify API for /api/v1/rule

Signed-off-by: hagen1778 <roman@victoriametrics.com>

* app/vmalert: simplify API for /api/v1/rule

Signed-off-by: hagen1778 <roman@victoriametrics.com>

* app/vmalert: simplify API for /api/v1/rule

Signed-off-by: hagen1778 <roman@victoriametrics.com>

* app/vmalert: simplify API for /api/v1/rule

Signed-off-by: hagen1778 <roman@victoriametrics.com>

---------

Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
2023-12-04 18:40:33 +03:00
Aliaksandr Valialkin
17900e39d7 app/vminsert/newrelic: simplify the code a bit after 1fb8dc0092
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5416
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5421
2023-12-04 16:52:34 +02:00
Dmytro Kozlov
d1aa15688a app/vminsert: fix newrelic ingestion in cluster version (#5421)
Properly pass tenant ID to ingested data from newrelic.
Before tenant ID was mistakenly skipped.
2023-12-04 16:52:29 +02:00
xzchaoo
7c7a32efd7 docs: fix the typo in vmctl.md (#5419)
fix the typo in vmctl.md

Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
2023-12-04 09:48:33 +01:00
Aliaksandr Valialkin
f5c4fcc250 lib/backup: consistently use path.Join() when constructing paths for s3, gs and azblob
E.g. replace `fs.Dir + filePath` with `path.Join(fs.Dir, filePath)`

The fs.Dir is guaranteed to end with slash - see Init() functions.
The filePath may start with slash. If it starts with slash, then `fs.Dir + filePath` constructs
an incorrect path with double slashes.
path.Join() properly substitutes duplicate slashes with a single slash in this case.

While at it, also substitute incorrect usage of filepath.Join() with path.Join()
for constructing paths to object storage systems, which expect forward slashes in paths.
filepath.Join() substittues forward slashes with backslashes on Windows, so this may break
creating or managing backups from Windows.

This is a follow-up for 0399367be602b577baf6a872ca81bf0f99ba401b
Updates https://github.com/VictoriaMetrics/VictoriaMetrics-enterprise/pull/719
2023-12-04 10:34:39 +02:00
Zakhar Bessarab
3532f52f4b lib/backup/s3remote: remove prev object versions for recursive delete (#719)
* lib/backup/s3remote: remove prev object versions for recursive delete

- fix error caused by sending empty objects list to be deleted. This was possible in case old versions of objects where deleted, but root-level entries where still available. This caused paginator to return an empty page which wasn't skipped.

- delete previous versions of objects recursively for S3 remote

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>

* docs/changelog: add vmbackupmanager fix entry

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>

* lib/backup/s3remote: unify path construction for S3 objects

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>

---------

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
2023-12-04 10:32:05 +02:00
Aliaksandr Valialkin
c7a2e4e90a deployment/docker: update backe Docker image from alpine 3.18.4 to 3.18.5
See https://www.alpinelinux.org/posts/Alpine-3.15.11-3.16.8-3.17.6-3.18.5-released.html
2023-12-03 18:53:51 +02:00
hagen1778
41291b6290 docs: fix formatting for datadog example
Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-12-01 22:21:51 +01:00
Aliaksandr Valialkin
f62e03b3d2 app/vmselect: do not limit concurrency for static and fast queries
Previously concurrency for static and fast queries was limited with the -search.maxConcurrentRequests
command-line flag. This could complicate identifying heavy queries via `vmui` at `Top queries` and `Active queries` pages,
since `vmui` and these pages couldn't be opened on overloaded vmselect.

Thanks to @f41gh7 for the idea.
2023-12-01 17:25:01 +02:00
Aliaksandr Valialkin
487f6380d0 lib/promscrape: show dropped targets because of sharding at /service-discovery page
Previously the /service-discovery page didn't show targets dropped because of sharding
( https://docs.victoriametrics.com/vmagent.html#scraping-big-number-of-targets ).

Show also the reason why every target is dropped at /service-discovery page.
This should improve debuging why particular targets are dropped.

While at it, do not remove dropped targets from the list at /service-discovery page
until the total number of targets exceeds the limit passed to -promscrape.maxDroppedTargets .
Previously the list was cleaned up every 10 minutes from the entries, which weren't updated
for the last minute. This could complicate debugging of dropped targets.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5389
2023-12-01 16:48:48 +02:00
hagen1778
e1359c904c docs: follow-up after 760a530305
760a530305
Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-12-01 13:27:48 +01:00
Piyush Verma
82a6e4efe5 Rename prom_writter.go to prom_writer.go (#5411)
Possible spelling mistake here.
2023-12-01 12:20:31 +01:00
Artem Navoiev
760a530305 docs: vmagent info about p queue disk size (#5399)
Signed-off-by: Artem Navoiev <tenmozes@gmail.com>
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
2023-12-01 12:19:56 +01:00
Hui Wang
1911320c86 vmalert-tool: fix alert_rule_test case when eval_time is not multiple of evaluation_interval (#5387)
Co-authored-by: hagen1778 <roman@victoriametrics.com>
2023-12-01 12:17:24 +01:00
Aliaksandr Valialkin
8eddccfbb4 all: expose additional metrics for simplifying debugging of VictoriaMetrics components
Updates https://github.com/VictoriaMetrics/metrics/issues/54
2023-11-30 02:06:54 +02:00
Aliaksandr Valialkin
837f6f0975 docs/vmauth.md: add typical use cases 2023-11-29 20:49:00 +02:00
hagen1778
ec5b72c879 docs: add 3rd party article "Observe and record performance of Spark jobs with Victoria Metrics"
Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-11-29 16:42:18 +01:00
Aliaksandr Valialkin
ac65c6b178 lib/promrelabel: add keep_if_contains and drop_if_contains relabeling actions 2023-11-29 12:22:43 +02:00
Nikolay
41f7940f97 lib/streamaggr: properly reference slice with labels (#5406)
* lib/streamaggr: properly reference slice with labels
by limiting slice capacity. It must fix issues with slice modification, in case of append new slice will be allocated, instead of modifying refrenced slice
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5402

* Reduce memory allocations when output_relabel_configs adds new labels to output samples

---------

Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
2023-11-29 10:03:04 +02:00
Github Actions
7ca783dee9 Automatic update operator docs from VictoriaMetrics/operator@f628bee (#5407) 2023-11-28 16:50:24 +01:00
Andrii Chubatiuk
48228031e4 docs: sync mistakenly deleted docs from 543f218fe9
543f218fe9
Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-11-28 16:46:08 +01:00
hagen1778
98d0f81f21 lib/protoparser/datadog: follow-up after 543f218fe9
* prevent /api/v1 from panic on parsing rows
* add tests for Extract function for v1 and v2 api's
* separate request types in different pools to prevent different objects mixing
* add changelog line

543f218fe9
Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-11-28 15:04:15 +01:00
Andrii Chubatiuk
543f218fe9 add datadog /api/v2/series and /api/beta/sketches support (#5094)
Co-authored-by: Andrew Chubatiuk <andrew.chubatiuk@motional.com>
Co-authored-by: Nikolay <https://github.com/f41gh7>
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
2023-11-28 14:52:29 +01:00
hagen1778
2291958648 make: remove build duplicates for crossbuild
Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-11-28 11:56:21 +01:00
hagen1778
5424632ba3 docs: mention contributor of PR 5368
Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-11-28 11:55:05 +01:00
luckyxiaoqiang
d7897e0d70 app/vmselect/promql: add day_of_year() function (#5368)
Co-authored-by: dingxiaoqiang <dingxiaoqiang@bytedance.com>
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
2023-11-28 11:54:00 +01:00
hagen1778
8a0bb4bf17 docs: mention loadbalancer in Monitoring chapter
Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-11-28 11:14:17 +01:00
hagen1778
d024fcf37f docs: fix indentation for /api/v1/labels
The indentation didn't change in b51d16e74c

Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-11-28 09:41:17 +01:00
hagen1778
5e38dde18d docs: clarify steps for rollup cache purge for vmselects
Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-11-28 09:38:19 +01:00
hagen1778
f42ec79958 docs: fix link for cache reset on vmselects
Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-11-28 09:29:10 +01:00
Ivan Yatskevich
c5b5895162 docs/dns-srv-typo-fix: replace dns+src with dns+srv (#5396) 2023-11-27 17:48:56 +04:00
Aliaksandr Valialkin
3d57cb3234 docs/Cluster-VictoriaMetrics.md: document that multitenancy via labels is applied to data ingested via non-http protocols
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/3009
2023-11-27 12:02:19 +02:00
Github Actions
255bede1f2 Automatic update operator docs from VictoriaMetrics/operator@e737115 (#5394) 2023-11-27 13:05:01 +04:00
Aliaksandr Valialkin
fc2e7a30b3 app/vmagent: properly increase vmagent_remotewrite_samples_dropped_total when scraped samples cannot be sent to the remote storage and -remoteWrite.dropSamplesOnOverload is set
This is a follow-up for 5034aa0773
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2110
2023-11-25 14:44:32 +02:00
Aliaksandr Valialkin
48f0aa8483 docs/vmagent.md: remove duplicate chapter for Google PubSub integration
The previous chapter has been added in 752f89f13f
See https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5358
2023-11-25 14:44:03 +02:00
Aliaksandr Valialkin
5034aa0773 app/vmagent: follow-up for 090cb2c9de
- Add Try* prefix to functions, which return bool result in order to improve readability and reduce the probability of missing check
  for the result returned from these functions.

- Call the adjustSampleValues() only once on input samples. Previously it was called on every attempt to flush data to peristent queue.

- Properly restore the initial state of WriteRequest passed to tryPushWriteRequest() before returning from this function
  after unsuccessful push to persistent queue. Previously a part of WriteRequest samples may be lost in such case.

- Add -remoteWrite.dropSamplesOnOverload command-line flag, which can be used for dropping incoming samples instead
  of returning 429 Too Many Requests error to the client when -remoteWrite.disableOnDiskQueue is set and the remote storage
  cannot keep up with the data ingestion rate.

- Add vmagent_remotewrite_samples_dropped_total metric, which counts the number of dropped samples.

- Add vmagent_remotewrite_push_failures_total metric, which counts the number of unsuccessful attempts to push
  data to persistent queue when -remoteWrite.disableOnDiskQueue is set.

- Remove vmagent_remotewrite_aggregation_metrics_dropped_total and vm_promscrape_push_samples_dropped_total metrics,
  because they are replaced with vmagent_remotewrite_samples_dropped_total metric.

- Update 'Disabling on-disk persistence' docs at docs/vmagent.md

- Update stale comments in the code

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5088
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2110
2023-11-25 12:09:44 +02:00
Nikolay
090cb2c9de app/vmagent: allow to disabled on-disk persistence (#5088)
* app/vmagent: allow to disabled on-disk queue
Previously, it wasn't possible to build data processing pipeline with a
chain of vmagents. In case when remoteWrite for the last vmagent in the
chain wasn't accessible, it persisted data only when it has enough disk
capacity. If disk queue is full, it started to silently drop ingested
metrics.

New flags allows to disable on-disk persistent and immediatly return an
error if remoteWrite is not accessible anymore. It blocks any writes and
notify client, that data ingestion isn't possible.

Main use case for this feature - use external queue such as kafka for
data persistence.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2110

* adds test, updates readme

* apply review suggestions

* update docs for vmagent

* makes linter happy

---------

Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
2023-11-24 13:42:11 +01:00
Aliaksandr Valialkin
a7800cdb95 vendor: update github.com/VictoriaMetrics/fastcache from v1.12.1 to v1.12.2
This should help reducing GC overhead growth at https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5379
2023-11-24 13:30:15 +02:00
Aliaksandr Valialkin
2cd9cda12c docs: make more visible that the maximum JSON line length, which is accepted by /api/v1/import, is limited by -import.maxLineLen command-line flag value
This is a follow-up for 0cf55ded34

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5364
2023-11-24 13:12:51 +02:00
Roman Khavronenko
0cf55ded34 lib/protoparser: decrease import.maxLineLen from 100MB to 10MB (#5364)
Tests showed that importing a single line with 70MB size takes 5.3GiB
RSS memory for VictoriaMetrics single-node.
In the scenario when user exports and imports data from one VM to another,
it could possibly lead to OOM exception for destination VM.

Importing a single line with 16MB size taks 1.3GiB RSS memory.
Hence, the limit for `import.maxLineLen` was decreased from 100MB to 10MB
to improve reliability of VictoriaMetrics during imports.

Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
2023-11-24 12:53:04 +02:00
Aliaksandr Valialkin
06d2d933fb docs/CHANGELOG.md: document Google PubSub support at vmagent (see 752f89f13f ) 2023-11-23 21:13:46 +02:00
Nikolay
752f89f13f apply review comments (#5358) 2023-11-23 21:10:08 +02:00
Github Actions
93b8bf66aa Automatic update operator docs from VictoriaMetrics/operator@fec3f9d (#5381) 2023-11-23 21:06:03 +02:00
Aliaksandr Valialkin
1831c731a3 app/vmagent/remotewrite: do not drop persistent queues when -remoteWrite.multitenantURL is set
It is unsafe to drop persistent queues when -remoteWrite.multitenantURL command-line flag is set,
since these queues are created on demand when a new sample for the given tenant is pushed
to the remote storage.

This addresses https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5357
The issue has been appeared in the commit f3a51e8b1d
when implementing https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4014
2023-11-23 20:40:39 +02:00
Github Actions
e1fb9d9230 Automatic update operator docs from VictoriaMetrics/operator@45bfa36 (#5373) 2023-11-22 20:24:50 +02:00
Aliaksandr Valialkin
348482c575 app/vmalert/notifier: remove backticks from the description for -notifier.blackhole command-line flag
Backticks in flag description are automatically converted to flag type. See https://pkg.go.dev/flag#PrintDefaults

This is a follow-up for 20025d4fd6 and 25317b4e70
2023-11-22 20:17:01 +02:00
Aliaksandr Valialkin
0b0776a440 docs/Cluster-VictoriaMetrics.md: sync with cluster branch 2023-11-22 19:31:21 +02:00
Aliaksandr Valialkin
334a739ff6 docs: convert png images to webp in all the docs except of docs/operator/*
This reduces the size of docs/* folder from 33MB to 18MB

Images inside docs/operator/* must be converted at the https://github.com/VictoriaMetrics/operator/tree/master/docs
and then the updated images must be automatically propagated to the docs/operator/*

This is a follow-up for d3f919df3e

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5206
2023-11-22 19:21:00 +02:00
Github Actions
0715b4e121 Automatic update operator docs from VictoriaMetrics/operator@b96131b (#5371) 2023-11-22 03:59:07 -08:00
Aliaksandr Valialkin
389f34cb57 deployment/docker: remove built binaries at bin folder after creating docker image from them at make publish-via-docker 2023-11-21 14:33:24 +02:00
Aliaksandr Valialkin
3a15c9ffb3 .github/workflows/codeql-analysis.yml: cache Go artifacts 2023-11-21 13:04:57 +02:00
Aliaksandr Valialkin
2b420b5c0a .github/workflows/main.yml: ignore changes inside dashboards and deployment/**.yml
The dashaboards/ and deployment/**.yml do not contain files, which may change main workflow results,
so it is better to ignore them.
2023-11-21 12:57:42 +02:00
Aliaksandr Valialkin
fb835ad658 .github/workflows: run build and test jobs in parallel in order to speed up the workflow run 2023-11-21 12:22:01 +02:00
hagen1778
d493da562e lib/storage: fix typo
Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-11-21 11:20:43 +01:00
Aliaksandr Valialkin
5d1ce9891b .github/workflows: add Go version to Go artifacts cache key
When Go version changes, artifacts for the previous Go version may becomes useless,
so there is a little sense in re-using them.
2023-11-21 12:09:44 +02:00
Aliaksandr Valialkin
74fda0b311 .github/workflows: take into account Makefile contents when generating cache key for Go build aftifacts
The cache key must change when the corresponding 'make ...' command changes inside Makefile.
2023-11-21 12:09:43 +02:00
Aliaksandr Valialkin
95f12c7e28 .github/workflows: use stable Go release - it should always point to the latest stable release
This eliminates the need to update .github/workflows/* files whenever new Go stable release is out,
like in the 2db1a664e1 .

See https://github.com/actions/setup-go#using-stableoldstable-aliases
2023-11-21 12:09:43 +02:00
hagen1778
e96b4410a1 lib/storage: fix typo
Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-11-21 10:52:53 +01:00
Aliaksandr Valialkin
c160a49908 .github/workflows/main.yml: try improving caching for Go artifacts
The default caching for Go artifacts from actions/setup-go@v4 uses the hash of go.sum file
as a cache key - see https://github.com/actions/cache/blob/main/examples.md#go---modules .
This isn't enough for VictoriaMetrics case, since different makefile actions build different the Go artifacts,
which need to be cached. So embed the action name in the cache key.
2023-11-21 01:11:08 +02:00
Aliaksandr Valialkin
a007a5a8a4 .github/workflows/codeql-analysis.yml: remove check-latest and cache inputs for actions/setup-go
- The `cache: true` is no longer needed starting from actions/setup-go@v4 - see https://github.com/actions/setup-go#v4
- The `check-latest: true` may slow down the action, so it is better to disalbe it - see https://github.com/actions/setup-go#check-latest-version
2023-11-21 00:56:43 +02:00
Aliaksandr Valialkin
f3d47c3dc3 Makefile: allow specifying the needed concurrency for make via MAKE_CONCURRENCY env var 2023-11-21 00:55:19 +02:00
Aliaksandr Valialkin
ba803a7cd2 Makefile: localize parallel make to release, release-victoria-logs and crossbuild commands
This is a follow-up for 81ddee4f3a
2023-11-20 23:15:17 +02:00
Aliaksandr Valialkin
81ddee4f3a Makefile: speedup release, publish and crossbuild rules by using parallel make 2023-11-20 22:53:23 +02:00
Aliaksandr Valialkin
fbab838dc0 app/vmagent/README.md: sync with docs/vmagent.md after cbe4a5c251 , so make docs-sync properly works 2023-11-20 22:42:58 +02:00
Nikolay
cbe4a5c251 app/vmagent: adds google pubsub as remoteWrite dst and ingest consumer (#713)
it allows to push and receive metrics from google pubsub queue
Adds needed documentation and examples for it
2023-11-20 22:42:30 +02:00
hagen1778
1ef6b7f32b docs: calrify version when vminsertConnsShutdownDuration was added
Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-11-20 17:15:37 +01:00
hagen1778
20025d4fd6 docs: typo after 3f5a41e35e
Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-11-20 17:05:15 +01:00
hagen1778
3ffa8975d4 docs: follow-up after d3f919df3e
d3f919df3e
Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-11-20 11:52:09 +01:00
Dmytro Kozlov
d3f919df3e docs/managed-victoriametrics: use webp format to reduce image size (#5206) 2023-11-20 11:28:47 +01:00
Khanh Quoc Le
4b7e6b36ce Add _stream fields log (#5068) 2023-11-17 15:58:52 +01:00
Hui Wang
ae3107153c lib/protoparser/promremotewrite: fall back to zstd decoding if Snappy-decoding fails (#5344)
This case is possible after the following steps:
1. vmagent successfully performed handshake with the -remoteWrite.url and the remote storage supports zstd-compressed data.
2. remote storage became unavailable or slow to ingest data, vmagent compressed the collected data into blocks with zstd and puts these blocks to persistent queue on disk.
3. vmagent restarts and the remote storage is unavailable during the handshake, then vmagent falls back to Snappy compression.
4. vmagent starts sending zstd-compressed data from persistent queue to the remote storage, while falsely advertizing it sends Snappy-compressed data.
5. The remote storage receives zstd-compressed data and fails unpacking it with Snappy.

The solution is the same as 12cd32fd75, just fall back to zstd decompression if Snappy decompression fails.
2023-11-17 15:51:09 +01:00
Aliaksandr Valialkin
34a26397d7 docs/enterprise.md: update VictoriaMetrics version in examples from v1.95.0 to v1.95.1
See https://github.com/VictoriaMetrics/VictoriaMetrics/releases/tag/v1.95.1
2023-11-17 15:46:57 +01:00
Github Actions
75059f3feb Automatic update operator docs from VictoriaMetrics/operator@388745c (#5340) 2023-11-17 15:45:25 +01:00
Aliaksandr Valialkin
727709da67 deployment: update VictoriaMetrics docker image tag from v1.95.0 to v1.95.1
See https://github.com/VictoriaMetrics/VictoriaMetrics/releases/tag/v1.95.1
2023-11-17 15:44:15 +01:00
Aliaksandr Valialkin
faee0e43d1 app/vmselect/promql: reduce the number of memory allocations inside copyTimeseriesShallow()
Previously the number of memory allocations inside copyTimeseriesShallow() was equal to 1+len(tss)
Reduce this number to 2 by pre-allocating a slice of timeseries structs with len(tss) length.
2023-11-17 15:40:03 +01:00
luckyxiaoqiang
c8e6e47e2a docs/metricsql: remove duplicate sentence (#5349) 2023-11-17 15:18:17 +01:00
Artem Navoiev
657f3bdd21 gh action bump pagefind version to 1.0.4
Signed-off-by: Artem Navoiev <tenmozes@gmail.com>
2023-11-17 14:26:17 +01:00
Aliaksandr Valialkin
3545633934 docs/CHANGELOG.md: cut v1.95.1 2023-11-16 20:31:59 +01:00
Aliaksandr Valialkin
e9d86d7e52 vendor: run make vendor-update 2023-11-16 20:20:27 +01:00
Aliaksandr Valialkin
aefd744abb dashboards: remove path!="/favicon.ico" filter from requests rate graphs
The `path!="/favicon.ico"` filter has little sense, since there are many other special paths,
which may be filtered out - /metrics, /flags, /health, /ping, /robots.txt, /-/healthy, /-/ready, /reload, etc.
See /lib/httpserver/httpserver.go for more details.
It will be hard or impossible to maintain filters for all these paths, so it is better to drop this filter
in order to simplify queries and improve the consistency of these queries.
2023-11-16 19:28:49 +01:00
Aliaksandr Valialkin
aae06e003e app/vmselect: simplify code a bit after 63e0f16062
Use only a single call to prometheus.WriteErrorResponse() inside sendPrometheusError
2023-11-16 18:14:33 +01:00
Aliaksandr Valialkin
84b3c1d3bc docs/FAQ.md: add a link to https://docs.victoriametrics.com/#monitoring in questions where this is needed 2023-11-16 17:45:30 +01:00
Aliaksandr Valialkin
2ea03cf80d lib/handshake: add SetReadDeadline and SetWriteDeadline implementations additionally to SetDeadline
This is a follow-up for 27a5461785

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5327
2023-11-16 16:48:05 +01:00
Roman Khavronenko
1fbd0dd9d8 lib/handshake: check for deadline in Read and Write methods (#5327)
The buffered connection could have exceeded the underlying connection
deadline during reading or writing to an internal buffer.
With this change, buffered connection struct additionally checks
for a deadline in Read/Write methods.

Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-11-16 16:47:46 +01:00
Github Actions
8924fea33a Automatic update operator docs from VictoriaMetrics/operator@bc8b02f (#5331) 2023-11-16 16:29:37 +01:00
Roman Khavronenko
8dfc874be3 docs/vmalert: clarify deduplication recommendations for HA setup (#5336)
Please see discussion here https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5279

Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-11-16 16:26:57 +01:00
Aliaksandr Valialkin
61035419d5 docs/CHANGELOG.md: remove duplicate word query after 2cbdb1db22 2023-11-16 16:24:03 +01:00
Aliaksandr Valialkin
2cbdb1db22 app/vmselect/promql: properly handle duplicate series when merging cached results with the results obtained from the database
evalRollupFuncNoCache() may return time series with identical labels (aka duplicate series)
when performing queries satisfying all the following conditions:

- It must select time series with multiple metric names. For example, {__name__=~"foo|bar"}
- The series selector must be wrapped into rollup function, which drops metric names. For example, rate({__name__=~"foo|bar"})
- The rollup function must be wrapped into aggregate function, which has no streaming optimization.
  For example, quantile(0.9, rate({__name__=~"foo|bar"})

In this case VictoriaMetrics shouldn't return `cannot merge series: duplicate series found` error.
Instead, it should fall back to query execution with disabled cache.

Also properly store the merged results. Previously they were incorrectly stored because of a typo
introduced in the commit 41a0fdaf39

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5332
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5337
2023-11-16 16:01:40 +01:00
hagen1778
d389a4fcf3 dashboards: use version instead of short_version in annotations
`version` label won't show the difference if various flavors of the same
version were deployed. But `short_version` will.

For example, on the sandbox env we test VM builds before new version release.
Without this change, the version update won't be visible on dashboard.

Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-11-16 09:26:47 +01:00
Github Actions
ec7cac3641 Automatic update Grafana datasource docs from VictoriaMetrics/grafana-datasource@95d0711 (#5329) 2023-11-16 11:12:32 +04:00
Aliaksandr Valialkin
0a28c8e91b docs/Articles.md: add https://hackernoon.com/unleashing-vm-histograms-for-ruby-migrating-from-prometheus-to-victoriametrics-with-vm-client 2023-11-16 00:53:44 +01:00
Yury Molodov
98e73a4022 vmui: change autocomplete hotkey to Alt/Option + A (#5328) 2023-11-15 23:33:10 +01:00
Aliaksandr Valialkin
17c45d1206 docs/vmbackup.md: fix links to https://docs.victoriametrics.com/vmbackup.html#permanent-deletion-of-objects-in-s3-compatible-storages
This is a follow-up for 2fc7e9f47e
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5121
2023-11-15 23:26:32 +01:00
Aliaksandr Valialkin
85bf63078c docs/vmagent.md: refer to proper command-line flag: -remoteWrite.shardByURL.labels instead of -remoteWrite.shardByURLLabels
This is a follow-up for ed70a40669

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4942
2023-11-15 23:03:13 +01:00
Aliaksandr Valialkin
7d4873bcef docs: mention that VictoriaMetrics and vmagent support data ingestion via New Relic protocol now
This is a follow-up for f60c08a7bd
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3520
2023-11-15 22:54:29 +01:00
Aliaksandr Valialkin
a79a439dac docs/Release-Guide.md: point to the proper location for CHANGELOG.md at github.com/VictoriaMetrics/operator repository 2023-11-15 21:55:45 +01:00
Aliaksandr Valialkin
4c25ee3597 deployment: update reference to VictoriaLogs from v0.4.1-victorialogs to v0.4.2-victorialogs 2023-11-15 20:46:53 +01:00
Aliaksandr Valialkin
ebd76588da docs/VictoriaLogs/README.md: cut v0.4.2-victorialogs 2023-11-15 20:43:57 +01:00
Aliaksandr Valialkin
9b248e3b2f deployment: update references to VictoriaMetrics components from v1.94.0 to v1.95.0 2023-11-15 20:35:11 +01:00
Aliaksandr Valialkin
0ea35215f6 app/vmalert-tool: add missing multiarch directory
This is needed for 'make publish-vmalert-tool'
2023-11-15 18:11:50 +01:00
Aliaksandr Valialkin
91f5c24f82 docs/CHANGELOG.md: cut v1.95.0 release 2023-11-15 17:45:52 +01:00
Aliaksandr Valialkin
8d9e365512 vendor: update github.com/klasuspost/compress from v1.17.2 to v1.17.3
See https://github.com/klauspost/compress/releases/tag/v1.17.3
2023-11-15 17:18:22 +01:00
Aliaksandr Valialkin
741013a33f docs/CHANGELOG.md: document v1.93.8 LTS release 2023-11-15 17:12:44 +01:00
Aliaksandr Valialkin
d9a7dea9a1 lib/querytracer: add missing blank comment line after 3121d76bee 2023-11-15 16:10:43 +01:00
Aliaksandr Valialkin
e837968e49 docs/stream-aggregation.md: clarify that stream aggregation is applied after all the configured relabeling
This is a follow-up after 68d2cb203d
2023-11-15 15:54:15 +01:00
Aliaksandr Valialkin
5bfa2a3e97 docs/CHANGELOG.md: document v1.87.11 LTS release 2023-11-15 15:53:05 +01:00
hagen1778
68d2cb203d docs/stream-aggr: specify the relabeling order during aggregation
Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-11-15 14:28:03 +01:00
Aliaksandr Valialkin
02624c41eb app/vmctl/README.md: sync with docs/vmctl.md after 7b2e2a23c2 2023-11-15 12:25:33 +01:00
Aliaksandr Valialkin
ef2113e5df Makefile: remove package-base dependency from publish rule, since this dep is set inside all the publish-* dependencies
This is a follow-up for d4099a75be
2023-11-15 12:23:31 +01:00
John Belmonte
7b2e2a23c2 vmctl README.md typo (#5326) 2023-11-15 11:47:25 +04:00
John Belmonte
efd5d1f2dd relabeling.md: fix link (#5325) 2023-11-15 11:32:09 +04:00
Aliaksandr Valialkin
201fb6ec5c vendor: run make vendor-update 2023-11-14 22:45:07 +01:00
Aliaksandr Valialkin
3076c1f400 lib/ingestserver: properly log the number of closed connections
Previously there was off-by-one error, which resulted in logging len(conns-1) connections instead of len(conns)

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4922
2023-11-14 21:53:24 +01:00
Aliaksandr Valialkin
6a533023b1 docs/CHANGELOG.md: consistently prepend command-line flags with a single dash 2023-11-14 21:44:19 +01:00
Aliaksandr Valialkin
0bf48a5d2a docs/Cluster-VictoriaMetrics.md: clarify how -storage.vminsertConnsShutdownDuration command-line flag works 2023-11-14 21:41:47 +01:00
hagen1778
feff13851c docs: clarify vmalert flag changes
Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-11-14 21:18:58 +01:00
Nikolay
3121d76bee lib/querytracer: makes package concurrent safe to use (#5322)
* lib/querytracer: makes package concurrent safe to use
it must fix various issues with concurrent code usage.
Especially, when it's not reasonable to wait for all goroutines to be finished

* wip

---------

Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
2023-11-14 20:59:08 +01:00
Aliaksandr Valialkin
cb106bdf39 lib/logger: increase default -loggerMaxArgLen command-line flag value from 500 to 1000
The 500 chars limit for the maximum arg lengths during logging appeared to be too low for some cases
2023-11-14 19:52:27 +01:00
Artem Navoiev
5d61a7327d docs: update Grafana setup section, use more direct link and add noti… (#5287) 2023-11-14 14:34:01 +01:00
hagen1778
d3ae2b2f62 dashboards: update description for RSS and anonymous memory panels to be consistent for single-node, cluster and vmagent dashboards.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-11-14 09:50:06 +01:00
hagen1778
d6ae082598 deployment/dashboards: respect job and instance filters for alerts annotation in cluster and single-node dashboards
Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-11-14 09:38:15 +01:00
Aliaksandr Valialkin
fbc6289a21 app/vmselect/promql: typo fixes after 7cf7740d18 2023-11-14 03:34:37 +01:00
Aliaksandr Valialkin
f9bd265249 lib/ingestserver: typo fix after f7834767c1
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4922
2023-11-14 03:26:26 +01:00
Aliaksandr Valialkin
7cf7740d18 app/vmselect/promql: properly handle instant query optimization conrner cases for min_over_time() and max_over_time()
- If min_over_time(m[offset] @ timestamp) <= min_over_time(m[offset] @ (timestamp-window)),
  then the optimization can be applied.

- If max_over_time(m[offset] @ timestamp) >= max_over_time(m[offset] @ (timestamp-window)),
  then the optimization can be applied.
2023-11-14 02:58:08 +01:00
Yury Molodov
0116a333fb vmui: reduced the number of server requests (#5253)
* vmui: reduced the number of server requests

* run `make vmui-update vmui-logs-update`

---------

Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
2023-11-14 01:50:00 +01:00
Aliaksandr Valialkin
43e3302803 docs/CHANGELOG.md: document 0e056ddb2d
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5203
2023-11-14 01:24:05 +01:00
Yury Molodov
0e056ddb2d vmui: fix trailing slash in serverURL (#5271)
* vmui: add function to autoremove slash at the end of serverURL (#5203)

* vmui: change removeTrailingSlash func
2023-11-14 01:21:16 +01:00
Noah Labrecque
12aefa8a4b fix: apply correct bounds to sf and tf (#5274) 2023-11-14 01:17:16 +01:00
Zakhar Bessarab
37997abd14 vmcluster: re-routing enhancement (#5293)
* app/vmstorage: close vminsert connections gradually before stopping storage

Implements graceful shutdown approach suggested here - https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4922#issuecomment-1768146878

Test results for this can be found here - https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4922#issuecomment-1790640274

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>

* app/vmstorage: update graceful shutdown logic

- close connections from vminsert in determenistic order
- update flag description
- lower default timeout to 25 seconds. 25 seconds value was chosen because the lowest default value used in default configuration deployments is 30s(default value in Kubernetes and ansible-playbooks).

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>

* docs/cluster: add information about re-routing enhancement during restart

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>

* docs/changelog: add entry for new command-line flag

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>

* {app/vmstorage,lib/ingestserver}: address review feedback

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>

* docs/cluster: add note to update workload scheduler timeout

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>

* wip

---------

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
2023-11-14 01:03:44 +01:00
Aliaksandr Valialkin
cef7a39ba3 lib/logstorage: always check the previous indexBlockHeader for blocks with matching tenantID and/or streamID
The previous indexBlockHeader may contain blocks for the matching tenantID and/or streamID,
so it must be scanned unconditionally during the search.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5295
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4856

This is a follow-up for 89dcbc2fe7
2023-11-13 23:13:53 +01:00
XLONG96
89dcbc2fe7 lib/logstorage: fix streamID and tenantID search (#4856) (#5295) 2023-11-13 23:09:39 +01:00
Aliaksandr Valialkin
8eed04b2c6 app/vmauth: add ability to drop the specified number of /-delimited prefix parts from request path
This can be done via `drop_src_path_prefix_parts` option at `url_map` and `user` levels.

See https://docs.victoriametrics.com/vmauth.html#dropping-request-path-prefix
2023-11-13 22:32:22 +01:00
Aliaksandr Valialkin
0feaeca3c1 lib/protoparser/promremotewrite: fall back to Snappy decoding if zstd decoding fails
This case is possible after the following steps:

1. vmagent tries to perform handshake with the -remoteWrite.url in order to determine whether
   the remote storage supports zstd-compressed data.
2. The remote storage is unavailable during the handshake. In this case vmagent falls back to Snappy compression
   for the data sent to the remote storage.
3. vmagent compresses the collected data into blocks with Snappy and puts these blocks to persistent queue on disk.
4. The remote storage becomes available.
5. vmagent restarts, performs the handshake with the remote storage and detects that it supports zstd-compressed data.
6. vmagent starts sending Snappy-compressed data from persistent queue to the remote storage,
   while falsely advertizing it sends zstd-compressed data.
7. The remote storage receives Snappy-compressed data and fails unpacking it with zstd.

The solution is to just fall back to Snappy decompression if zstd decompression fails.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5301
2023-11-13 21:19:08 +01:00
Aliaksandr Valialkin
8af56ea2ed lib/htmlcomponents: use relative links for the top page and for favicon.ico
This allows hiding VictoriaMetrics components behind proxies with arbitrary path prefixes.
For example, vmagent HTTP handlers can be served via /vmagent/ path prefix:

- http://proxy/vmagent/targets
- http://proxy/vmagent/service-discovery

The path prefix can be arbitrary. For example, below are vmagent urls
for /tenantID/vmagent/ path prefix:

- http://proxy/tenantID/vmagent/targets
- http://proxy/tenantID/vmagent/service-discovery

While at it, consistently serve favicon.ico from any path directory.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5306
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5307
2023-11-13 20:29:05 +01:00
Aliaksandr Valialkin
cf23dc6480 all: cleanup: remove // +build ... lines, since they are no longer needed after Go1.17, and the minimum supported Go version for VictoriaMetrics source code is Go1.20 2023-11-13 19:12:51 +01:00
Aliaksandr Valialkin
b7fb7c5f77 vendor: run make vendor-update 2023-11-13 18:50:16 +01:00
Aliaksandr Valialkin
3e93fa61ad lib/regexutil: properly handle alternate regexps surrounded by .+ or .*
Previously the following regexps were improperly handled:

  .+foo|bar.+
  .*foo|bar.*

This could lead to unexpected regexp match results.

See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5297

Thanks to @Haleygo for the initial attempt to fix the issue at https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5308
2023-11-13 18:23:38 +01:00
Aliaksandr Valialkin
c8d901424f docs/VictoriaLogs/CHANGELOG.md: follow-up for 66527c5981
Document the change

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5312
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5300
2023-11-13 10:39:48 +01:00
Yury Molodov
66527c5981 vmui: ui logs enhancements (#5312)
* vmui/logs: fix time sorting #5300

* vmui/logs: add base query validation

* vmui/logs: add a message for empty results
2023-11-13 10:36:52 +01:00
Aliaksandr Valialkin
6340911d38 lib/stringsutil: add tests for LimitStringLen() function 2023-11-13 10:32:33 +01:00
Dmytro Kozlov
4722b70c89 lib/stringsutil: fix failing test (#5313)
We have failed test on master branch.

```
--- FAIL: TestFormatLogMessage (0.00s)
    logger_test.go:24: unexpected result; got
        "foo: abcde, \"foo bar baz\", xx"
        want
        "foo: a..e, \"f..z\", xx"
```
if failed because maxArgs maxLen <= 4 in the  `LimitStringLen` in that case we always will return the income string
but in the test we limit the maxLen by value 4
```
f("foo: %s, %q, %s", []interface{}{"abcde", fmt.Errorf("foo bar baz"), "xx"}, 4, `foo: a..e, "f..z", xx`)
2023-11-13 09:51:49 +01:00
Aliaksandr Valialkin
ba058a4514 docs/CHANGELOG.md: remove trailing whitespace after bffd30b57a 2023-11-13 09:24:29 +01:00
Aliaksandr Valialkin
25ff811d78 docs/vmauth.md: add missing dashes in front of command-line flags at the Backend TLS setup section
Dashes must be consistently used in front of command-line flags across the documentation.

This is a follow up for 61594d2bd8
2023-11-13 09:12:57 +01:00
Aliaksandr Valialkin
eded218e8c app/vmauth: properly pass Host header to backends
Previously the `Host` header was remained unchanged when passing it in requests to backends.
This may improperly work if the backend uses host-based routing.

While at it, allows http/2.0 requests to backends. While VictoriaMetrics components
do not accept http/2.0 requests, other backends can require such requests.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5240
2023-11-13 09:05:39 +01:00
Aliaksandr Valialkin
61594d2bd8 app/vmauth: follow-up for 323f3720ed
- Re-use identically configured http.Transport across multiple users.
  This fixes handling of the limit on the number of connection, which can be established per each backend
  via -maxIdleConnsPerBackend command-line flag. This limit stopped working after 323f3720ed

- Add docs about backend TLS setup at https://docs.victoriametrics.com/vmauth.html#backend-tls-setup

- Add ability to disable backend TLS verification for all the users via -backend.tlsInsecureSkipVerify command-line flag.
  This flag may be useful when -auth.config contains big number of users, and every user must disable backend TLS verification.

- Add ability to specify TLS Root CA via tls_ca_file option at per-user basis and via -backend.tlsCAFile command-line flag
  across all the users.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5240
2023-11-13 08:33:10 +01:00
Aliaksandr Valialkin
ec72b750f2 docs/Articles.md: typo fix 2023-11-13 08:05:16 +01:00
Aliaksandr Valialkin
bfec8a3751 app/vmauth: improve docs a bit after 323f3720ed
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5240
2023-11-11 12:49:28 +01:00
Aliaksandr Valialkin
0df8489e0a app/vmagent/README.md: sync with docs/vmagent.md after 930d26b2ff 2023-11-11 12:31:32 +01:00
Aliaksandr Valialkin
230230cf0b lib/logger: add -loggerMaxArgLen command-line flag for fine-tuning the maximum length of logged args 2023-11-11 12:30:08 +01:00
Aliaksandr Valialkin
80213f07fa app/vmselect/promql: optimize instant queries with min_over_time() and max_over_time() rollup functions
This is a follow-up for 41a0fdaf39
2023-11-11 12:10:03 +01:00
Aliaksandr Valialkin
2db1a664e1 deployment: update Go builder from Go1.21.3 to Go1.21.4
See https://github.com/golang/go/issues?q=milestone%3AGo1.21.4+label%3ACherryPickApproved
2023-11-10 22:28:44 +01:00
Aliaksandr Valialkin
010dc15d16 lib/blockcache: do not cache entries, which were attempted to be accessed 1 or 2 times
Previously entries which were accessed only 1 time weren't cached.
It has been appeared that some rarely executed heavy queries may read indexdb block twice
in a row instead of once. There is no need in caching such a block then.
This change should eliminate cache size spikes for indexdb/dataBlocks when such heavy queries are executed.

Expose -blockcache.missesBeforeCaching command-line flag, which can be used for fine-tuning
the number of cache misses needed before storing the block in the caching.
2023-11-10 22:28:03 +01:00
Aliaksandr Valialkin
22498c5087 docs/Articles.md: sort third-party articles by importance 2023-11-10 21:26:51 +01:00
Aliaksandr Valialkin
dc668b8246 docs/Articles.md: add a link to https://blog.cloudflare.com/introducing-http-traffic-anomalies-notifications/ 2023-11-10 20:59:57 +01:00
Aliaksandr Valialkin
a9a26c20b5 docs/Single-server-VictoriaMetrics.md: make High availability section more clear 2023-11-10 20:58:32 +01:00
Aliaksandr Valialkin
d407d13e7b Makefile: update golangci-lint version from v1.54.2 to v1.55.1
See https://github.com/golangci/golangci-lint/releases/tag/v1.55.1
2023-11-10 20:23:48 +01:00
PhracturedBlue
2474281f1b Support building images via podman (#4978) 2023-11-09 00:50:21 -08:00
Zakhar Bessarab
73a1862182 docs/changelog: document vmbackupmanager bugfix (#5303)
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
2023-11-08 18:51:14 +01:00
Artem Navoiev
930d26b2ff docs: vmagent change the codeblock languages
Signed-off-by: Artem Navoiev <tenmozes@gmail.com>
2023-11-08 18:12:29 +01:00
Github Actions
2b15f2fc8c Automatic update Grafana datasource docs from VictoriaMetrics/grafana-datasource@cc18249 (#5305) 2023-11-08 18:19:12 +04:00
Github Actions
746cc93e95 Automatic update Grafana datasource docs from VictoriaMetrics/grafana-datasource@52bdb4a (#5304) 2023-11-08 17:59:10 +04:00
Roman Khavronenko
bffd30b57a app/vmalert: update remote-write process (#5284)
* app/vmalert: update remote-write process

* automatically retry remote-write requests on closed connections. The change should reduce the amount of logs produced in environments with short-living connections or environments without support of keep-alive on network balancers.
* increment `vmalert_remotewrite_errors_total` metric if all retries to send remote-write request failed. Before, this metric was incremented only if remote-write client's buffer is overloaded.
* increment `vmalert_remotewrite_dropped_rows_total` amd `vmalert_remotewrite_dropped_bytes_total` metrics if remote-write client's buffer is overloaded. Before, these metrics were incremented only after unsuccessful HTTP calls.

Signed-off-by: hagen1778 <roman@victoriametrics.com>

* Update docs/CHANGELOG.md

---------

Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Hui Wang <haley@victoriametrics.com>
2023-11-08 14:53:07 +08:00
Artem Navoiev
5afc6a5765 github actions: sync docs use the latest hugo version in CI
Signed-off-by: Artem Navoiev <tenmozes@gmail.com>
2023-11-07 12:23:12 +01:00
Artem Navoiev
b51d16e74c docs: url example change the title h2->h3 h3->h4 for better indexing in search
Signed-off-by: Artem Navoiev <tenmozes@gmail.com>
2023-11-07 09:54:32 +01:00
Artem Navoiev
e4f44bad91 docs: fix formatting in stream aggregation more
Signed-off-by: Artem Navoiev <tenmozes@gmail.com>
2023-11-07 09:31:21 +01:00
Artem Navoiev
ea4ca70be1 docs: fix formatting in stream aggregation
Signed-off-by: Artem Navoiev <tenmozes@gmail.com>
2023-11-07 09:27:44 +01:00
Github Actions
4d2b98b755 Automatic update operator docs from VictoriaMetrics/operator@b4b79da (#5291) 2023-11-06 16:04:12 +08:00
hagen1778
c07dc45786 app/vmalert: fix typo in remoteWrite.concurrency description
Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-11-03 22:04:50 +01:00
Yury Molodov
f90d2ec843 vmui: display query error on Explore metrics page (#5272)
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5202
2023-11-03 16:23:19 +01:00
hagen1778
054367c421 docs: make docs-sync after 323f3720ed
Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-11-03 16:22:23 +01:00
Zakhar Bessarab
323f3720ed app/vmauth: add option to skip TLS verification (#5256)
Add `tls_insecure_skip_verify` option on per-user basis which allows to disable TLS verification for all requests to backend on behalf of this user.

See: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5240

Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
2023-11-03 12:04:17 +01:00
Aliaksandr Valialkin
da24c129db vendor: run make vendor-update 2023-11-02 21:01:21 +01:00
Aliaksandr Valialkin
f2b373dd06 go.mod: pin the latest working version of golang.org/x/exp 2023-11-02 20:55:24 +01:00
Aliaksandr Valialkin
815fda8995 docs: update -help output after recent changes to VictoriaMetrics components 2023-11-02 20:27:10 +01:00
Aliaksandr Valialkin
65db6609eb docs/CHANGELOG.md: update the description of the optimization for SLO/SLI-like queries according to latest changes
See commits 4497a08e3d and 92826b0b4a
2023-11-02 20:05:05 +01:00
Roman Khavronenko
b5254199c6 app/vmalert: add label file pointing to the group's filename to metrics (#5281)
The filename should help identifying alerting rules belonging to specific groups
with identical names but different filenames.

https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5267

Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-11-02 16:01:31 +01:00
hagen1778
6eb205f8b0 app/vmalert: verify alert name correctness in restore test
Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-11-02 15:28:39 +01:00
Hui Wang
90d45574bf vmalert: reduce restore query request for each alerting rule (#5265)
reduce the number of queries for restoring alerts state on start-up. 
The change should speed up the restore process and reduce pressure on `remoteRead.url`.
2023-11-02 15:22:13 +01:00
Aliaksandr Valialkin
eeb25bc4a5 app/vmselect/promql: add missing trace message in rollupResultCache.GetSeries() 2023-11-02 09:22:48 +01:00
Aliaksandr Valialkin
dd33fc0c76 docs/CHANGELOG.md: typo fix: tis -> this 2023-11-02 08:33:40 +01:00
Aliaksandr Valialkin
536c1301fd docs/Single-server-VictoriaMetrics.md: document why data inside <-storageDataPath>/snapshots directory should be manipulated only via snapshot API 2023-11-02 08:30:39 +01:00
Aliaksandr Valialkin
87a86ec9db docs/CHANGELOG.md: document v1.93.7 LTS release 2023-11-02 08:21:00 +01:00
Aliaksandr Valialkin
ed70a40669 app/vmagent/remotewrite: add -remoteWrite.shardByURL.labels command-line flag
This command-line flag can be used for specifying a list of labels used for sharding
among -remoteWrite.url entries when -remoteWrite.shardByURL command-line flag is set.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4942
2023-11-01 23:08:54 +01:00
Alexander Marshalov
828ddd4e4f vmauth: add browser authorization request for http requests without… (#5234)
* vmauth: add browser authorization request for http requests without credentials to a route that is not in the `unauthorized_user` section (when `unauthorized_user` is specified).

* add link to issue in CHANGELOG

* Extend vmauth docs

* wip

---------

Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
2023-11-01 20:59:46 +01:00
Aliaksandr Valialkin
4497a08e3d app/vmselect/promql: reduce the minimum lookbehind window for enabling SLO/SLI optimizations from 24 hours to 6 hours
This reduction is based on production testing.

Also expose -search.minWindowForInstantRollupOptimization command-line flag, so users could fine-tune this arg for their needs
2023-11-01 20:18:44 +01:00
Aliaksandr Valialkin
f59dda3223 app/vmselect: run make quicktemplate-gen after b8739bc00b 2023-11-01 17:52:56 +01:00
Aliaksandr Valialkin
b8739bc00b app/vmselect: return stats.seriesFetched as string instead of number
vmalert expects string value for stats.seriesFetched, so it is impossible
switching to number without breaking compatibility with old vmalert releases :(

It is still unclear why stats.seriesFetched has string type in the first place...
2023-11-01 17:49:59 +01:00
Github Actions
b44b74d118 Automatic update operator docs from VictoriaMetrics/operator@49826be (#5270)
Co-authored-by: Alexander Marshalov <_@marshalov.org>
2023-11-01 17:45:56 +01:00
Artem Navoiev
324ecbeed6 add Try new Docs button in the current docs
Signed-off-by: Artem Navoiev <tenmozes@gmail.com>
2023-11-01 17:22:49 +01:00
Aliaksandr Valialkin
da887b49e7 app/vmui: show query execution duration in the header of query input field
This should simplify the process of query optimization
2023-11-01 16:43:51 +01:00
Hui Wang
e482eeff58 vmalert: support specifying full http url in notifier static_configs target (#5261)
* vmalert: support specifying full http or https urls in notifier static_configs target address
* show right label results in ui
2023-11-01 19:53:50 +08:00
Aliaksandr Valialkin
92826b0b4a app/vmselect/promql: apply SLO-like optimization to all the count_*_over_time() functions
This is a follow-up for 41a0fdaf39
2023-11-01 09:58:16 +01:00
Aliaksandr Valialkin
0189081490 app/vmselect/promql: typo fix, which could lead to panic during range query execution
The panic is:

  BUG: unexpected values after merging new values

This is a follow-up for 41a0fdaf39
2023-11-01 09:58:15 +01:00
Github Actions
3e76c3bd50 Automatic update operator docs from VictoriaMetrics/operator@57c1bf6 (#5266) 2023-11-01 16:33:25 +08:00
Aliaksandr Valialkin
c4c6ee9485 app/vmui: fix non-working Disable cache checkbox at JSON and Table views 2023-10-31 22:58:06 +01:00
Aliaksandr Valialkin
a70818f72f app/vmselect/promql: properly calculate rollup result if lookbehind window isn't set
This is a follow-up for 41a0fdaf39
2023-10-31 22:22:37 +01:00
Aliaksandr Valialkin
ea81f6fc36 app/vmselect/promql: add outliers_iqr(q) and outlier_iqr_over_time(m[d]) functions
These functions allow detecting anomalies in series and samples using Interquartile range method.
See Outliers section at https://en.wikipedia.org/wiki/Interquartile_range for more details.
2023-10-31 22:10:31 +01:00
Aliaksandr Valialkin
fba93dbe0b vendor: run make vendor-update 2023-10-31 20:19:51 +01:00
Aliaksandr Valialkin
41a0fdaf39 app/vmselect/promql: optimize repeated SLI-like instant queries with lookbehind windows >= 1d
Repeated instant queries with long lookbehind windows, which contain one of the following rollup functions,
are optimized via partial result caching:

- sum_over_time()
- count_over_time()
- avg_over_time()
- increase()
- rate()

The basic idea of optimization is to calculate

  rf(m[d] @ t)

as

  rf(m[offset] @ t) + rf(m[d] @ (t-offset)) - rf(m[offset] @ (t-d))

where rf(m[d] @ (t-offset)) is cached query result, which was calculated previously

The offset may be in the range of up to 1 hour.
2023-10-31 19:25:23 +01:00
Aliaksandr Valialkin
c96fc05f3e docs/Cluster-VictoriaMetrics.md: sync with cluster branch after 9d8f93050c 2023-10-31 19:13:11 +01:00
Aliaksandr Valialkin
51aab7bb17 app/vmselect/promql: wrap too long line after a950873fff 2023-10-31 18:59:10 +01:00
Aliaksandr Valialkin
714af89b13 lib/httpserver: follow-up for 0638bbe69c
- Replace spaces with underscores in the `reason` label value for the vm_http_request_errors_total metric
  in order be consistent with Prometheus-like naming

- Clarify the description for the change at docs/CHANGELOG.md

Updates https://github.com/victoriaMetrics/victoriaMetrics/issues/4590
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5166
2023-10-31 18:52:39 +01:00
Aliaksandr Valialkin
98699f203b lib/persistentqueue: properly re-create flock.lock file inside directory if persistent queue is broken.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5249

Thanks to @Sniper91 for the bugreport and initial fix at https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5233
2023-10-31 18:38:32 +01:00
Aliaksandr Valialkin
efb6ac27c2 lib/httpserver: call Request.Header() only once instead of calling it each time a new request header is set
This is a follow-up for ad839aa492
2023-10-31 18:38:32 +01:00
Artem Navoiev
68f82b1c06 github actions: fix typo in hugo version
Signed-off-by: Artem Navoiev <tenmozes@gmail.com>
2023-10-31 17:52:35 +01:00
Artem Navoiev
1020faa7b4 github actions: use 0.119 hugo version as far latest contains bug
Signed-off-by: Artem Navoiev <tenmozes@gmail.com>
2023-10-31 17:50:12 +01:00
Aliaksandr Valialkin
aade16f534 docs/Cluster-VictoriaMetrics.md: clarify the description on why -dosnwampling.period must be set at both vmstorage and vmselect
This is a follow-up for ca7457d906
2023-10-31 16:53:46 +01:00
Aliaksandr Valialkin
a71efce784 docs/Single-server-VictoriaMetrics.md: cosmetic fixes after 23369321f1 2023-10-31 16:42:29 +01:00
Aliaksandr Valialkin
4ac95b6f49 docs/CHANGELOG.md: move the description for -http.header.* command-line flags from SECURITY to FEATURE
The SECURITY label should be applied only to changes, which fix security issues.
The change at ad839aa492 adds new command-line flags, which can be used
for improving security in some cases. They do not fix any security issues.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5111
2023-10-31 16:23:08 +01:00
Aliaksandr Valialkin
7ac49162c6 lib/storage: follow-up for 29cebd82fb
Use atomic.CompareAndSwapUint32() instead of atomic.LoadUint32() followed by atomic.StoreUint32().
This makes the code more clear.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5159
2023-10-31 16:08:54 +01:00
hagen1778
f6208965ce dashboards/cluster: fix description about max threshold for Concurrent selects panel.
Before, it was mistakenly implying that `max` is equal to the double of available CPUs.

Addresses https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5214

Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-10-31 16:05:33 +01:00
Roman Khavronenko
a950873fff app/vmselect: expose vm_memory_intensive_queries_total counter metric (#5208)
The new metric gets increased each time `-search.logQueryMemoryUsage` memory limit
is exceeded by a query. This metric should help to identify expensive and heavy queries
without inspecting the logs.

Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-10-31 13:31:09 +01:00
hagen1778
a8051d48c4 docs: follow-up for 0638bbe69c
0638bbe69c
Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-10-31 12:54:30 +01:00
venkatbvc
0638bbe69c vmauth: add counter metrics for auth successes and failures (#5166)
New labels `reason="wrong basic auth creds"` and `reason="wrong auth key"` were
added to metric `vm_http_request_errors_total`  to help identify auth errors.

https://github.com/victoriaMetrics/victoriaMetrics/issues/4590

Co-authored-by: Rao, B V Chalapathi <b_v_chalapathi.rao@nokia.com>
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
2023-10-31 12:48:02 +01:00
hagen1778
aaf9e3d526 dashboards/vmalert: add new panel Missed evaluations
The new panel supposed to indicate alerting groups that miss their evaluations.

Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-10-31 10:35:19 +01:00
hagen1778
9866974a53 deployment/alerts: add TooManyMissedIterations alerting rule
The new rule for vmalert supposed to detect groups that miss their
evaulations due to slow queries.

Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-10-31 10:35:18 +01:00
hagen1778
8874b525b7 dashboards: fix Errors rate to Alertmanager filter
The panel `Errors rate to Alertmanager` had `group` label filter
applied to the expression, while the metric `vmalert_alerts_send_errors_total`
doesn't have that label. This resulted into always empty results.

Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-10-31 10:16:45 +01:00
Roman Khavronenko
ca7457d906 docs: explain motivation behind having -downsampling.period on vmselect (#5205)
Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-10-30 19:03:36 +01:00
Roman Khavronenko
23369321f1 docs: mention information loss when downsampling gauges (#5204)
Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-10-30 15:29:06 +01:00
Hui Wang
abcb21aa5e vmalert: fix alert firing state in replay mode (#5192)
fix possible missing firing states for alerting rules in replay mode
Before if one firing stage is bigger than single query request range, like rule with a big `for`, alerting rule won't able to be detected as firing.

Co-authored-by: hagen1778 <roman@victoriametrics.com>
2023-10-30 13:54:18 +01:00
hagen1778
e964df8039 docs/troubleshooting: mention issue with un-ordered labels
See https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5219#issuecomment-1773441711

Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-10-30 13:53:14 +01:00
hagen1778
a64b37cf24 docs: rm mention of default values for security HTTP headers
The headers, their corresponding flags are mentioned at
https://docs.victoriametrics.com/#security

Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-10-30 11:46:17 +01:00
Dima Lazerka
ad839aa492 lib/httpserver: add flags to specify HSTS / Frame-Options / CSP headers for httpserver (#5111)
support `Strict-Transport-Security`, `Content-Security-Policy` and `X-Frame-Options`
HTTP headers in all VictoriaMetrics components. 
The values for headers can be specified by users via the following flags: 
`-http.header.hsts`, `-http.header.csp` and `-http.header.frameOptions`.

Co-authored-by: hagen1778 <roman@victoriametrics.com>
2023-10-30 11:33:38 +01:00
Roman Khavronenko
29cebd82fb lib/storage: log warning about RO mode only on state change (#5191)
Before, vmstorage would log the same message each second producing excessive
amount of logs.

See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5159

Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-10-30 10:52:57 +01:00
Aliaksandr Valialkin
9149353a36 app/vmui: change the order of tables at Top queries tab
Move the most interesting table - queries with the most summary time to execute - to the top
2023-10-28 11:56:16 +02:00
Aliaksandr Valialkin
613b545dfd lib/promscrape/discovery/kubernetes: propagate possible errors at newAPIWatcher() to the caller
This allows substituting FATAL panics with recoverable runtime errors such as missing or invalid TLS CA file
and/or missing/invalid /var/run/secrets/kubernetes.io/serviceaccount/namespace file.
Now these errors are logged instead of PANIC'ing, so they can be fixed by updating the corresponding files
without the need to restart vmagent.

This is a follow-up for 90427abc65
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5243
2023-10-27 20:24:46 +02:00
Hui Wang
90427abc65 lib/promscrape/discovery/kubernetes: avoid possible panic if given caFile under kubernetes.SDConfig.HTTPClientConfig is not exist (#5243)
follow up d5a599badc
2023-10-27 20:20:22 +02:00
Aliaksandr Valialkin
632d788b63 lib/promscrape/discovery/kubernetes: stop all the url watchers, which belong to a particular groupWatcher, at once
Previously url watchers for pod, service and node objects could be mistakenly closed
when service discovery was set up only for endpoints and endpointslice roles,
since watchers for these roles may start start pod, service and node url watchers
with nil apiWatcher passed to groupWatcher.startWatchersForRole().

Now all the url watchers, which belong to a particular groupWatcher, are stopped at once
when this groupWatcher has no apiWatcher subscribers.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5216

The issue has been introduced in v1.93.5 when addressing https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4850
2023-10-27 13:51:35 +02:00
Hui Wang
7c90ce39cb do not print redundant error logs when failed to scrape consul or no… (#5239)
* do not print redundant error logs when failed to scrape consul or nomad target
prometheus performs the same because it uses consul lib which just drops the error(1806bcb38c/api/api.go (L1134))
2023-10-27 13:31:55 +08:00
hagen1778
3aec7eb44f app/vmalert: remove unclear comment
The timestamp alignment should be applied as a last step
to keep the timestamp consistent.

Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-10-26 15:41:35 +02:00
Daria Karavaieva
b60bb1d98a model list - isolation forest (#5235)
* model list - isolation forest

* curse of dimensionality

* isol forest definition change, minor fixes

* blank line fix
2023-10-26 12:25:54 +02:00
Aliaksandr Valialkin
68b1b3c4d4 Makefile: move build commands for vmalert-tool closer to vmalert
This should simplify maintenance of Makefile commands related to vmalert.

This is a follow-up for dc28196237
2023-10-26 08:07:50 +02:00
Aliaksandr Valialkin
cdbc06a639 lib/promscrape: do not add a suggestion for enabling TCP6 in error message when the dial address is TCPv4 2023-10-25 17:57:56 -07:00
Dima Lazerka
8b41b506c2 Revert "lib/promscrape: do not add a suggestion for enabling TCP6 in error message when the dial address is TCPv4"
It broke CI (lint)

This reverts commit 5464376d16.
2023-10-25 16:24:31 -07:00
Aliaksandr Valialkin
5464376d16 lib/promscrape: do not add a suggestion for enabling TCP6 in error message when the dial address is TCPv4 2023-10-26 00:29:51 +02:00
Aliaksandr Valialkin
ac933cc423 lib/promscrape: properly track the number of updated service discovery routines inside Config.mustRestart()
This is a follow-up for d5a599badc
2023-10-26 00:06:29 +02:00
Aliaksandr Valialkin
612dcf231a lib/promauth: typo fix in the error message after d5a599badc: obtaine -> obtain 2023-10-25 23:38:00 +02:00
Aliaksandr Valialkin
d5a599badc lib/promauth: follow-up for e16d3f5639
- Make sure that invalid/missing TLS CA file or TLS client certificate files at vmagent startup
  don't prevent from processing the corresponding scrape targets after the file becomes correct,
  without the need to restart vmagent.
  Previously scrape targets with invalid TLS CA file or TLS client certificate files
  were permanently dropped after the first attempt to initialize them, and they didn't
  appear until the next vmagent reload or the next change in other places of the loaded scrape configs.

- Make sure that TLS CA is properly re-loaded from file after it changes without the need to restart vmagent.
  Previously the old TLS CA was used until vmagent restart.

- Properly handle errors during http request creation for the second attempt to send data to remote system
  at vmagent and vmalert. Previously failed request creation could result in nil pointer dereferencing,
  since the returned request is nil on error.

- Add more context to the logged error during AWS sigv4 request signing before sending the data to -remoteWrite.url at vmagent.
  Previously it could miss details on the source of the request.

- Do not create a new HTTP client per second when generating OAuth2 token needed to put in Authorization header
  of every http request issued by vmagent during service discovery or target scraping.
  Re-use the HTTP client instead until the corresponding scrape config changes.

- Cache error at lib/promauth.Config.GetAuthHeader() in the same way as the auth header is cached,
  e.g. the error is cached for a second now. This should reduce load on CPU and OAuth2 server
  when auth header cannot be obtained because of temporary error.

- Share tls.Config.GetClientCertificate function among multiple scrape targets with the same tls_config.
  Cache the loaded certificate and the error for one second. This should significantly reduce CPU load
  when scraping big number of targets with the same tls_config.

- Allow loading TLS certificates from HTTP and HTTPs urls by specifying these urls at `tls_config->cert_file` and `tls_config->key_file`.

- Improve test coverage at lib/promauth

- Skip unreachable or invalid files specified at `scrape_config_files` during vmagent startup, since these files may become valid later.
  Previously vmagent was exitting in this case.

Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4959
2023-10-25 23:19:37 +02:00
Aliaksandr Valialkin
c22e3e7b1d lib/promscrape/discovery/kubernetes/kubeconfig_test.go: make TestParseKubeConfigSuccess test code easier to follow 2023-10-25 23:17:18 +02:00
Aliaksandr Valialkin
eed5206376 lib/promauth: properly parse string contents for ca, cert and key fields at tls_config
Previously yaml parser wasn't accepting string values for these fields,
because it was mistakenly expecting a list of uint8 values instead.
2023-10-25 23:12:21 +02:00
Aliaksandr Valialkin
4afcb2a689 lib/promscrape: move duplicate code from functions, which collect ScrapeWork lists for distinct SD types into Config.getScrapeWorkGeneric()
This removes more than 200 lines of duplicate code
2023-10-25 23:03:40 +02:00
Aliaksandr Valialkin
cb34d4440c app/vmalert/config: fix flacky test TestParseBad
It could return either `failed to read` or `failed to parse` errors depending
on whether the given url can be loaded or not under the current environment
2023-10-25 21:30:55 +02:00
Aliaksandr Valialkin
42dd71bb63 all: consistently use %w instead of %s in when error is passed to fmt.Errorf()
This allows consistently using errors.Is() for verifying whether the given error wraps some other known error.
2023-10-25 21:24:03 +02:00
Aliaksandr Valialkin
305c96e384 lib/workingsetcache: fix outdated comments for Load() and New() functions 2023-10-25 21:04:20 +02:00
hagen1778
c07909a20b app/vmalert: fix typo in tests
Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-10-25 16:28:27 +02:00
hagen1778
eed0c3c6b0 app/vmalert: fix tests after a216fe6728
a216fe6728
Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-10-25 16:25:26 +02:00
hagen1778
a216fe6728 app/vmalert: follow-up after c9375cac5e
c9375cac5e

Descriptions were updated in attempt to make it more clear for readers,
re-phrasing and linking missing docs.

`eval_delay` was added to tests to verify it can be unmarshalled.

`eval_delay` is now applied before timestamp alignment to make it more predictable.
Before, if delay < interval the timestamp won't be aligned.

`eval_delay` and `eval_offset` was added to API output.

`PreviouslySentSeriesToRW` converted to private `previouslySentSeriesToRW`.

Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-10-25 13:07:13 +02:00
Hui Wang
c9375cac5e vmalert: add -rule.evalDelay flag and eval_delay as group attribute (#5185)
Also mark `-datasource.lookback` as will be deprecated, see https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5155.
2023-10-25 11:54:18 +02:00
hagen1778
4e0a779efe deployment/alerts: update TooHighMemoryUsage annotation
The memory usage isn't measured on 5m interval anymore.

Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-10-24 09:53:44 +02:00
hagen1778
003ef3a518 deployment/alerts: make TooHighMemoryUsage more tolerable to spikes
Using `min_over_time` should reduce the amount of false positives when
component is running in near-the-threshold state. Now it should trigger
only if all collected samples were above the threshold on 10m interval.

Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-10-24 09:39:46 +02:00
hagen1778
685f9c3c98 deployment/alerts: make RemoteWriteConnectionIsSaturated expr readable
Signed-off-by: hagen1778 <roman@victoriametrics.com>
2023-10-24 09:31:57 +02:00
1735 changed files with 55184 additions and 57547 deletions

View File

@@ -14,13 +14,25 @@ jobs:
name: Build
runs-on: ubuntu-latest
steps:
- name: Setup Go
uses: actions/setup-go@main
with:
go-version: 1.21.3
id: go
- name: Code checkout
uses: actions/checkout@master
- name: Setup Go
id: go
uses: actions/setup-go@v5
with:
go-version: stable
cache: false
- name: Cache Go artifacts
uses: actions/cache@v3
with:
path: |
~/.cache/go-build
~/go/pkg/mod
~/go/bin
key: go-artifacts-${{ runner.os }}-check-licenses-${{ steps.go.outputs.go-version }}-${{ hashFiles('go.sum', 'Makefile', 'app/**/Makefile') }}
restore-keys: go-artifacts-${{ runner.os }}-check-licenses-
- name: Check License
run: |
make check-licenses
run: make check-licenses

View File

@@ -55,11 +55,22 @@ jobs:
uses: actions/checkout@v4
- name: Set up Go
uses: actions/setup-go@v4
id: go
uses: actions/setup-go@v5
with:
go-version: 1.21.3
check-latest: true
cache: true
go-version: stable
cache: false
if: ${{ matrix.language == 'go' }}
- name: Cache Go artifacts
uses: actions/cache@v3
with:
path: |
~/.cache/go-build
~/go/pkg/mod
~/go/bin
key: go-artifacts-${{ runner.os }}-codeql-analyze-${{ steps.go.outputs.go-version }}-${{ hashFiles('go.sum', 'Makefile', 'app/**/Makefile') }}
restore-keys: go-artifacts-${{ runner.os }}-codeql-analyze-
if: ${{ matrix.language == 'go' }}
# Initializes the CodeQL tools for scanning.

View File

@@ -7,6 +7,8 @@ on:
paths-ignore:
- "docs/**"
- "**.md"
- "dashboards/**"
- "deployment/**.yml"
pull_request:
branches:
- master
@@ -14,6 +16,8 @@ on:
paths-ignore:
- "docs/**"
- "**.md"
- "dashboards/**"
- "deployment/**.yml"
permissions:
contents: read
@@ -30,18 +34,55 @@ jobs:
uses: actions/checkout@v4
- name: Setup Go
uses: actions/setup-go@v4
id: go
uses: actions/setup-go@v5
with:
go-version: 1.21.3
check-latest: true
cache: true
go-version: stable
cache: false
- name: Dependencies
- name: Cache Go artifacts
uses: actions/cache@v3
with:
path: |
~/.cache/go-build
~/go/pkg/mod
~/go/bin
key: go-artifacts-${{ runner.os }}-check-all-${{ steps.go.outputs.go-version }}-${{ hashFiles('go.sum', 'Makefile', 'app/**/Makefile') }}
restore-keys: go-artifacts-${{ runner.os }}-check-all-
- name: Run check-all
run: |
make install-golangci-lint
make check-all
git diff --exit-code
build:
needs: lint
name: build
runs-on: ubuntu-latest
steps:
- name: Code checkout
uses: actions/checkout@v4
- name: Setup Go
id: go
uses: actions/setup-go@v5
with:
go-version: stable
cache: false
- name: Cache Go artifacts
uses: actions/cache@v3
with:
path: |
~/.cache/go-build
~/go/pkg/mod
~/go/bin
key: go-artifacts-${{ runner.os }}-crossbuild-${{ steps.go.outputs.go-version }}-${{ hashFiles('go.sum', 'Makefile', 'app/**/Makefile') }}
restore-keys: go-artifacts-${{ runner.os }}-crossbuild-
- name: Build
run: make crossbuild
test:
needs: lint
strategy:
@@ -54,43 +95,26 @@ jobs:
uses: actions/checkout@v4
- name: Setup Go
uses: actions/setup-go@v4
id: go
uses: actions/setup-go@v5
with:
go-version: 1.21.3
check-latest: true
cache: true
go-version: stable
cache: false
- name: Cache Go artifacts
uses: actions/cache@v3
with:
path: |
~/.cache/go-build
~/go/pkg/mod
~/go/bin
key: go-artifacts-${{ runner.os }}-${{ matrix.scenario }}-${{ steps.go.outputs.go-version }}-${{ hashFiles('go.sum', 'Makefile', 'app/**/Makefile') }}
restore-keys: go-artifacts-${{ runner.os }}-${{ matrix.scenario }}-
- name: run tests
run: |
make ${{ matrix.scenario}}
run: make ${{ matrix.scenario}}
- name: Publish coverage
uses: codecov/codecov-action@v3
with:
file: ./coverage.txt
build:
needs: test
name: build
runs-on: ubuntu-latest
steps:
- name: Code checkout
uses: actions/checkout@v4
- name: Setup Go
id: go
uses: actions/setup-go@v4
with:
go-version: 1.21.3
check-latest: true
cache: true
- uses: actions/cache@v3
with:
path: gocache-for-docker
key: gocache-docker-${{ runner.os }}-${{ steps.go.outputs.go-version }}-${{ hashFiles('go.mod') }}
- name: Build
run: |
make victoria-metrics-crossbuild
make vmuitils-crossbuild

View File

@@ -7,7 +7,8 @@ on:
- 'docs/**'
workflow_dispatch: {}
env:
PAGEFIND_VERSION: "1.0.3"
PAGEFIND_VERSION: "1.0.4"
HUGO_VERSION: "latest"
permissions:
contents: read # This is required for actions/checkout and to commit back image update
deployments: write
@@ -28,7 +29,7 @@ jobs:
path: docs
- uses: peaceiris/actions-hugo@v2
with:
hugo-version: 'latest'
hugo-version: ${{env.HUGO_VERSION}}
extended: true
- name: Install PageFind #install the static search engine for index build
uses: supplypike/setup-bin@v3

140
Makefile
View File

@@ -1,5 +1,7 @@
PKG_PREFIX := github.com/VictoriaMetrics/VictoriaMetrics
MAKE_CONCURRENCY ?= $(shell cat /proc/cpuinfo | grep -c processor)
MAKE_PARALLEL := $(MAKE) -j $(MAKE_CONCURRENCY)
DATEINFO_TAG ?= $(shell date -u +'%Y%m%d-%H%M%S')
BUILDINFO_TAG ?= $(shell echo $$(git describe --long --all | tr '/' '-')$$( \
git diff-index --quiet HEAD -- || echo '-dirty-'$$(git diff-index -u HEAD | openssl sha1 | cut -d' ' -f2 | cut -c 1-8)))
@@ -15,6 +17,7 @@ GO_BUILDINFO = -X '$(PKG_PREFIX)/lib/buildinfo.Version=$(APP_NAME)-$(DATEINFO_TA
.PHONY: $(MAKECMDGOALS)
include app/*/Makefile
include docs/Makefile
include deployment/*/Makefile
include dashboards/Makefile
include snap/local/Makefile
@@ -25,150 +28,152 @@ all: \
victoria-logs-prod \
vmagent-prod \
vmalert-prod \
vmalert-tool-prod \
vmauth-prod \
vmbackup-prod \
vmrestore-prod \
vmctl-prod \
vmalert-tool-prod
vmctl-prod
clean:
rm -rf bin/*
publish: package-base \
publish: \
publish-victoria-metrics \
publish-vmagent \
publish-vmalert \
publish-vmalert-tool \
publish-vmauth \
publish-vmbackup \
publish-vmrestore \
publish-vmctl \
publish-vmalert-tool
publish-vmctl
package: \
package-victoria-metrics \
package-victoria-logs \
package-vmagent \
package-vmalert \
package-vmalert-tool \
package-vmauth \
package-vmbackup \
package-vmrestore \
package-vmctl \
package-vmalert-tool
package-vmctl
vmutils: \
vmagent \
vmalert \
vmalert-tool \
vmauth \
vmbackup \
vmrestore \
vmctl \
vmalert-tool
vmctl
vmutils-pure: \
vmagent-pure \
vmalert-pure \
vmalert-tool-pure \
vmauth-pure \
vmbackup-pure \
vmrestore-pure \
vmctl-pure \
vmalert-tool-pure
vmctl-pure
vmutils-linux-amd64: \
vmagent-linux-amd64 \
vmalert-linux-amd64 \
vmalert-tool-linux-amd64 \
vmauth-linux-amd64 \
vmbackup-linux-amd64 \
vmrestore-linux-amd64 \
vmctl-linux-amd64 \
vmalert-tool-linux-amd64
vmctl-linux-amd64
vmutils-linux-arm64: \
vmagent-linux-arm64 \
vmalert-linux-arm64 \
vmalert-tool-linux-arm64 \
vmauth-linux-arm64 \
vmbackup-linux-arm64 \
vmrestore-linux-arm64 \
vmctl-linux-arm64 \
vmalert-tool-linux-arm64
vmctl-linux-arm64
vmutils-linux-arm: \
vmagent-linux-arm \
vmalert-linux-arm \
vmalert-tool-linux-arm \
vmauth-linux-arm \
vmbackup-linux-arm \
vmrestore-linux-arm \
vmctl-linux-arm \
vmalert-tool-linux-arm
vmctl-linux-arm
vmutils-linux-386: \
vmagent-linux-386 \
vmalert-linux-386 \
vmalert-tool-linux-386 \
vmauth-linux-386 \
vmbackup-linux-386 \
vmrestore-linux-386 \
vmctl-linux-386 \
vmalert-tool-linux-386
vmctl-linux-386
vmutils-linux-ppc64le: \
vmagent-linux-ppc64le \
vmalert-linux-ppc64le \
vmalert-tool-linux-ppc64le \
vmauth-linux-ppc64le \
vmbackup-linux-ppc64le \
vmrestore-linux-ppc64le \
vmctl-linux-ppc64le \
vmalert-tool-linux-ppc64le
vmctl-linux-ppc64le
vmutils-darwin-amd64: \
vmagent-darwin-amd64 \
vmalert-darwin-amd64 \
vmalert-tool-darwin-amd64 \
vmauth-darwin-amd64 \
vmbackup-darwin-amd64 \
vmrestore-darwin-amd64 \
vmctl-darwin-amd64 \
vmalert-tool-darwin-amd64
vmctl-darwin-amd64
vmutils-darwin-arm64: \
vmagent-darwin-arm64 \
vmalert-darwin-arm64 \
vmalert-tool-darwin-arm64 \
vmauth-darwin-arm64 \
vmbackup-darwin-arm64 \
vmrestore-darwin-arm64 \
vmctl-darwin-arm64 \
vmalert-tool-darwin-arm64
vmctl-darwin-arm64
vmutils-freebsd-amd64: \
vmagent-freebsd-amd64 \
vmalert-freebsd-amd64 \
vmalert-tool-freebsd-amd64 \
vmauth-freebsd-amd64 \
vmbackup-freebsd-amd64 \
vmrestore-freebsd-amd64 \
vmctl-freebsd-amd64 \
vmalert-tool-freebsd-amd64
vmctl-freebsd-amd64
vmutils-openbsd-amd64: \
vmagent-openbsd-amd64 \
vmalert-openbsd-amd64 \
vmalert-tool-openbsd-amd64 \
vmauth-openbsd-amd64 \
vmbackup-openbsd-amd64 \
vmrestore-openbsd-amd64 \
vmctl-openbsd-amd64 \
vmalert-tool-openbsd-amd64
vmctl-openbsd-amd64
vmutils-windows-amd64: \
vmagent-windows-amd64 \
vmalert-windows-amd64 \
vmalert-tool-windows-amd64 \
vmauth-windows-amd64 \
vmbackup-windows-amd64 \
vmrestore-windows-amd64 \
vmctl-windows-amd64 \
vmalert-tool-windows-amd64
vmctl-windows-amd64
crossbuild:
$(MAKE_PARALLEL) victoria-metrics-crossbuild vmutils-crossbuild
victoria-metrics-crossbuild: \
victoria-metrics-linux-386 \
victoria-metrics-linux-amd64 \
victoria-metrics-linux-arm64 \
victoria-metrics-linux-arm \
victoria-metrics-linux-386 \
victoria-metrics-linux-ppc64le \
victoria-metrics-darwin-amd64 \
victoria-metrics-darwin-arm64 \
@@ -180,7 +185,6 @@ vmutils-crossbuild: \
vmutils-linux-amd64 \
vmutils-linux-arm64 \
vmutils-linux-arm \
vmutils-linux-386 \
vmutils-linux-ppc64le \
vmutils-darwin-amd64 \
vmutils-darwin-arm64 \
@@ -190,14 +194,15 @@ vmutils-crossbuild: \
publish-release:
rm -rf bin/*
git checkout $(TAG) && LATEST_TAG=stable $(MAKE) release publish && \
git checkout $(TAG)-cluster && LATEST_TAG=cluster-stable $(MAKE) release publish && \
git checkout $(TAG)-enterprise && LATEST_TAG=enterprise-stable $(MAKE) release publish && \
git checkout $(TAG)-enterprise-cluster && LATEST_TAG=enterprise-cluster-stable $(MAKE) release publish
git checkout $(TAG) && $(MAKE) release && LATEST_TAG=stable $(MAKE) publish && \
git checkout $(TAG)-cluster && $(MAKE) release && LATEST_TAG=cluster-stable $(MAKE) publish && \
git checkout $(TAG)-enterprise && $(MAKE) release && LATEST_TAG=enterprise-stable $(MAKE) publish && \
git checkout $(TAG)-enterprise-cluster && $(MAKE) release && LATEST_TAG=enterprise-cluster-stable $(MAKE) publish
release: \
release-victoria-metrics \
release-vmutils
release:
$(MAKE_PARALLEL) \
release-victoria-metrics \
release-vmutils
release-victoria-metrics: \
release-victoria-metrics-linux-386 \
@@ -256,16 +261,16 @@ release-victoria-metrics-windows-goarch: victoria-metrics-windows-$(GOARCH)-prod
cd bin && rm -rf \
victoria-metrics-windows-$(GOARCH)-prod.exe
release-victoria-logs: \
release-victoria-logs-linux-386 \
release-victoria-logs-linux-amd64 \
release-victoria-logs-linux-arm \
release-victoria-logs-linux-arm64 \
release-victoria-logs-darwin-amd64 \
release-victoria-logs-darwin-arm64 \
release-victoria-logs-freebsd-amd64 \
release-victoria-logs-openbsd-amd64 \
release-victoria-logs-windows-amd64
release-victoria-logs:
$(MAKE_PARALLEL) release-victoria-logs-linux-386 \
release-victoria-logs-linux-amd64 \
release-victoria-logs-linux-arm \
release-victoria-logs-linux-arm64 \
release-victoria-logs-darwin-amd64 \
release-victoria-logs-darwin-arm64 \
release-victoria-logs-freebsd-amd64 \
release-victoria-logs-openbsd-amd64 \
release-victoria-logs-windows-amd64
release-victoria-logs-linux-386:
GOOS=linux GOARCH=386 $(MAKE) release-victoria-logs-goos-goarch
@@ -354,72 +359,72 @@ release-vmutils-windows-amd64:
release-vmutils-goos-goarch: \
vmagent-$(GOOS)-$(GOARCH)-prod \
vmalert-$(GOOS)-$(GOARCH)-prod \
vmalert-tool-$(GOOS)-$(GOARCH)-prod \
vmauth-$(GOOS)-$(GOARCH)-prod \
vmbackup-$(GOOS)-$(GOARCH)-prod \
vmrestore-$(GOOS)-$(GOARCH)-prod \
vmctl-$(GOOS)-$(GOARCH)-prod \
vmalert-tool-$(GOOS)-$(GOARCH)-prod
vmctl-$(GOOS)-$(GOARCH)-prod
cd bin && \
tar --transform="flags=r;s|-$(GOOS)-$(GOARCH)||" -czf vmutils-$(GOOS)-$(GOARCH)-$(PKG_TAG).tar.gz \
vmagent-$(GOOS)-$(GOARCH)-prod \
vmalert-$(GOOS)-$(GOARCH)-prod \
vmalert-tool-$(GOOS)-$(GOARCH)-prod \
vmauth-$(GOOS)-$(GOARCH)-prod \
vmbackup-$(GOOS)-$(GOARCH)-prod \
vmrestore-$(GOOS)-$(GOARCH)-prod \
vmctl-$(GOOS)-$(GOARCH)-prod \
vmalert-tool-$(GOOS)-$(GOARCH)-prod
&& sha256sum vmutils-$(GOOS)-$(GOARCH)-$(PKG_TAG).tar.gz \
vmagent-$(GOOS)-$(GOARCH)-prod \
vmalert-$(GOOS)-$(GOARCH)-prod \
vmalert-tool-$(GOOS)-$(GOARCH)-prod \
vmauth-$(GOOS)-$(GOARCH)-prod \
vmbackup-$(GOOS)-$(GOARCH)-prod \
vmrestore-$(GOOS)-$(GOARCH)-prod \
vmctl-$(GOOS)-$(GOARCH)-prod \
vmalert-tool-$(GOOS)-$(GOARCH)-prod \
| sed s/-$(GOOS)-$(GOARCH)-prod/-prod/ > vmutils-$(GOOS)-$(GOARCH)-$(PKG_TAG)_checksums.txt
cd bin && rm -rf \
vmagent-$(GOOS)-$(GOARCH)-prod \
vmalert-$(GOOS)-$(GOARCH)-prod \
vmalert-tool-$(GOOS)-$(GOARCH)-prod \
vmauth-$(GOOS)-$(GOARCH)-prod \
vmbackup-$(GOOS)-$(GOARCH)-prod \
vmrestore-$(GOOS)-$(GOARCH)-prod \
vmctl-$(GOOS)-$(GOARCH)-prod \
vmalert-tool-$(GOOS)-$(GOARCH)-prod
vmctl-$(GOOS)-$(GOARCH)-prod
release-vmutils-windows-goarch: \
vmagent-windows-$(GOARCH)-prod \
vmalert-windows-$(GOARCH)-prod \
vmalert-tool-windows-$(GOARCH)-prod \
vmauth-windows-$(GOARCH)-prod \
vmbackup-windows-$(GOARCH)-prod \
vmrestore-windows-$(GOARCH)-prod \
vmctl-windows-$(GOARCH)-prod \
vmalert-tool-windows-$(GOARCH)-prod
vmctl-windows-$(GOARCH)-prod
cd bin && \
zip vmutils-windows-$(GOARCH)-$(PKG_TAG).zip \
vmagent-windows-$(GOARCH)-prod.exe \
vmalert-windows-$(GOARCH)-prod.exe \
vmalert-tool-windows-$(GOARCH)-prod.exe \
vmauth-windows-$(GOARCH)-prod.exe \
vmbackup-windows-$(GOARCH)-prod.exe \
vmrestore-windows-$(GOARCH)-prod.exe \
vmctl-windows-$(GOARCH)-prod.exe \
vmalert-tool-windows-$(GOARCH)-prod.exe \
&& sha256sum vmutils-windows-$(GOARCH)-$(PKG_TAG).zip \
vmagent-windows-$(GOARCH)-prod.exe \
vmalert-windows-$(GOARCH)-prod.exe \
vmalert-tool-windows-$(GOARCH)-prod.exe \
vmauth-windows-$(GOARCH)-prod.exe \
vmbackup-windows-$(GOARCH)-prod.exe \
vmrestore-windows-$(GOARCH)-prod.exe \
vmctl-windows-$(GOARCH)-prod.exe \
vmalert-tool-windows-$(GOARCH)-prod.exe \
> vmutils-windows-$(GOARCH)-$(PKG_TAG)_checksums.txt
cd bin && rm -rf \
vmagent-windows-$(GOARCH)-prod.exe \
vmalert-windows-$(GOARCH)-prod.exe \
vmalert-tool-windows-$(GOARCH)-prod.exe \
vmauth-windows-$(GOARCH)-prod.exe \
vmbackup-windows-$(GOARCH)-prod.exe \
vmrestore-windows-$(GOARCH)-prod.exe \
vmctl-windows-$(GOARCH)-prod.exe \
vmalert-tool-windows-$(GOARCH)-prod.exe
vmctl-windows-$(GOARCH)-prod.exe
pprof-cpu:
go tool pprof -trim_path=github.com/VictoriaMetrics/VictoriaMetrics@ $(PPROF_FILE)
@@ -486,7 +491,7 @@ golangci-lint: install-golangci-lint
golangci-lint run
install-golangci-lint:
which golangci-lint || curl -sSfL https://raw.githubusercontent.com/golangci/golangci-lint/master/install.sh | sh -s -- -b $(shell go env GOPATH)/bin v1.54.2
which golangci-lint || curl -sSfL https://raw.githubusercontent.com/golangci/golangci-lint/master/install.sh | sh -s -- -b $(shell go env GOPATH)/bin v1.55.1
govulncheck: install-govulncheck
govulncheck ./...
@@ -529,12 +534,3 @@ copy-docs:
docs-sync:
SRC=README.md DST=docs/README.md OLD_URL='' ORDER=0 TITLE=VictoriaMetrics $(MAKE) copy-docs
SRC=README.md DST=docs/Single-server-VictoriaMetrics.md OLD_URL='/Single-server-VictoriaMetrics.html' TITLE=VictoriaMetrics ORDER=1 $(MAKE) copy-docs
SRC=app/vmagent/README.md DST=docs/vmagent.md OLD_URL='/vmagent.html' ORDER=3 TITLE=vmagent $(MAKE) copy-docs
SRC=app/vmalert/README.md DST=docs/vmalert.md OLD_URL='/vmalert.html' ORDER=4 TITLE=vmalert $(MAKE) copy-docs
SRC=app/vmauth/README.md DST=docs/vmauth.md OLD_URL='/vmauth.html' ORDER=5 TITLE=vmauth $(MAKE) copy-docs
SRC=app/vmbackup/README.md DST=docs/vmbackup.md OLD_URL='/vmbackup.html' ORDER=6 TITLE=vmbackup $(MAKE) copy-docs
SRC=app/vmrestore/README.md DST=docs/vmrestore.md OLD_URL='/vmrestore.html' ORDER=7 TITLE=vmrestore $(MAKE) copy-docs
SRC=app/vmctl/README.md DST=docs/vmctl.md OLD_URL='/vmctl.html' ORDER=8 TITLE=vmctl $(MAKE) copy-docs
SRC=app/vmgateway/README.md DST=docs/vmgateway.md OLD_URL='/vmgateway.html' ORDER=9 TITLE=vmgateway $(MAKE) copy-docs
SRC=app/vmbackupmanager/README.md DST=docs/vmbackupmanager.md OLD_URL='/vmbackupmanager.html' ORDER=10 TITLE=vmbackupmanager $(MAKE) copy-docs
SRC=app/vmalert-tool/README.md DST=docs/vmalert-tool.md OLD_URL='' ORDER=12 TITLE=vmalert-tool $(MAKE) copy-docs

205
README.md
View File

@@ -9,7 +9,7 @@
[![Build Status](https://github.com/VictoriaMetrics/VictoriaMetrics/workflows/main/badge.svg)](https://github.com/VictoriaMetrics/VictoriaMetrics/actions)
[![codecov](https://codecov.io/gh/VictoriaMetrics/VictoriaMetrics/branch/master/graph/badge.svg)](https://codecov.io/gh/VictoriaMetrics/VictoriaMetrics)
<img src="logo.png" width="300" alt="VictoriaMetrics logo">
<img src="docs/logo.webp" width="300" alt="VictoriaMetrics logo">
VictoriaMetrics is a fast, cost-effective and scalable monitoring solution and time series database.
@@ -87,6 +87,7 @@ VictoriaMetrics has the following prominent features:
* [Arbitrary CSV data](#how-to-import-csv-data).
* [Native binary format](#how-to-import-data-in-native-format).
* [DataDog agent or DogStatsD](#how-to-send-data-from-datadog-agent).
* [NewRelic infrastructure agent](#how-to-send-data-from-newrelic-agent).
* [OpenTelemetry metrics format](#sending-data-via-opentelemetry).
* It supports powerful [stream aggregation](https://docs.victoriametrics.com/stream-aggregation.html), which can be used as a [statsd](https://github.com/statsd/statsd) alternative.
* It supports metrics [relabeling](#relabeling).
@@ -338,7 +339,8 @@ which can be used as faster and less resource-hungry alternative to Prometheus.
## Grafana setup
Create [Prometheus datasource](http://docs.grafana.org/features/datasources/prometheus/) in Grafana with the following url:
Create [Prometheus datasource](https://grafana.com/docs/grafana/latest/datasources/prometheus/configure-prometheus-data-source/)
in Grafana with the following url:
```url
http://<victoriametrics-addr>:8428
@@ -346,12 +348,21 @@ http://<victoriametrics-addr>:8428
Substitute `<victoriametrics-addr>` with the hostname or IP address of VictoriaMetrics.
In the "Type and version" section it is recommended to set the type to "Prometheus" and the version to at least "2.24.x":
<img src="docs/grafana-datasource-prometheus.webp" alt="Grafana datasource" />
This allows Grafana to use a more efficient API to get label values.
Then build graphs and dashboards for the created datasource using [PromQL](https://prometheus.io/docs/prometheus/latest/querying/basics/)
or [MetricsQL](https://docs.victoriametrics.com/MetricsQL.html).
Alternatively, use VictoriaMetrics [datasource plugin](https://github.com/VictoriaMetrics/grafana-datasource) with support of extra features.
See more in [description](https://github.com/VictoriaMetrics/grafana-datasource#victoriametrics-data-source-for-grafana).
Creating a datasource may require [specific permissions](https://grafana.com/docs/grafana/latest/administration/data-source-management/).
If you don't see an option to create a data source - try contacting system administrator.
## How to upgrade VictoriaMetrics
VictoriaMetrics is developed at a fast pace, so it is recommended periodically checking [the CHANGELOG page](https://docs.victoriametrics.com/CHANGELOG.html) and performing regular upgrades.
@@ -442,7 +453,7 @@ This information is obtained from the `/api/v1/status/active_queries` HTTP endpo
[VMUI](#vmui) provides an ability to explore metrics exported by a particular `job` / `instance` in the following way:
1. Open the `vmui` at `http://victoriametrics:8428/vmui/`.
1. Click the `Explore metrics` tab.
1. Click the `Explore Prometheus metrics` tab.
1. Select the `job` you want to explore.
1. Optionally select the `instance` for the selected job to explore.
1. Select metrics you want to explore and compare.
@@ -502,9 +513,9 @@ See also [vmagent](https://docs.victoriametrics.com/vmagent.html), which can be
## How to send data from DataDog agent
VictoriaMetrics accepts data from [DataDog agent](https://docs.datadoghq.com/agent/)
or [DogStatsD](https://docs.datadoghq.com/developers/dogstatsd/)
via ["submit metrics" API](https://docs.datadoghq.com/api/latest/metrics/#submit-metrics)
VictoriaMetrics accepts data from [DataDog agent](https://docs.datadoghq.com/agent/)
or [DogStatsD](https://docs.datadoghq.com/developers/dogstatsd/)
via ["submit metrics" API](https://docs.datadoghq.com/api/latest/metrics/#submit-metrics)
at `/datadog/api/v1/series` path.
### Sending metrics to VictoriaMetrics
@@ -513,7 +524,7 @@ DataDog agent allows configuring destinations for metrics sending via ENV variab
or via [configuration file](https://docs.datadoghq.com/agent/guide/agent-configuration-files/) in section `dd_url`.
<p align="center">
<img src="docs/Single-server-VictoriaMetrics-sending_DD_metrics_to_VM.png" width="800">
<img src="docs/Single-server-VictoriaMetrics-sending_DD_metrics_to_VM.webp" width="800">
</p>
To configure DataDog agent via ENV variable add the following prefix:
@@ -547,7 +558,7 @@ DataDog allows configuring [Dual Shipping](https://docs.datadoghq.com/agent/guid
sending via ENV variable `DD_ADDITIONAL_ENDPOINTS` or via configuration file `additional_endpoints`.
<p align="center">
<img src="docs/Single-server-VictoriaMetrics-sending_DD_metrics_to_VM_and_DD.png" width="800">
<img src="docs/Single-server-VictoriaMetrics-sending_DD_metrics_to_VM_and_DD.webp" width="800">
</p>
Run DataDog using the following ENV variable with VictoriaMetrics as additional metrics receiver:
@@ -1126,6 +1137,18 @@ For example, the following command builds the image on top of [scratch](https://
ROOT_IMAGE=scratch make package-victoria-metrics
```
#### Building VictoriaMetrics with Podman
VictoriaMetrics can be built with Podman in either rootful or rootless mode.
When building via rootlful Podman, simply add `DOCKER=podman` to the relevant `make` commandline. To build
via rootless Podman, add `DOCKER=podman DOCKER_RUN="podman run --userns=keep-id"` to the `make`
commandline.
For example: `make victoria-metrics-pure DOCKER=podman DOCKER_RUN="podman run --userns=keep-id"`
Note that `production` builds are not supported via Podman becuase Podman does not support `buildx`.
## Start with docker-compose
[Docker-compose](https://github.com/VictoriaMetrics/VictoriaMetrics/blob/master/deployment/docker/docker-compose.yml)
@@ -1152,6 +1175,15 @@ Snapshots are created under `<-storageDataPath>/snapshots` directory, where `<-s
is the command-line flag value. Snapshots can be archived to backup storage at any time
with [vmbackup](https://docs.victoriametrics.com/vmbackup.html).
Snapshots consist of a mix of hard-links and soft-links to various files and directories inside `-storageDataPath`.
See [this article](https://medium.com/@valyala/how-victoriametrics-makes-instant-snapshots-for-multi-terabyte-time-series-data-e1f3fb0e0282)
for more details. This adds some restrictions on what can be done with the contents of `<-storageDataPath>/snapshots` directory:
- Do not delete subdirectories inside `<-storageDataPath>/snapshots` with `rm` or similar commands, since this will leave some snapshot data undeleted.
Prefer using the `/snapshot/delete` API for deleting snapshot. See below for more details about this API.
- Do not copy subdirectories inside `<-storageDataPath>/snapshot` with `cp`, `rsync` or similar commands, since there are high chances
that these commands won't copy some data stored in the snapshot. Prefer using [vmbackup](https://docs.victoriametrics.com/vmbackup.html) for making copies of snapshot data.
The `http://<victoriametrics-addr>:8428/snapshot/list` page contains the list of available snapshots.
Navigate to `http://<victoriametrics-addr>:8428/snapshot/delete?snapshot=<snapshot-name>` in order
@@ -1388,7 +1420,12 @@ For example, `/api/v1/import?extra_label=foo=bar` would add `"foo":"bar"` label
Note that it could be required to flush response cache after importing historical data. See [these docs](#backfilling) for detail.
VictoriaMetrics parses input JSON lines one-by-one. It loads the whole JSON line in memory, then parses it and then saves the parsed samples into persistent storage. This means that VictoriaMetrics can occupy big amounts of RAM when importing too long JSON lines. The solution is to split too long JSON lines into smaller lines. It is OK if samples for a single time series are split among multiple JSON lines.
VictoriaMetrics parses input JSON lines one-by-one. It loads the whole JSON line in memory, then parses it and then saves the parsed samples into persistent storage.
This means that VictoriaMetrics can occupy big amounts of RAM when importing too long JSON lines.
The solution is to split too long JSON lines into shorter lines. It is OK if samples for a single time series are split among multiple JSON lines.
JSON line length can be limited via `max_rows_per_line` query arg when exporting via [/api/v1/export](#how-to-export-data-in-json-line-format).
The maximum JSON line length, which can be parsed by VictoriaMetrics, is limited by `-import.maxLineLen` command-line flag value.
### How to import data in native format
@@ -1565,15 +1602,18 @@ The format follows [JSON streaming concept](http://ndjson.org/), e.g. each line
```
Note that every JSON object must be written in a single line, e.g. all the newline chars must be removed from it.
Every line length is limited by the value passed to `-import.maxLineLen` command-line flag (by default this is 100MB).
[/api/v1/import](#how-to-import-data-in-json-line-format) handler doesn't accept JSON lines longer than the value
passed to `-import.maxLineLen` command-line flag (by default this is 10MB).
It is recommended passing 1K-10K samples per line for achieving the maximum data ingestion performance at [/api/v1/import](#how-to-import-data-in-json-line-format).
Too long JSON lines may increase RAM usage at VictoriaMetrics side.
[/api/v1/export](#how-to-export-data-in-json-line-format) handler accepts `max_rows_per_line` query arg, which allows limiting the number of samples per each exported line.
It is OK to split [raw samples](https://docs.victoriametrics.com/keyConcepts.html#raw-samples)
for the same [time series](https://docs.victoriametrics.com/keyConcepts.html#time-series) across multiple lines.
The number of lines in JSON line document can be arbitrary.
The number of lines in the request to [/api/v1/import](#how-to-import-data-in-json-line-format) can be arbitrary - they are imported in streaming manner.
## Relabeling
@@ -1661,6 +1701,8 @@ By default, VictoriaMetrics is tuned for an optimal resource usage under typical
- `-search.maxConcurrentRequests` limits the number of concurrent requests VictoriaMetrics can process. Bigger number of concurrent requests usually means bigger memory usage. For example, if a single query needs 100 MiB of additional memory during its execution, then 100 concurrent queries may need `100 * 100 MiB = 10 GiB` of additional memory. So it is better to limit the number of concurrent queries, while suspending additional incoming queries if the concurrency limit is reached. VictoriaMetrics provides `-search.maxQueueDuration` command-line flag for limiting the max wait time for suspended queries. See also `-search.maxMemoryPerQuery` command-line flag.
- `-search.maxSamplesPerSeries` limits the number of raw samples the query can process per each time series. VictoriaMetrics sequentially processes raw samples per each found time series during the query. It unpacks raw samples on the selected time range per each time series into memory and then applies the given [rollup function](https://docs.victoriametrics.com/MetricsQL.html#rollup-functions). The `-search.maxSamplesPerSeries` command-line flag allows limiting memory usage in the case when the query is executed on a time range, which contains hundreds of millions of raw samples per each located time series.
- `-search.maxSamplesPerQuery` limits the number of raw samples a single query can process. This allows limiting CPU usage for heavy queries.
- `-search.maxResponseSeries` limits the number of time series a single query can return from [`/api/v1/query`](https://docs.victoriametrics.com/keyConcepts.html#instant-query)
and [`/api/v1/query_range`](https://docs.victoriametrics.com/keyConcepts.html#range-query).
- `-search.maxPointsPerTimeseries` limits the number of calculated points, which can be returned per each matching time series from [range query](https://docs.victoriametrics.com/keyConcepts.html#range-query).
- `-search.maxPointsSubqueryPerTimeseries` limits the number of calculated points, which can be generated per each matching time series during [subquery](https://docs.victoriametrics.com/MetricsQL.html#subqueries) evaluation.
- `-search.maxSeriesPerAggrFunc` limits the number of time series, which can be generated by [MetricsQL aggregate functions](https://docs.victoriametrics.com/MetricsQL.html#aggregate-functions) in a single query.
@@ -1669,48 +1711,50 @@ By default, VictoriaMetrics is tuned for an optimal resource usage under typical
- `-search.maxTagValues` limits the number of items, which may be returned from [/api/v1/label/.../values](https://prometheus.io/docs/prometheus/latest/querying/api/#querying-label-values). This endpoint is used mostly by Grafana for auto-completion of label values. Queries to this endpoint may take big amounts of CPU time and memory when the database contains big number of unique time series because of [high churn rate](https://docs.victoriametrics.com/FAQ.html#what-is-high-churn-rate). In this case it might be useful to set the `-search.maxTagValues` to quite low value in order to limit CPU and memory usage.
- `-search.maxTagValueSuffixesPerSearch` limits the number of entries, which may be returned from `/metrics/find` endpoint. See [Graphite Metrics API usage docs](#graphite-metrics-api-usage).
See also [cardinality limiter](#cardinality-limiter) and [capacity planning docs](#capacity-planning).
See also [resource usage limits at VictoriaMetrics cluster](https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html#resource-usage-limits),
[cardinality limiter](#cardinality-limiter) and [capacity planning docs](#capacity-planning).
## High availability
* Install multiple VictoriaMetrics instances in distinct datacenters (availability zones).
* Pass addresses of these instances to [vmagent](https://docs.victoriametrics.com/vmagent.html) via `-remoteWrite.url` command-line flag:
The general approach for achieving high availability is the following:
- to run two identically configured VictoriaMetrics instances in distinct datacenters (availability zones)
- to store the collected data simultaneously into these instances via [vmagent](https://docs.victoriametrics.com/vmagent.html) or Prometheus
- to query the first VictoriaMetrics instance and to fail over to the second instance when the first instance becomes temporarily unavailable.
Such a setup guarantees that the collected data isn't lost when one of VictoriaMetrics instance becomes unavailable.
The collected data continues to be written to the available VictoriaMetrics instance, so it should be available for querying.
Both [vmagent](https://docs.victoriametrics.com/vmagent.html) and Prometheus buffer the collected data locally if they cannot send it
to the configured remote storage. So the collected data will be written to the temporarily unavailable VictoriaMetrics instance
after it becomes available.
If you use [vmagent](https://docs.victoriametrics.com/vmagent.html) for storing the data into VictoriaMetrics,
then it can be configured with multiple `-remoteWrite.url` command-line flags, where every flag points to the VictoriaMetrics
instance in a particular availability zone, in order to replicate the collected data to all the VictoriaMetrics instances.
For example, the following command instructs `vmagent` to replicate data to `vm-az1` and `vm-az2` instances of VictoriaMetrics:
```console
/path/to/vmagent -remoteWrite.url=http://<victoriametrics-addr-1>:8428/api/v1/write -remoteWrite.url=http://<victoriametrics-addr-2>:8428/api/v1/write
/path/to/vmagent \
-remoteWrite.url=http://<vm-az1>:8428/api/v1/write \
-remoteWrite.url=http://<vm-az2>:8428/api/v1/write
```
Alternatively these addresses may be passed to `remote_write` section in Prometheus config:
If you use Prometheus for collecting and writing the data to VictoriaMetrics,
then the following [`remote_write`](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#remote_write) section
in Prometheus config can be used for replicating the collected data to `vm-az1` and `vm-az2` VictoriaMetrics instances:
```yml
remote_write:
- url: http://<victoriametrics-addr-1>:8428/api/v1/write
queue_config:
max_samples_per_send: 10000
# ...
- url: http://<victoriametrics-addr-N>:8428/api/v1/write
queue_config:
max_samples_per_send: 10000
- url: http://<vm-az1>:8428/api/v1/write
- url: http://<vm-az2>:8428/api/v1/write
```
* Apply the updated config:
It is recommended to use [vmagent](https://docs.victoriametrics.com/vmagent.html) instead of Prometheus for highly loaded setups,
since it uses lower amounts of RAM, CPU and network bandwidth than Prometheus.
```console
kill -HUP `pidof prometheus`
```
It is recommended to use [vmagent](https://docs.victoriametrics.com/vmagent.html) instead of Prometheus for highly loaded setups.
* Now Prometheus should write data into all the configured `remote_write` urls in parallel.
* Set up [Promxy](https://github.com/jacksontj/promxy) in front of all the VictoriaMetrics replicas.
* Set up Prometheus datasource in Grafana that points to Promxy.
If you have Prometheus HA pairs with replicas `r1` and `r2` in each pair, then configure each `r1`
to write data to `victoriametrics-addr-1`, while each `r2` should write data to `victoriametrics-addr-2`.
Another option is to write data simultaneously from Prometheus HA pair to a pair of VictoriaMetrics instances
with the enabled de-duplication. See [this section](#deduplication) for details.
If you use identically configured [vmagent](https://docs.victoriametrics.com/vmagent.html) instances for collecting the same data
and sending it to VictoriaMetrics, then do not forget enabling [deduplication](#deduplication) at VictoriaMetrics side.
## Deduplication
@@ -1897,8 +1941,23 @@ See how to request a free trial license [here](https://victoriametrics.com/produ
* `-downsampling.period=30d:5m,180d:1h` instructs VictoriaMetrics to deduplicate samples older than 30 days with 5 minutes interval and to deduplicate samples older than 180 days with 1 hour interval.
Downsampling is applied independently per each time series. It can reduce disk space usage and improve query performance if it is applied to time series with big number of samples per each series. The downsampling doesn't improve query performance if the database contains big number of time series with small number of samples per each series (aka [high churn rate](https://docs.victoriametrics.com/FAQ.html#what-is-high-churn-rate)), since downsampling doesn't reduce the number of time series. So the majority of time is spent on searching for the matching time series.
It is possible to use [stream aggregation](https://docs.victoriametrics.com/stream-aggregation.html) in vmagent or recording rules in [vmalert](https://docs.victoriametrics.com/vmalert.html) in order to [reduce the number of time series](https://docs.victoriametrics.com/vmalert.html#downsampling-and-aggregation-via-vmalert).
Downsampling is applied independently per each time series and leaves a single [raw sample](https://docs.victoriametrics.com/keyConcepts.html#raw-samples)
with the biggest [timestamp](https://en.wikipedia.org/wiki/Unix_time) on the configured interval, in the same way as [deduplication](#deduplication) does.
It works the best for [counters](https://docs.victoriametrics.com/keyConcepts.html#counter) and [histograms](https://docs.victoriametrics.com/keyConcepts.html#histogram),
as their values are always increasing. But downsampling [gauges](https://docs.victoriametrics.com/keyConcepts.html#gauge)
and [summaries](https://docs.victoriametrics.com/keyConcepts.html#summary)
would mean losing the changes within the downsampling interval. Please note, you can use [recording rules](https://docs.victoriametrics.com/vmalert.html#rules)
or [steaming aggregation](https://docs.victoriametrics.com/stream-aggregation.html)
to apply custom aggregation functions, like min/max/avg etc., in order to make gauges more resilient to downsampling.
Downsampling can reduce disk space usage and improve query performance if it is applied to time series with big number
of samples per each series. The downsampling doesn't improve query performance if the database contains big number
of time series with small number of samples per each series (aka [high churn rate](https://docs.victoriametrics.com/FAQ.html#what-is-high-churn-rate)),
since downsampling doesn't reduce the number of time series. In this case the majority of query time is spent on searching for the matching time series
instead of processing the found samples.
It is possible to use [stream aggregation](https://docs.victoriametrics.com/stream-aggregation.html) in [vmagent](https://docs.victoriametrics.com/vmagent.html)
or recording rules in [vmalert](https://docs.victoriametrics.com/vmalert.html) in order to
[reduce the number of time series](https://docs.victoriametrics.com/vmalert.html#downsampling-and-aggregation-via-vmalert).
Downsampling happens during [background merges](https://docs.victoriametrics.com/#storage)
and can't be performed if there is not enough of free disk space or if vmstorage
@@ -1956,6 +2015,8 @@ VictoriaMetrics provides the following security-related command-line flags:
* `-flagsAuthKey` for protecting `/flags` endpoint.
* `-pprofAuthKey` for protecting `/debug/pprof/*` endpoints, which can be used for [profiling](#profiling).
* `-denyQueryTracing` for disallowing [query tracing](#query-tracing).
* `-http.header.hsts`, `-http.header.csp`, and `-http.header.frameOptions` for serving `Strict-Transport-Security`, `Content-Security-Policy`
and `X-Frame-Options` HTTP response headers.
Explicitly set internal network interface for TCP and UDP ports for data ingestion with Graphite and OpenTSDB formats.
For example, substitute `-graphiteListenAddr=:2003` with `-graphiteListenAddr=<internal_iface_ip>:2003`. This protects from unexpected requests from untrusted network interfaces.
@@ -1991,6 +2052,8 @@ Alternatively, single-node VictoriaMetrics can self-scrape the metrics when `-se
set to duration greater than 0. For example, `-selfScrapeInterval=10s` would enable self-scraping of `/metrics` page
with 10 seconds interval.
_Please note, never use loadbalancer address for scraping metrics. All monitored components should be scraped directly by their address._
Official Grafana dashboards available for [single-node](https://grafana.com/grafana/dashboards/10229-victoriametrics/)
and [clustered](https://grafana.com/grafana/dashboards/11176-victoriametrics-cluster/) VictoriaMetrics.
See an [alternative dashboard for clustered VictoriaMetrics](https://grafana.com/grafana/dashboards/11831)
@@ -2317,7 +2380,7 @@ It is recommended disabling query cache with `-search.disableCache` command-line
historical data with timestamps from the past, since the cache assumes that the data is written with
the current timestamps. Query cache can be enabled after the backfilling is complete.
An alternative solution is to query [/internal/resetRollupResultCache](https://docs.victoriametrics.com/url-examples.html#internalresetRollupResultCache) handler after the backfilling is complete. This will reset the query cache, which could contain incomplete data cached during the backfilling.
An alternative solution is to query [/internal/resetRollupResultCache](https://docs.victoriametrics.com/url-examples.html#internalresetrollupresultcache) handler after the backfilling is complete. This will reset the query cache, which could contain incomplete data cached during the backfilling.
Yet another solution is to increase `-search.cacheTimestampOffset` flag value in order to disable caching
for data with timestamps close to the current time. Single-node VictoriaMetrics automatically resets response
@@ -2450,6 +2513,20 @@ Adhering `KISS` principle simplifies the resulting code and architecture, so it
Report bugs and propose new features [here](https://github.com/VictoriaMetrics/VictoriaMetrics/issues).
## Images in documentation
Please, keep image size and number of images per single page low. Keep the docs page as lightweight as possible.
If the page needs to have many images, consider using WEB-optimized image format [webp](https://developers.google.com/speed/webp).
When adding a new doc with many images use `webp` format right away. Or use a Makefile command below to
convert already existing images at `docs` folder automatically to `web` format:
```console
make docs-images-to-webp
```
Once conversion is done, update the path to images in your docs and verify everything is correct.
## VictoriaMetrics Logo
[Zip](https://github.com/VictoriaMetrics/VictoriaMetrics/blob/master/VM_logo.zip) contains three folders with different image orientations (main color and inverted version).
@@ -2487,6 +2564,8 @@ Pass `-help` to VictoriaMetrics in order to see the list of supported command-li
```
-bigMergeConcurrency int
Deprecated: this flag does nothing. Please use -smallMergeConcurrency for controlling the concurrency of background merges. See https://docs.victoriametrics.com/#storage
-blockcache.missesBeforeCaching int
The number of cache misses before putting the block into cache. Higher values may reduce indexdb/dataBlocks cache size at the cost of higher CPU and disk read usage (default 2)
-cacheExpireDuration duration
Items are removed from in-memory caches after they aren't accessed for this duration. Lower values may reduce memory usage at the cost of higher CPU usage. See also -prevCacheRemovalPercent (default 30m0s)
-configAuthKey string
@@ -2507,7 +2586,7 @@ Pass `-help` to VictoriaMetrics in order to see the list of supported command-li
-denyQueryTracing
Whether to disable the ability to trace queries. See https://docs.victoriametrics.com/#query-tracing
-downsampling.period array
Comma-separated downsampling periods in the format 'offset:period'. For example, '30d:10m' instructs to leave a single sample per 10 minutes for samples older than 30 days. When setting multiple downsampling periods, it is necessary for the periods to be multiples of each other. See https://docs.victoriametrics.com/#downsampling for details. This flag is available only in Enterprise binaries. See https://docs.victoriametrics.com/enterprise.html
Comma-separated downsampling periods in the format 'offset:period'. For example, '30d:10m' instructs to leave a single sample per 10 minutes for samples older than 30 days. When setting multiple downsampling periods, it is necessary for the periods to be multiples of each other. See https://docs.victoriametrics.com/#downsampling for details. This flag is available only in VictoriaMetrics enterprise. See https://docs.victoriametrics.com/enterprise.html
Supports an array of values separated by comma or specified via multiple flags.
-dryRun
Whether to check config files without running VictoriaMetrics. The following config files are checked: -promscrape.config, -relabelConfig and -streamAggr.config. Unknown config entries aren't allowed in -promscrape.config by default. This can be changed with -promscrape.config.strictParse=false command-line flag
@@ -2541,6 +2620,12 @@ Pass `-help` to VictoriaMetrics in order to see the list of supported command-li
Incoming http connections are closed after the configured timeout. This may help to spread the incoming load among a cluster of services behind a load balancer. Please note that the real timeout may be bigger by up to 10% as a protection against the thundering herd problem (default 2m0s)
-http.disableResponseCompression
Disable compression of HTTP responses to save CPU resources. By default, compression is enabled to save network bandwidth
-http.header.csp string
Value for 'Content-Security-Policy' header
-http.header.frameOptions string
Value for 'X-Frame-Options' header
-http.header.hsts string
Value for 'Strict-Transport-Security' header
-http.idleConnTimeout duration
Timeout for incoming idle http connections (default 1m0s)
-http.maxGracefulShutdownDuration duration
@@ -2554,12 +2639,12 @@ Pass `-help` to VictoriaMetrics in order to see the list of supported command-li
-httpAuth.username string
Username for HTTP server's Basic Auth. The authentication is disabled if empty. See also -httpAuth.password
-httpListenAddr string
TCP address to listen for http connections. See also -httpListenAddr.useProxyProtocol (default ":8428")
TCP address to listen for http connections. See also -tls and -httpListenAddr.useProxyProtocol (default ":8428")
-httpListenAddr.useProxyProtocol
Whether to use proxy protocol for connections accepted at -httpListenAddr . See https://www.haproxy.org/download/1.8/doc/proxy-protocol.txt . With enabled proxy protocol http server cannot serve regular /metrics endpoint. Use -pushmetrics.url for metrics pushing
-import.maxLineLen size
The maximum length in bytes of a single line accepted by /api/v1/import; the line length can be limited with 'max_rows_per_line' query arg passed to /api/v1/export
Supports the following optional suffixes for size values: KB, MB, GB, TB, KiB, MiB, GiB, TiB (default 104857600)
Supports the following optional suffixes for size values: KB, MB, GB, TB, KiB, MiB, GiB, TiB (default 10485760)
-influx.databaseNames array
Comma-separated list of database names to return from /query and /influx/query API. This can be needed for accepting data from Telegraf plugins such as https://github.com/fangli/fluent-plugin-influxdb
Supports an array of values separated by comma or specified via multiple flags.
@@ -2608,6 +2693,8 @@ Pass `-help` to VictoriaMetrics in order to see the list of supported command-li
Allows renaming fields in JSON formatted logs. Example: "ts:timestamp,msg:message" renames "ts" to "timestamp" and "msg" to "message". Supported fields: ts, level, caller, msg
-loggerLevel string
Minimum level of errors to log. Possible values: INFO, WARN, ERROR, FATAL, PANIC (default "INFO")
-loggerMaxArgLen int
The maximum length of a single logged argument. Longer arguments are replaced with 'arg_start..arg_end', where 'arg_start' and 'arg_end' is prefix and suffix of the arg with the length not exceeding -loggerMaxArgLen / 2 (default 1000)
-loggerOutput string
Output for the logs. Supported values: stderr, stdout (default "stderr")
-loggerTimezone string
@@ -2615,7 +2702,7 @@ Pass `-help` to VictoriaMetrics in order to see the list of supported command-li
-loggerWarnsPerSecondLimit int
Per-second limit on the number of WARN messages. If more than the given number of warns are emitted per second, then the remaining warns are suppressed. Zero values disable the rate limit
-maxConcurrentInserts int
The maximum number of concurrent insert requests. Default value should work for most cases, since it minimizes the memory usage. The default value can be increased when clients send data over slow networks. See also -insert.maxQueueDuration (default 8)
The maximum number of concurrent insert requests. Default value should work for most cases, since it minimizes the memory usage. The default value can be increased when clients send data over slow networks. See also -insert.maxQueueDuration (default 32)
-maxInsertRequestSize size
The maximum size in bytes of a single Prometheus remote_write API request
Supports the following optional suffixes for size values: KB, MB, GB, TB, KiB, MiB, GiB, TiB (default 33554432)
@@ -2630,6 +2717,9 @@ Pass `-help` to VictoriaMetrics in order to see the list of supported command-li
Allowed percent of system memory VictoriaMetrics caches may occupy. See also -memory.allowedBytes. Too low a value may increase cache miss rate usually resulting in higher CPU and disk IO usage. Too high a value may evict too much data from the OS page cache which will result in higher disk IO usage (default 60)
-metricsAuthKey string
Auth key for /metrics endpoint. It must be passed via authKey query arg. It overrides httpAuth.* settings
-newrelic.maxInsertRequestSize size
The maximum size in bytes of a single NewRelic request to /newrelic/infra/v2/metrics/events/bulk
Supports the following optional suffixes for size values: KB, MB, GB, TB, KiB, MiB, GiB, TiB (default 67108864)
-opentsdbHTTPListenAddr string
TCP address to listen for OpenTSDB HTTP put requests. Usually :4242 must be set. Doesn't work if empty. See also -opentsdbHTTPListenAddr.useProxyProtocol
-opentsdbHTTPListenAddr.useProxyProtocol
@@ -2657,6 +2747,8 @@ Pass `-help` to VictoriaMetrics in order to see the list of supported command-li
If non-empty, then the label with this name and the -promscrape.cluster.memberNum value is added to all the scraped metrics. See https://docs.victoriametrics.com/vmagent.html#scraping-big-number-of-targets for more info
-promscrape.cluster.memberNum string
The number of vmagent instance in the cluster of scrapers. It must be a unique value in the range 0 ... promscrape.cluster.membersCount-1 across scrapers in the cluster. Can be specified as pod name of Kubernetes StatefulSet - pod-name-Num, where Num is a numeric part of pod name. See also -promscrape.cluster.memberLabel . See https://docs.victoriametrics.com/vmagent.html#scraping-big-number-of-targets for more info (default "0")
-promscrape.cluster.memberURLTemplate string
An optional template for URL to access vmagent instance with the given -promscrape.cluster.memberNum value. Every %d occurence in the template is substituted with -promscrape.cluster.memberNum at urls to vmagent instances responsible for scraping the given target at /service-discovery page. For example -promscrape.cluster.memberURLTemplate='http://vmagent-%d:8429/targets'. See https://docs.victoriametrics.com/vmagent.html#scraping-big-number-of-targets for more details
-promscrape.cluster.membersCount int
The number of members in a cluster of scrapers. Each member must have a unique -promscrape.cluster.memberNum in the range 0 ... promscrape.cluster.membersCount-1 . Each member then scrapes roughly 1/N of all the targets. By default, cluster scraping is disabled, i.e. a single scraper scrapes all the targets. See https://docs.victoriametrics.com/vmagent.html#scraping-big-number-of-targets for more info (default 1)
-promscrape.cluster.name string
@@ -2670,7 +2762,7 @@ Pass `-help` to VictoriaMetrics in order to see the list of supported command-li
-promscrape.config.strictParse
Whether to deny unsupported fields in -promscrape.config . Set to false in order to silently skip unsupported fields (default true)
-promscrape.configCheckInterval duration
Interval for checking for changes in '-promscrape.config' file. By default, the checking is disabled. Send SIGHUP signal in order to force config check for changes
Interval for checking for changes in -promscrape.config file. By default, the checking is disabled. See how to reload -promscrape.config file at https://docs.victoriametrics.com/vmagent.html#configuration-update
-promscrape.consul.waitTime duration
Wait time used by Consul service discovery. Default value is used if not set
-promscrape.consulSDCheckInterval duration
@@ -2753,11 +2845,11 @@ Pass `-help` to VictoriaMetrics in order to see the list of supported command-li
-relabelConfig string
Optional path to a file with relabeling rules, which are applied to all the ingested metrics. The path can point either to local file or to http url. See https://docs.victoriametrics.com/#relabeling for details. The config is reloaded on SIGHUP signal
-retentionFilter array
Retention filter in the format 'filter:retention'. For example, '{env="dev"}:3d' configures the retention for time series with env="dev" label to 3 days. See https://docs.victoriametrics.com/#retention-filters for details. This flag is available only in Enterprise binaries. See https://docs.victoriametrics.com/enterprise.html
Retention filter in the format 'filter:retention'. For example, '{env="dev"}:3d' configures the retention for time series with env="dev" label to 3 days. See https://docs.victoriametrics.com/#retention-filters for details. This flag is available only in VictoriaMetrics enterprise. See https://docs.victoriametrics.com/enterprise.html
Supports an array of values separated by comma or specified via multiple flags.
-retentionPeriod value
Data with timestamps outside the retentionPeriod is automatically deleted. The minimum retentionPeriod is 24h or 1d. See also -retentionFilter
The following optional suffixes are supported: h (hour), d (day), w (week), y (year). If suffix isn't set, then the duration is counted in months (default 1)
The following optional suffixes are supported: s (second), m (minute), h (hour), d (day), w (week), y (year). If suffix isn't set, then the duration is counted in months (default 1)
-retentionTimezoneOffset duration
The offset for performing indexdb rotation. If set to 0, then the indexdb rotation is performed at 4am UTC time per each -retentionPeriod. If set to 2h, then the indexdb rotation is performed at 4am EET time (the timezone with +2h offset)
-search.cacheTimestampOffset duration
@@ -2773,12 +2865,12 @@ Pass `-help` to VictoriaMetrics in order to see the list of supported command-li
-search.latencyOffset duration
The time when data points become visible in query results after the collection. It can be overridden on per-query basis via latency_offset arg. Too small value can result in incomplete last points for query results (default 30s)
-search.logQueryMemoryUsage size
Log queries, which require more memory than specified by this flag. This may help detecting and optimizing heavy queries. Query logging is disabled by default. See also -search.logSlowQueryDuration and -search.maxMemoryPerQuery
Log query and increment vm_memory_intensive_queries_total metric each time the query requires more memory than specified by this flag. This may help detecting and optimizing heavy queries. Query logging is disabled by default. See also -search.logSlowQueryDuration and -search.maxMemoryPerQuery
Supports the following optional suffixes for size values: KB, MB, GB, TB, KiB, MiB, GiB, TiB (default 0)
-search.logSlowQueryDuration duration
Log queries with execution time exceeding this value. Zero disables slow query logging. See also -search.logQueryMemoryUsage (default 5s)
-search.maxConcurrentRequests int
The maximum number of concurrent search requests. It shouldn't be high, since a single request can saturate all the CPU cores, while many concurrently executed requests may require high amounts of memory. See also -search.maxQueueDuration and -search.maxMemoryPerQuery (default 8)
The maximum number of concurrent search requests. It shouldn't be high, since a single request can saturate all the CPU cores, while many concurrently executed requests may require high amounts of memory. See also -search.maxQueueDuration and -search.maxMemoryPerQuery (default 16)
-search.maxExportDuration duration
The maximum duration for /api/v1/export call (default 720h0m0s)
-search.maxExportSeries int
@@ -2797,7 +2889,7 @@ Pass `-help` to VictoriaMetrics in order to see the list of supported command-li
The maximum amounts of memory a single query may consume. Queries requiring more memory are rejected. The total memory limit for concurrently executed queries can be estimated as -search.maxMemoryPerQuery multiplied by -search.maxConcurrentRequests . See also -search.logQueryMemoryUsage
Supports the following optional suffixes for size values: KB, MB, GB, TB, KiB, MiB, GiB, TiB (default 0)
-search.maxPointsPerTimeseries int
The maximum points per a single timeseries returned from /api/v1/query_range. This option doesn't limit the number of scanned raw samples in the database. The main purpose of this option is to limit the number of per-series points returned to graphing UI such as VMUI or Grafana. There is no sense in setting this limit to values bigger than the horizontal resolution of the graph (default 30000)
The maximum points per a single timeseries returned from /api/v1/query_range. This option doesn't limit the number of scanned raw samples in the database. The main purpose of this option is to limit the number of per-series points returned to graphing UI such as VMUI or Grafana. There is no sense in setting this limit to values bigger than the horizontal resolution of the graph. See also -search.maxResponseSeries (default 30000)
-search.maxPointsSubqueryPerTimeseries int
The maximum number of points per series, which can be generated by subquery. See https://valyala.medium.com/prometheus-subqueries-in-victoriametrics-9b1492b720b3 (default 100000)
-search.maxQueryDuration duration
@@ -2807,6 +2899,8 @@ Pass `-help` to VictoriaMetrics in order to see the list of supported command-li
Supports the following optional suffixes for size values: KB, MB, GB, TB, KiB, MiB, GiB, TiB (default 16384)
-search.maxQueueDuration duration
The maximum time the request waits for execution when -search.maxConcurrentRequests limit is reached; see also -search.maxQueryDuration (default 10s)
-search.maxResponseSeries int
The maximum number of time series which can be returned from /api/v1/query and /api/v1/query_range . The limit is disabled if it equals to 0. See also -search.maxPointsPerTimeseries and -search.maxUniqueTimeseries
-search.maxSamplesPerQuery int
The maximum number of raw samples a single query can process across all time series. This protects from heavy queries, which select unexpectedly high number of raw samples. See also -search.maxSamplesPerSeries (default 1000000000)
-search.maxSamplesPerSeries int
@@ -2832,9 +2926,12 @@ Pass `-help` to VictoriaMetrics in order to see the list of supported command-li
-search.maxUniqueTimeseries int
The maximum number of unique time series, which can be selected during /api/v1/query and /api/v1/query_range queries. This option allows limiting memory usage (default 300000)
-search.maxWorkersPerQuery int
The maximum number of CPU cores a single query can use. The default value should work good for most cases. The flag can be set to lower values for improving performance of big number of concurrently executed queries. The flag can be set to bigger values for improving performance of heavy queries, which scan big number of time series (>10K) and/or big number of samples (>100M). There is no sense in setting this flag to values bigger than the number of CPU cores available on the system (default 4)
The maximum number of CPU cores a single query can use. The default value should work good for most cases. The flag can be set to lower values for improving performance of big number of concurrently executed queries. The flag can be set to bigger values for improving performance of heavy queries, which scan big number of time series (>10K) and/or big number of samples (>100M). There is no sense in setting this flag to values bigger than the number of CPU cores available on the system (default 16)
-search.minStalenessInterval duration
The minimum interval for staleness calculations. This flag could be useful for removing gaps on graphs generated from time series with irregular intervals between samples. See also '-search.maxStalenessInterval'
-search.minWindowForInstantRollupOptimization value
Enable cache-based optimization for repeated queries to /api/v1/query (aka instant queries), which contain rollup functions with lookbehind window exceeding the given value
The following optional suffixes are supported: s (second), m (minute), h (hour), d (day), w (week), y (year). If suffix isn't set, then the duration is counted in months (default 3h)
-search.noStaleMarkers
Set this flag to true if the database doesn't contain Prometheus stale markers, so there is no need in spending additional CPU time on its handling. Staleness markers may exist only in data obtained from Prometheus scrape targets
-search.queryStats.lastQueriesCount int
@@ -2861,7 +2958,7 @@ Pass `-help` to VictoriaMetrics in order to see the list of supported command-li
The timeout for creating new snapshot. If set, make sure that timeout is lower than backup period
-snapshotsMaxAge value
Automatically delete snapshots older than -snapshotsMaxAge if it is set to non-zero duration. Make sure that backup process has enough time to finish the backup before the corresponding snapshot is automatically deleted
The following optional suffixes are supported: h (hour), d (day), w (week), y (year). If suffix isn't set, then the duration is counted in months (default 0)
The following optional suffixes are supported: s (second), m (minute), h (hour), d (day), w (week), y (year). If suffix isn't set, then the duration is counted in months (default 0)
-sortLabels
Whether to sort labels for incoming samples before writing them to storage. This may be needed for reducing memory usage at storage when the order of labels in incoming samples is random. For example, if m{k1="v1",k2="v2"} may be sent as m{k2="v2",k1="v1"}. Enabled sorting for labels can slow down ingestion performance a bit
-storage.cacheSizeIndexDBDataBlocks size

View File

@@ -26,7 +26,7 @@ import (
)
var (
httpListenAddr = flag.String("httpListenAddr", ":8428", "TCP address to listen for http connections. See also -httpListenAddr.useProxyProtocol")
httpListenAddr = flag.String("httpListenAddr", ":8428", "TCP address to listen for http connections. See also -tls and -httpListenAddr.useProxyProtocol")
useProxyProtocol = flag.Bool("httpListenAddr.useProxyProtocol", false, "Whether to use proxy protocol for connections accepted at -httpListenAddr . "+
"See https://www.haproxy.org/download/1.8/doc/proxy-protocol.txt . "+
"With enabled proxy protocol http server cannot serve regular /metrics endpoint. Use -pushmetrics.url for metrics pushing")

View File

@@ -105,7 +105,7 @@ func RequestHandler(path string, w http.ResponseWriter, r *http.Request) bool {
vlstorage.MustAddRows(lr)
logstorage.PutLogRows(lr)
if err != nil {
logger.Warnf("cannot decode log message #%d in /_bulk request: %s", n, err)
logger.Warnf("cannot decode log message #%d in /_bulk request: %s, stream fields: %s", n, err, cp.StreamFields)
return true
}

View File

@@ -120,10 +120,10 @@ func compressData(s string) string {
var bb bytes.Buffer
zw := gzip.NewWriter(&bb)
if _, err := zw.Write([]byte(s)); err != nil {
panic(fmt.Errorf("unexpected error when compressing data: %s", err))
panic(fmt.Errorf("unexpected error when compressing data: %w", err))
}
if err := zw.Close(); err != nil {
panic(fmt.Errorf("unexpected error when closing gzip writer: %s", err))
panic(fmt.Errorf("unexpected error when closing gzip writer: %w", err))
}
return bb.String()
}

View File

@@ -43,7 +43,7 @@ func benchmarkReadBulkRequest(b *testing.B, isGzip bool) {
r.Reset(dataBytes)
_, err := readBulkRequest(r, isGzip, timeField, msgField, processLogMessage)
if err != nil {
panic(fmt.Errorf("unexpected error: %s", err))
panic(fmt.Errorf("unexpected error: %w", err))
}
}
})

View File

@@ -29,7 +29,7 @@ func benchmarkParseJSONRequest(b *testing.B, streams, rows, labels int) {
for pb.Next() {
_, err := parseJSONRequest(data, func(timestamp int64, fields []logstorage.Field) {})
if err != nil {
panic(fmt.Errorf("unexpected error: %s", err))
panic(fmt.Errorf("unexpected error: %w", err))
}
}
})

View File

@@ -84,7 +84,7 @@ func parseProtobufRequest(data []byte, processLogMessage func(timestamp int64, f
err = req.Unmarshal(bb.B)
if err != nil {
return 0, fmt.Errorf("cannot parse request body: %s", err)
return 0, fmt.Errorf("cannot parse request body: %w", err)
}
var commonFields []logstorage.Field
@@ -97,7 +97,7 @@ func parseProtobufRequest(data []byte, processLogMessage func(timestamp int64, f
// Labels are same for all entries in the stream.
commonFields, err = parsePromLabels(commonFields[:0], stream.Labels)
if err != nil {
return rowsIngested, fmt.Errorf("cannot parse stream labels %q: %s", stream.Labels, err)
return rowsIngested, fmt.Errorf("cannot parse stream labels %q: %w", stream.Labels, err)
}
fields := commonFields

View File

@@ -31,7 +31,7 @@ func benchmarkParseProtobufRequest(b *testing.B, streams, rows, labels int) {
for pb.Next() {
_, err := parseProtobufRequest(body, func(timestamp int64, fields []logstorage.Field) {})
if err != nil {
panic(fmt.Errorf("unexpected error: %s", err))
panic(fmt.Errorf("unexpected error: %w", err))
}
}
})

View File

@@ -1,13 +1,13 @@
{
"files": {
"main.css": "./static/css/main.9a224445.css",
"main.js": "./static/js/main.02178f4b.js",
"static/js/522.b5ae4365.chunk.js": "./static/js/522.b5ae4365.chunk.js",
"static/media/MetricsQL.md": "./static/media/MetricsQL.957b90ab4cb4852eec26.md",
"main.css": "./static/css/main.d1313636.css",
"main.js": "./static/js/main.1919fefe.js",
"static/js/522.da77e7b3.chunk.js": "./static/js/522.da77e7b3.chunk.js",
"static/media/MetricsQL.md": "./static/media/MetricsQL.8644fd7c964802dd34a9.md",
"index.html": "./index.html"
},
"entrypoints": [
"static/css/main.9a224445.css",
"static/js/main.02178f4b.js"
"static/css/main.d1313636.css",
"static/js/main.1919fefe.js"
]
}

View File

@@ -1 +1 @@
<!doctype html><html lang="en"><head><meta charset="utf-8"/><link rel="icon" href="./favicon.ico"/><meta name="viewport" content="width=device-width,initial-scale=1,maximum-scale=5"/><meta name="theme-color" content="#000000"/><meta name="description" content="UI for VictoriaMetrics"/><link rel="apple-touch-icon" href="./apple-touch-icon.png"/><link rel="icon" type="image/png" sizes="32x32" href="./favicon-32x32.png"><link rel="manifest" href="./manifest.json"/><title>VM UI</title><script src="./dashboards/index.js" type="module"></script><meta name="twitter:card" content="summary_large_image"><meta name="twitter:image" content="./preview.jpg"><meta name="twitter:title" content="UI for VictoriaMetrics"><meta name="twitter:description" content="Explore and troubleshoot your VictoriaMetrics data"><meta name="twitter:site" content="@VictoriaMetrics"><meta property="og:title" content="Metric explorer for VictoriaMetrics"><meta property="og:description" content="Explore and troubleshoot your VictoriaMetrics data"><meta property="og:image" content="./preview.jpg"><meta property="og:type" content="website"><script defer="defer" src="./static/js/main.02178f4b.js"></script><link href="./static/css/main.9a224445.css" rel="stylesheet"></head><body><noscript>You need to enable JavaScript to run this app.</noscript><div id="root"></div></body></html>
<!doctype html><html lang="en"><head><meta charset="utf-8"/><link rel="icon" href="./favicon.ico"/><meta name="viewport" content="width=device-width,initial-scale=1,maximum-scale=5"/><meta name="theme-color" content="#000000"/><meta name="description" content="UI for VictoriaMetrics"/><link rel="apple-touch-icon" href="./apple-touch-icon.png"/><link rel="icon" type="image/png" sizes="32x32" href="./favicon-32x32.png"><link rel="manifest" href="./manifest.json"/><title>VM UI</title><script src="./dashboards/index.js" type="module"></script><meta name="twitter:card" content="summary_large_image"><meta name="twitter:image" content="./preview.jpg"><meta name="twitter:title" content="UI for VictoriaMetrics"><meta name="twitter:description" content="Explore and troubleshoot your VictoriaMetrics data"><meta name="twitter:site" content="@VictoriaMetrics"><meta property="og:title" content="Metric explorer for VictoriaMetrics"><meta property="og:description" content="Explore and troubleshoot your VictoriaMetrics data"><meta property="og:image" content="./preview.jpg"><meta property="og:type" content="website"><script defer="defer" src="./static/js/main.1919fefe.js"></script><link href="./static/css/main.d1313636.css" rel="stylesheet"></head><body><noscript>You need to enable JavaScript to run this app.</noscript><div id="root"></div></body></html>

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

View File

@@ -7,7 +7,7 @@
/*! regenerator-runtime -- Copyright (c) 2014-present, Facebook, Inc. -- license (MIT): https://github.com/facebook/regenerator/blob/main/LICENSE */
/**
* @remix-run/router v1.7.2
* @remix-run/router v1.10.0
*
* Copyright (c) Remix Software Inc.
*
@@ -18,7 +18,7 @@
*/
/**
* React Router DOM v6.14.2
* React Router DOM v6.17.0
*
* Copyright (c) Remix Software Inc.
*
@@ -29,7 +29,7 @@
*/
/**
* React Router v6.14.2
* React Router v6.17.0
*
* Copyright (c) Remix Software Inc.
*

View File

@@ -1,11 +1,11 @@
---
sort: 14
weight: 14
sort: 23
weight: 23
title: MetricsQL
menu:
docs:
parent: "victoriametrics"
weight: 14
parent: 'victoriametrics'
weight: 23
aliases:
- /ExtendedPromQL.html
- /MetricsQL.html
@@ -21,7 +21,8 @@ However, there are some [intentional differences](https://medium.com/@romanhavro
[Standalone MetricsQL package](https://godoc.org/github.com/VictoriaMetrics/metricsql) can be used for parsing MetricsQL in external apps.
If you are unfamiliar with PromQL, then it is suggested reading [this tutorial for beginners](https://medium.com/@valyala/promql-tutorial-for-beginners-9ab455142085).
If you are unfamiliar with PromQL, then it is suggested reading [this tutorial for beginners](https://medium.com/@valyala/promql-tutorial-for-beginners-9ab455142085)
and introduction into [basic querying via MetricsQL](https://docs.victoriametrics.com/keyConcepts.html#metricsql).
The following functionality is implemented differently in MetricsQL compared to PromQL. This improves user experience:
@@ -109,7 +110,7 @@ The list of MetricsQL features on top of PromQL:
* [histogram_quantile](#histogram_quantile) accepts optional third arg - `boundsLabel`.
In this case it returns `lower` and `upper` bounds for the estimated percentile.
See [this issue for details](https://github.com/prometheus/prometheus/issues/5706).
* `default` binary operator. `q1 default q2` fills gaps in `q1` with the corresponding values from `q2`.
* `default` binary operator. `q1 default q2` fills gaps in `q1` with the corresponding values from `q2`. See also [drop_empty_series](#drop_empty_series).
* `if` binary operator. `q1 if q2` removes values from `q1` for missing values from `q2`.
* `ifnot` binary operator. `q1 ifnot q2` removes values from `q1` for existing values from `q2`.
* `WITH` templates. This feature simplifies writing and managing complex queries.
@@ -531,7 +532,7 @@ See also [duration_over_time](#duration_over_time) and [lag](#lag).
`mad_over_time(series_selector[d])` is a [rollup function](#rollup-functions), which calculates [median absolute deviation](https://en.wikipedia.org/wiki/Median_absolute_deviation)
over raw samples on the given lookbehind window `d` per each time series returned from the given [series_selector](https://docs.victoriametrics.com/keyConcepts.html#filtering).
See also [mad](#mad) and [range_mad](#range_mad).
See also [mad](#mad), [range_mad](#range_mad) and [outlier_iqr_over_time](#outlier_iqr_over_time).
#### max_over_time
@@ -561,6 +562,18 @@ This function is supported by PromQL. See also [tmin_over_time](#tmin_over_time)
for raw samples on the given lookbehind window `d`. It is calculated individually per each time series returned
from the given [series_selector](https://docs.victoriametrics.com/keyConcepts.html#filtering). It is expected that raw sample values are discrete.
#### outlier_iqr_over_time
`outlier_iqr_over_time(series_selector[d])` is a [rollup function](#rollup-functions), which returns the last sample on the given lookbehind window `d`
if its value is either smaller than the `q25-1.5*iqr` or bigger than `q75+1.5*iqr` where:
- `iqr` is an [Interquartile range](https://en.wikipedia.org/wiki/Interquartile_range) over raw samples on the lookbehind window `d`
- `q25` and `q75` are 25th and 75th [percentiles](https://en.wikipedia.org/wiki/Percentile) over raw samples on the lookbehind window `d`.
The `outlier_iqr_over_time()` is useful for detecting anomalies in gauge values based on the previous history of values.
For example, `outlier_iqr_over_time(memory_usage_bytes[1h])` triggers when `memory_usage_bytes` suddenly goes outside the usual value range for the last 24 hours.
See also [outliers_iqr](#outliers_iqr).
#### predict_linear
`predict_linear(series_selector[d], t)` is a [rollup function](#rollup-functions), which calculates the value `t` seconds in the future using
@@ -865,7 +878,7 @@ from the given [series_selector](https://docs.victoriametrics.com/keyConcepts.ht
Metric names are stripped from the resulting rollups. Add [keep_metric_names](#keep_metric_names) modifier in order to keep metric names.
See also [zscore](#zscore) and [range_trim_zscore](#range_trim_zscore).
See also [zscore](#zscore), [range_trim_zscore](#range_trim_zscore) and [outlier_iqr_over_time](#outlier_iqr_over_time).
### Transform functions
@@ -1055,6 +1068,17 @@ Metric names are stripped from the resulting series. Add [keep_metric_names](#ke
This function is supported by PromQL. See also [rad](#rad).
#### drop_empty_series
`drop_empty_series(q)` is a [transform function](#transform-functions), which drops empty series from `q`.
This function can be used when `default` operator should be applied only to non-empty series. For example,
`drop_empty_series(temperature < 30) default 42` returns series, which have at least a single sample smaller than 30 on the selected time range,
while filling gaps in the returned series with 42.
On the other hand `(temperature < 30) default 40` returns all the `temperature` series, even if they have no samples smaller than 30,
by replacing all the values bigger or equal to 30 with 40.
#### end
`end()` is a [transform function](#transform-functions), which returns the unix timestamp in seconds for the last point.
@@ -1591,7 +1615,7 @@ which maps `label` values from `src_*` to `dst*` for all the time series returne
which drops time series from `q` with `label` not matching the given `regexp`.
This function can be useful after [rollup](#rollup)-like functions, which may return multiple time series for every input series.
See also [label_mismatch](#label_mismatch).
See also [label_mismatch](#label_mismatch) and [labels_equal](#labels_equal).
#### label_mismatch
@@ -1599,7 +1623,7 @@ See also [label_mismatch](#label_mismatch).
which drops time series from `q` with `label` matching the given `regexp`.
This function can be useful after [rollup](#rollup)-like functions, which may return multiple time series for every input series.
See also [label_match](#label_match).
See also [label_match](#label_match) and [labels_equal](#labels_equal).
#### label_move
@@ -1642,23 +1666,30 @@ for the given `label` for every time series returned by `q`.
For example, if `label_value(foo, "bar")` is applied to `foo{bar="1.234"}`, then it will return a time series
`foo{bar="1.234"}` with `1.234` value. Function will return no data for non-numeric label values.
#### labels_equal
`labels_equal(q, "label1", "label2", ...)` is [label manipulation function](#label-manipulation-functions), which returns `q` series with identical values for the listed labels
"label1", "label2", etc.
See also [label_match](#label_match) and [label_mismatch](#label_mismatch).
#### sort_by_label
`sort_by_label(q, label1, ... labelN)` is [label manipulation function](#label-manipulation-functions), which sorts series in ascending order by the given set of labels.
`sort_by_label(q, "label1", ... "labelN")` is [label manipulation function](#label-manipulation-functions), which sorts series in ascending order by the given set of labels.
For example, `sort_by_label(foo, "bar")` would sort `foo` series by values of the label `bar` in these series.
See also [sort_by_label_desc](#sort_by_label_desc) and [sort_by_label_numeric](#sort_by_label_numeric).
#### sort_by_label_desc
`sort_by_label_desc(q, label1, ... labelN)` is [label manipulation function](#label-manipulation-functions), which sorts series in descending order by the given set of labels.
`sort_by_label_desc(q, "label1", ... "labelN")` is [label manipulation function](#label-manipulation-functions), which sorts series in descending order by the given set of labels.
For example, `sort_by_label(foo, "bar")` would sort `foo` series by values of the label `bar` in these series.
See also [sort_by_label](#sort_by_label) and [sort_by_label_numeric_desc](#sort_by_label_numeric_desc).
#### sort_by_label_numeric
`sort_by_label_numeric(q, label1, ... labelN)` is [label manipulation function](#label-manipulation-functions), which sorts series in ascending order by the given set of labels
`sort_by_label_numeric(q, "label1", ... "labelN")` is [label manipulation function](#label-manipulation-functions), which sorts series in ascending order by the given set of labels
using [numeric sort](https://www.gnu.org/software/coreutils/manual/html_node/Version-sort-is-not-the-same-as-numeric-sort.html).
For example, if `foo` series have `bar` label with values `1`, `101`, `15` and `2`, then `sort_by_label_numeric(foo, "bar")` would return series
in the following order of `bar` label values: `1`, `2`, `15` and `101`.
@@ -1667,7 +1698,7 @@ See also [sort_by_label_numeric_desc](#sort_by_label_numeric_desc) and [sort_by_
#### sort_by_label_numeric_desc
`sort_by_label_numeric_desc(q, label1, ... labelN)` is [label manipulation function](#label-manipulation-functions), which sorts series in descending order
`sort_by_label_numeric_desc(q, "label1", ... "labelN")` is [label manipulation function](#label-manipulation-functions), which sorts series in descending order
by the given set of labels using [numeric sort](https://www.gnu.org/software/coreutils/manual/html_node/Version-sort-is-not-the-same-as-numeric-sort.html).
For example, if `foo` series have `bar` label with values `1`, `101`, `15` and `2`, then `sort_by_label_numeric(foo, "bar")`
would return series in the following order of `bar` label values: `101`, `15`, `2` and `1`.
@@ -1839,20 +1870,33 @@ This function is supported by PromQL.
`mode(q) by (group_labels)` is [aggregate function](#aggregate-functions), which returns [mode](https://en.wikipedia.org/wiki/Mode_(statistics))
per each `group_labels` for all the time series returned by `q`. The aggregate is calculated individually per each group of points with the same timestamp.
#### outliers_iqr
`outliers_iqr(q)` is [aggregate function](#aggregate-functions), which returns time series from `q` with at least a single point
outside e.g. [Interquartile range outlier bounds](https://en.wikipedia.org/wiki/Interquartile_range) `[q25-1.5*iqr .. q75+1.5*iqr]`
comparing to other time series at the given point, where:
- `iqr` is an [Interquartile range](https://en.wikipedia.org/wiki/Interquartile_range) calculated independently per each point on the graph across `q` series.
- `q25` and `q75` are 25th and 75th [percentiles](https://en.wikipedia.org/wiki/Percentile) calculated independently per each point on the graph across `q` series.
The `outliers_iqr()` is useful for detecting anomalous series in the group of series. For example, `outliers_iqr(temperature) by (country)` returns
per-country series with anomalous outlier values comparing to the rest of per-country series.
See also [outliers_mad](#outliers_mad), [outliersk](#outliersk) and [outlier_iqr_over_time](#outlier_iqr_over_time).
#### outliers_mad
`outliers_mad(tolerance, q)` is [aggregate function](#aggregate-functions), which returns time series from `q` with at least
a single point outside [Median absolute deviation](https://en.wikipedia.org/wiki/Median_absolute_deviation) (aka MAD) multiplied by `tolerance`.
E.g. it returns time series with at least a single point below `median(q) - mad(q)` or a single point above `median(q) + mad(q)`.
See also [outliersk](#outliersk) and [mad](#mad).
See also [outliers_iqr](#outliers_iqr), [outliersk](#outliersk) and [mad](#mad).
#### outliersk
`outliersk(k, q)` is [aggregate function](#aggregate-functions), which returns up to `k` time series with the biggest standard deviation (aka outliers)
out of time series returned by `q`.
See also [outliers_mad](#outliers_mad).
See also [outliers_iqr](#outliers_iqr) and [outliers_mad](#outliers_mad).
#### quantile
@@ -1972,7 +2016,7 @@ See also [bottomk_min](#bottomk_min).
per each `group_labels` for all the time series returned by `q`. The aggregate is calculated individually per each group of points with the same timestamp.
This function is useful for detecting anomalies in the group of related time series.
See also [zscore_over_time](#zscore_over_time) and [range_trim_zscore](#range_trim_zscore).
See also [zscore_over_time](#zscore_over_time), [range_trim_zscore](#range_trim_zscore) and [outliers_iqr](#outliers_iqr).
## Subqueries

File diff suppressed because it is too large Load Diff

View File

@@ -65,7 +65,9 @@ func insertRows(at *auth.Token, rows []parser.Row, extraLabels []prompbmarshal.L
ctx.WriteRequest.Timeseries = tssDst
ctx.Labels = labels
ctx.Samples = samples
remotewrite.Push(at, &ctx.WriteRequest)
if !remotewrite.TryPush(at, &ctx.WriteRequest) {
return remotewrite.ErrQueueFullHTTPRetry
}
rowsInserted.Add(len(rows))
if at != nil {
rowsTenantInserted.Get(at).Add(len(rows))

View File

@@ -88,7 +88,9 @@ func insertRows(at *auth.Token, series []datadog.Series, extraLabels []prompbmar
ctx.WriteRequest.Timeseries = tssDst
ctx.Labels = labels
ctx.Samples = samples
remotewrite.Push(at, &ctx.WriteRequest)
if !remotewrite.TryPush(at, &ctx.WriteRequest) {
return remotewrite.ErrQueueFullHTTPRetry
}
rowsInserted.Add(rowsTotal)
if at != nil {
rowsTenantInserted.Get(at).Add(rowsTotal)

View File

@@ -5,6 +5,7 @@ import (
"github.com/VictoriaMetrics/VictoriaMetrics/app/vmagent/common"
"github.com/VictoriaMetrics/VictoriaMetrics/app/vmagent/remotewrite"
"github.com/VictoriaMetrics/VictoriaMetrics/lib/auth"
"github.com/VictoriaMetrics/VictoriaMetrics/lib/prompbmarshal"
parser "github.com/VictoriaMetrics/VictoriaMetrics/lib/protoparser/graphite"
"github.com/VictoriaMetrics/VictoriaMetrics/lib/protoparser/graphite/stream"
@@ -20,10 +21,12 @@ var (
//
// See https://graphite.readthedocs.io/en/latest/feeding-carbon.html#the-plaintext-protocol
func InsertHandler(r io.Reader) error {
return stream.Parse(r, insertRows)
return stream.Parse(r, false, func(rows []parser.Row) error {
return insertRows(nil, rows)
})
}
func insertRows(rows []parser.Row) error {
func insertRows(at *auth.Token, rows []parser.Row) error {
ctx := common.GetPushCtx()
defer common.PutPushCtx(ctx)
@@ -56,7 +59,9 @@ func insertRows(rows []parser.Row) error {
ctx.WriteRequest.Timeseries = tssDst
ctx.Labels = labels
ctx.Samples = samples
remotewrite.Push(nil, &ctx.WriteRequest)
if !remotewrite.TryPush(at, &ctx.WriteRequest) {
return remotewrite.ErrQueueFullHTTPRetry
}
rowsInserted.Add(len(rows))
rowsPerInsert.Update(float64(len(rows)))
return nil

View File

@@ -36,9 +36,9 @@ var (
// InsertHandlerForReader processes remote write for influx line protocol.
//
// See https://github.com/influxdata/telegraf/tree/master/plugins/inputs/socket_listener/
func InsertHandlerForReader(r io.Reader, isGzipped bool) error {
func InsertHandlerForReader(at *auth.Token, r io.Reader, isGzipped bool) error {
return stream.Parse(r, isGzipped, "", "", func(db string, rows []parser.Row) error {
return insertRows(nil, db, rows, nil)
return insertRows(at, db, rows, nil)
})
}
@@ -130,7 +130,9 @@ func insertRows(at *auth.Token, db string, rows []parser.Row, extraLabels []prom
ctx.ctx.Labels = labels
ctx.ctx.Samples = samples
ctx.commonLabels = commonLabels
remotewrite.Push(at, &ctx.ctx.WriteRequest)
if !remotewrite.TryPush(at, &ctx.ctx.WriteRequest) {
return remotewrite.ErrQueueFullHTTPRetry
}
rowsInserted.Add(rowsTotal)
if at != nil {
rowsTenantInserted.Get(at).Add(rowsTotal)

View File

@@ -11,8 +11,6 @@ import (
"sync/atomic"
"time"
"github.com/VictoriaMetrics/metrics"
"github.com/VictoriaMetrics/VictoriaMetrics/app/vmagent/csvimport"
"github.com/VictoriaMetrics/VictoriaMetrics/app/vmagent/datadog"
"github.com/VictoriaMetrics/VictoriaMetrics/app/vmagent/graphite"
@@ -42,12 +40,13 @@ import (
"github.com/VictoriaMetrics/VictoriaMetrics/lib/promscrape"
"github.com/VictoriaMetrics/VictoriaMetrics/lib/protoparser/common"
"github.com/VictoriaMetrics/VictoriaMetrics/lib/pushmetrics"
"github.com/VictoriaMetrics/metrics"
)
var (
httpListenAddr = flag.String("httpListenAddr", ":8429", "TCP address to listen for http connections. "+
"Set this flag to empty value in order to disable listening on any port. This mode may be useful for running multiple vmagent instances on the same server. "+
"Note that /targets and /metrics pages aren't available if -httpListenAddr=''. See also -httpListenAddr.useProxyProtocol")
"Note that /targets and /metrics pages aren't available if -httpListenAddr=''. See also -tls and -httpListenAddr.useProxyProtocol")
useProxyProtocol = flag.Bool("httpListenAddr.useProxyProtocol", false, "Whether to use proxy protocol for connections accepted at -httpListenAddr . "+
"See https://www.haproxy.org/download/1.8/doc/proxy-protocol.txt . "+
"With enabled proxy protocol http server cannot serve regular /metrics endpoint. Use -pushmetrics.url for metrics pushing")
@@ -125,7 +124,7 @@ func main() {
common.StartUnmarshalWorkers()
if len(*influxListenAddr) > 0 {
influxServer = influxserver.MustStart(*influxListenAddr, *influxUseProxyProtocol, func(r io.Reader) error {
return influx.InsertHandlerForReader(r, false)
return influx.InsertHandlerForReader(nil, r, false)
})
}
if len(*graphiteListenAddr) > 0 {
@@ -140,7 +139,7 @@ func main() {
opentsdbhttpServer = opentsdbhttpserver.MustStart(*opentsdbHTTPListenAddr, *opentsdbHTTPUseProxyProtocol, httpInsertHandler)
}
promscrape.Init(remotewrite.Push)
promscrape.Init(remotewrite.PushDropSamplesOnFailure)
if len(*httpListenAddr) > 0 {
go httpserver.Serve(*httpListenAddr, *useProxyProtocol, requestHandler)

View File

@@ -84,6 +84,8 @@ func insertRows(at *auth.Token, block *stream.Block, extraLabels []prompbmarshal
ctx.WriteRequest.Timeseries = tssDst
ctx.Labels = labels
ctx.Samples = samples
remotewrite.Push(at, &ctx.WriteRequest)
if !remotewrite.TryPush(at, &ctx.WriteRequest) {
return remotewrite.ErrQueueFullHTTPRetry
}
return nil
}

View File

@@ -76,7 +76,9 @@ func insertRows(at *auth.Token, rows []newrelic.Row, extraLabels []prompbmarshal
ctx.WriteRequest.Timeseries = tssDst
ctx.Labels = labels
ctx.Samples = samples
remotewrite.Push(at, &ctx.WriteRequest)
if !remotewrite.TryPush(at, &ctx.WriteRequest) {
return remotewrite.ErrQueueFullHTTPRetry
}
rowsInserted.Add(len(rows))
if at != nil {
rowsTenantInserted.Get(at).Add(samplesCount)

View File

@@ -59,7 +59,9 @@ func insertRows(at *auth.Token, tss []prompbmarshal.TimeSeries, extraLabels []pr
ctx.WriteRequest.Timeseries = tssDst
ctx.Labels = labels
ctx.Samples = samples
remotewrite.Push(at, &ctx.WriteRequest)
if !remotewrite.TryPush(at, &ctx.WriteRequest) {
return remotewrite.ErrQueueFullHTTPRetry
}
rowsInserted.Add(rowsTotal)
if at != nil {
rowsTenantInserted.Get(at).Add(rowsTotal)

View File

@@ -56,7 +56,9 @@ func insertRows(rows []parser.Row) error {
ctx.WriteRequest.Timeseries = tssDst
ctx.Labels = labels
ctx.Samples = samples
remotewrite.Push(nil, &ctx.WriteRequest)
if !remotewrite.TryPush(nil, &ctx.WriteRequest) {
return remotewrite.ErrQueueFullHTTPRetry
}
rowsInserted.Add(len(rows))
rowsPerInsert.Update(float64(len(rows)))
return nil

View File

@@ -64,7 +64,9 @@ func insertRows(at *auth.Token, rows []parser.Row, extraLabels []prompbmarshal.L
ctx.WriteRequest.Timeseries = tssDst
ctx.Labels = labels
ctx.Samples = samples
remotewrite.Push(at, &ctx.WriteRequest)
if !remotewrite.TryPush(at, &ctx.WriteRequest) {
return remotewrite.ErrQueueFullHTTPRetry
}
rowsInserted.Add(len(rows))
rowsPerInsert.Update(float64(len(rows)))
return nil

View File

@@ -73,7 +73,9 @@ func insertRows(at *auth.Token, rows []parser.Row, extraLabels []prompbmarshal.L
ctx.WriteRequest.Timeseries = tssDst
ctx.Labels = labels
ctx.Samples = samples
remotewrite.Push(at, &ctx.WriteRequest)
if !remotewrite.TryPush(at, &ctx.WriteRequest) {
return remotewrite.ErrQueueFullHTTPRetry
}
rowsInserted.Add(len(rows))
if at != nil {
rowsTenantInserted.Get(at).Add(len(rows))

View File

@@ -69,7 +69,9 @@ func insertRows(at *auth.Token, timeseries []prompb.TimeSeries, extraLabels []pr
ctx.WriteRequest.Timeseries = tssDst
ctx.Labels = labels
ctx.Samples = samples
remotewrite.Push(at, &ctx.WriteRequest)
if !remotewrite.TryPush(at, &ctx.WriteRequest) {
return remotewrite.ErrQueueFullHTTPRetry
}
rowsInserted.Add(rowsTotal)
if at != nil {
rowsTenantInserted.Get(at).Add(rowsTotal)

View File

@@ -106,12 +106,15 @@ type client struct {
func newHTTPClient(argIdx int, remoteWriteURL, sanitizedURL string, fq *persistentqueue.FastQueue, concurrency int) *client {
authCfg, err := getAuthConfig(argIdx)
if err != nil {
logger.Panicf("FATAL: cannot initialize auth config for remoteWrite.url=%q: %s", remoteWriteURL, err)
logger.Fatalf("cannot initialize auth config for -remoteWrite.url=%q: %s", remoteWriteURL, err)
}
tlsCfg, err := authCfg.NewTLSConfig()
if err != nil {
logger.Fatalf("cannot initialize tls config for -remoteWrite.url=%q: %s", remoteWriteURL, err)
}
tlsCfg := authCfg.NewTLSConfig()
awsCfg, err := getAWSAPIConfig(argIdx)
if err != nil {
logger.Fatalf("FATAL: cannot initialize AWS Config for remoteWrite.url=%q: %s", remoteWriteURL, err)
logger.Fatalf("cannot initialize AWS Config for -remoteWrite.url=%q: %s", remoteWriteURL, err)
}
tr := &http.Transport{
DialContext: statDial,
@@ -302,7 +305,7 @@ func (c *client) runWorker() {
continue
}
// Return unsent block to the queue.
c.fq.MustWriteBlock(block)
c.fq.MustWriteBlockIgnoreDisabledPQ(block)
return
case <-c.stopCh:
// c must be stopped. Wait for a while in the hope the block will be sent.
@@ -311,11 +314,11 @@ func (c *client) runWorker() {
case ok := <-ch:
if !ok {
// Return unsent block to the queue.
c.fq.MustWriteBlock(block)
c.fq.MustWriteBlockIgnoreDisabledPQ(block)
}
case <-time.After(graceDuration):
// Return unsent block to the queue.
c.fq.MustWriteBlock(block)
c.fq.MustWriteBlockIgnoreDisabledPQ(block)
}
return
}
@@ -328,15 +331,25 @@ func (c *client) doRequest(url string, body []byte) (*http.Response, error) {
return nil, err
}
resp, err := c.hc.Do(req)
if err != nil && errors.Is(err, io.EOF) {
// it is likely connection became stale.
// So we do one more attempt in hope request will succeed.
// If not, the error should be handled by the caller as usual.
// This should help with https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4139
req, _ = c.newRequest(url, body)
resp, err = c.hc.Do(req)
if err == nil {
return resp, nil
}
return resp, err
if !errors.Is(err, io.EOF) && !errors.Is(err, io.ErrUnexpectedEOF) {
return nil, err
}
// It is likely connection became stale or timed out during the first request.
// Make another attempt in hope request will succeed.
// If not, the error should be handled by the caller as usual.
// This should help with https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4139
req, err = c.newRequest(url, body)
if err != nil {
return nil, fmt.Errorf("second attempt: %w", err)
}
resp, err = c.hc.Do(req)
if err != nil {
return nil, fmt.Errorf("second attempt: %w", err)
}
return resp, nil
}
func (c *client) newRequest(url string, body []byte) (*http.Request, error) {
@@ -362,8 +375,7 @@ func (c *client) newRequest(url string, body []byte) (*http.Request, error) {
if c.awsCfg != nil {
sigv4Hash := awsapi.HashHex(body)
if err := c.awsCfg.SignRequest(req, sigv4Hash); err != nil {
// there is no need in retry, request will be rejected by client.Do and retried by code below
logger.Warnf("cannot sign remoteWrite request with AWS sigv4: %s", err)
return nil, fmt.Errorf("cannot sign remoteWrite request with AWS sigv4: %w", err)
}
}
return req, nil

View File

@@ -37,9 +37,9 @@ type pendingSeries struct {
periodicFlusherWG sync.WaitGroup
}
func newPendingSeries(pushBlock func(block []byte), isVMRemoteWrite bool, significantFigures, roundDigits int) *pendingSeries {
func newPendingSeries(fq *persistentqueue.FastQueue, isVMRemoteWrite bool, significantFigures, roundDigits int) *pendingSeries {
var ps pendingSeries
ps.wr.pushBlock = pushBlock
ps.wr.fq = fq
ps.wr.isVMRemoteWrite = isVMRemoteWrite
ps.wr.significantFigures = significantFigures
ps.wr.roundDigits = roundDigits
@@ -57,10 +57,11 @@ func (ps *pendingSeries) MustStop() {
ps.periodicFlusherWG.Wait()
}
func (ps *pendingSeries) Push(tss []prompbmarshal.TimeSeries) {
func (ps *pendingSeries) TryPush(tss []prompbmarshal.TimeSeries) bool {
ps.mu.Lock()
ps.wr.push(tss)
ok := ps.wr.tryPush(tss)
ps.mu.Unlock()
return ok
}
func (ps *pendingSeries) periodicFlusher() {
@@ -70,18 +71,20 @@ func (ps *pendingSeries) periodicFlusher() {
}
ticker := time.NewTicker(*flushInterval)
defer ticker.Stop()
mustStop := false
for !mustStop {
for {
select {
case <-ps.stopCh:
mustStop = true
ps.mu.Lock()
ps.wr.mustFlushOnStop()
ps.mu.Unlock()
return
case <-ticker.C:
if fasttime.UnixTimestamp()-atomic.LoadUint64(&ps.wr.lastFlushTime) < uint64(flushSeconds) {
continue
}
}
ps.mu.Lock()
ps.wr.flush()
_ = ps.wr.tryFlush()
ps.mu.Unlock()
}
}
@@ -90,16 +93,16 @@ type writeRequest struct {
// Move lastFlushTime to the top of the struct in order to guarantee atomic access on 32-bit architectures.
lastFlushTime uint64
// pushBlock is called when whe write request is ready to be sent.
pushBlock func(block []byte)
// The queue to send blocks to.
fq *persistentqueue.FastQueue
// Whether to encode the write request with VictoriaMetrics remote write protocol.
isVMRemoteWrite bool
// How many significant figures must be left before sending the writeRequest to pushBlock.
// How many significant figures must be left before sending the writeRequest to fq.
significantFigures int
// How many decimal digits after point must be left before sending the writeRequest to pushBlock.
// How many decimal digits after point must be left before sending the writeRequest to fq.
roundDigits int
wr prompbmarshal.WriteRequest
@@ -112,7 +115,7 @@ type writeRequest struct {
}
func (wr *writeRequest) reset() {
// Do not reset lastFlushTime, pushBlock, isVMRemoteWrite, significantFigures and roundDigits, since they are re-used.
// Do not reset lastFlushTime, fq, isVMRemoteWrite, significantFigures and roundDigits, since they are re-used.
wr.wr.Timeseries = nil
@@ -130,23 +133,40 @@ func (wr *writeRequest) reset() {
wr.buf = wr.buf[:0]
}
func (wr *writeRequest) flush() {
// mustFlushOnStop force pushes wr data into wr.fq
//
// This is needed in order to properly save in-memory data to persistent queue on graceful shutdown.
func (wr *writeRequest) mustFlushOnStop() {
wr.wr.Timeseries = wr.tss
wr.adjustSampleValues()
atomic.StoreUint64(&wr.lastFlushTime, fasttime.UnixTimestamp())
pushWriteRequest(&wr.wr, wr.pushBlock, wr.isVMRemoteWrite)
if !tryPushWriteRequest(&wr.wr, wr.mustWriteBlock, wr.isVMRemoteWrite) {
logger.Panicf("BUG: final flush must always return true")
}
wr.reset()
}
func (wr *writeRequest) adjustSampleValues() {
samples := wr.samples
if n := wr.significantFigures; n > 0 {
func (wr *writeRequest) mustWriteBlock(block []byte) bool {
wr.fq.MustWriteBlockIgnoreDisabledPQ(block)
return true
}
func (wr *writeRequest) tryFlush() bool {
wr.wr.Timeseries = wr.tss
atomic.StoreUint64(&wr.lastFlushTime, fasttime.UnixTimestamp())
if !tryPushWriteRequest(&wr.wr, wr.fq.TryWriteBlock, wr.isVMRemoteWrite) {
return false
}
wr.reset()
return true
}
func adjustSampleValues(samples []prompbmarshal.Sample, significantFigures, roundDigits int) {
if n := significantFigures; n > 0 {
for i := range samples {
s := &samples[i]
s.Value = decimal.RoundToSignificantFigures(s.Value, n)
}
}
if n := wr.roundDigits; n < 100 {
if n := roundDigits; n < 100 {
for i := range samples {
s := &samples[i]
s.Value = decimal.RoundToDecimalDigits(s.Value, n)
@@ -154,21 +174,27 @@ func (wr *writeRequest) adjustSampleValues() {
}
}
func (wr *writeRequest) push(src []prompbmarshal.TimeSeries) {
func (wr *writeRequest) tryPush(src []prompbmarshal.TimeSeries) bool {
tssDst := wr.tss
maxSamplesPerBlock := *maxRowsPerBlock
// Allow up to 10x of labels per each block on average.
maxLabelsPerBlock := 10 * maxSamplesPerBlock
for i := range src {
tssDst = append(tssDst, prompbmarshal.TimeSeries{})
wr.copyTimeSeries(&tssDst[len(tssDst)-1], &src[i])
if len(wr.samples) >= maxSamplesPerBlock || len(wr.labels) >= maxLabelsPerBlock {
wr.tss = tssDst
wr.flush()
if !wr.tryFlush() {
return false
}
tssDst = wr.tss
}
tsSrc := &src[i]
adjustSampleValues(tsSrc.Samples, wr.significantFigures, wr.roundDigits)
tssDst = append(tssDst, prompbmarshal.TimeSeries{})
wr.copyTimeSeries(&tssDst[len(tssDst)-1], tsSrc)
}
wr.tss = tssDst
return true
}
func (wr *writeRequest) copyTimeSeries(dst, src *prompbmarshal.TimeSeries) {
@@ -196,10 +222,10 @@ func (wr *writeRequest) copyTimeSeries(dst, src *prompbmarshal.TimeSeries) {
wr.buf = buf
}
func pushWriteRequest(wr *prompbmarshal.WriteRequest, pushBlock func(block []byte), isVMRemoteWrite bool) {
func tryPushWriteRequest(wr *prompbmarshal.WriteRequest, tryPushBlock func(block []byte) bool, isVMRemoteWrite bool) bool {
if len(wr.Timeseries) == 0 {
// Nothing to push
return
return true
}
bb := writeRequestBufPool.Get()
bb.B = prompbmarshal.MarshalWriteRequest(bb.B[:0], wr)
@@ -212,11 +238,13 @@ func pushWriteRequest(wr *prompbmarshal.WriteRequest, pushBlock func(block []byt
}
writeRequestBufPool.Put(bb)
if len(zb.B) <= persistentqueue.MaxBlockSize {
pushBlock(zb.B)
if !tryPushBlock(zb.B) {
return false
}
blockSizeRows.Update(float64(len(wr.Timeseries)))
blockSizeBytes.Update(float64(len(zb.B)))
snappyBufPool.Put(zb)
return
return true
}
snappyBufPool.Put(zb)
} else {
@@ -229,23 +257,36 @@ func pushWriteRequest(wr *prompbmarshal.WriteRequest, pushBlock func(block []byt
samples := wr.Timeseries[0].Samples
if len(samples) == 1 {
logger.Warnf("dropping a sample for metric with too long labels exceeding -remoteWrite.maxBlockSize=%d bytes", maxUnpackedBlockSize.N)
return
return true
}
n := len(samples) / 2
wr.Timeseries[0].Samples = samples[:n]
pushWriteRequest(wr, pushBlock, isVMRemoteWrite)
if !tryPushWriteRequest(wr, tryPushBlock, isVMRemoteWrite) {
wr.Timeseries[0].Samples = samples
return false
}
wr.Timeseries[0].Samples = samples[n:]
pushWriteRequest(wr, pushBlock, isVMRemoteWrite)
if !tryPushWriteRequest(wr, tryPushBlock, isVMRemoteWrite) {
wr.Timeseries[0].Samples = samples
return false
}
wr.Timeseries[0].Samples = samples
return
return true
}
timeseries := wr.Timeseries
n := len(timeseries) / 2
wr.Timeseries = timeseries[:n]
pushWriteRequest(wr, pushBlock, isVMRemoteWrite)
if !tryPushWriteRequest(wr, tryPushBlock, isVMRemoteWrite) {
wr.Timeseries = timeseries
return false
}
wr.Timeseries = timeseries[n:]
pushWriteRequest(wr, pushBlock, isVMRemoteWrite)
if !tryPushWriteRequest(wr, tryPushBlock, isVMRemoteWrite) {
wr.Timeseries = timeseries
return false
}
wr.Timeseries = timeseries
return true
}
var (

View File

@@ -26,13 +26,16 @@ func testPushWriteRequest(t *testing.T, rowsCount, expectedBlockLenProm, expecte
t.Helper()
wr := newTestWriteRequest(rowsCount, 20)
pushBlockLen := 0
pushBlock := func(block []byte) {
pushBlock := func(block []byte) bool {
if pushBlockLen > 0 {
panic(fmt.Errorf("BUG: pushBlock called multiple times; pushBlockLen=%d at first call, len(block)=%d at second call", pushBlockLen, len(block)))
}
pushBlockLen = len(block)
return true
}
if !tryPushWriteRequest(wr, pushBlock, isVMRemoteWrite) {
t.Fatalf("cannot push data to to remote storage")
}
pushWriteRequest(wr, pushBlock, isVMRemoteWrite)
if math.Abs(float64(pushBlockLen-expectedBlockLen)/float64(expectedBlockLen)*100) > tolerancePrc {
t.Fatalf("unexpected block len for rowsCount=%d, isVMRemoteWrite=%v; got %d bytes; expecting %d bytes +- %.0f%%",
rowsCount, isVMRemoteWrite, pushBlockLen, expectedBlockLen, tolerancePrc)

View File

@@ -3,6 +3,7 @@ package remotewrite
import (
"flag"
"fmt"
"strconv"
"strings"
"sync"
@@ -92,6 +93,7 @@ func (rctx *relabelCtx) applyRelabeling(tss []prompbmarshal.TimeSeries, pcs *pro
// Nothing to change.
return tss
}
rctx.reset()
tssDst := tss[:0]
labels := rctx.labels[:0]
for i := range tss {
@@ -120,6 +122,7 @@ func (rctx *relabelCtx) appendExtraLabels(tss []prompbmarshal.TimeSeries, extraL
if len(extraLabels) == 0 {
return
}
rctx.reset()
labels := rctx.labels[:0]
for i := range tss {
ts := &tss[i]
@@ -139,6 +142,34 @@ func (rctx *relabelCtx) appendExtraLabels(tss []prompbmarshal.TimeSeries, extraL
rctx.labels = labels
}
func (rctx *relabelCtx) tenantToLabels(tss []prompbmarshal.TimeSeries, accountID, projectID uint32) {
rctx.reset()
accountIDStr := strconv.FormatUint(uint64(accountID), 10)
projectIDStr := strconv.FormatUint(uint64(projectID), 10)
labels := rctx.labels[:0]
for i := range tss {
ts := &tss[i]
labelsLen := len(labels)
for _, label := range ts.Labels {
labelName := label.Name
if labelName == "vm_account_id" || labelName == "vm_project_id" {
continue
}
labels = append(labels, label)
}
labels = append(labels, prompbmarshal.Label{
Name: "vm_account_id",
Value: accountIDStr,
})
labels = append(labels, prompbmarshal.Label{
Name: "vm_project_id",
Value: projectIDStr,
})
ts.Labels = labels[labelsLen:]
}
rctx.labels = labels
}
type relabelCtx struct {
// pool for labels, which are used during the relabeling.
labels []prompbmarshal.Label
@@ -160,7 +191,7 @@ func getRelabelCtx() *relabelCtx {
}
func putRelabelCtx(rctx *relabelCtx) {
rctx.labels = rctx.labels[:0]
rctx.reset()
relabelCtxPool.Put(rctx)
}

View File

@@ -3,6 +3,7 @@ package remotewrite
import (
"flag"
"fmt"
"net/http"
"net/url"
"path/filepath"
"strconv"
@@ -10,6 +11,8 @@ import (
"sync/atomic"
"time"
"github.com/VictoriaMetrics/VictoriaMetrics/lib/httpserver"
"github.com/VictoriaMetrics/VictoriaMetrics/lib/auth"
"github.com/VictoriaMetrics/VictoriaMetrics/lib/bloomfilter"
"github.com/VictoriaMetrics/VictoriaMetrics/lib/bytesutil"
@@ -23,6 +26,7 @@ import (
"github.com/VictoriaMetrics/VictoriaMetrics/lib/procutil"
"github.com/VictoriaMetrics/VictoriaMetrics/lib/prompbmarshal"
"github.com/VictoriaMetrics/VictoriaMetrics/lib/promrelabel"
"github.com/VictoriaMetrics/VictoriaMetrics/lib/promutils"
"github.com/VictoriaMetrics/VictoriaMetrics/lib/streamaggr"
"github.com/VictoriaMetrics/VictoriaMetrics/lib/tenantmetrics"
"github.com/VictoriaMetrics/metrics"
@@ -33,15 +37,22 @@ var (
remoteWriteURLs = flagutil.NewArrayString("remoteWrite.url", "Remote storage URL to write data to. It must support either VictoriaMetrics remote write protocol "+
"or Prometheus remote_write protocol. Example url: http://<victoriametrics-host>:8428/api/v1/write . "+
"Pass multiple -remoteWrite.url options in order to replicate the collected data to multiple remote storage systems. "+
"The data can be sharded among the configured remote storage systems if -remoteWrite.shardByURL flag is set. "+
"See also -remoteWrite.multitenantURL")
"The data can be sharded among the configured remote storage systems if -remoteWrite.shardByURL flag is set")
remoteWriteMultitenantURLs = flagutil.NewArrayString("remoteWrite.multitenantURL", "Base path for multitenant remote storage URL to write data to. "+
"See https://docs.victoriametrics.com/vmagent.html#multitenancy for details. Example url: http://<vminsert>:8480 . "+
"Pass multiple -remoteWrite.multitenantURL flags in order to replicate data to multiple remote storage systems. See also -remoteWrite.url")
"Pass multiple -remoteWrite.multitenantURL flags in order to replicate data to multiple remote storage systems. "+
"This flag is deprecated in favor of -enableMultitenantHandlers . See https://docs.victoriametrics.com/vmagent.html#multitenancy")
enableMultitenantHandlers = flag.Bool("enableMultitenantHandlers", false, "Whether to process incoming data via multitenant insert handlers according to "+
"https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html#url-format . By default incoming data is processed via single-node insert handlers "+
"according to https://docs.victoriametrics.com/#how-to-import-time-series-data ."+
"See https://docs.victoriametrics.com/vmagent.html#multitenancy for details")
shardByURL = flag.Bool("remoteWrite.shardByURL", false, "Whether to shard outgoing series across all the remote storage systems enumerated via -remoteWrite.url . "+
"By default the data is replicated across all the -remoteWrite.url . See https://docs.victoriametrics.com/vmagent.html#sharding-among-remote-storages")
tmpDataPath = flag.String("remoteWrite.tmpDataPath", "vmagent-remotewrite-data", "Path to directory where temporary data for remote write component is stored. "+
"See also -remoteWrite.maxDiskUsagePerURL")
shardByURLLabels = flagutil.NewArrayString("remoteWrite.shardByURL.labels", "Optional list of labels, which must be used for sharding outgoing samples "+
"among remote storage systems if -remoteWrite.shardByURL command-line flag is set. By default all the labels are used for sharding in order to gain "+
"even distribution of series over the specified -remoteWrite.url systems")
tmpDataPath = flag.String("remoteWrite.tmpDataPath", "vmagent-remotewrite-data", "Path to directory for storing pending data, which isn't sent to the configured -remoteWrite.url . "+
"See also -remoteWrite.maxDiskUsagePerURL and -remoteWrite.disableOnDiskQueue")
keepDanglingQueues = flag.Bool("remoteWrite.keepDanglingQueues", false, "Keep persistent queues contents at -remoteWrite.tmpDataPath in case there are no matching -remoteWrite.url. "+
"Useful when -remoteWrite.url is changed temporarily and persistent queue files will be needed later on.")
queues = flag.Int("remoteWrite.queues", cgroup.AvailableCPUs()*2, "The number of concurrent queues to each -remoteWrite.url. Set more queues if default number of queues "+
@@ -80,6 +91,11 @@ var (
"are written to the corresponding -remoteWrite.url . See also -remoteWrite.streamAggr.keepInput and https://docs.victoriametrics.com/stream-aggregation.html")
streamAggrDedupInterval = flagutil.NewArrayDuration("remoteWrite.streamAggr.dedupInterval", 0, "Input samples are de-duplicated with this interval before being aggregated. "+
"Only the last sample per each time series per each interval is aggregated if the interval is greater than zero")
disableOnDiskQueue = flag.Bool("remoteWrite.disableOnDiskQueue", false, "Whether to disable storing pending data to -remoteWrite.tmpDataPath "+
"when the configured remote storage systems cannot keep up with the data ingestion rate. See https://docs.victoriametrics.com/vmagent.html#disabling-on-disk-persistence ."+
"See also -remoteWrite.dropSamplesOnOverload")
dropSamplesOnOverload = flag.Bool("remoteWrite.dropSamplesOnOverload", false, "Whether to drop samples when -remoteWrite.disableOnDiskQueue is set and if the samples "+
"cannot be pushed into the configured remote storage systems in a timely manner. See https://docs.victoriametrics.com/vmagent.html#disabling-on-disk-persistence")
)
var (
@@ -92,11 +108,19 @@ var (
// Data without tenant id is written to defaultAuthToken if -remoteWrite.multitenantURL is specified.
defaultAuthToken = &auth.Token{}
// ErrQueueFullHTTPRetry must be returned when TryPush() returns false.
ErrQueueFullHTTPRetry = &httpserver.ErrorWithStatusCode{
Err: fmt.Errorf("remote storage systems cannot keep up with the data ingestion rate; retry the request later " +
"or remove -remoteWrite.disableOnDiskQueue from vmagent command-line flags, so it could save pending data to -remoteWrite.tmpDataPath; " +
"see https://docs.victoriametrics.com/vmagent.html#disabling-on-disk-persistence"),
StatusCode: http.StatusTooManyRequests,
}
)
// MultitenancyEnabled returns true if -remoteWrite.multitenantURL is specified.
// MultitenancyEnabled returns true if -enableMultitenantHandlers or -remoteWrite.multitenantURL is specified.
func MultitenancyEnabled() bool {
return len(*remoteWriteMultitenantURLs) > 0
return *enableMultitenantHandlers || len(*remoteWriteMultitenantURLs) > 0
}
// Contains the current relabelConfigs.
@@ -116,6 +140,8 @@ func InitSecretFlags() {
}
}
var shardByURLLabelsMap map[string]struct{}
// Init initializes remotewrite.
//
// It must be called after flag.Parse().
@@ -152,6 +178,13 @@ func Init() {
if *queues <= 0 {
*queues = 1
}
if len(*shardByURLLabels) > 0 {
m := make(map[string]struct{}, len(*shardByURLLabels))
for _, label := range *shardByURLLabels {
m[label] = struct{}{}
}
shardByURLLabelsMap = m
}
initLabelsGlobal()
// Register SIGHUP handler for config reload before loadRelabelConfigs.
@@ -170,6 +203,7 @@ func Init() {
if len(*remoteWriteURLs) > 0 {
rwctxsDefault = newRemoteWriteCtxs(nil, *remoteWriteURLs)
}
dropDanglingQueues()
// Start config reloader.
configReloaderWG.Add(1)
@@ -187,6 +221,42 @@ func Init() {
}()
}
func dropDanglingQueues() {
if *keepDanglingQueues {
return
}
if len(*remoteWriteMultitenantURLs) > 0 {
// Do not drop dangling queues for *remoteWriteMultitenantURLs, since it is impossible to determine
// unused queues for multitenant urls - they are created on demand when new sample for the given
// tenant is pushed to remote storage.
return
}
// Remove dangling persistent queues, if any.
// This is required for the case when the number of queues has been changed or URL have been changed.
// See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4014
//
existingQueues := make(map[string]struct{}, len(rwctxsDefault))
for _, rwctx := range rwctxsDefault {
existingQueues[rwctx.fq.Dirname()] = struct{}{}
}
queuesDir := filepath.Join(*tmpDataPath, persistentQueueDirname)
files := fs.MustReadDir(queuesDir)
removed := 0
for _, f := range files {
dirname := f.Name()
if _, ok := existingQueues[dirname]; !ok {
logger.Infof("removing dangling queue %q", dirname)
fullPath := filepath.Join(queuesDir, dirname)
fs.MustRemoveAll(fullPath)
removed++
}
}
if removed > 0 {
logger.Infof("removed %d dangling queues from %q, active queues: %d", removed, *tmpDataPath, len(rwctxsDefault))
}
}
func reloadRelabelConfigs() {
relabelConfigReloads.Inc()
logger.Infof("reloading relabel configs pointed by -remoteWrite.relabelConfig and -remoteWrite.urlRelabelConfig")
@@ -260,33 +330,6 @@ func newRemoteWriteCtxs(at *auth.Token, urls []string) []*remoteWriteCtx {
}
rwctxs[i] = newRemoteWriteCtx(i, remoteWriteURL, maxInmemoryBlocks, sanitizedURL)
}
if !*keepDanglingQueues {
// Remove dangling queues, if any.
// This is required for the case when the number of queues has been changed or URL have been changed.
// See: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4014
existingQueues := make(map[string]struct{}, len(rwctxs))
for _, rwctx := range rwctxs {
existingQueues[rwctx.fq.Dirname()] = struct{}{}
}
queuesDir := filepath.Join(*tmpDataPath, persistentQueueDirname)
files := fs.MustReadDir(queuesDir)
removed := 0
for _, f := range files {
dirname := f.Name()
if _, ok := existingQueues[dirname]; !ok {
logger.Infof("removing dangling queue %q", dirname)
fullPath := filepath.Join(queuesDir, dirname)
fs.MustRemoveAll(fullPath)
removed++
}
}
if removed > 0 {
logger.Infof("removed %d dangling queues from %q, active queues: %d", removed, *tmpDataPath, len(rwctxs))
}
}
return rwctxs
}
@@ -295,7 +338,7 @@ var configReloaderWG sync.WaitGroup
// Stop stops remotewrite.
//
// It is expected that nobody calls Push during and after the call to this func.
// It is expected that nobody calls TryPush during and after the call to this func.
func Stop() {
close(configReloaderStopCh)
configReloaderWG.Wait()
@@ -305,7 +348,7 @@ func Stop() {
}
rwctxsDefault = nil
// There is no need in locking rwctxsMapLock here, since nobody should call Push during the Stop call.
// There is no need in locking rwctxsMapLock here, since nobody should call TryPush during the Stop call.
for _, rwctxs := range rwctxsMap {
for _, rwctx := range rwctxs {
rwctx.MustStop()
@@ -321,24 +364,47 @@ func Stop() {
}
}
// Push sends wr to remote storage systems set via `-remoteWrite.url`.
// PushDropSamplesOnFailure pushes wr to the configured remote storage systems set via -remoteWrite.url and -remoteWrite.multitenantURL
//
// If at is nil, then the data is pushed to the configured `-remoteWrite.url`.
// If at isn't nil, the data is pushed to the configured `-remoteWrite.multitenantURL`.
// If at is nil, then the data is pushed to the configured -remoteWrite.url.
// If at isn't nil, the data is pushed to the configured -remoteWrite.multitenantURL.
//
// Note that wr may be modified by Push because of relabeling and rounding.
func Push(at *auth.Token, wr *prompbmarshal.WriteRequest) {
if at == nil && len(*remoteWriteMultitenantURLs) > 0 {
// Write data to default tenant if at isn't set while -remoteWrite.multitenantURL is set.
// PushDropSamplesOnFailure can modify wr contents.
func PushDropSamplesOnFailure(at *auth.Token, wr *prompbmarshal.WriteRequest) {
_ = tryPush(at, wr, true)
}
// TryPush tries sending wr to the configured remote storage systems set via -remoteWrite.url and -remoteWrite.multitenantURL
//
// If at is nil, then the data is pushed to the configured -remoteWrite.url.
// If at isn't nil, the data is pushed to the configured -remoteWrite.multitenantURL.
//
// TryPush can modify wr contents, so the caller must re-initialize wr before calling TryPush() after unsuccessful attempt.
// TryPush may send partial data from wr on unsuccessful attempt, so repeated call for the same wr may send the data multiple times.
//
// The caller must return ErrQueueFullHTTPRetry to the client, which sends wr, if TryPush returns false.
func TryPush(at *auth.Token, wr *prompbmarshal.WriteRequest) bool {
return tryPush(at, wr, *dropSamplesOnOverload)
}
func tryPush(at *auth.Token, wr *prompbmarshal.WriteRequest, dropSamplesOnFailure bool) bool {
tss := wr.Timeseries
if at == nil && MultitenancyEnabled() {
// Write data to default tenant if at isn't set when multitenancy is enabled.
at = defaultAuthToken
}
var tenantRctx *relabelCtx
var rwctxs []*remoteWriteCtx
if at == nil {
rwctxs = rwctxsDefault
} else if len(*remoteWriteMultitenantURLs) == 0 {
// Convert at to (vm_account_id, vm_project_id) labels.
tenantRctx = getRelabelCtx()
defer putRelabelCtx(tenantRctx)
rwctxs = rwctxsDefault
} else {
if len(*remoteWriteMultitenantURLs) == 0 {
logger.Panicf("BUG: -remoteWrite.multitenantURL command-line flag must be set when __tenant_id__=%q label is set", at)
}
rwctxsMapLock.Lock()
tenantID := tenantmetrics.TenantID{
AccountID: at.AccountID,
@@ -352,18 +418,37 @@ func Push(at *auth.Token, wr *prompbmarshal.WriteRequest) {
rwctxsMapLock.Unlock()
}
rowsCount := getRowsCount(tss)
if *disableOnDiskQueue {
// Quick check whether writes to configured remote storage systems are blocked.
// This allows saving CPU time spent on relabeling and block compression
// if some of remote storage systems cannot keep up with the data ingestion rate.
for _, rwctx := range rwctxs {
if rwctx.fq.IsWriteBlocked() {
pushFailures.Inc()
if dropSamplesOnFailure {
// Just drop samples
samplesDropped.Add(rowsCount)
return true
}
return false
}
}
}
var rctx *relabelCtx
rcs := allRelabelConfigs.Load()
pcsGlobal := rcs.global
if pcsGlobal.Len() > 0 {
rctx = getRelabelCtx()
defer putRelabelCtx(rctx)
}
tss := wr.Timeseries
rowsCount := getRowsCount(tss)
globalRowsPushedBeforeRelabel.Add(rowsCount)
maxSamplesPerBlock := *maxRowsPerBlock
// Allow up to 10x of labels per each block on average.
maxLabelsPerBlock := 10 * maxSamplesPerBlock
for len(tss) > 0 {
// Process big tss in smaller blocks in order to reduce the maximum memory usage
samplesCount := 0
@@ -371,7 +456,7 @@ func Push(at *auth.Token, wr *prompbmarshal.WriteRequest) {
i := 0
for i < len(tss) {
samplesCount += len(tss[i].Samples)
labelsCount += len(tss[i].Labels)
labelsCount += len(tss[i].Samples) * len(tss[i].Labels)
i++
if samplesCount >= maxSamplesPerBlock || labelsCount >= maxLabelsPerBlock {
break
@@ -384,6 +469,9 @@ func Push(at *auth.Token, wr *prompbmarshal.WriteRequest) {
} else {
tss = nil
}
if tenantRctx != nil {
tenantRctx.tenantToLabels(tssBlock, at.AccountID, at.ProjectID)
}
if rctx != nil {
rowsCountBeforeRelabel := getRowsCount(tssBlock)
tssBlock = rctx.applyRelabeling(tssBlock, pcsGlobal)
@@ -392,25 +480,35 @@ func Push(at *auth.Token, wr *prompbmarshal.WriteRequest) {
}
sortLabelsIfNeeded(tssBlock)
tssBlock = limitSeriesCardinality(tssBlock)
pushBlockToRemoteStorages(rwctxs, tssBlock)
if rctx != nil {
rctx.reset()
if !tryPushBlockToRemoteStorages(rwctxs, tssBlock) {
if !*disableOnDiskQueue {
logger.Panicf("BUG: tryPushBlockToRemoteStorages must return true if -remoteWrite.disableOnDiskQueue isn't set")
}
pushFailures.Inc()
if dropSamplesOnFailure {
samplesDropped.Add(rowsCount)
return true
}
return false
}
}
if rctx != nil {
putRelabelCtx(rctx)
}
return true
}
func pushBlockToRemoteStorages(rwctxs []*remoteWriteCtx, tssBlock []prompbmarshal.TimeSeries) {
var (
samplesDropped = metrics.NewCounter(`vmagent_remotewrite_samples_dropped_total`)
pushFailures = metrics.NewCounter(`vmagent_remotewrite_push_failures_total`)
)
func tryPushBlockToRemoteStorages(rwctxs []*remoteWriteCtx, tssBlock []prompbmarshal.TimeSeries) bool {
if len(tssBlock) == 0 {
// Nothing to push
return
return true
}
if len(rwctxs) == 1 {
// Fast path - just push data to the configured single remote storage
rwctxs[0].Push(tssBlock)
return
return rwctxs[0].TryPush(tssBlock)
}
// We need to push tssBlock to multiple remote storages.
@@ -418,15 +516,28 @@ func pushBlockToRemoteStorages(rwctxs []*remoteWriteCtx, tssBlock []prompbmarsha
if *shardByURL {
// Shard the data among rwctxs
tssByURL := make([][]prompbmarshal.TimeSeries, len(rwctxs))
tmpLabels := promutils.GetLabels()
for _, ts := range tssBlock {
h := getLabelsHash(ts.Labels)
hashLabels := ts.Labels
if len(shardByURLLabelsMap) > 0 {
hashLabels = tmpLabels.Labels[:0]
for _, label := range ts.Labels {
if _, ok := shardByURLLabelsMap[label.Name]; ok {
hashLabels = append(hashLabels, label)
}
}
}
h := getLabelsHash(hashLabels)
idx := h % uint64(len(tssByURL))
tssByURL[idx] = append(tssByURL[idx], ts)
}
promutils.PutLabels(tmpLabels)
// Push sharded data to remote storages in parallel in order to reduce
// the time needed for sending the data to multiple remote storage systems.
var wg sync.WaitGroup
wg.Add(len(rwctxs))
var anyPushFailed uint64
for i, rwctx := range rwctxs {
tssShard := tssByURL[i]
if len(tssShard) == 0 {
@@ -434,11 +545,13 @@ func pushBlockToRemoteStorages(rwctxs []*remoteWriteCtx, tssBlock []prompbmarsha
}
go func(rwctx *remoteWriteCtx, tss []prompbmarshal.TimeSeries) {
defer wg.Done()
rwctx.Push(tss)
if !rwctx.TryPush(tss) {
atomic.StoreUint64(&anyPushFailed, 1)
}
}(rwctx, tssShard)
}
wg.Wait()
return
return atomic.LoadUint64(&anyPushFailed) == 0
}
// Replicate data among rwctxs.
@@ -446,13 +559,17 @@ func pushBlockToRemoteStorages(rwctxs []*remoteWriteCtx, tssBlock []prompbmarsha
// the time needed for sending the data to multiple remote storage systems.
var wg sync.WaitGroup
wg.Add(len(rwctxs))
var anyPushFailed uint64
for _, rwctx := range rwctxs {
go func(rwctx *remoteWriteCtx) {
defer wg.Done()
rwctx.Push(tssBlock)
if !rwctx.TryPush(tssBlock) {
atomic.StoreUint64(&anyPushFailed, 1)
}
}(rwctx)
}
wg.Wait()
return atomic.LoadUint64(&anyPushFailed) == 0
}
// sortLabelsIfNeeded sorts labels if -sortLabels command-line flag is set.
@@ -572,13 +689,19 @@ func newRemoteWriteCtx(argIdx int, remoteWriteURL *url.URL, maxInmemoryBlocks in
logger.Warnf("rounding the -remoteWrite.maxDiskUsagePerURL=%d to the minimum supported value: %d", maxPendingBytes, persistentqueue.DefaultChunkFileSize)
maxPendingBytes = persistentqueue.DefaultChunkFileSize
}
fq := persistentqueue.MustOpenFastQueue(queuePath, sanitizedURL, maxInmemoryBlocks, maxPendingBytes)
fq := persistentqueue.MustOpenFastQueue(queuePath, sanitizedURL, maxInmemoryBlocks, maxPendingBytes, *disableOnDiskQueue)
_ = metrics.GetOrCreateGauge(fmt.Sprintf(`vmagent_remotewrite_pending_data_bytes{path=%q, url=%q}`, queuePath, sanitizedURL), func() float64 {
return float64(fq.GetPendingBytes())
})
_ = metrics.GetOrCreateGauge(fmt.Sprintf(`vmagent_remotewrite_pending_inmemory_blocks{path=%q, url=%q}`, queuePath, sanitizedURL), func() float64 {
return float64(fq.GetInmemoryQueueLen())
})
_ = metrics.GetOrCreateGauge(fmt.Sprintf(`vmagent_remotewrite_queue_blocked{path=%q, url=%q}`, queuePath, sanitizedURL), func() float64 {
if fq.IsWriteBlocked() {
return 1
}
return 0
})
var c *client
switch remoteWriteURL.Scheme {
@@ -600,7 +723,7 @@ func newRemoteWriteCtx(argIdx int, remoteWriteURL *url.URL, maxInmemoryBlocks in
}
pss := make([]*pendingSeries, pssLen)
for i := range pss {
pss[i] = newPendingSeries(fq.MustWriteBlock, c.useVMProto, sf, rd)
pss[i] = newPendingSeries(fq, c.useVMProto, sf, rd)
}
rwctx := &remoteWriteCtx{
@@ -617,7 +740,7 @@ func newRemoteWriteCtx(argIdx int, remoteWriteURL *url.URL, maxInmemoryBlocks in
sasFile := streamAggrConfig.GetOptionalArg(argIdx)
if sasFile != "" {
dedupInterval := streamAggrDedupInterval.GetOptionalArg(argIdx)
sas, err := streamaggr.LoadFromFile(sasFile, rwctx.pushInternal, dedupInterval)
sas, err := streamaggr.LoadFromFile(sasFile, rwctx.pushInternalTrackDropped, dedupInterval)
if err != nil {
logger.Fatalf("cannot initialize stream aggregators from -remoteWrite.streamAggr.config=%q: %s", sasFile, err)
}
@@ -653,7 +776,7 @@ func (rwctx *remoteWriteCtx) MustStop() {
rwctx.rowsDroppedByRelabel = nil
}
func (rwctx *remoteWriteCtx) Push(tss []prompbmarshal.TimeSeries) {
func (rwctx *remoteWriteCtx) TryPush(tss []prompbmarshal.TimeSeries) bool {
// Apply relabeling
var rctx *relabelCtx
var v *[]prompbmarshal.TimeSeries
@@ -691,7 +814,9 @@ func (rwctx *remoteWriteCtx) Push(tss []prompbmarshal.TimeSeries) {
}
matchIdxsPool.Put(matchIdxs)
}
rwctx.pushInternal(tss)
// Try pushing the data to remote storage
ok := rwctx.tryPushInternal(tss)
// Return back relabeling contexts to the pool
if rctx != nil {
@@ -699,6 +824,8 @@ func (rwctx *remoteWriteCtx) Push(tss []prompbmarshal.TimeSeries) {
tssPool.Put(v)
putRelabelCtx(rctx)
}
return ok
}
var matchIdxsPool bytesutil.ByteBufferPool
@@ -718,7 +845,21 @@ func dropAggregatedSeries(src []prompbmarshal.TimeSeries, matchIdxs []byte, drop
return dst
}
func (rwctx *remoteWriteCtx) pushInternal(tss []prompbmarshal.TimeSeries) {
func (rwctx *remoteWriteCtx) pushInternalTrackDropped(tss []prompbmarshal.TimeSeries) {
if rwctx.tryPushInternal(tss) {
return
}
if !*disableOnDiskQueue {
logger.Panicf("BUG: tryPushInternal must return true if -remoteWrite.disableOnDiskQueue isn't set")
}
pushFailures.Inc()
if *dropSamplesOnOverload {
rowsCount := getRowsCount(tss)
samplesDropped.Add(rowsCount)
}
}
func (rwctx *remoteWriteCtx) tryPushInternal(tss []prompbmarshal.TimeSeries) bool {
var rctx *relabelCtx
var v *[]prompbmarshal.TimeSeries
if len(labelsGlobal) > 0 {
@@ -732,13 +873,16 @@ func (rwctx *remoteWriteCtx) pushInternal(tss []prompbmarshal.TimeSeries) {
pss := rwctx.pss
idx := atomic.AddUint64(&rwctx.pssNextIdx, 1) % uint64(len(pss))
pss[idx].Push(tss)
ok := pss[idx].TryPush(tss)
if rctx != nil {
*v = prompbmarshal.ResetTimeSeries(tss)
tssPool.Put(v)
putRelabelCtx(rctx)
}
return ok
}
func (rwctx *remoteWriteCtx) reinitStreamAggr() {
@@ -751,7 +895,7 @@ func (rwctx *remoteWriteCtx) reinitStreamAggr() {
logger.Infof("reloading stream aggregation configs pointed by -remoteWrite.streamAggr.config=%q", sasFile)
metrics.GetOrCreateCounter(fmt.Sprintf(`vmagent_streamaggr_config_reloads_total{path=%q}`, sasFile)).Inc()
dedupInterval := streamAggrDedupInterval.GetOptionalArg(rwctx.idx)
sasNew, err := streamaggr.LoadFromFile(sasFile, rwctx.pushInternal, dedupInterval)
sasNew, err := streamaggr.LoadFromFile(sasFile, rwctx.pushInternalTrackDropped, dedupInterval)
if err != nil {
metrics.GetOrCreateCounter(fmt.Sprintf(`vmagent_streamaggr_config_reloads_errors_total{path=%q}`, sasFile)).Inc()
metrics.GetOrCreateCounter(fmt.Sprintf(`vmagent_streamaggr_config_reload_successful{path=%q}`, sasFile)).Set(0)

View File

@@ -76,7 +76,9 @@ func insertRows(at *auth.Token, rows []parser.Row, extraLabels []prompbmarshal.L
ctx.WriteRequest.Timeseries = tssDst
ctx.Labels = labels
ctx.Samples = samples
remotewrite.Push(at, &ctx.WriteRequest)
if !remotewrite.TryPush(at, &ctx.WriteRequest) {
return remotewrite.ErrQueueFullHTTPRetry
}
rowsInserted.Add(rowsTotal)
if at != nil {
rowsTenantInserted.Get(at).Add(rowsTotal)

View File

@@ -1,246 +1,3 @@
See vmalert-tool docs [here](https://docs.victoriametrics.com/vmalert-tool.html).
# vmalert-tool
VMAlert command-line tool
## Unit testing for rules
You can use `vmalert-tool` to run unit tests for alerting and recording rules.
It will perform the following actions:
* sets up an isolated VictoriaMetrics instance;
* simulates the periodic ingestion of time series;
* queries the ingested data for recording and alerting rules evaluation like [vmalert](https://docs.victoriametrics.com/vmalert.html);
* checks whether the firing alerts or resulting recording rules match the expected results.
See how to run vmalert-tool for unit test below:
```
# Run vmalert-tool with one or multiple test files via --files cmd-line flag
./vmalert-tool unittest --files test1.yaml --files test2.yaml
```
vmalert-tool unittest is compatible with [Prometheus config format for tests](https://prometheus.io/docs/prometheus/latest/configuration/unit_testing_rules/#test-file-format)
except `promql_expr_test` field. Use `metricsql_expr_test` field name instead. The name is different because vmalert-tool
validates and executes [MetricsQL](https://docs.victoriametrics.com/MetricsQL.html) expressions,
which aren't always backward compatible with [PromQL](https://prometheus.io/docs/prometheus/latest/querying/basics/).
### Test file format
The configuration format for files specified in `--files` cmd-line flag is the following:
```yaml
# Path to the files or http url containing [rule groups](https://docs.victoriametrics.com/vmalert.html#groups) configuration.
# Enterprise version of vmalert-tool supports S3 and GCS paths to rules.
rule_files:
[ - <string> ]
# The evaluation interval for rules specified in `rule_files`
[ evaluation_interval: <duration> | default = 1m ]
# Groups listed below will be evaluated by order.
# Not All the groups need not be mentioned, if not, they will be evaluated by define order in rule_files.
group_eval_order:
[ - <string> ]
# The list of unit test files to be checked during evaluation.
tests:
[ - <test_group> ]
```
#### `<test_group>`
```yaml
# Interval between samples for input series
interval: <duration>
# Time series to persist into the database according to configured <interval> before running tests.
input_series:
[ - <series> ]
# Name of the test group, optional
[ name: <string> ]
# Unit tests for alerting rules
alert_rule_test:
[ - <alert_test_case> ]
# Unit tests for Metricsql expressions.
metricsql_expr_test:
[ - <metricsql_expr_test> ]
# External labels accessible for templating.
external_labels:
[ <labelname>: <string> ... ]
```
#### `<series>`
```yaml
# series in the following format '<metric name>{<label name>=<label value>, ...}'
# Examples:
# series_name{label1="value1", label2="value2"}
# go_goroutines{job="prometheus", instance="localhost:9090"}
series: <string>
# values support several special equations:
# 'a+bxc' becomes 'a a+b a+(2*b) a+(3*b) … a+(c*b)'
# Read this as series starts at a, then c further samples incrementing by b.
# 'a-bxc' becomes 'a a-b a-(2*b) a-(3*b) … a-(c*b)'
# Read this as series starts at a, then c further samples decrementing by b (or incrementing by negative b).
# '_' represents a missing sample from scrape
# 'stale' indicates a stale sample
# Examples:
# 1. '-2+4x3' becomes '-2 2 6 10' - series starts at -2, then 3 further samples incrementing by 4.
# 2. ' 1-2x4' becomes '1 -1 -3 -5 -7' - series starts at 1, then 4 further samples decrementing by 2.
# 3. ' 1x4' becomes '1 1 1 1 1' - shorthand for '1+0x4', series starts at 1, then 4 further samples incrementing by 0.
# 4. ' 1 _x3 stale' becomes '1 _ _ _ stale' - the missing sample cannot increment, so 3 missing samples are produced by the '_x3' expression.
values: <string>
```
#### `<alert_test_case>`
vmalert by default adds `alertgroup` and `alertname` to the generated alerts and time series.
So you will need to specify both `groupname` and `alertname` under a single `<alert_test_case>`,
but no need to add them under `exp_alerts`.
You can also pass `--disableAlertgroupLabel` to skip `alertgroup` check.
```yaml
# The time elapsed from time=0s when this alerting rule should be checked.
# Means this rule should be firing at this point, or shouldn't be firing if 'exp_alerts' is empty.
eval_time: <duration>
# Name of the group name to be tested.
groupname: <string>
# Name of the alert to be tested.
alertname: <string>
# List of the expected alerts that are firing under the given alertname at
# the given evaluation time. If you want to test if an alerting rule should
# not be firing, then you can mention only the fields above and leave 'exp_alerts' empty.
exp_alerts:
[ - <alert> ]
```
#### `<alert>`
```yaml
# These are the expanded labels and annotations of the expected alert.
# Note: labels also include the labels of the sample associated with the alert
exp_labels:
[ <labelname>: <string> ]
exp_annotations:
[ <labelname>: <string> ]
```
#### `<metricsql_expr_test>`
```yaml
# Expression to evaluate
expr: <string>
# The time elapsed from time=0s when this expression be evaluated.
eval_time: <duration>
# Expected samples at the given evaluation time.
exp_samples:
[ - <sample> ]
```
#### `<sample>`
```yaml
# Labels of the sample in usual series notation '<metric name>{<label name>=<label value>, ...}'
# Examples:
# series_name{label1="value1", label2="value2"}
# go_goroutines{job="prometheus", instance="localhost:9090"}
labels: <string>
# The expected value of the Metricsql expression.
value: <number>
```
### Example
This is an example input file for unit testing which will pass.
`test.yaml` is the test file which follows the syntax above and `alerts.yaml` contains the alerting rules.
With `rules.yaml` in the same directory, run `./vmalert-tool unittest --files=./unittest/testdata/test.yaml`.
#### `test.yaml`
```yaml
rule_files:
- rules.yaml
evaluation_interval: 1m
tests:
- interval: 1m
input_series:
- series: 'up{job="prometheus", instance="localhost:9090"}'
values: "0+0x1440"
metricsql_expr_test:
- expr: suquery_interval_test
eval_time: 4m
exp_samples:
- labels: '{__name__="suquery_interval_test", datacenter="dc-123", instance="localhost:9090", job="prometheus"}'
value: 1
alert_rule_test:
- eval_time: 2h
groupname: group1
alertname: InstanceDown
exp_alerts:
- exp_labels:
job: prometheus
severity: page
instance: localhost:9090
datacenter: dc-123
exp_annotations:
summary: "Instance localhost:9090 down"
description: "localhost:9090 of job prometheus has been down for more than 5 minutes."
- eval_time: 0
groupname: group1
alertname: AlwaysFiring
exp_alerts:
- exp_labels:
datacenter: dc-123
- eval_time: 0
groupname: group1
alertname: InstanceDown
exp_alerts: []
external_labels:
datacenter: dc-123
```
#### `alerts.yaml`
```yaml
# This is the rules file.
groups:
- name: group1
rules:
- alert: InstanceDown
expr: up == 0
for: 5m
labels:
severity: page
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."
- alert: AlwaysFiring
expr: 1
- name: group2
rules:
- record: job:test:count_over_time1m
expr: sum without(instance) (count_over_time(test[1m]))
- record: suquery_interval_test
expr: count_over_time(up[5m:])
```
vmalert-tool docs can be edited at [docs/vmalert-tool.md](https://github.com/VictoriaMetrics/VictoriaMetrics/blob/master/docs/vmalert-tool.md).

View File

@@ -0,0 +1,12 @@
# See https://medium.com/on-docker/use-multi-stage-builds-to-inject-ca-certs-ad1e8f01de1b
ARG certs_image
ARG root_image
FROM $certs_image as certs
RUN apk update && apk upgrade && apk --update --no-cache add ca-certificates
FROM $root_image
COPY --from=certs /etc/ssl/certs/ca-certificates.crt /etc/ssl/certs/ca-certificates.crt
EXPOSE 8429
ENTRYPOINT ["/vmalert-tool-prod"]
ARG TARGETARCH
COPY vmalert-tool-linux-${TARGETARCH}-prod ./vmalert-tool-prod

View File

@@ -21,6 +21,11 @@ tests:
groupname: group2
alertname: SameAlertNameWithDifferentGroup
exp_alerts: []
- eval_time: 150s
groupname: group1
alertname: SameAlertNameWithDifferentGroup
exp_alerts:
- {}
- eval_time: 6m
groupname: group1
alertname: SameAlertNameWithDifferentGroup

View File

@@ -336,7 +336,7 @@ func (tg *testGroup) test(evalInterval time.Duration, groupOrderMap map[string]i
if disableAlertgroupLabel {
g.Name = ""
}
if _, ok := alertExpResultMap[time.Duration(ts.UnixNano())][g.Name]; !ok {
if _, ok := alertExpResultMap[alertEvalTimes[evalIndex]][g.Name]; !ok {
continue
}
if _, ok := gotAlertsMap[g.Name]; !ok {
@@ -347,7 +347,7 @@ func (tg *testGroup) test(evalInterval time.Duration, groupOrderMap map[string]i
if !isAlertRule {
continue
}
if _, ok := alertExpResultMap[time.Duration(ts.UnixNano())][g.Name][ar.Name]; ok {
if _, ok := alertExpResultMap[alertEvalTimes[evalIndex]][g.Name][ar.Name]; ok {
for _, got := range ar.GetAlerts() {
if got.State != notifier.StateFiring {
continue

File diff suppressed because it is too large Load Diff

View File

@@ -19,11 +19,14 @@ import (
// Group contains list of Rules grouped into
// entity with one name and evaluation interval
type Group struct {
Type Type `yaml:"type,omitempty"`
File string
Name string `yaml:"name"`
Interval *promutils.Duration `yaml:"interval,omitempty"`
EvalOffset *promutils.Duration `yaml:"eval_offset,omitempty"`
Type Type `yaml:"type,omitempty"`
File string
Name string `yaml:"name"`
Interval *promutils.Duration `yaml:"interval,omitempty"`
EvalOffset *promutils.Duration `yaml:"eval_offset,omitempty"`
// EvalDelay will adjust the `time` parameter of rule evaluation requests to compensate intentional query delay from datasource.
// see https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5155
EvalDelay *promutils.Duration `yaml:"eval_delay,omitempty"`
Limit int `yaml:"limit,omitempty"`
Rules []Rule `yaml:"rules"`
Concurrency int `yaml:"concurrency"`
@@ -233,7 +236,7 @@ func ParseSilent(pathPatterns []string, validateTplFn ValidateTplFn, validateExp
files, err := readFromFS(pathPatterns)
if err != nil {
return nil, fmt.Errorf("failed to read from the config: %s", err)
return nil, fmt.Errorf("failed to read from the config: %w", err)
}
return parse(files, validateTplFn, validateExpressions)
}
@@ -242,11 +245,11 @@ func ParseSilent(pathPatterns []string, validateTplFn ValidateTplFn, validateExp
func Parse(pathPatterns []string, validateTplFn ValidateTplFn, validateExpressions bool) ([]Group, error) {
files, err := readFromFS(pathPatterns)
if err != nil {
return nil, fmt.Errorf("failed to read from the config: %s", err)
return nil, fmt.Errorf("failed to read from the config: %w", err)
}
groups, err := parse(files, validateTplFn, validateExpressions)
if err != nil {
return nil, fmt.Errorf("failed to parse %s: %s", pathPatterns, err)
return nil, fmt.Errorf("failed to parse %s: %w", pathPatterns, err)
}
if len(groups) < 1 {
cLogger.Warnf("no groups found in %s", strings.Join(pathPatterns, ";"))

View File

@@ -106,7 +106,7 @@ func TestParseBad(t *testing.T) {
},
{
[]string{"http://unreachable-url"},
"failed to read",
"failed to",
},
}
for _, tc := range testCases {

View File

@@ -49,7 +49,7 @@ func (fs *FS) Read(files []string) (map[string][]byte, error) {
path, resp.StatusCode, http.StatusOK, data)
}
if err != nil {
return nil, fmt.Errorf("cannot read %q: %s", path, err)
return nil, fmt.Errorf("cannot read %q: %w", path, err)
}
result[path] = data
}

View File

@@ -15,6 +15,7 @@ groups:
interval: 2s
concurrency: 2
type: prometheus
eval_delay: 30s
rules:
- alert: Conns
expr: sum(vm_tcplistener_conns) by (instance) > 1

View File

@@ -43,7 +43,9 @@ var (
oauth2TokenURL = flag.String("datasource.oauth2.tokenUrl", "", "Optional OAuth2 tokenURL to use for -datasource.url.")
oauth2Scopes = flag.String("datasource.oauth2.scopes", "", "Optional OAuth2 scopes to use for -datasource.url. Scopes must be delimited by ';'")
lookBack = flag.Duration("datasource.lookback", 0, `Lookback defines how far into the past to look when evaluating queries. For example, if the datasource.lookback=5m then param "time" with value now()-5m will be added to every query.`)
lookBack = flag.Duration("datasource.lookback", 0, `Will be deprecated soon, please adjust "-search.latencyOffset" at datasource side `+
`or specify "latency_offset" in rule group's params. Lookback defines how far into the past to look when evaluating queries. `+
`For example, if the datasource.lookback=5m then param "time" with value now()-5m will be added to every query.`)
queryStep = flag.Duration("datasource.queryStep", 5*time.Minute, "How far a value can fallback to when evaluating queries. "+
"For example, if -datasource.queryStep=15s then param \"step\" with value \"15s\" will be added to every query. "+
"If set to 0, rule's evaluation interval will be used instead.")
@@ -83,7 +85,10 @@ func Init(extraParams url.Values) (QuerierBuilder, error) {
return nil, fmt.Errorf("datasource.url is empty")
}
if !*queryTimeAlignment {
logger.Warnf("flag `datasource.queryTimeAlignment` is deprecated and will be removed in next releases, please use `eval_alignment` in rule group instead")
logger.Warnf("flag `-datasource.queryTimeAlignment` is deprecated and will be removed in next releases. Please use `eval_alignment` in rule group instead.")
}
if *lookBack != 0 {
logger.Warnf("flag `-datasource.lookback` will be deprecated soon. Please use `-rule.evalDelay` command-line flag instead. See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5155 for details.")
}
tr, err := utils.Transport(*addr, *tlsCertFile, *tlsKeyFile, *tlsCAFile, *tlsServerName, *tlsInsecureSkipVerify)
@@ -113,7 +118,7 @@ func Init(extraParams url.Values) (QuerierBuilder, error) {
}
_, err = authCfg.GetAuthHeader()
if err != nil {
return nil, fmt.Errorf("failed to set request auth header to datasource %q: %s", *addr, err)
return nil, fmt.Errorf("failed to set request auth header to datasource %q: %w", *addr, err)
}
return &VMStorage{

View File

@@ -142,24 +142,30 @@ func (s *VMStorage) Query(ctx context.Context, query string, ts time.Time) (Resu
return Result{}, nil, err
}
resp, err := s.do(ctx, req)
if errors.Is(err, io.EOF) || errors.Is(err, io.ErrUnexpectedEOF) {
// something in the middle between client and datasource might be closing
// the connection. So we do a one more attempt in hope request will succeed.
req, _ = s.newQueryRequest(query, ts)
resp, err = s.do(ctx, req)
}
if err != nil {
return Result{}, req, err
if !errors.Is(err, io.EOF) && !errors.Is(err, io.ErrUnexpectedEOF) {
// Return unexpected error to the caller.
return Result{}, nil, err
}
// Something in the middle between client and datasource might be closing
// the connection. So we do a one more attempt in hope request will succeed.
req, err = s.newQueryRequest(query, ts)
if err != nil {
return Result{}, nil, fmt.Errorf("second attempt: %w", err)
}
resp, err = s.do(ctx, req)
if err != nil {
return Result{}, nil, fmt.Errorf("second attempt: %w", err)
}
}
defer func() {
_ = resp.Body.Close()
}()
// Process the received response.
parseFn := parsePrometheusResponse
if s.dataSourceType != datasourcePrometheus {
parseFn = parseGraphiteResponse
}
result, err := parseFn(req, resp)
_ = resp.Body.Close()
return result, req, err
}
@@ -177,23 +183,31 @@ func (s *VMStorage) QueryRange(ctx context.Context, query string, start, end tim
return res, fmt.Errorf("end param is missing")
}
req, err := s.newQueryRangeRequest(query, start, end)
if err != nil {
return Result{}, err
}
resp, err := s.do(ctx, req)
if errors.Is(err, io.EOF) || errors.Is(err, io.ErrUnexpectedEOF) {
// something in the middle between client and datasource might be closing
// the connection. So we do a one more attempt in hope request will succeed.
req, _ = s.newQueryRangeRequest(query, start, end)
resp, err = s.do(ctx, req)
}
if err != nil {
return res, err
}
defer func() {
_ = resp.Body.Close()
}()
return parsePrometheusResponse(req, resp)
resp, err := s.do(ctx, req)
if err != nil {
if !errors.Is(err, io.EOF) && !errors.Is(err, io.ErrUnexpectedEOF) {
// Return unexpected error to the caller.
return res, err
}
// Something in the middle between client and datasource might be closing
// the connection. So we do a one more attempt in hope request will succeed.
req, err = s.newQueryRangeRequest(query, start, end)
if err != nil {
return res, fmt.Errorf("second attempt: %w", err)
}
resp, err = s.do(ctx, req)
if err != nil {
return res, fmt.Errorf("second attempt: %w", err)
}
}
// Process the received response.
res, err = parsePrometheusResponse(req, resp)
_ = resp.Body.Close()
return res, err
}
func (s *VMStorage) do(ctx context.Context, req *http.Request) (*http.Response, error) {
@@ -219,7 +233,7 @@ func (s *VMStorage) do(ctx context.Context, req *http.Request) (*http.Response,
func (s *VMStorage) newQueryRangeRequest(query string, start, end time.Time) (*http.Request, error) {
req, err := s.newRequest()
if err != nil {
return nil, fmt.Errorf("cannot create query_range request to datasource %q: %s", s.datasourceURL, err)
return nil, fmt.Errorf("cannot create query_range request to datasource %q: %w", s.datasourceURL, err)
}
s.setPrometheusRangeReqParams(req, query, start, end)
return req, nil
@@ -228,7 +242,7 @@ func (s *VMStorage) newQueryRangeRequest(query string, start, end time.Time) (*h
func (s *VMStorage) newQueryRequest(query string, ts time.Time) (*http.Request, error) {
req, err := s.newRequest()
if err != nil {
return nil, fmt.Errorf("cannot create query request to datasource %q: %s", s.datasourceURL, err)
return nil, fmt.Errorf("cannot create query request to datasource %q: %w", s.datasourceURL, err)
}
switch s.dataSourceType {
case "", datasourcePrometheus:

View File

@@ -112,14 +112,14 @@ func parsePrometheusResponse(req *http.Request, resp *http.Response) (res Result
return res, fmt.Errorf("response error, query: %s, errorType: %s, error: %s", req.URL.Redacted(), r.ErrorType, r.Error)
}
if r.Status != statusSuccess {
return res, fmt.Errorf("unknown status: %s, Expected success or error ", r.Status)
return res, fmt.Errorf("unknown status: %s, Expected success or error", r.Status)
}
var parseFn func() ([]Metric, error)
switch r.Data.ResultType {
case rtVector:
var pi promInstant
if err := json.Unmarshal(r.Data.Result, &pi.Result); err != nil {
return res, fmt.Errorf("umarshal err %s; \n %#v", err, string(r.Data.Result))
return res, fmt.Errorf("unmarshal err %w; \n %#v", err, string(r.Data.Result))
}
parseFn = pi.metrics
case rtMatrix:

View File

@@ -47,8 +47,8 @@ all files with prefix rule_ in folder dir.
See https://docs.victoriametrics.com/vmalert.html#reading-rules-from-object-storage
`)
ruleTemplatesPath = flagutil.NewArrayString("rule.templates", `Path or glob pattern to location with go template definitions
for rules annotations templating. Flag can be specified multiple times.
ruleTemplatesPath = flagutil.NewArrayString("rule.templates", `Path or glob pattern to location with go template definitions `+
`for rules annotations templating. Flag can be specified multiple times.
Examples:
-rule.templates="/path/to/file". Path to a single file with go templates
-rule.templates="dir/*.tpl" -rule.templates="/*.tpl". Relative path to all .tpl files in "dir" folder,
@@ -59,7 +59,7 @@ absolute path to all .tpl files in root.
configCheckInterval = flag.Duration("configCheckInterval", 0, "Interval for checking for changes in '-rule' or '-notifier.config' files. "+
"By default, the checking is disabled. Send SIGHUP signal in order to force config check for changes.")
httpListenAddr = flag.String("httpListenAddr", ":8880", "Address to listen for http connections. See also -httpListenAddr.useProxyProtocol")
httpListenAddr = flag.String("httpListenAddr", ":8880", "Address to listen for http connections. See also -tls and -httpListenAddr.useProxyProtocol")
useProxyProtocol = flag.Bool("httpListenAddr.useProxyProtocol", false, "Whether to use proxy protocol for connections accepted at -httpListenAddr . "+
"See https://www.haproxy.org/download/1.8/doc/proxy-protocol.txt . "+
"With enabled proxy protocol http server cannot serve regular /metrics endpoint. Use -pushmetrics.url for metrics pushing")
@@ -230,7 +230,9 @@ func newManager(ctx context.Context) (*manager, error) {
if err != nil {
return nil, fmt.Errorf("failed to init remoteWrite: %w", err)
}
manager.rw = rw
if rw != nil {
manager.rw = rw
}
rr, err := remoteread.Init()
if err != nil {

View File

@@ -191,7 +191,7 @@ func (a Alert) toPromLabels(relabelCfg *promrelabel.ParsedConfigs) []prompbmarsh
var labels []prompbmarshal.Label
for k, v := range a.Labels {
labels = append(labels, prompbmarshal.Label{
Name: k,
Name: promrelabel.SanitizeMetricName(k),
Value: v,
})
}

View File

@@ -237,6 +237,11 @@ func TestAlert_toPromLabels(t *testing.T) {
[]prompbmarshal.Label{{Name: "a", Value: "baz"}, {Name: "foo", Value: "bar"}},
nil,
)
fn(
map[string]string{"foo.bar": "baz", "service!name": "qux"},
[]prompbmarshal.Label{{Name: "foo_bar", Value: "baz"}, {Name: "service_name", Value: "qux"}},
nil,
)
pcs, err := promrelabel.ParseRelabelConfigsData([]byte(`
- target_label: "foo"

View File

@@ -3,7 +3,6 @@ package notifier
import (
"crypto/md5"
"fmt"
"gopkg.in/yaml.v2"
"net/url"
"os"
"path"
@@ -11,6 +10,8 @@ import (
"strings"
"time"
"gopkg.in/yaml.v2"
"github.com/VictoriaMetrics/VictoriaMetrics/lib/promauth"
"github.com/VictoriaMetrics/VictoriaMetrics/lib/promrelabel"
"github.com/VictoriaMetrics/VictoriaMetrics/lib/promscrape/discovery/consul"
@@ -142,26 +143,23 @@ func parseLabels(target string, metaLabels *promutils.Labels, cfg *Config) (stri
if labels.Len() == 0 {
return "", nil, nil
}
schemeRelabeled := labels.Get("__scheme__")
if len(schemeRelabeled) == 0 {
schemeRelabeled = "http"
scheme := labels.Get("__scheme__")
if len(scheme) == 0 {
scheme = "http"
}
addressRelabeled := labels.Get("__address__")
if len(addressRelabeled) == 0 {
alertsPath := labels.Get("__alerts_path__")
if !strings.HasPrefix(alertsPath, "/") {
alertsPath = "/" + alertsPath
}
address := labels.Get("__address__")
if len(address) == 0 {
return "", nil, nil
}
if strings.Contains(addressRelabeled, "/") {
return "", nil, nil
}
addressRelabeled = addMissingPort(schemeRelabeled, addressRelabeled)
alertsPathRelabeled := labels.Get("__alerts_path__")
if !strings.HasPrefix(alertsPathRelabeled, "/") {
alertsPathRelabeled = "/" + alertsPathRelabeled
}
u := fmt.Sprintf("%s://%s%s", schemeRelabeled, addressRelabeled, alertsPathRelabeled)
address = addMissingPort(scheme, address)
u := fmt.Sprintf("%s://%s%s", scheme, address, alertsPath)
if _, err := url.Parse(u); err != nil {
return "", nil, fmt.Errorf("invalid url %q for scheme=%q (%q), target=%q, metrics_path=%q (%q): %w",
u, cfg.Scheme, schemeRelabeled, target, addressRelabeled, alertsPathRelabeled, err)
u, cfg.Scheme, scheme, target, address, alertsPath, err)
}
return u, labels, nil
}
@@ -181,9 +179,24 @@ func addMissingPort(scheme, target string) string {
func mergeLabels(target string, metaLabels *promutils.Labels, cfg *Config) *promutils.Labels {
// See https://prometheus.io/docs/prometheus/latest/configuration/configuration/#relabel_config
m := promutils.NewLabels(3 + metaLabels.Len())
m.Add("__address__", target)
m.Add("__scheme__", cfg.Scheme)
m.Add("__alerts_path__", path.Join("/", cfg.PathPrefix, alertManagerPath))
address := target
scheme := cfg.Scheme
alertsPath := path.Join("/", cfg.PathPrefix, alertManagerPath)
// try to extract optional scheme and alertsPath from __address__.
if strings.HasPrefix(address, "http://") {
scheme = "http"
address = address[len("http://"):]
} else if strings.HasPrefix(address, "https://") {
scheme = "https"
address = address[len("https://"):]
}
if n := strings.IndexByte(address, '/'); n >= 0 {
alertsPath = address[n:]
address = address[:n]
}
m.Add("__address__", address)
m.Add("__scheme__", scheme)
m.Add("__alerts_path__", alertsPath)
m.AddFrom(metaLabels)
return m
}

View File

@@ -87,7 +87,7 @@ func (cw *configWatcher) reload(path string) error {
func (cw *configWatcher) add(typeK TargetType, interval time.Duration, labelsFn getLabels) error {
targets, errors := targetsFromLabels(labelsFn, cw.cfg, cw.genFn)
for _, err := range errors {
return fmt.Errorf("failed to init notifier for %q: %s", typeK, err)
return fmt.Errorf("failed to init notifier for %q: %w", typeK, err)
}
cw.setTargets(typeK, targets)
@@ -107,7 +107,7 @@ func (cw *configWatcher) add(typeK TargetType, interval time.Duration, labelsFn
}
updateTargets, errors := targetsFromLabels(labelsFn, cw.cfg, cw.genFn)
for _, err := range errors {
logger.Errorf("failed to init notifier for %q: %s", typeK, err)
logger.Errorf("failed to init notifier for %q: %w", typeK, err)
}
cw.setTargets(typeK, updateTargets)
}
@@ -118,7 +118,7 @@ func (cw *configWatcher) add(typeK TargetType, interval time.Duration, labelsFn
func targetsFromLabels(labelsFn getLabels, cfg *Config, genFn AlertURLGenerator) ([]Target, []error) {
metaLabels, err := labelsFn()
if err != nil {
return nil, []error{fmt.Errorf("failed to get labels: %s", err)}
return nil, []error{fmt.Errorf("failed to get labels: %w", err)}
}
var targets []Target
var errors []error
@@ -167,11 +167,11 @@ func (cw *configWatcher) start() error {
for _, target := range cfg.Targets {
address, labels, err := parseLabels(target, nil, cw.cfg)
if err != nil {
return fmt.Errorf("failed to parse labels for target %q: %s", target, err)
return fmt.Errorf("failed to parse labels for target %q: %w", target, err)
}
notifier, err := NewAlertManager(address, cw.genFn, httpCfg, cw.cfg.parsedAlertRelabelConfigs, cw.cfg.Timeout.Duration())
if err != nil {
return fmt.Errorf("failed to init alertmanager for addr %q: %s", address, err)
return fmt.Errorf("failed to init alertmanager for addr %q: %w", address, err)
}
targets = append(targets, Target{
Notifier: notifier,
@@ -189,14 +189,14 @@ func (cw *configWatcher) start() error {
sdc := &cw.cfg.ConsulSDConfigs[i]
targetLabels, err := sdc.GetLabels(cw.cfg.baseDir)
if err != nil {
return nil, fmt.Errorf("got labels err: %s", err)
return nil, fmt.Errorf("got labels err: %w", err)
}
labels = append(labels, targetLabels...)
}
return labels, nil
})
if err != nil {
return fmt.Errorf("failed to start consulSD discovery: %s", err)
return fmt.Errorf("failed to start consulSD discovery: %w", err)
}
}
@@ -207,14 +207,14 @@ func (cw *configWatcher) start() error {
sdc := &cw.cfg.DNSSDConfigs[i]
targetLabels, err := sdc.GetLabels(cw.cfg.baseDir)
if err != nil {
return nil, fmt.Errorf("got labels err: %s", err)
return nil, fmt.Errorf("got labels err: %w", err)
}
labels = append(labels, targetLabels...)
}
return labels, nil
})
if err != nil {
return fmt.Errorf("failed to start DNSSD discovery: %s", err)
return fmt.Errorf("failed to start DNSSD discovery: %w", err)
}
}
return nil

View File

@@ -318,3 +318,47 @@ func TestMergeHTTPClientConfigs(t *testing.T) {
t.Fatalf("expected BasicAuth tp be present")
}
}
func TestParseLabels(t *testing.T) {
testCases := []struct {
name string
target string
cfg *Config
expectedAddress string
expectedErr bool
}{
{
"invalid address",
"invalid:*//url",
&Config{},
"",
true,
},
{
"use some default params",
"alertmanager:9093",
&Config{PathPrefix: "test"},
"http://alertmanager:9093/test/api/v2/alerts",
false,
},
{
"use target address",
"https://alertmanager:9093/api/v1/alerts",
&Config{Scheme: "http", PathPrefix: "test"},
"https://alertmanager:9093/api/v1/alerts",
false,
},
}
for _, tc := range testCases {
t.Run(tc.name, func(t *testing.T) {
address, _, err := parseLabels(tc.target, nil, tc.cfg)
if err == nil == tc.expectedErr {
t.Fatalf("unexpected error; got %t; want %t", err != nil, tc.expectedErr)
}
if address != tc.expectedAddress {
t.Fatalf("unexpected address; got %q; want %q", address, tc.expectedAddress)
}
})
}
}

View File

@@ -23,7 +23,7 @@ var (
"It is hidden by default, since it can contain sensitive info such as auth key")
blackHole = flag.Bool("notifier.blackhole", false, "Whether to blackhole alerting notifications. "+
"Enable this flag if you want vmalert to evaluate alerting rules without sending any notifications to external receivers (eg. alertmanager). "+
"`-notifier.url`, `-notifier.config` and `-notifier.blackhole` are mutually exclusive.")
"-notifier.url, -notifier.config and -notifier.blackhole are mutually exclusive.")
basicAuthUsername = flagutil.NewArrayString("notifier.basicAuth.username", "Optional basic auth username for -notifier.url")
basicAuthPassword = flagutil.NewArrayString("notifier.basicAuth.password", "Optional basic auth password for -notifier.url")
@@ -90,7 +90,7 @@ func Init(gen AlertURLGenerator, extLabels map[string]string, extURL string) (fu
externalLabels = extLabels
eu, err := url.Parse(externalURL)
if err != nil {
return nil, fmt.Errorf("failed to parse external URL: %s", err)
return nil, fmt.Errorf("failed to parse external URL: %w", err)
}
templates.UpdateWithFuncs(templates.FuncsWithExternalURL(eu))
@@ -116,7 +116,7 @@ func Init(gen AlertURLGenerator, extLabels map[string]string, extURL string) (fu
if len(*addrs) > 0 {
notifiers, err := notifiersFromFlags(gen)
if err != nil {
return nil, fmt.Errorf("failed to create notifier from flag values: %s", err)
return nil, fmt.Errorf("failed to create notifier from flag values: %w", err)
}
staticNotifiersFn = func() []Notifier {
return notifiers
@@ -126,7 +126,7 @@ func Init(gen AlertURLGenerator, extLabels map[string]string, extURL string) (fu
cw, err = newWatcher(*configPath, gen)
if err != nil {
return nil, fmt.Errorf("failed to init config watcher: %s", err)
return nil, fmt.Errorf("failed to init config watcher: %w", err)
}
return cw.notifiers, nil
}

View File

@@ -5,6 +5,7 @@ static_configs:
- targets:
- localhost:9093
- localhost:9095
- https://localhost:9093/test/api/v2/alerts
basic_auth:
username: foo
password: bar

View File

@@ -3,6 +3,7 @@ package remotewrite
import (
"bytes"
"context"
"errors"
"flag"
"fmt"
"io"
@@ -117,12 +118,19 @@ func NewClient(ctx context.Context, cfg Config) (*Client, error) {
// Push adds timeseries into queue for writing into remote storage.
// Push returns and error if client is stopped or if queue is full.
func (c *Client) Push(s prompbmarshal.TimeSeries) error {
rwTotal.Inc()
select {
case <-c.doneCh:
rwErrors.Inc()
droppedRows.Add(len(s.Samples))
droppedBytes.Add(s.Size())
return fmt.Errorf("client is closed")
case c.input <- s:
return nil
default:
rwErrors.Inc()
droppedRows.Add(len(s.Samples))
droppedBytes.Add(s.Size())
return fmt.Errorf("failed to push timeseries - queue is full (%d entries). "+
"Queue size is controlled by -remoteWrite.maxQueueSize flag",
c.maxQueueSize)
@@ -181,11 +189,14 @@ func (c *Client) run(ctx context.Context) {
}
var (
rwErrors = metrics.NewCounter(`vmalert_remotewrite_errors_total`)
rwTotal = metrics.NewCounter(`vmalert_remotewrite_total`)
sentRows = metrics.NewCounter(`vmalert_remotewrite_sent_rows_total`)
sentBytes = metrics.NewCounter(`vmalert_remotewrite_sent_bytes_total`)
sendDuration = metrics.NewFloatCounter(`vmalert_remotewrite_send_duration_seconds_total`)
droppedRows = metrics.NewCounter(`vmalert_remotewrite_dropped_rows_total`)
droppedBytes = metrics.NewCounter(`vmalert_remotewrite_dropped_bytes_total`)
sendDuration = metrics.NewFloatCounter(`vmalert_remotewrite_send_duration_seconds_total`)
bufferFlushDuration = metrics.NewHistogram(`vmalert_remotewrite_flush_duration_seconds`)
_ = metrics.NewGauge(`vmalert_remotewrite_concurrency`, func() float64 {
@@ -222,6 +233,11 @@ func (c *Client) flush(ctx context.Context, wr *prompbmarshal.WriteRequest) {
L:
for attempts := 0; ; attempts++ {
err := c.send(ctx, b)
if errors.Is(err, io.EOF) {
// Something in the middle between client and destination might be closing
// the connection. So we do a one more attempt in hope request will succeed.
err = c.send(ctx, b)
}
if err == nil {
sentRows.Add(len(wr.Timeseries))
sentBytes.Add(len(b))
@@ -259,6 +275,7 @@ L:
}
rwErrors.Inc()
droppedRows.Add(len(wr.Timeseries))
droppedBytes.Add(len(b))
logger.Errorf("attempts to send remote-write request failed - dropping %d time series",
@@ -282,7 +299,9 @@ func (c *Client) send(ctx context.Context, data []byte) error {
if c.authCfg != nil {
err = c.authCfg.SetHeaders(req, true)
if err != nil {
return &nonRetriableError{err: err}
return &nonRetriableError{
err: err,
}
}
}
if !*disablePathAppend {
@@ -301,13 +320,14 @@ func (c *Client) send(ctx context.Context, data []byte) error {
// Prometheus remote Write compatible receivers MUST
switch resp.StatusCode / 100 {
case 2:
// respond with a HTTP 2xx status code when the write is successful.
// respond with HTTP 2xx status code when write is successful.
return nil
case 4:
if resp.StatusCode != http.StatusTooManyRequests {
// MUST NOT retry write requests on HTTP 4xx responses other than 429
return &nonRetriableError{fmt.Errorf("unexpected response code %d for %s. Response body %q",
resp.StatusCode, req.URL.Redacted(), body)}
return &nonRetriableError{
err: fmt.Errorf("unexpected response code %d for %s. Response body %q", resp.StatusCode, req.URL.Redacted(), body),
}
}
fallthrough
default:

View File

@@ -30,7 +30,7 @@ var (
maxQueueSize = flag.Int("remoteWrite.maxQueueSize", 1e5, "Defines the max number of pending datapoints to remote write endpoint")
maxBatchSize = flag.Int("remoteWrite.maxBatchSize", 1e3, "Defines max number of timeseries to be flushed at once")
concurrency = flag.Int("remoteWrite.concurrency", 1, "Defines number of writers for concurrent writing into remote querier")
concurrency = flag.Int("remoteWrite.concurrency", 1, "Defines number of writers for concurrent writing into remote write endpoint")
flushInterval = flag.Duration("remoteWrite.flushInterval", 5*time.Second, "Defines interval of flushes to remote write endpoint")
tlsInsecureSkipVerify = flag.Bool("remoteWrite.tlsInsecureSkipVerify", false, "Whether to skip tls verification when connecting to -remoteWrite.url")

View File

@@ -36,11 +36,11 @@ func replay(groupsCfg []config.Group, qb datasource.QuerierBuilder, rw remotewri
}
tFrom, err := time.Parse(time.RFC3339, *replayFrom)
if err != nil {
return fmt.Errorf("failed to parse %q: %s", *replayFrom, err)
return fmt.Errorf("failed to parse %q: %w", *replayFrom, err)
}
tTo, err := time.Parse(time.RFC3339, *replayTo)
if err != nil {
return fmt.Errorf("failed to parse %q: %s", *replayTo, err)
return fmt.Errorf("failed to parse %q: %w", *replayTo, err)
}
if !tTo.After(tFrom) {
return fmt.Errorf("replay.timeTo must be bigger than replay.timeFrom")

View File

@@ -30,6 +30,7 @@ type AlertingRule struct {
Annotations map[string]string
GroupID uint64
GroupName string
File string
EvalInterval time.Duration
Debug bool
@@ -47,7 +48,7 @@ type AlertingRule struct {
}
type alertingRuleMetrics struct {
errors *utils.Gauge
errors *utils.Counter
pending *utils.Gauge
active *utils.Gauge
samples *utils.Gauge
@@ -67,6 +68,7 @@ func NewAlertingRule(qb datasource.QuerierBuilder, group *Group, cfg config.Rule
Annotations: cfg.Annotations,
GroupID: group.ID(),
GroupName: group.Name,
File: group.File,
EvalInterval: group.Interval,
Debug: cfg.Debug,
q: qb.BuildWithParams(datasource.QuerierParams{
@@ -91,7 +93,7 @@ func NewAlertingRule(qb datasource.QuerierBuilder, group *Group, cfg config.Rule
entries: make([]StateEntry, entrySize),
}
labels := fmt.Sprintf(`alertname=%q, group=%q, id="%d"`, ar.Name, group.Name, ar.ID())
labels := fmt.Sprintf(`alertname=%q, group=%q, file=%q, id="%d"`, ar.Name, group.Name, group.File, ar.ID())
ar.metrics.pending = utils.GetOrCreateGauge(fmt.Sprintf(`vmalert_alerts_pending{%s}`, labels),
func() float64 {
ar.alertsMu.RLock()
@@ -116,14 +118,7 @@ func NewAlertingRule(qb datasource.QuerierBuilder, group *Group, cfg config.Rule
}
return float64(num)
})
ar.metrics.errors = utils.GetOrCreateGauge(fmt.Sprintf(`vmalert_alerting_rules_error{%s}`, labels),
func() float64 {
e := ar.state.getLast()
if e.Err == nil {
return 0
}
return 1
})
ar.metrics.errors = utils.GetOrCreateCounter(fmt.Sprintf(`vmalert_alerting_rules_errors_total{%s}`, labels))
ar.metrics.samples = utils.GetOrCreateGauge(fmt.Sprintf(`vmalert_alerting_rules_last_evaluation_samples{%s}`, labels),
func() float64 {
e := ar.state.getLast()
@@ -269,7 +264,7 @@ func (ar *AlertingRule) toLabels(m datasource.Metric, qFn templates.QueryFn) (*l
Expr: ar.Expr,
})
if err != nil {
return nil, fmt.Errorf("failed to expand labels: %s", err)
return nil, fmt.Errorf("failed to expand labels: %w", err)
}
for k, v := range extraLabels {
ls.processed[k] = v
@@ -295,24 +290,33 @@ func (ar *AlertingRule) toLabels(m datasource.Metric, qFn templates.QueryFn) (*l
}
// execRange executes alerting rule on the given time range similarly to exec.
// It doesn't update internal states of the Rule and meant to be used just
// to get time series for backfilling.
// It returns ALERT and ALERT_FOR_STATE time series as result.
// When making consecutive calls make sure to respect time linearity for start and end params,
// as this function modifies AlertingRule alerts state.
// It is not thread safe.
// It returns ALERT and ALERT_FOR_STATE time series as a result.
func (ar *AlertingRule) execRange(ctx context.Context, start, end time.Time) ([]prompbmarshal.TimeSeries, error) {
res, err := ar.q.QueryRange(ctx, ar.Expr, start, end)
if err != nil {
return nil, err
}
var result []prompbmarshal.TimeSeries
holdAlertState := make(map[uint64]*notifier.Alert)
qFn := func(query string) ([]datasource.Metric, error) {
return nil, fmt.Errorf("`query` template isn't supported in replay mode")
}
for _, s := range res.Data {
ls, err := ar.toLabels(s, qFn)
if err != nil {
return nil, fmt.Errorf("failed to expand labels: %s", err)
}
h := hash(ls.processed)
a, err := ar.newAlert(s, nil, time.Time{}, qFn) // initial alert
if err != nil {
return nil, fmt.Errorf("failed to create alert: %s", err)
return nil, fmt.Errorf("failed to create alert: %w", err)
}
if ar.For == 0 { // if alert is instant
// if alert is instant, For: 0
if ar.For == 0 {
a.State = notifier.StateFiring
for i := range s.Values {
result = append(result, ar.alertToTimeSeries(a, s.Timestamps[i])...)
@@ -324,18 +328,32 @@ func (ar *AlertingRule) execRange(ctx context.Context, start, end time.Time) ([]
prevT := time.Time{}
for i := range s.Values {
at := time.Unix(s.Timestamps[i], 0)
// try to restore alert's state on the first iteration
if at.Equal(start) {
if _, ok := ar.alerts[h]; ok {
a = ar.alerts[h]
prevT = at
}
}
if at.Sub(prevT) > ar.EvalInterval {
// reset to Pending if there are gaps > EvalInterval between DPs
a.State = notifier.StatePending
a.ActiveAt = at
} else if at.Sub(a.ActiveAt) >= ar.For {
a.Start = time.Time{}
} else if at.Sub(a.ActiveAt) >= ar.For && a.State != notifier.StateFiring {
a.State = notifier.StateFiring
a.Start = at
}
prevT = at
result = append(result, ar.alertToTimeSeries(a, s.Timestamps[i])...)
// save alert's state on last iteration, so it can be used on the next execRange call
if at.Equal(end) {
holdAlertState[h] = a
}
}
}
ar.alerts = holdAlertState
return result, nil
}
@@ -360,6 +378,9 @@ func (ar *AlertingRule) exec(ctx context.Context, ts time.Time, limit int) ([]pr
defer func() {
ar.state.add(curState)
if curState.Err != nil {
ar.metrics.errors.Inc()
}
}()
ar.alertsMu.Lock()
@@ -388,7 +409,7 @@ func (ar *AlertingRule) exec(ctx context.Context, ts time.Time, limit int) ([]pr
for _, m := range res.Data {
ls, err := ar.toLabels(m, qFn)
if err != nil {
curState.Err = fmt.Errorf("failed to expand labels: %s", err)
curState.Err = fmt.Errorf("failed to expand labels: %w", err)
return nil, curState.Err
}
h := hash(ls.processed)
@@ -513,7 +534,7 @@ func (ar *AlertingRule) newAlert(m datasource.Metric, ls *labelSet, start time.T
if ls == nil {
ls, err = ar.toLabels(m, qFn)
if err != nil {
return nil, fmt.Errorf("failed to expand labels: %s", err)
return nil, fmt.Errorf("failed to expand labels: %w", err)
}
}
a := &notifier.Alert{
@@ -591,44 +612,41 @@ func (ar *AlertingRule) restore(ctx context.Context, q datasource.Querier, ts ti
return nil
}
for _, a := range ar.alerts {
nameStr := fmt.Sprintf("%s=%q", alertNameLabel, ar.Name)
if !*disableAlertGroupLabel {
nameStr = fmt.Sprintf("%s=%q,%s=%q", alertGroupNameLabel, ar.GroupName, alertNameLabel, ar.Name)
}
var labelsFilter string
for k, v := range ar.Labels {
labelsFilter += fmt.Sprintf(",%s=%q", k, v)
}
expr := fmt.Sprintf("last_over_time(%s{%s%s}[%ds])",
alertForStateMetricName, nameStr, labelsFilter, int(lookback.Seconds()))
res, _, err := q.Query(ctx, expr, ts)
if err != nil {
return fmt.Errorf("failed to execute restore query %q: %w ", expr, err)
}
if len(res.Data) < 1 {
ar.logDebugf(ts, nil, "no response was received from restore query")
return nil
}
for _, series := range res.Data {
series.DelLabel("__name__")
labelSet := make(map[string]string, len(series.Labels))
for _, v := range series.Labels {
labelSet[v.Name] = v.Value
}
id := hash(labelSet)
a, ok := ar.alerts[id]
if !ok {
continue
}
if a.Restored || a.State != notifier.StatePending {
continue
}
var labelsFilter []string
for k, v := range a.Labels {
labelsFilter = append(labelsFilter, fmt.Sprintf("%s=%q", k, v))
}
sort.Strings(labelsFilter)
expr := fmt.Sprintf("last_over_time(%s{%s}[%ds])",
alertForStateMetricName, strings.Join(labelsFilter, ","), int(lookback.Seconds()))
ar.logDebugf(ts, nil, "restoring alert state via query %q", expr)
res, _, err := q.Query(ctx, expr, ts)
if err != nil {
return err
}
qMetrics := res.Data
if len(qMetrics) < 1 {
ar.logDebugf(ts, nil, "no response was received from restore query")
continue
}
// only one series expected in response
m := qMetrics[0]
// __name__ supposed to be alertForStateMetricName
m.DelLabel("__name__")
// we assume that restore query contains all label matchers,
// so all received labels will match anyway if their number is equal.
if len(m.Labels) != len(a.Labels) {
ar.logDebugf(ts, nil, "state restore query returned not expected label-set %v", m.Labels)
continue
}
a.ActiveAt = time.Unix(int64(m.Values[0]), 0)
a.ActiveAt = time.Unix(int64(series.Values[0]), 0)
a.Restored = true
logger.Infof("alert %q (%d) restored to state at %v", a.Name, a.ID, a.ActiveAt)
}

View File

@@ -3,6 +3,7 @@ package rule
import (
"context"
"errors"
"fmt"
"reflect"
"sort"
"strings"
@@ -13,6 +14,7 @@ import (
"github.com/VictoriaMetrics/VictoriaMetrics/app/vmalert/config"
"github.com/VictoriaMetrics/VictoriaMetrics/app/vmalert/datasource"
"github.com/VictoriaMetrics/VictoriaMetrics/app/vmalert/notifier"
"github.com/VictoriaMetrics/VictoriaMetrics/app/vmalert/utils"
"github.com/VictoriaMetrics/VictoriaMetrics/lib/prompbmarshal"
"github.com/VictoriaMetrics/VictoriaMetrics/lib/promutils"
)
@@ -346,15 +348,18 @@ func TestAlertingRule_Exec(t *testing.T) {
}
func TestAlertingRule_ExecRange(t *testing.T) {
fakeGroup := Group{Name: "TestRule_ExecRange"}
testCases := []struct {
rule *AlertingRule
data []datasource.Metric
expAlerts []*notifier.Alert
rule *AlertingRule
data []datasource.Metric
expAlerts []*notifier.Alert
expHoldAlertStateAlerts map[uint64]*notifier.Alert
}{
{
newTestAlertingRule("empty", 0),
[]datasource.Metric{},
nil,
nil,
},
{
newTestAlertingRule("empty labels", 0),
@@ -364,6 +369,7 @@ func TestAlertingRule_ExecRange(t *testing.T) {
[]*notifier.Alert{
{State: notifier.StateFiring},
},
nil,
},
{
newTestAlertingRule("single-firing", 0),
@@ -376,6 +382,7 @@ func TestAlertingRule_ExecRange(t *testing.T) {
State: notifier.StateFiring,
},
},
nil,
},
{
newTestAlertingRule("single-firing-on-range", 0),
@@ -387,6 +394,7 @@ func TestAlertingRule_ExecRange(t *testing.T) {
{State: notifier.StateFiring},
{State: notifier.StateFiring},
},
nil,
},
{
newTestAlertingRule("for-pending", time.Second),
@@ -398,6 +406,16 @@ func TestAlertingRule_ExecRange(t *testing.T) {
{State: notifier.StatePending, ActiveAt: time.Unix(3, 0)},
{State: notifier.StatePending, ActiveAt: time.Unix(5, 0)},
},
map[uint64]*notifier.Alert{hash(map[string]string{"alertname": "for-pending"}): {
GroupID: fakeGroup.ID(),
Name: "for-pending",
Labels: map[string]string{"alertname": "for-pending"},
Annotations: map[string]string{},
State: notifier.StatePending,
ActiveAt: time.Unix(5, 0),
Value: 1,
For: time.Second,
}},
},
{
newTestAlertingRule("for-firing", 3*time.Second),
@@ -409,6 +427,38 @@ func TestAlertingRule_ExecRange(t *testing.T) {
{State: notifier.StatePending, ActiveAt: time.Unix(1, 0)},
{State: notifier.StateFiring, ActiveAt: time.Unix(1, 0)},
},
map[uint64]*notifier.Alert{hash(map[string]string{"alertname": "for-firing"}): {
GroupID: fakeGroup.ID(),
Name: "for-firing",
Labels: map[string]string{"alertname": "for-firing"},
Annotations: map[string]string{},
State: notifier.StateFiring,
ActiveAt: time.Unix(1, 0),
Start: time.Unix(5, 0),
Value: 1,
For: 3 * time.Second,
}},
},
{
newTestAlertingRule("for-hold-pending", time.Second),
[]datasource.Metric{
{Values: []float64{1, 1, 1}, Timestamps: []int64{1, 2, 5}},
},
[]*notifier.Alert{
{State: notifier.StatePending, ActiveAt: time.Unix(1, 0)},
{State: notifier.StateFiring, ActiveAt: time.Unix(1, 0)},
{State: notifier.StatePending, ActiveAt: time.Unix(5, 0)},
},
map[uint64]*notifier.Alert{hash(map[string]string{"alertname": "for-hold-pending"}): {
GroupID: fakeGroup.ID(),
Name: "for-hold-pending",
Labels: map[string]string{"alertname": "for-hold-pending"},
Annotations: map[string]string{},
State: notifier.StatePending,
ActiveAt: time.Unix(5, 0),
Value: 1,
For: time.Second,
}},
},
{
newTestAlertingRule("for=>pending=>firing=>pending=>firing=>pending", time.Second),
@@ -422,9 +472,10 @@ func TestAlertingRule_ExecRange(t *testing.T) {
{State: notifier.StateFiring, ActiveAt: time.Unix(5, 0)},
{State: notifier.StatePending, ActiveAt: time.Unix(20, 0)},
},
nil,
},
{
newTestAlertingRule("multi-series-for=>pending=>pending=>firing", 3*time.Second),
newTestAlertingRule("multi-series", 3*time.Second),
[]datasource.Metric{
{Values: []float64{1, 1, 1}, Timestamps: []int64{1, 3, 5}},
{
@@ -436,7 +487,6 @@ func TestAlertingRule_ExecRange(t *testing.T) {
{State: notifier.StatePending, ActiveAt: time.Unix(1, 0)},
{State: notifier.StatePending, ActiveAt: time.Unix(1, 0)},
{State: notifier.StateFiring, ActiveAt: time.Unix(1, 0)},
//
{
State: notifier.StatePending, ActiveAt: time.Unix(1, 0),
Labels: map[string]string{
@@ -450,6 +500,29 @@ func TestAlertingRule_ExecRange(t *testing.T) {
},
},
},
map[uint64]*notifier.Alert{
hash(map[string]string{"alertname": "multi-series"}): {
GroupID: fakeGroup.ID(),
Name: "multi-series",
Labels: map[string]string{"alertname": "multi-series"},
Annotations: map[string]string{},
State: notifier.StateFiring,
ActiveAt: time.Unix(1, 0),
Start: time.Unix(5, 0),
Value: 1,
For: 3 * time.Second,
},
hash(map[string]string{"alertname": "multi-series", "foo": "bar"}): {
GroupID: fakeGroup.ID(),
Name: "multi-series",
Labels: map[string]string{"alertname": "multi-series", "foo": "bar"},
Annotations: map[string]string{},
State: notifier.StatePending,
ActiveAt: time.Unix(5, 0),
Value: 1,
For: 3 * time.Second,
},
},
},
{
newTestRuleWithLabels("multi-series-firing", "source", "vm"),
@@ -477,16 +550,16 @@ func TestAlertingRule_ExecRange(t *testing.T) {
"source": "vm",
}},
},
nil,
},
}
fakeGroup := Group{Name: "TestRule_ExecRange"}
for _, tc := range testCases {
t.Run(tc.rule.Name, func(t *testing.T) {
fq := &datasource.FakeQuerier{}
tc.rule.q = fq
tc.rule.GroupID = fakeGroup.ID()
fq.Add(tc.data...)
gotTS, err := tc.rule.execRange(context.TODO(), time.Now(), time.Now())
gotTS, err := tc.rule.execRange(context.TODO(), time.Unix(1, 0), time.Unix(5, 0))
if err != nil {
t.Fatalf("unexpected err: %s", err)
}
@@ -512,6 +585,11 @@ func TestAlertingRule_ExecRange(t *testing.T) {
t.Fatalf("%d: expected \n%v but got \n%v", i, exp, got)
}
}
if tc.expHoldAlertStateAlerts != nil {
if !reflect.DeepEqual(tc.expHoldAlertStateAlerts, tc.rule.alerts) {
t.Fatalf("expected hold alerts state: \n%v but got \n%v", tc.expHoldAlertStateAlerts, tc.rule.alerts)
}
}
})
}
}
@@ -564,6 +642,9 @@ func TestGroup_Restore(t *testing.T) {
if got.ActiveAt != exp.ActiveAt {
t.Fatalf("expected ActiveAt %v; got %v", exp.ActiveAt, got.ActiveAt)
}
if got.Name != exp.Name {
t.Fatalf("expected alertname %q; got %q", exp.Name, got.Name)
}
}
}
@@ -579,6 +660,7 @@ func TestGroup_Restore(t *testing.T) {
[]config.Rule{{Alert: "foo", Expr: "foo", For: promutils.NewDuration(time.Second)}},
map[uint64]*notifier.Alert{
hash(map[string]string{alertNameLabel: "foo", alertGroupNameLabel: "TestRestore"}): {
Name: "foo",
ActiveAt: defaultTS,
},
})
@@ -592,6 +674,7 @@ func TestGroup_Restore(t *testing.T) {
[]config.Rule{{Alert: "foo", Expr: "foo", For: promutils.NewDuration(time.Second)}},
map[uint64]*notifier.Alert{
hash(map[string]string{alertNameLabel: "foo", alertGroupNameLabel: "TestRestore"}): {
Name: "foo",
ActiveAt: ts,
},
})
@@ -599,7 +682,7 @@ func TestGroup_Restore(t *testing.T) {
// two rules, two active alerts, one with state restored
ts = time.Now().Truncate(time.Hour)
fqr.Set(`last_over_time(ALERTS_FOR_STATE{alertgroup="TestRestore",alertname="bar"}[3600s])`,
stateMetric("foo", ts))
stateMetric("bar", ts))
fn(
[]config.Rule{
{Alert: "foo", Expr: "foo", For: promutils.NewDuration(time.Second)},
@@ -607,9 +690,11 @@ func TestGroup_Restore(t *testing.T) {
},
map[uint64]*notifier.Alert{
hash(map[string]string{alertNameLabel: "foo", alertGroupNameLabel: "TestRestore"}): {
Name: "foo",
ActiveAt: defaultTS,
},
hash(map[string]string{alertNameLabel: "bar", alertGroupNameLabel: "TestRestore"}): {
Name: "bar",
ActiveAt: ts,
},
})
@@ -627,9 +712,11 @@ func TestGroup_Restore(t *testing.T) {
},
map[uint64]*notifier.Alert{
hash(map[string]string{alertNameLabel: "foo", alertGroupNameLabel: "TestRestore"}): {
Name: "foo",
ActiveAt: ts,
},
hash(map[string]string{alertNameLabel: "bar", alertGroupNameLabel: "TestRestore"}): {
Name: "bar",
ActiveAt: ts,
},
})
@@ -642,6 +729,7 @@ func TestGroup_Restore(t *testing.T) {
[]config.Rule{{Alert: "foo", Expr: "foo", For: promutils.NewDuration(time.Second)}},
map[uint64]*notifier.Alert{
hash(map[string]string{alertNameLabel: "foo", alertGroupNameLabel: "TestRestore"}): {
Name: "foo",
ActiveAt: defaultTS,
},
})
@@ -654,6 +742,7 @@ func TestGroup_Restore(t *testing.T) {
[]config.Rule{{Alert: "foo", Expr: "foo", Labels: map[string]string{"env": "dev"}, For: promutils.NewDuration(time.Second)}},
map[uint64]*notifier.Alert{
hash(map[string]string{alertNameLabel: "foo", alertGroupNameLabel: "TestRestore", "env": "dev"}): {
Name: "foo",
ActiveAt: ts,
},
})
@@ -666,6 +755,7 @@ func TestGroup_Restore(t *testing.T) {
[]config.Rule{{Alert: "foo", Expr: "foo", Labels: map[string]string{"env": "dev"}, For: promutils.NewDuration(time.Second)}},
map[uint64]*notifier.Alert{
hash(map[string]string{alertNameLabel: "foo", alertGroupNameLabel: "TestRestore", "env": "dev"}): {
Name: "foo",
ActiveAt: defaultTS,
},
})
@@ -990,6 +1080,9 @@ func newTestAlertingRule(name string, waitFor time.Duration) *AlertingRule {
EvalInterval: waitFor,
alerts: make(map[uint64]*notifier.Alert),
state: &ruleState{entries: make([]StateEntry, 10)},
metrics: &alertingRuleMetrics{
errors: utils.GetOrCreateCounter(fmt.Sprintf(`vmalert_alerting_rules_errors_total{alertname=%q}`, name)),
},
}
return &rule
}

View File

@@ -31,7 +31,9 @@ var (
"Rule's updates are available on rule's Details page and are used for debugging purposes. The number of stored updates can be overridden per rule via update_entries_limit param.")
resendDelay = flag.Duration("rule.resendDelay", 0, "MiniMum amount of time to wait before resending an alert to notifier")
maxResolveDuration = flag.Duration("rule.maxResolveDuration", 0, "Limits the maxiMum duration for automatic alert expiration, "+
"which by default is 4 times evaluationInterval of the parent ")
"which by default is 4 times evaluationInterval of the parent group")
evalDelay = flag.Duration("rule.evalDelay", 30*time.Second, "Adjustment of the `time` parameter for rule evaluation requests to compensate intentional data delay from the datasource."+
"Normally, should be equal to `-search.latencyOffset` (cmd-line flag configured for VictoriaMetrics single-node or vmselect).")
disableAlertGroupLabel = flag.Bool("disableAlertgroupLabel", false, "Whether to disable adding group's Name as label to generated alerts and time series.")
remoteReadLookBack = flag.Duration("remoteRead.lookback", time.Hour, "Lookback defines how far to look into past for alerts timeseries."+
" For example, if lookback=1h then range from now() to now()-1h will be scanned.")
@@ -39,13 +41,16 @@ var (
// Group is an entity for grouping rules
type Group struct {
mu sync.RWMutex
Name string
File string
Rules []Rule
Type config.Type
Interval time.Duration
EvalOffset *time.Duration
mu sync.RWMutex
Name string
File string
Rules []Rule
Type config.Type
Interval time.Duration
EvalOffset *time.Duration
// EvalDelay will adjust timestamp for rule evaluation requests to compensate intentional query delay from datasource.
// see https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5155
EvalDelay *time.Duration
Limit int
Concurrency int
Checksum string
@@ -139,6 +144,9 @@ func NewGroup(cfg config.Group, qb datasource.QuerierBuilder, defaultInterval ti
if cfg.EvalOffset != nil {
g.EvalOffset = &cfg.EvalOffset.D
}
if cfg.EvalDelay != nil {
g.EvalDelay = &cfg.EvalDelay.D
}
for _, h := range cfg.Headers {
g.Headers[h.Key] = h.Value
}
@@ -331,7 +339,7 @@ func (g *Group) Start(ctx context.Context, nts func() []notifier.Notifier, rw re
Rw: rw,
Notifiers: nts,
notifierHeaders: g.NotifierHeaders,
PreviouslySentSeriesToRW: make(map[uint64]map[string][]prompbmarshal.Label),
previouslySentSeriesToRW: make(map[uint64]map[string][]prompbmarshal.Label),
}
g.infof("started")
@@ -520,7 +528,7 @@ func (g *Group) ExecOnce(ctx context.Context, nts func() []notifier.Notifier, rw
Rw: rw,
Notifiers: nts,
notifierHeaders: g.NotifierHeaders,
PreviouslySentSeriesToRW: make(map[uint64]map[string][]prompbmarshal.Label),
previouslySentSeriesToRW: make(map[uint64]map[string][]prompbmarshal.Label),
}
if len(g.Rules) < 1 {
return nil
@@ -581,10 +589,14 @@ func (g *Group) adjustReqTimestamp(timestamp time.Time) time.Time {
// to 10:30, to the previous evaluationInterval.
return ts.Add(-g.Interval)
}
// EvalOffset shouldn't interfere with evalAlignment,
// so we return it immediately
// when `eval_offset` is using, ts shouldn't be effect by `eval_alignment` and `eval_delay`
// since it should be always aligned.
return ts
}
timestamp = timestamp.Add(-g.getEvalDelay())
// always apply the alignment as a last step
if g.evalAlignment == nil || *g.evalAlignment {
// align query time with interval to get similar result with grafana when plotting time series.
// see https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5049
@@ -594,6 +606,13 @@ func (g *Group) adjustReqTimestamp(timestamp time.Time) time.Time {
return timestamp
}
func (g *Group) getEvalDelay() time.Duration {
if g.EvalDelay != nil {
return *g.EvalDelay
}
return *evalDelay
}
// executor contains group's notify and rw configs
type executor struct {
Notifiers func() []notifier.Notifier
@@ -602,11 +621,11 @@ type executor struct {
Rw remotewrite.RWClient
previouslySentSeriesToRWMu sync.Mutex
// PreviouslySentSeriesToRW stores series sent to RW on previous iteration
// previouslySentSeriesToRW stores series sent to RW on previous iteration
// map[ruleID]map[ruleLabels][]prompb.Label
// where `ruleID` is ID of the Rule within a Group
// and `ruleLabels` is []prompb.Label marshalled to a string
PreviouslySentSeriesToRW map[uint64]map[string][]prompbmarshal.Label
previouslySentSeriesToRW map[uint64]map[string][]prompbmarshal.Label
}
// execConcurrently executes rules concurrently if concurrency>1
@@ -644,9 +663,6 @@ var (
execTotal = metrics.NewCounter(`vmalert_execution_total`)
execErrors = metrics.NewCounter(`vmalert_execution_errors_total`)
remoteWriteErrors = metrics.NewCounter(`vmalert_remotewrite_errors_total`)
remoteWriteTotal = metrics.NewCounter(`vmalert_remotewrite_total`)
)
func (e *executor) exec(ctx context.Context, r Rule, ts time.Time, resolveDuration time.Duration, limit int) error {
@@ -667,9 +683,7 @@ func (e *executor) exec(ctx context.Context, r Rule, ts time.Time, resolveDurati
pushToRW := func(tss []prompbmarshal.TimeSeries) error {
var lastErr error
for _, ts := range tss {
remoteWriteTotal.Inc()
if err := e.Rw.Push(ts); err != nil {
remoteWriteErrors.Inc()
lastErr = fmt.Errorf("rule %q: remote write failure: %w", r, err)
}
}
@@ -723,7 +737,7 @@ func (e *executor) getStaleSeries(r Rule, tss []prompbmarshal.TimeSeries, timest
var staleS []prompbmarshal.TimeSeries
// check whether there are series which disappeared and need to be marked as stale
e.previouslySentSeriesToRWMu.Lock()
for key, labels := range e.PreviouslySentSeriesToRW[rID] {
for key, labels := range e.previouslySentSeriesToRW[rID] {
if _, ok := ruleLabels[key]; ok {
continue
}
@@ -732,7 +746,7 @@ func (e *executor) getStaleSeries(r Rule, tss []prompbmarshal.TimeSeries, timest
staleS = append(staleS, ss)
}
// set previous series to current
e.PreviouslySentSeriesToRW[rID] = ruleLabels
e.previouslySentSeriesToRW[rID] = ruleLabels
e.previouslySentSeriesToRWMu.Unlock()
return staleS
@@ -750,14 +764,14 @@ func (e *executor) purgeStaleSeries(activeRules []Rule) {
for _, rule := range activeRules {
id := rule.ID()
prev, ok := e.PreviouslySentSeriesToRW[id]
prev, ok := e.previouslySentSeriesToRW[id]
if ok {
// keep previous series for staleness detection
newPreviouslySentSeriesToRW[id] = prev
}
}
e.PreviouslySentSeriesToRW = nil
e.PreviouslySentSeriesToRW = newPreviouslySentSeriesToRW
e.previouslySentSeriesToRW = nil
e.previouslySentSeriesToRW = newPreviouslySentSeriesToRW
e.previouslySentSeriesToRWMu.Unlock()
}

View File

@@ -321,7 +321,7 @@ func TestResolveDuration(t *testing.T) {
func TestGetStaleSeries(t *testing.T) {
ts := time.Now()
e := &executor{
PreviouslySentSeriesToRW: make(map[uint64]map[string][]prompbmarshal.Label),
previouslySentSeriesToRW: make(map[uint64]map[string][]prompbmarshal.Label),
}
f := func(r Rule, labels, expLabels [][]prompbmarshal.Label) {
t.Helper()
@@ -414,7 +414,7 @@ func TestPurgeStaleSeries(t *testing.T) {
f := func(curRules, newRules, expStaleRules []Rule) {
t.Helper()
e := &executor{
PreviouslySentSeriesToRW: make(map[uint64]map[string][]prompbmarshal.Label),
previouslySentSeriesToRW: make(map[uint64]map[string][]prompbmarshal.Label),
}
// seed executor with series for
// current rules
@@ -424,13 +424,13 @@ func TestPurgeStaleSeries(t *testing.T) {
e.purgeStaleSeries(newRules)
if len(e.PreviouslySentSeriesToRW) != len(expStaleRules) {
if len(e.previouslySentSeriesToRW) != len(expStaleRules) {
t.Fatalf("expected to get %d stale series, got %d",
len(expStaleRules), len(e.PreviouslySentSeriesToRW))
len(expStaleRules), len(e.previouslySentSeriesToRW))
}
for _, exp := range expStaleRules {
if _, ok := e.PreviouslySentSeriesToRW[exp.ID()]; !ok {
if _, ok := e.previouslySentSeriesToRW[exp.ID()]; !ok {
t.Fatalf("expected to have rule %d; got nil instead", exp.ID())
}
}
@@ -515,7 +515,7 @@ func TestFaultyRW(t *testing.T) {
e := &executor{
Rw: &remotewrite.Client{},
PreviouslySentSeriesToRW: make(map[uint64]map[string][]prompbmarshal.Label),
previouslySentSeriesToRW: make(map[uint64]map[string][]prompbmarshal.Label),
}
err := e.exec(context.Background(), r, time.Now(), 0, 10)
@@ -628,6 +628,7 @@ func TestGroupStartDelay(t *testing.T) {
func TestGetPrometheusReqTimestamp(t *testing.T) {
offset := 30 * time.Minute
evalDelay := 1 * time.Minute
disableAlign := false
testCases := []struct {
name string
@@ -635,7 +636,7 @@ func TestGetPrometheusReqTimestamp(t *testing.T) {
originTS, expTS string
}{
{
"with query align",
"with query align + default evalDelay",
&Group{
Interval: time.Hour,
},
@@ -643,16 +644,16 @@ func TestGetPrometheusReqTimestamp(t *testing.T) {
"2023-08-28T11:00:00+00:00",
},
{
"without query align",
"without query align + default evalDelay",
&Group{
Interval: time.Hour,
evalAlignment: &disableAlign,
},
"2023-08-28T11:11:00+00:00",
"2023-08-28T11:11:00+00:00",
"2023-08-28T11:10:30+00:00",
},
{
"with eval_offset, find previous offset point",
"with eval_offset, find previous offset point + default evalDelay",
&Group{
EvalOffset: &offset,
Interval: time.Hour,
@@ -661,7 +662,7 @@ func TestGetPrometheusReqTimestamp(t *testing.T) {
"2023-08-28T10:30:00+00:00",
},
{
"with eval_offset",
"with eval_offset + default evalDelay",
&Group{
EvalOffset: &offset,
Interval: time.Hour,
@@ -669,14 +670,44 @@ func TestGetPrometheusReqTimestamp(t *testing.T) {
"2023-08-28T11:41:00+00:00",
"2023-08-28T11:30:00+00:00",
},
{
"1h interval with eval_delay",
&Group{
EvalDelay: &evalDelay,
Interval: time.Hour,
},
"2023-08-28T11:41:00+00:00",
"2023-08-28T11:00:00+00:00",
},
{
"1m interval with eval_delay",
&Group{
EvalDelay: &evalDelay,
Interval: time.Minute,
},
"2023-08-28T11:41:13+00:00",
"2023-08-28T11:40:00+00:00",
},
{
"disable alignment with eval_delay",
&Group{
EvalDelay: &evalDelay,
Interval: time.Hour,
evalAlignment: &disableAlign,
},
"2023-08-28T11:41:00+00:00",
"2023-08-28T11:40:00+00:00",
},
}
for _, tc := range testCases {
originT, _ := time.Parse(time.RFC3339, tc.originTS)
expT, _ := time.Parse(time.RFC3339, tc.expTS)
gotTS := tc.g.adjustReqTimestamp(originT)
if !gotTS.Equal(expT) {
t.Fatalf("get wrong prometheus request timestamp, expect %s, got %s", expT, gotTS)
}
t.Run(tc.name, func(t *testing.T) {
originT, _ := time.Parse(time.RFC3339, tc.originTS)
expT, _ := time.Parse(time.RFC3339, tc.expTS)
gotTS := tc.g.adjustReqTimestamp(originT)
if !gotTS.Equal(expT) {
t.Fatalf("get wrong prometheus request timestamp, expect %s, got %s", expT, gotTS)
}
})
}
}

View File

@@ -17,12 +17,14 @@ import (
// to evaluate configured Expression and
// return TimeSeries as result.
type RecordingRule struct {
Type config.Type
RuleID uint64
Name string
Expr string
Labels map[string]string
GroupID uint64
Type config.Type
RuleID uint64
Name string
Expr string
Labels map[string]string
GroupID uint64
GroupName string
File string
q datasource.Querier
@@ -34,7 +36,7 @@ type RecordingRule struct {
}
type recordingRuleMetrics struct {
errors *utils.Gauge
errors *utils.Counter
samples *utils.Gauge
}
@@ -52,13 +54,15 @@ func (rr *RecordingRule) ID() uint64 {
// NewRecordingRule creates a new RecordingRule
func NewRecordingRule(qb datasource.QuerierBuilder, group *Group, cfg config.Rule) *RecordingRule {
rr := &RecordingRule{
Type: group.Type,
RuleID: cfg.ID,
Name: cfg.Record,
Expr: cfg.Expr,
Labels: cfg.Labels,
GroupID: group.ID(),
metrics: &recordingRuleMetrics{},
Type: group.Type,
RuleID: cfg.ID,
Name: cfg.Record,
Expr: cfg.Expr,
Labels: cfg.Labels,
GroupID: group.ID(),
GroupName: group.Name,
File: group.File,
metrics: &recordingRuleMetrics{},
q: qb.BuildWithParams(datasource.QuerierParams{
DataSourceType: group.Type.String(),
EvaluationInterval: group.Interval,
@@ -78,15 +82,8 @@ func NewRecordingRule(qb datasource.QuerierBuilder, group *Group, cfg config.Rul
entries: make([]StateEntry, entrySize),
}
labels := fmt.Sprintf(`recording=%q, group=%q, id="%d"`, rr.Name, group.Name, rr.ID())
rr.metrics.errors = utils.GetOrCreateGauge(fmt.Sprintf(`vmalert_recording_rules_error{%s}`, labels),
func() float64 {
e := rr.state.getLast()
if e.Err == nil {
return 0
}
return 1
})
labels := fmt.Sprintf(`recording=%q, group=%q, file=%q, id="%d"`, rr.Name, group.Name, group.File, rr.ID())
rr.metrics.errors = utils.GetOrCreateCounter(fmt.Sprintf(`vmalert_recording_rules_errors_total{%s}`, labels))
rr.metrics.samples = utils.GetOrCreateGauge(fmt.Sprintf(`vmalert_recording_rules_last_evaluation_samples{%s}`, labels),
func() float64 {
e := rr.state.getLast()
@@ -138,6 +135,9 @@ func (rr *RecordingRule) exec(ctx context.Context, ts time.Time, limit int) ([]p
defer func() {
rr.state.add(curState)
if curState.Err != nil {
rr.metrics.errors.Inc()
}
}()
if err != nil {

View File

@@ -8,6 +8,7 @@ import (
"time"
"github.com/VictoriaMetrics/VictoriaMetrics/app/vmalert/datasource"
"github.com/VictoriaMetrics/VictoriaMetrics/app/vmalert/utils"
"github.com/VictoriaMetrics/VictoriaMetrics/lib/prompbmarshal"
)
@@ -201,9 +202,15 @@ func TestRecordingRuleLimit(t *testing.T) {
metricWithValuesAndLabels(t, []float64{2, 3}, "__name__", "bar", "job", "bar"),
metricWithValuesAndLabels(t, []float64{4, 5, 6}, "__name__", "baz", "job", "baz"),
}
rule := &RecordingRule{Name: "job:foo", state: &ruleState{entries: make([]StateEntry, 10)}, Labels: map[string]string{
"source": "test_limit",
}}
rule := &RecordingRule{Name: "job:foo",
state: &ruleState{entries: make([]StateEntry, 10)},
Labels: map[string]string{
"source": "test_limit",
},
metrics: &recordingRuleMetrics{
errors: utils.GetOrCreateCounter(`vmalert_recording_rules_errors_total{alertname="job:foo"}`),
},
}
var err error
for _, testCase := range testCases {
fq := &datasource.FakeQuerier{}
@@ -223,6 +230,9 @@ func TestRecordingRule_ExecNegative(t *testing.T) {
"job": "test",
},
state: &ruleState{entries: make([]StateEntry, 10)},
metrics: &recordingRuleMetrics{
errors: utils.GetOrCreateCounter(`vmalert_recording_rules_errors_total{alertname="job:foo"}`),
},
}
fq := &datasource.FakeQuerier{}
expErr := "connection reset by peer"

View File

@@ -43,26 +43,26 @@ type ruleState struct {
// StateEntry stores rule's execution states
type StateEntry struct {
// stores last moment of time rule.Exec was called
Time time.Time
Time time.Time `json:"time"`
// stores the timesteamp with which rule.Exec was called
At time.Time
At time.Time `json:"at"`
// stores the duration of the last rule.Exec call
Duration time.Duration
Duration time.Duration `json:"duration"`
// stores last error that happened in Exec func
// resets on every successful Exec
// may be used as Health ruleState
Err error
Err error `json:"error"`
// stores the number of samples returned during
// the last evaluation
Samples int
Samples int `json:"samples"`
// stores the number of time series fetched during
// the last evaluation.
// Is supported by VictoriaMetrics only, starting from v1.90.0
// If seriesFetched == nil, then this attribute was missing in
// datasource response (unsupported).
SeriesFetched *int
SeriesFetched *int `json:"series_fetched"`
// stores the curl command reflecting the HTTP request used during rule.Exec
Curl string
Curl string `json:"curl"`
}
// GetLastEntry returns latest stateEntry of rule
@@ -166,7 +166,7 @@ func replayRule(r Rule, start, end time.Time, rw remotewrite.RWClient, replayRul
var n int
for _, ts := range tss {
if err := rw.Push(ts); err != nil {
return n, fmt.Errorf("remote write failure: %s", err)
return n, fmt.Errorf("remote write failure: %w", err)
}
n += len(ts.Samples)
}

Binary file not shown.

Before

Width:  |  Height:  |  Size: 73 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 151 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 122 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 80 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 62 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 109 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 41 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 41 KiB

View File

@@ -132,6 +132,24 @@ func (rh *requestHandler) handler(w http.ResponseWriter, r *http.Request) bool {
w.Header().Set("Content-Type", "application/json")
w.Write(data)
return true
case "/vmalert/api/v1/rule", "/api/v1/rule":
rule, err := rh.getRule(r)
if err != nil {
httpserver.Errorf(w, r, "%s", err)
return true
}
rwu := apiRuleWithUpdates{
apiRule: rule,
StateUpdates: rule.Updates,
}
data, err := json.Marshal(rwu)
if err != nil {
httpserver.Errorf(w, r, "failed to marshal rule: %s", err)
return true
}
w.Header().Set("Content-Type", "application/json")
w.Write(data)
return true
case "/-/reload":
logger.Infof("api config reload was called, sending sighup")
procutil.SelfSIGHUP()
@@ -147,11 +165,11 @@ func (rh *requestHandler) handler(w http.ResponseWriter, r *http.Request) bool {
func (rh *requestHandler) getRule(r *http.Request) (apiRule, error) {
groupID, err := strconv.ParseUint(r.FormValue(paramGroupID), 10, 64)
if err != nil {
return apiRule{}, fmt.Errorf("failed to read %q param: %s", paramGroupID, err)
return apiRule{}, fmt.Errorf("failed to read %q param: %w", paramGroupID, err)
}
ruleID, err := strconv.ParseUint(r.FormValue(paramRuleID), 10, 64)
if err != nil {
return apiRule{}, fmt.Errorf("failed to read %q param: %s", paramRuleID, err)
return apiRule{}, fmt.Errorf("failed to read %q param: %w", paramRuleID, err)
}
obj, err := rh.m.ruleAPI(groupID, ruleID)
if err != nil {
@@ -163,11 +181,11 @@ func (rh *requestHandler) getRule(r *http.Request) (apiRule, error) {
func (rh *requestHandler) getAlert(r *http.Request) (*apiAlert, error) {
groupID, err := strconv.ParseUint(r.FormValue(paramGroupID), 10, 64)
if err != nil {
return nil, fmt.Errorf("failed to read %q param: %s", paramGroupID, err)
return nil, fmt.Errorf("failed to read %q param: %w", paramGroupID, err)
}
alertID, err := strconv.ParseUint(r.FormValue(paramAlertID), 10, 64)
if err != nil {
return nil, fmt.Errorf("failed to read %q param: %s", paramAlertID, err)
return nil, fmt.Errorf("failed to read %q param: %w", paramAlertID, err)
}
a, err := rh.m.alertAPI(groupID, alertID)
if err != nil {

View File

@@ -143,6 +143,28 @@ func TestHandler(t *testing.T) {
t.Errorf("expected 1 group got %d", length)
}
})
t.Run("/api/v1/rule?ruleID&groupID", func(t *testing.T) {
expRule := ruleToAPI(ar)
gotRule := apiRule{}
getResp(ts.URL+"/"+expRule.APILink(), &gotRule, 200)
if expRule.ID != gotRule.ID {
t.Errorf("expected to get Rule %q; got %q instead", expRule.ID, gotRule.ID)
}
gotRule = apiRule{}
getResp(ts.URL+"/vmalert/"+expRule.APILink(), &gotRule, 200)
if expRule.ID != gotRule.ID {
t.Errorf("expected to get Rule %q; got %q instead", expRule.ID, gotRule.ID)
}
gotRuleWithUpdates := apiRuleWithUpdates{}
getResp(ts.URL+"/"+expRule.APILink(), &gotRuleWithUpdates, 200)
if gotRuleWithUpdates.StateUpdates == nil || len(gotRuleWithUpdates.StateUpdates) < 1 {
t.Fatalf("expected %+v to have state updates field not empty", gotRuleWithUpdates.StateUpdates)
}
})
}
func TestEmptyResponse(t *testing.T) {

View File

@@ -94,6 +94,10 @@ type apiGroup struct {
NotifierHeaders []string `json:"notifier_headers,omitempty"`
// Labels is a set of label value pairs, that will be added to every rule.
Labels map[string]string `json:"labels,omitempty"`
// EvalOffset Group will be evaluated at the exact time offset on the range of [0...evaluationInterval]
EvalOffset float64 `json:"eval_offset,omitempty"`
// EvalDelay will adjust the `time` parameter of rule evaluation requests to compensate intentional query delay from datasource.
EvalDelay float64 `json:"eval_delay,omitempty"`
}
// groupAlerts represents a group of alerts for WEB view
@@ -147,6 +151,10 @@ type apiRule struct {
ID string `json:"id"`
// GroupID is an unique Group's ID
GroupID string `json:"group_id"`
// GroupName is Group name rule belong to
GroupName string `json:"group_name"`
// File is file name where rule is defined
File string `json:"file"`
// Debug shows whether debug mode is enabled
Debug bool `json:"debug"`
@@ -156,6 +164,19 @@ type apiRule struct {
Updates []rule.StateEntry `json:"-"`
}
// apiRuleWithUpdates represents apiRule but with extra fields for marshalling
type apiRuleWithUpdates struct {
apiRule
// Updates contains the ordered list of recorded ruleStateEntry objects
StateUpdates []rule.StateEntry `json:"updates,omitempty"`
}
// APILink returns a link to the rule's JSON representation.
func (ar apiRule) APILink() string {
return fmt.Sprintf("api/v1/rule?%s=%s&%s=%s",
paramGroupID, ar.GroupID, paramRuleID, ar.ID)
}
// WebLink returns a link to the alert which can be used in UI.
func (ar apiRule) WebLink() string {
return fmt.Sprintf("rule?%s=%s&%s=%s",
@@ -223,8 +244,10 @@ func alertingToAPI(ar *rule.AlertingRule) apiRule {
Debug: ar.Debug,
// encode as strings to avoid rounding in JSON
ID: fmt.Sprintf("%d", ar.ID()),
GroupID: fmt.Sprintf("%d", ar.GroupID),
ID: fmt.Sprintf("%d", ar.ID()),
GroupID: fmt.Sprintf("%d", ar.GroupID),
GroupName: ar.GroupName,
File: ar.File,
}
if lastState.Err != nil {
r.LastError = lastState.Err.Error()
@@ -309,6 +332,12 @@ func groupToAPI(g *rule.Group) apiGroup {
Labels: g.Labels,
}
if g.EvalOffset != nil {
ag.EvalOffset = g.EvalOffset.Seconds()
}
if g.EvalDelay != nil {
ag.EvalDelay = g.EvalDelay.Seconds()
}
ag.Rules = make([]apiRule, 0)
for _, r := range g.Rules {
ag.Rules = append(ag.Rules, ruleToAPI(r))

View File

@@ -1,500 +1,3 @@
# vmauth
See vmauth docs [here](https://docs.victoriametrics.com/vmauth.html).
`vmauth` is a simple auth proxy, router and [load balancer](#load-balancing) for [VictoriaMetrics](https://github.com/VictoriaMetrics/VictoriaMetrics).
It reads auth credentials from `Authorization` http header ([Basic Auth](https://en.wikipedia.org/wiki/Basic_access_authentication), `Bearer token` and [InfluxDB authorization](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1897) is supported),
matches them against configs pointed by [-auth.config](#auth-config) command-line flag and proxies incoming HTTP requests to the configured per-user `url_prefix` on successful match.
The `-auth.config` can point to either local file or to http url.
## Quick start
Just download `vmutils-*` archive from [releases page](https://github.com/VictoriaMetrics/VictoriaMetrics/releases/latest), unpack it
and pass the following flag to `vmauth` binary in order to start authorizing and routing requests:
```console
/path/to/vmauth -auth.config=/path/to/auth/config.yml
```
After that `vmauth` starts accepting HTTP requests on port `8427` and routing them according to the provided [-auth.config](#auth-config).
The port can be modified via `-httpListenAddr` command-line flag.
The auth config can be reloaded via the following ways:
- By passing `SIGHUP` signal to `vmauth`.
- By querying `/-/reload` http endpoint. This endpoint can be protected with `-reloadAuthKey` command-line flag. See [security docs](#security) for more details.
- By specifying `-configCheckInterval` command-line flag to the interval between config re-reads. For example, `-configCheckInterval=5s` will re-read the config
and apply new changes every 5 seconds.
Docker images for `vmauth` are available [here](https://hub.docker.com/r/victoriametrics/vmauth/tags).
See how `vmauth` used in [docker-compose env](https://github.com/VictoriaMetrics/VictoriaMetrics/blob/master/deployment/docker/README.md#victoriametrics-cluster).
Pass `-help` to `vmauth` in order to see all the supported command-line flags with their descriptions.
Feel free [contacting us](mailto:info@victoriametrics.com) if you need customized auth proxy for VictoriaMetrics with the support of LDAP, SSO, RBAC, SAML,
accounting and rate limiting such as [vmgateway](https://docs.victoriametrics.com/vmgateway.html).
## Load balancing
Each `url_prefix` in the [-auth.config](#auth-config) may contain either a single url or a list of urls.
In the latter case `vmauth` balances load among the configured urls in least-loaded round-robin manner.
If the backend at the configured url isn't available, then `vmauth` tries sending the request to the remaining configured urls.
It is possible to configure automatic retry of requests if the backend responds with status code from optional `retry_status_codes` list.
Load balancing feature can be used in the following cases:
- Balancing the load among multiple `vmselect` and/or `vminsert` nodes in [VictoriaMetrics cluster](https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html).
The following `-auth.config` file can be used for spreading incoming requests among 3 vmselect nodes and re-trying failed requests
or requests with 500 and 502 response status codes:
```yml
unauthorized_user:
url_prefix:
- http://vmselect1:8481/
- http://vmselect2:8481/
- http://vmselect3:8481/
retry_status_codes: [500, 502]
```
- Spreading select queries among multiple availability zones (AZs) with identical data. For example, the following config spreads select queries
among 3 AZs. Requests are re-tried if some AZs are temporarily unavailable or if some `vmstorage` nodes in some AZs are temporarily unavailable.
`vmauth` adds `deny_partial_response=1` query arg to all the queries in order to guarantee to get full response from every AZ.
See [these docs](https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html#cluster-availability) for details.
```yml
unauthorized_user:
url_prefix:
- https://vmselect-az1/?deny_partial_response=1
- https://vmselect-az2/?deny_partial_response=1
- https://vmselect-az3/?deny_partial_response=1
retry_status_codes: [500, 502, 503]
```
Load balancig can also be configured independently per each user and per each `url_map` entry.
See [auth config docs](#auth-config) for more details.
## Concurrency limiting
`vmauth` limits the number of concurrent requests it can proxy according to the following command-line flags:
- `-maxConcurrentRequests` limits the global number of concurrent requests `vmauth` can serve across all the configured users.
- `-maxConcurrentPerUserRequests` limits the number of concurrent requests `vmauth` can serve per each configured user.
It is also possible to set individual limits on the number of concurrent requests per each user
with the `max_concurrent_requests` option - see [auth config example](#auth-config).
`vmauth` responds with `429 Too Many Requests` HTTP error when the number of concurrent requests exceeds the configured limits.
The following [metrics](#monitoring) related to concurrency limits are exposed by `vmauth`:
- `vmauth_concurrent_requests_capacity` - the global limit on the number of concurrent requests `vmauth` can serve.
It is set via `-maxConcurrentRequests` command-line flag.
- `vmauth_concurrent_requests_current` - the current number of concurrent requests `vmauth` processes.
- `vmauth_concurrent_requests_limit_reached_total` - the number of requests rejected with `429 Too Many Requests` error
because of the global concurrency limit has been reached.
- `vmauth_user_concurrent_requests_capacity{username="..."}` - the limit on the number of concurrent requests for the given `username`.
- `vmauth_user_concurrent_requests_current{username="..."}` - the current number of concurrent requests for the given `username`.
- `vmauth_user_concurrent_requests_limit_reached_total{username="foo"}` - the number of requests rejected with `429 Too Many Requests` error
because of the concurrency limit has been reached for the given `username`.
- `vmauth_unauthorized_user_concurrent_requests_capacity` - the limit on the number of concurrent requests for unauthorized users (if `unauthorized_user` section is used).
- `vmauth_unauthorized_user_concurrent_requests_current` - the current number of concurrent requests for unauthorized users (if `unauthorized_user` section is used).
- `vmauth_unauthorized_user_concurrent_requests_limit_reached_total` - the number of requests rejected with `429 Too Many Requests` error
because of the concurrency limit has been reached for unauthorized users (if `unauthorized_user` section is used).
## IP filters
[Enterprise version](https://docs.victoriametrics.com/enterprise.html) of `vmauth` can be configured to allow / deny incoming requests via global and per-user IP filters.
For example, the following config allows requests to `vmauth` from `10.0.0.0/24` network and from `1.2.3.4` IP address, while denying requests from `10.0.0.42` IP address:
```yml
users:
# User configs here
ip_filters:
allow_list:
- 10.0.0.0/24
- 1.2.3.4
deny_list: [10.0.0.42]
```
The following config allows requests for the user 'foobar' only from the IP `127.0.0.1`:
```yml
users:
- username: "foobar"
password: "***"
url_prefix: "http://localhost:8428"
ip_filters:
allow_list: [127.0.0.1]
```
See config example of using IP filters [here](https://github.com/VictoriaMetrics/VictoriaMetrics/blob/master/app/vmauth/example_config_ent.yml).
## Auth config
`-auth.config` is represented in the following simple `yml` format:
```yml
# Arbitrary number of usernames may be put here.
# It is possible to set multiple identical usernames with different passwords.
# Such usernames can be differentiated by `name` option.
users:
# Requests with the 'Authorization: Bearer XXXX' and 'Authorization: Token XXXX'
# header are proxied to http://localhost:8428 .
# For example, http://vmauth:8427/api/v1/query is proxied to http://localhost:8428/api/v1/query
# Requests with the Basic Auth username=XXXX are proxied to http://localhost:8428 as well.
- bearer_token: "XXXX"
url_prefix: "http://localhost:8428"
# Requests with the 'Authorization: Bearer YYY' header are proxied to http://localhost:8428 ,
# The `X-Scope-OrgID: foobar` http header is added to every proxied request.
# The `X-Server-Hostname` http header is removed from the proxied response.
# For example, http://vmauth:8427/api/v1/query is proxied to http://localhost:8428/api/v1/query
- bearer_token: "YYY"
url_prefix: "http://localhost:8428"
# extra headers to add to the request or remove from the request (if header value is empty)
headers:
- "X-Scope-OrgID: foobar"
# extra headers to add to the response or remove from the response (if header value is empty)
response_headers:
- "X-Server-Hostname:" # empty value means the header will be removed from the response
# All the requests to http://vmauth:8427 with the given Basic Auth (username:password)
# are proxied to http://localhost:8428 .
# For example, http://vmauth:8427/api/v1/query is proxied to http://localhost:8428/api/v1/query
#
# The given user can send maximum 10 concurrent requests according to the provided max_concurrent_requests.
# Excess concurrent requests are rejected with 429 HTTP status code.
# See also -maxConcurrentPerUserRequests and -maxConcurrentRequests command-line flags.
- username: "local-single-node"
password: "***"
url_prefix: "http://localhost:8428"
max_concurrent_requests: 10
# All the requests to http://vmauth:8427 with the given Basic Auth (username:password)
# are proxied to http://localhost:8428 with extra_label=team=dev query arg.
# For example, http://vmauth:8427/api/v1/query is routed to http://localhost:8428/api/v1/query?extra_label=team=dev
- username: "local-single-node2"
password: "***"
url_prefix: "http://localhost:8428?extra_label=team=dev"
# All the requests to http://vmauth:8427 with the given Basic Auth (username:password)
# are load-balanced among http://vmselect1:8481/select/123/prometheus and http://vmselect2:8481/select/123/prometheus
# For example, http://vmauth:8427/api/v1/query is proxied to the following urls in a round-robin manner:
# - http://vmselect1:8481/select/123/prometheus/api/v1/select
# - http://vmselect2:8481/select/123/prometheus/api/v1/select
- username: "cluster-select-account-123"
password: "***"
url_prefix:
- "http://vmselect1:8481/select/123/prometheus"
- "http://vmselect2:8481/select/123/prometheus"
# All the requests to http://vmauth:8427 with the given Basic Auth (username:password)
# are load-balanced between http://vminsert1:8480/insert/42/prometheus and http://vminsert2:8480/insert/42/prometheus
# For example, http://vmauth:8427/api/v1/write is proxied to the following urls in a round-robin manner:
# - http://vminsert1:8480/insert/42/prometheus/api/v1/write
# - http://vminsert2:8480/insert/42/prometheus/api/v1/write
- username: "cluster-insert-account-42"
password: "***"
url_prefix:
- "http://vminsert1:8480/insert/42/prometheus"
- "http://vminsert2:8480/insert/42/prometheus"
# A single user for querying and inserting data:
#
# - Requests to http://vmauth:8427/api/v1/query, http://vmauth:8427/api/v1/query_range
# and http://vmauth:8427/api/v1/label/<label_name>/values are proxied to the following urls in a round-robin manner:
# - http://vmselect1:8481/select/42/prometheus
# - http://vmselect2:8481/select/42/prometheus
# For example, http://vmauth:8427/api/v1/query is proxied to http://vmselect1:8480/select/42/prometheus/api/v1/query
# or to http://vmselect2:8480/select/42/prometheus/api/v1/query .
# Requests are re-tried at other url_prefix backends if response status codes match 500 or 502.
#
# - Requests to http://vmauth:8427/api/v1/write are proxied to http://vminsert:8480/insert/42/prometheus/api/v1/write .
# The "X-Scope-OrgID: abc" http header is added to these requests.
# The "X-Server-Hostname" http header is removed from the proxied response.
#
# Request which do not match `src_paths` from the `url_map` are proxied to the urls from `default_url`
# in a round-robin manner. The original request path is passed in `request_path` query arg.
# For example, request to http://vmauth:8427/non/existing/path are proxied:
# - to http://default1:8888/unsupported_url_handler?request_path=/non/existing/path
# - or http://default2:8888/unsupported_url_handler?request_path=/non/existing/path
- username: "foobar"
url_map:
- src_paths:
- "/api/v1/query"
- "/api/v1/query_range"
- "/api/v1/label/[^/]+/values"
url_prefix:
- "http://vmselect1:8481/select/42/prometheus"
- "http://vmselect2:8481/select/42/prometheus"
retry_status_codes: [500, 502]
- src_paths: ["/api/v1/write"]
url_prefix: "http://vminsert:8480/insert/42/prometheus"
headers:
- "X-Scope-OrgID: abc"
response_headers:
- "X-Server-Hostname:" # empty value means the header will be removed from the response
ip_filters:
deny_list: [127.0.0.1]
default_url:
- "http://default1:8888/unsupported_url_handler"
- "http://default2:8888/unsupported_url_handler"
# Requests without Authorization header are routed according to `unauthorized_user` section.
# Requests are routed in round-robin fashion between `url_prefix` backends.
# The deny_partial_response query arg is added to all the routed requests.
# The requests are re-tried if url_prefix backends send 500 or 503 response status codes.
unauthorized_user:
url_prefix:
- http://vmselect-az1/?deny_partial_response=1
- http://vmselect-az2/?deny_partial_response=1
retry_status_codes: [503, 500]
ip_filters:
allow_list: ["1.2.3.0/24", "127.0.0.1"]
deny_list:
- 10.1.0.1
```
The config may contain `%{ENV_VAR}` placeholders, which are substituted by the corresponding `ENV_VAR` environment variable values.
This may be useful for passing secrets to the config.
Please note, vmauth doesn't follow redirects. If destination redirects request to a new location, make sure this
location is supported in vmauth `url_map` config.
## Security
It is expected that all the backend services protected by `vmauth` are located in an isolated private network, so they can be accessed by external users only via `vmauth`.
Do not transfer Basic Auth headers in plaintext over untrusted networks. Enable https. This can be done by passing the following `-tls*` command-line flags to `vmauth`:
```console
-tls
Whether to enable TLS (aka HTTPS) for incoming requests. -tlsCertFile and -tlsKeyFile must be set if -tls is set
-tlsCertFile string
Path to file with TLS certificate. Used only if -tls is set. Prefer ECDSA certs instead of RSA certs, since RSA certs are slow
-tlsKeyFile string
Path to file with TLS key. Used only if -tls is set
```
Alternatively, [https termination proxy](https://en.wikipedia.org/wiki/TLS_termination_proxy) may be put in front of `vmauth`.
It is recommended protecting the following endpoints with authKeys:
* `/-/reload` with `-reloadAuthKey` command-line flag, so external users couldn't trigger config reload.
* `/flags` with `-flagsAuthKey` command-line flag, so unauthorized users couldn't get application command-line flags.
* `/metrics` with `-metricsAuthKey` command-line flag, so unauthorized users couldn't get access to [vmauth metrics](#monitoring).
* `/debug/pprof` with `-pprofAuthKey` command-line flag, so unauthorized users couldn't get access to [profiling information](#profiling).
`vmauth` also supports the ability to restrict access by IP - see [these docs](#ip-filters). See also [concurrency limiting docs](#concurrency-limiting).
## Monitoring
`vmauth` exports various metrics in Prometheus exposition format at `http://vmauth-host:8427/metrics` page. It is recommended setting up regular scraping of this page
either via [vmagent](https://docs.victoriametrics.com/vmagent.html) or via Prometheus, so the exported metrics could be analyzed later.
`vmauth` exports `vmauth_user_requests_total` [counter](https://docs.victoriametrics.com/keyConcepts.html#counter) metric
and `vmauth_user_request_duration_seconds_*` [summary](https://docs.victoriametrics.com/keyConcepts.html#summary) metric
with `username` label. The `username` label value equals to `username` field value set in the `-auth.config` file.
It is possible to override or hide the value in the label by specifying `name` field.
For example, the following config will result in `vmauth_user_requests_total{username="foobar"}`
instead of `vmauth_user_requests_total{username="secret_user"}`:
```yml
users:
- username: "secret_user"
name: "foobar"
# other config options here
```
For unauthorized users `vmauth` exports `vmauth_unauthorized_user_requests_total`
[counter](https://docs.victoriametrics.com/keyConcepts.html#counter) metric and
`vmauth_unauthorized_user_request_duration_seconds_*` [summary](https://docs.victoriametrics.com/keyConcepts.html#summary)
metric without label (if `unauthorized_user` section of config is used).
## How to build from sources
It is recommended using [binary releases](https://github.com/VictoriaMetrics/VictoriaMetrics/releases/latest) - `vmauth` is located in `vmutils-*` archives there.
### Development build
1. [Install Go](https://golang.org/doc/install). The minimum supported version is Go 1.20.
1. Run `make vmauth` from the root folder of [the repository](https://github.com/VictoriaMetrics/VictoriaMetrics).
It builds `vmauth` binary and puts it into the `bin` folder.
### Production build
1. [Install docker](https://docs.docker.com/install/).
1. Run `make vmauth-prod` from the root folder of [the repository](https://github.com/VictoriaMetrics/VictoriaMetrics).
It builds `vmauth-prod` binary and puts it into the `bin` folder.
### Building docker images
Run `make package-vmauth`. It builds `victoriametrics/vmauth:<PKG_TAG>` docker image locally.
`<PKG_TAG>` is auto-generated image tag, which depends on source code in the repository.
The `<PKG_TAG>` may be manually set via `PKG_TAG=foobar make package-vmauth`.
The base docker image is [alpine](https://hub.docker.com/_/alpine) but it is possible to use any other base image
by setting it via `<ROOT_IMAGE>` environment variable. For example, the following command builds the image on top of [scratch](https://hub.docker.com/_/scratch) image:
```console
ROOT_IMAGE=scratch make package-vmauth
```
## Profiling
`vmauth` provides handlers for collecting the following [Go profiles](https://blog.golang.org/profiling-go-programs):
* Memory profile. It can be collected with the following command (replace `0.0.0.0` with hostname if needed):
<div class="with-copy" markdown="1">
```console
curl http://0.0.0.0:8427/debug/pprof/heap > mem.pprof
```
</div>
* CPU profile. It can be collected with the following command (replace `0.0.0.0` with hostname if needed):
<div class="with-copy" markdown="1">
```console
curl http://0.0.0.0:8427/debug/pprof/profile > cpu.pprof
```
</div>
The command for collecting CPU profile waits for 30 seconds before returning.
The collected profiles may be analyzed with [go tool pprof](https://github.com/google/pprof).
It is safe sharing the collected profiles from security point of view, since they do not contain sensitive information.
## Advanced usage
Pass `-help` command-line arg to `vmauth` in order to see all the configuration options:
```console
./vmauth -help
vmauth authenticates and authorizes incoming requests and proxies them to VictoriaMetrics.
See the docs at https://docs.victoriametrics.com/vmauth.html .
-auth.config string
Path to auth config. It can point either to local file or to http url. See https://docs.victoriametrics.com/vmauth.html for details on the format of this auth config
-configCheckInterval duration
interval for config file re-read. Zero value disables config re-reading. By default, refreshing is disabled, send SIGHUP for config refresh.
-enableTCP6
Whether to enable IPv6 for listening and dialing. By default, only IPv4 TCP and UDP are used
-envflag.enable
Whether to enable reading flags from environment variables in addition to the command line. Command line flag values have priority over values from environment vars. Flags are read only from the command line if this flag isn't set. See https://docs.victoriametrics.com/#environment-variables for more details
-envflag.prefix string
Prefix for environment variables if -envflag.enable is set
-eula
Deprecated, please use -license or -licenseFile flags instead. By specifying this flag, you confirm that you have an enterprise license and accept the ESA https://victoriametrics.com/legal/esa/ . This flag is available only in Enterprise binaries. See https://docs.victoriametrics.com/enterprise.html
-failTimeout duration
Sets a delay period for load balancing to skip a malfunctioning backend (default 3s)
-filestream.disableFadvise
Whether to disable fadvise() syscall when reading large data files. The fadvise() syscall prevents from eviction of recently accessed data from OS page cache during background merges and backups. In some rare cases it is better to disable the syscall if it uses too much CPU
-flagsAuthKey string
Auth key for /flags endpoint. It must be passed via authKey query arg. It overrides httpAuth.* settings
-fs.disableMmap
Whether to use pread() instead of mmap() for reading data files. By default, mmap() is used for 64-bit arches and pread() is used for 32-bit arches, since they cannot read data files bigger than 2^32 bytes in memory. mmap() is usually faster for reading small data chunks than pread()
-http.connTimeout duration
Incoming http connections are closed after the configured timeout. This may help to spread the incoming load among a cluster of services behind a load balancer. Please note that the real timeout may be bigger by up to 10% as a protection against the thundering herd problem (default 2m0s)
-http.disableResponseCompression
Disable compression of HTTP responses to save CPU resources. By default, compression is enabled to save network bandwidth
-http.idleConnTimeout duration
Timeout for incoming idle http connections (default 1m0s)
-http.maxGracefulShutdownDuration duration
The maximum duration for a graceful shutdown of the HTTP server. A highly loaded server may require increased value for a graceful shutdown (default 7s)
-http.pathPrefix string
An optional prefix to add to all the paths handled by http server. For example, if '-http.pathPrefix=/foo/bar' is set, then all the http requests will be handled on '/foo/bar/*' paths. This may be useful for proxied requests. See https://www.robustperception.io/using-external-urls-and-proxies-with-prometheus
-http.shutdownDelay duration
Optional delay before http server shutdown. During this delay, the server returns non-OK responses from /health page, so load balancers can route new requests to other servers
-httpAuth.password string
Password for HTTP server's Basic Auth. The authentication is disabled if -httpAuth.username is empty
-httpAuth.username string
Username for HTTP server's Basic Auth. The authentication is disabled if empty. See also -httpAuth.password
-httpListenAddr string
TCP address to listen for http connections. See also -httpListenAddr.useProxyProtocol (default ":8427")
-httpListenAddr.useProxyProtocol
Whether to use proxy protocol for connections accepted at -httpListenAddr . See https://www.haproxy.org/download/1.8/doc/proxy-protocol.txt . With enabled proxy protocol http server cannot serve regular /metrics endpoint. Use -pushmetrics.url for metrics pushing
-internStringCacheExpireDuration duration
The expiry duration for caches for interned strings. See https://en.wikipedia.org/wiki/String_interning . See also -internStringMaxLen and -internStringDisableCache (default 6m0s)
-internStringDisableCache
Whether to disable caches for interned strings. This may reduce memory usage at the cost of higher CPU usage. See https://en.wikipedia.org/wiki/String_interning . See also -internStringCacheExpireDuration and -internStringMaxLen
-internStringMaxLen int
The maximum length for strings to intern. A lower limit may save memory at the cost of higher CPU usage. See https://en.wikipedia.org/wiki/String_interning . See also -internStringDisableCache and -internStringCacheExpireDuration (default 500)
-license string
Lisense key for VictoriaMetrics Enterprise. See https://victoriametrics.com/products/enterprise/ . Trial Enterprise license can be obtained from https://victoriametrics.com/products/enterprise/trial/ . This flag is available only in Enterprise binaries. The license key can be also passed via file specified by -licenseFile command-line flag
-license.forceOffline
Whether to enable offline verification for VictoriaMetrics Enterprise license key, which has been passed either via -license or via -licenseFile command-line flag. The issued license key must support offline verification feature. Contact info@victoriametrics.com if you need offline license verification. This flag is avilable only in Enterprise binaries
-licenseFile string
Path to file with license key for VictoriaMetrics Enterprise. See https://victoriametrics.com/products/enterprise/ . Trial Enterprise license can be obtained from https://victoriametrics.com/products/enterprise/trial/ . This flag is available only in Enterprise binaries. The license key can be also passed inline via -license command-line flag
-logInvalidAuthTokens
Whether to log requests with invalid auth tokens. Such requests are always counted at vmauth_http_request_errors_total{reason="invalid_auth_token"} metric, which is exposed at /metrics page
-loggerDisableTimestamps
Whether to disable writing timestamps in logs
-loggerErrorsPerSecondLimit int
Per-second limit on the number of ERROR messages. If more than the given number of errors are emitted per second, the remaining errors are suppressed. Zero values disable the rate limit
-loggerFormat string
Format for logs. Possible values: default, json (default "default")
-loggerJSONFields string
Allows renaming fields in JSON formatted logs. Example: "ts:timestamp,msg:message" renames "ts" to "timestamp" and "msg" to "message". Supported fields: ts, level, caller, msg
-loggerLevel string
Minimum level of errors to log. Possible values: INFO, WARN, ERROR, FATAL, PANIC (default "INFO")
-loggerOutput string
Output for the logs. Supported values: stderr, stdout (default "stderr")
-loggerTimezone string
Timezone to use for timestamps in logs. Timezone must be a valid IANA Time Zone. For example: America/New_York, Europe/Berlin, Etc/GMT+3 or Local (default "UTC")
-loggerWarnsPerSecondLimit int
Per-second limit on the number of WARN messages. If more than the given number of warns are emitted per second, then the remaining warns are suppressed. Zero values disable the rate limit
-maxConcurrentPerUserRequests int
The maximum number of concurrent requests vmauth can process per each configured user. Other requests are rejected with '429 Too Many Requests' http status code. See also -maxConcurrentRequests command-line option and max_concurrent_requests option in per-user config (default 300)
-maxConcurrentRequests int
The maximum number of concurrent requests vmauth can process. Other requests are rejected with '429 Too Many Requests' http status code. See also -maxConcurrentPerUserRequests and -maxIdleConnsPerBackend command-line options (default 1000)
-maxIdleConnsPerBackend int
The maximum number of idle connections vmauth can open per each backend host. See also -maxConcurrentRequests (default 100)
-maxRequestBodySizeToRetry size
The maximum request body size, which can be cached and re-tried at other backends. Bigger values may require more memory
Supports the following optional suffixes for size values: KB, MB, GB, TB, KiB, MiB, GiB, TiB (default 16384)
-memory.allowedBytes size
Allowed size of system memory VictoriaMetrics caches may occupy. This option overrides -memory.allowedPercent if set to a non-zero value. Too low a value may increase the cache miss rate usually resulting in higher CPU and disk IO usage. Too high a value may evict too much data from the OS page cache resulting in higher disk IO usage
Supports the following optional suffixes for size values: KB, MB, GB, TB, KiB, MiB, GiB, TiB (default 0)
-memory.allowedPercent float
Allowed percent of system memory VictoriaMetrics caches may occupy. See also -memory.allowedBytes. Too low a value may increase cache miss rate usually resulting in higher CPU and disk IO usage. Too high a value may evict too much data from the OS page cache which will result in higher disk IO usage (default 60)
-metricsAuthKey string
Auth key for /metrics endpoint. It must be passed via authKey query arg. It overrides httpAuth.* settings
-pprofAuthKey string
Auth key for /debug/pprof/* endpoints. It must be passed via authKey query arg. It overrides httpAuth.* settings
-pushmetrics.extraLabel array
Optional labels to add to metrics pushed to -pushmetrics.url . For example, -pushmetrics.extraLabel='instance="foo"' adds instance="foo" label to all the metrics pushed to -pushmetrics.url
Supports an array of values separated by comma or specified via multiple flags.
-pushmetrics.interval duration
Interval for pushing metrics to -pushmetrics.url (default 10s)
-pushmetrics.url array
Optional URL to push metrics exposed at /metrics page. See https://docs.victoriametrics.com/#push-metrics . By default, metrics exposed at /metrics page aren't pushed to any remote storage
Supports an array of values separated by comma or specified via multiple flags.
-reloadAuthKey string
Auth key for /-/reload http endpoint. It must be passed as authKey=...
-responseTimeout duration
The timeout for receiving a response from backend (default 5m0s)
-tls
Whether to enable TLS for incoming HTTP requests at -httpListenAddr (aka https). -tlsCertFile and -tlsKeyFile must be set if -tls is set
-tlsCertFile string
Path to file with TLS certificate if -tls is set. Prefer ECDSA certs instead of RSA certs as RSA certs are slower. The provided certificate file is automatically re-read every second, so it can be dynamically updated
-tlsCipherSuites array
Optional list of TLS cipher suites for incoming requests over HTTPS if -tls is set. See the list of supported cipher suites at https://pkg.go.dev/crypto/tls#pkg-constants
Supports an array of values separated by comma or specified via multiple flags.
-tlsKeyFile string
Path to file with TLS key if -tls is set. The provided key file is automatically re-read every second, so it can be dynamically updated
-tlsMinVersion string
Optional minimum TLS version to use for incoming requests over HTTPS if -tls is set. Supported values: TLS10, TLS11, TLS12, TLS13
-version
Show VictoriaMetrics version
```
vmauth docs can be edited at [docs/vmauth.md](https://github.com/VictoriaMetrics/VictoriaMetrics/blob/master/docs/vmauth.md).

View File

@@ -5,6 +5,7 @@ import (
"encoding/base64"
"flag"
"fmt"
"net/http"
"net/url"
"os"
"regexp"
@@ -14,13 +15,15 @@ import (
"sync/atomic"
"time"
"github.com/VictoriaMetrics/metrics"
"gopkg.in/yaml.v2"
"github.com/VictoriaMetrics/VictoriaMetrics/lib/envtemplate"
"github.com/VictoriaMetrics/VictoriaMetrics/lib/fasttime"
"github.com/VictoriaMetrics/VictoriaMetrics/lib/flagutil"
"github.com/VictoriaMetrics/VictoriaMetrics/lib/fs"
"github.com/VictoriaMetrics/VictoriaMetrics/lib/logger"
"github.com/VictoriaMetrics/VictoriaMetrics/lib/procutil"
"github.com/VictoriaMetrics/metrics"
"gopkg.in/yaml.v2"
)
var (
@@ -28,6 +31,10 @@ var (
"See https://docs.victoriametrics.com/vmauth.html for details on the format of this auth config")
configCheckInterval = flag.Duration("configCheckInterval", 0, "interval for config file re-read. "+
"Zero value disables config re-reading. By default, refreshing is disabled, send SIGHUP for config refresh.")
defaultRetryStatusCodes = flagutil.NewArrayInt("retryStatusCodes", 0, "Comma-separated list of default HTTP response status codes when vmauth re-tries the request on other backends. "+
"See https://docs.victoriametrics.com/vmauth.html#load-balancing for details")
defaultLoadBalancingPolicy = flag.String("loadBalancingPolicy", "least_loaded", "The default load balancing policy to use for backend urls specified inside url_prefix section. "+
"Supported policies: least_loaded, first_available. See https://docs.victoriametrics.com/vmauth.html#load-balancing for more details")
)
// AuthConfig represents auth config.
@@ -38,20 +45,26 @@ type AuthConfig struct {
// UserInfo is user information read from authConfigPath
type UserInfo struct {
Name string `yaml:"name,omitempty"`
BearerToken string `yaml:"bearer_token,omitempty"`
Username string `yaml:"username,omitempty"`
Password string `yaml:"password,omitempty"`
URLPrefix *URLPrefix `yaml:"url_prefix,omitempty"`
URLMaps []URLMap `yaml:"url_map,omitempty"`
HeadersConf HeadersConf `yaml:",inline"`
MaxConcurrentRequests int `yaml:"max_concurrent_requests,omitempty"`
DefaultURL *URLPrefix `yaml:"default_url,omitempty"`
RetryStatusCodes []int `yaml:"retry_status_codes,omitempty"`
Name string `yaml:"name,omitempty"`
BearerToken string `yaml:"bearer_token,omitempty"`
Username string `yaml:"username,omitempty"`
Password string `yaml:"password,omitempty"`
URLPrefix *URLPrefix `yaml:"url_prefix,omitempty"`
URLMaps []URLMap `yaml:"url_map,omitempty"`
HeadersConf HeadersConf `yaml:",inline"`
MaxConcurrentRequests int `yaml:"max_concurrent_requests,omitempty"`
DefaultURL *URLPrefix `yaml:"default_url,omitempty"`
RetryStatusCodes []int `yaml:"retry_status_codes,omitempty"`
LoadBalancingPolicy string `yaml:"load_balancing_policy,omitempty"`
DropSrcPathPrefixParts int `yaml:"drop_src_path_prefix_parts,omitempty"`
TLSInsecureSkipVerify *bool `yaml:"tls_insecure_skip_verify,omitempty"`
TLSCAFile string `yaml:"tls_ca_file,omitempty"`
concurrencyLimitCh chan struct{}
concurrencyLimitReached *metrics.Counter
httpTransport *http.Transport
requests *metrics.Counter
requestsDuration *metrics.Summary
}
@@ -113,10 +126,12 @@ func (h *Header) MarshalYAML() (interface{}, error) {
// URLMap is a mapping from source paths to target urls.
type URLMap struct {
SrcPaths []*SrcPath `yaml:"src_paths,omitempty"`
URLPrefix *URLPrefix `yaml:"url_prefix,omitempty"`
HeadersConf HeadersConf `yaml:",inline"`
RetryStatusCodes []int `yaml:"retry_status_codes,omitempty"`
SrcPaths []*SrcPath `yaml:"src_paths,omitempty"`
URLPrefix *URLPrefix `yaml:"url_prefix,omitempty"`
HeadersConf HeadersConf `yaml:",inline"`
RetryStatusCodes []int `yaml:"retry_status_codes,omitempty"`
LoadBalancingPolicy string `yaml:"load_balancing_policy,omitempty"`
DropSrcPathPrefixParts int `yaml:"drop_src_path_prefix_parts,omitempty"`
}
// SrcPath represents an src path
@@ -127,8 +142,28 @@ type SrcPath struct {
// URLPrefix represents passed `url_prefix`
type URLPrefix struct {
n uint32
n uint32
// the list of backend urls
bus []*backendURL
// requests are re-tried on other backend urls for these http response status codes
retryStatusCodes []int
// load balancing policy used
loadBalancingPolicy string
}
func (up *URLPrefix) setLoadBalancingPolicy(loadBalancingPolicy string) error {
switch loadBalancingPolicy {
case "", // empty string is equivalent to least_loaded
"least_loaded",
"first_available":
up.loadBalancingPolicy = loadBalancingPolicy
return nil
default:
return fmt.Errorf("unexpected load_balancing_policy: %q; want least_loaded or first_available", loadBalancingPolicy)
}
}
type backendURL struct {
@@ -147,6 +182,10 @@ func (bu *backendURL) setBroken() {
atomic.StoreUint64(&bu.brokenDeadline, deadline)
}
func (bu *backendURL) get() {
atomic.AddInt32(&bu.concurrentRequests, 1)
}
func (bu *backendURL) put() {
atomic.AddInt32(&bu.concurrentRequests, -1)
}
@@ -155,6 +194,40 @@ func (up *URLPrefix) getBackendsCount() int {
return len(up.bus)
}
// getBackendURL returns the backendURL depending on the load balance policy.
//
// backendURL.put() must be called on the returned backendURL after the request is complete.
func (up *URLPrefix) getBackendURL() *backendURL {
if up.loadBalancingPolicy == "first_available" {
return up.getFirstAvailableBackendURL()
}
return up.getLeastLoadedBackendURL()
}
// getFirstAvailableBackendURL returns the first available backendURL, which isn't broken.
//
// backendURL.put() must be called on the returned backendURL after the request is complete.
func (up *URLPrefix) getFirstAvailableBackendURL() *backendURL {
bus := up.bus
bu := bus[0]
if !bu.isBroken() {
// Fast path - send the request to the first url.
bu.get()
return bu
}
// Slow path - the first url is temporarily unavailabel. Fall back to the remaining urls.
for i := 1; i < len(bus); i++ {
if !bus[i].isBroken() {
bu = bus[i]
break
}
}
bu.get()
return bu
}
// getLeastLoadedBackendURL returns the backendURL with the minimum number of concurrent requests.
//
// backendURL.put() must be called on the returned backendURL after the request is complete.
@@ -163,7 +236,7 @@ func (up *URLPrefix) getLeastLoadedBackendURL() *backendURL {
if len(bus) == 1 {
// Fast path - return the only backend url.
bu := bus[0]
atomic.AddInt32(&bu.concurrentRequests, 1)
bu.get()
return bu
}
@@ -194,7 +267,7 @@ func (up *URLPrefix) getLeastLoadedBackendURL() *backendURL {
minRequests = n
}
}
atomic.AddInt32(&buMin.concurrentRequests, 1)
buMin.get()
return buMin
}
@@ -204,6 +277,7 @@ func (up *URLPrefix) UnmarshalYAML(f func(interface{}) error) error {
if err := f(&v); err != nil {
return err
}
var urls []string
switch x := v.(type) {
case string:
@@ -224,6 +298,7 @@ func (up *URLPrefix) UnmarshalYAML(f func(interface{}) error) error {
default:
return fmt.Errorf("unexpected type for `url_prefix`: %T; want string or []string", v)
}
bus := make([]*backendURL, len(urls))
for i, u := range urls {
pu, err := url.Parse(u)
@@ -420,6 +495,21 @@ func parseAuthConfig(data []byte) (*AuthConfig, error) {
}
ui := ac.UnauthorizedUser
if ui != nil {
if ui.Username != "" {
return nil, fmt.Errorf("field username can't be specified for unauthorized_user section")
}
if ui.Password != "" {
return nil, fmt.Errorf("field password can't be specified for unauthorized_user section")
}
if ui.BearerToken != "" {
return nil, fmt.Errorf("field bearer_token can't be specified for unauthorized_user section")
}
if ui.Name != "" {
return nil, fmt.Errorf("field name can't be specified for unauthorized_user section")
}
if err := ui.initURLs(); err != nil {
return nil, err
}
ui.requests = metrics.GetOrCreateCounter(`vmauth_unauthorized_user_requests_total`)
ui.requestsDuration = metrics.GetOrCreateSummary(`vmauth_unauthorized_user_request_duration_seconds`)
ui.concurrencyLimitCh = make(chan struct{}, ui.getMaxConcurrentRequests())
@@ -430,6 +520,11 @@ func parseAuthConfig(data []byte) (*AuthConfig, error) {
_ = metrics.GetOrCreateGauge(`vmauth_unauthorized_user_concurrent_requests_current`, func() float64 {
return float64(len(ui.concurrencyLimitCh))
})
tr, err := getTransport(ui.TLSInsecureSkipVerify, ui.TLSCAFile)
if err != nil {
return nil, fmt.Errorf("cannot initialize HTTP transport: %w", err)
}
ui.httpTransport = tr
}
return &ac, nil
}
@@ -455,30 +550,11 @@ func parseAuthConfigUsers(ac *AuthConfig) (map[string]*UserInfo, error) {
if byAuthToken[at2] != nil {
return nil, fmt.Errorf("duplicate auth token found for bearer_token=%q, username=%q: %q", ui.BearerToken, ui.Username, at2)
}
if ui.URLPrefix != nil {
if err := ui.URLPrefix.sanitize(); err != nil {
return nil, err
}
}
if ui.DefaultURL != nil {
if err := ui.DefaultURL.sanitize(); err != nil {
return nil, err
}
}
for _, e := range ui.URLMaps {
if len(e.SrcPaths) == 0 {
return nil, fmt.Errorf("missing `src_paths` in `url_map`")
}
if e.URLPrefix == nil {
return nil, fmt.Errorf("missing `url_prefix` in `url_map`")
}
if err := e.URLPrefix.sanitize(); err != nil {
return nil, err
}
}
if len(ui.URLMaps) == 0 && ui.URLPrefix == nil {
return nil, fmt.Errorf("missing `url_prefix`")
if err := ui.initURLs(); err != nil {
return nil, err
}
name := ui.name()
if ui.BearerToken != "" {
if ui.Password != "" {
@@ -500,12 +576,71 @@ func parseAuthConfigUsers(ac *AuthConfig) (map[string]*UserInfo, error) {
_ = metrics.GetOrCreateGauge(fmt.Sprintf(`vmauth_user_concurrent_requests_current{username=%q}`, name), func() float64 {
return float64(len(ui.concurrencyLimitCh))
})
tr, err := getTransport(ui.TLSInsecureSkipVerify, ui.TLSCAFile)
if err != nil {
return nil, fmt.Errorf("cannot initialize HTTP transport: %w", err)
}
ui.httpTransport = tr
byAuthToken[at1] = ui
byAuthToken[at2] = ui
}
return byAuthToken, nil
}
func (ui *UserInfo) initURLs() error {
retryStatusCodes := defaultRetryStatusCodes.Values()
loadBalancingPolicy := *defaultLoadBalancingPolicy
if ui.URLPrefix != nil {
if err := ui.URLPrefix.sanitize(); err != nil {
return err
}
if len(ui.RetryStatusCodes) > 0 {
retryStatusCodes = ui.RetryStatusCodes
}
if ui.LoadBalancingPolicy != "" {
loadBalancingPolicy = ui.LoadBalancingPolicy
}
ui.URLPrefix.retryStatusCodes = retryStatusCodes
if err := ui.URLPrefix.setLoadBalancingPolicy(loadBalancingPolicy); err != nil {
return err
}
}
if ui.DefaultURL != nil {
if err := ui.DefaultURL.sanitize(); err != nil {
return err
}
}
for _, e := range ui.URLMaps {
if len(e.SrcPaths) == 0 {
return fmt.Errorf("missing `src_paths` in `url_map`")
}
if e.URLPrefix == nil {
return fmt.Errorf("missing `url_prefix` in `url_map`")
}
if err := e.URLPrefix.sanitize(); err != nil {
return err
}
rscs := retryStatusCodes
lbp := loadBalancingPolicy
if len(e.RetryStatusCodes) > 0 {
rscs = e.RetryStatusCodes
}
if e.LoadBalancingPolicy != "" {
lbp = e.LoadBalancingPolicy
}
e.URLPrefix.retryStatusCodes = rscs
if err := e.URLPrefix.setLoadBalancingPolicy(lbp); err != nil {
return err
}
}
if len(ui.URLMaps) == 0 && ui.URLPrefix == nil {
return fmt.Errorf("missing `url_prefix`")
}
return nil
}
func (ui *UserInfo) name() string {
if ui.Name != "" {
return ui.Name

View File

@@ -221,22 +221,26 @@ func TestParseAuthConfigSuccess(t *testing.T) {
}
// Single user
insecureSkipVerifyTrue := true
f(`
users:
- username: foo
password: bar
url_prefix: http://aaa:343/bbb
max_concurrent_requests: 5
tls_insecure_skip_verify: true
`, map[string]*UserInfo{
getAuthToken("", "foo", "bar"): {
Username: "foo",
Password: "bar",
URLPrefix: mustParseURL("http://aaa:343/bbb"),
MaxConcurrentRequests: 5,
TLSInsecureSkipVerify: &insecureSkipVerifyTrue,
},
})
// Multiple url_prefix entries
insecureSkipVerifyFalse := false
f(`
users:
- username: foo
@@ -244,6 +248,10 @@ users:
url_prefix:
- http://node1:343/bbb
- http://node2:343/bbb
tls_insecure_skip_verify: false
retry_status_codes: [500, 501]
load_balancing_policy: first_available
drop_src_path_prefix_parts: 1
`, map[string]*UserInfo{
getAuthToken("", "foo", "bar"): {
Username: "foo",
@@ -252,6 +260,10 @@ users:
"http://node1:343/bbb",
"http://node2:343/bbb",
}),
TLSInsecureSkipVerify: &insecureSkipVerifyFalse,
RetryStatusCodes: []int{500, 501},
LoadBalancingPolicy: "first_available",
DropSrcPathPrefixParts: 1,
},
})
@@ -448,6 +460,47 @@ users:
}
func TestParseAuthConfigPassesTLSVerificationConfig(t *testing.T) {
c := `
users:
- username: foo
password: bar
url_prefix: https://aaa/bbb
max_concurrent_requests: 5
tls_insecure_skip_verify: true
unauthorized_user:
url_prefix: http://aaa:343/bbb
max_concurrent_requests: 5
tls_insecure_skip_verify: false
`
ac, err := parseAuthConfig([]byte(c))
if err != nil {
t.Fatalf("unexpected error: %s", err)
}
m, err := parseAuthConfigUsers(ac)
if err != nil {
t.Fatalf("unexpected error: %s", err)
}
ui := m[getAuthToken("", "foo", "bar")]
if !isSetBool(ui.TLSInsecureSkipVerify, true) || !ui.httpTransport.TLSClientConfig.InsecureSkipVerify {
t.Fatalf("unexpected TLSInsecureSkipVerify value for user foo")
}
if !isSetBool(ac.UnauthorizedUser.TLSInsecureSkipVerify, false) || ac.UnauthorizedUser.httpTransport.TLSClientConfig.InsecureSkipVerify {
t.Fatalf("unexpected TLSInsecureSkipVerify value for unauthorized_user")
}
}
func isSetBool(boolP *bool, expectedValue bool) bool {
if boolP == nil {
return false
}
return *boolP == expectedValue
}
func getSrcPaths(paths []string) []*SrcPath {
var sps []*SrcPath
for _, path := range paths {

View File

@@ -37,11 +37,20 @@ users:
# All the requests to http://vmauth:8427 with the given Basic Auth (username:password)
# are proxied to http://localhost:8428 with extra_label=team=dev query arg.
# For example, http://vmauth:8427/api/v1/query is routed to http://localhost:8428/api/v1/query?extra_label=team=dev
# For example, http://vmauth:8427/api/v1/query is proxied to http://localhost:8428/api/v1/query?extra_label=team=dev
- username: "local-single-node2"
password: "***"
url_prefix: "http://localhost:8428?extra_label=team=dev"
# All the requests to http://vmauth:8427 with the given Basic Auth (username:password)
# are proxied to https://localhost:8428
# For example, http://vmauth:8427/api/v1/query is proxied to https://localhost/api/v1/query
# TLS verification is ignored for https://localhost.
- username: "local-single-node-with-tls"
password: "***"
url_prefix: "https://localhost"
tls_insecure_skip_verify: true
# All the requests to http://vmauth:8427 with the given Basic Auth (username:password)
# are load-balanced among http://vmselect1:8481/select/123/prometheus and http://vmselect2:8481/select/123/prometheus
# For example, http://vmauth:8427/api/v1/query is proxied to the following urls in a round-robin manner:
@@ -82,6 +91,8 @@ users:
# For example, request to http://vmauth:8427/non/existing/path are proxied:
# - to http://default1:8888/unsupported_url_handler?request_path=/non/existing/path
# - or http://default2:8888/unsupported_url_handler?request_path=/non/existing/path
#
# Regular expressions are allowed in `src_paths` entries.
- username: "foobar"
url_map:
- src_paths:
@@ -100,9 +111,9 @@ users:
- "http://default1:8888/unsupported_url_handler"
- "http://default2:8888/unsupported_url_handler"
# Requests without Authorization header are routed according to `unauthorized_user` section.
# Requests are routed in round-robin fashion between `url_prefix` backends.
# The deny_partial_response query arg is added to all the routed requests.
# Requests without Authorization header are proxied according to `unauthorized_user` section.
# Requests are proxied in round-robin fashion between `url_prefix` backends.
# The deny_partial_response query arg is added to all the proxied requests.
# The requests are re-tried if url_prefix backends send 500 or 503 response status codes.
unauthorized_user:
url_prefix:

View File

@@ -20,6 +20,8 @@ users:
# For example, request to http://vmauth:8427/non/existing/path are proxied:
# - to http://default1:8888/unsupported_url_handler?request_path=/non/existing/path
# - or http://default2:8888/unsupported_url_handler?request_path=/non/existing/path
#
# Regular expressions are allowed in `src_paths` entries.
- username: "foobar"
url_map:
- src_paths:

View File

@@ -2,6 +2,8 @@ package main
import (
"context"
"crypto/tls"
"crypto/x509"
"errors"
"flag"
"fmt"
@@ -15,20 +17,23 @@ import (
"sync"
"time"
"github.com/VictoriaMetrics/metrics"
"github.com/VictoriaMetrics/VictoriaMetrics/lib/buildinfo"
"github.com/VictoriaMetrics/VictoriaMetrics/lib/bytesutil"
"github.com/VictoriaMetrics/VictoriaMetrics/lib/encoding"
"github.com/VictoriaMetrics/VictoriaMetrics/lib/envflag"
"github.com/VictoriaMetrics/VictoriaMetrics/lib/flagutil"
"github.com/VictoriaMetrics/VictoriaMetrics/lib/fs"
"github.com/VictoriaMetrics/VictoriaMetrics/lib/httpserver"
"github.com/VictoriaMetrics/VictoriaMetrics/lib/logger"
"github.com/VictoriaMetrics/VictoriaMetrics/lib/netutil"
"github.com/VictoriaMetrics/VictoriaMetrics/lib/procutil"
"github.com/VictoriaMetrics/VictoriaMetrics/lib/pushmetrics"
"github.com/VictoriaMetrics/metrics"
)
var (
httpListenAddr = flag.String("httpListenAddr", ":8427", "TCP address to listen for http connections. See also -httpListenAddr.useProxyProtocol")
httpListenAddr = flag.String("httpListenAddr", ":8427", "TCP address to listen for http connections. See also -tls and -httpListenAddr.useProxyProtocol")
useProxyProtocol = flag.Bool("httpListenAddr.useProxyProtocol", false, "Whether to use proxy protocol for connections accepted at -httpListenAddr . "+
"See https://www.haproxy.org/download/1.8/doc/proxy-protocol.txt . "+
"With enabled proxy protocol http server cannot serve regular /metrics endpoint. Use -pushmetrics.url for metrics pushing")
@@ -46,6 +51,10 @@ var (
failTimeout = flag.Duration("failTimeout", 3*time.Second, "Sets a delay period for load balancing to skip a malfunctioning backend")
maxRequestBodySizeToRetry = flagutil.NewBytes("maxRequestBodySizeToRetry", 16*1024, "The maximum request body size, which can be cached and re-tried at other backends. "+
"Bigger values may require more memory")
backendTLSInsecureSkipVerify = flag.Bool("backend.tlsInsecureSkipVerify", false, "Whether to skip TLS verification when connecting to backends over HTTPS. "+
"See https://docs.victoriametrics.com/vmauth.html#backend-tls-setup")
backendTLSCAFile = flag.String("backend.TLSCAFile", "", "Optional path to TLS root CA file, which is used for TLS verification when connecting to backends over HTTPS. "+
"See https://docs.victoriametrics.com/vmauth.html#backend-tls-setup")
)
func main() {
@@ -155,15 +164,23 @@ func processUserRequest(w http.ResponseWriter, r *http.Request, ui *UserInfo) {
func processRequest(w http.ResponseWriter, r *http.Request, ui *UserInfo) {
u := normalizeURL(r.URL)
up, hc, retryStatusCodes := ui.getURLPrefixAndHeaders(u)
up, hc, dropSrcPathPrefixParts := ui.getURLPrefixAndHeaders(u)
isDefault := false
if up == nil {
missingRouteRequests.Inc()
if ui.DefaultURL == nil {
// Authorization should be requested for http requests without credentials
// to a route that is not in the configuration for unauthorized user.
// See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5236
if ui.BearerToken == "" && ui.Username == "" && len(*authUsers.Load()) > 0 {
w.Header().Set("WWW-Authenticate", `Basic realm="Restricted"`)
http.Error(w, "missing `Authorization` request header", http.StatusUnauthorized)
return
}
missingRouteRequests.Inc()
httpserver.Errorf(w, r, "missing route for %q", u.String())
return
}
up, hc, retryStatusCodes = ui.DefaultURL, ui.HeadersConf, ui.RetryStatusCodes
up, hc = ui.DefaultURL, ui.HeadersConf
isDefault = true
}
maxAttempts := up.getBackendsCount()
@@ -173,17 +190,17 @@ func processRequest(w http.ResponseWriter, r *http.Request, ui *UserInfo) {
}
}
for i := 0; i < maxAttempts; i++ {
bu := up.getLeastLoadedBackendURL()
bu := up.getBackendURL()
targetURL := bu.url
// Don't change path and add request_path query param for default route.
if isDefault {
query := targetURL.Query()
query.Set("request_path", u.Path)
query.Set("request_path", u.String())
targetURL.RawQuery = query.Encode()
} else { // Update path for regular routes.
targetURL = mergeURLs(targetURL, u)
targetURL = mergeURLs(targetURL, u, dropSrcPathPrefixParts)
}
ok := tryProcessingRequest(w, r, targetURL, hc, retryStatusCodes)
ok := tryProcessingRequest(w, r, targetURL, hc, up.retryStatusCodes, ui.httpTransport)
bu.put()
if ok {
return
@@ -197,12 +214,12 @@ func processRequest(w http.ResponseWriter, r *http.Request, ui *UserInfo) {
httpserver.Errorf(w, r, "%s", err)
}
func tryProcessingRequest(w http.ResponseWriter, r *http.Request, targetURL *url.URL, hc HeadersConf, retryStatusCodes []int) bool {
func tryProcessingRequest(w http.ResponseWriter, r *http.Request, targetURL *url.URL, hc HeadersConf, retryStatusCodes []int, transport *http.Transport) bool {
// This code has been copied from net/http/httputil/reverseproxy.go
req := sanitizeRequestHeaders(r)
req.URL = targetURL
req.Host = targetURL.Host
updateHeadersByConfig(req.Header, hc.RequestHeaders)
transportOnce.Do(transportInit)
res, err := transport.RoundTrip(req)
rtb, rtbOK := req.Body.(*readTrackingBody)
if err != nil {
@@ -345,23 +362,77 @@ var (
missingRouteRequests = metrics.NewCounter(`vmauth_http_request_errors_total{reason="missing_route"}`)
)
var (
transport *http.Transport
transportOnce sync.Once
)
func getTransport(insecureSkipVerifyP *bool, caFile string) (*http.Transport, error) {
if insecureSkipVerifyP == nil {
insecureSkipVerifyP = backendTLSInsecureSkipVerify
}
insecureSkipVerify := *insecureSkipVerifyP
if caFile == "" {
caFile = *backendTLSCAFile
}
func transportInit() {
bb := bbPool.Get()
defer bbPool.Put(bb)
bb.B = appendTransportKey(bb.B[:0], insecureSkipVerify, caFile)
transportMapLock.Lock()
defer transportMapLock.Unlock()
tr := transportMap[string(bb.B)]
if tr == nil {
trLocal, err := newTransport(insecureSkipVerify, caFile)
if err != nil {
return nil, err
}
transportMap[string(bb.B)] = trLocal
tr = trLocal
}
return tr, nil
}
var transportMap = make(map[string]*http.Transport)
var transportMapLock sync.Mutex
func appendTransportKey(dst []byte, insecureSkipVerify bool, caFile string) []byte {
dst = encoding.MarshalBool(dst, insecureSkipVerify)
dst = encoding.MarshalBytes(dst, bytesutil.ToUnsafeBytes(caFile))
return dst
}
var bbPool bytesutil.ByteBufferPool
func newTransport(insecureSkipVerify bool, caFile string) (*http.Transport, error) {
tr := http.DefaultTransport.(*http.Transport).Clone()
tr.ResponseHeaderTimeout = *responseTimeout
// Automatic compression must be disabled in order to fix https://github.com/VictoriaMetrics/VictoriaMetrics/issues/535
tr.DisableCompression = true
// Disable HTTP/2.0, since VictoriaMetrics components don't support HTTP/2.0 (because there is no sense in this).
tr.ForceAttemptHTTP2 = false
tr.MaxIdleConnsPerHost = *maxIdleConnsPerBackend
if tr.MaxIdleConns != 0 && tr.MaxIdleConns < tr.MaxIdleConnsPerHost {
tr.MaxIdleConns = tr.MaxIdleConnsPerHost
}
transport = tr
tlsCfg := tr.TLSClientConfig
if tlsCfg == nil {
tlsCfg = &tls.Config{}
tr.TLSClientConfig = tlsCfg
}
if insecureSkipVerify || caFile != "" {
tlsCfg.ClientSessionCache = tls.NewLRUClientSessionCache(0)
tlsCfg.InsecureSkipVerify = insecureSkipVerify
if caFile != "" {
data, err := fs.ReadFileOrHTTP(caFile)
if err != nil {
return nil, fmt.Errorf("cannot read tls_ca_file: %w", err)
}
rootCA := x509.NewCertPool()
if !rootCA.AppendCertsFromPEM(data) {
return nil, fmt.Errorf("cannot parse data read from tls_ca_file %q", caFile)
}
tlsCfg.RootCAs = rootCA
}
}
return tr, nil
}
var (

View File

@@ -6,12 +6,13 @@ import (
"strings"
)
func mergeURLs(uiURL, requestURI *url.URL) *url.URL {
func mergeURLs(uiURL, requestURI *url.URL, dropSrcPathPrefixParts int) *url.URL {
targetURL := *uiURL
if strings.HasPrefix(requestURI.Path, "/") {
srcPath := dropPrefixParts(requestURI.Path, dropSrcPathPrefixParts)
if strings.HasPrefix(srcPath, "/") {
targetURL.Path = strings.TrimSuffix(targetURL.Path, "/")
}
targetURL.Path += requestURI.Path
targetURL.Path += srcPath
requestParams := requestURI.Query()
// fast path
if len(requestParams) == 0 {
@@ -32,18 +33,34 @@ func mergeURLs(uiURL, requestURI *url.URL) *url.URL {
return &targetURL
}
func (ui *UserInfo) getURLPrefixAndHeaders(u *url.URL) (*URLPrefix, HeadersConf, []int) {
func dropPrefixParts(path string, parts int) string {
if parts <= 0 {
return path
}
for parts > 0 {
path = strings.TrimPrefix(path, "/")
n := strings.IndexByte(path, '/')
if n < 0 {
return ""
}
path = path[n:]
parts--
}
return path
}
func (ui *UserInfo) getURLPrefixAndHeaders(u *url.URL) (*URLPrefix, HeadersConf, int) {
for _, e := range ui.URLMaps {
for _, sp := range e.SrcPaths {
if sp.match(u.Path) {
return e.URLPrefix, e.HeadersConf, e.RetryStatusCodes
return e.URLPrefix, e.HeadersConf, e.DropSrcPathPrefixParts
}
}
}
if ui.URLPrefix != nil {
return ui.URLPrefix, ui.HeadersConf, ui.RetryStatusCodes
return ui.URLPrefix, ui.HeadersConf, ui.DropSrcPathPrefixParts
}
return nil, HeadersConf{}, nil
return nil, HeadersConf{}, 0
}
func normalizeURL(uOrig *url.URL) *url.URL {

View File

@@ -7,20 +7,94 @@ import (
"testing"
)
func TestCreateTargetURLSuccess(t *testing.T) {
f := func(ui *UserInfo, requestURI, expectedTarget, expectedRequestHeaders, expectedResponseHeaders string, expectedRetryStatusCodes []int) {
func TestDropPrefixParts(t *testing.T) {
f := func(path string, parts int, expectedResult string) {
t.Helper()
result := dropPrefixParts(path, parts)
if result != expectedResult {
t.Fatalf("unexpected result; got %q; want %q", result, expectedResult)
}
}
f("", 0, "")
f("", 1, "")
f("", 10, "")
f("foo", 0, "foo")
f("foo", -1, "foo")
f("foo", 1, "")
f("/foo", 0, "/foo")
f("/foo/bar", 0, "/foo/bar")
f("/foo/bar/baz", 0, "/foo/bar/baz")
f("foo", 0, "foo")
f("foo/bar", 0, "foo/bar")
f("foo/bar/baz", 0, "foo/bar/baz")
f("/foo/", 0, "/foo/")
f("/foo/bar/", 0, "/foo/bar/")
f("/foo/bar/baz/", 0, "/foo/bar/baz/")
f("/foo", 1, "")
f("/foo/bar", 1, "/bar")
f("/foo/bar/baz", 1, "/bar/baz")
f("foo", 1, "")
f("foo/bar", 1, "/bar")
f("foo/bar/baz", 1, "/bar/baz")
f("/foo/", 1, "/")
f("/foo/bar/", 1, "/bar/")
f("/foo/bar/baz/", 1, "/bar/baz/")
f("/foo", 2, "")
f("/foo/bar", 2, "")
f("/foo/bar/baz", 2, "/baz")
f("foo", 2, "")
f("foo/bar", 2, "")
f("foo/bar/baz", 2, "/baz")
f("/foo/", 2, "")
f("/foo/bar/", 2, "/")
f("/foo/bar/baz/", 2, "/baz/")
f("/foo", 3, "")
f("/foo/bar", 3, "")
f("/foo/bar/baz", 3, "")
f("foo", 3, "")
f("foo/bar", 3, "")
f("foo/bar/baz", 3, "")
f("/foo/", 3, "")
f("/foo/bar/", 3, "")
f("/foo/bar/baz/", 3, "/")
f("/foo/", 4, "")
f("/foo/bar/", 4, "")
f("/foo/bar/baz/", 4, "")
}
func TestCreateTargetURLSuccess(t *testing.T) {
f := func(ui *UserInfo, requestURI, expectedTarget, expectedRequestHeaders, expectedResponseHeaders string,
expectedRetryStatusCodes []int, expectedLoadBalancingPolicy string, expectedDropSrcPathPrefixParts int) {
t.Helper()
if err := ui.initURLs(); err != nil {
t.Fatalf("cannot initialize urls inside UserInfo: %s", err)
}
u, err := url.Parse(requestURI)
if err != nil {
t.Fatalf("cannot parse %q: %s", requestURI, err)
}
u = normalizeURL(u)
up, hc, retryStatusCodes := ui.getURLPrefixAndHeaders(u)
up, hc, dropSrcPathPrefixParts := ui.getURLPrefixAndHeaders(u)
if up == nil {
t.Fatalf("cannot determie backend: %s", err)
}
bu := up.getLeastLoadedBackendURL()
target := mergeURLs(bu.url, u)
target := mergeURLs(bu.url, u, dropSrcPathPrefixParts)
bu.put()
if target.String() != expectedTarget {
t.Fatalf("unexpected target; got %q; want %q", target, expectedTarget)
@@ -29,14 +103,20 @@ func TestCreateTargetURLSuccess(t *testing.T) {
if headersStr != expectedRequestHeaders {
t.Fatalf("unexpected request headers; got %s; want %s", headersStr, expectedRequestHeaders)
}
if !reflect.DeepEqual(retryStatusCodes, expectedRetryStatusCodes) {
t.Fatalf("unexpected retryStatusCodes; got %d; want %d", retryStatusCodes, expectedRetryStatusCodes)
if !reflect.DeepEqual(up.retryStatusCodes, expectedRetryStatusCodes) {
t.Fatalf("unexpected retryStatusCodes; got %d; want %d", up.retryStatusCodes, expectedRetryStatusCodes)
}
if up.loadBalancingPolicy != expectedLoadBalancingPolicy {
t.Fatalf("unexpected loadBalancingPolicy; got %q; want %q", up.loadBalancingPolicy, expectedLoadBalancingPolicy)
}
if dropSrcPathPrefixParts != expectedDropSrcPathPrefixParts {
t.Fatalf("unexpected dropSrcPathPrefixParts; got %d; want %d", dropSrcPathPrefixParts, expectedDropSrcPathPrefixParts)
}
}
// Simple routing with `url_prefix`
f(&UserInfo{
URLPrefix: mustParseURL("http://foo.bar"),
}, "", "http://foo.bar/.", "[]", "[]", nil)
}, "", "http://foo.bar/.", "[]", "[]", nil, "least_loaded", 0)
f(&UserInfo{
URLPrefix: mustParseURL("http://foo.bar"),
HeadersConf: HeadersConf{
@@ -45,29 +125,31 @@ func TestCreateTargetURLSuccess(t *testing.T) {
Value: "aaa",
}},
},
RetryStatusCodes: []int{503, 501},
}, "/", "http://foo.bar", `[{"bb" "aaa"}]`, `[]`, []int{503, 501})
RetryStatusCodes: []int{503, 501},
LoadBalancingPolicy: "first_available",
DropSrcPathPrefixParts: 2,
}, "/a/b/c", "http://foo.bar/c", `[{"bb" "aaa"}]`, `[]`, []int{503, 501}, "first_available", 2)
f(&UserInfo{
URLPrefix: mustParseURL("http://foo.bar/federate"),
}, "/", "http://foo.bar/federate", "[]", "[]", nil)
}, "/", "http://foo.bar/federate", "[]", "[]", nil, "least_loaded", 0)
f(&UserInfo{
URLPrefix: mustParseURL("http://foo.bar"),
}, "a/b?c=d", "http://foo.bar/a/b?c=d", "[]", "[]", nil)
}, "a/b?c=d", "http://foo.bar/a/b?c=d", "[]", "[]", nil, "least_loaded", 0)
f(&UserInfo{
URLPrefix: mustParseURL("https://sss:3894/x/y"),
}, "/z", "https://sss:3894/x/y/z", "[]", "[]", nil)
}, "/z", "https://sss:3894/x/y/z", "[]", "[]", nil, "least_loaded", 0)
f(&UserInfo{
URLPrefix: mustParseURL("https://sss:3894/x/y"),
}, "/../../aaa", "https://sss:3894/x/y/aaa", "[]", "[]", nil)
}, "/../../aaa", "https://sss:3894/x/y/aaa", "[]", "[]", nil, "least_loaded", 0)
f(&UserInfo{
URLPrefix: mustParseURL("https://sss:3894/x/y"),
}, "/./asd/../../aaa?a=d&s=s/../d", "https://sss:3894/x/y/aaa?a=d&s=s%2F..%2Fd", "[]", "[]", nil)
}, "/./asd/../../aaa?a=d&s=s/../d", "https://sss:3894/x/y/aaa?a=d&s=s%2F..%2Fd", "[]", "[]", nil, "least_loaded", 0)
// Complex routing with `url_map`
ui := &UserInfo{
URLMaps: []URLMap{
{
SrcPaths: getSrcPaths([]string{"/api/v1/query"}),
SrcPaths: getSrcPaths([]string{"/vmsingle/api/v1/query"}),
URLPrefix: mustParseURL("http://vmselect/0/prometheus"),
HeadersConf: HeadersConf{
RequestHeaders: []Header{
@@ -87,7 +169,9 @@ func TestCreateTargetURLSuccess(t *testing.T) {
},
},
},
RetryStatusCodes: []int{503, 500, 501},
RetryStatusCodes: []int{503, 500, 501},
LoadBalancingPolicy: "first_available",
DropSrcPathPrefixParts: 1,
},
{
SrcPaths: getSrcPaths([]string{"/api/v1/write"}),
@@ -105,11 +189,12 @@ func TestCreateTargetURLSuccess(t *testing.T) {
Value: "y",
}},
},
RetryStatusCodes: []int{502},
RetryStatusCodes: []int{502},
DropSrcPathPrefixParts: 2,
}
f(ui, "/api/v1/query?query=up", "http://vmselect/0/prometheus/api/v1/query?query=up", `[{"xx" "aa"} {"yy" "asdf"}]`, `[{"qwe" "rty"}]`, []int{503, 500, 501})
f(ui, "/api/v1/write", "http://vminsert/0/prometheus/api/v1/write", "[]", "[]", nil)
f(ui, "/api/v1/query_range", "http://default-server/api/v1/query_range", `[{"bb" "aaa"}]`, `[{"x" "y"}]`, []int{502})
f(ui, "/vmsingle/api/v1/query?query=up", "http://vmselect/0/prometheus/api/v1/query?query=up", `[{"xx" "aa"} {"yy" "asdf"}]`, `[{"qwe" "rty"}]`, []int{503, 500, 501}, "first_available", 1)
f(ui, "/api/v1/write", "http://vminsert/0/prometheus/api/v1/write", "[]", "[]", []int{502}, "least_loaded", 0)
f(ui, "/foo/bar/api/v1/query_range", "http://default-server/api/v1/query_range", `[{"bb" "aaa"}]`, `[{"x" "y"}]`, []int{502}, "least_loaded", 2)
// Complex routing regexp paths in `url_map`
ui = &UserInfo{
@@ -125,17 +210,17 @@ func TestCreateTargetURLSuccess(t *testing.T) {
},
URLPrefix: mustParseURL("http://default-server"),
}
f(ui, "/api/v1/query?query=up", "http://vmselect/0/prometheus/api/v1/query?query=up", "[]", "[]", nil)
f(ui, "/api/v1/query_range?query=up", "http://vmselect/0/prometheus/api/v1/query_range?query=up", "[]", "[]", nil)
f(ui, "/api/v1/label/foo/values", "http://vmselect/0/prometheus/api/v1/label/foo/values", "[]", "[]", nil)
f(ui, "/api/v1/write", "http://vminsert/0/prometheus/api/v1/write", "[]", "[]", nil)
f(ui, "/api/v1/foo/bar", "http://default-server/api/v1/foo/bar", "[]", "[]", nil)
f(ui, "/api/v1/query?query=up", "http://vmselect/0/prometheus/api/v1/query?query=up", "[]", "[]", nil, "least_loaded", 0)
f(ui, "/api/v1/query_range?query=up", "http://vmselect/0/prometheus/api/v1/query_range?query=up", "[]", "[]", nil, "least_loaded", 0)
f(ui, "/api/v1/label/foo/values", "http://vmselect/0/prometheus/api/v1/label/foo/values", "[]", "[]", nil, "least_loaded", 0)
f(ui, "/api/v1/write", "http://vminsert/0/prometheus/api/v1/write", "[]", "[]", nil, "least_loaded", 0)
f(ui, "/api/v1/foo/bar", "http://default-server/api/v1/foo/bar", "[]", "[]", nil, "least_loaded", 0)
f(&UserInfo{
URLPrefix: mustParseURL("http://foo.bar?extra_label=team=dev"),
}, "/api/v1/query", "http://foo.bar/api/v1/query?extra_label=team=dev", "[]", "[]", nil)
}, "/api/v1/query", "http://foo.bar/api/v1/query?extra_label=team=dev", "[]", "[]", nil, "least_loaded", 0)
f(&UserInfo{
URLPrefix: mustParseURL("http://foo.bar?extra_label=team=mobile"),
}, "/api/v1/query?extra_label=team=dev", "http://foo.bar/api/v1/query?extra_label=team%3Dmobile", "[]", "[]", nil)
}, "/api/v1/query?extra_label=team=dev", "http://foo.bar/api/v1/query?extra_label=team%3Dmobile", "[]", "[]", nil, "least_loaded", 0)
}
func TestCreateTargetURLFailure(t *testing.T) {
@@ -146,7 +231,7 @@ func TestCreateTargetURLFailure(t *testing.T) {
t.Fatalf("cannot parse %q: %s", requestURI, err)
}
u = normalizeURL(u)
up, hc, retryStatusCodes := ui.getURLPrefixAndHeaders(u)
up, hc, dropSrcPathPrefixParts := ui.getURLPrefixAndHeaders(u)
if up != nil {
t.Fatalf("unexpected non-empty up=%#v", up)
}
@@ -156,8 +241,8 @@ func TestCreateTargetURLFailure(t *testing.T) {
if hc.ResponseHeaders != nil {
t.Fatalf("unexpected non-empty response headers=%q", hc.ResponseHeaders)
}
if retryStatusCodes != nil {
t.Fatalf("unexpected non-empty retryStatusCodes=%d", retryStatusCodes)
if dropSrcPathPrefixParts != 0 {
t.Fatalf("unexpected non-zero dropSrcPathPrefixParts=%d", dropSrcPathPrefixParts)
}
}
f(&UserInfo{}, "/foo/bar")

View File

@@ -1,438 +1,3 @@
# vmbackup
See vmbackup docs [here](https://docs.victoriametrics.com/vmbackup.html).
`vmbackup` creates VictoriaMetrics data backups from [instant snapshots](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#how-to-work-with-snapshots).
`vmbackup` supports incremental and full backups. Incremental backups are created automatically if the destination path already contains data from the previous backup.
Full backups can be sped up with `-origin` pointing to an already existing backup on the same remote storage. In this case `vmbackup` makes server-side copy for the shared
data between the existing backup and new backup. It saves time and costs on data transfer.
Backup process can be interrupted at any time. It is automatically resumed from the interruption point when restarting `vmbackup` with the same args.
Backed up data can be restored with [vmrestore](https://docs.victoriametrics.com/vmrestore.html).
See [this article](https://medium.com/@valyala/speeding-up-backups-for-big-time-series-databases-533c1a927883) for more details.
See also [vmbackupmanager](https://docs.victoriametrics.com/vmbackupmanager.html) tool built on top of `vmbackup`. This tool simplifies
creation of hourly, daily, weekly and monthly backups.
## Supported storage types
`vmbackup` supports the following `-dst` storage types:
* [GCS](https://cloud.google.com/storage/). Example: `gs://<bucket>/<path/to/backup>`
* [S3](https://aws.amazon.com/s3/). Example: `s3://<bucket>/<path/to/backup>`
* [Azure Blob Storage](https://azure.microsoft.com/en-us/products/storage/blobs/). Example: `azblob://<container>/<path/to/backup>`
* Any S3-compatible storage such as [MinIO](https://github.com/minio/minio), [Ceph](https://docs.ceph.com/en/pacific/radosgw/s3/) or [Swift](https://platform.swiftstack.com/docs/admin/middleware/s3_middleware.html). See [these docs](#advanced-usage) for details.
* Local filesystem. Example: `fs://</absolute/path/to/backup>`. Note that `vmbackup` prevents from storing the backup into the directory pointed by `-storageDataPath` command-line flag, since this directory should be managed solely by VictoriaMetrics or `vmstorage`.
## Use cases
### Regular backups
Regular backup can be performed with the following command:
```console
./vmbackup -storageDataPath=</path/to/victoria-metrics-data> -snapshot.createURL=http://localhost:8428/snapshot/create -dst=gs://<bucket>/<path/to/new/backup>
```
* `</path/to/victoria-metrics-data>` - path to VictoriaMetrics data pointed by `-storageDataPath` command-line flag in single-node VictoriaMetrics or in cluster `vmstorage`.
There is no need to stop VictoriaMetrics for creating backups since they are performed from immutable [instant snapshots](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#how-to-work-with-snapshots).
* `http://victoriametrics:8428/snapshot/create` is the url for creating snapshots according to [these docs](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#how-to-work-with-snapshots). `vmbackup` creates a snapshot by querying the provided `-snapshot.createURL`, then performs the backup and then automatically removes the created snapshot.
* `<bucket>` is an already existing name for [GCS bucket](https://cloud.google.com/storage/docs/creating-buckets).
* `<path/to/new/backup>` is the destination path where new backup will be placed.
### Regular backups with server-side copy from existing backup
If the destination GCS bucket already contains the previous backup at `-origin` path, then new backup can be sped up
with the following command:
```console
./vmbackup -storageDataPath=</path/to/victoria-metrics-data> -snapshot.createURL=http://localhost:8428/snapshot/create -dst=gs://<bucket>/<path/to/new/backup> -origin=gs://<bucket>/<path/to/existing/backup>
```
It saves time and network bandwidth costs by performing server-side copy for the shared data from the `-origin` to `-dst`.
### Incremental backups
Incremental backups are performed if `-dst` points to an already existing backup. In this case only new data is uploaded to remote storage.
It saves time and network bandwidth costs when working with big backups:
```console
./vmbackup -storageDataPath=</path/to/victoria-metrics-data> -snapshot.createURL=http://localhost:8428/snapshot/create -dst=gs://<bucket>/<path/to/existing/backup>
```
### Smart backups
Smart backups mean storing full daily backups into `YYYYMMDD` folders and creating incremental hourly backup into `latest` folder:
* Run the following command every hour:
```console
./vmbackup -storageDataPath=</path/to/victoria-metrics-data> -snapshot.createURL=http://localhost:8428/snapshot/create -dst=gs://<bucket>/latest
```
Where `<latest-snapshot>` is the latest [snapshot](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#how-to-work-with-snapshots).
The command will upload only changed data to `gs://<bucket>/latest`.
* Run the following command once a day:
```console
./vmbackup -storageDataPath=</path/to/victoria-metrics-data> -snapshot.createURL=http://localhost:8428/snapshot/create -dst=gs://<bucket>/<YYYYMMDD> -origin=gs://<bucket>/latest
```
Where `<daily-snapshot>` is the snapshot for the last day `<YYYYMMDD>`.
This approach saves network bandwidth costs on hourly backups (since they are incremental) and allows recovering data from either the last hour (`latest` backup)
or from any day (`YYYYMMDD` backups). Note that hourly backup shouldn't run when creating daily backup.
Do not forget to remove old backups when they are no longer needed in order to save storage costs.
See also [vmbackupmanager tool](https://docs.victoriametrics.com/vmbackupmanager.html) for automating smart backups.
### Server-side copy of the existing backup
Sometimes it is needed to make server-side copy of the existing backup. This can be done by specifying the source backup path via `-origin` command-line flag,
while the destination path for backup copy must be specified via `-dst` command-line flag. For example, the following command copies backup
from `gs://bucket/foo` to `gs://bucket/bar`:
```console
./vmbackup -origin=gs://bucket/foo -dst=gs://bucket/bar
```
The `-origin` and `-dst` must point to the same object storage bucket or to the same filesystem.
The server-side backup copy is usually performed at much faster speed comparing to the usual backup, since backup data isn't transferred
between the remote storage and locally running `vmbackup` tool.
If the `-dst` already contains some data, then its' contents is synced with the `-origin` data. This allows making incremental server-side copies of backups.
## How does it work?
The backup algorithm is the following:
1. Create a snapshot by querying the provided `-snapshot.createURL`
1. Collect information about files in the created snapshot, in the `-dst` and in the `-origin`.
1. Determine which files in `-dst` are missing in the created snapshot, and delete them. These are usually small files, which are already merged into bigger files in the snapshot.
1. Determine which files in the created snapshot are missing in `-dst`. These are usually small new files and bigger merged files.
1. Determine which files from step 3 exist in the `-origin`, and perform server-side copy of these files from `-origin` to `-dst`.
These are usually the biggest and the oldest files, which are shared between backups.
1. Upload the remaining files from step 3 from the created snapshot to `-dst`.
1. Delete the created snapshot.
The algorithm splits source files into 1 GiB chunks in the backup. Each chunk is stored as a separate file in the backup.
Such splitting balances between the number of files in the backup and the amounts of data that needs to be re-transferred after temporary errors.
`vmbackup` relies on [instant snapshot](https://medium.com/@valyala/how-victoriametrics-makes-instant-snapshots-for-multi-terabyte-time-series-data-e1f3fb0e0282) properties:
* All the files in the snapshot are immutable.
* Old files are periodically merged into new files.
* Smaller files have higher probability to be merged.
* Consecutive snapshots share many identical files.
These properties allow performing fast and cheap incremental backups and server-side copying from `-origin` paths.
See [this article](https://medium.com/@valyala/speeding-up-backups-for-big-time-series-databases-533c1a927883) for more details.
`vmbackup` can work improperly or slowly when these properties are violated.
## Troubleshooting
* If the backup is slow, then try setting higher value for `-concurrency` flag. This will increase the number of concurrent workers that upload data to backup storage.
* If `vmbackup` eats all the network bandwidth or CPU, then either decrease the `-concurrency` command-line flag value or set `-maxBytesPerSecond` command-line flag value to lower value.
* If `vmbackup` consumes all the CPU on systems with big number of CPU cores, then try running it with `-filestream.disableFadvise` command-line flag.
* If `vmbackup` has been interrupted due to temporary error, then just restart it with the same args. It will resume the backup process.
* Backups created from [single-node VictoriaMetrics](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html) cannot be restored
at [cluster VictoriaMetrics](https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html) and vice versa.
## Advanced usage
### Providing credentials as a file
Obtaining credentials from a file.
Add flag `-credsFilePath=/etc/credentials` with the following content:
- for S3 (AWS, MinIO or other S3 compatible storages):
```console
[default]
aws_access_key_id=theaccesskey
aws_secret_access_key=thesecretaccesskeyvalue
```
- for GCP cloud storage:
```json
{
"type": "service_account",
"project_id": "project-id",
"private_key_id": "key-id",
"private_key": "-----BEGIN PRIVATE KEY-----\nprivate-key\n-----END PRIVATE KEY-----\n",
"client_email": "service-account-email",
"client_id": "client-id",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://accounts.google.com/o/oauth2/token",
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/service-account-email"
}
```
### Providing credentials via env variables
Obtaining credentials from env variables.
- For AWS S3 compatible storages set env variable `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`.
Also you can set env variable `AWS_SHARED_CREDENTIALS_FILE` with path to credentials file.
- For GCE cloud storage set env variable `GOOGLE_APPLICATION_CREDENTIALS` with path to credentials file.
- For Azure storage either set env variables `AZURE_STORAGE_ACCOUNT_NAME` and `AZURE_STORAGE_ACCOUNT_KEY`, or `AZURE_STORAGE_ACCOUNT_CONNECTION_STRING`.
Please, note that `vmbackup` will use credentials provided by cloud providers metadata service [when applicable](https://docs.victoriametrics.com/vmbackup.html#using-cloud-providers-metadata-service).
### Using cloud providers metadata service
`vmbackup` and `vmbackupmanager` will automatically use cloud providers metadata service in order to obtain credentials if they are running in cloud environment
and credentials are not explicitly provided via flags or env variables.
### Providing credentials in Kubernetes
The simplest way to provide credentials in Kubernetes is to use [Secrets](https://kubernetes.io/docs/concepts/configuration/secret/)
and inject them into the pod as environment variables. For example, the following secret can be used for AWS S3 credentials:
```yaml
apiVersion: v1
kind: Secret
metadata:
name: vmbackup-credentials
data:
access_key: key
secret_key: secret
```
And then it can be injected into the pod as environment variables:
```yaml
...
env:
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
key: access_key
name: vmbackup-credentials
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
key: secret_key
name: vmbackup-credentials
...
```
A more secure way is to use IAM roles to provide tokens for pods instead of managing credentials manually.
For AWS deployments it will be required to configure [IAM roles for service accounts](https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts.html).
In order to use IAM roles for service accounts with `vmbackup` or `vmbackupmanager` it is required to create ServiceAccount with IAM role mapping:
```yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: monitoring-backups
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::{ACCOUNT_ID}:role/{ROLE_NAME}
```
And [configure pod to use service account](https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/).
After this `vmbackup` and `vmbackupmanager` will automatically use IAM role for service account in order to obtain credentials.
For GCP deployments it will be required to configure [Workload Identity](https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity).
In order to use Workload Identity with `vmbackup` or `vmbackupmanager` it is required to create ServiceAccount with Workload Identity annotation:
```yaml
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: monitoring-backups
annotations:
iam.gke.io/gcp-service-account: {sa_name}@{project_name}.iam.gserviceaccount.com
```
And [configure pod to use service account](https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/).
After this `vmbackup` and `vmbackupmanager` will automatically use Workload Identity for servicpe account in order to obtain credentials.
### Using custom S3 endpoint
Usage with s3 custom url endpoint. It is possible to use `vmbackup` with s3 compatible storages like minio, cloudian, etc.
You have to add a custom url endpoint via flag:
- for MinIO
```console
-customS3Endpoint=http://localhost:9000
```
- for aws gov region
```console
-customS3Endpoint=https://s3-fips.us-gov-west-1.amazonaws.com
```
### Permanent deletion of objects in S3 and compatible storages
`vmbackup` and [vmbackupmanager](https://docs.victoriametrics.com/vmbackupmanager.html) use standard delete operation
for S3-compatible object storage when pefrorming [incremental backups](#incremental-backups).
This operation removes only the current version of the object. This works OK in most cases.
Sometimes it is needed to remove all the versions of an object. In this case pass `-deleteAllObjectVersions` command-line flag to `vmbackup`.
Alternatively, it is possible to use object storage lifecycle rules to remove non-current versions of objects automatically.
Refer to the respective documentation for your object storage provider for more details.
### Command-line flags
Run `vmbackup -help` in order to see all the available options:
```console
-concurrency int
The number of concurrent workers. Higher concurrency may reduce backup duration (default 10)
-configFilePath string
Path to file with S3 configs. Configs are loaded from default location if not set.
See https://docs.aws.amazon.com/general/latest/gr/aws-security-credentials.html
-configProfile string
Profile name for S3 configs. If no set, the value of the environment variable will be loaded (AWS_PROFILE or AWS_DEFAULT_PROFILE), or if both not set, DefaultSharedConfigProfile is used
-credsFilePath string
Path to file with GCS or S3 credentials. Credentials are loaded from default locations if not set.
See https://cloud.google.com/iam/docs/creating-managing-service-account-keys and https://docs.aws.amazon.com/general/latest/gr/aws-security-credentials.html
-customS3Endpoint string
Custom S3 endpoint for use with S3-compatible storages (e.g. MinIO). S3 is used if not set
-deleteAllObjectVersions
Whether to prune previous object versions when deleting an object. By default, when object storage has versioning enabled deleting the file removes only current version. This option forces removal of all previous versions. See: https://docs.victoriametrics.com/vmbackup.html#permanent-deletion-of-objects-in-s3-compatible-storages
-dst string
Where to put the backup on the remote storage. Example: gs://bucket/path/to/backup, s3://bucket/path/to/backup, azblob://container/path/to/backup or fs:///path/to/local/backup/dir
-dst can point to the previous backup. In this case incremental backup is performed, i.e. only changed data is uploaded
-enableTCP6
Whether to enable IPv6 for listening and dialing. By default, only IPv4 TCP and UDP are used
-envflag.enable
Whether to enable reading flags from environment variables in addition to the command line. Command line flag values have priority over values from environment vars. Flags are read only from the command line if this flag isn't set. See https://docs.victoriametrics.com/#environment-variables for more details
-envflag.prefix string
Prefix for environment variables if -envflag.enable is set
-eula
Deprecated, please use -license or -licenseFile flags instead. By specifying this flag, you confirm that you have an enterprise license and accept the ESA https://victoriametrics.com/legal/esa/ . This flag is available only in Enterprise binaries. See https://docs.victoriametrics.com/enterprise.html
-filestream.disableFadvise
Whether to disable fadvise() syscall when reading large data files. The fadvise() syscall prevents from eviction of recently accessed data from OS page cache during background merges and backups. In some rare cases it is better to disable the syscall if it uses too much CPU
-flagsAuthKey string
Auth key for /flags endpoint. It must be passed via authKey query arg. It overrides httpAuth.* settings
-fs.disableMmap
Whether to use pread() instead of mmap() for reading data files. By default, mmap() is used for 64-bit arches and pread() is used for 32-bit arches, since they cannot read data files bigger than 2^32 bytes in memory. mmap() is usually faster for reading small data chunks than pread()
-http.connTimeout duration
Incoming http connections are closed after the configured timeout. This may help to spread the incoming load among a cluster of services behind a load balancer. Please note that the real timeout may be bigger by up to 10% as a protection against the thundering herd problem (default 2m0s)
-http.disableResponseCompression
Disable compression of HTTP responses to save CPU resources. By default, compression is enabled to save network bandwidth
-http.idleConnTimeout duration
Timeout for incoming idle http connections (default 1m0s)
-http.maxGracefulShutdownDuration duration
The maximum duration for a graceful shutdown of the HTTP server. A highly loaded server may require increased value for a graceful shutdown (default 7s)
-http.pathPrefix string
An optional prefix to add to all the paths handled by http server. For example, if '-http.pathPrefix=/foo/bar' is set, then all the http requests will be handled on '/foo/bar/*' paths. This may be useful for proxied requests. See https://www.robustperception.io/using-external-urls-and-proxies-with-prometheus
-http.shutdownDelay duration
Optional delay before http server shutdown. During this delay, the server returns non-OK responses from /health page, so load balancers can route new requests to other servers
-httpAuth.password string
Password for HTTP server's Basic Auth. The authentication is disabled if -httpAuth.username is empty
-httpAuth.username string
Username for HTTP server's Basic Auth. The authentication is disabled if empty. See also -httpAuth.password
-httpListenAddr string
TCP address for exporting metrics at /metrics page (default ":8420")
-internStringCacheExpireDuration duration
The expiry duration for caches for interned strings. See https://en.wikipedia.org/wiki/String_interning . See also -internStringMaxLen and -internStringDisableCache (default 6m0s)
-internStringDisableCache
Whether to disable caches for interned strings. This may reduce memory usage at the cost of higher CPU usage. See https://en.wikipedia.org/wiki/String_interning . See also -internStringCacheExpireDuration and -internStringMaxLen
-internStringMaxLen int
The maximum length for strings to intern. A lower limit may save memory at the cost of higher CPU usage. See https://en.wikipedia.org/wiki/String_interning . See also -internStringDisableCache and -internStringCacheExpireDuration (default 500)
-license string
Lisense key for VictoriaMetrics Enterprise. See https://victoriametrics.com/products/enterprise/ . Trial Enterprise license can be obtained from https://victoriametrics.com/products/enterprise/trial/ . This flag is available only in Enterprise binaries. The license key can be also passed via file specified by -licenseFile command-line flag
-license.forceOffline
Whether to enable offline verification for VictoriaMetrics Enterprise license key, which has been passed either via -license or via -licenseFile command-line flag. The issued license key must support offline verification feature. Contact info@victoriametrics.com if you need offline license verification. This flag is avilable only in Enterprise binaries
-licenseFile string
Path to file with license key for VictoriaMetrics Enterprise. See https://victoriametrics.com/products/enterprise/ . Trial Enterprise license can be obtained from https://victoriametrics.com/products/enterprise/trial/ . This flag is available only in Enterprise binaries. The license key can be also passed inline via -license command-line flag
-loggerDisableTimestamps
Whether to disable writing timestamps in logs
-loggerErrorsPerSecondLimit int
Per-second limit on the number of ERROR messages. If more than the given number of errors are emitted per second, the remaining errors are suppressed. Zero values disable the rate limit
-loggerFormat string
Format for logs. Possible values: default, json (default "default")
-loggerJSONFields string
Allows renaming fields in JSON formatted logs. Example: "ts:timestamp,msg:message" renames "ts" to "timestamp" and "msg" to "message". Supported fields: ts, level, caller, msg
-loggerLevel string
Minimum level of errors to log. Possible values: INFO, WARN, ERROR, FATAL, PANIC (default "INFO")
-loggerOutput string
Output for the logs. Supported values: stderr, stdout (default "stderr")
-loggerTimezone string
Timezone to use for timestamps in logs. Timezone must be a valid IANA Time Zone. For example: America/New_York, Europe/Berlin, Etc/GMT+3 or Local (default "UTC")
-loggerWarnsPerSecondLimit int
Per-second limit on the number of WARN messages. If more than the given number of warns are emitted per second, then the remaining warns are suppressed. Zero values disable the rate limit
-maxBytesPerSecond size
The maximum upload speed. There is no limit if it is set to 0
Supports the following optional suffixes for size values: KB, MB, GB, TB, KiB, MiB, GiB, TiB (default 0)
-memory.allowedBytes size
Allowed size of system memory VictoriaMetrics caches may occupy. This option overrides -memory.allowedPercent if set to a non-zero value. Too low a value may increase the cache miss rate usually resulting in higher CPU and disk IO usage. Too high a value may evict too much data from the OS page cache resulting in higher disk IO usage
Supports the following optional suffixes for size values: KB, MB, GB, TB, KiB, MiB, GiB, TiB (default 0)
-memory.allowedPercent float
Allowed percent of system memory VictoriaMetrics caches may occupy. See also -memory.allowedBytes. Too low a value may increase cache miss rate usually resulting in higher CPU and disk IO usage. Too high a value may evict too much data from the OS page cache which will result in higher disk IO usage (default 60)
-metricsAuthKey string
Auth key for /metrics endpoint. It must be passed via authKey query arg. It overrides httpAuth.* settings
-origin string
Optional origin directory on the remote storage with old backup for server-side copying when performing full backup. This speeds up full backups
-pprofAuthKey string
Auth key for /debug/pprof/* endpoints. It must be passed via authKey query arg. It overrides httpAuth.* settings
-pushmetrics.extraLabel array
Optional labels to add to metrics pushed to -pushmetrics.url . For example, -pushmetrics.extraLabel='instance="foo"' adds instance="foo" label to all the metrics pushed to -pushmetrics.url
Supports an array of values separated by comma or specified via multiple flags.
-pushmetrics.interval duration
Interval for pushing metrics to -pushmetrics.url (default 10s)
-pushmetrics.url array
Optional URL to push metrics exposed at /metrics page. See https://docs.victoriametrics.com/#push-metrics . By default, metrics exposed at /metrics page aren't pushed to any remote storage
Supports an array of values separated by comma or specified via multiple flags.
-s3ForcePathStyle
Prefixing endpoint with bucket name when set false, true by default. (default true)
-s3StorageClass string
The Storage Class applied to objects uploaded to AWS S3. Supported values are: GLACIER, DEEP_ARCHIVE, GLACIER_IR, INTELLIGENT_TIERING, ONEZONE_IA, OUTPOSTS, REDUCED_REDUNDANCY, STANDARD, STANDARD_IA.
See https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-class-intro.html
-snapshot.createURL string
VictoriaMetrics create snapshot url. When this is given a snapshot will automatically be created during backup. Example: http://victoriametrics:8428/snapshot/create . There is no need in setting -snapshotName if -snapshot.createURL is set
-snapshot.deleteURL string
VictoriaMetrics delete snapshot url. Optional. Will be generated from -snapshot.createURL if not provided. All created snapshots will be automatically deleted. Example: http://victoriametrics:8428/snapshot/delete
-snapshotName string
Name for the snapshot to backup. See https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#how-to-work-with-snapshots. There is no need in setting -snapshotName if -snapshot.createURL is set
-storageDataPath string
Path to VictoriaMetrics data. Must match -storageDataPath from VictoriaMetrics or vmstorage (default "victoria-metrics-data")
-tls
Whether to enable TLS for incoming HTTP requests at -httpListenAddr (aka https). -tlsCertFile and -tlsKeyFile must be set if -tls is set
-tlsCertFile string
Path to file with TLS certificate if -tls is set. Prefer ECDSA certs instead of RSA certs as RSA certs are slower. The provided certificate file is automatically re-read every second, so it can be dynamically updated
-tlsCipherSuites array
Optional list of TLS cipher suites for incoming requests over HTTPS if -tls is set. See the list of supported cipher suites at https://pkg.go.dev/crypto/tls#pkg-constants
Supports an array of values separated by comma or specified via multiple flags.
-tlsKeyFile string
Path to file with TLS key if -tls is set. The provided key file is automatically re-read every second, so it can be dynamically updated
-tlsMinVersion string
Optional minimum TLS version to use for incoming requests over HTTPS if -tls is set. Supported values: TLS10, TLS11, TLS12, TLS13
-version
Show VictoriaMetrics version
```
## How to build from sources
It is recommended using [binary releases](https://github.com/VictoriaMetrics/VictoriaMetrics/releases/latest) - see `vmutils-*` archives there.
### Development build
1. [Install Go](https://golang.org/doc/install). The minimum supported version is Go 1.20.
1. Run `make vmbackup` from the root folder of [the repository](https://github.com/VictoriaMetrics/VictoriaMetrics).
It builds `vmbackup` binary and puts it into the `bin` folder.
### Production build
1. [Install docker](https://docs.docker.com/install/).
1. Run `make vmbackup-prod` from the root folder of [the repository](https://github.com/VictoriaMetrics/VictoriaMetrics).
It builds `vmbackup-prod` binary and puts it into the `bin` folder.
### Building docker images
Run `make package-vmbackup`. It builds `victoriametrics/vmbackup:<PKG_TAG>` docker image locally.
`<PKG_TAG>` is auto-generated image tag, which depends on source code in the repository.
The `<PKG_TAG>` may be manually set via `PKG_TAG=foobar make package-vmbackup`.
The base docker image is [alpine](https://hub.docker.com/_/alpine) but it is possible to use any other base image
by setting it via `<ROOT_IMAGE>` environment variable. For example, the following command builds the image on top of [scratch](https://hub.docker.com/_/scratch) image:
```console
ROOT_IMAGE=scratch make package-vmbackup
```
vmbackup docs can be edited at [docs/vmbackup.md](https://github.com/VictoriaMetrics/VictoriaMetrics/blob/master/docs/vmbackup.md).

View File

@@ -168,7 +168,7 @@ See the docs at https://docs.victoriametrics.com/vmbackup.html .
func newSrcFS() (*fslocal.FS, error) {
if err := snapshot.Validate(*snapshotName); err != nil {
return nil, fmt.Errorf("invalid -snapshotName=%q: %s", *snapshotName, err)
return nil, fmt.Errorf("invalid -snapshotName=%q: %w", *snapshotName, err)
}
snapshotPath := filepath.Join(*storageDataPath, "snapshots", *snapshotName)

View File

@@ -1,548 +1,3 @@
# vmbackupmanager
See vmbackupmanager docs [here](https://docs.victoriametrics.com/vmbackupmanager.html).
***vmbackupmanager is a part of [enterprise package](https://docs.victoriametrics.com/enterprise.html).
It is available for download and evaluation at [releases page](https://github.com/VictoriaMetrics/VictoriaMetrics/releases/latest).
See how to request a free trial license [here](https://victoriametrics.com/products/enterprise/trial/).***
The VictoriaMetrics backup manager automates regular backup procedures. It supports the following backup intervals: **hourly**, **daily**, **weekly** and **monthly**.
Multiple backup intervals may be configured simultaneously. I.e. the backup manager creates hourly backups every hour, while it creates daily backups every day, etc.
Backup manager must have read access to the storage data, so best practice is to install it on the same machine (or as a sidecar) where the storage node is installed.
The backup service makes a backup every hour and puts it to the latest folder and then copies data to the folders
which represent the backup intervals (hourly, daily, weekly and monthly)
The required flags for running the service are as follows:
* -eula - should be true and means that you have the legal right to run a backup manager. That can either be a signed contract or an email
with confirmation to run the service in a trial period.
* -storageDataPath - path to VictoriaMetrics or vmstorage data path to make backup from.
* -snapshot.createURL - VictoriaMetrics creates snapshot URL which will automatically be created during backup. Example: <http://victoriametrics:8428/snapshot/create>
* -dst - backup destination at [the supported storage types](https://docs.victoriametrics.com/vmbackup.html#supported-storage-types).
* -credsFilePath - path to file with GCS or S3 credentials. Credentials are loaded from default locations if not set.
See [https://cloud.google.com/iam/docs/creating-managing-service-account-keys](https://cloud.google.com/iam/docs/creating-managing-service-account-keys)
and [https://docs.aws.amazon.com/general/latest/gr/aws-security-credentials.html](https://docs.aws.amazon.com/general/latest/gr/aws-security-credentials.html).
Backup schedule is controlled by the following flags:
* -disableHourly - disable hourly run. Default false
* -disableDaily - disable daily run. Default false
* -disableWeekly - disable weekly run. Default false
* -disableMonthly - disable monthly run. Default false
By default, all flags are turned on and Backup Manager backups data every hour for every interval (hourly, daily, weekly and monthly).
The backup manager creates the following directory hierarchy at **-dst**:
* /latest/ - contains the latest backup
* /hourly/ - contains hourly backups. Each backup is named as *YYYY-MM-DD:HH*
* /daily/ - contains daily backups. Each backup is named as *YYYY-MM-DD*
* /weekly/ - contains weekly backups. Each backup is named as *YYYY-WW*
* /monthly/ - contains monthly backups. Each backup is named as *YYYY-MM*
To get the full list of supported flags please run the following command:
```console
./vmbackupmanager --help
```
The service creates a **full** backup each run. This means that the system can be restored fully
from any particular backup using [vmrestore](https://docs.victoriametrics.com/vmrestore.html).
Backup manager uploads only the data that has been changed or created since the most recent backup (incremental backup).
This reduces the consumed network traffic and the time needed for performing the backup.
See [this article](https://medium.com/@valyala/speeding-up-backups-for-big-time-series-databases-533c1a927883) for details.
*Please take into account that the first backup upload could take a significant amount of time as it needs to upload all the data.*
There are two flags which could help with performance tuning:
* -maxBytesPerSecond - the maximum upload speed. There is no limit if it is set to 0
* -concurrency - The number of concurrent workers. Higher concurrency may improve upload speed (default 10)
## Example of Usage
GCS and cluster version. You need to have a credentials file in json format with following structure:
credentials.json
```json
{
"type": "service_account",
"project_id": "<project>",
"private_key_id": "",
"private_key": "-----BEGIN PRIVATE KEY-----\-----END PRIVATE KEY-----\n",
"client_email": "test@<project>.iam.gserviceaccount.com",
"client_id": "",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://oauth2.googleapis.com/token",
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/test%40<project>.iam.gserviceaccount.com"
}
```
Backup manager launched with the following configuration:
```console
export NODE_IP=192.168.0.10
export VMSTORAGE_ENDPOINT=http://127.0.0.1:8428
./vmbackupmanager -dst=gs://vmstorage-data/$NODE_IP -credsFilePath=credentials.json -storageDataPath=/vmstorage-data -snapshot.createURL=$VMSTORAGE_ENDPOINT/snapshot/create -eula
```
Expected logs in vmbackupmanager:
```console
info lib/backup/actions/backup.go:131 server-side copied 81 out of 81 parts from GCS{bucket: "vmstorage-data", dir: "192.168.0.10//latest/"} to GCS{bucket: "vmstorage-data", dir: "192.168.0.10//weekly/2020-34/"} in 2.549833008s
info lib/backup/actions/backup.go:169 backed up 853315 bytes in 2.882 seconds; deleted 0 bytes; server-side copied 853315 bytes; uploaded 0 bytes
```
Expected logs in vmstorage:
```console
info VictoriaMetrics/lib/storage/table.go:146 creating table snapshot of "/vmstorage-data/data"...
info VictoriaMetrics/lib/storage/storage.go:311 deleting snapshot "/vmstorage-data/snapshots/20200818201959-162C760149895DDA"...
info VictoriaMetrics/lib/storage/storage.go:319 deleted snapshot "/vmstorage-data/snapshots/20200818201959-162C760149895DDA" in 0.169 seconds
```
The result on the GCS bucket
* The root folder
<img alt="root folder" src="vmbackupmanager_root_folder.png">
* The latest folder
<img alt="latest folder" src="vmbackupmanager_latest_folder.png">
Please, see [vmbackup docs](https://docs.victoriametrics.com/vmbackup.html#advanced-usage) for more examples of authentication with different
storage types.
## Backup Retention Policy
Backup retention policy is controlled by:
* -keepLastHourly - keep the last N hourly backups. Disabled by default
* -keepLastDaily - keep the last N daily backups. Disabled by default
* -keepLastWeekly - keep the last N weekly backups. Disabled by default
* -keepLastMonthly - keep the last N monthly backups. Disabled by default
> *Note*: 0 value in every keepLast flag results into deletion of ALL backups for particular type (hourly, daily, weekly and monthly)
> *Note*: retention policy does not enforce removing previous versions of objects in object storages such if versioning is enabled. See [these docs](https://docs.victoriametrics.com/vmbackup.html#permanent-deletion-of-objects-in-s3-and-compatible-storages) for more details.
Lets assume we have a backup manager collecting daily backups for the past 10 days.
<img alt="retention policy daily before retention cycle" src="vmbackupmanager_rp_daily_1.png">
We enable backup retention policy for backup manager by using following configuration:
```console
export NODE_IP=192.168.0.10
export VMSTORAGE_ENDPOINT=http://127.0.0.1:8428
./vmbackupmanager -dst=gs://vmstorage-data/$NODE_IP -credsFilePath=credentials.json -storageDataPath=/vmstorage-data -snapshot.createURL=$VMSTORAGE_ENDPOINT/snapshot/create
-keepLastDaily=3 -eula
```
Expected logs in backup manager on start:
```console
info lib/logger/flag.go:20 flag "keepLastDaily" = "3"
```
Expected logs in backup manager during retention cycle:
```console
info app/vmbackupmanager/retention.go:106 daily backups to delete [daily/2021-02-13 daily/2021-02-12 daily/2021-02-11 daily/2021-02-10 daily/2021-02-09 daily/2021-02-08 daily/2021-02-07]
```
The result on the GCS bucket. We see only 3 daily backups:
<img alt="retention policy daily after retention cycle" src="vmbackupmanager_rp_daily_2.png">
### Protection backups against deletion by retention policy
You can protect any backup against deletion by retention policy with the `vmbackupmanager backups lock` command.
For instance:
```console
./vmbackupmanager backup lock daily/2021-02-13 -dst=<DST_PATH> -storageDataPath=/vmstorage-data -eula
```
After that the backup won't be deleted by retention policy.
You can view the `locked` attribute in backup list:
```console
./vmbackupmanager backup list -dst=<DST_PATH> -storageDataPath=/vmstorage-data -eula
```
To remove protection, you can use the command `vmbackupmanager backups unlock`.
For example:
```console
./vmbackupmanager backup unlock daily/2021-02-13 -dst=<DST_PATH> -storageDataPath=/vmstorage-data -eula
```
## API methods
`vmbackupmanager` exposes the following API methods:
* GET `/api/v1/backups` - returns list of backups in remote storage.
Example output:
```json
[{"name":"daily/2023-04-07","size_bytes":318837,"size":"311.4ki","created_at":"2023-04-07T16:15:07+00:00"},{"name":"hourly/2023-04-07:11","size_bytes":318837,"size":"311.4ki","created_at":"2023-04-07T16:15:06+00:00"},{"name":"latest","size_bytes":318837,"size":"311.4ki","created_at":"2023-04-07T16:15:04+00:00"},{"name":"monthly/2023-04","size_bytes":318837,"size":"311.4ki","created_at":"2023-04-07T16:15:10+00:00"},{"name":"weekly/2023-14","size_bytes":318837,"size":"311.4ki","created_at":"2023-04-07T16:15:09+00:00"}]
```
> Note: `created_at` field is in RFC3339 format.
* GET `/api/v1/backups/<BACKUP_NAME>` - returns backup info by name.
Example output:
```json
{"name":"daily/2023-04-07","size_bytes":318837,"size":"311.4ki","created_at":"2023-04-07T16:15:07+00:00","locked":true}
```
* PUT `/api/v1/backups/<BACKUP_NAME>` - update "locked" attribute for backup by name.
Example request body:
```json
{"locked":true}
```
Example response:
```json
{"name":"daily/2023-04-07","size_bytes":318837,"size":"311.4ki","created_at":"2023-04-07T16:15:07+00:00", "locked": true}
```
* POST `/api/v1/restore` - saves backup name to restore when [performing restore](#restore-commands).
Example request body:
```json
{"backup":"daily/2022-10-06"}
```
* GET `/api/v1/restore` - returns backup name from restore mark if it exists.
Example response:
```json
{"backup":"daily/2022-10-06"}
```
* DELETE `/api/v1/restore` - delete restore mark.
## CLI
`vmbackupmanager` exposes CLI commands to work with [API methods](#api-methods) without external dependencies.
Supported commands:
```console
vmbackupmanager backup
vmbackupmanager backup list
List backups in remote storage
vmbackupmanager backup lock
Locks backup in remote storage against deletion
vmbackupmanager backup unlock
Unlocks backup in remote storage for deletion
vmbackupmanager restore
Restore backup specified by restore mark if it exists
vmbackupmanager restore get
Get restore mark if it exists
vmbackupmanager restore delete
Delete restore mark if it exists
vmbackupmanager restore create [backup_name]
Create restore mark
```
By default, CLI commands are using `http://127.0.0.1:8300` endpoint to reach `vmbackupmanager` API.
It can be changed by using flag:
```
-apiURL string
vmbackupmanager address to perform API requests (default "http://127.0.0.1:8300")
```
### Backup commands
`vmbackupmanager backup list` lists backups in remote storage:
```console
$ ./vmbackupmanager backup list
[{"name":"daily/2023-04-07","size_bytes":318837,"size":"311.4ki","created_at":"2023-04-07T16:15:07+00:00"},{"name":"hourly/2023-04-07:11","size_bytes":318837,"size":"311.4ki","created_at":"2023-04-07T16:15:06+00:00"},{"name":"latest","size_bytes":318837,"size":"311.4ki","created_at":"2023-04-07T16:15:04+00:00"},{"name":"monthly/2023-04","size_bytes":318837,"size":"311.4ki","created_at":"2023-04-07T16:15:10+00:00"},{"name":"weekly/2023-14","size_bytes":318837,"size":"311.4ki","created_at":"2023-04-07T16:15:09+00:00"}]
```
### Restore commands
Restore commands are used to create, get and delete restore mark.
Restore mark is used by `vmbackupmanager` to store backup name to restore when running restore.
Create restore mark:
```console
$ ./vmbackupmanager restore create daily/2022-10-06
```
Get restore mark if it exists:
```console
$ ./vmbackupmanager restore get
{"backup":"daily/2022-10-06"}
```
Delete restore mark if it exists:
```console
$ ./vmbackupmanager restore delete
```
Perform restore:
```console
$ /vmbackupmanager-prod restore -dst=gs://vmstorage-data/$NODE_IP -credsFilePath=credentials.json -storageDataPath=/vmstorage-data
```
Note that `vmsingle` or `vmstorage` should be stopped before performing restore.
If restore mark doesn't exist at `storageDataPath`(restore wasn't requested) `vmbackupmanager restore` will exit with successful status code.
### How to restore backup via CLI
1. Run `vmbackupmanager backup list` to get list of available backups:
```console
$ /vmbackupmanager-prod backup list
[{"name":"daily/2023-04-07","size_bytes":318837,"size":"311.4ki","created_at":"2023-04-07T16:15:07+00:00"},{"name":"hourly/2023-04-07:11","size_bytes":318837,"size":"311.4ki","created_at":"2023-04-07T16:15:06+00:00"},{"name":"latest","size_bytes":318837,"size":"311.4ki","created_at":"2023-04-07T16:15:04+00:00"},{"name":"monthly/2023-04","size_bytes":318837,"size":"311.4ki","created_at":"2023-04-07T16:15:10+00:00"},{"name":"weekly/2023-14","size_bytes":318837,"size":"311.4ki","created_at":"2023-04-07T16:15:09+00:00"}]
```
1. Run `vmbackupmanager restore create` to create restore mark:
- Use relative path to backup to restore from currently used remote storage:
```console
$ /vmbackupmanager-prod restore create daily/2023-04-07
```
- Use full path to backup to restore from any remote storage:
```console
$ /vmbackupmanager-prod restore create azblob://test1/vmbackupmanager/daily/2023-04-07
```
1. Stop `vmstorage` or `vmsingle` node
1. Run `vmbackupmanager restore` to restore backup:
```console
$ /vmbackupmanager-prod restore -credsFilePath=credentials.json -storageDataPath=/vmstorage-data
```
1. Start `vmstorage` or `vmsingle` node
### How to restore in Kubernetes
1. Ensure there is an init container with `vmbackupmanager restore` in `vmstorage` or `vmsingle` pod.
For [VictoriaMetrics operator](https://docs.victoriametrics.com/operator/VictoriaMetrics-Operator.html) deployments it is required to add:
```yaml
vmbackup:
restore:
onStart:
enabled: "true"
```
See operator `VMStorage` schema [here](https://docs.victoriametrics.com/operator/api.html#vmstorage) and `VMSingle` [here](https://docs.victoriametrics.com/operator/api.html#vmsinglespec).
1. Enter container running `vmbackupmanager`
1. Use `vmbackupmanager backup list` to get list of available backups:
```console
$ /vmbackupmanager-prod backup list
[{"name":"daily/2023-04-07","size_bytes":318837,"size":"311.4ki","created_at":"2023-04-07T16:15:07+00:00"},{"name":"hourly/2023-04-07:11","size_bytes":318837,"size":"311.4ki","created_at":"2023-04-07T16:15:06+00:00"},{"name":"latest","size_bytes":318837,"size":"311.4ki","created_at":"2023-04-07T16:15:04+00:00"},{"name":"monthly/2023-04","size_bytes":318837,"size":"311.4ki","created_at":"2023-04-07T16:15:10+00:00"},{"name":"weekly/2023-14","size_bytes":318837,"size":"311.4ki","created_at":"2023-04-07T16:15:09+00:00"}]
```
1. Use `vmbackupmanager restore create` to create restore mark:
- Use relative path to backup to restore from currently used remote storage:
```console
$ /vmbackupmanager-prod restore create daily/2023-04-07
```
- Use full path to backup to restore from any remote storage:
```console
$ /vmbackupmanager-prod restore create azblob://test1/vmbackupmanager/daily/2023-04-07
```
1. Restart pod
#### Restore cluster into another cluster
These steps are assuming that [VictoriaMetrics operator](https://docs.victoriametrics.com/operator/VictoriaMetrics-Operator.html) is used to manage `VMCluster`.
Clusters here are referred to as `source` and `destination`.
1. Create a new cluster with access to *source* cluster `vmbackupmanager` storage and same number of storage nodes.
Add the following section in order to enable restore on start (operator `VMStorage` schema can be found [here](https://docs.victoriametrics.com/operator/api.html#vmstorage):
```yaml
vmbackup:
restore:
onStart:
enabled: "true"
```
Note: it is safe to leave this section in the cluster configuration, since it will be ignored if restore mark doesn't exist.
> Important! Use different `-dst` for *destination* cluster to avoid overwriting backup data of the *source* cluster.
1. Enter container running `vmbackupmanager` in *source* cluster
1. Use `vmbackupmanager backup list` to get list of available backups:
```console
$ /vmbackupmanager-prod backup list
[{"name":"daily/2023-04-07","size_bytes":318837,"size":"311.4ki","created_at":"2023-04-07T16:15:07+00:00"},{"name":"hourly/2023-04-07:11","size_bytes":318837,"size":"311.4ki","created_at":"2023-04-07T16:15:06+00:00"},{"name":"latest","size_bytes":318837,"size":"311.4ki","created_at":"2023-04-07T16:15:04+00:00"},{"name":"monthly/2023-04","size_bytes":318837,"size":"311.4ki","created_at":"2023-04-07T16:15:10+00:00"},{"name":"weekly/2023-14","size_bytes":318837,"size":"311.4ki","created_at":"2023-04-07T16:15:09+00:00"}]
```
1. Use `vmbackupmanager restore create` to create restore mark at each pod of the *destination* cluster.
Each pod in *destination* cluster should be restored from backup of respective pod in *source* cluster.
For example: `vmstorage-source-0` in *source* cluster should be restored from `vmstorage-destination-0` in *destination* cluster.
```console
$ /vmbackupmanager-prod restore create s3://source_cluster/vmstorage-source-0/daily/2023-04-07
```
1. Restart `vmstorage` pods of *destination* cluster. On pod start `vmbackupmanager` will restore data from the specified backup.
## Monitoring
`vmbackupmanager` exports various metrics in Prometheus exposition format at `http://vmbackupmanager:8300/metrics` page. It is recommended setting up regular scraping of this page
either via [vmagent](https://docs.victoriametrics.com/vmagent.html) or via Prometheus, so the exported metrics could be analyzed later.
Use the official [Grafana dashboard](https://grafana.com/grafana/dashboards/17798-victoriametrics-backupmanager/) for `vmbackupmanager` overview.
Graphs on this dashboard contain useful hints - hover the `i` icon in the top left corner of each graph in order to read it.
If you have suggestions for improvements or have found a bug - please open an issue on github or add
a review to the dashboard.
## Configuration
### Flags
Pass `-help` to `vmbackupmanager` in order to see the full list of supported
command-line flags with their descriptions.
The shortlist of configuration flags is the following:
```
vmbackupmanager performs regular backups according to the provided configs.
subcommands:
backup: provides auxiliary backup-related commands
restore: restores backup specified by restore mark if it exists
command-line flags:
-apiURL string
vmbackupmanager address to perform API requests (default "http://127.0.0.1:8300")
-concurrency int
The number of concurrent workers. Higher concurrency may reduce backup duration (default 10)
-configFilePath string
Path to file with S3 configs. Configs are loaded from default location if not set.
See https://docs.aws.amazon.com/general/latest/gr/aws-security-credentials.html
-configProfile string
Profile name for S3 configs. If no set, the value of the environment variable will be loaded (AWS_PROFILE or AWS_DEFAULT_PROFILE), or if both not set, DefaultSharedConfigProfile is used
-credsFilePath string
Path to file with GCS or S3 credentials. Credentials are loaded from default locations if not set.
See https://cloud.google.com/iam/docs/creating-managing-service-account-keys and https://docs.aws.amazon.com/general/latest/gr/aws-security-credentials.html
-customS3Endpoint string
Custom S3 endpoint for use with S3-compatible storages (e.g. MinIO). S3 is used if not set
-deleteAllObjectVersions
Whether to prune previous object versions when deleting an object. By default, when object storage has versioning enabled deleting the file removes only current version. This option forces removal of all previous versions. See: https://docs.victoriametrics.com/vmbackup.html#permanent-deletion-of-objects-in-s3-compatible-storages
-disableDaily
Disable daily run. Default false
-disableHourly
Disable hourly run. Default false
-disableMonthly
Disable monthly run. Default false
-disableWeekly
Disable weekly run. Default false
-dst string
The root folder of Victoria Metrics backups. Example: gs://bucket/path/to/backup/dir, s3://bucket/path/to/backup/dir or fs:///path/to/local/backup/dir
-enableTCP6
Whether to enable IPv6 for listening and dialing. By default, only IPv4 TCP and UDP are used
-envflag.enable
Whether to enable reading flags from environment variables in addition to the command line. Command line flag values have priority over values from environment vars. Flags are read only from the command line if this flag isn't set. See https://docs.victoriametrics.com/#environment-variables for more details
-envflag.prefix string
Prefix for environment variables if -envflag.enable is set
-eula
Deprecated, please use -license or -licenseFile flags instead. By specifying this flag, you confirm that you have an enterprise license and accept the ESA https://victoriametrics.com/legal/esa/ . This flag is available only in Enterprise binaries. See https://docs.victoriametrics.com/enterprise.html
-filestream.disableFadvise
Whether to disable fadvise() syscall when reading large data files. The fadvise() syscall prevents from eviction of recently accessed data from OS page cache during background merges and backups. In some rare cases it is better to disable the syscall if it uses too much CPU
-flagsAuthKey string
Auth key for /flags endpoint. It must be passed via authKey query arg. It overrides httpAuth.* settings
-fs.disableMmap
Whether to use pread() instead of mmap() for reading data files. By default, mmap() is used for 64-bit arches and pread() is used for 32-bit arches, since they cannot read data files bigger than 2^32 bytes in memory. mmap() is usually faster for reading small data chunks than pread()
-http.connTimeout duration
Incoming http connections are closed after the configured timeout. This may help to spread the incoming load among a cluster of services behind a load balancer. Please note that the real timeout may be bigger by up to 10% as a protection against the thundering herd problem (default 2m0s)
-http.disableResponseCompression
Disable compression of HTTP responses to save CPU resources. By default, compression is enabled to save network bandwidth
-http.idleConnTimeout duration
Timeout for incoming idle http connections (default 1m0s)
-http.maxGracefulShutdownDuration duration
The maximum duration for a graceful shutdown of the HTTP server. A highly loaded server may require increased value for a graceful shutdown (default 7s)
-http.pathPrefix string
An optional prefix to add to all the paths handled by http server. For example, if '-http.pathPrefix=/foo/bar' is set, then all the http requests will be handled on '/foo/bar/*' paths. This may be useful for proxied requests. See https://www.robustperception.io/using-external-urls-and-proxies-with-prometheus
-http.shutdownDelay duration
Optional delay before http server shutdown. During this delay, the server returns non-OK responses from /health page, so load balancers can route new requests to other servers
-httpAuth.password string
Password for HTTP server's Basic Auth. The authentication is disabled if -httpAuth.username is empty
-httpAuth.username string
Username for HTTP server's Basic Auth. The authentication is disabled if empty. See also -httpAuth.password
-httpListenAddr string
Address to listen for http connections (default ":8300")
-internStringCacheExpireDuration duration
The expiry duration for caches for interned strings. See https://en.wikipedia.org/wiki/String_interning . See also -internStringMaxLen and -internStringDisableCache (default 6m0s)
-internStringDisableCache
Whether to disable caches for interned strings. This may reduce memory usage at the cost of higher CPU usage. See https://en.wikipedia.org/wiki/String_interning . See also -internStringCacheExpireDuration and -internStringMaxLen
-internStringMaxLen int
The maximum length for strings to intern. A lower limit may save memory at the cost of higher CPU usage. See https://en.wikipedia.org/wiki/String_interning . See also -internStringDisableCache and -internStringCacheExpireDuration (default 500)
-keepLastDaily int
Keep last N daily backups. If 0 is specified next retention cycle removes all backups for given time period. (default -1)
-keepLastHourly int
Keep last N hourly backups. If 0 is specified next retention cycle removes all backups for given time period. (default -1)
-keepLastMonthly int
Keep last N monthly backups. If 0 is specified next retention cycle removes all backups for given time period. (default -1)
-keepLastWeekly int
Keep last N weekly backups. If 0 is specified next retention cycle removes all backups for given time period. (default -1)
-license string
Lisense key for VictoriaMetrics Enterprise. See https://victoriametrics.com/products/enterprise/ . Trial Enterprise license can be obtained from https://victoriametrics.com/products/enterprise/trial/ . This flag is available only in Enterprise binaries. The license key can be also passed via file specified by -licenseFile command-line flag
-license.forceOffline
Whether to enable offline verification for VictoriaMetrics Enterprise license key, which has been passed either via -license or via -licenseFile command-line flag. The issued license key must support offline verification feature. Contact info@victoriametrics.com if you need offline license verification. This flag is avilable only in Enterprise binaries
-licenseFile string
Path to file with license key for VictoriaMetrics Enterprise. See https://victoriametrics.com/products/enterprise/ . Trial Enterprise license can be obtained from https://victoriametrics.com/products/enterprise/trial/ . This flag is available only in Enterprise binaries. The license key can be also passed inline via -license command-line flag
-loggerDisableTimestamps
Whether to disable writing timestamps in logs
-loggerErrorsPerSecondLimit int
Per-second limit on the number of ERROR messages. If more than the given number of errors are emitted per second, the remaining errors are suppressed. Zero values disable the rate limit
-loggerFormat string
Format for logs. Possible values: default, json (default "default")
-loggerJSONFields string
Allows renaming fields in JSON formatted logs. Example: "ts:timestamp,msg:message" renames "ts" to "timestamp" and "msg" to "message". Supported fields: ts, level, caller, msg
-loggerLevel string
Minimum level of errors to log. Possible values: INFO, WARN, ERROR, FATAL, PANIC (default "INFO")
-loggerOutput string
Output for the logs. Supported values: stderr, stdout (default "stderr")
-loggerTimezone string
Timezone to use for timestamps in logs. Timezone must be a valid IANA Time Zone. For example: America/New_York, Europe/Berlin, Etc/GMT+3 or Local (default "UTC")
-loggerWarnsPerSecondLimit int
Per-second limit on the number of WARN messages. If more than the given number of warns are emitted per second, then the remaining warns are suppressed. Zero values disable the rate limit
-maxBytesPerSecond int
The maximum upload speed. There is no limit if it is set to 0
-memory.allowedBytes size
Allowed size of system memory VictoriaMetrics caches may occupy. This option overrides -memory.allowedPercent if set to a non-zero value. Too low a value may increase the cache miss rate usually resulting in higher CPU and disk IO usage. Too high a value may evict too much data from the OS page cache resulting in higher disk IO usage
Supports the following optional suffixes for size values: KB, MB, GB, TB, KiB, MiB, GiB, TiB (default 0)
-memory.allowedPercent float
Allowed percent of system memory VictoriaMetrics caches may occupy. See also -memory.allowedBytes. Too low a value may increase cache miss rate usually resulting in higher CPU and disk IO usage. Too high a value may evict too much data from the OS page cache which will result in higher disk IO usage (default 60)
-metricsAuthKey string
Auth key for /metrics endpoint. It must be passed via authKey query arg. It overrides httpAuth.* settings
-pprofAuthKey string
Auth key for /debug/pprof/* endpoints. It must be passed via authKey query arg. It overrides httpAuth.* settings
-pushmetrics.extraLabel array
Optional labels to add to metrics pushed to -pushmetrics.url . For example, -pushmetrics.extraLabel='instance="foo"' adds instance="foo" label to all the metrics pushed to -pushmetrics.url
Supports an array of values separated by comma or specified via multiple flags.
-pushmetrics.interval duration
Interval for pushing metrics to -pushmetrics.url (default 10s)
-pushmetrics.url array
Optional URL to push metrics exposed at /metrics page. See https://docs.victoriametrics.com/#push-metrics . By default, metrics exposed at /metrics page aren't pushed to any remote storage
Supports an array of values separated by comma or specified via multiple flags.
-runOnStart
Upload backups immediately after start of the service. Otherwise the backup starts on new hour
-s3ForcePathStyle
Prefixing endpoint with bucket name when set false, true by default. (default true)
-s3StorageClass string
The Storage Class applied to objects uploaded to AWS S3. Supported values are: GLACIER, DEEP_ARCHIVE, GLACIER_IR, INTELLIGENT_TIERING, ONEZONE_IA, OUTPOSTS, REDUCED_REDUNDANCY, STANDARD, STANDARD_IA.
See https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-class-intro.html
-snapshot.createURL string
VictoriaMetrics create snapshot url. When this is given a snapshot will automatically be created during backup.Example: http://victoriametrics:8428/snapshot/create
-snapshot.deleteURL string
VictoriaMetrics delete snapshot url. Optional. Will be generated from snapshot.createURL if not provided. All created snaphosts will be automatically deleted.Example: http://victoriametrics:8428/snapshot/delete
-storageDataPath string
Path to VictoriaMetrics data. Must match -storageDataPath from VictoriaMetrics or vmstorage (default "victoria-metrics-data")
-tls
Whether to enable TLS for incoming HTTP requests at -httpListenAddr (aka https). -tlsCertFile and -tlsKeyFile must be set if -tls is set
-tlsCertFile string
Path to file with TLS certificate if -tls is set. Prefer ECDSA certs instead of RSA certs as RSA certs are slower. The provided certificate file is automatically re-read every second, so it can be dynamically updated
-tlsCipherSuites array
Optional list of TLS cipher suites for incoming requests over HTTPS if -tls is set. See the list of supported cipher suites at https://pkg.go.dev/crypto/tls#pkg-constants
Supports an array of values separated by comma or specified via multiple flags.
-tlsKeyFile string
Path to file with TLS key if -tls is set. The provided key file is automatically re-read every second, so it can be dynamically updated
-tlsMinVersion string
Optional minimum TLS version to use for incoming requests over HTTPS if -tls is set. Supported values: TLS10, TLS11, TLS12, TLS13
-version
Show VictoriaMetrics version
```
vmbackupmanager docs can be edited at [docs/vmbackupmanager.md](https://github.com/VictoriaMetrics/VictoriaMetrics/blob/master/docs/vmbackupmanager.md).

Binary file not shown.

Before

Width:  |  Height:  |  Size: 29 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 25 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 99 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 64 KiB

File diff suppressed because it is too large Load Diff

View File

@@ -358,10 +358,13 @@ func (c *Client) getSeries() ([]*Series, error) {
func (c *Client) do(q influx.Query) ([]queryValues, error) {
res, err := c.Query(q)
if err != nil {
return nil, fmt.Errorf("query %q err: %s", q.Command, err)
return nil, fmt.Errorf("query error: %s", err)
}
if res.Error() != nil {
return nil, fmt.Errorf("response error: %s", res.Error())
}
if len(res.Results) < 1 {
return nil, fmt.Errorf("exploration query %q returned 0 results", q.Command)
return nil, fmt.Errorf("query returned 0 results")
}
return parseResult(res.Results[0])
}

Some files were not shown because too many files have changed in this diff Show More