* feat: add query history
* fix: change detect keyUp for nav query history
* feat: set default query history
* feat: change graph legend
* update dependencies
* update codemirror version
* fix: correct update period time after zoom/pan
* fix: optimize data processing for the graph
* fix: eliminate memory leaks related to mouse events
* fix: correct display of straight line
* Merge branch 'master' into vmui-fix-reset-graph
* app/vmselect: `make vmui-update`
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
This removes the unneeded level of indirection and improves code readability.
The "prometheus" and "graphite" constants aren't going to change in the future, so there is no sense in hiding them behind constants.
This should improve the maximum data ingestion speed for highly-loaded vmagent instances
which run on beefy servers with many CPU cores and big amounts of RAM
Previously only the lower part of 64-bit hash was used for calculating the offset.
This may give uneven distribution in some cases. So let's use all the available 64 bits from the hash
for calculating the offset.
Prometheus allows to have groups with no rules, so we should support
it in vmalert as well for compatibility reasons.
It is also allowed to hot-reload empty groups by adding or removing rules.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Do not store in memory the response from the last scrape per each target if -promscrape.noStaleMarkers option is enabled.
This should reduce memory usage when the scraped targets return large responses.
Previously, ID for alert entity was generated without alertname or groupname.
This led to collision, when multiple alerting rules within the same group
producing same labelsets. E.g. expr: `sum(metric1) by (job) > 0` and
expr: `sum(metric2) by (job) > 0` could result into same labelset `job: "job"`.
The issue affects only UI and Web API parts of vmalert, because alert ID is used
only for displaying and finding active alerts. It does not affect state restore
procedure, since this label was added right before pushing to remote storage.
The change now adds all extra labels right after receiving response from the datasource.
And removes adding extra labels before pushing to remote storage.
Additionally, change introduces a new flag `Restored` which will be displayed in UI
for alerts which have been restored from remote storage on restart.
* adds tab as second separator for graphite text protocol
* changes indexFunc for indexAny
* Update lib/protoparser/graphite/parser_test.go
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* feat: add query history
* fix: change detect keyUp for nav query history
* feat: set default query history
* app/vmselect/vmui: `make vmui-update`
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
This should make visible the set flags at flag.Visit(), which is used later for logging
and exporting the `is_set` label for these flags at /metrics page
Commit fixes potential race condition when group update
and generating of ID() happens simultaneously.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Regression was introduced during code refactoring. It potentially
could lead to situation when SIGHUP signals were ignored while
vmalert was still busy with initing group manager.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The following errors:
vendor/cloud.google.com/go/storage/storage.go:1447:53: o.GetCustomerEncryption().GetKeySha256 undefined (type *"google.golang.org/genproto/googleapis/storage/v2".Object_CustomerEncryption has no field or method GetKeySha256)
vendor/cloud.google.com/go/storage/writer.go:439:10: q.GetCommittedSize undefined (type *"google.golang.org/genproto/googleapis/storage/v2".QueryWriteStatusResponse has no field or method GetCommittedSize)
The extra `/` may cause issues when additional path prefixes
are configured. Also, removing it makes it consistent
with the rest of declarations.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
vmctl: properly convert influx bools into integer representation
When using vmctl influx, the import would fail importing boolean fields
with:
```
failed to convert value "some".0 to float64: unexpected value type true
```
This converts `true` to `1` and `false` to `0`.
Fixes#1709
Sort series by a hash calculated from the series labels. This should guarantee "random" selection of the returned time series.
Previously the selection could be biased, since time series were sorted alphabetically by label names and label values.
Stream parsing mode can be automatically enabled when scraping targets with big response bodies
exceeding the -promscrape.minResponseSizeForStreamParse , so it must be always initialized.
This allows sending staleness marks and properly calculate scrape_series_added metric in stream parsing mode
at the cost of the increased memory usage, since now the potentially big response is kept
in the lastScrape byte slice per each scrapeWork.
In practice the memory usage increase shouldn't be big, since the response size
is usually much smaller than the parsed metrics from this response after the relabeling,
which usually adds a big pile of target-specific labels per each metric.
* vmalert: adjust `http.Transport.MaxIdleConns` value accordingly to `http.Transport.MaxIdleConnsPerHost`
`http.Transport.MaxIdleConnsPerHost` setting is controlled by `datasource.maxIdleConnections` flag,
while `http.Transport.MaxIdleConns` is inherited from DefaultTransport and is equal to `100`.
The fix adjusts `http.Transport.MaxIdleConns` value if it is lower than `http.Transport.MaxIdleConnsPerHost`.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmauth: adjust `http.Transport.MaxIdleConns` value accordingly to `http.Transport.MaxIdleConnsPerHost`
`http.Transport.MaxIdleConnsPerHost` setting is controlled by `maxIdleConnsPerBackend` flag,
while `http.Transport.MaxIdleConns` is inherited from DefaultTransport and is equal to `100`.
The fix adjusts `http.Transport.MaxIdleConns` value if it is lower than `http.Transport.MaxIdleConnsPerHost`.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The source link is controlled by `external.url` and `external.alert.source`
flags, in the same way as for alertmanager notifications.
The source link is added to Alerts list view, and specific Alert view.
* added guide for VM operator
* Update docs/guides/getting-started-with-vm-operator.md
Co-authored-by: Roman Khavronenko <hagen1778@gmail.com>
* Update docs/guides/getting-started-with-vm-operator.md
Co-authored-by: Roman Khavronenko <hagen1778@gmail.com>
* Fixed different typos and added improvements from proposals
* move remoteWrite.url to other place
* fixed typo
* rephrased vminsert explanation
* remove not needed parameters for default setup
Co-authored-by: Roman Khavronenko <hagen1778@gmail.com>
* adds read-only mode for vmstorage
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/269
* changes order a bit
* moves isFreeDiskLimitReached var to storage struct
renames functions to be consistent
change protoparser api - with optional storage limit check for given openned storage
* renames freeSpaceLimit to ReadOnly
The list of functions, which can adjust lookbehind window is more limited than the rest of functions,
so it is better from maintainability and readability PoV using the allowlist instead of blocklist.
It is expected that the `deriv(m[d])` returns non-empty value if the lookbehind window `d`
contains less than 2 samples in the same way as `rate()` does.
This is a follow-up after 3e084be06b .
Previously, `predict_linear` returned slightly different results comparing
to Prometheus. The change makes linear regression algorithm compatible
with Prometheus.
`deriv` was excluded from the list of functions which can adjust the time
window for the same reasons.
The fix makes the binary comparison func to check for NaNs
before executing the actual comparison. This prevents VM
to return values for non-existing samples for expressions
which contain bool comparisons. Please see added test
for example.
It appeared, that `testRowsEqual` NaN comparison was incorrect.
The fix caused some tests to fail. Please see the change and
tests updated.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Adjustment results into discrepancy between Prometheus and VM on time windows
smaller than scrape interval.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The fix will always return zero if received set of items consists of one
element only, which also means no deviation.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The change affects `count/stddev/stdvar_over_time` funcs and makes
them to return NaN instead of zero when there is no datapoints
in a time window.
This is needed for improving compatibility with Prometheus.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The removed fast path optimisations weren't consistent with
`quantile` function behavior and results into discrepancy.
Specifically, results didn't match in cases when:
* 0 < phi > 1;
* values contain only one element.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* app/vmselect: `quantile` func compatiblity with Prometheus
The `quantile` func was previously calculated by https://github.com/valyala/histogram
package. The result of such calculation was always the closest real value to
requested quantile. While in Prometheus implementation interpolation is used.
Such difference may result into discrepancy in output between Prometheus and
VictoriaMetrics.
This commit adds a Prometheus-like `quantile` function. It also used by other
functions which depend on it, such as `quantiles`, `quantile_over_time`, `median` etc.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1625
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* app/vmselect: `quantile` review fixes
* quantile functions were split into multiple to provide
different API for already sorted data;
* float64sPool is used for reducing allocations. Items in pool may have
different sizes, but defining a new pool was complicates due to name collisions;
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* app/vmselect: make sorting for query result similar to Prometheus
Updated sorting allows to get the order of series in result similar or equal
to what Prometheus returns.
The change is needed for compatibility reasons.
* Update app/vmselect/promql/exec_test.go
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
`omitempty` tag resulted into skipping this param on marshaling,
which was used as a checksum for groups configuration. Since on
config reload checksums are compared before applying changes,
any change to `interval` only didn't trigger config reload.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1641
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The commit contains the following changes:
- Show vmui when requesting /graph page in order to be compatible with Prometheus datasource in Grafana.
- Properly encode query args at vmui url.
- Set the number of points on the graph to the number of horizontal pixels divided by 2. Previously it was hardcoded to 30.
- Do not save server url to persistent storage at browser, since it should be always obtained from the url.
- Run `make vmui-update` for updating vmui embedded into VictoriaMetrics.
* feat: change url params for compatible prometheus
* style: add comment for TimeParams
* fix: change get default server for single version
* fix: change function for get query string value
* vmalert: add flag to limit the max value for auto-resovle duration for alerts
The new flag `rule.maxResolveDuration` suppose to limit max value for
alert.End param, which is used by notifiers like Alertmanager for alerts auto resolve.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1586
Also reduce CPU usage when applying `series_limit` to scrape targets with constant set of metrics.
The main idea is to perform the calculations on scrape_series_added and series_limit
only if the set of metrics exposed by the target has been changed.
Scrape targets rarely change the set of exposed metrics,
so this optimization should reduce CPU usage in general case.
These actions simlify metrics filtering. For example,
- action: keep_metrics
regex: 'foo|bar|baz'
would leave only metrics with `foo`, `bar` and `baz` names, while the rest of metrics will be deleted.
The commit also makes possible to split long regexps into multiple lines. For example, the following config is equivalent to the config above:
- action: keep_metrics
regex:
- foo
- bar
- baz
New UI pages:
/ - welcome page with API handlers list;
/groups - list of all rules per group;
/alerts - list of all active alerts;
/groupID/alertID/status - status of the active alert;
The number of series per target can be limited with the following options:
* Global limit with `-promscrape.maxSeriesPerTarget` command-line option.
* Per-target limit with `max_series: N` option in `scrape_config` section.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1561
* dasbhoard: replace `null` datasources
null datasource value may confuse Grafana and make it drop panel query in some
versions.
* docker: bump grafana image version
* dashboards: add URL variable selector to vmagent dashboard
* dashboards: add new panel `Remote write connection saturation` to vmagent dashboard
* alerts: add new alert for `Remote write connection saturation` panel of vmagent dashboard
* dashboards: add "Logging rate" panel to vmagent dashboard
The purpose of update is to make README and flags description more
clear to the reader. Especially, show that vm-account-id flag is required
for clustered version of VM.
* vmalert: allow extra GET params in datasource package
ExtraParams will be added as GET params to every HTTP request made by datasource.
The `roundDigits` param, for example, was substituted by corresponding extra param.
* vmalert: add nocache=1 param for replay process
The `nocache=1` param is VictoriaMetrics specific parameter which prevents it
from caching and boundaries aligning for queries. We set it to avoid cache
pollution in `replay` mode and also to avoid unnecessary time range boundaries
alignment.
* vmalert: mention nocache=1 in replay description
* vmalert: fix bug with unused param
* vmalert: remove `vmalert_execution_duration_seconds` metric
The summary for `vmalert_execution_duration_seconds` metric gives no additional
value comparing to `vmalert_iteration_duration_seconds` metric.
* vmalert: update config reload success metric properly
Previously, if there was unsuccessfull attempt to reload config and then
rollback to previous version - the metric remained set to 0.
* vmalert: add Grafana dashboard to overview application metrics
* docker: include vmalert target into list for scraping
* vmalert: extend notifier metrics with addr label
The change adds an `addr` label to metrics for alerts_sent and alerts_send_errors
to identify which exact address is having issues.
The according change was made to vmalert dashboard.
* vmalert: update documentation and docker environment for vmalert's dashboard
Mention Grafana's dashboard in vmalert's README in a new section #Monitoring.
Update docker-compose env to automatically add vmalert's dashboard.
Update docker-compose README with additional info about services.
* Document all the functions supported by MetricsQL, including PromQL functions
* Group functions by their type: rollup functions, transform functions, label manipulation functions and aggregate functions.
* Document implicit query transformations.
Store the scraped response body instead of storing the parsed and relabeld metrics.
This should reduce memory usage, since the response body takes less memory than the parsed and relabeled metrics.
This is especially true for Kubernetes service discovery, which adds many long labels for all the scraped metrics.
This should also reduce CPU usage, since the marshaling of the parsed
and relabeld metrics has been substituted by response body copying.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1526
Before, metric `vm_request_duration_seconds` was update only on successful
attempts which could be misleading. For example, timeout errors on netstorage
request may be not accounted in the metric and won't be visible on dashboards.
Using `defer` statement to update the metric after query arguments validation
may improve the situation.
This option allows reducing CPU usage a bit when VictoriaMetrics is used
for collecting and processing non-Prometheus data. For example, InfluxDB line protocol, Graphite, OpenTSDB, CSV, etc.
This option can be useful when vmagent consumes too much additional memory
for staleness markers functionality and when staleness markers aren't needed.
* vmalert: allow to disable automatically added path to remote write address via disablePathAppend flag
* docs: update docs to include remoteWrite.disablePathAppend
This metric can be used for determining high saturation of every connection to remote storage with
an alerting query `rate(vmagent_remotewrite_send_duration_seconds_total) > 0.9s`.
This query triggers when a connection is satureated by more than 90%
This reverts commit 94dfcb6747a3b29a11d14e71bea21a2312bb6346.
It is better to remove staleness marks (decimal.StaleNaN) before calling rollupConfig.Do, e.g. in preFunc
Prometheus stalenss marks shouldn't be changed in removeCounterResets. Otherwise they will be converted to an ordinary NaN values,
which couldn't be removed in dropStaleNaNs() function later. This may result in incorrect calculations for rollup functions.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1526
* docs: update "number of open files" tuning recomendation
Make "number of open files" recomendation not only Prometheus specific to avoid
confusion for users who does not use Prometheus.
* docs: mention fstrim in Tuning section
* vmalert: expose new metrics for tracking number of produced samples during last evaluation
Two new metrics were added to track the number of samples produced during the last evaluation:
* vmalert_recording_rules_last_evaluation_samples
* vmalert_alerting_rules_last_evaluation_samples
The gauge type is used to remain consistent with Prometheus metric
`prometheus_rule_group_last_evaluation_samples` which is on the group level.
However, the counter type was considered as well.
Two metrics instead of one are used to make it easier to separate recording and
alerting rules. It is likely, number of samples produced by recording rules is
more important so people will refer to it more frequently.
The expected usage of the new metric is the following:
```
- alert: RecordingRuleReturnsEmptyResults
expr: sum(vmalert_recording_rules_last_evaluation_samples) by(recording) < 1
annotations:
summary: Recording rule {{$labels.recording}} returns empty results.
Please verify expression correctness.
```
Addresses https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1494
* vmalert: rename `vmalert_alerts_error` to `vmalert_alerting_rules_error` to remain consistent with recording rules metrics
* feature: Add multitenant for vmagent
* Minor fix
* Fix rcs index out of range
* Minor fix
* Fix multi Init
* Fix multi Init
* Fix multi Init
* Add default multi
* Adjust naming
* Add TenantInserted metrics
* Add TenantInserted metrics
* fix: remove unused metrics for vmagent
* fix: remove unused metrics for vmagent
Co-authored-by: mghader <marc.ghader@ubisoft.com>
Co-authored-by: Sebastian YEPES <syepes@gmail.com>
* Rename -search.maxMetricsPointSearch to -search.maxSamplesPerQuery, so it is more consistent with the existing -search.maxSamplesPerSeries
* Move the -search.maxSamplesPerQuery from vmstorage to vmselect, so it could effectively limit the number of raw samples obtained from all the vmstorage nodes
* Document the -search.maxSamplesPerQuery in docs/CHANGELOG.md
* fix: move request button to server input
* feat: add switch for query autocomplete
* refactor: rename state for popover open
* feat: add detect os by userAgent
* fix: change hotkey to run query for mac
* fix: change detect mac os
* fix: change div to span inside Typography
Co-authored-by: yury <yurymolodov@victoriametrics.com>
This should improve the readability and usefullness of the /api/v1/status/top_queries when debugging slow queries
or queries that take too much cpu time.
Previously the switch occurred when the cache size becomes 100% of its capacity. The cache size could never reach 100% capacity.
This could prevent from switching from the split cache to full cache, thus reducing the cache effectiveness.
- Support durations anywhere in MetricsQL queries. E.g. sum_over_time(m[1h])/1h is equivalent to sum_over_time(m[1h])/3600
- Support durations without suffix. E.g. rate(m[300]) is equivalent to rate(m[5m])
Previously needsDedup() could return true if the de-duplication wasn't needed for the following case:
d < interval
/ \
| v | v |
interval interval
Now it properly returns false for this case
This reverts commit 7c6d3981bf.
Reason for revert: high contention at bucket16Pool on systems with big number of CPU cores.
This slows down query processing significantly.
This should reduce memory usage on systems with big number of CPU cores,
since every inmemoryPart object occupies at least 64KB of memory and sync.Pool maintains
a separate pool inmemoryPart objects per each CPU core.
Though the new scheme for the pool worsens per-cpu cache locality, this should be amortized
by big sizes of inmemoryPart objects.
CPU and memory profiles show that the pool capacity for inmemoryBlock objects is too small.
This results in the increased load on memory allocation code in Go runtime.
Increase the pool capacity in order to reduce the load on Go runtime.
This should improve hit ratio for tagFiltersCache when big number of new time series are constantly registered
(aka high churn rate). This, in turn, should reduce CPU usage for queries over such time series.
Previously the stats for cache misses could be improperly counted, because it had inflated cache misses
if the entry was missing in the curr cache, but was existing in the prev cache.
The same applies to cache requests - they were inflated if the entry was missing in the curr cache.
To add a Copy button wrap code snippet with the following element:
```
<div class="with-copy" markdown="1">
<your-code-snippet>
</div>
```
See the changes to `Kubernetes monitoring with VictoriaMetrics Single` for details.
Previously the switch from `split` to `whole` mode had been performed too early,
e.g. when the current cache size became bigger than 1/4 of the allowed cache size.
Now it is performed when the current cache size becomes bigger than 1/2 of the allowed cache size.
This change can reduce memory usage for data ingestion path when big number of active time series are ingested.
One minute cache timeout result in slower queries in some production workloads where the interval
between query execution is in the range 1 minute - 2 minutes.
This should reduce memory usage on a system with high number of active time series and a high churn rate.
One minute is enough for caching the blocks needed for repeated queries (e.g. alerting rules, recording rules and dashboard refreshes).
This should reduce memory usage for the pool on systems with big number of CPU cores.
The sync.Pool maintains per-CPU pools, so the total number of objects in the pool
is proportional to the number of available CPU cores. The channel limits the number
of pooled objects by its own capacity. This means smaller number of pooled objects on average.
This should reduce resource usage (CPU, RAM, disk IO) at vmstorage nodes
if the addresses of vmstorage nodes are passed in random order to vminsert nodes.
* Change default value of '-remoteWrite.queues' to cgroup.AvailableCPUS() * 2 to reduce scrape interval
Default value of vmagent option '-remotewrite.queues' is 4 and default
size of vmagent ScheudleUnmarshalWorkers is number of CPUs, when available
CPUs is much greater than 4, e.g 32, worker are competing push queues
which will increase scrape interval and may cause scrape timeout.
* Update README and flag description
Co-authored-by: xiaozy <xiaozy01@fenbi.com>
Due to staleness handling, increase_pure were using incorrect previous value
during calculation in cases where series disappears for period longer
than staleness period and then returns back. The fix suppose to account
for a real datapoint value before staleness takes place. The fix should
remove unexpected spikes while using `increase_pure` for staled series.
* dashboard: update single version dash
The update contains the following changes:
* display anonymous memory usage metric. This metric suppose to reflect
memory usage of the process which can't be freed by OS;
* add legends to all panels. This is important for cases when users share
the screenshots;
* modify panels for Grafana v8.0.0
* dashboard: update single version dash tags
* dashboard: update vmagent dash
The update contains the following changes:
* display anonymous memory usage metric. This metric suppose to reflect
memory usage of the process which can't be freed by OS;
* add legends to all panels. This is important for cases when users share
the screenshots;
* modify panels for Grafana v8.0.0
This panic can be raised by the reverseProxy on aborted request to the backend.
So handle it (e.g. suppress) at reverseProxy.ServeHTTP call.
Do not suppress the panic at lib/httpserver generic HTTP handler,
since it may result in an inconsistent state left after the panicking handler.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1353
* vmalert: fix mistake with object reuse while parsing response
During the refactoring, the wrong optimisations was applied in
parse function which caused metric fields reset. The change removes
optimisation.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1369
* vmalert: add test to cover multiple metrics in one response
* vmalert: support rules backfilling (aka `replay`)
vmalert can `replay` configured rules in the past
and backfill results via remote write protocol.
It supports MetricsQL/PromQL storage as data source,
and can backfill data to remote write compatible
storage.
Supports recording and alerting rules `replay`. See more
details in README.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/836
* vmalert: review fixes
* vmalert: readme fixes
* new feature: relabel logging
Use scrape_configs[x].relabel_debug = true to log metric names inkl.
labels before and after relabeling. After relabeling related metrics
get dropped, i.e. not submitted to servers.
* vminsert wants relabel logging, too.
The pool for inmemoryBlock struct doesn't give any performance gains in production workloads,
while it may result in excess memory usage for inmemoryBlock structs inside the pool during
background merge of indexdb.
New flag `-rule.configCheckInterval` defines how often `vmalert` will re-read
config file. If it detects any changes, the config will be reloaded.
This behaviour is turned off by default.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/512
This speeds up bucket32.addBucketAtPos() when bucket32.buckets contains big number of items,
since the copying of bucket16 pointers is much faster than the copying of bucket16 objects.
This is a cpu profile for copying bucket16 objects:
10ms 13.43s (flat, cum) 32.01% of Total
10ms 120ms 650: b.b16his = append(b.b16his[:pos+1], b.b16his[pos:]...)
. . 651: b.b16his[pos] = hi
. 13.31s 652: b.buckets = append(b.buckets[:pos+1], b.buckets[pos:]...)
. . 653: b16 := &b.buckets[pos]
. . 654: *b16 = bucket16{}
. . 655: return b16
. . 656:}
This is a cpu profile for copying pointers to bucket16:
10ms 1.14s (flat, cum) 2.19% of Total
. 100ms 647: b.b16his = append(b.b16his[:pos+1], b.b16his[pos:]...)
. . 648: b.b16his[pos] = hi
10ms 700ms 649: b.buckets = append(b.buckets[:pos+1], b.buckets[pos:]...)
. 330ms 650: b16 := &bucket16{}
. . 651: b.buckets[pos] = b16
. . 652: return b16
. . 653:}
The new setting `extra_filter_labels` may be assigned to group.
If it is, then all rules within a group will automatically filter
for configured labels. The feature is well-described here
https://docs.victoriametrics.com#prometheus-querying-api-enhancements
New setting is compatible only with VM datasource.
Previously the blocked directories were removed sequentially by a single goroutine.
This can be not enough for highly loaded VictoriaMetrics that accepts millions of sample per second,
when big number of LSM parts are created and removed at high rate.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1313
* changes vmalert query function
for prometheus rules compatibility its better to use labels as map.
it simplifies template evaluation and allow to ignore can't evaluate field error
because map will return default value.
fixes https://github.com/VictoriaMetrics/operator/issues/243
These numbers are exposed via the following metrics:
- vmagent_hourly_series_limit_current_series
- vmagent_daily_series_limit_current_series
Expose also the limits via the following metrics:
- vmagent_hourly_series_limit_max_series
- vmagent_daily_series_limit_max_series
The `::tag` type is needed in cases when field and tag names are equal, which
results into unexpected results in InfluxQL. Setting the type explicitly helps
InfluxDB to understand which exact column we apply filter to.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1299
duplicates map helps to determine wheter extra labels has overriden
labels which make time series unique. It was using a sorted hashed
labels sequence as a key. But hashing algorithm could have collisions,
so it is more convenient to not use hashing at all.
Log message for recording rules duplicates was improved as well.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1293
These functions are non-trivial, while their code has minimal differences.
It is better from maintainability PoV to merge these functions into a single function.
It must match all the time series on the given time range.
Previously it was matched to all the time series without the restriction on the given time range.
Dependency updates must be under manual control, since the resulting code diffs must be reviewed manually for the sake of security.
It is done with `make vendor-update` now.
* Add vendor license checker, update codecov action, add dependbot for github actions
* update gitingore, temprorary turn on check
* fix action name
* change action rules to trigger only when vendor changes
* remove obsolete line from main action
Starting from v1.56.0 VM supports `round_digits` which allows to limit
the number of digits after the decimal point in response value. The feature
can be used to reduce entropy of produced by recording rules values
and significantly improve the compression. See more details in link below.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/525
Previously, `startGroup` could exit on restore errors despite the
`remoteRead.ignoreRestoreErrors` flag value. Now vmalert checks the
flag value before deciding whether to return error or just log it.
This should eliminate possible race when an update on endpoints depends on pods and/or services, which are missing in the cache yet.
This could result in missing targets based on endpoints or endpointslices.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1240
Alerting rules now can return specific error type ErrStateRestore to indicate
whether restore state procedure failed. Such errors were returned and logged
before as well. But now user can specify whether to just log these errors
(remoteRead.ignoreRestoreErrors=true) or to stop the process
(remoteRead.ignoreRestoreErrors=false). The latter is important when VM isn't
ready yet to serve queries from vmalert and it needs to wait.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1252
Panics may leave the process in inconsistent state. That's why it is better to stop the process after the panic
instead of recovering from the panic. Unfortunately, the standard net/http.Server recovers panics in request handlers.
See https://github.com/golang/go/issues/16542 . That's lib/httpserver must stop the process on itself after the panic.
* Simplify arguments list for fn `queryDataSource` to improve readbility
* vmalert: adjust `time` param according to rule evaluation interval
With this change, vmalert will start to use rule's evaluation interval
for truncating the `time` param. This is mostly needed to produce consistent
time series with timestamps unaffected by vmalert start time. Now, timestamp
becomes predictable.
Additionally, adjustment is similar to what Grafana does for plotting range graphs.
Hence, recording rule series and recording rule expression plotted in grafana
suppose to become similar in most of cases.
* changes vmalert Querier with per rule querier
it allows to changes some parametrs based on rule setting
for instance - alert type, tenant for cluster version or event endpoint url.
Previously, vmalert used `lastExecTime` timestamp when writing recording rules
to the remote storage. This may be incorrect, if vmalert uses `datasource.lookback` flag,
which means rule's expression will be executed at some moment in the past.
To avoid such situations, vmalert now will use returned timestamp instead of `lastExecTime`.
This should increase block sizes and subsequently increase the maximum possible bandwidth per each connection to remote storage.
This, in turn, should reduce the probability of storing the data in local buffers.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1235
This partially reverts fb82c4b9fa
It has been appeared that the additional memory allocation may result in higher GC pauses.
It is better to spend CPU time on copying bigger bucket16 structs instead of increasing query latencies due to higher GC pauses
* [draft] per tenant statistic
* updates metric name
update graph
adds link and example config
* quick fix
* adds grafana dashboard
adds example alert
Co-authored-by: f41gh7 <nik@victoriametrics.com>
Just like the existing infrastructure for `BUILDINFO_TAG`, this can ease the production of [reproducible builds](https://wiki.freebsd.org/ReproducibleBuilds).
(e.g. in FreeBSD the date the port was committed is used at build time, not the actual build time, so that an identical port produced at different times produces an identical executable)
* docs: drop table of contents for `vmctl`
We already have it autogenerated on .github.io, so no need to keep it.
* docs: mention OpenTSDB migration feature for vmctl
* docs: sync docs for `vmalert`
* add more documentation on OpenTSDB migration explaining what chunking means
* more clarification of OpenTSDB aggregations
* break out what a retention string becomes
* add more docs around retention strings
* add example of running program and fix mistake in how hard offsets are handled
* fix formatting
The major change is adding `sort` directive to docs. For those docs which are copied
from internal packages `sort` is added via makefile command. For the rest it is added
manually since they're updated manually as well.
The rest of changes is connected with markdown formatting. For example, changing headers
in some files (`##` => `#`) makes navigation on .github.io to look better. This especially
useful for `changelog` docs.
Table of contents for `vmctl` is dropped, since we already have it autogenerated on .github.io.
No link changes expected. The corresponding PR to `cluster` branch will be made in follow-up PR.
It should be faster querying all the labels and/or all the values instead of querying per-day labels/values on time ranges exceeding maxDaysForPerDaySearch
Remove async registration of apiWatchers, since it breaks discovering `role: endpoints` and `role: endpointslices` targets,
which depend on pod and service objects.
There is no need in reloading `endpoints` and `endpointslices` targets if the referenced `pod` or `service` objects change,
since in this case the corresponding `endpoints` and `endpointslices` objects should also change because they contain
ResourceVersion of the referenced `pod` or `service` objects, which is modified on object update.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1182
This option can be useful when samples for the same time series are ingested with distinct order of labels.
For example, metric{k1="v1",k2="v2"} and metric{k2="v2",k1="v1"}.
Alert `TooHighChurnRate24h` suppose to cover cases when churn rate
is low but results in multiple times higher number than total
number of active series.
* dashboard: update single node dashboard
* add number of new series created over last 24h;
* bump version requirements.
* dashboard: update vmagent dashboard
* add panel for open file descriptors;
* add panel for disk I/O;
* add panel for `vmagent_remotewrite_packets_dropped_total` metric;
* bump version requirements.
* adds blocks drop at 400 BadRequest status code
recieved from remote storage,
not expected that remote storage will be able to handle it on retry
* removes error logging for dropped blocks,
its expected error
3s evaluation interval is too small for practical setups. It can result in increased load on datasource.
So it is better to remove it from example config args, which are usually copy-pasted by novice users.
This adds the following new metrics for each VictoriaMetrics app:
* process_resident_memory_anonymous_bytes - the RSS share for memory allocated by the process itself.
This share cannot be freed by the OS, so it must be taken into account by OOM killer.
* process_resident_memory_pagecache_bytes - the RSS share for page cache memory (aka memory-mapped files).
This share can be freed by the OS at any time, so it must be ignored by OOM killer.
There was a signficant refactoring in the code responsible for time series search,
so it can result in both speed ups and slow downs depending on used queries.
The archive contains the following executables for Windows:
* vmagent
* vmalert
* vmauth
* vmctl
Other components - vmbackup, vmrestore, victoria-metrics - aren't supported for Windows yet
Do not pass filter metric ids to getMetricIDsForTagFilter, since it has been appeared that this slows down
the function by multiple times when it finds big number of metricIDs (tens of millions).
* dashboard: update single node dashboard
* add panel `Open FDs` for file descriptors metrics;
* add panel `Disk writes/reads` to show the real read/write
load on storage layer;
* add `process_resident_memory_bytes` metric to memory usage panel;
* add stats panel to show available CPUs, memory and disk space;
* rm flags panel since it didn't prove its usefulness.
* alerts: add alert for reaching FDs limit
Do not cache too big byte buffers and too big writeRequestCtx objects,
since it is cheaper to re-create them instead of wasting RAM for their caching.
This reverts 7f6f350ee1
Previously multiple scrape jobs could create multiple watchers for the same apiURL. Now only a single watcher is used.
This should reduce load on Kubernetes API server when many scrape job configs use Kubernetes service discovery.
See the description for `sample_limit` option from Prometheus docs:
Per-scrape limit on number of scraped samples that will be accepted.
If more than this number of samples are present after metric relabeling
the entire scrape will be treated as failed. 0 means no limit.
https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config
Serialize reloading per-role objects, so they don't occupy too much memory when objects for many scrape jobs are simultaneously refreshed.
Do not reload per-role objects if they were already refreshed by concurrent goroutines. This should reduce load on Kubernetes API server
when big number of scrape jobs are configured for the same Kubernetes role.
This is a follow-up for 17b87725ed
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1113
Previously vmagent was creating a separate Kubernetes object cache per each scrape job.
This could result in increased memory usage when monitoring a Kubernetes cluster with big number of objects (pods / nodes / services, etc.)
as seen at https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1113
Now it uses a shared map of scrape objects across multiple scrape jobs.
* fixes windows compilation,
adds signal impl for windows,
adds free space usage for windows,
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/70https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1036
NOTE victoria metrics database still CANNOT work under windows system,
only vmagent is supported.
To completly port victoria metrics, you have to fix issues with separators,
parsing and posix file removall
* rollback separator
* Adds windows setInformation api,
it must behave like unix, need to test it.
changes procutil
* check for invlaid param
* Fixes posix delete semantic
* refactored a bit
* fixes openbsd build
* removed windows api call
* Fixes code after windows add
* Update lib/procutil/signal_windows.go
Co-authored-by: Aliaksandr Valialkin <valyala@gmail.com>
The first and the last buckets are usually `[0 ... leMin]` and `(leMax ... +Inf)`. If they are merged with adjancent buckets,
then the resulting accuracy can suffer.
Main points:
* Revert changes outside lib/promscrape/discovery/kuberntes . These changes can be applied later in a separate commit
* Minimize changes in lib/promscrape/discovery/kubernetes compared to a93e644001
* Corner case fixes.
* started work on sd for k8s
* continue work on watch sd
* fixes
* continue work
* continue work on sd k8s
* disable gzip
* fixes typos
* log errror
* minor fix
Co-authored-by: Aliaksandr Valialkin <valyala@gmail.com>
The v1.15.0 exports the following additional metrics:
process_io_read_bytes_total - the number of bytes read via io syscalls such as read and pread
process_io_written_bytes_total - the number of bytes written via io syscalls such as write and pwrite
process_io_read_syscalls_total - the number of read syscalls such as read and pread
process_io_write_syscalls_total - the number of write syscalls such as write and pwrite
process_io_storage_read_bytes_total - the number of bytes read from storage layer
process_io_storage_written_bytes_total - the number of bytes written to storage layer
These metrics can be used for monitoring process io
While this may increase CPU and disk IO usage needed for background merge,
this also recudes CPU usage during queries in production. This is because
such queries tend to read recently added data and it is better to have lower number
of parts for such data in order to reduce CPU usage.
This partially reverts ebf8da3730
The filter arg has been removed in the commit c7ee2fabb8
because it was preventing from caching the number of matching time series per each tf.
Now the cache contains duration for tf execution, so the filter shouldn't break such caching.
For example `{label=~"foo\.bar"}` should be converted to `{label="foo.bar"}`. Previously it has was mistakenly conveted to `{label="foo\.bar"}` .
This could result in missing time series for such tag filters.
These metrics may result in big number of time series when vmagent scrapes thousands of targets and these targets constantly changes.
* It is better using `up == 0` query for determining failing targets.
* It is better using the following query for determining targets with exceeded limit on the number of metrics:
scrape_samples_scraped > 0 if up == 0
The `filter` arg breaks the logic for sorting tag filters by the matching metrics,
which may result in non-optimal performance during time series search.
Production workloads show that indexdb blocks must be cached unconditionally for reducing CPU usage.
This shouldn't increase memory usage too much, since unused blocks are removed from the cache every two minutes.
* init implementation for graphite alerts
* adds graphite support for vmalert
* small fix
* changes vmalert graphite api with type
* updates tests
* small fix
* fixes graphite parse
* Fixes graphite from time
It is better developing vmctl tool in VictoriaMetrics repository, so it could be released
together with the rest of vmutils tools such as vmalert, vmagent, vmbackup, vmrestore and vmauth.
These metrics could be useful for determining imporperly working scrape targets.
Note that these metrics are exported only for failing scrape targets. They aren't exposed for normally working targets.
* Include individual binary checksums for vmutils
* Consistent archive/binary artefacts between arm64/amd64 for vmutils
* architecture in arhcive, checksums
* not in binaries
* adds extra_label to all import apis,
changes priority for extra_label - now it has priority over original labels
* Update README.md
Co-authored-by: Aliaksandr Valialkin <valyala@gmail.com>
* Update README.md
Co-authored-by: Aliaksandr Valialkin <valyala@gmail.com>
* adds extra labels to vmagent import api
changes order for adding labels, now its added after user values
* adds tests for extra_label
* import fix
Co-authored-by: Aliaksandr Valialkin <valyala@gmail.com>
List contains examples for the alerting rules which might be executed
via `vmalert` to track the health state of VM components. It is assumed
that list will be revised and calibrated for each system individually.
On templates validation stage vmalert does not acutally send queries, so for complex
chained expression validation may fail. To avoid this, we add a blank sample in response
so validation can pass successfully. Later, during the rule execution, stub will be replaced
with real `query` function.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/989
Currently, alertmanager spams logs with `Notify attempt failed, will retry later` message
because default receiver is unreachable. The change updates default configuration with
blackhole receiver which means alertmanager will continue to accept alerts but won't make
attempts to send them anywhere.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/995
* adds snap docs,
adds release information for snap package,
adds docs notes about configuration management with snap package.
* adds release page mention
* version fix for snap, its awful
* revert version
* Apply suggestions from code review
Co-authored-by: Aliaksandr Valialkin <valyala@gmail.com>
Previously it could be modified in order to improve response cache hit ratio.
This is unneeded, since cache hit ratio should remain good because the query time range
should be already aligned to multiple of `step` values.
See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/976
* Adds query stat handler,
for query and query_range api, victoriametrics tracks query execution time,
stats are expored at /api/v1/status/queries endpoint with topN param
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/907
* fixed query stats bugs
* improves queryStats tracker
* improves query stat
* small fix
* fix tests
* added more tests
* fixes 386 tests
* naming fixes
* adds drop for outdated records
* adds proxy_url support,
adds proxy_url to the dockerswarm, eureka, kubernetes and consul service discovery,
adds proxy_url to the scrape_config for targets scrapping,
http based proxy is supported atm,
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/503
* fixes imports
Previously such parts could remain undeleted for long durations until they are merged with other parts.
This should help for `-retentionPeriod` values smaller than one month.
It is unclear why `go install` doesn't work in Github Actions. Needs additional investigation.
The following error is returned now:
cannot find package "golang.org/x/lint/golint" in any of:
/opt/hostedtoolcache/go/1.15.5/x64/src/golang.org/x/lint/golint (from $GOROOT)
/home/runner/go/src/golang.org/x/lint/golint (from $GOPATH)
* adds ArrayDuration and ArrayBool flags,
makes sendTimeout and tlsInsecure configurable per remoteWrite url
* added backward compatibility testcases for ArrayDuration and ArrayBool
* fixes bool flag
* fixes test cases
The commit adds a support for template function `query`,
`first` and `value`. The function `query` executes
a MetricsQL query for active alerts. In vmalert we
update templates on every evaluation for active alerts
to keep them up to date. With `query` func it may become
a perf issue since it will fire a query on every execution.
We should keep it in mind for now.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/539
This is a preparation for Go 1.16, which deprecates `go get` for installing binaries.
See https://tip.golang.org/doc/go1.16#go-command :
go install, with or without a version suffix (as described above), is now the recommended way
to build and install packages in module mode. go get should be used with the -d flag to adjust
the current module's dependencies without building packages, and use of go get to build and install
packages is deprecated. In a future release, the -d flag will always be enabled.
This diff is just to suggest wording to let people know there is no future-compatible guaranteed way to make their own native format files for import yet.
This should prevent from possible 'memory leaks' when a pointer to ScrapeWork item stored in the slice
could prevent from releasing memory occupied by all the ScrapeWork items stored in the slice when they
are no longer used.
See the related commit e205975716 and the related issue https://github.com/VictoriaMetrics/VictoriaMetrics/issues/825
* adds consul watch api,
it must reduce load on consul service with blocking wait requests,
changed discoveryClient api with fetchResponseMeta callback.
* small fix
* fix after master merge
* adds watch client at discovery utils
* fixes consul watcher,
changes namings,
fixes data race
* small typo fix
* sanity fix
* fix naming and service node update
OpenMetrics timestamps are floating-point numbers, that represent Unix timestamp in seconds.
This differs from Prometheus exposition format, where timestamps are integer numbers representing Unix timestamp in milliseconds.
All the callers for fs.OpenReaderAt expect that the file will be opened.
So it is better to log fatal error inside fs.MustOpenReaderAt instead of leaving this to the caller.
It is recommended to have at least of 50% of free RAM on vmstorage nodes in order handle possible
RAM usage spikes during rolling upgrade for vmstorage nodes when time series
are re-routed from temporarily unavailable node to the remaining active nodes.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/891
This commit also adds `"isPartial":{true|false}` field to `/api/v1/*` responses. `"isPartial":true` is set when the response
is based on a partial data because some of vmstorage nodes weren't available during query processing.
* Add omitempty for DisableCompression and DisableKeepAlive fields in ScrapeConfig
* Add omitempty annotation to all the default/optional values
* Fix annotations after review
This should prevent from holding previously discovered []ScrapeWork slices when a part of discovered targets changes over time.
This should reduce memory usage for the case when big number of discovered scrape targets changes over time.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/825
The previous implementation treated extra labels (global and rule labels) as
separate label set to returned time series labels. Hence, time series always contained
only original labels and alert ID was generated from sorted labels key-values.
Extra labels didn't affect the generated ID and were applied on the following actions:
- templating for Summary and Annotations;
- persisting state via remote write;
- restoring state via remote read.
Such behaviour caused difficulties on restore procedure because extra labels had to be dropped
before checking the alert ID, but that not always worked. Consider the case when expression
returns the following time series `up{job="foo"}` and rule has extra label `job=bar`.
This would mean that restored alert ID will be always different to the real time series because
of collision.
To solve the situation extra labels are now always applied beforehand and `vmalert` doesn't
store original labels anymore. However, this could result into a new error situation.
Consider the case when expression returns two time series `up{job="foo"}` and `up{job="baz"}`,
while rule has extra label `job=bar`. In such case, applying extra labels will result into
two identical time series and `vmalert` will return error:
`result contains metrics with the same labelset after applying rule labels`
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/870
* adds leading forward slash check for scrapeURL path
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/835
* adds ready probe for scrape config initialization,
it should prevent metrics loss during vmagent rolling update,
/ready api will return 425 http code, if some scrape config still waits for initialization.
* updates docs
* Update app/vmagent/README.md
* renames var
* Update app/vmagent/README.md
Co-authored-by: Aliaksandr Valialkin <valyala@gmail.com>
Previously such blocks were cleaned after they weren't accessed during 10 minutes.
Now they are cleaned after one minute of missing access. This should reduce memory usage in general case.
* add short_version label to vm_app_version metric
use case: Version panel of Grafana dashboard should use a live query, but currently it uses a template query which becomes stale. Grafana is not able to preform regex substitution on labels.
* Update metrics.go
* fix compile
* dashboard: add `Storage full ETA` panel
The new panel suppose to help to estimate the time needed to run out of free
disk space.
Thx to @belm0 @hekmon
* disable legend for `Storage full ETA` panel
Label `alertgroup` was introduced in #611 and automatically added to generated
time series. By mistake, this new label wasn't correctly purged on restore event
and affected alert's ID uniqueness. This commit removes `alertgroup` label
in restore function.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/870
The new release of github.com/VictoriaMetrics/metricsql adds more optimizations for `foo{filters1} op bar{filters2}`:
* rollup_func(foo[d]) op bar{filters}
* transform_func(foo) op bar{filters}
* num_or_scalar op bar op baz{filters}
This should prevent from double counting for time series at the time when it changes label.
The most common case is in K8S, which changes pod uid label with each new deployment.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/748
This reverts commit ac45082216.
Reason for revert: the previous behavior for VictoriaMetrics is easier to understand and use by users -
functions, which don't change the meaning of the time series shouldn't drop metric name.
Now the following functions do not drop metric names:
* ceil
* floor
* round
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/674
This reverts commit bb61a4769b.
Reason for revert: the previous behavior for VictoriaMetrics is easier to understand and use by users -
functions, which don't change the meaning of the time series shouldn't drop metric name.
Now the following functions do not drop metric name:
* clamp_min
* clamp_max
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/674
This reverts commit e5202a4eae.
Reason for revert: the previous behavior for VictoriaMetrics is easier to understand and use by users -
functions, which don't change the meaning of the time series shouldn't drop metric name.
Now the following functions do not drop metric name:
* max_over_time
* min_over_time
* avg_over_time
* quantile_over_time
* geomean_over_time
* mode_over_time
* holt_winters
* predict_linear
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/674
Show also original labels for duplicate targets in error message in order to simplify debugging the issue.
Now `/targets` endpoint accepts optional `show_original_labels=1` query arg, which shows original labels for each target.
This may simplify debugging for target relabeling.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/651
Remove the code that uses metricIDs caches for the current and the previous hour during metricIDs search,
since this code became unused after implementing per-day inverted index almost a year ago.
While at it, fix a bug, which could prevent from finding time series with names containing dots (aka Graphite-like names
such as `foo.bar.baz`).
Previously the maximum cache lifetime has been limited by 10 seconds. Now it is extended up to a day.
This should reduce CPU usage in the following cases:
* when querying recently added data with small churn rate for time series
* when querying historical data
This reverts commit bab6a15ae0.
Reason for revert: the `fetchData` arg is used in cluster branch.
Leaving this arg in master branch makes smaller the diff with cluster branch.
* Add improvements to ec2 discovery
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/771
role_arn support with aws sts
instance iam_role support
refreshing temporary tokens
* Apply suggestions from code review
Co-authored-by: Roman Khavronenko <hagen1778@gmail.com>
* changed implementation, removed tests, clean up code
* moved endpoint builder into getEC2APIResponse
Co-authored-by: Roman Khavronenko <hagen1778@gmail.com>
The strategy is:
- Periodical flushing of inmemory blocks to files, so they aren't lost on unclean shutdown.
- Periodical syncing of metadata for persisted queues, so the metadata remains in sync with the persisted data.
- Automatic adjusting of too big chunk size when opening the queue. The chunk size may be bigger than the writer offset after unclean shutdown.
- Skipping of broken chunk file if it cannot be read.
- Fsyncing finalized chunk files.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/687
Previously certain errors in timestamps and/or values could be silently skipped,
which could lead to samples with zero values stored in the database.
Updates https://github.com/VictoriaMetrics/vmctl/issues/25
On config reload event `vmalert` reloads configuration for every group. While
it works for simple configurations, the more complex and heavy installations may
suffer from frequent config reloads.
The change introduces the `checksum` field for every group and is set to md5 hash
of yaml configuration. The checksum will change if on any change to group
definition like rules order or annotation change. Comparing the `checksum` field
on config reload event helps to detect if group should be updated.
The groups update is now done concurrently, so reload duration will be limited by
the slowest group now.
Partially solves #691 by improving config reload speed.
This is similar to https://github.com/prometheus/prometheus/pull/6838 , which will be added in Prometheus v2.21.
See https://github.com/prometheus/prometheus/releases/tag/v2.21.0-rc.1
* Added endpointslices discovery to k8s api
Started from 1.17 k8s version endpointslices is beta,
it allows to query k8s api for endpoints more efficient.
It presents at scrape_config.yaml as separate role for kubernetes_sd_config.
kubernetes_sd_config:
- role: endpointslices
* fixed typos, changed EndpointConditions signature - with values instead of pointers
This should reduce disk space usage when scraping targets containing metrics with identical names
such as `node_cpu_seconds_total`, histograms, quantiles, etc.
Expose `vm_timestamps_blocks_merged_total` and `vm_timestamps_bytes_saved_total` metrics for monitoring
the effectiveness of timestamp blocks merging.
* revise /api/v1/series docs
Further clarification for #735
* clarify how default range differers from Prometheus API
* avoid `start=0` suggestion when confirming delete, because it will cause a timeout in most deployments
* Update README.md
aws sdk has complicated logic for chosing profile name and we shouldn't set
it to `default` value. It leads to bugs and improper configuration.
Set it to empty value by default is safe. It will be automatically set to `default` by sdk.
These functions returns the number of raw samples that don't exceed `le` or are bigger than `gt`.
These functions are complement to already existing `share_le_over_time(m[d], le)` and `share_gt_over_time(m[d], gt)`.
* VMAlert start with empty rules dir
There are some applications (operator for instance), that generates alerts configuration at runtime
and vmalert must start correctly without rules to support this behaviour.
Later application will add rules files and send SIGHUP to vmalert,
which will trigger reading rules files and start rules exectuion.
Removing rules files with SIGHUP signal must stop rules execution and
vmalert will wait for new rules.
* imports sorted
* added test cases for empty rules, removed blank line
* fixed imports conflict
* updated tests
* feat: spread load of rule evaluation by group when starting new groups
* review: reduce the resulting diff.
* Update app/vmalert/group.go
Co-authored-by: Roman Khavronenko <hagen1778@gmail.com>
Co-authored-by: Aliaksandr Valialkin <valyala@gmail.com>
Co-authored-by: Roman Khavronenko <hagen1778@gmail.com>
The previous basis on `cap(sw.labels)` doesn't work anymore after 7785869ccc ,
because `sw.labels` may be reset multiple times when processing big number of rows.
* dashboard: rename var `datasource` to `ds` for consistency reason
Dasbhoards for cluster version or vmagent operate with datasource variable
named `ds`. For consistency sake we rename this variable in single node version
as well.
* dashboard: add instance variable picker
See dashboard reviews here https://grafana.com/grafana/dashboards/10229/reviews
* dashboard: limit number of buckets in histogram to 12 for vmagent dashboard
* dashboard: bump version requirement in description for single version
* dashboard: drop extra series override for single version
* dashboard: set Y-min to zero for most of panels in vmagent dashboard
Description for `-rule` flag uses as example specific chars like asterisks
which could be interpreted wrong by different shells. To avoid this, description
now contains quoted flag values.
See also #708
Arch 386 is a 32-bit architecture and interprets int type for numbers as an explicit int32,
whereas on most modern CPUs int is implicitly an int64. This makes tests to fail with
`int overflow` error.
Previously targets with big number of metrics and/or labels could generated too big buffers,
which then could be re-used when scraping targets with small number of metrics.
This resulted in memory waste.
Now big buffers are used only for targets with big number of metrics / labels,
while small buffers are used for targets with small number of metrics / labels.
The previous notion was inconsistent with what `decimal.Round` does.
According to [wiki](https://en.wikipedia.org/wiki/Significant_figures) rounding
applied to all significant figures, not just decimal ones.
* app/vmalert: extend metrics set exported by `vmalert` #573
New metrics were added to improve observability:
+ vmalert_alerts_pending{alertname, group} - number of pending alerts per group
per alert;
+ vmalert_alerts_acitve{alertname, group} - number of active alerts per group
per alert;
+ vmalert_alerts_error{alertname, group} - is 1 if alertname ended up with error
during prev execution, is 0 if no errors happened;
+ vmalert_recording_rules_error{recording, group} - is 1 if recording rule
ended up with error during prev execution, is 0 if no errors happened;
* vmalert_iteration_total{group, file} - now contains group and file name labels.
This should improve control over specific groups;
* vmalert_iteration_duration_seconds{group, file} - now contains group and file name labels. This should improve control over specific groups;
Some collisions for alerts and recording rules are possible, because neither
group name nor alert/recording rule name are unique for compatibility reasons.
Commit contains list of TODOs for Unregistering metrics since groups and rules
are ephemeral and could be removed without application restart. In order to
unlock Unregistering feature corresponding PR was filed - https://github.com/VictoriaMetrics/metrics/pull/13
* app/vmalert: extend metrics set exported by `vmalert` #573
The changes are following:
* add an ID label to rules metrics, since `name` collisions within one group is
a common case - see the k8s example alerts;
* supports metrics unregistering on rule updates. Consider the case when one rule
was added or removed from the group, or the whole group was added or removed.
The change depends on https://github.com/VictoriaMetrics/metrics/pull/16
where race condition for Unregister method was fixed.
Previously the limit has been raised to GOMAXPROCS, but it has been appeared that this
increases query latencies since more CPUs are busy with merges.
While at it, substitute `*MergeConcurrencyLimitCh` channels with simple integer limits.
Previously the `vm_slow_row_inserts_total` metric may be incremented multiple times for different data points per a single time series,
while only a single increment is needed when inserting the first data point for this time series.
`external.label` flag supposed to help to distinguish alert or recording rules
source in situations when more than one `vmalert` runs for the same datasource
or AlertManager.
due to memory align the metaindexRow structure use 64-byte pre object.
this commit changes the order of field, make metaindexRow use 56-byte pre
object.
Signed-off-by: Sasasu <su@sasasu.me>
This function works with both Prometheus-style and VictoriaMetrics-style buckets.
The function removes buckets with the lowest values in order to reserve the highest precision.
The function is useful for building heatmaps in Grafana from too big number of buckets.
Previously the time spent on inverted index search could exceed the configured `-search.maxQueryDuration`.
This commit stops searching in inverted index on query timeout.
Previously the enabled relabeling with `-relabelConfig` command-line flag could result in missing labels
if a single Influx line protocol message contains multiple field values.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/638
This condition may occur after the following sequence of events:
1) A goroutine enters the loop body when len(addRowsConcurrencyCh) == cap(addRowsConcurrencyCh) inside Storage.searchTSIDs.
2) All the goroutines return from Storage.AddRows.
3) The goroutine from step 1 blocks on searchTSIDsCond.Wait() inside the loop body.
The goroutine remains blocked until the next call to Storage.AddRows, which calls searchTSIDsCond.Signal().
This may take indefinite time.
This should prvent from inflated `increase()` results for time series that start from big initial values.
Such cases may occur when a label value changes in a metric without counter reset.
`vmagent` Grafana dashboard suppose to provide basic observability over multiple
`vmagent` instances. Dashboard is saved in Grafana export format so it can be easily
imported. It was also integrated into docker-compose environment.
This is a follow-up commit after 12b16077c4 ,
which didn't reset the `tsidCache` in all the required places.
This could result in indefinite errors like:
missing metricName by metricID ...; this could be the case after unclean shutdown; deleting the metricID, so it could be re-created next time
Fix this by resetting the cache inside deleteMetricIDs function.
* add vm_protoparser_rows_read_total metrics to promscrape
move vm_protoparser_rows_read_total for promscrape to better place
move vm_protoparser_rows_read_total for promscrape to better place
* remove possibility of infinity loop at prometheus parser
Array type flag is now defined as `value` type in flag description when printed.
This change adds additional description to every Array type flag so it would be
clear what exact type is used:
```
-remoteWrite.urlRelabelConfig array
Optional path to relabel config for the corresponding -remoteWrite.url
Supports array of values separated by comma or specified via multiple flags.
```
Metric `vm_persistentqueue_bytes_pending` is a gauge that shows current amount
of bytes in persistentqueue flushed on disk as a difference between write and read
offsets. This metric is very similar to `vmagent_remotewrite_pending_data_bytes`
except of accounting for bytes in-memory.
The change to `vm_promscrape_targets` metric suppose to improve observability
for `vmagent` so it will be possible to track how many targets are up or down
for every specific scrape group:
```
vm_promscrape_targets{type="static_configs", status="down"} 1
vm_promscrape_targets{type="static_configs", status="up"} 2
```
Previously multiple goroutines could access remoteWriteCtx.tss concurrently, which could lead to data race
and improper relabeling. Now each goroutine has its own copy of tss during relabeling.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/599
Previously the concurrency has been limited to GOMAXPROCS*2. This had little sense,
since every call to Storage.AddRows is bound to CPU, so the maximum ingestion bandwidth
is achieved when the number of concurrent calls to Storage.AddRows is limited to the number of CPUs,
i.e. to GOMAXPROCS.
Heavy queries could result in the lack of CPU resources for processing the current data ingestion stream.
Prevent this by delaying queries' execution until free resources are available for data ingestion.
Expose `vm_search_delays_total` metric, which may be used in for alerting when there is no enough CPU resources
for data ingestion and/or for executing heavy queries.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/291
* app/vmalert: add retries to remotewrite
Remotewrite pkg now does limited number of retries if write request failed.
This suppose to make vmalert state persisting more reliable.
New metrics were added to remotewrite in order to track rows/bytes sent/dropped.
defaultFlushInterval was increased from 1s to 5s for sanity reasons.
* fix
* wip
* wip
* wip
* fix bits alignment bug for 32-bit systems
* fix mistakenly dropped field
The maximum deadline miss duration is reduced to 2x scrape_interval in the worst case.
By default it is limited to scrape_interval configured for the given scrape target.
* Fix Auto metrics relabeled errors
* Finalize auto-genenated Labels
* Fix Test Errors
* fix error logs when queue is full
Co-authored-by: xinyulong <xinyulong@kuaishou.com>
which errcheck ||GO111MODULE=off go get -u github.com/kisielk/errcheck
which errcheck ||GO111MODULE=off go get github.com/kisielk/errcheck
check-all:fmtvetlinterrcheckgolangci-lint
@@ -122,8 +239,8 @@ benchmark-pure:
GO111MODULE=on CGO_ENABLED=0 go test -mod=vendor -bench=. ./app/...
vendor-update:
GO111MODULE=on go get -u ./lib/...
GO111MODULE=on go get -u ./app/...
GO111MODULE=on go get -u -d ./lib/...
GO111MODULE=on go get -u -d ./app/...
GO111MODULE=on go mod tidy
GO111MODULE=on go mod vendor
@@ -133,23 +250,46 @@ app-local:
app-local-pure:
CGO_ENABLED=0GO111MODULE=on go build $(RACE) -mod=vendor -ldflags "$(GO_BUILDINFO)" -o bin/$(APP_NAME)-pure$(RACE)$(PKG_PREFIX)/app/$(APP_NAME)
app-local-with-goarch:
GO111MODULE=on go build $(RACE) -mod=vendor -ldflags "$(GO_BUILDINFO)" -o bin/$(APP_NAME)-$(GOARCH)$(RACE)$(PKG_PREFIX)/app/$(APP_NAME)
app-local-windows-with-goarch:
CGO_ENABLED=0GO111MODULE=on go build $(RACE) -mod=vendor -ldflags "$(GO_BUILDINFO)" -o bin/$(APP_NAME)-windows-$(GOARCH)$(RACE).exe $(PKG_PREFIX)/app/$(APP_NAME)
quicktemplate-gen:install-qtc
qtc
install-qtc:
which qtc ||GO111MODULE=off go get -u github.com/valyala/quicktemplate/qtc
which qtc ||GO111MODULE=off go get github.com/valyala/quicktemplate/qtc
golangci-lint:install-golangci-lint
golangci-lint run --exclude '(SA4003|SA1019|SA5011):' -D errcheck -D structcheck --timeout 2m
install-golangci-lint:
which golangci-lint ||GO111MODULE=off go get -u github.com/golangci/golangci-lint/cmd/golangci-lint
which golangci-lint ||curl -sSfL https://raw.githubusercontent.com/golangci/golangci-lint/master/install.sh | sh -s -- -b $(shell go env GOPATH)/bin v1.43.0
install-wwhrd:
which wwhrd ||GO111MODULE=off go get github.com/frapposelli/wwhrd
check-licenses:install-wwhrd
wwhrd check -f .wwhrd.yml
copy-docs:
echo"---\nsort: ${ORDER}\n---\n" > ${DST}
cat ${SRC} >> ${DST}
# Copies docs for all components and adds the order tag.
httpListenAddr=flag.String("httpListenAddr",":8428","TCP address to listen for http connections")
minScrapeInterval=flag.Duration("dedup.minScrapeInterval",0,"Remove superflouos samples from time series if they are located closer to each other than this duration. "+
"This may be useful for reducing overhead when multiple identically configured Prometheus instances write data to the same VictoriaMetrics. "+
"Deduplication is disabled if the -dedup.minScrapeInterval is 0")
minScrapeInterval=flag.Duration("dedup.minScrapeInterval",0,"Leave only the first sample in every time series per each discrete interval "+
"equal to -dedup.minScrapeInterval > 0. See https://docs.victoriametrics.com/#deduplication for details")
dryRun=flag.Bool("dryRun",false,"Whether to check only -promscrape.config and then exit. "+
"Unknown config entries are allowed in -promscrape.config by default. This can be changed with -promscrape.config.strictParse")
)
funcmain(){
// Write flags and help message to stdout, since it is easier to grep or pipe.
flag.CommandLine.SetOutput(os.Stdout)
flag.Usage=usage
envflag.Parse()
buildinfo.Init()
logger.Init()
ifpromscrape.IsDryRun(){
*dryRun=true
}
if*dryRun{
iferr:=promscrape.CheckConfig();err!=nil{
logger.Fatalf("error when checking -promscrape.config: %s",err)
}
logger.Infof("-promscrape.config is ok; exitting with 0 status code")
return
}
logger.Infof("starting VictoriaMetrics at %q...",*httpListenAddr)
measurementFieldSeparator=flag.String("influxMeasurementFieldSeparator","_","Separator for '{measurement}{separator}{field_name}' metric name when inserted via Influx line protocol")
skipSingleField=flag.Bool("influxSkipSingleField",false,"Uses '{measurement}' instead of '{measurement}{separator}{field_name}' for metic name if Influx line contains only a single field")
measurementFieldSeparator=flag.String("influxMeasurementFieldSeparator","_","Separator for '{measurement}{separator}{field_name}' metric name when inserted via InfluxDB line protocol")
skipSingleField=flag.Bool("influxSkipSingleField",false,"Uses '{measurement}' instead of '{measurement}{separator}{field_name}' for metic name if InfluxDB line contains only a single field")
skipMeasurement=flag.Bool("influxSkipMeasurement",false,"Uses '{field_name}' as a metric name while ignoring '{measurement}' and '-influxMeasurementFieldSeparator'")
httpListenAddr=flag.String("httpListenAddr",":8429","TCP address to listen for http connections. "+
"Set this flag to empty value in order to disable listening on any port. This mode may be useful for running multiple vmagent instances on the same server. "+
"Note that /targets and /metrics pages aren't available if -httpListenAddr=''")
influxListenAddr=flag.String("influxListenAddr","","TCP and UDP address to listen for Influx line protocol data. Usually :8189 must be set. Doesn't work if empty")
influxListenAddr=flag.String("influxListenAddr","","TCP and UDP address to listen for InfluxDB line protocol data. Usually :8189 must be set. Doesn't work if empty. "+
"This flag isn't needed when ingesting data over HTTP - just send it to http://<vmagent>:8429/write")
graphiteListenAddr=flag.String("graphiteListenAddr","","TCP and UDP address to listen for Graphite plaintext data. Usually :2003 must be set. Doesn't work if empty")
opentsdbListenAddr=flag.String("opentsdbListenAddr","","TCP and UDP address to listen for OpentTSDB metrics. "+
"Telnet put messages and HTTP /api/put messages are simultaneously served on TCP port. "+
"Usually :4242 must be set. Doesn't work if empty")
opentsdbHTTPListenAddr=flag.String("opentsdbHTTPListenAddr","","TCP address to listen for OpentTSDB HTTP put requests. Usually :4242 must be set. Doesn't work if empty")
configAuthKey=flag.String("configAuthKey","","Authorization key for accessing /config page. It must be passed via authKey query arg")
dryRun=flag.Bool("dryRun",false,"Whether to check only config files without running vmagent. The following files are checked: "+
"-promscrape.config, -remoteWrite.relabelConfig, -remoteWrite.urlRelabelConfig . See also -promscrape.config.dryRun")
tlsInsecureSkipVerify=flag.Bool("remoteWrite.tlsInsecureSkipVerify",false,"Whether to skip tls verification when connecting to -remoteWrite.url")
tlsInsecureSkipVerify=flagutil.NewArrayBool("remoteWrite.tlsInsecureSkipVerify","Whether to skip tls verification when connecting to -remoteWrite.url")
tlsCertFile=flagutil.NewArray("remoteWrite.tlsCertFile","Optional path to client-side TLS certificate file to use when connecting to -remoteWrite.url. "+
"If multiple args are set, then they are applied independently for the corresponding -remoteWrite.url")
tlsKeyFile=flagutil.NewArray("remoteWrite.tlsKeyFile","Optional path to client-side TLS certificate key to use when connecting to -remoteWrite.url. "+
@@ -34,107 +40,107 @@ var (
"If multiple args are set, then they are applied independently for the corresponding -remoteWrite.url")
basicAuthPassword=flagutil.NewArray("remoteWrite.basicAuth.password","Optional basic auth password to use for -remoteWrite.url. "+
"If multiple args are set, then they are applied independently for the corresponding -remoteWrite.url")
basicAuthPasswordFile=flagutil.NewArray("remoteWrite.basicAuth.passwordFile","Optional path to basic auth password to use for -remoteWrite.url. "+
"The file is re-read every second. "+
"If multiple args are set, then they are applied independently for the corresponding -remoteWrite.url")
bearerToken=flagutil.NewArray("remoteWrite.bearerToken","Optional bearer auth token to use for -remoteWrite.url. "+
"If multiple args are set, then they are applied independently for the corresponding -remoteWrite.url")
bearerTokenFile=flagutil.NewArray("remoteWrite.bearerTokenFile","Optional path to bearer token file to use for -remoteWrite.url. "+
"The token is re-read from the file every second. "+
"If multiple args are set, then they are applied independently for the corresponding -remoteWrite.url")
oauth2ClientID=flagutil.NewArray("remoteWrite.oauth2.clientID","Optional OAuth2 clientID to use for -remoteWrite.url. "+
"If multiple args are set, then they are applied independently for the corresponding -remoteWrite.url")
oauth2ClientSecret=flagutil.NewArray("remoteWrite.oauth2.clientSecret","Optional OAuth2 clientSecret to use for -remoteWrite.url. "+
"If multiple args are set, then they are applied independently for the corresponding -remoteWrite.url")
oauth2ClientSecretFile=flagutil.NewArray("remoteWrite.oauth2.clientSecretFile","Optional OAuth2 clientSecretFile to use for -remoteWrite.url. "+
"If multiple args are set, then they are applied independently for the corresponding -remoteWrite.url")
oauth2TokenURL=flagutil.NewArray("remoteWrite.oauth2.tokenUrl","Optional OAuth2 tokenURL to use for -remoteWrite.url. "+
"If multiple args are set, then they are applied independently for the corresponding -remoteWrite.url")
oauth2Scopes=flagutil.NewArray("remoteWrite.oauth2.scopes","Optional OAuth2 scopes to use for -remoteWrite.url. Scopes must be delimited by ';'. "+
"If multiple args are set, then they are applied independently for the corresponding -remoteWrite.url")
logger.Errorf("unexpected status code received after sending a block with size %d bytes to %q: %d; response body=%q; re-sending the block in %.3f seconds",
flushInterval=flag.Duration("remoteWrite.flushInterval",time.Second,"Interval for flushing the data to remote storage. "+
"Higher value reduces network bandwidth usage at the cost of delayed push of scraped data to remote storage. "+
"Minimum supported interval is 1 second")
maxUnpackedBlockSize=flag.Int("remoteWrite.maxBlockSize",32*1024*1024,"The maximum size in bytes of unpacked request to send to remote storage. "+
"It shouldn't exceed -maxInsertRequestSize from VictoriaMetrics")
"This option takes effect only when less than 10K data points per second are pushed to -remoteWrite.url")
maxUnpackedBlockSize=flagutil.NewBytes("remoteWrite.maxBlockSize",8*1024*1024,"The maximum block size to send to remote storage. Bigger blocks may improve performance at the cost of the increased memory usage. See also -remoteWrite.maxRowsPerBlock")
maxRowsPerBlock=flag.Int("remoteWrite.maxRowsPerBlock",10000,"The maximum number of samples to send in each block to remote storage. Higher number may improve performance at the cost of the increased memory usage. See also -remoteWrite.maxBlockSize")
)
// the maximum number of rows to send per each block.
unparsedLabelsGlobal=flagutil.NewArray("remoteWrite.label","Optional label in the form 'name=value' to add to all the metrics before sending them to -remoteWrite.url. "+
"Pass multiple -remoteWrite.label flags in order to add multiple flags to metrics before sending them to remote storage")
"Pass multiple -remoteWrite.label flags in order to add multiple labels to metrics before sending them to remote storage")
relabelConfigPathGlobal=flag.String("remoteWrite.relabelConfig","","Optional path to file with relabel_config entries. These entries are applied to all the metrics "+
"before sending them to -remoteWrite.url. See https://prometheus.io/docs/prometheus/latest/configuration/configuration/#relabel_config for details")
"before sending them to -remoteWrite.url. See https://docs.victoriametrics.com/vmagent.html#relabeling for details")
relabelDebugGlobal=flag.Bool("remoteWrite.relabelDebug",false,"Whether to log metrics before and after relabeling with -remoteWrite.relabelConfig. "+
"If the -remoteWrite.relabelDebug is enabled, then the metrics aren't sent to remote storage. This is useful for debugging the relabeling configs")
relabelConfigPaths=flagutil.NewArray("remoteWrite.urlRelabelConfig","Optional path to relabel config for the corresponding -remoteWrite.url")
relabelDebug=flagutil.NewArrayBool("remoteWrite.urlRelabelDebug","Whether to log metrics before and after relabeling with -remoteWrite.urlRelabelConfig. "+
"If the -remoteWrite.urlRelabelDebug is enabled, then the metrics aren't sent to the corresponding -remoteWrite.url. "+
"This is useful for debugging the relabeling configs")
returnnil,fmt.Errorf("too many -remoteWrite.urlRelabelConfig args: %d; it mustn't exceed the number of -remoteWrite.url or -remoteWrite.multitenantURL args: %d",
remoteWriteURLs=flagutil.NewArray("remoteWrite.url","Remote storage URL to write data to. It must support Prometheus remote_write API. "+
"It is recommended using VictoriaMetrics as remote storage. Example url: http://<victoriametrics-host>:8428/api/v1/write . "+
"Pass multiple -remoteWrite.url flags in order to write data concurrently to multiple remote storage systems")
tmpDataPath=flag.String("remoteWrite.tmpDataPath","vmagent-remotewrite-data","Path to directory where temporary data for remote write component is stored")
queues=flag.Int("remoteWrite.queues",1,"The number of concurrent queues to each -remoteWrite.url. Set more queues if a single queue "+
"isn't enough for sending high volume of collected data to remote storage")
"Pass multiple -remoteWrite.url flags in order to replicate data to multiple remote storage systems. See also -remoteWrite.multitenantURL")
remoteWriteMultitenantURLs=flagutil.NewArray("remoteWrite.multitenantURL","Base path for multitenant remote storage URL to write data to. "+
"See https://docs.victoriametrics.com/vmagent.html#multitenancy for details. Example url: http://<vminsert>:8480 . "+
"Pass multiple -remoteWrite.multitenantURL flags in order to replicate data to multiple remote storage systems. See also -remoteWrite.url")
tmpDataPath=flag.String("remoteWrite.tmpDataPath","vmagent-remotewrite-data","Path to directory where temporary data for remote write component is stored. "+
"See also -remoteWrite.maxDiskUsagePerURL")
queues=flag.Int("remoteWrite.queues",cgroup.AvailableCPUs()*2,"The number of concurrent queues to each -remoteWrite.url. Set more queues if default number of queues "+
"isn't enough for sending high volume of collected data to remote storage. Default value is 2 * numberOfAvailableCPUs")
showRemoteWriteURL=flag.Bool("remoteWrite.showURL",false,"Whether to show -remoteWrite.url in the exported metrics. "+
"It is hidden by default, since it can contain sensistive auth info")
maxPendingBytesPerURL=flag.Int("remoteWrite.maxDiskUsagePerURL",0,"The maximum file-based buffer size in bytes at -remoteWrite.tmpDataPath "+
"It is hidden by default, since it can contain sensitive info such as auth key")
maxPendingBytesPerURL=flagutil.NewBytes("remoteWrite.maxDiskUsagePerURL",0,"The maximum file-based buffer size in bytes at -remoteWrite.tmpDataPath "+
"for each -remoteWrite.url. When buffer size reaches the configured maximum, then old data is dropped when adding new data to the buffer. "+
"Buffered data is stored in ~500MB chunks, so the minimum practical value for this flag is 500000000. "+
"Disk usage is unlimited if the value is set to 0")
significantFigures=flagutil.NewArrayInt("remoteWrite.significantFigures","The number of significant figures to leave in metric values before writing them "+
"to remote storage. See https://en.wikipedia.org/wiki/Significant_figures . Zero value saves all the significant figures. "+
"This option may be used for improving data compression for the stored metrics. See also -remoteWrite.roundDigits")
roundDigits=flagutil.NewArrayInt("remoteWrite.roundDigits","Round metric values to this number of decimal digits after the point before writing them to remote storage. "+
"Examples: -remoteWrite.roundDigits=2 would round 1.236 to 1.24, while -remoteWrite.roundDigits=-1 would round 126.78 to 130. "+
"By default digits rounding is disabled. Set it to 100 for disabling it for a particular remote storage. "+
"This option may be used for improving data compression for the stored metrics")
sortLabels=flag.Bool("sortLabels",false,`Whether to sort labels for incoming samples before writing them to all the configured remote storage systems. `+
`This may be needed for reducing memory usage at remote storage when the order of labels in incoming samples is random. `+
`For example, if m{k1="v1",k2="v2"} may be sent as m{k2="v2",k1="v1"}`+
`Enabled sorting for labels can slow down ingestion performance a bit`)
maxHourlySeries=flag.Int("remoteWrite.maxHourlySeries",0,"The maximum number of unique series vmagent can send to remote storage systems during the last hour. "+
"Excess series are logged and dropped. This can be useful for limiting series cardinality. See https://docs.victoriametrics.com/vmagent.html#cardinality-limiter")
maxDailySeries=flag.Int("remoteWrite.maxDailySeries",0,"The maximum number of unique series vmagent can send to remote storage systems during the last 24 hours. "+
"Excess series are logged and dropped. This can be useful for limiting series churn rate. See https://docs.victoriametrics.com/vmagent.html#cardinality-limiter")
)
varrwctxs[]*remoteWriteCtx
var(
// rwctxsDefault contains statically populated entries when -remoteWrite.url is specified.
rwctxsDefault[]*remoteWriteCtx
// rwctxsMap contains dynamically populated entries when -remoteWrite.multitenantURL is specified.
Execute the query against storage which was used for `-remoteWrite.url` during the `replay`.
### Additional configuration
There are following non-required `replay` flags:
*`-replay.maxDatapointsPerQuery` - the max number of data points expected to receive in one request.
In two words, it affects the max time range for every `/query_range` request. The higher the value,
the less requests will be issued during `replay`.
*`-replay.ruleRetryAttempts` - when datasource fails to respond vmalert will make this number of retries
per rule before giving up.
*`-replay.rulesDelay` - delay between sequential rules execution. Important in cases if there are chaining
(rules which depend on each other) rules. It is expected, that remote storage will be able to persist
previously accepted data during the delay, so data will be available for the subsequent queries.
Keep it equal or bigger than `-remoteWrite.flushInterval`.
See full description for these flags in `./vmalert --help`.
### Limitations
* Graphite engine isn't supported yet;
*`query` template function is disabled for performance reasons (might be changed in future);
## Monitoring
`vmalert` exports various metrics in Prometheus exposition format at `http://vmalert-host:8880/metrics` page.
We recommend setting up regular scraping of this page either through `vmagent` or by Prometheus so that the exported
metrics may be analyzed later.
Use official [Grafana dashboard](https://grafana.com/grafana/dashboards/14950) for `vmalert` overview.
If you have suggestions for improvements or have found a bug - please open an issue on github or add
a review to the dashboard.
## Configuration
Pass `-help` to `vmalert` in order to see the full list of supported
command-line flags with their descriptions.
The shortlist of configuration flags is the following:
```
Usage of vmalert:
-datasource.appendTypePrefix
Whether to add type prefix to -datasource.url based on the query type. Set to true if sending different query types to the vmselect URL.
-datasource.basicAuth.password string
Optional basic auth password for -datasource.url
Optional basic auth password for -datasource.url
-datasource.basicAuth.passwordFile string
Optional path to basic auth password to use for -datasource.url
-datasource.basicAuth.username string
Optional basic auth username for -datasource.url
-datasource.tlsCAFile value
Optional path to TLS CA file to use for verifying connections to -datasource.url. By default system CA is used.
-datasource.tlsCertFile value
Optional path to client-side TLS certificate file to use when connecting to -datasource.url.
Optional basic auth username for -datasource.url
-datasource.bearerToken string
Optional bearer auth token to use for -datasource.url.
-datasource.bearerTokenFile string
Optional path to bearer token file to use for -datasource.url.
-datasource.lookback duration
Lookback defines how far into the past to look when evaluating queries. For example, if the datasource.lookback=5m then param "time" with value now()-5m will be added to every query.
-datasource.maxIdleConnections int
Defines the number of idle (keep-alive connections) to each configured datasource. Consider setting this value equal to the value: groups_total * group.concurrency. Too low a value may result in a high number of sockets in TIME_WAIT state. (default 100)
-datasource.queryStep duration
queryStep defines how far a value can fallback to when evaluating queries. For example, if datasource.queryStep=15s then param "step" with value "15s" will be added to every query.If queryStep isn't specified, rule's evaluationInterval will be used instead.
-datasource.roundDigits int
Adds "round_digits" GET param to datasource requests. In VM "round_digits" limits the number of digits after the decimal point in response values.
-datasource.tlsCAFile string
Optional path to TLS CA file to use for verifying connections to -datasource.url. By default, system CA is used
-datasource.tlsCertFile string
Optional path to client-side TLS certificate file to use when connecting to -datasource.url
-datasource.tlsInsecureSkipVerify
Whether to skip tls verification when connecting to -datasource.url
-datasource.tlsKeyFile value
Optional path to client-side TLS certificate key to use when connecting to -datasource.url.
-datasource.tlsServerName value
Optional TLS server name to use for connections to -datasource.url. By default the server name from -datasource.url is used.
Whether to skip tls verification when connecting to -datasource.url
-datasource.tlsKeyFile string
Optional path to client-side TLS certificate key to use when connecting to -datasource.url
-datasource.tlsServerName string
Optional TLS server name to use for connections to -datasource.url. By default, the server name from -datasource.url is used
-datasource.url string
VictoriaMetrics or VMSelect url. Required parameter. E.g. http://127.0.0.1:8428
VictoriaMetrics or vmselect url. Required parameter. E.g. http://127.0.0.1:8428
-disableAlertgroupLabel
Whether to disable adding group's name as label to generated alerts and time series.
-dryRun -rule
Whether to check only config files without running vmalert. The rules file are validated. The -rule flag must be specified.
-enableTCP6
Whether to enable IPv6 for listening and dialing. By default only IPv4 TCP and UDP is used
-envflag.enable
Whether to enable reading flags from environment variables additionally to command line. Command line flag values have priority over values from environment vars. Flags are read only from command line if this flag isn't set. See https://docs.victoriametrics.com/#environment-variables for more details
-envflag.prefix string
Prefix for environment variables if -envflag.enable is set
-evaluationInterval duration
How often to evaluate the rules (default 1m0s)
How often to evaluate the rules (default 1m0s)
-external.alert.source string
External Alert Source allows to override the Source link for alerts sent to AlertManager for cases where you want to build a custom link to Grafana, Prometheus or any other service.
eg. 'explore?orgId=1&left=[\"now-1h\",\"now\",\"VictoriaMetrics\",{\"expr\": \"{{$expr|quotesEscape|crlfEscape|queryEscape}}\"},{\"mode\":\"Metrics\"},{\"ui\":[true,true,true,\"none\"]}]'.If empty '/api/v1/:groupID/alertID/status' is used
-external.label array
Optional label in the form 'name=value' to add to all generated recording rules and alerts. Pass multiple -label flags in order to add multiple label sets.
Supports an array of values separated by comma or specified via multiple flags.
-external.url string
External URL is used as alert's source for sent alerts to the notifier
External URL is used as alert's source for sent alerts to the notifier
-fs.disableMmap
Whether to use pread() instead of mmap() for reading data files. By default mmap() is used for 64-bit arches and pread() is used for 32-bit arches, since they cannot read data files bigger than 2^32 bytes in memory. mmap() is usually faster for reading small data chunks than pread()
-http.connTimeout duration
Incoming http connections are closed after the configured timeout. This may help to spread the incoming load among a cluster of services behind a load balancer. Please note that the real timeout may be bigger by up to 10% as a protection against the thundering herd problem (default 2m0s)
-http.disableResponseCompression
Disable compression of HTTP responses to save CPU resources. By default compression is enabled to save network bandwidth
-http.idleConnTimeout duration
Timeout for incoming idle http connections (default 1m0s)
-http.maxGracefulShutdownDuration duration
The maximum duration for a graceful shutdown of the HTTP server. A highly loaded server may require increased value for a graceful shutdown (default 7s)
-http.pathPrefix string
An optional prefix to add to all the paths handled by http server. For example, if '-http.pathPrefix=/foo/bar' is set, then all the http requests will be handled on '/foo/bar/*' paths. This may be useful for proxied requests. See https://www.robustperception.io/using-external-urls-and-proxies-with-prometheus
-http.shutdownDelay duration
Optional delay before http server shutdown. During this delay, the server returns non-OK responses from /health page, so load balancers can route new requests to other servers
-httpAuth.password string
Password for HTTP Basic Auth. The authentication is disabled if -httpAuth.username is empty
-httpAuth.username string
Username for HTTP Basic Auth. The authentication is disabled if empty. See also -httpAuth.password
-httpListenAddr string
Address to listen for http connections (default ":8880")
Address to listen for http connections (default ":8880")
-loggerDisableTimestamps
Whether to disable writing timestamps in logs
-loggerErrorsPerSecondLimit int
Per-second limit on the number of ERROR messages. If more than the given number of errors are emitted per second, the remaining errors are suppressed. Zero values disable the rate limit
-loggerFormat string
Format for logs. Possible values: default, json (default "default")
-loggerLevel string
Minimum level of errors to log. Possible values: INFO, WARN, ERROR, FATAL, PANIC (default "INFO")
-loggerOutput string
Output for the logs. Supported values: stderr, stdout (default "stderr")
-loggerTimezone string
Timezone to use for timestamps in logs. Timezone must be a valid IANA Time Zone. For example: America/New_York, Europe/Berlin, Etc/GMT+3 or Local (default "UTC")
-loggerWarnsPerSecondLimit int
Per-second limit on the number of WARN messages. If more than the given number of warns are emitted per second, then the remaining warns are suppressed. Zero values disable the rate limit
-memory.allowedBytes size
Allowed size of system memory VictoriaMetrics caches may occupy. This option overrides -memory.allowedPercent if set to a non-zero value. Too low a value may increase the cache miss rate usually resulting in higher CPU and disk IO usage. Too high a value may evict too much data from OS page cache resulting in higher disk IO usage
Supports the following optional suffixes for size values: KB, MB, GB, KiB, MiB, GiB (default 0)
-memory.allowedPercent float
Allowed percent of system memory VictoriaMetrics caches may occupy. See also -memory.allowedBytes. Too low a value may increase cache miss rate usually resulting in higher CPU and disk IO usage. Too high a value may evict too much data from OS page cache which will result in higher disk IO usage (default 60)
-metricsAuthKey string
Auth key for /metrics. It overrides httpAuth settings
-notifier.tlsCAFile value
Optional path to TLS CA file to use for verifying connections to -notifier.url. By default system CA is used.
-notifier.tlsCertFile value
Optional path to client-side TLS certificate file to use when connecting to -notifier.url.
-notifier.tlsInsecureSkipVerify
Whether to skip tls verification when connecting to -notifier.url
-notifier.tlsKeyFile value
Optional path to client-side TLS certificate key to use when connecting to -notifier.url.
-notifier.tlsServerName value
Optional TLS server name to use for connections to -notifier.url. By default the server name from -notifier.url is used.
-notifier.url string
Prometheus alertmanager URL. Required parameter. e.g. http://127.0.0.1:9093
Auth key for /metrics. It overrides httpAuth settings
-notifier.basicAuth.password array
Optional basic auth password for -notifier.url
Supports an array of values separated by comma or specified via multiple flags.
-notifier.basicAuth.username array
Optional basic auth username for -notifier.url
Supports an array of values separated by comma or specified via multiple flags.
-notifier.tlsCAFile array
Optional path to TLS CA file to use for verifying connections to -notifier.url. By default system CA is used
Supports an array of values separated by comma or specified via multiple flags.
-notifier.tlsCertFile array
Optional path to client-side TLS certificate file to use when connecting to -notifier.url
Supports an array of values separated by comma or specified via multiple flags.
-notifier.tlsInsecureSkipVerify array
Whether to skip tls verification when connecting to -notifier.url
Supports array of values separated by comma or specified via multiple flags.
-notifier.tlsKeyFile array
Optional path to client-side TLS certificate key to use when connecting to -notifier.url
Supports an array of values separated by comma or specified via multiple flags.
-notifier.tlsServerName array
Optional TLS server name to use for connections to -notifier.url. By default the server name from -notifier.url is used
Supports an array of values separated by comma or specified via multiple flags.
-notifier.url array
Prometheus alertmanager URL. Required parameter. e.g. http://127.0.0.1:9093
Supports an array of values separated by comma or specified via multiple flags.
-pprofAuthKey string
Auth key for /debug/pprof. It overrides httpAuth settings
-remoteRead.basicAuth.password string
Optional basic auth password for -remoteRead.url
Optional basic auth password for -remoteRead.url
-remoteRead.basicAuth.passwordFile string
Optional path to basic auth password to use for -remoteRead.url
-remoteRead.basicAuth.username string
Optional basic auth username for -remoteRead.url
Optional basic auth username for -remoteRead.url
-remoteRead.bearerToken string
Optional bearer auth token to use for -remoteRead.url.
-remoteRead.bearerTokenFile string
Optional path to bearer token file to use for -remoteRead.url.
-remoteRead.disablePathAppend
Whether to disable automatic appending of '/api/v1/query' path to the configured -remoteRead.url.
-remoteRead.ignoreRestoreErrors
Whether to ignore errors from remote storage when restoring alerts state on startup. (default true)
-remoteRead.lookback duration
Lookback defines how far to look into past for alerts timeseries. For example, if lookback=1h then range from now() to now()-1h will be scanned. (default 1h0m0s)
-remoteRead.tlsCAFile value
Optional path to TLS CA file to use for verifying connections to -remoteRead.url. By default system CA is used.
-remoteRead.tlsCertFile value
Optional path to client-side TLS certificate file to use when connecting to -remoteRead.url.
Lookback defines how far to look into past for alerts timeseries. For example, if lookback=1h then range from now() to now()-1h will be scanned. (default 1h0m0s)
-remoteRead.tlsCAFile string
Optional path to TLS CA file to use for verifying connections to -remoteRead.url. By default system CA is used
-remoteRead.tlsCertFile string
Optional path to client-side TLS certificate file to use when connecting to -remoteRead.url
-remoteRead.tlsInsecureSkipVerify
Whether to skip tls verification when connecting to -remoteRead.url
-remoteRead.tlsKeyFile value
Optional path to client-side TLS certificate key to use when connecting to -remoteRead.url.
-remoteRead.tlsServerName value
Optional TLS server name to use for connections to -remoteRead.url. By default the server name from -remoteRead.url is used.
Whether to skip tls verification when connecting to -remoteRead.url
-remoteRead.tlsKeyFile string
Optional path to client-side TLS certificate key to use when connecting to -remoteRead.url
-remoteRead.tlsServerName string
Optional TLS server name to use for connections to -remoteRead.url. By default the server name from -remoteRead.url is used
-remoteRead.url vmalert
Optional URL to VictoriaMetrics or VMSelect that will be used to restore alerts state. This configuration makes sense only if vmalert was configured with `remoteWrite.url` before and has been successfully persisted its state. E.g. http://127.0.0.1:8428
Optional URL to VictoriaMetrics or vmselect that will be used to restore alerts state. This configuration makes sense only if vmalert was configured with `remoteWrite.url` before and has been successfully persisted its state. E.g. http://127.0.0.1:8428. See also -remoteRead.disablePathAppend
-remoteWrite.basicAuth.password string
Optional basic auth password for -remoteWrite.url
Optional basic auth password for -remoteWrite.url
-remoteWrite.basicAuth.passwordFile string
Optional path to basic auth password to use for -remoteWrite.url
-remoteWrite.basicAuth.username string
Optional basic auth username for -remoteWrite.url
Optional basic auth username for -remoteWrite.url
-remoteWrite.bearerToken string
Optional bearer auth token to use for -remoteWrite.url.
-remoteWrite.bearerTokenFile string
Optional path to bearer token file to use for -remoteWrite.url.
-remoteWrite.concurrency int
Defines number of readers that concurrently write into remote storage (default 1)
Defines number of writers for concurrent writing into remote querier (default 1)
-remoteWrite.disablePathAppend
Whether to disable automatic appending of '/api/v1/write' path to the configured -remoteWrite.url.
-remoteWrite.flushInterval duration
Defines interval of flushes to remote write endpoint (default 5s)
Defines interval of flushes to remote write endpoint (default 5s)
-remoteWrite.maxBatchSize int
Defines defines max number of timeseries to be flushed at once (default 1000)
Defines defines max number of timeseries to be flushed at once (default 1000)
-remoteWrite.maxQueueSize int
Defines the max number of pending datapoints to remote write endpoint (default 100000)
-remoteWrite.tlsCAFile value
Optional path to TLS CA file to use for verifying connections to -remoteWrite.url. By default system CA is used.
-remoteWrite.tlsCertFile value
Optional path to client-side TLS certificate file to use when connecting to -remoteWrite.url.
Defines the max number of pending datapoints to remote write endpoint (default 100000)
-remoteWrite.tlsCAFile string
Optional path to TLS CA file to use for verifying connections to -remoteWrite.url. By default system CA is used
-remoteWrite.tlsCertFile string
Optional path to client-side TLS certificate file to use when connecting to -remoteWrite.url
-remoteWrite.tlsInsecureSkipVerify
Whether to skip tls verification when connecting to -remoteWrite.url
-remoteWrite.tlsKeyFile value
Optional path to client-side TLS certificate key to use when connecting to -remoteWrite.url.
-remoteWrite.tlsServerName value
Optional TLS server name to use for connections to -remoteWrite.url. By default the server name from -remoteWrite.url is used.
Whether to skip tls verification when connecting to -remoteWrite.url
-remoteWrite.tlsKeyFile string
Optional path to client-side TLS certificate key to use when connecting to -remoteWrite.url
-remoteWrite.tlsServerName string
Optional TLS server name to use for connections to -remoteWrite.url. By default the server name from -remoteWrite.url is used
-remoteWrite.url string
Optional URL to VictoriaMetrics or VMInsert where to persist alerts state in form of timeseries. E.g. http://127.0.0.1:8428
-rule value
Path to the file with alert rules.
Supports patterns. Flag can be specified multiple times.
Examples:
-rule /path/to/file. Path to a single file with alerting rules
-rule dir/*.yaml -rule /*.yaml. Relative path to all .yaml files in "dir" folder,
absolute path to all .yaml files in root.
Optional URL to VictoriaMetrics or vminsert where to persist alerts state and recording rules results in form of timeseries. For example, if -remoteWrite.url=http://127.0.0.1:8428 is specified, then the alerts state will be written to http://127.0.0.1:8428/api/v1/write . See also -remoteWrite.disablePathAppend
-replay.maxDatapointsPerQuery int
Max number of data points expected in one request. The higher the value, the less requests will be made during replay. (default 1000)
-replay.ruleRetryAttempts int
Defines how many retries to make before giving up on rule if request for it returns an error. (default 5)
-replay.rulesDelay duration
Delay between rules evaluation within the group. Could be important if there are chained rules inside of the groupand processing need to wait for previous rule results to be persisted by remote storage before evaluating the next rule.Keep it equal or bigger than -remoteWrite.flushInterval. (default 1s)
-replay.timeFrom string
The time filter in RFC3339 format to select time series with timestamp equal or higher than provided value. E.g. '2020-01-01T20:07:00Z'
-replay.timeTo string
The time filter in RFC3339 format to select timeseries with timestamp equal or lower than provided value. E.g. '2020-01-01T20:07:00Z'
-rule array
Path to the file with alert rules.
Supports patterns. Flag can be specified multiple times.
Examples:
-rule="/path/to/file". Path to a single file with alerting rules
-rule="dir/*.yaml" -rule="/*.yaml". Relative path to all .yaml files in "dir" folder,
absolute path to all .yaml files in root.
Rule files may contain %{ENV_VAR} placeholders, which are substituted by the corresponding env vars.
Supports an array of values separated by comma or specified via multiple flags.
-rule.configCheckInterval duration
Interval for checking for changes in '-rule' files. By default the checking is disabled. Send SIGHUP signal in order to force config check for changes
-rule.maxResolveDuration duration
Limits the maximum duration for automatic alert expiration, which is by default equal to 3 evaluation intervals of the parent group.
-rule.validateExpressions
Whether to validate rules expressions via MetricsQL engine (default true)
Whether to validate rules expressions via MetricsQL engine (default true)
-rule.validateTemplates
Whether to validate annotation and label templates (default true)
Whether to validate annotation and label templates (default true)
-tls
Whether to enable TLS (aka HTTPS) for incoming requests. -tlsCertFile and -tlsKeyFile must be set if -tls is set
-tlsCertFile string
Path to file with TLS certificate. Used only if -tls is set. Prefer ECDSA certs instead of RSA certs as RSA certs are slower
-tlsKeyFile string
Path to file with TLS key. Used only if -tls is set
-version
Show VictoriaMetrics version
```
Pass `-help` to `vmalert` in order to see the full list of supported
command-line flags with their descriptions.
`vmalert` supports "hot" config reload via the following methods:
* send SIGHUP signal to `vmalert` process;
* send GET request to `/-/reload` endpoint;
* configure `-rule.configCheckInterval` flag for periodic reload
on config change.
To reload configuration without `vmalert` restart send SIGHUP signal
or send GET request to `/-/reload` endpoint.
### Contributing
## Contributing
`vmalert` is mostly designed and built by VictoriaMetrics community.
Feel free to share your experience and ideas for improving this
Feel free to share your experience and ideas for improving this
software. Please keep simplicity as the main priority.
2. Run `make vmalert-arm-prod` or `make vmalert-arm64-prod` from the root folder of [the repository](https://github.com/VictoriaMetrics/VictoriaMetrics).
It builds `vmalert-arm-prod` or `vmalert-arm64-prod` binary respectively and puts it into the `bin` folder.
addr=flag.String("datasource.url","","VictoriaMetrics or VMSelect url. Required parameter."+
"E.g. http://127.0.0.1:8428")
basicAuthUsername=flag.String("datasource.basicAuth.username","","Optional basic auth username for -datasource.url")
basicAuthPassword=flag.String("datasource.basicAuth.password","","Optional basic auth password for -datasource.url")
addr=flag.String("datasource.url","","VictoriaMetrics or vmselect url. Required parameter."+
"E.g. http://127.0.0.1:8428")
appendTypePrefix=flag.Bool("datasource.appendTypePrefix",false,"Whether to add type prefix to -datasource.url based on the query type. Set to true if sending different query types to the vmselect URL.")
basicAuthUsername=flag.String("datasource.basicAuth.username","","Optional basic auth username for -datasource.url")
basicAuthPassword=flag.String("datasource.basicAuth.password","","Optional basic auth password for -datasource.url")
basicAuthPasswordFile=flag.String("datasource.basicAuth.passwordFile","","Optional path to basic auth password to use for -datasource.url")
bearerToken=flag.String("datasource.bearerToken","","Optional bearer auth token to use for -datasource.url.")
bearerTokenFile=flag.String("datasource.bearerTokenFile","","Optional path to bearer token file to use for -datasource.url.")
tlsInsecureSkipVerify=flag.Bool("datasource.tlsInsecureSkipVerify",false,"Whether to skip tls verification when connecting to -datasource.url")
tlsCertFile=flag.String("datasource.tlsCertFile","","Optional path to client-side TLS certificate file to use when connecting to -datasource.url")
tlsKeyFile=flag.String("datasource.tlsKeyFile","","Optional path to client-side TLS certificate key to use when connecting to -datasource.url")
tlsCAFile=flag.String("datasource.tlsCAFile","","Optional path to TLS CA file to use for verifying connections to -datasource.url. "+
"By default system CA is used")
tlsServerName=flag.String("datasource.tlsServerName","","Optional TLS server name to use for connections to -datasource.url. "+
"By default the server name from -datasource.url is used")
tlsCAFile=flag.String("datasource.tlsCAFile","",`Optional path to TLS CA file to use for verifying connections to -datasource.url. By default, system CA is used`)
tlsServerName=flag.String("datasource.tlsServerName","",`Optional TLS server name to use for connections to -datasource.url. By default, the server name from -datasource.url is used`)
lookBack=flag.Duration("datasource.lookback",0,`Lookback defines how far into the past to look when evaluating queries. For example, if the datasource.lookback=5m then param "time" with value now()-5m will be added to every query.`)
queryStep=flag.Duration("datasource.queryStep",0,"queryStep defines how far a value can fallback to when evaluating queries. "+
"For example, if datasource.queryStep=15s then param \"step\" with value \"15s\" will be added to every query."+
"If queryStep isn't specified, rule's evaluationInterval will be used instead.")
maxIdleConnections=flag.Int("datasource.maxIdleConnections",100,`Defines the number of idle (keep-alive connections) to each configured datasource. Consider setting this value equal to the value: groups_total * group.concurrency. Too low a value may result in a high number of sockets in TIME_WAIT state.`)
roundDigits=flag.Int("datasource.roundDigits",0,`Adds "round_digits" GET param to datasource requests. `+
`In VM "round_digits" limits the number of digits after the decimal point in response values.`)
)
// Param represents an HTTP GET param
typeParamstruct{
Key,Valuestring
}
// Init creates a Querier from provided flag values.
funcInit()(Querier,error){
// Provided extraParams will be added as GET params to
rulePath=flagutil.NewArray("rule",`Path to the file with alert rules.
Supports patterns. Flag can be specified multiple times.
rulePath=flagutil.NewArray("rule",`Path to the file with alert rules.
Supports patterns. Flag can be specified multiple times.
Examples:
-rule/path/to/file. Path to a single file with alerting rules
-ruledir/*.yaml -rule/*.yaml. Relative path to all .yaml files in "dir" folder,
absolute path to all .yaml files in root.`)
-rule="/path/to/file". Path to a single file with alerting rules
-rule="dir/*.yaml" -rule="/*.yaml". Relative path to all .yaml files in "dir" folder,
absolute path to all .yaml files in root.
Rule files may contain %{ENV_VAR} placeholders, which are substituted by the corresponding env vars.`)
rulesCheckInterval=flag.Duration("rule.configCheckInterval",0,"Interval for checking for changes in '-rule' files. "+
"By default the checking is disabled. Send SIGHUP signal in order to force config check for changes")
httpListenAddr=flag.String("httpListenAddr",":8880","Address to listen for http connections")
evaluationInterval=flag.Duration("evaluationInterval",time.Minute,"How often to evaluate the rules")
validateTemplates=flag.Bool("rule.validateTemplates",true,"Whether to validate annotation and label templates")
validateExpressions=flag.Bool("rule.validateExpressions",true,"Whether to validate rules expressions via MetricsQL engine")
maxResolveDuration=flag.Duration("rule.maxResolveDuration",0,"Limits the maximum duration for automatic alert expiration, "+
"which is by default equal to 3 evaluation intervals of the parent group.")
externalURL=flag.String("external.url","","External URL is used as alert's source for sent alerts to the notifier")
externalAlertSource=flag.String("external.alert.source","",`External Alert Source allows to override the Source link for alerts sent to AlertManager for cases where you want to build a custom link to Grafana, Prometheus or any other service.
eg. 'explore?orgId=1&left=[\"now-1h\",\"now\",\"VictoriaMetrics\",{\"expr\": \"{{$expr|quotesEscape|pathEscape}}\"},{\"mode\":\"Metrics\"},{\"ui\":[true,true,true,\"none\"]}]'.If empty '/api/v1/:groupID/alertID/status' is used`)
eg. 'explore?orgId=1&left=[\"now-1h\",\"now\",\"VictoriaMetrics\",{\"expr\": \"{{$expr|quotesEscape|crlfEscape|queryEscape}}\"},{\"mode\":\"Metrics\"},{\"ui\":[true,true,true,\"none\"]}]'.If empty '/api/v1/:groupID/alertID/status' is used`)
externalLabels=flagutil.NewArray("external.label","Optional label in the form 'name=value' to add to all generated recording rules and alerts. "+
"Pass multiple -label flags in order to add multiple label sets.")
remoteReadLookBack=flag.Duration("remoteRead.lookback",time.Hour,"Lookback defines how far to look into past for alerts timeseries."+
" For example, if lookback=1h then range from now() to now()-1h will be scanned.")
remoteReadIgnoreRestoreErrors=flag.Bool("remoteRead.ignoreRestoreErrors",true,"Whether to ignore errors from remote storage when restoring alerts state on startup.")
disableAlertGroupLabel=flag.Bool("disableAlertgroupLabel",false,"Whether to disable adding group's name as label to generated alerts and time series.")
dryRun=flag.Bool("dryRun",false,"Whether to check only config files without running vmalert. The rules file are validated. The `-rule` flag must be specified.")
)
varalertURLGeneratorFnnotifier.AlertURLGenerator
funcmain(){
// Write flags and help message to stdout, since it is easier to grep or pipe.
flag.CommandLine.SetOutput(os.Stdout)
@@ -53,35 +69,74 @@ func main() {
buildinfo.Init()
logger.Init()
if*dryRun{
u,_:=url.Parse("https://victoriametrics.com/")
notifier.InitTemplateFunc(u)
groups,err:=config.Parse(*rulePath,true,true)
iferr!=nil{
logger.Fatalf("failed to parse %q: %s",*rulePath,err)
}
iflen(groups)==0{
logger.Fatalf("No rules for validation. Please specify path to file(s) with alerting and/or recording rules using `-rule` flag")
addr=flag.String("remoteRead.url","","Optional URL to VictoriaMetrics or VMSelect that will be used to restore alerts"+
"state. This configuration makes sense only if `vmalert` was configured with `remoteWrite.url` before and has been successfully persisted its state."+
"E.g. http://127.0.0.1:8428")
addr=flag.String("remoteRead.url","","Optional URL to VictoriaMetrics or vmselect that will be used to restore alerts"+
"state. This configuration makes sense only if `vmalert` was configured with `remoteWrite.url` before and has been successfully persisted its state."+
"E.g. http://127.0.0.1:8428. See also -remoteRead.disablePathAppend")
basicAuthUsername=flag.String("remoteRead.basicAuth.username","","Optional basic auth username for -remoteRead.url")
basicAuthPassword=flag.String("remoteRead.basicAuth.password","","Optional basic auth password for -remoteRead.url")
basicAuthPasswordFile=flag.String("remoteRead.basicAuth.passwordFile","","Optional path to basic auth password to use for -remoteRead.url")
bearerToken=flag.String("remoteRead.bearerToken","","Optional bearer auth token to use for -remoteRead.url.")
bearerTokenFile=flag.String("remoteRead.bearerTokenFile","","Optional path to bearer token file to use for -remoteRead.url.")
tlsInsecureSkipVerify=flag.Bool("remoteRead.tlsInsecureSkipVerify",false,"Whether to skip tls verification when connecting to -remoteRead.url")
tlsCertFile=flag.String("remoteRead.tlsCertFile","","Optional path to client-side TLS certificate file to use when connecting to -remoteRead.url")
tlsKeyFile=flag.String("remoteRead.tlsKeyFile","","Optional path to client-side TLS certificate key to use when connecting to -remoteRead.url")
@@ -22,11 +26,12 @@ var (
"By default system CA is used")
tlsServerName=flag.String("remoteRead.tlsServerName","","Optional TLS server name to use for connections to -remoteRead.url. "+
"By default the server name from -remoteRead.url is used")
disablePathAppend=flag.Bool("remoteRead.disablePathAppend",false,"Whether to disable automatic appending of '/api/v1/query' path to the configured -remoteRead.url.")
)
// Init creates a Querier from provided flag values.
addr=flag.String("remoteWrite.url","","Optional URL to VictoriaMetrics or VMInsert where to persist alerts state"+
"and recording rules results in form of timeseries. E.g. http://127.0.0.1:8428")
basicAuthUsername=flag.String("remoteWrite.basicAuth.username","","Optional basic auth username for -remoteWrite.url")
basicAuthPassword=flag.String("remoteWrite.basicAuth.password","","Optional basic auth password for -remoteWrite.url")
addr=flag.String("remoteWrite.url","","Optional URL to VictoriaMetrics or vminsert where to persist alerts state"+
"and recording rules results in form of timeseries. For example, if -remoteWrite.url=http://127.0.0.1:8428 is specified, "+
"then the alerts state will be written to http://127.0.0.1:8428/api/v1/write . See also -remoteWrite.disablePathAppend")
basicAuthUsername=flag.String("remoteWrite.basicAuth.username","","Optional basic auth username for -remoteWrite.url")
basicAuthPassword=flag.String("remoteWrite.basicAuth.password","","Optional basic auth password for -remoteWrite.url")
basicAuthPasswordFile=flag.String("remoteWrite.basicAuth.passwordFile","","Optional path to basic auth password to use for -remoteWrite.url")
bearerToken=flag.String("remoteWrite.bearerToken","","Optional bearer auth token to use for -remoteWrite.url.")
bearerTokenFile=flag.String("remoteWrite.bearerTokenFile","","Optional path to bearer token file to use for -remoteWrite.url.")
maxQueueSize=flag.Int("remoteWrite.maxQueueSize",1e5,"Defines the max number of pending datapoints to remote write endpoint")
maxBatchSize=flag.Int("remoteWrite.maxBatchSize",1e3,"Defines defines max number of timeseries to be flushed at once")
@@ -27,6 +31,7 @@ var (
"By default system CA is used")
tlsServerName=flag.String("remoteWrite.tlsServerName","","Optional TLS server name to use for connections to -remoteWrite.url. "+
"By default the server name from -remoteWrite.url is used")
disablePathAppend=flag.Bool("remoteWrite.disablePathAppend",false,"Whether to disable automatic appending of '/api/v1/write' path to the configured -remoteWrite.url.")
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.