_stream field can be empty for the recently ingested rows because respective entry in indexdb is not yet searchable as it haven't been flushed to storage yet.
This change just skips such items in the output response to make it more consistent.
See: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6042
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* app/vmauth: do not increment backend_errors when hitting concurrency limit
Previously, both "vmauth_concurrent_requests_limit_reached_total" and "vmauth_user_request_backend_errors_total" were incremented.
This was based on the assumption that if concurrency limit is hit the backend must be failing to handle the request thus meaning an error.
This assumption does not work in case the endpoint can be overloaded by the misbehaving client sending too many requests within the timeframe.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5565
Signed-off-by: hagen1778 <roman@victoriametrics.com>
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
This should improve debuggability of unexpected deletion of directories inside partitions.
While at it, log the proper path to parts.json when the directory for big part is missing in the partition.
parts.json is located inside directory with small parts, and there is no parts.json file inside directory with big parts.
* docs/cluster: update cluster resizing info
- add example of resources distribution
- add info about how to handle uneven disk usage after adding a new storage node
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* Update docs/Cluster-VictoriaMetrics.md
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
* Update docs/Cluster-VictoriaMetrics.md
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
* vmui: fix parsing of fractional values
* vmui/vmanomaly: update display logic to align with vmanomaly /query_range API
* vmui/vmanomaly: rename flag anomalyView to isAnomalyView
- Automatically reload changed TLS root CA pointed by -remoteWrite.tlsCAFile command-line flag
- Automatically reload changed TLS root CA configured via oauth2.tsl_config.ca_file option at -promscrape.config
- Document the change as a feature instead of a bug at docs/CHANGELOG.md
- Simplify the code at lib/promauth, which is responsible for reloading changed TLS root CA files.
- Simplify the usage of lib/promauth.Config.NewRoundTripper() - now it accepts the base http.Transport
instead of a callback, which can change the internal http.Transport.
- Reuse the default tls config if lib/promauth.Config doesn't contain tls-specific configs.
This should reduce memory usage a bit when tls isn't used for scraping big number of targets.
- Do not re-read TLS root CA files on every processed request. Re-read them once per second.
This should reduce CPU usage when scraping big number of targets over https.
- Do not store cert.pem and key.pem files in TestTLSConfigWithCertificatesFilesUpdate, since they can be loaded
from byte slices via crypto/tls.X509KeyPair().
- Remove obsolete comparisons of string representations for authConfig and proxyAuthConfig at areEqualScrapeConfigs().
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5725
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5526
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2171
* lib/{promauth,promscrape}: automatically refresh root CA certificates after changes on disk
Added a custom `http.RoundTripper` implementation which checks for root CA content changes and updates `tls.Config` used by `http.RoundTripper` after detecting CA change.
Client certificate changes are not tracked by this implementation since `tls.Config` already supports passing certificate dynamically by overriding `tls.Config.GetClientCertificate`.
This change implements dynamic reload of root CA only for streaming client used for scraping. Blocking client (`fasthttp.HostClient`) does not support using custom transport so can't use this implementation.
See: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5526
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/promauth/config: update NewRoundTripper API
Update API to allow user to update only parameters required for transport.
Add warning log when reloading Root CA failed.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/promauth/config: fix mutex acquire logic
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/promauth/config: replace RWMutex with regular mutex to simplify the code
- remove additional mutex used for getRootCABytes - require callee to use mutex
- replace RWMutex with regular mutex
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/promauth/config: refactor
- hold the mutex lock to avoid round tripper being re-created twice
- move recreation logic into separate func to simplify the code
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Nikolay <nik@victoriametrics.com>
- Rename -opentelemetry.sanitizeMetrics command-line flag to more clear -opentelemetry.usePrometheusNaming
- Clarify the description of the change at docs/CHANGELOG.md
- Rename promrelabel.SanitizeLabelNameParts to more clear promrelabel.SplitMetricNameToTokens
- Properly split metric names at '_' char in promerlabel.SplitMetricNameToTokens.
- Add tests for various edge cases for Prometheus metric names' normalization
according to the code at b865505850/pkg/translator/prometheus/normalize_name.go
- Extract the code responsible for Prometheus metric names' normalization into a separate file (santize.go)
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6037
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6035
- Make the configuration more clear by accepting the list of ignored labels during sharding
via a dedicated command-line flag - -remoteWrite.shardByURL.ignoreLabels.
This prevents from overloading the meaning of -remoteWrite.shardByURL.labels command-line flag.
- Removed superfluous memory allocation per each processed sample if sharding by remote storage is enabled.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5938
Remove description for -search.maxExportDuration and -search.maxStatusRequestDuration command-line flags
from the 'Resource usage limits' chapter, since these flags are rarely used for limiting resource usage
and they are already documented in the 'List of command-line flags' chapter.
- Fix docs for new functions at app/vmselect/graphite/functions.json
- Properly drain series lists on errors in aggregateSeriesListsGeneric() and aggregateSeriesList()
- Add links to docs for the added functions at docs/CHANGELOG.md
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5809
- Allow specifying only a single HTTP header for reading auth tokens via -httpAuthHeader command-line flag.
This is better from security PoV, since this prevents from accidental reading of auth token from undesired
HTTP header. By default the -httpAuthHeader equals to Authorization. When it is overridden, then
auth token isn't read from Authorization header - it is read only from the specified header.
- Document the -httpAuthHeader command-line flag at https://docs.victoriametrics.com/vmauth/#reading-auth-tokens-from-other-http-headers
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6009
* docs/vmanomaly: v1.12.0 & link updates
* add autotuned description to model section
* - update refs of vmanomaly on enterprise and vmalert pages
- add diagrams for model types
- update self-monitoring section
* - fix typos
- remove .index.html from links
This reverts commit cb23685681.
Reason for revert: the "fix" may hide programming bugs related to incorrect creation of folders
before their use. This may complicate detecting and fixing such bugs in the future.
There are the following fixes for the issue https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5985 :
- To configure the OS to do not drop data from the system-wide temporary directory (aka /tmp).
- To run VictoriaMetrics with -cacheDataPath command-line flag, which points to the directory,
which cannot be removed automatically by the OS.
The case when the user accidentally deletes the directory with some files created by VictoriaMetrics
shouldn't be considered as expected, so VictoriaMetrics shouldn't try resolving this case automatically.
It is much better from operation and debuggability PoV is to crash with the clear `directory doesn't exist` error
in this case.
The remotewrite.Stop() expects that there are no pending calls to TryPush().
This means that the ingestionRateLimiter.Register() must be unblocked inside TryPush() when calling remotewrite.Stop().
Provide remotewrite.StopIngestionRateLimiter() function for unblocking the rate limiter before calling the remotewrite.Stop().
While at it, move the rate limiter into lib/ratelimiter package, since it has two users.
Also move the description of the feature to the correct place at docs/CHANGELOG.md.
Also cross-reference -remoteWrite.rateLimit and -maxIngestionRate command-line flags.
This is a follow-up for 02bccd1eb9
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5900
Custom HTTP headers are set via net/http.Header.Set or net/http.Header.Add functions.
These functions always convert header keys to canonical form. So the change at b577413d3b
isn't visible to users of VictoriaMetrics components.
There is no need in documenting this change at docs/CHANGELOG.md, since it doesn't give any useful information to users.
This is a follow-up for e6dd52b04c
* lib/storage: add ability to use downsampling for the given series filter
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* docs: add information about downsampling filters
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* docs: fix MetricsQL filter
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/storage/downsampling: treat missing downsampling filter as a bug
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/storage/part_header: verify correctness of downsampling filters when opening partition
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/storage/downsampling: save only appliable rules in part metadata
Filter and save only rules which are appliable to partition based on MinTimestamp of stored data.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/storage/downsampling: update log messages for final dedup
Properly specify a reason of re-running deduplication for partition.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/storage: consistently use MaxTimestamp to determine deduplication/downsampling rules
Using MinTimestamp leads to applying downsampling to parts which are only partially covered by downsampling rule.
For example, partition covers range [1000-2000]. At t=2100 and rule offset 500 data with t=2100-500 => 1600 must be downsampled. The range check against MinTimestamp evaluates to true even though partition contains range which must not be downsampled - [1600:2000].
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* Follow-up
- Apply the first matching downsampling period if multiple filters match the given time series.
This allows fine-tuning the downsampling config for the specific needs.
- Take into account downsampling filters during search queries.
- Reduce the difference between community and enterprise branches. This should simplify further maintenance of these branches.
- Properly parse series filters with colons inside them.
- Document the feature at docs/CHANGELOG.md.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4960
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* lib/storage: adds metrics for downsampling
vm_downsampling_partitions_scheduled - shows the number of parts, that must be downsampled
vm_downsampling_partitions_scheduled_size_bytes - shows total size in bytes for parts, the must be donwsampled
These two metrics answer the questions - is downsampling running? how many parts scheduled for downsampling and how many of them currently downsampled? Storage space that it occupies.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2612
* wip
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* rm recommendation to keep look-behind window empty, as it is not correct
* mention the change of default value for `-search.latencyOffset`
Signed-off-by: hagen1778 <roman@victoriametrics.com>
During shutdown period of vmalert, remotewrite client retrieve all pending time series from buffer queue, compose them into 1 batch and execute remote write.
This final batch may exceed the limit of -remoteWrite.maxBatchSize, and be rejected by the receiver (gateway, vmcluster or others).
This changes ensures that even during shutdown vmalert won't exceed the max batch size limit for remote write
destination.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6025
To horizontally scale streaming aggregation, you might want to deploy a separate hashing tier
of vmagents that route to a separate aggregation tier. The hashing tier should shard by all labels
except the instance-level labels, to ensure the input metrics are routed correctly to the aggregator
instance responsible for those labels.
For this to achieve we introduce `remoteWrite.shardByURL.inverseLabels` flag to inverse logic of `remoteWrite.shardByURL.labels`
---------
Co-authored-by: Eugene Ma <eugene.ma@airbnb.com>
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
* vmalert: fix sending alert messages
1. fix `endsAt` field in messages that send to alertmanager, previously rule with small interval could never be triggered;
2. fix behavior of `-rule.resendDelay`, before it could prevent sending firing message when rule state is volatile.
* docs: update changelog notes
Signed-off-by: hagen1778 <roman@victoriametrics.com>
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
Store the deadline when the metricID entries must be deleted from indexdb
if metricID->metricName entry isn't found after the deadline. This should
make the code more clear comparing the the previous version, where the timestamp
of the first metricID->metricName lookup miss was stored in missingMetricIDs.
Remove the misleading comment about the importance of the order for creating entries
in the inverted index when registering new time series. The order doesn't matter,
since any subset of the created entries can become visible for search
before any other subset after registering in indexdb.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5948
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5959
* lib/storage/table: properly wait for force merges to be completed during shutdown
Properly keep track of running background merges and wait for merges completion when closing the table.
Previously, force merge was not in sync with overall storage shutdown which could lead to holding ptw ref.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* docs: add changelog entry
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
vmselect uses a cache folder in file system for two purposes:
1. Storing rollup cache results on shutdown;
2. Storing temporary search results from vmstorage during query executions.
It could happen that cache folder is deleted accidentally by user, or by OS
during cleanup routines. This would cause vmselect to:
1. panic on /metrics call, because `MustGetFreeSpace` will fail;
2. return query error user, as it won't be able to store temporary search results.
The changes in this commit are the following:
1. Make `MustGetFreeSpace` to try re-creating the cache folder if it is missing;
2. Make vmselect to try re-creating the cache folder if it can't persist tmp search
results.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5985
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Nikolay <nik@victoriametrics.com>
* [vmagent] added ingestion rate limiting with new flag `-maxIngestionRate`. This flag can be used to limit the number of samples ingested by vmagent per second. If the limit is exceeded, the ingestion rate will be throttled.
* fix changelog
* fix review comment
When `--vm-native-step-interval` is specified, explore phase will be executed
within specified intervals. Discovered metric names will be associated with
time intervals at which they were discovered. This suppose to reduce number
of requests vmctl makes per metric name since it will skip time intervals
when metric name didn't exist.
This should also reduce probability of exceeding complexity limits
for number of selected series in one request during explore phase.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5369
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Links with upper-case simply don't work for unknown reason.
Once the reason is fixed on docs side, this commit can be reverted.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
It is confusing for cluster users to find that VMUI is available at vmselect as it is only mentioned in the list of URLs. Explicit mention of vmselect URL in docs will make it easier to discover.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
This makes the code less fragile - it is harder to skip the convertToCompositeTagFilterss() call now.
While at it, call indexSearch.containsTimeRange() inside indexSearch.searchMetricIDsInternal()
in order to quickly terminate search of time series in the old indexdb for new time ranges.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5055
This is a follow-up for 2d31fd7855
It is assumed that URLPrefix.busOriginal will be initialized
durin Unmarshal of the config. But in tests we set fields manually,
so this field never get initialized properly.
Fixes the error `panic: runtime error: integer divide by zero`
at `vmauth.getLeastLoadedBackendURL`.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
If tlsServerName isn't empty, then it is likely the https request is sent to IP instead of hostname.
In this case the request will fail, since Go automatically sets the Host header to the IP instead
of the desired hostname at tlsServerName. So set the Host header to tlsServerName if itsn't empty.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5802
The `match[]` filter is mandatory at /api/v1/series, so it mustn't be dropped here.
There is no sense in dropping `match[]` filter together with `extra_label` and `extra_filters[]`
at /api/v1/labels and /api/v1/label/.../values if -search.ignoreExtraFiltersAtLabelsAPI commnad-line flag is set,
since:
- the `match[]` filter triggers slow path at these APIs;
- the `extra_label` and `extra_filters[]` filters narrow down the number of matched time series,
so they improve performance comparing to the case when only `match[]` filter is left,
while `extra_label` and `extra_filters[]` filters are dropped.
This is a follow-up for 0b7a23a91d
Improved trace display for better visual separation of branches:
* Increased left padding for each element
* Added padding for the last element in the branch
The change also removes misleading `default` value from README for `maxConcurrentInserts`
cmd-line flag.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
This reverts commit eb40395a1c.
Reason for revert: it has been appeared that the performance gain on multiple CPU cores
wasn't visible because the benchmark was generating incorrect pushSample.key.
See a207e0bf687d65f5198207477248d70c69284296
Previously samples were dropped on the first incomplete interval and the next complete interval.
Also make sure that the de-duplication is performed just before flushing the aggregate state.
This should help the case then dedup_interval = interval.
* use docs.victoriametrics.com instead of github docs
* add links to common terms used in VictoriaMetrics
Signed-off-by: hagen1778 <roman@victoriametrics.com>
For example, if `interval: 1m`, then data flush occurs at the end of every minute,
while `interval: 1h` leads to data flush at the end of every hour.
Add `no_align_flush_to_interval` option, which can be used for disabling the alignment.
The labelsMap struct employs the fact that label indexes are condensed around 0,
so it stores the referred labels in a slice instead of map and uses slice index as label key.
This allows increasing the LabelsCompressor.Decompress performance by up to 3x.
This also reduces the latency of data flush in stream aggregation.
- Reduce memory usage by up to 5x when de-duplicating samples across big number of time series.
- Reduce memory usage by up to 5x when aggregating across big number of output time series.
- Add lib/promutils.LabelsCompressor, which is going to be used by other VictoriaMetrics components
for reducing memory usage for marshaled []prompbmarshal.Label.
- Add `dedup_interval` option at aggregation config, which allows setting individual
deduplication intervals per each aggregation.
- Add `keep_metric_names` option at aggregation config, which allows keeping the original
metric names in the output samples.
- Add `unique_samples` output, which counts the number of unique sample values.
- Add `increase_prometheus` and `total_prometheus` outputs, which ignore the first sample
per each newly encountered time series.
- Use 64-bit hashes instead of marshaled labels as map keys when calculating `count_series` output.
This makes obsolete https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5579
- Expose various metrics, which may help debugging stream aggregation:
- vm_streamaggr_dedup_state_size_bytes - the size of data structures responsible for deduplication
- vm_streamaggr_dedup_state_items_count - the number of items in the deduplication data structures
- vm_streamaggr_labels_compressor_size_bytes - the size of labels compressor data structures
- vm_streamaggr_labels_compressor_items_count - the number of entries in the labels compressor
- vm_streamaggr_flush_duration_seconds - a histogram, which shows the duration of stream aggregation flushes
- vm_streamaggr_dedup_flush_duration_seconds - a histogram, which shows the duration of deduplication flushes
- vm_streamaggr_flush_timeouts_total - counter for timed out stream aggregation flushes,
which took longer than the configured interval
- vm_streamaggr_dedup_flush_timeouts_total - counter for timed out deduplication flushes,
which took longer than the configured dedup_interval
- Actualize docs/stream-aggregation.md
The memory usage reduction increases CPU usage during stream aggregation by up to 30%.
This commit is based on https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5850
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5898
The match[] at /api/v1/labels and /api/v1/label/.../values also may lead to slow requests and
high resource usage if it matches big number of time series. So it must be igrnored if -search.ignoreExtraFiltersAtLabelsAPI
command-line flag is set.
This is a follow-up for fab02faa3f
* app/{vmagent,vminsert}: adds /v1/metrics suffix for opentelemetry route path
it must fix compatibility with opentemetry-collector [spec](https://opentelemetry.io/docs/specs/otlp/\#otlphttp-request)
this suffix is hard-coded and cannot be changed with collector configuration
* Apply suggestions from code review
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* metricsql: fix label_join() when `dst_label` is equal to one of the `src_label`
* Update app/vmselect/promql/transform.go
* Update docs/CHANGELOG.md
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
- Document the ability to read OpenTelemetry data from Amazon Firehose at docs/CHANGELOG.md
- Simplify parsing Firehose data. There is no need in trying to optimize the parsing with fastjson
and byte slice tricks, since OpenTelemetry protocol is really slooow because of over-engineering.
It is better to write clear code for better maintanability in the future.
- Move Firehose parser from /lib/protoparser/firehose to lib/protoparser/opentelemetry/firehose,
since it is used only by opentelemetry parser.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5893
* deployment: create a separate env for VictoriaLogs
The new environment consists of the following components:
* VictoriaLogs
* fluentbit for collecting logs and sending to VictoriaLogs
* VictoriaMetrics for scraping and storing metrics from fluentbit and VictoriaLogs
* Grafana with VictoriaLogs datasource for monitoring
-----------------
The motivation for creating a separate environment is to simplify existing environments
and make it easier to update or modify them in future.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Prevsiously they were swapped - the first arg should be the label name and the second arg should be label filters
This is a follow-up for e389b7b959e8144fdff5075bf7a5a39b2b0c6dd3
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5847
This solves two issues:
1. The vm_backups_uploaded_bytes_total metric will grow more smoothly
2. This prevents from int overflow at metrics.Counter.Add() when uploading files bigger than 2GiB
This should simplify code maintenance by gradually converting to atomic.* types instead of calling atomic.* functions
on int and bool types.
See ea9e2b19a5
The issue has been introduced in bace9a2501
The improper fix was in the d4c0615dcd ,
since it fixed the issue just by an accident, because Go comiler aligned the rawRowsShards field
by 4-byte boundary inside partition struct.
The proper fix is to use atomic.Int64 field - this guarantees that the access to this field
won't result in unaligned 64-bit atomic operation. See https://github.com/golang/go/issues/50860
and https://github.com/golang/go/issues/19057
It has been appeared that there are VictoriaMetrics users, who rely on the fact that
VictoriaMetrics components were closing incoming connections to -httpListenAddr every 2 minutes
by default. So let's return back this value by default in order to fix the breaking change
made at d8c1db7953 .
See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1304#issuecomment-1961891450 .
Previously the (date, metricID) entries for dates older than the last 2 days were removed.
This could lead to slow check for the (date, metricID) entry in the indexdb during ingesting historical data (aka backfilling).
The issue has been introduced in 431aa16c8d
This commit returns back limits for these endpoints, which have been removed at 5d66ee88bd ,
since it has been appeared that missing limits result in high CPU usage, while the introduced concurrency limiter
results in failed lightweight requests to these endpoints because of timeout when heavyweight requests are executed.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5055
* vmui: add a time picker to the "Logs Explorer" page #5673
* Update app/vmui/packages/vmui/src/pages/ExploreLogs/hooks/useFetchLogs.ts
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Do not convert shard items to part when a shard becomes full. Instead, collect multiple
full shards and then convert them to a searchable part at once. This reduces
the number of searchable parts, which, in turn, should increase query performance,
since queries need to scan smaller number of parts.
* app/vmselect: adds milliseconds to the csv export response for rfc3339
* milliseconds is a standard prescion for VictoriaMetrics query request responses
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5837
* app/victoria-metrics: adds tests for csv export/import
follow-up after 3541a8d0cf96dd4f8563624c4aab6816615d0756
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
Previously the interval between item addition and its conversion to searchable in-memory part
could vary significantly because of too coarse per-second precision. Switch from fasttime.UnixTimestamp()
to time.Now().UnixMilli() for millisecond precision. It is OK to use time.Now() for tracking
the time when buffered items must be converted to searchable in-memory parts, since time.Now()
calls aren't located in hot paths.
Increase the flush interval for converting buffered samples to searchable in-memory parts
from one second to two seconds. This should reduce the number of blocks, which are needed
to be processed during high-frequency alerting queries. This, in turn, should reduce CPU usage.
While at it, hardcode the maximum size of rawRows shard to 8Mb, since this size gives the optimal
data ingestion pefromance according to load tests. This reduces memory usage and CPU usage on systems
with big amounts of RAM under high data ingestion rate.
The pooled rawRowsBlock objects occupies big amounts of memory between flushes,
and the flushes are relatively rare. So it is better to don't use the pool
and to allocate rawRow blocks on demand. This should reduce the average
memory usage between flushes.
The buffer can be quite big under high ingestion rate (e.g. more than 100MB).
This leads to increased memory usage between buffer flushes.
So it is better to re-create the buffer on every flush in order to reduce memory usage
between buffer flushes.
This should improve maintainability of the code related to rollup functions,
since it is located in rollup.go
While at it, properly return empty results from holt_winters(), rate_over_sum(),
sum2_over_time(), geomean_over_time() and distinct_over_time() when there are no real samples
on the selected lookbehind window. Previously the previous sample value was mistakenly
returned from these functions.
* [lib/promutils, lib/httputils] fixed floating-point error when parsing time in RFC3339 format (#5801)
* fixed tests
* fixed test
* Revert "fixed test"
This reverts commit 8a29764806.
* Revert "fixed tests"
This reverts commit 9ce13d1042.
* Revert "[lib/promutils, lib/httputils] fixed floating-point error when parsing time in RFC3339 format (#5801)"
This reverts commit a7a04bd4
* [lib/httputils] fixed floating-point error when parsing time in RFC3339 format (#5801)
---------
Co-authored-by: Nikolay <nik@victoriametrics.com>
Currently the docker-compose examples for loading `victoriametrics-datasource` uses 2 environment variables:
- `GF_ALLOW_LOADING_UNSIGNED_PLUGINS`
- `GF_DEFAULT_APP_MODE`
I believe both of the env vars are trying to achieve the same thing. `GF_DEFAULT_APP_MODE` disables code signing for all plugins and `GF_ALLOW_LOADING_UNSIGNED_PLUGINS` intends to disable code signing for just `victoriametrics-datasource`.
Keeping the scope narrowed to just `victoriametrics-datasource` would be preferable in this case.
Unfortunately `GF_ALLOW_LOADING_UNSIGNED_PLUGINS` is misspelled. According to [grafana docs](https://grafana.com/docs/grafana/latest/setup-grafana/configure-grafana/#override-configuration-with-environment-variables), the format is supposed to be `GF_<SectionName>_<KeyName>`. In other words the current env var is missing the section name.
This PR proposes to:
1. fix the typo
2. remove the global disablement of code signing
Alternatively, if you prefer to keep codesigning disabled globally, please remove `GF_ALLOW_LOADING_UNSIGNED_PLUGINS` env var as it confuses things
* support case-insensitive search
* reflect search condition in URL, so link can be sharable
* support filtering on /alerts page
* fix collapseAll/expandAll logic to respect only shown entries
* add changelog
b60dcbe11f
Signed-off-by: hagen1778 <roman@victoriametrics.com>
- Consistently return the first `limit` log entries if the total size of found log entries doesn't exceed 1Mb.
See app/vlselect/logsql/sort_writer.go . Previously random log entries could be returned with each request.
- Document the change at docs/VictoriaLogs/CHANGELOG.md
- Document the `limit` query arg at docs/VictoriaLogs/querying/README.md
- Make the change less intrusive.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5674
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5778
* - apply v1.10 changes
- chapter on model types (uni/multivariate and rolling)
* - update self-monitoring labels description
- fix typos
* fix duplicated text and link rendering
* app/vmauth: properly release memory during config reload
previously metrics package hold a refrence for channels for users concurrent requests.
it case of churn at `name` field of users configuration, new metric was created. But previous one wasn't deleted.
It prevented full parsed configuration from being garbace collected.
now all config related metrics are bound to corresponding metrics.Set and unregistered during config reload process.
It also must fix an issue with incorrect values for current concurrent user requests
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4690
* wip
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Instead, log a sample of these long items once per 5 seconds into error log,
so users could notice and fix the issue with too long labels or too many labels.
Previously this panic could occur in production when ingesting samples with too long labels.
The 3c246cdf00 added an optimization where the previous metaindexRow
could be saved to disk when the current block header couldn't be added indexBlock because the resulting
indexBlock size became too big. This could result in an empty metaindexRow.firstItem for the next metaindexRow.
This allows removing importing unneeded command-line flags into binaries, which import lib/storage,
which, in turn, was importing lib/snapshot in order to use Time, Validate and NewName functions.
This is a follow-up for 83e55456e2
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5738
Closing client connections every 2 minutes doesn't help load balancing -
this just leads to "jumpy" connections between multiple backend servers,
e.g. the load isn't spread evenly among backend servers, and instead jumps
between the servers every 2 minutes.
It is still possible periodically closing client connections by specifying non-zero -http.connTimeout command-line flag.
This should help with https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1304#issuecomment-1636997037
This is a follow-up for d387da142e
Sync description after changes in 83e55456e2
Remove extra text in flags description for vmbackupmanager as it breaks layout
when rendered at docs.victoriametrics.com
Signed-off-by: hagen1778 <roman@victoriametrics.com>
There is no sense in storing commonPrefix for blockHeader containing only a single item,
since this only increases blockHeader size without any benefits.
This panic could occur when samples with too long label values are ingested into VictoriaMetrics.
This could result in too long fistItem and commonPrefix values at blockHeader (up to 64kb each).
This may inflate the maximum index block size by 4 * maxIndexBlockSize.
* add more details to changelog
* simplify panels description
* remove capacity planning recommendation, as it proves it incompetent
Signed-off-by: hagen1778 <roman@victoriametrics.com>
During background downsampling, rate(vm_deduplicated_samples_total{type="merge"}) could be much bigger than
rate(vm_rows_added_to_storage_total) and it could last quite some time,
which causes negative values of Storage full ETA and confuses users, see playground.
Instead of trying to get more accurate results during downsampling, I think it's ok to ignore
vm_deduplicated_samples_total at all, it's more reasonable to see Storage full ETA increase after downsampling.
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
changes(), increases_over_time() and resets() shouldn't take into account
value changes, which may occur because of precision errors.
The maximum guaranteed precision for raw samples stored in VictoriaMetrics is 12 decimal digits.
So do not count relative changes for values if they are smaller than 1e-12 comparing to the value.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/767
For example, -fooDuration=',10s,' is now supported - it sets three command-line flag values:
- the first and the last one are set to the default value for `-fooDuration`
- the second one is set to 10s
* vmui: fix select closing on click outside (#5728)
* vmui: clear entered text in select after selecting a value (#5727)
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
This should significantly reduce the number of open ReaderAt files
on VictoriaMetrics and VictoriaLogs startup.
The open files can be tracked via vm_fs_readers metric
GOGC can be already set via environment variable. There is no need in adding
new approaches for setting the GOGC (such as command-line flag), since they complicate operations.
It should help identifying cases when too much CPU is spent on garbage collection,
and advice users on how this can be addressed.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Remove temporary file before closing it in order to signal the OS that it shouldn't
store the file contents from page cache to disk when the file is closed.
Gracefully handle the case when the file cannot be removed before being closed -
in this case remove the file after closing it. This allows working on Windows.
Also remove superflouos opening of temporary file for reading - re-use already opened file handle for writing.
This is a follow-up for 9b1e002287
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4020
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/70
The easyproto-based marshaler is 2x slower than the previous custom marshaler,
so let's stick with it. This improves the performance for sending data to remote storage at vmagent
and reduces CPU usage to pre-v1.97.0 levels.
This case is possible after a new brsPool is allocated. The fix is to verify whether len(brsPool) >= len(brs.brs)
before trying to append a new item to brsPool and sharing its contents with brs.brs.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5733
* app/vmselect: set proper timestamp for cached instant responses
The change updates `getSumInstantValues` to prefer timestamp
from the most recent results. Before, timestamp from cached series
was used.
The old behavior had negative impact on recording rules as they
were getting responses with shifted timestamps in past.
Subsequent recording or alerting rules fetching results of these
recording rules could get no result due to staleness interval.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5659
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* wip
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
There is no sense in running more than GOMAXPROCS concurrent marshalers,
since they are CPU-bound. More concurrent marshalers do not increase the marshaling bandwidth,
but they may result in more RAM usage.
Entries for the previous dates is usually not used, so there is little sense in keeping them in memory.
This should reduce the size of storage/date_metricID cache, which can be monitored
via vm_cache_entries{type="storage/date_metricID"} metric.
This limit has little sense for these APIs, since:
- Thses APIs frequently result in scanning of all the time series on the given time range.
For example, if extra_filters={datacenter="some_dc"} .
- Users expect these APIs shouldn't hit the -search.maxUniqueTimeseries limit,
which is intended for limiting resource usage at /api/v1/query and /api/v1/query_range requests.
Also limit the concurrency for /api/v1/labels, /api/v1/label/.../values
and /api/v1/series requests in order to limit the maximum memory usage and CPU usage for these API.
This limit shouldn't affect typical use cases for these APIs:
- Grafana dashboard load when dashboard labels should be loaded
- Auto-suggestion list load when editing the query in Grafana or vmui
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5055
* port un-synced changed from docs/readme to readme
* consistently use `sh` instead of `console` highlight, as it looks like
a more appropriate syntax highlight
* consistently use `sh` instead of `bash`, as it is shorter
* consistently use `yaml` instead of `yml`
See syntax codes here https://gohugo.io/content-management/syntax-highlighting/
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* Optimize the performance of data merge: decimal.CalibrateScale() from 49633 ns/op to 9146 ns/op
* Optimize the performance of data merge: decimal.CalibrateScale()
Sending unfinished aggregate states tend to produce unexpected anomalies with lower values than expected.
The old behavior can be restored by specifying `flush_on_shutdown: true` setting in streaming aggregation config
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* app/vmalert: fix data race during hot-config reload
During hot-reload, the logic evokes the group update and rules evaluation
interruption simultaneously. Falsely assuming that interruption happens before
the update. However, it could happen that group will be updated first and only
after the rules evaluation will be cancelled. Which will result in permanent
interruption for all rules within the group.
The fix caches the cancel context function into local variable first. And only after
performs the group update. With cached cancel function we can safely call it without
worrying that we cancel the evaluation for already updated group.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* Revert "app/vmalert: fix data race during hot-config reload"
This reverts commit a4bb7e8932.
* app/vmalert: fix data race during hot-config reload
During hot-reload, the logic evokes the group update and rules evaluation
interruption simultaneously. Falsely assuming that interruption happens before
the update. However, it could happen that group will be updated first and only
after the rules evaluation will be cancelled. Which will result in permanent
interruption for all rules within the group.
The fix cancels the evaulation context before applying the update, making sure
that the context will be cancelled for old group always.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* wip
Signed-off-by: hagen1778 <roman@victoriametrics.com>
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
- Maintain a separate worker pool per each part type (in-memory, file, big and small).
Previously a shared pool was used for merging all the part types.
A single merge worker could merge parts with mixed types at once. For example,
it could merge simultaneously an in-memory part plus a big file part.
Such a merge could take hours for big file part. During the duration of this merge
the in-memory part was pinned in memory and couldn't be persisted to disk
under the configured -inmemoryDataFlushInterval .
Another common issue, which could happen when parts with mixed types are merged,
is uncontrolled growth of in-memory parts or small parts when all the merge workers
were busy with merging big files. Such growth could lead to significant performance
degradataion for queries, since every query needs to check ever growing list of parts.
This could also slow down the registration of new time series, since VictoriaMetrics
searches for the internal series_id in the indexdb for every new time series.
The third issue is graceful shutdown duration, which could be very long when a background
merge is running on in-memory parts plus big file parts. This merge couldn't be interrupted,
since it merges in-memory parts.
A separate pool of merge workers per every part type elegantly resolves both issues:
- In-memory parts are merged to file-based parts in a timely manner, since the maximum
size of in-memory parts is limited.
- Long-running merges for big parts do not block merges for in-memory parts and small parts.
- Graceful shutdown duration is now limited by the time needed for flushing in-memory parts to files.
Merging for file parts is instantly canceled on graceful shutdown now.
- Deprecate -smallMergeConcurrency command-line flag, since the new background merge algorithm
should automatically self-tune according to the number of available CPU cores.
- Deprecate -finalMergeDelay command-line flag, since it wasn't working correctly.
It is better to run forced merge when needed - https://docs.victoriametrics.com/#forced-merge
- Tune the number of shards for pending rows and items before the data goes to in-memory parts
and becomes visible for search. This improves the maximum data ingestion rate and the maximum rate
for registration of new time series. This should reduce the duration of data ingestion slowdown
in VictoriaMetrics cluster on e.g. re-routing events, when some of vmstorage nodes become temporarily
unavailable.
- Prevent from possible "sync: WaitGroup misuse" panic on graceful shutdown.
This is a follow-up for fa566c68a6 .
Thanks @misutoth to for the inspiration at https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5212
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5190
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3790
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3551
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3337
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3425
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3647
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3641
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/648
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/291
* docs: remove raw and endraw tags as they are not needed for the new version of site
Signed-off-by: Artem Navoiev <tenmozes@gmail.com>
* revert formating in vmaler
Signed-off-by: Artem Navoiev <tenmozes@gmail.com>
---------
Signed-off-by: Artem Navoiev <tenmozes@gmail.com>
The maxFileParts usage has been accidentally removed in fa566c68a6
While at it, add Count suffix to *AssistedMerges counter names in order to make them less misleading.
Previously their names were falsely suggesting that these are gauges, which show the number of concurrently
executed assisted merges.
This reverts cd4f641d32 , since it has been appeared that the disabled compression
for vmstorage->vmselect data increase network bandwidth usage by more than 10x on typical production workloads,
while it decreases CPU usage at vmstorage by up to 10% and improves query latency by up to 10%.
The 10x increase in network usage is too high price for 10% improvements on query latency and vmstorage CPU usage.
This may result in network bandwidth bottlenecks, which can reduce the overall performance and stability
of VictoriaMetrics cluster. That's why return back the vmstorage->vmselect data compression by default.
The vmstorage->vmselect compression can be disabled by passing -rpc.disableCompression command-line flag to vmstorage.
The vmselect->vmselect compression in multi-level cluster setup can be disabled by passing -clusternative.disableCompression command-line flag.
It has been appeared that the registration of new time series slows down linearly
with the number of indexdb parts, since VictoriaMetrics needs to check every indexdb part
when it searches for TSID by newly ingested metric name.
The number of in-memory parts grows when new time series are registered
at high rate. The number of in-memory parts grows faster on systems with big number
of CPU cores, because the mergeset maintains per-CPU buffers with newly added entries
for the indexdb, and every such entry is transformed eventually into a separate in-memory part.
The solution has been suggested in https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5212
by @misutoth - to limit the number of in-memory parts with buffered channel.
This solution is implemented in this commit. Additionally, this commit merges per-CPU parts
into a single part before adding it to the list of in-memory parts. This reduces CPU load
when searching for TSID by newly ingested metric name.
The https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5212 recommends setting the limit on the number
of in-memory parts to 100, but my internal testing shows that much lower limit 15 works with the same efficiency
on a system with 16 CPU cores while reducing memory usage for `indexdb/dataBlocks` cache by up to 50%.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5190
This allows reducing the indexdb/tagFiltersToMetricIDs cache size by 8 on average.
The cache size can be checked via vm_cache_size_bytes{type="indexdb/tagFiltersToMetricIDs"} metric exposed at /metrics page.
Measuring read / write duration from / to in-memory buffers has little sense,
since it will be always fast. It is better to measure read / write duration from / to
real files at vm_filestream_write_duration_seconds_total and vm_filestream_read_duration_seconds_total metrics.
This also reduces overhead on time.Now() and Histogram.UpdateDuration() calls
per each filestream.Reader.Read() and filestream.Writer.Write() call when the data is read / written from / to in-memory buffers.
This is a follow-up for 2f63dec2e3
* lib/promscrape: respect `0` value for `series_limit` param
Respect `0` value for `series_limit` param in `scrape_config`
even if global limit was set via `-promscrape.seriesLimitPerTarget`.
Previously, `0` value will be ignored in favor of `-promscrape.seriesLimitPerTarget`.
This behavior aligns with possibility to override `series_limit` value via
relabeling with `__series_limit__` label.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* Update docs/CHANGELOG.md
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Previously, it was not possible to determine which tenant sends metrics with excessive amount of labels of label values.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* vmui: fix the logic of closing the popper #5470
* vmui: fix the logic of caching autocomplete results #5472
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
This reduces the number of memory allocations at the cost of possible memory usage increase,
since now different metric name strings may hold references to the previous byte slice.
This is good tradeoff, since ProcessSearchQuery is called in vmselect, and vmselect isn't usually limited by memory.
This change has been extracted from https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5527
This should smooth CPU and RAM usage spikes related to these periodic tasks,
by reducing the probability that multiple concurrent periodic tasks are performed at the same time.
The dateMetricIDCache puts recently registered (date, metricID) entries into mutable cache protected by the mutex.
The dateMetricIDCache.Has() checks for the entry in the mutable cache when it isn't found in the immutable cache.
Access to the mutable cache is protected by the mutex. This means this access is slow on systems with many CPU cores.
The mutabe cache was merged into immutable cache every 10 seconds in order to avoid slow access to mutable cache.
This means that ingestion of new time series to VictoriaMetrics could result in significant slowdown for up to 10 seconds
because of bottleneck at the mutex.
Fix this by merging the mutable cache into immutable cache after len(cacheItems) / 2
cache hits under the mutex, e.g. when the entry is found in the mutable cache.
This should automatically adjust intervals between merges depending on the addition rate
for new time series (aka churn rate):
- The interval will be much smaller than 10 seconds under high churn rate.
This should reduce the mutex contention for mutable cache.
- The interval will be bigger than 10 seconds under low churn rate.
This should reduce the uneeded work on merging of mutable cache into immutable cache.
The change removes artificial delay before returning error, which sometimes
caused less retry events than expected.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* lib/promscrape/discovery/kubernetes: fix watcher start order for roles endpoints and endpointslice
Previously the groupWatcher could be mistakenly stopped when requests for pod or services resources take too long.
* remove mislead comment
* docs/sd_configs.md: mention -promscrape.kubernetes.attachNodeMetadataAll flag in the description for attach_metadata section
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4640
* wip
* lib/promscrape/kubernetes: prevent from stopping groupWatcher when there are in-flight apiWatcher.mustStart() calls
groupWatcher is stopped if it has zero registered apiWatchers during 14 seconds.
But such a groupWatcher can be still in use if apiWatcher for `role: endpoints` or `role: endpointslice`
is being registered and the discovery of the associated `pod` and/or `service` objects takes longer
than 14 seconds - see the beginning of groupWatcher.startWatchersForRole() function for details.
Track the number of in-flight calls to apiWatcher.mustStart() and prevent from stopping the associated groupWatcher
if the number of in-flight calls is non-zero.
P.S. postponing the discovery of `pod` and/or `service` objects associated with `endpoints` or `endpointslice` roles
isn't the best solution, since it slows down initial discovery of `endpoints` and `endpointslice` targets.
* typo fix
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Examples:
1) -metricsAuthKey=file:///abs/path/to/file - reads flag value from the given absolute filepath
2) -metricsAuthKey=file://./relative/path/to/file - reads flag value from the given relative filepath
3) -metricsAuthKey=http://some-host/some/path?query_arg=abc - reads flag value from the given url
The flag value is automatically updated when the file contents changes.
* app/vmauth: adds metric_labels and backend_errors counter
it must improve observability for user requests with new metric - per user backend errors counter.
it's needed to calculate requests fail rate to the configured backends.
metric_labels configuration allows to perform additional aggregations on top of multiple users from configuration section.
It could be multiple clients or clients with separate read/write tokens
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5565
* wip
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* app/vmselect/promql: properly handle possible negative results caused by float operations precision error in rollup functions like rate() or increase()
* fix test
- docs/sd_configs.md: moved hetzner_sd_configs docs to the correct place according to alphabetical order of SD names,
document missing __meta_hetzner_role label.
- lib/promscrape/config.go: added missing MustStop() call for Hetzner SD,
and moved the code to the correct place according to alphabetical order of SD names.
- lib/promscrape/discovery/hetzner: properly handle pagination for hloud API responses,
populate missing __meta_hetzner_role label like Prometheus does.
- Properly populate __meta_hetzner_public_ipv6_network label like Prometheus does.
- Remove unused SDConfig.Token.
- Remove "omitempty" annotation from SDConfig.Role field, since this field is mandatory.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5550
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3154
It is likely this value was borrowed from panel `Slow inserts` panel
from Grafana dasbhoard for VM single/cluster installations. This is a mistake.
There is no such thing as "tolerable churn rate", as tolerancy depends on the amount
of allocated resources.
Although, it is unclear what is meant by 5%. If it refers to 5% of new time series per second,
then such churn rate is extremely high. It would mean that the avg life of a time series is 20s.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
app/victoriametrics: update the test suite
* simplify /query and /query_range test cases configuration and tests
* support instant queries with lookbehind window like `query=foo[5m]`
* support instant queries selecting scalar value like `query=42`
* add query_range test for prometheus
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmui/vmanomaly: add support models that produce only `anomaly_score`
* vmui/vmanomaly: fix display legend
* vmui/vmanomaly: update comment on anomaly threshold
- Clarify the bugfix description at docs/CHANGELOG.md
- Simplify the code by accessing prefetchedMetricIDs struct under the lock
instead of using lockless access to immutable struct.
This shouldn't worsen code scalability too much on busy systems with many CPU cores,
since the code executed under the lock is quite small and fast.
This allows removing cloning of prefetchedMetricIDs struct every time
new metric names are pre-fetched. This should reduce load on Go GC,
since the cloning of uin64set.Set struct allocates many new objects.
Add docs to explain that `latest` backup folder can be accessed
more frequently than others. Hence, it is recommended to keep it
on storages with low access price.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Before, this cache was limited only by size.
Cache invalidation by time happens with jitter to prevent thundering herd problem.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
It was calculating the number of dropped time series instead of the number of dropped samples.
While at it, drop vmalert_remotewrite_dropped_bytes_total metric, since it was inconsistently calculated -
at one place it was calculating raw protobuf-encoded sample sizes, while at another place it was calculating
the size of snappy-compressed prompbmarshal.WriteRequest protobuf message.
Additionally, this metric has zero practical sense, so just drop it in order to reduce the level of confusion.
* update link to book a demo for AD & RCA
* fix invalid refs in components/model
* - fix staging -> prod links
- replace capitalized FAQ headers
- change the section order on main page
- replace :latest tag with current stable
* add panels for detailed visualization of traffic usage between vmstorage, vminsert, vmselect
components and their clients. New panels are available in the rows dedicated to specific components.
* update "Slow Queries" panel to show percentage of the slow queries to the total number of read queries
served by vmselect. The percentage value should make it more clear for users whether there is a service degradation.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* app/vmselect: drop `rollupDefault` function as duplicate
It is unclear why there are two identical fns `rollupDefault`
and `rollupDistinct`. Dropping one of them.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* Update app/vmselect/promql/rollup.go
* Update app/vmselect/promql/rollup.go
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
The user may which to control the endpoint parameters for instance to
set the audience when requesting an access token. Exposing the
parameters as a map allows for additional use cases without requiring
modification.
* lib/awsapi: properly assume role with webIdentity token
introduce new irsaRoleArn param for config. It's only needed for authorization with webIdentity token.
First credentials obtained with irsa role and the next sts assume call for an actual roleArn made with those credentials.
Common use case for it - cross AWS accounts authorization
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3822
* wip
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Some clients may ingest samples via OpenTelemetry protocol without Resource labels.
Previously VictoriaMetrics was silently dropping such samples.
The commit 317834f876 added vm_protoparser_rows_dropped_total{type="opentelemetry",reason="resource_not_set"}
counter for tracking of such dropped samples. See https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5459
It is better from usability PoV to accept such samples instead of dropping them and incrementing the corresponding counter.
Before, retries happened only on writes into a network connection
between source and destination. But errors returned by server after
all the data was transmitted were logged, but not retried.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Currently, it is impossible to understand why metrics are not ingested when resource is not set by OTEL exporter. Adding metric should simplify debugging and make it improve debuggability.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
app/vmselect/promql/eval.go:evalAggrFunc shunts evaluation
of AggrFuncExpr over rollupFunc over MetricsExpr to an optimized
path. tryGetArgRollupFuncWithMetricExpr() checks whether expression
can be shunted, but it mangles the AggrFuncExpr when the aggregation
function has more than one argument. This results in queries like
`sum(aggr_over_time("avg_over_time",m))` failing with error message
'expecting at least 2 args to "aggr_over_time"; got 1 args' while
the analogous query `sum(avg_over_time(m))` executes successfully.
This fix removes the unnecessary mangling.
Signed-off-by: Anton Tykhyy <atykhyy@gmail.com>
Previously `retry_status_codes: []` and `drop_src_path_prefix_parts: 0` at `url_map` were equivalent to missing values.
This was resulting in using the user-level values instead.
* vmui: add show quick tip for autocomplete
* vmui: auto-completion usability improvements #5348
* vmui: add const for min symbols in autocomplete
* Use proper queries to VictoriaMetrics
* vmui: fix comments for autocomplete
* app/vmselect: run `make vmui-update`
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* fix typos in docs
* add `shard-` prefix to generated links when `-promscrape.cluster.memberURLTemplate` is enabled
Signed-off-by: hagen1778 <roman@victoriametrics.com>
This is follow-up after
75196d7234
It updates some of the alerting rules to remove unnecessary aggregations.
It keeps aggregations for expressions which are using multiple time series
filters to make sure their label will match.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Aggregations with by() have one sideeffect, that any custom labels you add to hosts are dropped too which can be used for alerts routing.
Therefore, some good practice could be to use without() instead, with labels, like without(path) , or without(url) to get same aggregations but with any external labels left intact.
Before, vmalert would send notifications with labels containing characters
not supported by Alertmanager validator, resulting into validation errors
like `msg="Failed to validate alerts" err="invalid label set: invalid name "foo.bar"`
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Previously the lower bound could be too small, which could result in missing values at the beginning of the graph
for default_rollup() function. This function is automatically applied to all the series selectors if they aren't
explicitly wrapped into a rollup function - see https://docs.victoriametrics.com/MetricsQL.html#implicit-query-conversions
While at it, properly take into account `-search.minStalenessInterval` command-line flag when adjusting
the lower bound for the selected time range.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5388
* vmagent: export `vm_promscrape_scrape_pool_targets` metric to track the number of targets that each scrape_job discovers
* add extra panel for new metric
- Add links to relevant docs into descriptions for every -kafka.* and -gcp.pubsub.* command-line flags.
- Wait until message processing goroutines are stopped before returning from gcppubsub.Stop().
- Prevent from multiple calls to Init() without Stop().
- Drop message if tenantID cannot be parsed properly.
- Take into account tenantID for all the supported message formats.
- Support gzip-compressed messages for graphite format.
- Use exponential backoff sleep when the message cannot be pushed to remote storage systems
because of disabled on-disk persistence - https://docs.victoriametrics.com/vmagent.html#disabling-on-disk-persistence
- Unblock from sleep as soon as Stop() is called. Previously the sleep could take up to 2 seconds after Stop() is called.
- Remove unused globalCtx and initContext from app/vmagent/remotewrite/gcppubsub
- Mention Google PubSub support at docs/enterprise.md
- Make Google PubSub docs more clear at docs/vmagent.md
This is a follow-up for commits 115245924a5f096c5a3383d6cc8e8b6fbd421984
and e6eab781ce42285a6a1750dc01eba6801dd35516 .
Updates https://github.com/VictoriaMetrics/VictoriaMetrics-enterprise/pull/717
Updates https://github.com/VictoriaMetrics/VictoriaMetrics-enterprise/pull/713
* app/vmalert: expose `/vmalert/api/v1/rule` and `/api/v1/rule` API which returns rule status in JSON format
* app/vmalert: hide updates if query param not set
* app/vmalert: fix panic (recursion call)
* app/vmalert: add needed group name and file name
* app/vmalert: fix comment, update behavior
* app/vmalert: fix description
* app/vmalert: simplify API for /api/v1/rule
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* app/vmalert: simplify API for /api/v1/rule
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* app/vmalert: simplify API for /api/v1/rule
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* app/vmalert: simplify API for /api/v1/rule
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* app/vmalert: simplify API for /api/v1/rule
Signed-off-by: hagen1778 <roman@victoriametrics.com>
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
E.g. replace `fs.Dir + filePath` with `path.Join(fs.Dir, filePath)`
The fs.Dir is guaranteed to end with slash - see Init() functions.
The filePath may start with slash. If it starts with slash, then `fs.Dir + filePath` constructs
an incorrect path with double slashes.
path.Join() properly substitutes duplicate slashes with a single slash in this case.
While at it, also substitute incorrect usage of filepath.Join() with path.Join()
for constructing paths to object storage systems, which expect forward slashes in paths.
filepath.Join() substittues forward slashes with backslashes on Windows, so this may break
creating or managing backups from Windows.
This is a follow-up for 0399367be602b577baf6a872ca81bf0f99ba401b
Updates https://github.com/VictoriaMetrics/VictoriaMetrics-enterprise/pull/719
* lib/backup/s3remote: remove prev object versions for recursive delete
- fix error caused by sending empty objects list to be deleted. This was possible in case old versions of objects where deleted, but root-level entries where still available. This caused paginator to return an empty page which wasn't skipped.
- delete previous versions of objects recursively for S3 remote
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* docs/changelog: add vmbackupmanager fix entry
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/backup/s3remote: unify path construction for S3 objects
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Previously concurrency for static and fast queries was limited with the -search.maxConcurrentRequests
command-line flag. This could complicate identifying heavy queries via `vmui` at `Top queries` and `Active queries` pages,
since `vmui` and these pages couldn't be opened on overloaded vmselect.
Thanks to @f41gh7 for the idea.
Previously the /service-discovery page didn't show targets dropped because of sharding
( https://docs.victoriametrics.com/vmagent.html#scraping-big-number-of-targets ).
Show also the reason why every target is dropped at /service-discovery page.
This should improve debuging why particular targets are dropped.
While at it, do not remove dropped targets from the list at /service-discovery page
until the total number of targets exceeds the limit passed to -promscrape.maxDroppedTargets .
Previously the list was cleaned up every 10 minutes from the entries, which weren't updated
for the last minute. This could complicate debugging of dropped targets.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5389
* lib/streamaggr: properly reference slice with labels
by limiting slice capacity. It must fix issues with slice modification, in case of append new slice will be allocated, instead of modifying refrenced slice
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5402
* Reduce memory allocations when output_relabel_configs adds new labels to output samples
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* prevent /api/v1 from panic on parsing rows
* add tests for Extract function for v1 and v2 api's
* separate request types in different pools to prevent different objects mixing
* add changelog line
543f218fe9
Signed-off-by: hagen1778 <roman@victoriametrics.com>
- Add Try* prefix to functions, which return bool result in order to improve readability and reduce the probability of missing check
for the result returned from these functions.
- Call the adjustSampleValues() only once on input samples. Previously it was called on every attempt to flush data to peristent queue.
- Properly restore the initial state of WriteRequest passed to tryPushWriteRequest() before returning from this function
after unsuccessful push to persistent queue. Previously a part of WriteRequest samples may be lost in such case.
- Add -remoteWrite.dropSamplesOnOverload command-line flag, which can be used for dropping incoming samples instead
of returning 429 Too Many Requests error to the client when -remoteWrite.disableOnDiskQueue is set and the remote storage
cannot keep up with the data ingestion rate.
- Add vmagent_remotewrite_samples_dropped_total metric, which counts the number of dropped samples.
- Add vmagent_remotewrite_push_failures_total metric, which counts the number of unsuccessful attempts to push
data to persistent queue when -remoteWrite.disableOnDiskQueue is set.
- Remove vmagent_remotewrite_aggregation_metrics_dropped_total and vm_promscrape_push_samples_dropped_total metrics,
because they are replaced with vmagent_remotewrite_samples_dropped_total metric.
- Update 'Disabling on-disk persistence' docs at docs/vmagent.md
- Update stale comments in the code
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5088
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2110
* app/vmagent: allow to disabled on-disk queue
Previously, it wasn't possible to build data processing pipeline with a
chain of vmagents. In case when remoteWrite for the last vmagent in the
chain wasn't accessible, it persisted data only when it has enough disk
capacity. If disk queue is full, it started to silently drop ingested
metrics.
New flags allows to disable on-disk persistent and immediatly return an
error if remoteWrite is not accessible anymore. It blocks any writes and
notify client, that data ingestion isn't possible.
Main use case for this feature - use external queue such as kafka for
data persistence.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2110
* adds test, updates readme
* apply review suggestions
* update docs for vmagent
* makes linter happy
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Tests showed that importing a single line with 70MB size takes 5.3GiB
RSS memory for VictoriaMetrics single-node.
In the scenario when user exports and imports data from one VM to another,
it could possibly lead to OOM exception for destination VM.
Importing a single line with 16MB size taks 1.3GiB RSS memory.
Hence, the limit for `import.maxLineLen` was decreased from 100MB to 10MB
to improve reliability of VictoriaMetrics during imports.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
The default caching for Go artifacts from actions/setup-go@v4 uses the hash of go.sum file
as a cache key - see https://github.com/actions/cache/blob/main/examples.md#go---modules .
This isn't enough for VictoriaMetrics case, since different makefile actions build different the Go artifacts,
which need to be cached. So embed the action name in the cache key.
This case is possible after the following steps:
1. vmagent successfully performed handshake with the -remoteWrite.url and the remote storage supports zstd-compressed data.
2. remote storage became unavailable or slow to ingest data, vmagent compressed the collected data into blocks with zstd and puts these blocks to persistent queue on disk.
3. vmagent restarts and the remote storage is unavailable during the handshake, then vmagent falls back to Snappy compression.
4. vmagent starts sending zstd-compressed data from persistent queue to the remote storage, while falsely advertizing it sends Snappy-compressed data.
5. The remote storage receives zstd-compressed data and fails unpacking it with Snappy.
The solution is the same as 12cd32fd75, just fall back to zstd decompression if Snappy decompression fails.
Previously the number of memory allocations inside copyTimeseriesShallow() was equal to 1+len(tss)
Reduce this number to 2 by pre-allocating a slice of timeseries structs with len(tss) length.
The `path!="/favicon.ico"` filter has little sense, since there are many other special paths,
which may be filtered out - /metrics, /flags, /health, /ping, /robots.txt, /-/healthy, /-/ready, /reload, etc.
See /lib/httpserver/httpserver.go for more details.
It will be hard or impossible to maintain filters for all these paths, so it is better to drop this filter
in order to simplify queries and improve the consistency of these queries.
The buffered connection could have exceeded the underlying connection
deadline during reading or writing to an internal buffer.
With this change, buffered connection struct additionally checks
for a deadline in Read/Write methods.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
evalRollupFuncNoCache() may return time series with identical labels (aka duplicate series)
when performing queries satisfying all the following conditions:
- It must select time series with multiple metric names. For example, {__name__=~"foo|bar"}
- The series selector must be wrapped into rollup function, which drops metric names. For example, rate({__name__=~"foo|bar"})
- The rollup function must be wrapped into aggregate function, which has no streaming optimization.
For example, quantile(0.9, rate({__name__=~"foo|bar"})
In this case VictoriaMetrics shouldn't return `cannot merge series: duplicate series found` error.
Instead, it should fall back to query execution with disabled cache.
Also properly store the merged results. Previously they were incorrectly stored because of a typo
introduced in the commit 41a0fdaf39
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5332
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5337
`version` label won't show the difference if various flavors of the same
version were deployed. But `short_version` will.
For example, on the sandbox env we test VM builds before new version release.
Without this change, the version update won't be visible on dashboard.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* lib/querytracer: makes package concurrent safe to use
it must fix various issues with concurrent code usage.
Especially, when it's not reasonable to wait for all goroutines to be finished
* wip
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
- If min_over_time(m[offset] @ timestamp) <= min_over_time(m[offset] @ (timestamp-window)),
then the optimization can be applied.
- If max_over_time(m[offset] @ timestamp) >= max_over_time(m[offset] @ (timestamp-window)),
then the optimization can be applied.
* vmui: reduced the number of server requests
* run `make vmui-update vmui-logs-update`
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
This case is possible after the following steps:
1. vmagent tries to perform handshake with the -remoteWrite.url in order to determine whether
the remote storage supports zstd-compressed data.
2. The remote storage is unavailable during the handshake. In this case vmagent falls back to Snappy compression
for the data sent to the remote storage.
3. vmagent compresses the collected data into blocks with Snappy and puts these blocks to persistent queue on disk.
4. The remote storage becomes available.
5. vmagent restarts, performs the handshake with the remote storage and detects that it supports zstd-compressed data.
6. vmagent starts sending Snappy-compressed data from persistent queue to the remote storage,
while falsely advertizing it sends zstd-compressed data.
7. The remote storage receives Snappy-compressed data and fails unpacking it with zstd.
The solution is to just fall back to Snappy decompression if zstd decompression fails.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5301
We have failed test on master branch.
```
--- FAIL: TestFormatLogMessage (0.00s)
logger_test.go:24: unexpected result; got
"foo: abcde, \"foo bar baz\", xx"
want
"foo: a..e, \"f..z\", xx"
```
if failed because maxArgs maxLen <= 4 in the `LimitStringLen` in that case we always will return the income string
but in the test we limit the maxLen by value 4
```
f("foo: %s, %q, %s", []interface{}{"abcde", fmt.Errorf("foo bar baz"), "xx"}, 4, `foo: a..e, "f..z", xx`)
Previously the `Host` header was remained unchanged when passing it in requests to backends.
This may improperly work if the backend uses host-based routing.
While at it, allows http/2.0 requests to backends. While VictoriaMetrics components
do not accept http/2.0 requests, other backends can require such requests.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5240
- Re-use identically configured http.Transport across multiple users.
This fixes handling of the limit on the number of connection, which can be established per each backend
via -maxIdleConnsPerBackend command-line flag. This limit stopped working after 323f3720ed
- Add docs about backend TLS setup at https://docs.victoriametrics.com/vmauth.html#backend-tls-setup
- Add ability to disable backend TLS verification for all the users via -backend.tlsInsecureSkipVerify command-line flag.
This flag may be useful when -auth.config contains big number of users, and every user must disable backend TLS verification.
- Add ability to specify TLS Root CA via tls_ca_file option at per-user basis and via -backend.tlsCAFile command-line flag
across all the users.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5240
Previously entries which were accessed only 1 time weren't cached.
It has been appeared that some rarely executed heavy queries may read indexdb block twice
in a row instead of once. There is no need in caching such a block then.
This change should eliminate cache size spikes for indexdb/dataBlocks when such heavy queries are executed.
Expose -blockcache.missesBeforeCaching command-line flag, which can be used for fine-tuning
the number of cache misses needed before storing the block in the caching.
* app/vmalert: update remote-write process
* automatically retry remote-write requests on closed connections. The change should reduce the amount of logs produced in environments with short-living connections or environments without support of keep-alive on network balancers.
* increment `vmalert_remotewrite_errors_total` metric if all retries to send remote-write request failed. Before, this metric was incremented only if remote-write client's buffer is overloaded.
* increment `vmalert_remotewrite_dropped_rows_total` amd `vmalert_remotewrite_dropped_bytes_total` metrics if remote-write client's buffer is overloaded. Before, these metrics were incremented only after unsuccessful HTTP calls.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* Update docs/CHANGELOG.md
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Hui Wang <haley@victoriametrics.com>
reduce the number of queries for restoring alerts state on start-up.
The change should speed up the restore process and reduce pressure on `remoteRead.url`.
* vmauth: add browser authorization request for http requests without credentials to a route that is not in the `unauthorized_user` section (when `unauthorized_user` is specified).
* add link to issue in CHANGELOG
* Extend vmauth docs
* wip
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
This reduction is based on production testing.
Also expose -search.minWindowForInstantRollupOptimization command-line flag, so users could fine-tune this arg for their needs
vmalert expects string value for stats.seriesFetched, so it is impossible
switching to number without breaking compatibility with old vmalert releases :(
It is still unclear why stats.seriesFetched has string type in the first place...
Repeated instant queries with long lookbehind windows, which contain one of the following rollup functions,
are optimized via partial result caching:
- sum_over_time()
- count_over_time()
- avg_over_time()
- increase()
- rate()
The basic idea of optimization is to calculate
rf(m[d] @ t)
as
rf(m[offset] @ t) + rf(m[d] @ (t-offset)) - rf(m[offset] @ (t-d))
where rf(m[d] @ (t-offset)) is cached query result, which was calculated previously
The offset may be in the range of up to 1 hour.
The SECURITY label should be applied only to changes, which fix security issues.
The change at ad839aa492 adds new command-line flags, which can be used
for improving security in some cases. They do not fix any security issues.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5111
The new metric gets increased each time `-search.logQueryMemoryUsage` memory limit
is exceeded by a query. This metric should help to identify expensive and heavy queries
without inspecting the logs.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The new rule for vmalert supposed to detect groups that miss their
evaulations due to slow queries.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The panel `Errors rate to Alertmanager` had `group` label filter
applied to the expression, while the metric `vmalert_alerts_send_errors_total`
doesn't have that label. This resulted into always empty results.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
fix possible missing firing states for alerting rules in replay mode
Before if one firing stage is bigger than single query request range, like rule with a big `for`, alerting rule won't able to be detected as firing.
Co-authored-by: hagen1778 <roman@victoriametrics.com>
support `Strict-Transport-Security`, `Content-Security-Policy` and `X-Frame-Options`
HTTP headers in all VictoriaMetrics components.
The values for headers can be specified by users via the following flags:
`-http.header.hsts`, `-http.header.csp` and `-http.header.frameOptions`.
Co-authored-by: hagen1778 <roman@victoriametrics.com>
This allows substituting FATAL panics with recoverable runtime errors such as missing or invalid TLS CA file
and/or missing/invalid /var/run/secrets/kubernetes.io/serviceaccount/namespace file.
Now these errors are logged instead of PANIC'ing, so they can be fixed by updating the corresponding files
without the need to restart vmagent.
This is a follow-up for 90427abc65
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5243
Previously url watchers for pod, service and node objects could be mistakenly closed
when service discovery was set up only for endpoints and endpointslice roles,
since watchers for these roles may start start pod, service and node url watchers
with nil apiWatcher passed to groupWatcher.startWatchersForRole().
Now all the url watchers, which belong to a particular groupWatcher, are stopped at once
when this groupWatcher has no apiWatcher subscribers.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5216
The issue has been introduced in v1.93.5 when addressing https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4850
* do not print redundant error logs when failed to scrape consul or nomad target
prometheus performs the same because it uses consul lib which just drops the error(1806bcb38c/api/api.go (L1134))
- Make sure that invalid/missing TLS CA file or TLS client certificate files at vmagent startup
don't prevent from processing the corresponding scrape targets after the file becomes correct,
without the need to restart vmagent.
Previously scrape targets with invalid TLS CA file or TLS client certificate files
were permanently dropped after the first attempt to initialize them, and they didn't
appear until the next vmagent reload or the next change in other places of the loaded scrape configs.
- Make sure that TLS CA is properly re-loaded from file after it changes without the need to restart vmagent.
Previously the old TLS CA was used until vmagent restart.
- Properly handle errors during http request creation for the second attempt to send data to remote system
at vmagent and vmalert. Previously failed request creation could result in nil pointer dereferencing,
since the returned request is nil on error.
- Add more context to the logged error during AWS sigv4 request signing before sending the data to -remoteWrite.url at vmagent.
Previously it could miss details on the source of the request.
- Do not create a new HTTP client per second when generating OAuth2 token needed to put in Authorization header
of every http request issued by vmagent during service discovery or target scraping.
Re-use the HTTP client instead until the corresponding scrape config changes.
- Cache error at lib/promauth.Config.GetAuthHeader() in the same way as the auth header is cached,
e.g. the error is cached for a second now. This should reduce load on CPU and OAuth2 server
when auth header cannot be obtained because of temporary error.
- Share tls.Config.GetClientCertificate function among multiple scrape targets with the same tls_config.
Cache the loaded certificate and the error for one second. This should significantly reduce CPU load
when scraping big number of targets with the same tls_config.
- Allow loading TLS certificates from HTTP and HTTPs urls by specifying these urls at `tls_config->cert_file` and `tls_config->key_file`.
- Improve test coverage at lib/promauth
- Skip unreachable or invalid files specified at `scrape_config_files` during vmagent startup, since these files may become valid later.
Previously vmagent was exitting in this case.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4959
It could return either `failed to read` or `failed to parse` errors depending
on whether the given url can be loaded or not under the current environment
c9375cac5e
Descriptions were updated in attempt to make it more clear for readers,
re-phrasing and linking missing docs.
`eval_delay` was added to tests to verify it can be unmarshalled.
`eval_delay` is now applied before timestamp alignment to make it more predictable.
Before, if delay < interval the timestamp won't be aligned.
`eval_delay` and `eval_offset` was added to API output.
`PreviouslySentSeriesToRW` converted to private `previouslySentSeriesToRW`.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Using `min_over_time` should reduce the amount of false positives when
component is running in near-the-threshold state. Now it should trigger
only if all collected samples were above the threshold on 10m interval.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Please provide a brief description of the changes you made. Be as specific as possible to help others understand the purpose and impact of your modifications.
### Checklist
The following checks are mandatory:
- [ ] I have read the [Contributing Guidelines](https://github.com/VictoriaMetrics/VictoriaMetrics/blob/master/CONTRIBUTING.md)
- [ ] All commits are signed and include `Signed-off-by` line. Use `git commit -s` to include `Signed-off-by` your commits. See this [doc](https://git-scm.com/book/en/v2/Git-Tools-Signing-Your-Work) about how to sign your commits.
- [ ] Tests are passing locally. Use `make test` to run all tests locally.
- [ ] Linting is passing locally. Use `make check-all` to run all linters locally.
Further checks are optional for External Contributions:
- [ ] Include a link to the GitHub issue in the commit message, if issue exists.
- [ ] Mention the change in the [Changelog](https://github.com/VictoriaMetrics/VictoriaMetrics/blob/master/docs/CHANGELOG.md). Explain what has changed and why. If there is a related issue or documentation change - link them as well.
Tips for writing a good changelog message::
* Write a human-readable changelog message that describes the problem and solution.
* Include a link to the issue or pull request in your changelog message.
* Use specific language identifying the fix, such as an error message, metric name, or flag name.
* Provide a link to the relevant documentation for any new features you add or modify.
- [ ] After your pull request is merged, please add a message to the issue with instructions for how to test the fix or try the feature you added. Here is an [example](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4048#issuecomment-1546453726)
- [ ] Do not close the original issue before the change is released. Please note, in some cases Github can automatically close the issue once PR is merged. Re-open the issue in such case.
- [ ] If the change somehow affects public interfaces (a new flag was added or updated, or some behavior has changed) - add the corresponding change to documentation.
Examples of good changelog messages:
1. FEATURE: [vmagent](https://docs.victoriametrics.com/vmagent.html): add support for [VictoriaMetrics remote write protocol](https://docs.victoriametrics.com/vmagent.html#victoriametrics-remote-write-protocol) when [sending / receiving data to / from Kafka](https://docs.victoriametrics.com/vmagent.html#kafka-integration). This protocol allows saving egress network bandwidth costs when sending data from `vmagent` to `Kafka` located in another datacenter or availability zone. See [this feature request](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1225).
2. BUGFIX: [stream aggregation](https://docs.victoriametrics.com/stream-aggregation.html): suppress `series after dedup` error message in logs when `-remoteWrite.streamAggr.dedupInterval` command-line flag is set at [vmagent](https://docs.victoriametrics.com/vmgent.html) or when `-streamAggr.dedupInterval` command-line flag is set at [single-node VictoriaMetrics](https://docs.victoriametrics.com/).
@@ -14,3 +14,8 @@ We are open to third-party pull requests provided they follow [KISS design princ
- Avoid automated decisions, which may hurt cluster availability, consistency or performance.
Adhering `KISS` principle simplifies the resulting code and architecture, so it can be reviewed, understood and verified by many people.
Before sending a pull request please check the following:
- [ ] All commits are signed and include `Signed-off-by` line. Use `git commit -s` to include `Signed-off-by` your commits. See this [doc](https://git-scm.com/book/en/v2/Git-Tools-Signing-Your-Work) about how to sign your commits.
- [ ] Tests are passing locally. Use `make test` to run all tests locally.
- [ ] Linting is passing locally. Use `make check-all` to run all linters locally.
httpListenAddr=flag.String("httpListenAddr",":9428","TCP address to listen for http connections. See also -httpListenAddr.useProxyProtocol")
useProxyProtocol=flag.Bool("httpListenAddr.useProxyProtocol",false,"Whether to use proxy protocol for connections accepted at -httpListenAddr . "+
httpListenAddrs=flagutil.NewArrayString("httpListenAddr","TCP address to listen for incoming http requests. See also -httpListenAddr.useProxyProtocol")
useProxyProtocol=flagutil.NewArrayBool("httpListenAddr.useProxyProtocol","Whether to use proxy protocol for connections accepted at the given -httpListenAddr . "+
"See https://www.haproxy.org/download/1.8/doc/proxy-protocol.txt . "+
"With enabled proxy protocol http server cannot serve regular /metrics endpoint. Use -pushmetrics.url for metrics pushing")
gogc=flag.Int("gogc",100,"GOGC to use. See https://tip.golang.org/doc/gc-guide")
)
funcmain(){
@@ -34,27 +32,31 @@ func main() {
flag.CommandLine.SetOutput(os.Stdout)
flag.Usage=usage
envflag.Parse()
cgroup.SetGOGC(*gogc)
buildinfo.Init()
logger.Init()
pushmetrics.Init()
logger.Infof("starting VictoriaLogs at %q...",*httpListenAddr)
listenAddrs:=*httpListenAddrs
iflen(listenAddrs)==0{
listenAddrs=[]string{":9428"}
}
logger.Infof("starting VictoriaLogs at %q...",listenAddrs)
httpListenAddr=flag.String("httpListenAddr",":8428","TCP address to listen for http connections. See also -httpListenAddr.useProxyProtocol")
useProxyProtocol=flag.Bool("httpListenAddr.useProxyProtocol",false,"Whether to use proxy protocol for connections accepted at -httpListenAddr . "+
httpListenAddrs=flagutil.NewArrayString("httpListenAddr","TCP addresses to listen for incoming http requests. See also -tls and -httpListenAddr.useProxyProtocol")
useProxyProtocol=flagutil.NewArrayBool("httpListenAddr.useProxyProtocol","Whether to use proxy protocol for connections accepted at the corresponding -httpListenAddr . "+
"See https://www.haproxy.org/download/1.8/doc/proxy-protocol.txt . "+
"With enabled proxy protocol http server cannot serve regular /metrics endpoint. Use -pushmetrics.url for metrics pushing")
minScrapeInterval=flag.Duration("dedup.minScrapeInterval",0,"Leave only the last sample in every time series per each discrete interval "+
"equal to -dedup.minScrapeInterval > 0. See https://docs.victoriametrics.com/#deduplication and https://docs.victoriametrics.com/#downsampling")
"equal to -dedup.minScrapeInterval > 0. See also -streamAggr.dedupInterval and https://docs.victoriametrics.com/#deduplication")
dryRun=flag.Bool("dryRun",false,"Whether to check config files without running VictoriaMetrics. The following config files are checked: "+
"-promscrape.config, -relabelConfig and -streamAggr.config. Unknown config entries aren't allowed in -promscrape.config by default. "+
"This can be changed with -promscrape.config.strictParse=false command-line flag")
@@ -48,7 +48,6 @@ func main() {
envflag.Parse()
buildinfo.Init()
logger.Init()
pushmetrics.Init()
ifpromscrape.IsDryRun(){
*dryRun=true
@@ -67,30 +66,37 @@ func main() {
return
}
logger.Infof("starting VictoriaMetrics at %q...",*httpListenAddr)
listenAddrs:=*httpListenAddrs
iflen(listenAddrs)==0{
listenAddrs=[]string{":8428"}
}
logger.Infof("starting VictoriaMetrics at %q...",listenAddrs)
t.Fatalf("not expected number of csv lines want=%d\ngot=%d test=%s.%s\n\response=%q",len(data),test.ExpectedResultLinesCount,q,test.Issue,strings.Join(data,"\n"))
<!doctype html><htmllang="en"><head><metacharset="utf-8"/><linkrel="icon"href="./favicon.ico"/><metaname="viewport"content="width=device-width,initial-scale=1,maximum-scale=5"/><metaname="theme-color"content="#000000"/><metaname="description"content="UI for VictoriaMetrics"/><linkrel="apple-touch-icon"href="./apple-touch-icon.png"/><linkrel="icon"type="image/png"sizes="32x32"href="./favicon-32x32.png"><linkrel="manifest"href="./manifest.json"/><title>VM UI</title><scriptsrc="./dashboards/index.js"type="module"></script><metaname="twitter:card"content="summary_large_image"><metaname="twitter:image"content="./preview.jpg"><metaname="twitter:title"content="UI for VictoriaMetrics"><metaname="twitter:description"content="Explore and troubleshoot your VictoriaMetrics data"><metaname="twitter:site"content="@VictoriaMetrics"><metaproperty="og:title"content="Metric explorer for VictoriaMetrics"><metaproperty="og:description"content="Explore and troubleshoot your VictoriaMetrics data"><metaproperty="og:image"content="./preview.jpg"><metaproperty="og:type"content="website"><scriptdefer="defer"src="./static/js/main.02178f4b.js"></script><linkhref="./static/css/main.9a224445.css"rel="stylesheet"></head><body><noscript>You need to enable JavaScript to run this app.</noscript><divid="root"></div></body></html>
<!doctype html><htmllang="en"><head><metacharset="utf-8"/><linkrel="icon"href="./favicon.ico"/><metaname="viewport"content="width=device-width,initial-scale=1,maximum-scale=5"/><metaname="theme-color"content="#000000"/><metaname="description"content="UI for VictoriaMetrics"/><linkrel="apple-touch-icon"href="./apple-touch-icon.png"/><linkrel="icon"type="image/png"sizes="32x32"href="./favicon-32x32.png"><linkrel="manifest"href="./manifest.json"/><title>VM UI</title><scriptsrc="./dashboards/index.js"type="module"></script><metaname="twitter:card"content="summary_large_image"><metaname="twitter:image"content="./preview.jpg"><metaname="twitter:title"content="UI for VictoriaMetrics"><metaname="twitter:description"content="Explore and troubleshoot your VictoriaMetrics data"><metaname="twitter:site"content="@VictoriaMetrics"><metaproperty="og:title"content="Metric explorer for VictoriaMetrics"><metaproperty="og:description"content="Explore and troubleshoot your VictoriaMetrics data"><metaproperty="og:image"content="./preview.jpg"><metaproperty="og:type"content="website"><scriptdefer="defer"src="./static/js/main.034044a7.js"></script><linkhref="./static/css/main.bc07cc78.css"rel="stylesheet"></head><body><noscript>You need to enable JavaScript to run this app.</noscript><divid="root"></div></body></html>
httpListenAddr=flag.String("httpListenAddr",":8429","TCP address to listen for http connections. "+
httpListenAddrs=flagutil.NewArrayString("httpListenAddr","TCP address to listen for incoming http requests. "+
"Set this flag to empty value in order to disable listening on any port. This mode may be useful for running multiple vmagent instances on the same server. "+
"Note that /targets and /metrics pages aren't available if -httpListenAddr=''. See also -httpListenAddr.useProxyProtocol")
useProxyProtocol=flag.Bool("httpListenAddr.useProxyProtocol",false,"Whether to use proxy protocol for connections accepted at -httpListenAddr . "+
"Note that /targets and /metrics pages aren't available if -httpListenAddr=''. See also -tls and -httpListenAddr.useProxyProtocol")
useProxyProtocol=flagutil.NewArrayBool("httpListenAddr.useProxyProtocol","Whether to use proxy protocol for connections accepted at the corresponding -httpListenAddr . "+
"See https://www.haproxy.org/download/1.8/doc/proxy-protocol.txt . "+
"With enabled proxy protocol http server cannot serve regular /metrics endpoint. Use -pushmetrics.url for metrics pushing")
influxListenAddr=flag.String("influxListenAddr","","TCP and UDP address to listen for InfluxDB line protocol data. Usually :8089 must be set. Doesn't work if empty. "+
@@ -69,7 +70,8 @@ var (
"See also -opentsdbHTTPListenAddr.useProxyProtocol")
opentsdbHTTPUseProxyProtocol=flag.Bool("opentsdbHTTPListenAddr.useProxyProtocol",false,"Whether to use proxy protocol for connections accepted "+
"at -opentsdbHTTPListenAddr . See https://www.haproxy.org/download/1.8/doc/proxy-protocol.txt")
configAuthKey=flag.String("configAuthKey","","Authorization key for accessing /config page. It must be passed via authKey query arg")
configAuthKey=flagutil.NewPassword("configAuthKey","Authorization key for accessing /config page. It must be passed via authKey query arg")
reloadAuthKey=flagutil.NewPassword("reloadAuthKey","Auth key for /-/reload http endpoint. It must be passed as authKey=...")
dryRun=flag.Bool("dryRun",false,"Whether to check config files without running vmagent. The following files are checked: "+
"Unknown config entries aren't allowed in -promscrape.config by default. This can be changed by passing -promscrape.config.strictParse=false command-line flag")
@@ -96,7 +98,6 @@ func main() {
remotewrite.InitSecretFlags()
buildinfo.Init()
logger.Init()
pushmetrics.Init()
ifpromscrape.IsDryRun(){
iferr:=promscrape.CheckConfig();err!=nil{
@@ -119,13 +120,18 @@ func main() {
return
}
logger.Infof("starting vmagent at %q...",*httpListenAddr)
listenAddrs:=*httpListenAddrs
iflen(listenAddrs)==0{
listenAddrs=[]string{":8429"}
}
logger.Infof("starting vmagent at %q...",listenAddrs)
rateLimit=flagutil.NewArrayInt("remoteWrite.rateLimit",0,"Optional rate limit in bytes per second for data sent to the corresponding -remoteWrite.url. "+
"By default, the rate limit is disabled. It can be useful for limiting load on remote storage when big amounts of buffered data "+
"is sent after temporary unavailability of the remote storage")
"is sent after temporary unavailability of the remote storage. See also -maxIngestionRate")
sendTimeout=flagutil.NewArrayDuration("remoteWrite.sendTimeout",time.Minute,"Timeout for sending a single block of data to the corresponding -remoteWrite.url")
proxyURL=flagutil.NewArrayString("remoteWrite.proxyURL","Optional proxy URL for writing data to the corresponding -remoteWrite.url. "+
tlsHandshakeTimeout=flagutil.NewArrayDuration("remoteWrite.tlsHandshakeTimeout",20*time.Second,"The timeout for estabilishing tls connections to the corresponding -remoteWrite.url")
tlsInsecureSkipVerify=flagutil.NewArrayBool("remoteWrite.tlsInsecureSkipVerify","Whether to skip tls verification when connecting to the corresponding -remoteWrite.url")
tlsCertFile=flagutil.NewArrayString("remoteWrite.tlsCertFile","Optional path to client-side TLS certificate file to use when connecting "+
"to the corresponding -remoteWrite.url")
@@ -58,8 +61,10 @@ var (
oauth2ClientID=flagutil.NewArrayString("remoteWrite.oauth2.clientID","Optional OAuth2 clientID to use for the corresponding -remoteWrite.url")
oauth2ClientSecret=flagutil.NewArrayString("remoteWrite.oauth2.clientSecret","Optional OAuth2 clientSecret to use for the corresponding -remoteWrite.url")
oauth2ClientSecretFile=flagutil.NewArrayString("remoteWrite.oauth2.clientSecretFile","Optional OAuth2 clientSecretFile to use for the corresponding -remoteWrite.url")
oauth2TokenURL=flagutil.NewArrayString("remoteWrite.oauth2.tokenUrl","Optional OAuth2 tokenURL to use for the corresponding -remoteWrite.url")
oauth2Scopes=flagutil.NewArrayString("remoteWrite.oauth2.scopes","Optional OAuth2 scopes to use for the corresponding -remoteWrite.url. Scopes must be delimited by ';'")
oauth2EndpointParams=flagutil.NewArrayString("remoteWrite.oauth2.endpointParams","Optional OAuth2 endpoint parameters to use for the corresponding -remoteWrite.url . "+
`The endpoint parameters must be set in JSON format: {"param1":"value1",...,"paramN":"valueN"}`)
oauth2TokenURL=flagutil.NewArrayString("remoteWrite.oauth2.tokenUrl","Optional OAuth2 tokenURL to use for the corresponding -remoteWrite.url")
oauth2Scopes=flagutil.NewArrayString("remoteWrite.oauth2.scopes","Optional OAuth2 scopes to use for the corresponding -remoteWrite.url. Scopes must be delimited by ';'")
awsUseSigv4=flagutil.NewArrayBool("remoteWrite.aws.useSigv4","Enables SigV4 request signing for the corresponding -remoteWrite.url. "+
"It is expected that other -remoteWrite.aws.* command-line flags are set if sigv4 request signing is enabled")
remoteWriteURLs=flagutil.NewArrayString("remoteWrite.url","Remote storage URL to write data to. It must support either VictoriaMetrics remote write protocol "+
"or Prometheus remote_write protocol. Example url: http://<victoriametrics-host>:8428/api/v1/write . "+
"Pass multiple -remoteWrite.url options in order to replicate the collected data to multiple remote storage systems. "+
"The data can be sharded among the configured remote storage systems if -remoteWrite.shardByURL flag is set. "+
"See also -remoteWrite.multitenantURL")
"The data can be sharded among the configured remote storage systems if -remoteWrite.shardByURL flag is set")
remoteWriteMultitenantURLs=flagutil.NewArrayString("remoteWrite.multitenantURL","Base path for multitenant remote storage URL to write data to. "+
"See https://docs.victoriametrics.com/vmagent.html#multitenancy for details. Example url: http://<vminsert>:8480 . "+
"Pass multiple -remoteWrite.multitenantURL flags in order to replicate data to multiple remote storage systems. See also -remoteWrite.url")
"Pass multiple -remoteWrite.multitenantURL flags in order to replicate data to multiple remote storage systems. "+
"This flag is deprecated in favor of -enableMultitenantHandlers . See https://docs.victoriametrics.com/vmagent.html#multitenancy")
enableMultitenantHandlers=flag.Bool("enableMultitenantHandlers",false,"Whether to process incoming data via multitenant insert handlers according to "+
"https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html#url-format . By default incoming data is processed via single-node insert handlers "+
"according to https://docs.victoriametrics.com/#how-to-import-time-series-data ."+
"See https://docs.victoriametrics.com/vmagent.html#multitenancy for details")
shardByURL=flag.Bool("remoteWrite.shardByURL",false,"Whether to shard outgoing series across all the remote storage systems enumerated via -remoteWrite.url . "+
"By default the data is replicated across all the -remoteWrite.url . See https://docs.victoriametrics.com/vmagent.html#sharding-among-remote-storages")
tmpDataPath=flag.String("remoteWrite.tmpDataPath","vmagent-remotewrite-data","Path to directory where temporary data for remote write component is stored. "+
"See also -remoteWrite.maxDiskUsagePerURL")
shardByURLLabels=flagutil.NewArrayString("remoteWrite.shardByURL.labels","Optional list of labels, which must be used for sharding outgoing samples "+
"among remote storage systems if -remoteWrite.shardByURL command-line flag is set. By default all the labels are used for sharding in order to gain "+
"even distribution of series over the specified -remoteWrite.url systems. See also -remoteWrite.shardByURL.ignoreLabels")
shardByURLIgnoreLabels=flagutil.NewArrayString("remoteWrite.shardByURL.ignoreLabels","Optional list of labels, which must be ignored when sharding outgoing samples "+
"among remote storage systems if -remoteWrite.shardByURL command-line flag is set. By default all the labels are used for sharding in order to gain "+
"even distribution of series over the specified -remoteWrite.url systems. See also -remoteWrite.shardByURL.labels")
tmpDataPath=flag.String("remoteWrite.tmpDataPath","vmagent-remotewrite-data","Path to directory for storing pending data, which isn't sent to the configured -remoteWrite.url . "+
"See also -remoteWrite.maxDiskUsagePerURL and -remoteWrite.disableOnDiskQueue")
keepDanglingQueues=flag.Bool("remoteWrite.keepDanglingQueues",false,"Keep persistent queues contents at -remoteWrite.tmpDataPath in case there are no matching -remoteWrite.url. "+
"Useful when -remoteWrite.url is changed temporarily and persistent queue files will be needed later on.")
queues=flag.Int("remoteWrite.queues",cgroup.AvailableCPUs()*2,"The number of concurrent queues to each -remoteWrite.url. Set more queues if default number of queues "+
"isn't enough for sending high volume of collected data to remote storage. Default value is 2 * numberOfAvailableCPUs")
"isn't enough for sending high volume of collected data to remote storage. "+
"Default value depends on the number of available CPU cores. It should work fine in most cases since it minimizes resource usage")
showRemoteWriteURL=flag.Bool("remoteWrite.showURL",false,"Whether to show -remoteWrite.url in the exported metrics. "+
"It is hidden by default, since it can contain sensitive info such as auth key")
maxPendingBytesPerURL=flagutil.NewArrayBytes("remoteWrite.maxDiskUsagePerURL",0,"The maximum file-based buffer size in bytes at -remoteWrite.tmpDataPath "+
@@ -68,6 +84,8 @@ var (
"Excess series are logged and dropped. This can be useful for limiting series cardinality. See https://docs.victoriametrics.com/vmagent.html#cardinality-limiter")
maxDailySeries=flag.Int("remoteWrite.maxDailySeries",0,"The maximum number of unique series vmagent can send to remote storage systems during the last 24 hours. "+
"Excess series are logged and dropped. This can be useful for limiting series churn rate. See https://docs.victoriametrics.com/vmagent.html#cardinality-limiter")
maxIngestionRate=flag.Int("maxIngestionRate",0,"The maximum number of samples vmagent can receive per second. Data ingestion is paused when the limit is exceeded. "+
"By default there are no limits on samples ingestion rate. See also -remoteWrite.rateLimit")
streamAggrConfig=flagutil.NewArrayString("remoteWrite.streamAggr.config","Optional path to file with stream aggregation config. "+
"See https://docs.victoriametrics.com/stream-aggregation.html . "+
@@ -78,8 +96,18 @@ var (
streamAggrDropInput=flagutil.NewArrayBool("remoteWrite.streamAggr.dropInput","Whether to drop all the input samples after the aggregation "+
"with -remoteWrite.streamAggr.config. By default, only aggregates samples are dropped, while the remaining samples "+
"are written to the corresponding -remoteWrite.url . See also -remoteWrite.streamAggr.keepInput and https://docs.victoriametrics.com/stream-aggregation.html")
streamAggrDedupInterval=flagutil.NewArrayDuration("remoteWrite.streamAggr.dedupInterval",0,"Input samples are de-duplicated with this interval before being aggregated. "+
"Only the last sample per each time series per each interval is aggregated if the interval is greater than zero")
streamAggrDedupInterval=flagutil.NewArrayDuration("remoteWrite.streamAggr.dedupInterval",0,"Input samples are de-duplicated with this interval before optional aggregation "+
"with -remoteWrite.streamAggr.config . See also -dedup.minScrapeInterval and https://docs.victoriametrics.com/stream-aggregation.html#deduplication")
streamAggrIgnoreOldSamples=flagutil.NewArrayBool("remoteWrite.streamAggr.ignoreOldSamples","Whether to ignore input samples with old timestamps outside the current aggregation interval "+
"for the corresponding -remoteWrite.streamAggr.config . See https://docs.victoriametrics.com/stream-aggregation.html#ignoring-old-samples")
streamAggrDropInputLabels=flagutil.NewArrayString("streamAggr.dropInputLabels","An optional list of labels to drop from samples "+
"before stream de-duplication and aggregation . See https://docs.victoriametrics.com/stream-aggregation.html#dropping-unneeded-labels")
disableOnDiskQueue=flag.Bool("remoteWrite.disableOnDiskQueue",false,"Whether to disable storing pending data to -remoteWrite.tmpDataPath "+
"when the configured remote storage systems cannot keep up with the data ingestion rate. See https://docs.victoriametrics.com/vmagent.html#disabling-on-disk-persistence ."+
"See also -remoteWrite.dropSamplesOnOverload")
dropSamplesOnOverload=flag.Bool("remoteWrite.dropSamplesOnOverload",false,"Whether to drop samples when -remoteWrite.disableOnDiskQueue is set and if the samples "+
"cannot be pushed into the configured remote storage systems in a timely manner. See https://docs.victoriametrics.com/vmagent.html#disabling-on-disk-persistence")
)
var(
@@ -92,11 +120,19 @@ var (
// Data without tenant id is written to defaultAuthToken if -remoteWrite.multitenantURL is specified.
defaultAuthToken=&auth.Token{}
// ErrQueueFullHTTPRetry must be returned when TryPush() returns false.
logger.Warnf("rounding the -remoteWrite.maxDiskUsagePerURL=%d to the minimum supported value: %d",maxPendingBytes,persistentqueue.DefaultChunkFileSize)
vmalert-tool unittest is compatible with [Prometheus config format for tests](https://prometheus.io/docs/prometheus/latest/configuration/unit_testing_rules/#test-file-format)
except `promql_expr_test` field. Use `metricsql_expr_test` field name instead. The name is different because vmalert-tool
validates and executes [MetricsQL](https://docs.victoriametrics.com/MetricsQL.html) expressions,
which aren't always backward compatible with [PromQL](https://prometheus.io/docs/prometheus/latest/querying/basics/).
### Test file format
The configuration format for files specified in `--files` cmd-line flag is the following:
```yaml
# Path to the files or http url containing [rule groups](https://docs.victoriametrics.com/vmalert.html#groups) configuration.
# Enterprise version of vmalert-tool supports S3 and GCS paths to rules.
rule_files:
[- <string> ]
# The evaluation interval for rules specified in `rule_files`
[ evaluation_interval:<duration> | default = 1m ]
# Groups listed below will be evaluated by order.
# Not All the groups need not be mentioned, if not, they will be evaluated by define order in rule_files.
group_eval_order:
[- <string> ]
# The list of unit test files to be checked during evaluation.
tests:
[- <test_group> ]
```
#### `<test_group>`
```yaml
# Interval between samples for input series
interval:<duration>
# Time series to persist into the database according to configured <interval> before running tests.
input_series:
[- <series> ]
# Name of the test group, optional
[ name:<string> ]
# Unit tests for alerting rules
alert_rule_test:
[- <alert_test_case> ]
# Unit tests for Metricsql expressions.
metricsql_expr_test:
[- <metricsql_expr_test> ]
# External labels accessible for templating.
external_labels:
[ <labelname>:<string> ... ]
```
#### `<series>`
```yaml
# series in the following format '<metric name>{<label name>=<label value>, ...}'
tlsCAFile=flag.String("datasource.tlsCAFile","",`Optional path to TLS CA file to use for verifying connections to -datasource.url. By default, system CA is used`)
tlsServerName=flag.String("datasource.tlsServerName","",`Optional TLS server name to use for connections to -datasource.url. By default, the server name from -datasource.url is used`)
oauth2ClientID=flag.String("datasource.oauth2.clientID","","Optional OAuth2 clientID to use for -datasource.url. ")
oauth2ClientSecret=flag.String("datasource.oauth2.clientSecret","","Optional OAuth2 clientSecret to use for -datasource.url.")
oauth2ClientSecretFile=flag.String("datasource.oauth2.clientSecretFile","","Optional OAuth2 clientSecretFile to use for -datasource.url. ")
oauth2TokenURL=flag.String("datasource.oauth2.tokenUrl","","Optional OAuth2 tokenURL to use for -datasource.url.")
oauth2Scopes=flag.String("datasource.oauth2.scopes","","Optional OAuth2 scopes to use for -datasource.url. Scopes must be delimited by ';'")
oauth2ClientID=flag.String("datasource.oauth2.clientID","","Optional OAuth2 clientID to use for -datasource.url")
oauth2ClientSecret=flag.String("datasource.oauth2.clientSecret","","Optional OAuth2 clientSecret to use for -datasource.url")
oauth2ClientSecretFile=flag.String("datasource.oauth2.clientSecretFile","","Optional OAuth2 clientSecretFile to use for -datasource.url")
oauth2EndpointParams=flag.String("datasource.oauth2.endpointParams","","Optional OAuth2 endpoint parameters to use for -datasource.url . "+
`The endpoint parameters must be set in JSON format: {"param1":"value1",...,"paramN":"valueN"}`)
oauth2TokenURL=flag.String("datasource.oauth2.tokenUrl","","Optional OAuth2 tokenURL to use for -datasource.url")
oauth2Scopes=flag.String("datasource.oauth2.scopes","","Optional OAuth2 scopes to use for -datasource.url. Scopes must be delimited by ';'")
lookBack=flag.Duration("datasource.lookback",0,`Lookback defines how far into the past to look when evaluating queries. For example, if the datasource.lookback=5m then param "time" with value now()-5m will be added to every query.`)
lookBack=flag.Duration("datasource.lookback",0,`Deprecated: please adjust "-search.latencyOffset" at datasource side `+
`or specify "latency_offset" in rule group's params. Lookback defines how far into the past to look when evaluating queries. `+
`For example, if the datasource.lookback=5m then param "time" with value now()-5m will be added to every query.`)
queryStep=flag.Duration("datasource.queryStep",5*time.Minute,"How far a value can fallback to when evaluating queries. "+
"For example, if -datasource.queryStep=15s then param \"step\" with value \"15s\" will be added to every query. "+
"If set to 0, rule's evaluation interval will be used instead.")
logger.Warnf("flag `datasource.queryTimeAlignment` is deprecated and will be removed in next releases, please use `eval_alignment` in rule group instead")
logger.Warnf("flag `-datasource.queryTimeAlignment` is deprecated and will be removed in next releases. Please use `eval_alignment` in rule group instead.")
}
if*lookBack!=0{
logger.Warnf("flag `-datasource.lookback` is deprecated and will be removed in next releases. Please adjust `-search.latencyOffset` at datasource side or specify `latency_offset` in rule group's params. See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5155 for details.")
@@ -47,8 +47,8 @@ all files with prefix rule_ in folder dir.
See https://docs.victoriametrics.com/vmalert.html#reading-rules-from-object-storage
`)
ruleTemplatesPath=flagutil.NewArrayString("rule.templates",`Path or glob pattern to location with go template definitions
for rules annotations templating. Flag can be specified multiple times.
ruleTemplatesPath=flagutil.NewArrayString("rule.templates",`Path or glob pattern to location with go template definitions`+
`for rules annotations templating. Flag can be specified multiple times.
Examples:
-rule.templates="/path/to/file". Path to a single file with go templates
-rule.templates="dir/*.tpl" -rule.templates="/*.tpl". Relative path to all .tpl files in "dir" folder,
@@ -59,8 +59,8 @@ absolute path to all .tpl files in root.
configCheckInterval=flag.Duration("configCheckInterval",0,"Interval for checking for changes in '-rule' or '-notifier.config' files. "+
"By default, the checking is disabled. Send SIGHUP signal in order to force config check for changes.")
httpListenAddr=flag.String("httpListenAddr",":8880","Address to listen for http connections. See also -httpListenAddr.useProxyProtocol")
useProxyProtocol=flag.Bool("httpListenAddr.useProxyProtocol",false,"Whether to use proxy protocol for connections accepted at -httpListenAddr . "+
httpListenAddrs=flagutil.NewArrayString("httpListenAddr","Address to listen for incoming http requests. See also -tls and -httpListenAddr.useProxyProtocol")
useProxyProtocol=flagutil.NewArrayBool("httpListenAddr.useProxyProtocol","Whether to use proxy protocol for connections accepted at the corresponding -httpListenAddr . "+
"See https://www.haproxy.org/download/1.8/doc/proxy-protocol.txt . "+
"With enabled proxy protocol http server cannot serve regular /metrics endpoint. Use -pushmetrics.url for metrics pushing")
evaluationInterval=flag.Duration("evaluationInterval",time.Minute,"How often to evaluate the rules")
@@ -96,7 +96,6 @@ func main() {
notifier.InitSecretFlags()
buildinfo.Init()
logger.Init()
pushmetrics.Init()
if!*remoteReadIgnoreRestoreErrors{
logger.Warnf("flag `remoteRead.ignoreRestoreErrors` is deprecated and will be removed in next releases.")
// try to extract optional scheme and alertsPath from __address__.
ifstrings.HasPrefix(address,"http://"){
scheme="http"
address=address[len("http://"):]
}elseifstrings.HasPrefix(address,"https://"){
scheme="https"
address=address[len("https://"):]
}
ifn:=strings.IndexByte(address,'/');n>=0{
alertsPath=address[n:]
address=address[:n]
}
m.Add("__address__",address)
m.Add("__scheme__",scheme)
m.Add("__alerts_path__",alertsPath)
m.AddFrom(metaLabels)
returnm
}
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.