_stream field can be empty for the recently ingested rows because respective entry in indexdb is not yet searchable as it haven't been flushed to storage yet.
This change just skips such items in the output response to make it more consistent.
See: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6042
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* app/vmauth: do not increment backend_errors when hitting concurrency limit
Previously, both "vmauth_concurrent_requests_limit_reached_total" and "vmauth_user_request_backend_errors_total" were incremented.
This was based on the assumption that if concurrency limit is hit the backend must be failing to handle the request thus meaning an error.
This assumption does not work in case the endpoint can be overloaded by the misbehaving client sending too many requests within the timeframe.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5565
Signed-off-by: hagen1778 <roman@victoriametrics.com>
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
This should improve debuggability of unexpected deletion of directories inside partitions.
While at it, log the proper path to parts.json when the directory for big part is missing in the partition.
parts.json is located inside directory with small parts, and there is no parts.json file inside directory with big parts.
* docs/cluster: update cluster resizing info
- add example of resources distribution
- add info about how to handle uneven disk usage after adding a new storage node
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* Update docs/Cluster-VictoriaMetrics.md
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
* Update docs/Cluster-VictoriaMetrics.md
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
* vmui: fix parsing of fractional values
* vmui/vmanomaly: update display logic to align with vmanomaly /query_range API
* vmui/vmanomaly: rename flag anomalyView to isAnomalyView
- Automatically reload changed TLS root CA pointed by -remoteWrite.tlsCAFile command-line flag
- Automatically reload changed TLS root CA configured via oauth2.tsl_config.ca_file option at -promscrape.config
- Document the change as a feature instead of a bug at docs/CHANGELOG.md
- Simplify the code at lib/promauth, which is responsible for reloading changed TLS root CA files.
- Simplify the usage of lib/promauth.Config.NewRoundTripper() - now it accepts the base http.Transport
instead of a callback, which can change the internal http.Transport.
- Reuse the default tls config if lib/promauth.Config doesn't contain tls-specific configs.
This should reduce memory usage a bit when tls isn't used for scraping big number of targets.
- Do not re-read TLS root CA files on every processed request. Re-read them once per second.
This should reduce CPU usage when scraping big number of targets over https.
- Do not store cert.pem and key.pem files in TestTLSConfigWithCertificatesFilesUpdate, since they can be loaded
from byte slices via crypto/tls.X509KeyPair().
- Remove obsolete comparisons of string representations for authConfig and proxyAuthConfig at areEqualScrapeConfigs().
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5725
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5526
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2171
* lib/{promauth,promscrape}: automatically refresh root CA certificates after changes on disk
Added a custom `http.RoundTripper` implementation which checks for root CA content changes and updates `tls.Config` used by `http.RoundTripper` after detecting CA change.
Client certificate changes are not tracked by this implementation since `tls.Config` already supports passing certificate dynamically by overriding `tls.Config.GetClientCertificate`.
This change implements dynamic reload of root CA only for streaming client used for scraping. Blocking client (`fasthttp.HostClient`) does not support using custom transport so can't use this implementation.
See: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5526
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/promauth/config: update NewRoundTripper API
Update API to allow user to update only parameters required for transport.
Add warning log when reloading Root CA failed.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/promauth/config: fix mutex acquire logic
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/promauth/config: replace RWMutex with regular mutex to simplify the code
- remove additional mutex used for getRootCABytes - require callee to use mutex
- replace RWMutex with regular mutex
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/promauth/config: refactor
- hold the mutex lock to avoid round tripper being re-created twice
- move recreation logic into separate func to simplify the code
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Nikolay <nik@victoriametrics.com>
- Rename -opentelemetry.sanitizeMetrics command-line flag to more clear -opentelemetry.usePrometheusNaming
- Clarify the description of the change at docs/CHANGELOG.md
- Rename promrelabel.SanitizeLabelNameParts to more clear promrelabel.SplitMetricNameToTokens
- Properly split metric names at '_' char in promerlabel.SplitMetricNameToTokens.
- Add tests for various edge cases for Prometheus metric names' normalization
according to the code at b865505850/pkg/translator/prometheus/normalize_name.go
- Extract the code responsible for Prometheus metric names' normalization into a separate file (santize.go)
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6037
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6035
- Make the configuration more clear by accepting the list of ignored labels during sharding
via a dedicated command-line flag - -remoteWrite.shardByURL.ignoreLabels.
This prevents from overloading the meaning of -remoteWrite.shardByURL.labels command-line flag.
- Removed superfluous memory allocation per each processed sample if sharding by remote storage is enabled.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5938
Remove description for -search.maxExportDuration and -search.maxStatusRequestDuration command-line flags
from the 'Resource usage limits' chapter, since these flags are rarely used for limiting resource usage
and they are already documented in the 'List of command-line flags' chapter.
- Fix docs for new functions at app/vmselect/graphite/functions.json
- Properly drain series lists on errors in aggregateSeriesListsGeneric() and aggregateSeriesList()
- Add links to docs for the added functions at docs/CHANGELOG.md
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5809
- Allow specifying only a single HTTP header for reading auth tokens via -httpAuthHeader command-line flag.
This is better from security PoV, since this prevents from accidental reading of auth token from undesired
HTTP header. By default the -httpAuthHeader equals to Authorization. When it is overridden, then
auth token isn't read from Authorization header - it is read only from the specified header.
- Document the -httpAuthHeader command-line flag at https://docs.victoriametrics.com/vmauth/#reading-auth-tokens-from-other-http-headers
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6009
* docs/vmanomaly: v1.12.0 & link updates
* add autotuned description to model section
* - update refs of vmanomaly on enterprise and vmalert pages
- add diagrams for model types
- update self-monitoring section
* - fix typos
- remove .index.html from links
This reverts commit cb23685681.
Reason for revert: the "fix" may hide programming bugs related to incorrect creation of folders
before their use. This may complicate detecting and fixing such bugs in the future.
There are the following fixes for the issue https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5985 :
- To configure the OS to do not drop data from the system-wide temporary directory (aka /tmp).
- To run VictoriaMetrics with -cacheDataPath command-line flag, which points to the directory,
which cannot be removed automatically by the OS.
The case when the user accidentally deletes the directory with some files created by VictoriaMetrics
shouldn't be considered as expected, so VictoriaMetrics shouldn't try resolving this case automatically.
It is much better from operation and debuggability PoV is to crash with the clear `directory doesn't exist` error
in this case.
The remotewrite.Stop() expects that there are no pending calls to TryPush().
This means that the ingestionRateLimiter.Register() must be unblocked inside TryPush() when calling remotewrite.Stop().
Provide remotewrite.StopIngestionRateLimiter() function for unblocking the rate limiter before calling the remotewrite.Stop().
While at it, move the rate limiter into lib/ratelimiter package, since it has two users.
Also move the description of the feature to the correct place at docs/CHANGELOG.md.
Also cross-reference -remoteWrite.rateLimit and -maxIngestionRate command-line flags.
This is a follow-up for 02bccd1eb9
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5900
Custom HTTP headers are set via net/http.Header.Set or net/http.Header.Add functions.
These functions always convert header keys to canonical form. So the change at b577413d3b
isn't visible to users of VictoriaMetrics components.
There is no need in documenting this change at docs/CHANGELOG.md, since it doesn't give any useful information to users.
This is a follow-up for e6dd52b04c
* lib/storage: add ability to use downsampling for the given series filter
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* docs: add information about downsampling filters
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* docs: fix MetricsQL filter
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/storage/downsampling: treat missing downsampling filter as a bug
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/storage/part_header: verify correctness of downsampling filters when opening partition
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/storage/downsampling: save only appliable rules in part metadata
Filter and save only rules which are appliable to partition based on MinTimestamp of stored data.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/storage/downsampling: update log messages for final dedup
Properly specify a reason of re-running deduplication for partition.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/storage: consistently use MaxTimestamp to determine deduplication/downsampling rules
Using MinTimestamp leads to applying downsampling to parts which are only partially covered by downsampling rule.
For example, partition covers range [1000-2000]. At t=2100 and rule offset 500 data with t=2100-500 => 1600 must be downsampled. The range check against MinTimestamp evaluates to true even though partition contains range which must not be downsampled - [1600:2000].
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* Follow-up
- Apply the first matching downsampling period if multiple filters match the given time series.
This allows fine-tuning the downsampling config for the specific needs.
- Take into account downsampling filters during search queries.
- Reduce the difference between community and enterprise branches. This should simplify further maintenance of these branches.
- Properly parse series filters with colons inside them.
- Document the feature at docs/CHANGELOG.md.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4960
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* lib/storage: adds metrics for downsampling
vm_downsampling_partitions_scheduled - shows the number of parts, that must be downsampled
vm_downsampling_partitions_scheduled_size_bytes - shows total size in bytes for parts, the must be donwsampled
These two metrics answer the questions - is downsampling running? how many parts scheduled for downsampling and how many of them currently downsampled? Storage space that it occupies.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2612
* wip
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* rm recommendation to keep look-behind window empty, as it is not correct
* mention the change of default value for `-search.latencyOffset`
Signed-off-by: hagen1778 <roman@victoriametrics.com>
During shutdown period of vmalert, remotewrite client retrieve all pending time series from buffer queue, compose them into 1 batch and execute remote write.
This final batch may exceed the limit of -remoteWrite.maxBatchSize, and be rejected by the receiver (gateway, vmcluster or others).
This changes ensures that even during shutdown vmalert won't exceed the max batch size limit for remote write
destination.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6025
To horizontally scale streaming aggregation, you might want to deploy a separate hashing tier
of vmagents that route to a separate aggregation tier. The hashing tier should shard by all labels
except the instance-level labels, to ensure the input metrics are routed correctly to the aggregator
instance responsible for those labels.
For this to achieve we introduce `remoteWrite.shardByURL.inverseLabels` flag to inverse logic of `remoteWrite.shardByURL.labels`
---------
Co-authored-by: Eugene Ma <eugene.ma@airbnb.com>
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
* vmalert: fix sending alert messages
1. fix `endsAt` field in messages that send to alertmanager, previously rule with small interval could never be triggered;
2. fix behavior of `-rule.resendDelay`, before it could prevent sending firing message when rule state is volatile.
* docs: update changelog notes
Signed-off-by: hagen1778 <roman@victoriametrics.com>
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
Store the deadline when the metricID entries must be deleted from indexdb
if metricID->metricName entry isn't found after the deadline. This should
make the code more clear comparing the the previous version, where the timestamp
of the first metricID->metricName lookup miss was stored in missingMetricIDs.
Remove the misleading comment about the importance of the order for creating entries
in the inverted index when registering new time series. The order doesn't matter,
since any subset of the created entries can become visible for search
before any other subset after registering in indexdb.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5948
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5959
* lib/storage/table: properly wait for force merges to be completed during shutdown
Properly keep track of running background merges and wait for merges completion when closing the table.
Previously, force merge was not in sync with overall storage shutdown which could lead to holding ptw ref.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* docs: add changelog entry
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
vmselect uses a cache folder in file system for two purposes:
1. Storing rollup cache results on shutdown;
2. Storing temporary search results from vmstorage during query executions.
It could happen that cache folder is deleted accidentally by user, or by OS
during cleanup routines. This would cause vmselect to:
1. panic on /metrics call, because `MustGetFreeSpace` will fail;
2. return query error user, as it won't be able to store temporary search results.
The changes in this commit are the following:
1. Make `MustGetFreeSpace` to try re-creating the cache folder if it is missing;
2. Make vmselect to try re-creating the cache folder if it can't persist tmp search
results.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5985
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Nikolay <nik@victoriametrics.com>
* [vmagent] added ingestion rate limiting with new flag `-maxIngestionRate`. This flag can be used to limit the number of samples ingested by vmagent per second. If the limit is exceeded, the ingestion rate will be throttled.
* fix changelog
* fix review comment
When `--vm-native-step-interval` is specified, explore phase will be executed
within specified intervals. Discovered metric names will be associated with
time intervals at which they were discovered. This suppose to reduce number
of requests vmctl makes per metric name since it will skip time intervals
when metric name didn't exist.
This should also reduce probability of exceeding complexity limits
for number of selected series in one request during explore phase.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5369
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Links with upper-case simply don't work for unknown reason.
Once the reason is fixed on docs side, this commit can be reverted.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
It is confusing for cluster users to find that VMUI is available at vmselect as it is only mentioned in the list of URLs. Explicit mention of vmselect URL in docs will make it easier to discover.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
This makes the code less fragile - it is harder to skip the convertToCompositeTagFilterss() call now.
While at it, call indexSearch.containsTimeRange() inside indexSearch.searchMetricIDsInternal()
in order to quickly terminate search of time series in the old indexdb for new time ranges.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5055
This is a follow-up for 2d31fd7855
It is assumed that URLPrefix.busOriginal will be initialized
durin Unmarshal of the config. But in tests we set fields manually,
so this field never get initialized properly.
Fixes the error `panic: runtime error: integer divide by zero`
at `vmauth.getLeastLoadedBackendURL`.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
If tlsServerName isn't empty, then it is likely the https request is sent to IP instead of hostname.
In this case the request will fail, since Go automatically sets the Host header to the IP instead
of the desired hostname at tlsServerName. So set the Host header to tlsServerName if itsn't empty.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5802
The `match[]` filter is mandatory at /api/v1/series, so it mustn't be dropped here.
There is no sense in dropping `match[]` filter together with `extra_label` and `extra_filters[]`
at /api/v1/labels and /api/v1/label/.../values if -search.ignoreExtraFiltersAtLabelsAPI commnad-line flag is set,
since:
- the `match[]` filter triggers slow path at these APIs;
- the `extra_label` and `extra_filters[]` filters narrow down the number of matched time series,
so they improve performance comparing to the case when only `match[]` filter is left,
while `extra_label` and `extra_filters[]` filters are dropped.
This is a follow-up for 0b7a23a91d
Improved trace display for better visual separation of branches:
* Increased left padding for each element
* Added padding for the last element in the branch
The change also removes misleading `default` value from README for `maxConcurrentInserts`
cmd-line flag.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
This reverts commit eb40395a1c.
Reason for revert: it has been appeared that the performance gain on multiple CPU cores
wasn't visible because the benchmark was generating incorrect pushSample.key.
See a207e0bf687d65f5198207477248d70c69284296
Previously samples were dropped on the first incomplete interval and the next complete interval.
Also make sure that the de-duplication is performed just before flushing the aggregate state.
This should help the case then dedup_interval = interval.
* use docs.victoriametrics.com instead of github docs
* add links to common terms used in VictoriaMetrics
Signed-off-by: hagen1778 <roman@victoriametrics.com>
For example, if `interval: 1m`, then data flush occurs at the end of every minute,
while `interval: 1h` leads to data flush at the end of every hour.
Add `no_align_flush_to_interval` option, which can be used for disabling the alignment.
The labelsMap struct employs the fact that label indexes are condensed around 0,
so it stores the referred labels in a slice instead of map and uses slice index as label key.
This allows increasing the LabelsCompressor.Decompress performance by up to 3x.
This also reduces the latency of data flush in stream aggregation.
- Reduce memory usage by up to 5x when de-duplicating samples across big number of time series.
- Reduce memory usage by up to 5x when aggregating across big number of output time series.
- Add lib/promutils.LabelsCompressor, which is going to be used by other VictoriaMetrics components
for reducing memory usage for marshaled []prompbmarshal.Label.
- Add `dedup_interval` option at aggregation config, which allows setting individual
deduplication intervals per each aggregation.
- Add `keep_metric_names` option at aggregation config, which allows keeping the original
metric names in the output samples.
- Add `unique_samples` output, which counts the number of unique sample values.
- Add `increase_prometheus` and `total_prometheus` outputs, which ignore the first sample
per each newly encountered time series.
- Use 64-bit hashes instead of marshaled labels as map keys when calculating `count_series` output.
This makes obsolete https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5579
- Expose various metrics, which may help debugging stream aggregation:
- vm_streamaggr_dedup_state_size_bytes - the size of data structures responsible for deduplication
- vm_streamaggr_dedup_state_items_count - the number of items in the deduplication data structures
- vm_streamaggr_labels_compressor_size_bytes - the size of labels compressor data structures
- vm_streamaggr_labels_compressor_items_count - the number of entries in the labels compressor
- vm_streamaggr_flush_duration_seconds - a histogram, which shows the duration of stream aggregation flushes
- vm_streamaggr_dedup_flush_duration_seconds - a histogram, which shows the duration of deduplication flushes
- vm_streamaggr_flush_timeouts_total - counter for timed out stream aggregation flushes,
which took longer than the configured interval
- vm_streamaggr_dedup_flush_timeouts_total - counter for timed out deduplication flushes,
which took longer than the configured dedup_interval
- Actualize docs/stream-aggregation.md
The memory usage reduction increases CPU usage during stream aggregation by up to 30%.
This commit is based on https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5850
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5898
The match[] at /api/v1/labels and /api/v1/label/.../values also may lead to slow requests and
high resource usage if it matches big number of time series. So it must be igrnored if -search.ignoreExtraFiltersAtLabelsAPI
command-line flag is set.
This is a follow-up for fab02faa3f
* app/{vmagent,vminsert}: adds /v1/metrics suffix for opentelemetry route path
it must fix compatibility with opentemetry-collector [spec](https://opentelemetry.io/docs/specs/otlp/\#otlphttp-request)
this suffix is hard-coded and cannot be changed with collector configuration
* Apply suggestions from code review
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* metricsql: fix label_join() when `dst_label` is equal to one of the `src_label`
* Update app/vmselect/promql/transform.go
* Update docs/CHANGELOG.md
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
- Document the ability to read OpenTelemetry data from Amazon Firehose at docs/CHANGELOG.md
- Simplify parsing Firehose data. There is no need in trying to optimize the parsing with fastjson
and byte slice tricks, since OpenTelemetry protocol is really slooow because of over-engineering.
It is better to write clear code for better maintanability in the future.
- Move Firehose parser from /lib/protoparser/firehose to lib/protoparser/opentelemetry/firehose,
since it is used only by opentelemetry parser.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5893
* deployment: create a separate env for VictoriaLogs
The new environment consists of the following components:
* VictoriaLogs
* fluentbit for collecting logs and sending to VictoriaLogs
* VictoriaMetrics for scraping and storing metrics from fluentbit and VictoriaLogs
* Grafana with VictoriaLogs datasource for monitoring
-----------------
The motivation for creating a separate environment is to simplify existing environments
and make it easier to update or modify them in future.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Prevsiously they were swapped - the first arg should be the label name and the second arg should be label filters
This is a follow-up for e389b7b959e8144fdff5075bf7a5a39b2b0c6dd3
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5847
This solves two issues:
1. The vm_backups_uploaded_bytes_total metric will grow more smoothly
2. This prevents from int overflow at metrics.Counter.Add() when uploading files bigger than 2GiB
This should simplify code maintenance by gradually converting to atomic.* types instead of calling atomic.* functions
on int and bool types.
See ea9e2b19a5
The issue has been introduced in bace9a2501
The improper fix was in the d4c0615dcd ,
since it fixed the issue just by an accident, because Go comiler aligned the rawRowsShards field
by 4-byte boundary inside partition struct.
The proper fix is to use atomic.Int64 field - this guarantees that the access to this field
won't result in unaligned 64-bit atomic operation. See https://github.com/golang/go/issues/50860
and https://github.com/golang/go/issues/19057
It has been appeared that there are VictoriaMetrics users, who rely on the fact that
VictoriaMetrics components were closing incoming connections to -httpListenAddr every 2 minutes
by default. So let's return back this value by default in order to fix the breaking change
made at d8c1db7953 .
See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1304#issuecomment-1961891450 .
Previously the (date, metricID) entries for dates older than the last 2 days were removed.
This could lead to slow check for the (date, metricID) entry in the indexdb during ingesting historical data (aka backfilling).
The issue has been introduced in 431aa16c8d
This commit returns back limits for these endpoints, which have been removed at 5d66ee88bd ,
since it has been appeared that missing limits result in high CPU usage, while the introduced concurrency limiter
results in failed lightweight requests to these endpoints because of timeout when heavyweight requests are executed.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5055
* vmui: add a time picker to the "Logs Explorer" page #5673
* Update app/vmui/packages/vmui/src/pages/ExploreLogs/hooks/useFetchLogs.ts
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Do not convert shard items to part when a shard becomes full. Instead, collect multiple
full shards and then convert them to a searchable part at once. This reduces
the number of searchable parts, which, in turn, should increase query performance,
since queries need to scan smaller number of parts.
* app/vmselect: adds milliseconds to the csv export response for rfc3339
* milliseconds is a standard prescion for VictoriaMetrics query request responses
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5837
* app/victoria-metrics: adds tests for csv export/import
follow-up after 3541a8d0cf96dd4f8563624c4aab6816615d0756
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
Previously the interval between item addition and its conversion to searchable in-memory part
could vary significantly because of too coarse per-second precision. Switch from fasttime.UnixTimestamp()
to time.Now().UnixMilli() for millisecond precision. It is OK to use time.Now() for tracking
the time when buffered items must be converted to searchable in-memory parts, since time.Now()
calls aren't located in hot paths.
Increase the flush interval for converting buffered samples to searchable in-memory parts
from one second to two seconds. This should reduce the number of blocks, which are needed
to be processed during high-frequency alerting queries. This, in turn, should reduce CPU usage.
While at it, hardcode the maximum size of rawRows shard to 8Mb, since this size gives the optimal
data ingestion pefromance according to load tests. This reduces memory usage and CPU usage on systems
with big amounts of RAM under high data ingestion rate.
The pooled rawRowsBlock objects occupies big amounts of memory between flushes,
and the flushes are relatively rare. So it is better to don't use the pool
and to allocate rawRow blocks on demand. This should reduce the average
memory usage between flushes.
The buffer can be quite big under high ingestion rate (e.g. more than 100MB).
This leads to increased memory usage between buffer flushes.
So it is better to re-create the buffer on every flush in order to reduce memory usage
between buffer flushes.
This should improve maintainability of the code related to rollup functions,
since it is located in rollup.go
While at it, properly return empty results from holt_winters(), rate_over_sum(),
sum2_over_time(), geomean_over_time() and distinct_over_time() when there are no real samples
on the selected lookbehind window. Previously the previous sample value was mistakenly
returned from these functions.
* [lib/promutils, lib/httputils] fixed floating-point error when parsing time in RFC3339 format (#5801)
* fixed tests
* fixed test
* Revert "fixed test"
This reverts commit 8a29764806.
* Revert "fixed tests"
This reverts commit 9ce13d1042.
* Revert "[lib/promutils, lib/httputils] fixed floating-point error when parsing time in RFC3339 format (#5801)"
This reverts commit a7a04bd4
* [lib/httputils] fixed floating-point error when parsing time in RFC3339 format (#5801)
---------
Co-authored-by: Nikolay <nik@victoriametrics.com>
Currently the docker-compose examples for loading `victoriametrics-datasource` uses 2 environment variables:
- `GF_ALLOW_LOADING_UNSIGNED_PLUGINS`
- `GF_DEFAULT_APP_MODE`
I believe both of the env vars are trying to achieve the same thing. `GF_DEFAULT_APP_MODE` disables code signing for all plugins and `GF_ALLOW_LOADING_UNSIGNED_PLUGINS` intends to disable code signing for just `victoriametrics-datasource`.
Keeping the scope narrowed to just `victoriametrics-datasource` would be preferable in this case.
Unfortunately `GF_ALLOW_LOADING_UNSIGNED_PLUGINS` is misspelled. According to [grafana docs](https://grafana.com/docs/grafana/latest/setup-grafana/configure-grafana/#override-configuration-with-environment-variables), the format is supposed to be `GF_<SectionName>_<KeyName>`. In other words the current env var is missing the section name.
This PR proposes to:
1. fix the typo
2. remove the global disablement of code signing
Alternatively, if you prefer to keep codesigning disabled globally, please remove `GF_ALLOW_LOADING_UNSIGNED_PLUGINS` env var as it confuses things
* support case-insensitive search
* reflect search condition in URL, so link can be sharable
* support filtering on /alerts page
* fix collapseAll/expandAll logic to respect only shown entries
* add changelog
b60dcbe11f
Signed-off-by: hagen1778 <roman@victoriametrics.com>
- Consistently return the first `limit` log entries if the total size of found log entries doesn't exceed 1Mb.
See app/vlselect/logsql/sort_writer.go . Previously random log entries could be returned with each request.
- Document the change at docs/VictoriaLogs/CHANGELOG.md
- Document the `limit` query arg at docs/VictoriaLogs/querying/README.md
- Make the change less intrusive.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5674
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5778
* - apply v1.10 changes
- chapter on model types (uni/multivariate and rolling)
* - update self-monitoring labels description
- fix typos
* fix duplicated text and link rendering
* app/vmauth: properly release memory during config reload
previously metrics package hold a refrence for channels for users concurrent requests.
it case of churn at `name` field of users configuration, new metric was created. But previous one wasn't deleted.
It prevented full parsed configuration from being garbace collected.
now all config related metrics are bound to corresponding metrics.Set and unregistered during config reload process.
It also must fix an issue with incorrect values for current concurrent user requests
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4690
* wip
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Instead, log a sample of these long items once per 5 seconds into error log,
so users could notice and fix the issue with too long labels or too many labels.
Previously this panic could occur in production when ingesting samples with too long labels.
The 3c246cdf00 added an optimization where the previous metaindexRow
could be saved to disk when the current block header couldn't be added indexBlock because the resulting
indexBlock size became too big. This could result in an empty metaindexRow.firstItem for the next metaindexRow.
This allows removing importing unneeded command-line flags into binaries, which import lib/storage,
which, in turn, was importing lib/snapshot in order to use Time, Validate and NewName functions.
This is a follow-up for 83e55456e2
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5738
Closing client connections every 2 minutes doesn't help load balancing -
this just leads to "jumpy" connections between multiple backend servers,
e.g. the load isn't spread evenly among backend servers, and instead jumps
between the servers every 2 minutes.
It is still possible periodically closing client connections by specifying non-zero -http.connTimeout command-line flag.
This should help with https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1304#issuecomment-1636997037
This is a follow-up for d387da142e
Sync description after changes in 83e55456e2
Remove extra text in flags description for vmbackupmanager as it breaks layout
when rendered at docs.victoriametrics.com
Signed-off-by: hagen1778 <roman@victoriametrics.com>
There is no sense in storing commonPrefix for blockHeader containing only a single item,
since this only increases blockHeader size without any benefits.
This panic could occur when samples with too long label values are ingested into VictoriaMetrics.
This could result in too long fistItem and commonPrefix values at blockHeader (up to 64kb each).
This may inflate the maximum index block size by 4 * maxIndexBlockSize.
* add more details to changelog
* simplify panels description
* remove capacity planning recommendation, as it proves it incompetent
Signed-off-by: hagen1778 <roman@victoriametrics.com>
During background downsampling, rate(vm_deduplicated_samples_total{type="merge"}) could be much bigger than
rate(vm_rows_added_to_storage_total) and it could last quite some time,
which causes negative values of Storage full ETA and confuses users, see playground.
Instead of trying to get more accurate results during downsampling, I think it's ok to ignore
vm_deduplicated_samples_total at all, it's more reasonable to see Storage full ETA increase after downsampling.
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
changes(), increases_over_time() and resets() shouldn't take into account
value changes, which may occur because of precision errors.
The maximum guaranteed precision for raw samples stored in VictoriaMetrics is 12 decimal digits.
So do not count relative changes for values if they are smaller than 1e-12 comparing to the value.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/767
For example, -fooDuration=',10s,' is now supported - it sets three command-line flag values:
- the first and the last one are set to the default value for `-fooDuration`
- the second one is set to 10s
* vmui: fix select closing on click outside (#5728)
* vmui: clear entered text in select after selecting a value (#5727)
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
This should significantly reduce the number of open ReaderAt files
on VictoriaMetrics and VictoriaLogs startup.
The open files can be tracked via vm_fs_readers metric
GOGC can be already set via environment variable. There is no need in adding
new approaches for setting the GOGC (such as command-line flag), since they complicate operations.
It should help identifying cases when too much CPU is spent on garbage collection,
and advice users on how this can be addressed.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Remove temporary file before closing it in order to signal the OS that it shouldn't
store the file contents from page cache to disk when the file is closed.
Gracefully handle the case when the file cannot be removed before being closed -
in this case remove the file after closing it. This allows working on Windows.
Also remove superflouos opening of temporary file for reading - re-use already opened file handle for writing.
This is a follow-up for 9b1e002287
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4020
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/70
The easyproto-based marshaler is 2x slower than the previous custom marshaler,
so let's stick with it. This improves the performance for sending data to remote storage at vmagent
and reduces CPU usage to pre-v1.97.0 levels.
This case is possible after a new brsPool is allocated. The fix is to verify whether len(brsPool) >= len(brs.brs)
before trying to append a new item to brsPool and sharing its contents with brs.brs.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5733
* app/vmselect: set proper timestamp for cached instant responses
The change updates `getSumInstantValues` to prefer timestamp
from the most recent results. Before, timestamp from cached series
was used.
The old behavior had negative impact on recording rules as they
were getting responses with shifted timestamps in past.
Subsequent recording or alerting rules fetching results of these
recording rules could get no result due to staleness interval.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5659
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* wip
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
There is no sense in running more than GOMAXPROCS concurrent marshalers,
since they are CPU-bound. More concurrent marshalers do not increase the marshaling bandwidth,
but they may result in more RAM usage.
Entries for the previous dates is usually not used, so there is little sense in keeping them in memory.
This should reduce the size of storage/date_metricID cache, which can be monitored
via vm_cache_entries{type="storage/date_metricID"} metric.
This limit has little sense for these APIs, since:
- Thses APIs frequently result in scanning of all the time series on the given time range.
For example, if extra_filters={datacenter="some_dc"} .
- Users expect these APIs shouldn't hit the -search.maxUniqueTimeseries limit,
which is intended for limiting resource usage at /api/v1/query and /api/v1/query_range requests.
Also limit the concurrency for /api/v1/labels, /api/v1/label/.../values
and /api/v1/series requests in order to limit the maximum memory usage and CPU usage for these API.
This limit shouldn't affect typical use cases for these APIs:
- Grafana dashboard load when dashboard labels should be loaded
- Auto-suggestion list load when editing the query in Grafana or vmui
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5055
* port un-synced changed from docs/readme to readme
* consistently use `sh` instead of `console` highlight, as it looks like
a more appropriate syntax highlight
* consistently use `sh` instead of `bash`, as it is shorter
* consistently use `yaml` instead of `yml`
See syntax codes here https://gohugo.io/content-management/syntax-highlighting/
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* Optimize the performance of data merge: decimal.CalibrateScale() from 49633 ns/op to 9146 ns/op
* Optimize the performance of data merge: decimal.CalibrateScale()
Sending unfinished aggregate states tend to produce unexpected anomalies with lower values than expected.
The old behavior can be restored by specifying `flush_on_shutdown: true` setting in streaming aggregation config
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* app/vmalert: fix data race during hot-config reload
During hot-reload, the logic evokes the group update and rules evaluation
interruption simultaneously. Falsely assuming that interruption happens before
the update. However, it could happen that group will be updated first and only
after the rules evaluation will be cancelled. Which will result in permanent
interruption for all rules within the group.
The fix caches the cancel context function into local variable first. And only after
performs the group update. With cached cancel function we can safely call it without
worrying that we cancel the evaluation for already updated group.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* Revert "app/vmalert: fix data race during hot-config reload"
This reverts commit a4bb7e8932.
* app/vmalert: fix data race during hot-config reload
During hot-reload, the logic evokes the group update and rules evaluation
interruption simultaneously. Falsely assuming that interruption happens before
the update. However, it could happen that group will be updated first and only
after the rules evaluation will be cancelled. Which will result in permanent
interruption for all rules within the group.
The fix cancels the evaulation context before applying the update, making sure
that the context will be cancelled for old group always.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* wip
Signed-off-by: hagen1778 <roman@victoriametrics.com>
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
- Maintain a separate worker pool per each part type (in-memory, file, big and small).
Previously a shared pool was used for merging all the part types.
A single merge worker could merge parts with mixed types at once. For example,
it could merge simultaneously an in-memory part plus a big file part.
Such a merge could take hours for big file part. During the duration of this merge
the in-memory part was pinned in memory and couldn't be persisted to disk
under the configured -inmemoryDataFlushInterval .
Another common issue, which could happen when parts with mixed types are merged,
is uncontrolled growth of in-memory parts or small parts when all the merge workers
were busy with merging big files. Such growth could lead to significant performance
degradataion for queries, since every query needs to check ever growing list of parts.
This could also slow down the registration of new time series, since VictoriaMetrics
searches for the internal series_id in the indexdb for every new time series.
The third issue is graceful shutdown duration, which could be very long when a background
merge is running on in-memory parts plus big file parts. This merge couldn't be interrupted,
since it merges in-memory parts.
A separate pool of merge workers per every part type elegantly resolves both issues:
- In-memory parts are merged to file-based parts in a timely manner, since the maximum
size of in-memory parts is limited.
- Long-running merges for big parts do not block merges for in-memory parts and small parts.
- Graceful shutdown duration is now limited by the time needed for flushing in-memory parts to files.
Merging for file parts is instantly canceled on graceful shutdown now.
- Deprecate -smallMergeConcurrency command-line flag, since the new background merge algorithm
should automatically self-tune according to the number of available CPU cores.
- Deprecate -finalMergeDelay command-line flag, since it wasn't working correctly.
It is better to run forced merge when needed - https://docs.victoriametrics.com/#forced-merge
- Tune the number of shards for pending rows and items before the data goes to in-memory parts
and becomes visible for search. This improves the maximum data ingestion rate and the maximum rate
for registration of new time series. This should reduce the duration of data ingestion slowdown
in VictoriaMetrics cluster on e.g. re-routing events, when some of vmstorage nodes become temporarily
unavailable.
- Prevent from possible "sync: WaitGroup misuse" panic on graceful shutdown.
This is a follow-up for fa566c68a6 .
Thanks @misutoth to for the inspiration at https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5212
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5190
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3790
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3551
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3337
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3425
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3647
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3641
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/648
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/291
* docs: remove raw and endraw tags as they are not needed for the new version of site
Signed-off-by: Artem Navoiev <tenmozes@gmail.com>
* revert formating in vmaler
Signed-off-by: Artem Navoiev <tenmozes@gmail.com>
---------
Signed-off-by: Artem Navoiev <tenmozes@gmail.com>
The maxFileParts usage has been accidentally removed in fa566c68a6
While at it, add Count suffix to *AssistedMerges counter names in order to make them less misleading.
Previously their names were falsely suggesting that these are gauges, which show the number of concurrently
executed assisted merges.
This reverts cd4f641d32 , since it has been appeared that the disabled compression
for vmstorage->vmselect data increase network bandwidth usage by more than 10x on typical production workloads,
while it decreases CPU usage at vmstorage by up to 10% and improves query latency by up to 10%.
The 10x increase in network usage is too high price for 10% improvements on query latency and vmstorage CPU usage.
This may result in network bandwidth bottlenecks, which can reduce the overall performance and stability
of VictoriaMetrics cluster. That's why return back the vmstorage->vmselect data compression by default.
The vmstorage->vmselect compression can be disabled by passing -rpc.disableCompression command-line flag to vmstorage.
The vmselect->vmselect compression in multi-level cluster setup can be disabled by passing -clusternative.disableCompression command-line flag.
It has been appeared that the registration of new time series slows down linearly
with the number of indexdb parts, since VictoriaMetrics needs to check every indexdb part
when it searches for TSID by newly ingested metric name.
The number of in-memory parts grows when new time series are registered
at high rate. The number of in-memory parts grows faster on systems with big number
of CPU cores, because the mergeset maintains per-CPU buffers with newly added entries
for the indexdb, and every such entry is transformed eventually into a separate in-memory part.
The solution has been suggested in https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5212
by @misutoth - to limit the number of in-memory parts with buffered channel.
This solution is implemented in this commit. Additionally, this commit merges per-CPU parts
into a single part before adding it to the list of in-memory parts. This reduces CPU load
when searching for TSID by newly ingested metric name.
The https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5212 recommends setting the limit on the number
of in-memory parts to 100, but my internal testing shows that much lower limit 15 works with the same efficiency
on a system with 16 CPU cores while reducing memory usage for `indexdb/dataBlocks` cache by up to 50%.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5190
This allows reducing the indexdb/tagFiltersToMetricIDs cache size by 8 on average.
The cache size can be checked via vm_cache_size_bytes{type="indexdb/tagFiltersToMetricIDs"} metric exposed at /metrics page.
Measuring read / write duration from / to in-memory buffers has little sense,
since it will be always fast. It is better to measure read / write duration from / to
real files at vm_filestream_write_duration_seconds_total and vm_filestream_read_duration_seconds_total metrics.
This also reduces overhead on time.Now() and Histogram.UpdateDuration() calls
per each filestream.Reader.Read() and filestream.Writer.Write() call when the data is read / written from / to in-memory buffers.
This is a follow-up for 2f63dec2e3
* lib/promscrape: respect `0` value for `series_limit` param
Respect `0` value for `series_limit` param in `scrape_config`
even if global limit was set via `-promscrape.seriesLimitPerTarget`.
Previously, `0` value will be ignored in favor of `-promscrape.seriesLimitPerTarget`.
This behavior aligns with possibility to override `series_limit` value via
relabeling with `__series_limit__` label.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* Update docs/CHANGELOG.md
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Previously, it was not possible to determine which tenant sends metrics with excessive amount of labels of label values.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* vmui: fix the logic of closing the popper #5470
* vmui: fix the logic of caching autocomplete results #5472
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
This reduces the number of memory allocations at the cost of possible memory usage increase,
since now different metric name strings may hold references to the previous byte slice.
This is good tradeoff, since ProcessSearchQuery is called in vmselect, and vmselect isn't usually limited by memory.
This change has been extracted from https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5527
This should smooth CPU and RAM usage spikes related to these periodic tasks,
by reducing the probability that multiple concurrent periodic tasks are performed at the same time.
The dateMetricIDCache puts recently registered (date, metricID) entries into mutable cache protected by the mutex.
The dateMetricIDCache.Has() checks for the entry in the mutable cache when it isn't found in the immutable cache.
Access to the mutable cache is protected by the mutex. This means this access is slow on systems with many CPU cores.
The mutabe cache was merged into immutable cache every 10 seconds in order to avoid slow access to mutable cache.
This means that ingestion of new time series to VictoriaMetrics could result in significant slowdown for up to 10 seconds
because of bottleneck at the mutex.
Fix this by merging the mutable cache into immutable cache after len(cacheItems) / 2
cache hits under the mutex, e.g. when the entry is found in the mutable cache.
This should automatically adjust intervals between merges depending on the addition rate
for new time series (aka churn rate):
- The interval will be much smaller than 10 seconds under high churn rate.
This should reduce the mutex contention for mutable cache.
- The interval will be bigger than 10 seconds under low churn rate.
This should reduce the uneeded work on merging of mutable cache into immutable cache.
The change removes artificial delay before returning error, which sometimes
caused less retry events than expected.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* lib/promscrape/discovery/kubernetes: fix watcher start order for roles endpoints and endpointslice
Previously the groupWatcher could be mistakenly stopped when requests for pod or services resources take too long.
* remove mislead comment
* docs/sd_configs.md: mention -promscrape.kubernetes.attachNodeMetadataAll flag in the description for attach_metadata section
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4640
* wip
* lib/promscrape/kubernetes: prevent from stopping groupWatcher when there are in-flight apiWatcher.mustStart() calls
groupWatcher is stopped if it has zero registered apiWatchers during 14 seconds.
But such a groupWatcher can be still in use if apiWatcher for `role: endpoints` or `role: endpointslice`
is being registered and the discovery of the associated `pod` and/or `service` objects takes longer
than 14 seconds - see the beginning of groupWatcher.startWatchersForRole() function for details.
Track the number of in-flight calls to apiWatcher.mustStart() and prevent from stopping the associated groupWatcher
if the number of in-flight calls is non-zero.
P.S. postponing the discovery of `pod` and/or `service` objects associated with `endpoints` or `endpointslice` roles
isn't the best solution, since it slows down initial discovery of `endpoints` and `endpointslice` targets.
* typo fix
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Examples:
1) -metricsAuthKey=file:///abs/path/to/file - reads flag value from the given absolute filepath
2) -metricsAuthKey=file://./relative/path/to/file - reads flag value from the given relative filepath
3) -metricsAuthKey=http://some-host/some/path?query_arg=abc - reads flag value from the given url
The flag value is automatically updated when the file contents changes.
* app/vmauth: adds metric_labels and backend_errors counter
it must improve observability for user requests with new metric - per user backend errors counter.
it's needed to calculate requests fail rate to the configured backends.
metric_labels configuration allows to perform additional aggregations on top of multiple users from configuration section.
It could be multiple clients or clients with separate read/write tokens
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5565
* wip
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* app/vmselect/promql: properly handle possible negative results caused by float operations precision error in rollup functions like rate() or increase()
* fix test
- docs/sd_configs.md: moved hetzner_sd_configs docs to the correct place according to alphabetical order of SD names,
document missing __meta_hetzner_role label.
- lib/promscrape/config.go: added missing MustStop() call for Hetzner SD,
and moved the code to the correct place according to alphabetical order of SD names.
- lib/promscrape/discovery/hetzner: properly handle pagination for hloud API responses,
populate missing __meta_hetzner_role label like Prometheus does.
- Properly populate __meta_hetzner_public_ipv6_network label like Prometheus does.
- Remove unused SDConfig.Token.
- Remove "omitempty" annotation from SDConfig.Role field, since this field is mandatory.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5550
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3154
It is likely this value was borrowed from panel `Slow inserts` panel
from Grafana dasbhoard for VM single/cluster installations. This is a mistake.
There is no such thing as "tolerable churn rate", as tolerancy depends on the amount
of allocated resources.
Although, it is unclear what is meant by 5%. If it refers to 5% of new time series per second,
then such churn rate is extremely high. It would mean that the avg life of a time series is 20s.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
app/victoriametrics: update the test suite
* simplify /query and /query_range test cases configuration and tests
* support instant queries with lookbehind window like `query=foo[5m]`
* support instant queries selecting scalar value like `query=42`
* add query_range test for prometheus
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmui/vmanomaly: add support models that produce only `anomaly_score`
* vmui/vmanomaly: fix display legend
* vmui/vmanomaly: update comment on anomaly threshold
- Clarify the bugfix description at docs/CHANGELOG.md
- Simplify the code by accessing prefetchedMetricIDs struct under the lock
instead of using lockless access to immutable struct.
This shouldn't worsen code scalability too much on busy systems with many CPU cores,
since the code executed under the lock is quite small and fast.
This allows removing cloning of prefetchedMetricIDs struct every time
new metric names are pre-fetched. This should reduce load on Go GC,
since the cloning of uin64set.Set struct allocates many new objects.
Add docs to explain that `latest` backup folder can be accessed
more frequently than others. Hence, it is recommended to keep it
on storages with low access price.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Before, this cache was limited only by size.
Cache invalidation by time happens with jitter to prevent thundering herd problem.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
It was calculating the number of dropped time series instead of the number of dropped samples.
While at it, drop vmalert_remotewrite_dropped_bytes_total metric, since it was inconsistently calculated -
at one place it was calculating raw protobuf-encoded sample sizes, while at another place it was calculating
the size of snappy-compressed prompbmarshal.WriteRequest protobuf message.
Additionally, this metric has zero practical sense, so just drop it in order to reduce the level of confusion.
* update link to book a demo for AD & RCA
* fix invalid refs in components/model
* - fix staging -> prod links
- replace capitalized FAQ headers
- change the section order on main page
- replace :latest tag with current stable
* add panels for detailed visualization of traffic usage between vmstorage, vminsert, vmselect
components and their clients. New panels are available in the rows dedicated to specific components.
* update "Slow Queries" panel to show percentage of the slow queries to the total number of read queries
served by vmselect. The percentage value should make it more clear for users whether there is a service degradation.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* app/vmselect: drop `rollupDefault` function as duplicate
It is unclear why there are two identical fns `rollupDefault`
and `rollupDistinct`. Dropping one of them.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* Update app/vmselect/promql/rollup.go
* Update app/vmselect/promql/rollup.go
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
The user may which to control the endpoint parameters for instance to
set the audience when requesting an access token. Exposing the
parameters as a map allows for additional use cases without requiring
modification.
* lib/awsapi: properly assume role with webIdentity token
introduce new irsaRoleArn param for config. It's only needed for authorization with webIdentity token.
First credentials obtained with irsa role and the next sts assume call for an actual roleArn made with those credentials.
Common use case for it - cross AWS accounts authorization
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3822
* wip
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Some clients may ingest samples via OpenTelemetry protocol without Resource labels.
Previously VictoriaMetrics was silently dropping such samples.
The commit 317834f876 added vm_protoparser_rows_dropped_total{type="opentelemetry",reason="resource_not_set"}
counter for tracking of such dropped samples. See https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5459
It is better from usability PoV to accept such samples instead of dropping them and incrementing the corresponding counter.
Before, retries happened only on writes into a network connection
between source and destination. But errors returned by server after
all the data was transmitted were logged, but not retried.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Currently, it is impossible to understand why metrics are not ingested when resource is not set by OTEL exporter. Adding metric should simplify debugging and make it improve debuggability.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
app/vmselect/promql/eval.go:evalAggrFunc shunts evaluation
of AggrFuncExpr over rollupFunc over MetricsExpr to an optimized
path. tryGetArgRollupFuncWithMetricExpr() checks whether expression
can be shunted, but it mangles the AggrFuncExpr when the aggregation
function has more than one argument. This results in queries like
`sum(aggr_over_time("avg_over_time",m))` failing with error message
'expecting at least 2 args to "aggr_over_time"; got 1 args' while
the analogous query `sum(avg_over_time(m))` executes successfully.
This fix removes the unnecessary mangling.
Signed-off-by: Anton Tykhyy <atykhyy@gmail.com>
Previously `retry_status_codes: []` and `drop_src_path_prefix_parts: 0` at `url_map` were equivalent to missing values.
This was resulting in using the user-level values instead.
* vmui: add show quick tip for autocomplete
* vmui: auto-completion usability improvements #5348
* vmui: add const for min symbols in autocomplete
* Use proper queries to VictoriaMetrics
* vmui: fix comments for autocomplete
* app/vmselect: run `make vmui-update`
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* fix typos in docs
* add `shard-` prefix to generated links when `-promscrape.cluster.memberURLTemplate` is enabled
Signed-off-by: hagen1778 <roman@victoriametrics.com>
This is follow-up after
75196d7234
It updates some of the alerting rules to remove unnecessary aggregations.
It keeps aggregations for expressions which are using multiple time series
filters to make sure their label will match.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Aggregations with by() have one sideeffect, that any custom labels you add to hosts are dropped too which can be used for alerts routing.
Therefore, some good practice could be to use without() instead, with labels, like without(path) , or without(url) to get same aggregations but with any external labels left intact.
Before, vmalert would send notifications with labels containing characters
not supported by Alertmanager validator, resulting into validation errors
like `msg="Failed to validate alerts" err="invalid label set: invalid name "foo.bar"`
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Previously the lower bound could be too small, which could result in missing values at the beginning of the graph
for default_rollup() function. This function is automatically applied to all the series selectors if they aren't
explicitly wrapped into a rollup function - see https://docs.victoriametrics.com/MetricsQL.html#implicit-query-conversions
While at it, properly take into account `-search.minStalenessInterval` command-line flag when adjusting
the lower bound for the selected time range.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5388
* vmagent: export `vm_promscrape_scrape_pool_targets` metric to track the number of targets that each scrape_job discovers
* add extra panel for new metric
- Add links to relevant docs into descriptions for every -kafka.* and -gcp.pubsub.* command-line flags.
- Wait until message processing goroutines are stopped before returning from gcppubsub.Stop().
- Prevent from multiple calls to Init() without Stop().
- Drop message if tenantID cannot be parsed properly.
- Take into account tenantID for all the supported message formats.
- Support gzip-compressed messages for graphite format.
- Use exponential backoff sleep when the message cannot be pushed to remote storage systems
because of disabled on-disk persistence - https://docs.victoriametrics.com/vmagent.html#disabling-on-disk-persistence
- Unblock from sleep as soon as Stop() is called. Previously the sleep could take up to 2 seconds after Stop() is called.
- Remove unused globalCtx and initContext from app/vmagent/remotewrite/gcppubsub
- Mention Google PubSub support at docs/enterprise.md
- Make Google PubSub docs more clear at docs/vmagent.md
This is a follow-up for commits 115245924a5f096c5a3383d6cc8e8b6fbd421984
and e6eab781ce42285a6a1750dc01eba6801dd35516 .
Updates https://github.com/VictoriaMetrics/VictoriaMetrics-enterprise/pull/717
Updates https://github.com/VictoriaMetrics/VictoriaMetrics-enterprise/pull/713
* app/vmalert: expose `/vmalert/api/v1/rule` and `/api/v1/rule` API which returns rule status in JSON format
* app/vmalert: hide updates if query param not set
* app/vmalert: fix panic (recursion call)
* app/vmalert: add needed group name and file name
* app/vmalert: fix comment, update behavior
* app/vmalert: fix description
* app/vmalert: simplify API for /api/v1/rule
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* app/vmalert: simplify API for /api/v1/rule
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* app/vmalert: simplify API for /api/v1/rule
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* app/vmalert: simplify API for /api/v1/rule
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* app/vmalert: simplify API for /api/v1/rule
Signed-off-by: hagen1778 <roman@victoriametrics.com>
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
E.g. replace `fs.Dir + filePath` with `path.Join(fs.Dir, filePath)`
The fs.Dir is guaranteed to end with slash - see Init() functions.
The filePath may start with slash. If it starts with slash, then `fs.Dir + filePath` constructs
an incorrect path with double slashes.
path.Join() properly substitutes duplicate slashes with a single slash in this case.
While at it, also substitute incorrect usage of filepath.Join() with path.Join()
for constructing paths to object storage systems, which expect forward slashes in paths.
filepath.Join() substittues forward slashes with backslashes on Windows, so this may break
creating or managing backups from Windows.
This is a follow-up for 0399367be602b577baf6a872ca81bf0f99ba401b
Updates https://github.com/VictoriaMetrics/VictoriaMetrics-enterprise/pull/719
* lib/backup/s3remote: remove prev object versions for recursive delete
- fix error caused by sending empty objects list to be deleted. This was possible in case old versions of objects where deleted, but root-level entries where still available. This caused paginator to return an empty page which wasn't skipped.
- delete previous versions of objects recursively for S3 remote
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* docs/changelog: add vmbackupmanager fix entry
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/backup/s3remote: unify path construction for S3 objects
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Previously concurrency for static and fast queries was limited with the -search.maxConcurrentRequests
command-line flag. This could complicate identifying heavy queries via `vmui` at `Top queries` and `Active queries` pages,
since `vmui` and these pages couldn't be opened on overloaded vmselect.
Thanks to @f41gh7 for the idea.
Previously the /service-discovery page didn't show targets dropped because of sharding
( https://docs.victoriametrics.com/vmagent.html#scraping-big-number-of-targets ).
Show also the reason why every target is dropped at /service-discovery page.
This should improve debuging why particular targets are dropped.
While at it, do not remove dropped targets from the list at /service-discovery page
until the total number of targets exceeds the limit passed to -promscrape.maxDroppedTargets .
Previously the list was cleaned up every 10 minutes from the entries, which weren't updated
for the last minute. This could complicate debugging of dropped targets.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5389
* lib/streamaggr: properly reference slice with labels
by limiting slice capacity. It must fix issues with slice modification, in case of append new slice will be allocated, instead of modifying refrenced slice
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5402
* Reduce memory allocations when output_relabel_configs adds new labels to output samples
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* prevent /api/v1 from panic on parsing rows
* add tests for Extract function for v1 and v2 api's
* separate request types in different pools to prevent different objects mixing
* add changelog line
543f218fe9
Signed-off-by: hagen1778 <roman@victoriametrics.com>
- Add Try* prefix to functions, which return bool result in order to improve readability and reduce the probability of missing check
for the result returned from these functions.
- Call the adjustSampleValues() only once on input samples. Previously it was called on every attempt to flush data to peristent queue.
- Properly restore the initial state of WriteRequest passed to tryPushWriteRequest() before returning from this function
after unsuccessful push to persistent queue. Previously a part of WriteRequest samples may be lost in such case.
- Add -remoteWrite.dropSamplesOnOverload command-line flag, which can be used for dropping incoming samples instead
of returning 429 Too Many Requests error to the client when -remoteWrite.disableOnDiskQueue is set and the remote storage
cannot keep up with the data ingestion rate.
- Add vmagent_remotewrite_samples_dropped_total metric, which counts the number of dropped samples.
- Add vmagent_remotewrite_push_failures_total metric, which counts the number of unsuccessful attempts to push
data to persistent queue when -remoteWrite.disableOnDiskQueue is set.
- Remove vmagent_remotewrite_aggregation_metrics_dropped_total and vm_promscrape_push_samples_dropped_total metrics,
because they are replaced with vmagent_remotewrite_samples_dropped_total metric.
- Update 'Disabling on-disk persistence' docs at docs/vmagent.md
- Update stale comments in the code
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5088
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2110
* app/vmagent: allow to disabled on-disk queue
Previously, it wasn't possible to build data processing pipeline with a
chain of vmagents. In case when remoteWrite for the last vmagent in the
chain wasn't accessible, it persisted data only when it has enough disk
capacity. If disk queue is full, it started to silently drop ingested
metrics.
New flags allows to disable on-disk persistent and immediatly return an
error if remoteWrite is not accessible anymore. It blocks any writes and
notify client, that data ingestion isn't possible.
Main use case for this feature - use external queue such as kafka for
data persistence.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2110
* adds test, updates readme
* apply review suggestions
* update docs for vmagent
* makes linter happy
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Tests showed that importing a single line with 70MB size takes 5.3GiB
RSS memory for VictoriaMetrics single-node.
In the scenario when user exports and imports data from one VM to another,
it could possibly lead to OOM exception for destination VM.
Importing a single line with 16MB size taks 1.3GiB RSS memory.
Hence, the limit for `import.maxLineLen` was decreased from 100MB to 10MB
to improve reliability of VictoriaMetrics during imports.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
The default caching for Go artifacts from actions/setup-go@v4 uses the hash of go.sum file
as a cache key - see https://github.com/actions/cache/blob/main/examples.md#go---modules .
This isn't enough for VictoriaMetrics case, since different makefile actions build different the Go artifacts,
which need to be cached. So embed the action name in the cache key.
This case is possible after the following steps:
1. vmagent successfully performed handshake with the -remoteWrite.url and the remote storage supports zstd-compressed data.
2. remote storage became unavailable or slow to ingest data, vmagent compressed the collected data into blocks with zstd and puts these blocks to persistent queue on disk.
3. vmagent restarts and the remote storage is unavailable during the handshake, then vmagent falls back to Snappy compression.
4. vmagent starts sending zstd-compressed data from persistent queue to the remote storage, while falsely advertizing it sends Snappy-compressed data.
5. The remote storage receives zstd-compressed data and fails unpacking it with Snappy.
The solution is the same as 12cd32fd75, just fall back to zstd decompression if Snappy decompression fails.
Previously the number of memory allocations inside copyTimeseriesShallow() was equal to 1+len(tss)
Reduce this number to 2 by pre-allocating a slice of timeseries structs with len(tss) length.
The `path!="/favicon.ico"` filter has little sense, since there are many other special paths,
which may be filtered out - /metrics, /flags, /health, /ping, /robots.txt, /-/healthy, /-/ready, /reload, etc.
See /lib/httpserver/httpserver.go for more details.
It will be hard or impossible to maintain filters for all these paths, so it is better to drop this filter
in order to simplify queries and improve the consistency of these queries.
The buffered connection could have exceeded the underlying connection
deadline during reading or writing to an internal buffer.
With this change, buffered connection struct additionally checks
for a deadline in Read/Write methods.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
evalRollupFuncNoCache() may return time series with identical labels (aka duplicate series)
when performing queries satisfying all the following conditions:
- It must select time series with multiple metric names. For example, {__name__=~"foo|bar"}
- The series selector must be wrapped into rollup function, which drops metric names. For example, rate({__name__=~"foo|bar"})
- The rollup function must be wrapped into aggregate function, which has no streaming optimization.
For example, quantile(0.9, rate({__name__=~"foo|bar"})
In this case VictoriaMetrics shouldn't return `cannot merge series: duplicate series found` error.
Instead, it should fall back to query execution with disabled cache.
Also properly store the merged results. Previously they were incorrectly stored because of a typo
introduced in the commit 41a0fdaf39
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5332
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5337
`version` label won't show the difference if various flavors of the same
version were deployed. But `short_version` will.
For example, on the sandbox env we test VM builds before new version release.
Without this change, the version update won't be visible on dashboard.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* lib/querytracer: makes package concurrent safe to use
it must fix various issues with concurrent code usage.
Especially, when it's not reasonable to wait for all goroutines to be finished
* wip
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
- If min_over_time(m[offset] @ timestamp) <= min_over_time(m[offset] @ (timestamp-window)),
then the optimization can be applied.
- If max_over_time(m[offset] @ timestamp) >= max_over_time(m[offset] @ (timestamp-window)),
then the optimization can be applied.
* vmui: reduced the number of server requests
* run `make vmui-update vmui-logs-update`
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
This case is possible after the following steps:
1. vmagent tries to perform handshake with the -remoteWrite.url in order to determine whether
the remote storage supports zstd-compressed data.
2. The remote storage is unavailable during the handshake. In this case vmagent falls back to Snappy compression
for the data sent to the remote storage.
3. vmagent compresses the collected data into blocks with Snappy and puts these blocks to persistent queue on disk.
4. The remote storage becomes available.
5. vmagent restarts, performs the handshake with the remote storage and detects that it supports zstd-compressed data.
6. vmagent starts sending Snappy-compressed data from persistent queue to the remote storage,
while falsely advertizing it sends zstd-compressed data.
7. The remote storage receives Snappy-compressed data and fails unpacking it with zstd.
The solution is to just fall back to Snappy decompression if zstd decompression fails.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5301
We have failed test on master branch.
```
--- FAIL: TestFormatLogMessage (0.00s)
logger_test.go:24: unexpected result; got
"foo: abcde, \"foo bar baz\", xx"
want
"foo: a..e, \"f..z\", xx"
```
if failed because maxArgs maxLen <= 4 in the `LimitStringLen` in that case we always will return the income string
but in the test we limit the maxLen by value 4
```
f("foo: %s, %q, %s", []interface{}{"abcde", fmt.Errorf("foo bar baz"), "xx"}, 4, `foo: a..e, "f..z", xx`)
Previously the `Host` header was remained unchanged when passing it in requests to backends.
This may improperly work if the backend uses host-based routing.
While at it, allows http/2.0 requests to backends. While VictoriaMetrics components
do not accept http/2.0 requests, other backends can require such requests.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5240
- Re-use identically configured http.Transport across multiple users.
This fixes handling of the limit on the number of connection, which can be established per each backend
via -maxIdleConnsPerBackend command-line flag. This limit stopped working after 323f3720ed
- Add docs about backend TLS setup at https://docs.victoriametrics.com/vmauth.html#backend-tls-setup
- Add ability to disable backend TLS verification for all the users via -backend.tlsInsecureSkipVerify command-line flag.
This flag may be useful when -auth.config contains big number of users, and every user must disable backend TLS verification.
- Add ability to specify TLS Root CA via tls_ca_file option at per-user basis and via -backend.tlsCAFile command-line flag
across all the users.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5240
Previously entries which were accessed only 1 time weren't cached.
It has been appeared that some rarely executed heavy queries may read indexdb block twice
in a row instead of once. There is no need in caching such a block then.
This change should eliminate cache size spikes for indexdb/dataBlocks when such heavy queries are executed.
Expose -blockcache.missesBeforeCaching command-line flag, which can be used for fine-tuning
the number of cache misses needed before storing the block in the caching.
* app/vmalert: update remote-write process
* automatically retry remote-write requests on closed connections. The change should reduce the amount of logs produced in environments with short-living connections or environments without support of keep-alive on network balancers.
* increment `vmalert_remotewrite_errors_total` metric if all retries to send remote-write request failed. Before, this metric was incremented only if remote-write client's buffer is overloaded.
* increment `vmalert_remotewrite_dropped_rows_total` amd `vmalert_remotewrite_dropped_bytes_total` metrics if remote-write client's buffer is overloaded. Before, these metrics were incremented only after unsuccessful HTTP calls.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* Update docs/CHANGELOG.md
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Hui Wang <haley@victoriametrics.com>
reduce the number of queries for restoring alerts state on start-up.
The change should speed up the restore process and reduce pressure on `remoteRead.url`.
* vmauth: add browser authorization request for http requests without credentials to a route that is not in the `unauthorized_user` section (when `unauthorized_user` is specified).
* add link to issue in CHANGELOG
* Extend vmauth docs
* wip
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
This reduction is based on production testing.
Also expose -search.minWindowForInstantRollupOptimization command-line flag, so users could fine-tune this arg for their needs
vmalert expects string value for stats.seriesFetched, so it is impossible
switching to number without breaking compatibility with old vmalert releases :(
It is still unclear why stats.seriesFetched has string type in the first place...
Repeated instant queries with long lookbehind windows, which contain one of the following rollup functions,
are optimized via partial result caching:
- sum_over_time()
- count_over_time()
- avg_over_time()
- increase()
- rate()
The basic idea of optimization is to calculate
rf(m[d] @ t)
as
rf(m[offset] @ t) + rf(m[d] @ (t-offset)) - rf(m[offset] @ (t-d))
where rf(m[d] @ (t-offset)) is cached query result, which was calculated previously
The offset may be in the range of up to 1 hour.
The SECURITY label should be applied only to changes, which fix security issues.
The change at ad839aa492 adds new command-line flags, which can be used
for improving security in some cases. They do not fix any security issues.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5111
The new metric gets increased each time `-search.logQueryMemoryUsage` memory limit
is exceeded by a query. This metric should help to identify expensive and heavy queries
without inspecting the logs.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The new rule for vmalert supposed to detect groups that miss their
evaulations due to slow queries.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The panel `Errors rate to Alertmanager` had `group` label filter
applied to the expression, while the metric `vmalert_alerts_send_errors_total`
doesn't have that label. This resulted into always empty results.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
fix possible missing firing states for alerting rules in replay mode
Before if one firing stage is bigger than single query request range, like rule with a big `for`, alerting rule won't able to be detected as firing.
Co-authored-by: hagen1778 <roman@victoriametrics.com>
support `Strict-Transport-Security`, `Content-Security-Policy` and `X-Frame-Options`
HTTP headers in all VictoriaMetrics components.
The values for headers can be specified by users via the following flags:
`-http.header.hsts`, `-http.header.csp` and `-http.header.frameOptions`.
Co-authored-by: hagen1778 <roman@victoriametrics.com>
This allows substituting FATAL panics with recoverable runtime errors such as missing or invalid TLS CA file
and/or missing/invalid /var/run/secrets/kubernetes.io/serviceaccount/namespace file.
Now these errors are logged instead of PANIC'ing, so they can be fixed by updating the corresponding files
without the need to restart vmagent.
This is a follow-up for 90427abc65
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5243
Previously url watchers for pod, service and node objects could be mistakenly closed
when service discovery was set up only for endpoints and endpointslice roles,
since watchers for these roles may start start pod, service and node url watchers
with nil apiWatcher passed to groupWatcher.startWatchersForRole().
Now all the url watchers, which belong to a particular groupWatcher, are stopped at once
when this groupWatcher has no apiWatcher subscribers.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5216
The issue has been introduced in v1.93.5 when addressing https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4850
* do not print redundant error logs when failed to scrape consul or nomad target
prometheus performs the same because it uses consul lib which just drops the error(1806bcb38c/api/api.go (L1134))
- Make sure that invalid/missing TLS CA file or TLS client certificate files at vmagent startup
don't prevent from processing the corresponding scrape targets after the file becomes correct,
without the need to restart vmagent.
Previously scrape targets with invalid TLS CA file or TLS client certificate files
were permanently dropped after the first attempt to initialize them, and they didn't
appear until the next vmagent reload or the next change in other places of the loaded scrape configs.
- Make sure that TLS CA is properly re-loaded from file after it changes without the need to restart vmagent.
Previously the old TLS CA was used until vmagent restart.
- Properly handle errors during http request creation for the second attempt to send data to remote system
at vmagent and vmalert. Previously failed request creation could result in nil pointer dereferencing,
since the returned request is nil on error.
- Add more context to the logged error during AWS sigv4 request signing before sending the data to -remoteWrite.url at vmagent.
Previously it could miss details on the source of the request.
- Do not create a new HTTP client per second when generating OAuth2 token needed to put in Authorization header
of every http request issued by vmagent during service discovery or target scraping.
Re-use the HTTP client instead until the corresponding scrape config changes.
- Cache error at lib/promauth.Config.GetAuthHeader() in the same way as the auth header is cached,
e.g. the error is cached for a second now. This should reduce load on CPU and OAuth2 server
when auth header cannot be obtained because of temporary error.
- Share tls.Config.GetClientCertificate function among multiple scrape targets with the same tls_config.
Cache the loaded certificate and the error for one second. This should significantly reduce CPU load
when scraping big number of targets with the same tls_config.
- Allow loading TLS certificates from HTTP and HTTPs urls by specifying these urls at `tls_config->cert_file` and `tls_config->key_file`.
- Improve test coverage at lib/promauth
- Skip unreachable or invalid files specified at `scrape_config_files` during vmagent startup, since these files may become valid later.
Previously vmagent was exitting in this case.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4959
It could return either `failed to read` or `failed to parse` errors depending
on whether the given url can be loaded or not under the current environment
c9375cac5e
Descriptions were updated in attempt to make it more clear for readers,
re-phrasing and linking missing docs.
`eval_delay` was added to tests to verify it can be unmarshalled.
`eval_delay` is now applied before timestamp alignment to make it more predictable.
Before, if delay < interval the timestamp won't be aligned.
`eval_delay` and `eval_offset` was added to API output.
`PreviouslySentSeriesToRW` converted to private `previouslySentSeriesToRW`.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Using `min_over_time` should reduce the amount of false positives when
component is running in near-the-threshold state. Now it should trigger
only if all collected samples were above the threshold on 10m interval.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* app/vmselect: limit the number of parallel workers by 32
The change should improve performance and memory usage during query processing
on machines with big number of CPU cores. The number of parallel workers for
query processing is controlled via `-search.maxWorkersPerQuery` command-line flag.
By default, the number of workers is limited by the number of available CPU cores,
but not more than 32. The limit can be increased via `-search.maxWorkersPerQuery`.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* wip
- The `-search.maxWorkersPerQuery` command-line flag doesn't limit resource usage,
so move it from the `resource usage limits` to `troubleshooting` chapter at docs/Single-server-VictoriaMetrics.md
- Make more clear the description for the `-search.maxWorkersPerQuery` command-line flag
- Add the description of `-search.maxWorkersPerQuery` to docs/Cluster-VictoriaMetrics.md
- Limit the maximum value, which can be passed to `-search.maxWorkersPerQuery`, to GOMAXPROCS,
because bigger values may worsen query performance and increase CPU usage
- Improve the the description of the change at docs/CHANGELOG.md. Mark it as FEATURE instead of BUGFIX,
since it is closer to a feature than to a bugfix.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5087
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* fix inconsistent behaviors with prometheus when scraping
1. address https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4959. skip job with wrong syntax in `scrape_configs` with error logs instead of exiting;
2. show error messages on vmagent /targets ui if there are wrong auth configs in `scrape_configs`, previously will print error logs and do scrape without auth header;
3. don't send requests if there are wrong auth configs in:
1. vmagent remoteWrite;
2. vmalert datasource/remoteRead/remoteWrite/notifier.
* add changelogs
* address review comments
* fix ut
This can be useful in the following queries:
drop_empty_series(temperature <= 30) default 40
This query drops temperature series with all the values bigger than 30 on the selected time range,
while replacing gaps in the remaining series with 40.
The query without drop_empty_series:
(temperature <= 30) default 40
would leave all the temperature series with all the values bigger than 30 on the selected time range,
and replace all their values with 40. This is not what could be epxected in some cases
like here - https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5071
This reverts commit 3d7a77bf82.
Reason for revert: relative links do not work properly at GitHub code
and at GitHub wiki. For example, the following page contains broken links
before reverting this commit:
https://github.com/VictoriaMetrics/VictoriaMetrics/blob/master/docs/VictoriaLogs/CHANGELOG.md
It is always better to use absolute links thank relative links, since the page contents
can be copy-n-pasted to other pages, which are located in vastly different directories,
and all the links will remain working.
* fixed error when creating a full backup using the `-origin` flag (#5144)
* Update docs/CHANGELOG.md
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
This is a follow-up for f60c08a7bd
Changes:
- Make sure all the urls related to NewRelic protocol start from /newrelic . Previously some urls were started from /api/v1/newrelic
- Remove /api/v1 part from NewRelic urls, since it has no sense
- Remove automatic transformation from CamelCase to snake_case for NewRelic labels and metric names,
since it may complicate the transition from NewRelic to VictoriaMetrics. Preserve all the metric names and label names,
so users could query metrics and labels by the same names which are used in NewRelic.
The automatic transformation from CamelCase to snake_case can be added later as a special action for relabeling rules if needed.
- Properly update per-tenant data ingestion stats at app/vmagent/newrelic/request_handler.go . Previously it was always zero.
- Fix NewRelic urls in vmagent when multitenant data ingestion is enabled. Previously they were mistakenly started from `/`.
- Document NewRelic data ingestion url at docs/Cluster-VictoriaMetrics.md
- Remove superflouos memory allocations at lib/protoparser/newrelic
- Improve tests at lib/protoparser/newrelic/*
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3520
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4712
- Make more clear the docs at docs/enterprise.md, so readers could figure out faster
on how to obtain enterprise key and how to pass it to VictoriaMetrics Enterprise components.
- Fix examples at docs/enterprise.md, which were referring to non-existing `-license-file` command-line flag.
The `-licenseFile` command-line flag must be used instead.
- Improve the description of `-license*` command-line flags, so users could understand
faster how to use them.
- Improve the warning message, which is emitted when the deprecated -eula command-line flag is passed,
so the user could figure out how to switch faster to -license* command-line flags.
- Disallow running VictoriaMetrics components with both -license and -licenseFile command-line flags.
- Disallow running VictoriaMetrics components when -licensFile points to an empty file.
- Consistently use the phrase "This flag is available only in Enterprise binaries" across
all the enterprise-specific command-line flags.
- Remove unneeded level of indirection for `noLicenseMessage` and `expiredMessage` string contants
in order to improve code readability and maintainability.
- Remove unneded `return` statements after `logger.Fatalf()` calls, since these calls exit the app and never return.
- Make sure that the info log message about successful license verification is emitted
when the license is verified successfully. Previously the error message could be logged
when the license payload is invalid or if it misses some required features.
This reverts commit a8345bb1b9
Reason for revert: VictoriaMetrics binaries are consistently created inside `bin` directory at the root of the repository
when running `make <vm-app>` according to https://docs.victoriametrics.com/#how-to-build-from-sources
If some dev environments create binaries inside random directories, then it is better to provide docs
at https://docs.victoriametrics.com/#how-to-build-from-sources on how to setup these IDEs, so they
consistently create binaries at bin/* directory at the root of the repository instead of trying to add
random ignore rules inside .gitignore.
As for the data directories created by VictoriaMetrics components, they may be created at random places too,
so there is little sense in trying to add ignore rules for all these directories inside .gitignore.
It is better to document that the built binaries must be consistently started from the repository root,
so data directories are created at the repository root. The .gitignore already contains rule
for blocking common data directories, which can be created by VictoriaMetrics components at the repository root.
- Reduce vertical space usage, so more information is available on the screen without the need to scroll.
- Show information for lines with higher values at the top of the legend under the graph.
This should simplify graph analysis when it contains many lines.
Fix vminsert/vmstorage/vmselect metrics filtering when dashboard is used
to display data from many sub-clusters with unique job names.
Before, only one specific job could have been accounted for component-specific panels,
instead of all available jobs for the component.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
reduce lock contention for heavy aggregation requests
previously lock contetion may happen on machine with big number of CPU due to enabled string interning. sync.Map was a choke point for all aggregation requests.
Now instead of interning, new string is created. It may increase CPU and memory usage for some cases.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5087
* vmalert: add `query_time_alignment` for rule group
1. add `eval_alignment` attribute for group which by default is true. So group rule query stamp will be aligned with interval and propagated to ALERT metrics and the messages for alertmanager;
2. deprecate `datasource.queryTimeAlignment` flag.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5049
Strip sensitive information such as auth headers or passwords from datasource, remote-read,
remote-write or notifier URLs in log messages or UI. This behavior is by default and is controlled via
`-datasource.showURL`, `-remoteRead.showURL`, `remoteWrite.showURL` or `-notifier.showURL` cmd-line flags.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5044
* lib/promscrape: make concurrency control optional
Before, `-maxConcurrentInserts` was limiting all calls to `promscrape.Parse`
function: during ingestion and scraping. This behavior is incorrect.
Cmd-line flag `-maxConcurrentInserts` should have effect onl on ingestion.
Since both pipelines use the same `promscrape.Parse` function, we extend it
to make concurrency limiter optional. So caller can decide whether concurrency
should be limited or not.
This commit makes c53b5788b4
obsolete.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* Revert "dashboards: move `Concurrent inserts` panel to Troubleshooting section"
This reverts commit c53b5788b4.
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
- Move uniqueFields from rows to blockStreamMerger struct.
This allows localizing all the references to uniqueFields inside blockStreamMerger.mustWriteBlock(),
which should improve readability and maintainability of the code.
- Remove logging of the event when blocks cannot be merged because they contain more than maxColumnsPerBlock,
since the provided logging didn't provide the solution for the issue with too many columns.
I couldn't figure out the proper solution, which could be helpful for end user,
so decided to remove the logging until we find the solution.
This commit also contains the following additional changes:
- It truncates field names longer than 128 chars during logs ingestion.
This should prevent from ingesting bogus field names.
This also should prevent from too big columnsHeader blocks,
which could negatively affect search query performance,
since columnsHeader is read on every scan of the corresponding data block.
- It limits the maximum length of const column value to 256.
Longer values are stored in an ordinary columns.
This helps limiting the size of columnsHeader blocks
and improving search query performance by avoiding
reading too long const columns on every scan of the corresponding data block.
- It deduplicates columns with identical names during data ingestion
and background merging. Previously it was possible to pass columns with duplicate names
to block.mustInitFromRows(), and they were stored as is in the block.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4762
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4969
docs: add clarification of the retention filter usage
Updated documentation regarding retention filter usage if duration is set lower than
`-retentionPeriod` flag value.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
* lib/promscrape: add metric `vm_promscrape_scrapes_skipped_total`
add metric `vm_promscrape_scrapes_skipped_total`to show whether vmagent skips the scrapes.
This could happen if vmagent is overloaded or target is responding too slow for configured `scrape_interval`.
The follow-up commit should add a corresponding alerting rule and panel to vmagent dashboard.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* deployment/docker: add `TooManyScrapeSkips` alerting rule for vmagent
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* dashboards: add panels `Scrape duration 0.99 quantile` and `Skipped scrapes` to vmagent dashboard
Signed-off-by: hagen1778 <roman@victoriametrics.com>
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
- Compare the actual free disk space to the value provided via -storage.minFreeDiskSpaceBytes
directly inside the Storage.IsReadOnly(). This should work fast in most cases.
This simplifies the logic at lib/storage.
- Do not take into account -storage.minFreeDiskSpaceBytes during background merges, since
it results in uncontrolled growth of small parts when the free disk space approaches -storage.minFreeDiskSpaceBytes.
The background merge logic uses another mechanism for determining whether there is enough
disk space for the merge - it reserves the needed disk space before the merge
and releases it after the merge. This prevents from out of disk space errors during background merge.
- Properly handle corner cases for flushing in-memory data to disk when the storage
enters read-only mode. This is better than losing the in-memory data.
- Return back Storage.MustAddRows() instead of Storage.AddRows(),
since the only case when AddRows() can return error is when the storage is in read-only mode.
This case must be handled by the caller by calling Storage.IsReadOnly()
before adding rows to the storage.
This simplifies the code a bit, since the caller of Storage.MustAddRows() shouldn't handle
errors returned by Storage.AddRows().
- Properly store parsed logs to Storage if parts of the request contain invalid log lines.
Previously the parsed logs could be lost in this case.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4737
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4945
This should reduce tail latency during data ingestion.
This shouldn't slow down data ingestion in the worst case, since assisted merges are spread among
distinct addRows/addItems calls after this change.
It was added in order to limit number of goroutines performing assisted merges during ingestion.
It turned out that blocking ingestion goroutines lower ingestion performance and limits overall ingestion around 40k items per seconds because of lock contention.
Removing parts merge sync.Cond allows to remove lock contention at write path and significantly improves write performance.
See: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4775
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* vmui: update information about tsdb usage in cluster version
* vmui: cleanup
* vmui: add CHANGELOG.md
* vmui: cleanup
* vmui: update logic, move information to the visible place
* app/vmui: remove values fetch, update documentation for cardinality explorer
* app/vmui: update CHANGELOG.md
Moved because this panel is related to both: scraped and ingested data.
Before, it could have give a misleading impression that it is related to ingested metrics only.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* docker-compose: add vmauth to cluster env
vmauth acts as a balancer and used as an example of how to interconnect
VM components via vmauth.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* docker-compose: add vmauth to cluster env
vmauth acts as a balancer and used as an example of how to interconnect
VM components via vmauth.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Nikolay <nik@victoriametrics.com>
lib/promscrape/discovery/kubernetes: supress context.Cancelled error in logs
It is possible that context.Cancelled will appear after k8s watcher was closed due to reload(see https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4850).
Logging an error misinforms user and looks like vmagent discovery will stop working even though this does not affect discovery.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* expose metrics `vmauth_config_last_reload_*` for tracking the state of config reloads, similarly to vmagent/vmalert components.
* do not print logs like `SIGHUP received...` once per configured `-configCheckInterval` cmd-line flag. This log will be printed only if config reload was invoked manually.
* prevent configuration reloading if there were no changes in config. This improves memory usage when `-configCheckInterval` cmd-line flag is configured and config has extensive list of regexp expressions requiring additional memory on parsing.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* lib/storage/partition: add check to ensure parts exist on disk
If part exists in parts.json but is missing on disk there will be a misleading error similar to "unexpected number of substrings in the part name".
This change forces verification of part existence and throws a correct error in case it is missing on disk.
Such issue can be result of https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5005 or disk corruption.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/storage/partition: use filepath.Join instead of string concatenation
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/storage/partition: add action points for error message
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* all: add a check for missing part in lib/mergeset and lib/logstorage
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
- Switch from summary to histogram for vl_http_request_duration_seconds metric.
This allows calculating request duration quantiles across multiple hosts
via histogram_quantile(0.99, sum(vl_http_request_duration_seconds_bucket) by (vmrange)).
- Take into account only successfully processed data ingestion requests
when updating vl_http_request_duration_seconds histogram.
Failed requests are ignored, since they may significantly skew measurements.
- Clarify the description of the change at docs/VictoriaLogs/CHANGELOG.md.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4934
- Move the bugfix description to the correct place in docs/CHANGELOG.md
- Prevent from logging of 'context canceled' errors after the url watcher is stopped,
since these errors are expected and may confuse users.
- Remove unused urlWatcher.refCount field.
- Remove unused urlWatcher.close() method.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4850
* lib/promscrape/discovery/kubernetes: fix leaking api watcher
goroutine which was polling k8s API had no execution control. This leaded to leaking goroutines during config reload.
See: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4850
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/promscrape/discovery/kubernetes: use reference counting for urlWatcher cleanup
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/promscrape/discovery/kubernetes: remove waitgroup sync for goroutines polling API server
This is unnecessary since context will is cancelled and new requests will not be sent. Also, using waitgroup will increase time required to perform reload which might result in missed scrapes.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/promscrape/discovery/kubernetes: clarify comment
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* Apply suggestions from code review
* lib/promscrape/discovery/kubernetes: address review feedback
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Nikolay <nik@victoriametrics.com>
Adding limit on ingestion allows to avoid issues like this one https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4762
Such issues are often caused by misconfigurtion on log persing/ingestion side and preventing such rows from being ingested allows to avoid performance implications created by storing such log rows.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
This automation doesn't work as intended on LTS releases, bugfix releases and custom releases,
since it assumes every new tag is related only to new release.
Also the github.com/VictoriaMetrics/ops repository may contain manually set custom tags
for VictoriaMetrics components (for example, for testing the latest bugfixes or features),
which are overwritten by the generated pull request.
The way to go is to manually update tags at github.com/VictoriaMetrics/ops repository when needed
instead of trying to automate this process.
This should prevent from emitting too long lines when too long args are passed to logger.* functions.
For example, too long MetricsQL queries or too long data samples.
* Introduce flagutil.Duration
To avoid conversion bugs
* Fix tests
* Clarify documentation re. month=31 days
* Add fasttime.UnixTime() to obtain time.Time
The goal is to refactor out the last usage of `.Msecs`.
* Use fasttime for time.Now()
* wip
- Remove fasttime.UnixTime(), since it doesn't improve code readability and maintainability
- Run `make docs-sync` for syncing changes from README.md to docs/ folder
- Make lib/flagutil.Duration.Msec private
- Rename msecsPerMonth const to msecsPer31Days in order to be consistent with retention31Days
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* feat: add cardinality support for prometheus (#4320)
* docs/CHANGELOG.md: add cardinality support for prometheus
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* deployment/docker: add VictoriaLogs configuration
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* deployment/docker/victorialogs: remove outdated comment
It was added in order to indicate that it is required to build VictoriaLogs manually before starting it at the time there was no public release available.
Currently, there is a public tag and it is not required to build it from sources.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* deployment/docker/victorialogs/fluentbit: include log path in stream configuration
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* deployment/docker: add reference to monitoring setup for VictoriaLogs
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* added ability to set and clear response headers (#4825)
Signed-off-by: Alexander Marshalov <_@marshalov.org>
* added ability to set and clear response headers (#4825)
Signed-off-by: Alexander Marshalov <_@marshalov.org>
* fix review comment
Signed-off-by: Alexander Marshalov <_@marshalov.org>
---------
Signed-off-by: Alexander Marshalov <_@marshalov.org>
* app/vminsert: fixes readonly check
previously vminsert doesn't check readOnly state for vmstorage, since check was never performed for nil buffer
In this case every 30 second storage node loss readonly state and received some data.
It caused re-routing and possible slow down for ingestion
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4870
* wip
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Disallow parsing multitenant token at auth.NewToken().
Use auth.NewTokenPossibleMultitenant() at vminsert only. All the other callers should call auth.NewToken(),
since they do not support multitenant token.
This is a follow-up for f0c06b428e
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4910
- Document the change at docs/CHANGELOG.md
- Set the default value for -vmstorageUserTimeout to 3 seconds. This is much better
than the 0 value, which means that TCP connection to unreachable vmstorage could block
for up to 16 minutes.
- Document -vmstorageUserTimeout at docs/Cluster-VictoriaMetrics.md
The previous code could result in the following data race:
1. The s.ptwHot partition is marked to be deleted
2. ptw.decRef() is called on it
3. ptw.pt is set to nil
4. s.ptwHot.pt is accessed from concurrent goroutine, which leads to panic.
The change clears s.ptwHot under s.partitionsLock in order to prevent from the data race.
This is a follow-up for 8d50032dd6
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4895
`-promscrape.cluster.membersCount` by default should be `1`, like every
single vmagent is a cluster of one member on its own.
The change additionally validates that user can't set `-promscrape.cluster.membersCount`
to value lower than `1`.
eabcfc9bcd
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Initially, stream parse mode was reading data from response and parsing it on flight. This was causing longer delay to read the whole response and required increasing timeout value to allow data processing while reading. So that 908e35affd increased timeout value to fix this.
But after 74c00a8762 response in stream parse mode is saved into memory and then parsed eliminating necessity of having timeout value higher that for usual scrape.
Updates: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4847
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* vmagent: retry failed write request on the closed connection
Retry failed write request on the closed connection immediately,
without waiting for backoff. This should improve data delivery speed
and reduce amount of error logs emitted by vmagent when using idle connections.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4139
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmagent: retry failed write request on the closed connection
Re-instantinate request before retry as body could have been already spoiled.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Nikolay <nik@victoriametrics.com>
* vmalert: correctly re-instantinate HTTP req on retries
Previosly, request retry to datasource re-used existing HTTP request.
But if request object was already partially processed (body was read),
then retry will be unsuccessful.
The change re-instantinates HTTP request object before retry.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: review fix
Signed-off-by: hagen1778 <roman@victoriametrics.com>
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* Add button to prettify query
Just capitalizes query text for now
* Add /prettify-query API handler
* Replace UI pretiffier using prettifier API
* Add showing server errors
Had to pass setQueryErrors from useFetchQuery.ts
* Use serverUrl from global AppState
* Change icon to AutoAwsome icon + added style change color when button is active
* Add sync/await to prettifyQuery function
* Doc public function for lint
* Minor async fix
* Removed extra blank lines
* Extract usePrettifyQuery hook
* Made more generic style for :active button
* Refactor usePrettifyQuery
However, prettify errors don't clean up query errors, but should
* Add prettyQuery functionality to CHANGELOG.md
* Reuse queryErrors
* Unhide errors on start
---------
Co-authored-by: Tamara <toma.vashchuk@gmail.com>
exclude assets/README.md from publishing on the docs website
as its purpose is different to other docs.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
- Fix Prometheus-compatible naming after applying the relabeling if -usePromCompatibleNaming command-line flag is set.
This should prevent from possible Prometheus-incompatible metric names and label names generated by the relabeling.
- Do not return anything from relabelCtx.appendExtraLabels() function, since it cannot change the number of time series
passed to it. Append labels for the passed time series in-place.
- Remove promrelabel.FinalizeLabels() call after adding extra labels to time series, since this call has been already
made at relabelCtx.applyRelabeling(). It is user's responsibility if he passes labels with double underscore prefixes
to -remoteWrite.label.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4247
This reverts commit 251d3d54a4.
Reason for revert: this commit removes important information, which was placed in the docs/assets folder
in order to reduce the probability of putting images related to particular docs here,
with the reasoning why this is a bad practice.
This information should remain in the docs/assets folder. Probably, the file should be renamed
from README.md to README or to some other visible name, which doesn't lead to generation of unneded
documentation for the assets folder.
app/vmctl: don't interrupt migration process if tenant has no data
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Alexander Marshalov <_@marshalov.org>
VictoriaLogs has its own release schedule, so it must be released separately via:
make publish-victoria-logs release-victoria-logs
TODO: sync VictoriaLogs and VictoriaMetrics releases after VictoriaLogs goes out of preview stage.
This will simplify release process and upgrades at user side.
vmagent: properly add extra labels before sending data to remote storage
labels from `remoteWrite.label` are now added to sent metrics just before they
are pushed to `remoteWrite.url` after all relabelings, including stream aggregation relabelings (#4247)
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4247
Signed-off-by: Alexander Marshalov <_@marshalov.org>
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
Fix display of ingested rows rate for `Samples ingested/s`
and `Samples rate` panels for vmagent's dasbhoard.
Previously, not all ingested protocols were accounted in these panels.
An extra panel `Rows rate` was added to `Ingestion` section to display the split
for rows ingested rate by protocol.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
This reverts commit 252643d100.
Reason for revert: the commit incorrectly fixes the the issue.
The `remoteAddr` must be properly quoted inside lib/httpserver.GetQuotedRemoteAddr().
It isn't quoted properly if the request contains X-Forwarded-For header.
The proper fix will be included in the follow-up commit.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4676
{vmagent/remotewrite,vminsert/common}: fix dropInput and keepInput flags inconsistency
Sync behavior for dropInput and keepInput flags between single-node and vmagent.
Fix vmagent not respecting dropInput flag and reverse logic for keepInput.
The deferred call's arguments are evaluated immediately, but the function call is not executed until the surrounding function returns.
Signed-off-by: Abirdcfly <fp544037857@gmail.com>
* rename `configErr` to `lastConfigErr` to reduce confusion
* add tests to verify metrics and msg are set properly
* fix mistake when config success metric wasn't restored after an error
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Value of `-dedup.minScrapeInterval` comand-line flag must be higher than `evaluation_interval` in order to make sure that only one sample on each evaluation will be left after deduplication.
Moreover, value of `-dedup.minScrapeInterval` must be a multiple of vmalert's `evaluation_interval` in order to make sure that samples will be aligned between deduplication window periods.
See: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4774#issuecomment-1663940811
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Correctly calculate `Bytes per point` value for single-server and cluster VM dashboards.
Before, the calculation mistakenly accounted for the number of entries in indexdb in
denominator, which could have shown lower values than expected.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The `ConcurrentFlushesHitTheLimit` could be related to components like
vminsert, vmstorage, vm-single-node and vmagent. Moving this alert
to the `health` section of alerts will be benefitial for all components
and will remove the duplicates from single/cluster alerts.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The new panel supposed to show whether the number of concurrent
inserts processed by vmagent isn't reaching the limit.
The panel contains recommendation what to do if limit is reached.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
vmauth: allow configuring deadline for a backend to be excluded from the rotation
The new flag `-failTimeout` allows overriding default time for a bad backend
to be excluded from rotation. The override option could be useful for systems
where it is expected for backends to be off for significant periods of time.
Co-authored-by: Zakhar Bessarab <zekker6@gmail.com>
Binary export API protocol can be disabled via `-vm-native-disable-binary-protocol` cmd-line flag when migrating data from VictoriaMetrics. Disabling binary protocol
can be useful for deduplication of the exported data before ingestion.
For this, deduplication need to be configured at `-vm-native-src-addr` side
and `-vm-native-disable-binary-protocol` should be set on vmctl side.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* lib/protoparser: adds opentelemetry parser
app/{vmagent,vminsert}: adds opentelemetry ingestion path
Adds ability to ingest data with opentelemetry protocol
protobuf and json encoding is supported
data converted into prometheus protobuf timeseries
each data type has own converter and it may produce multiple timeseries
from single datapoint (for summary and histogram).
only cumulative aggregationFamily is supported for sum(prometheus
counter) and histogram.
Apply suggestions from code review
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
updates deps
fixes tests
wip
wip
wip
wip
lib/protoparser/opentelemetry: moves to vtprotobuf generator
go mod vendor
lib/protoparse/opentelemetry: reduce memory allocations
* wip
- Remove support for JSON parsing, since it is too fragile and is rarely used in practice.
The most clients send OpenTelemetry metrics in protobuf.
The JSON parser can be added in the future if needed.
- Remove unused code from lib/protoparser/opentelemetry/pb and lib/protoparser/opentelemetry/proto
- Do not re-use protobuf message between ParseStream() calls, since there is high chance
of high fragmentation of the re-used message because of too complex nested structure of the message.
* wip
* wip
* wip
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
The important change is to highlight that restore procedure happens
only once and only for already loaded rules. Config hot-reload
doesn't trigger the restore procedure.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Add -remoteWrite.shardByURL command-line flag, which instructs vmagent to spread evenly
outgoing time series data among the configured remote storage systems specified via -remoteWrite.url .
Samples for the same time series go to the same -remoteWrite.url . This allows building horizontally
scalable stream aggregation when samples for counter and histogram series must be aggregated
by the same second-level vmagent instance.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4637
- Use a byte slice instead of a map for tracking indexes for matching series.
This improves performance, since access by slice index is faster than access by map key.
- Re-use the byte slice for tracking indexes for matching series.
This removes unnecessary memory allocations and improves stream aggregation performance a bit.
- Add an ability to return to the previous behvaiour by specifying -remoteWrite.streamAggr.dropInput command-line flag.
In this case all the input samples are dropped when stream aggregation is enabled.
- Backport the new stream aggregation behaviour from vmagent to single-node VictoriaMetrics when -streamAggr.config
option is set.
- Improve docs regarding this change at docs/CHANGELOG.md
- Document the new behavior at docs/stream-aggregation.md
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4243
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4575
Previously all the incoming samples were de-duplicated, even if their series doesn't
match aggregation rule filters. This could result in increased CPU usage.
Now the de-duplication isn't applied to samples for series, which do not match
aggregation rule filters. Such samples are just ignored.
* lib/storage: pre-create timeseries before indexDB rotation
during an hour before indexDB rotation start creating records at the next indexDB
it must improve performance during switch for the next indexDB and remove ingestion issues.
Since there is no need for creation new index records for timeseries already ingested into current indexDB
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4563
* lib/storage: further work on indexdb rotation optimization
- Document the change at docs/CHAGNELOG.md
- Move back various caches from indexDB to Storage. This makes the change less intrusive.
The dateMetricIDCache now takes into account indexDB generation, so it stores (date, metricID)
entries for both the current and the next indexDB.
- Consolidate the code responsible for idbNext pre-filling into prefillNextIndexDB() function.
This improves code readability and maintainability a bit.
- Rewrite and simplify the code responsible for calculating the next retention timestamp.
Add various tests for corner cases of this code.
- Remove indexdb pre-filling from RegisterMetricNames() function, since this function is rarely called.
It is OK to add indexdb entries on demand in this function. This simplifies the code.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1401
* docs/CHANGELOG.md: refer to https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4563
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* app/vmalert/datasource/graphite: allow overriding "from" parameter for datasource queries
Fixes construction of URL parameters for graphite render to allow overriding "from" parameter.
See: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4685
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* app/vmalert/datasource/graphite: update flow for building URL parameters
Makes flow of building URL parameters same as Prometheus datasource has:
1) Setting all default values
2) Merging those values with provided `extraParams`
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* Update docs/CHANGELOG.md
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
The change disables version updates for repo packages.
Please note, security updates should not be affected by the change
according to https://docs.github.com/en/code-security/dependabot/dependabot-version-updates/configuration-options-for-the-dependabot.yml-file#open-pull-requests-limit:
```
open-pull-requests-limit
By default, Dependabot opens a maximum of five pull requests for version updates. Once there are five open pull requests from Dependabot, Dependabot will not open any new requests until some of those open requests are merged or closed.
This option has no impact on security updates, which have a separate, internal limit of ten open pull requests.
```
Signed-off-by: hagen1778 <roman@victoriametrics.com>
- Round staleness_interval durations to the upper number of seconds.
This should prevent from under-calculations for fractional staleness intervals.
- Rename stalenessInterval field at *AggrState structs into stalenessSecs, since it holds seconds.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4667
- Improve docs
- Hide `debug relabeling` column when -promscrape.dropOriginalLabels command-line flag is set
- Inline the code from the added template functions, since the code is harder to follow
with the template functions, especially when these functions have misleading names.
Also, these functions are used only in one place, e.g. they do not reduce the amounts of code.
- Hide `click to show original labels` title at `labels` column when original labels aren't available.
- Show the reason on whey original labels aren't available at /service-discovery page.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4597
- Parse protobuf if Content-Type isn't set to `application/json` - this behavior is documented at https://grafana.com/docs/loki/latest/api/#push-log-entries-to-loki
- Properly handle gzip'ped JSON requests. The `gzip` header must be read from `Content-Encoding` instead of `Content-Type` header
- Properly flush all the parsed logs with the explicit call to vlstorage.MustAddRows() at the end of query handler
- Check JSON field types more strictly.
- Allow parsing Loki timestamp as floating-point number. Such a timestamp can be generated by some clients,
which store timestamps in float64 instead of int64.
- Optimize parsing of Loki labels in Prometheus text exposition format.
- Simplify tests.
- Remove lib/slicesutil, since there are no more users for it.
- Update docs with missing info and fix various typos. For example, it should be enough to have `instance` and `job` labels
as stream fields in most Loki setups.
- Allow empty of missing timestamps in the ingested logs.
The current timestamp at VictoriaLogs side is then used for the ingested logs.
This simplifies debugging and testing of the provided HTTP-based data ingestion APIs.
The remaining MAJOR issue, which needs to be addressed: victoria-logs binary size increased from 13MB to 22MB
after adding support for Loki data ingestion protocol at https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4482 .
This is because of shitty protobuf dependencies. They must be replaced with another protobuf implementation
similar to the one used at lib/prompb or lib/prompbmarshal .
Loki uses default labels format without "or" operator. This format can't create a list of LabelFilters, so only first set of LabelFilters should be used.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* app/vmagent: fix creating target id if `--promscrape.dropOriginalLabels` flag was used
* app/vmagent: hide links if OriginalLabels was dropped
* app/vmagent: update CHANGELOG.md and added information to the docs
* app/vmagent: fix comments
* app/vlinsert: add support of loki push protocol
- implemented loki push protocol for both Protobuf and JSON formats
- added examples in documentation
- added example docker-compose
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* app/vlinsert: move protobuf metric into its own file
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* deployment/docker/victorialogs/promtail: update reference to docker image
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* deployment/docker/victorialogs/promtail: make volume name unique
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* app/vlinsert/loki: add license reference
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* deployment/docker/victorialogs/promtail: fix volume name
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* docs/VictoriaLogs/data-ingestion: add stream fields for loki JSON ingestion example
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* app/vlinsert/loki: move entities to places where those are used
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* app/vlinsert/loki: refactor to use common components
- use CommonParameters from insertutils
- stop ingestion after first error similar to elasticsearch and jsonline
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* app/vlinsert/loki: address review feedback
- add missing logstorage.PutLogRows calls
- refactor tenant ID parsing to use common function
- reduce number of allocations for parsing by reusing logfields slices
- add tests and benchmarks for requests processing funcs
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
This eliminates the need in .(*T) casting for results obtained from Load()
Leave atomic.Value for map, since atomic.Pointer[map[...]...] makes double pointer to map,
because map is already a pointer type.
- Add `Active queries` chapter to VMUI docs
- Set `Content-Type: json` header inside promql.WriteActiveQueries() handler,
in order to be consistent with other request handlers called at app/vmselect/main.go
- Pass the request to promql.WriteActiveQueries() handler, so it can change its output
depending on the provided request params. This also improves consistency of
promql.WriteActiveQueries() args with other request hanlers at app/vmselect/main.go
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4653
* feat: add page to display a list of active queries (#4598)
* app/vmagent: code formatting
* fix: remove console
---------
Co-authored-by: dmitryk-dk <kozlovdmitriyy@gmail.com>
Using `job=~$job_storage` forces "Cache usage" panel to display only vmstorage caches, but there is a cache peresent at vmselect(`promql/rollupResult`).
Updated selector to match generic `$job` so that all caches will be displayed with an option to display per-job caches.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
The following syntax is supported: _time:filter offset off
For example:
- _time:5m offset 1h - 5-minute duration one hour before the current time
- _time:2023 offset 2w - 2023 year with the 2 weeks offset in the past
The `a op b keep_metric_names` is ambigouos to `a op (b keep_metric_names)` when `b` is a transform or rollup function.
For example, `a + rate(b) keep_metric_names`. So it is better to use more clear syntax: `(a op b) keep_metric_names`
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3710
- Sort MetricName tags only once before the benchmark loop.
- Obtain indexSearch per each benchmark loop in order to give a chance for background merge
for the recently created parts
Previously all the newly ingested time series were registered in global `MetricName -> TSID` index.
This index was used during data ingestion for locating the TSID (internal series id)
for the given canonical metric name (the canonical metric name consists of metric name plus all its labels sorted by label names).
The `MetricName -> TSID` index is stored on disk in order to make sure that the data
isn't lost on VictoriaMetrics restart or unclean shutdown.
The lookup in this index is relatively slow, since VictoriaMetrics needs to read the corresponding
data block from disk, unpack it, put the unpacked block into `indexdb/dataBlocks` cache,
and then search for the given `MetricName -> TSID` entry there. So VictoriaMetrics
uses in-memory cache for speeding up the lookup for active time series.
This cache is named `storage/tsid`. If this cache capacity is enough for all the currently ingested
active time series, then VictoriaMetrics works fast, since it doesn't need to read the data from disk.
VictoriaMetrics starts reading data from `MetricName -> TSID` on-disk index in the following cases:
- If `storage/tsid` cache capacity isn't enough for active time series.
Then just increase available memory for VictoriaMetrics or reduce the number of active time series
ingested into VictoriaMetrics.
- If new time series is ingested into VictoriaMetrics. In this case it cannot find
the needed entry in the `storage/tsid` cache, so it needs to consult on-disk `MetricName -> TSID` index,
since it doesn't know that the index has no the corresponding entry too.
This is a typical event under high churn rate, when old time series are constantly substituted
with new time series.
Reading the data from `MetricName -> TSID` index is slow, so inserts, which lead to reading this index,
are counted as slow inserts, and they can be monitored via `vm_slow_row_inserts_total` metric exposed by VictoriaMetrics.
Prior to this commit the `MetricName -> TSID` index was global, e.g. it contained entries sorted by `MetricName`
for all the time series ever ingested into VictoriaMetrics during the configured -retentionPeriod.
This index can become very large under high churn rate and long retention. VictoriaMetrics
caches data from this index in `indexdb/dataBlocks` in-memory cache for speeding up index lookups.
The `indexdb/dataBlocks` cache may occupy significant share of available memory for storing
recently accessed blocks at `MetricName -> TSID` index when searching for newly ingested time series.
This commit switches from global `MetricName -> TSID` index to per-day index. This allows significantly
reducing the amounts of data, which needs to be cached in `indexdb/dataBlocks`, since now VictoriaMetrics
consults only the index for the current day when new time series is ingested into it.
The downside of this change is increased indexdb size on disk for workloads without high churn rate,
e.g. with static time series, which do no change over time, since now VictoriaMetrics needs to store
identical `MetricName -> TSID` entries for static time series for every day.
This change removes an optimization for reducing CPU and disk IO spikes at indexdb rotation,
since it didn't work correctly - see https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1401 .
At the same time the change fixes the issue, which could result in lost access to time series,
which stop receving new samples during the first hour after indexdb rotation - see https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2698
The issue with the increased CPU and disk IO usage during indexdb rotation will be addressed
in a separate commit according to https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1401#issuecomment-1553488685
This is a follow-up for 1f28b46ae9
The number of parts in the snapshot partition may be zero if concurrent goroutine just
started creating new partition, but didn't put data into it yet when the current
goroutine made a snapshot.
This reverts commit 20b18e9feb.
Reason for revert: running goimports on `make check-all` introduces the following issues:
- It runs only on modified files, which weren't commited yet into git repository.
This means the formatting for the remaining files becomes different comparing to the formatting
for the changed files. This also means that the goimports has no any effect
at github actions and when the changed code is already commited to git repository.
- `gomiports` performs formatting in the same way as gofmt, so `make fmt` becomes unnecessary.
But when `gofmt` is substituted with `goimports`, then it performs unnecessary formatting for *.qtpl.go files.
It is possible to make a hack, which will prepare a list of all the *.go files at lib/ and app/
without the *.qtpl.go files, and then feed this list to `goimports`, but this looks too fragile
for the task of just fixing the ordering of Go imports.
So it is better to leave source code formatting as is with `gofmt`, while manually fixing improper ordering
of Go import from time to time in dedicated commits until better solution arises.
Add a break if gotAlert is nil
This removes the following golangci-lint warning:
app/vmalert/alerting_test.go:868:8: SA5011(related information): this check suggests that the pointer can be nil (staticcheck)
if gotAlert == nil {
^
* app/vmctl: fix panic `--remote-read-filter-time-start` flag not defined
* app/vmctl: update CHANGELOG.md
---------
Co-authored-by: Nikolay <nik@victoriametrics.com>
It could happen for low evaluation intervals and irregular
delays during execution that evaluation time would get
a negative offset. This could result into cumulative
discrepancy between the actual time and evaluation time for rules.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* make: add goimports task
Adds task to fix imports formatting implace.
Formats imports into:
- native library
- external libraries
- local packages based on github.com/VictoriaMetrics/VictoriaMetrics prefix
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* make: add goimports install task
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* make: run goimports only for changed files
Applying goimports to all existing files would create a lot of problems with cherry-picking changes between different branches used for development. To avoid this it was decided to only run goimports on changed files to fix formatting gradually.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* make: update goimports to run on all changed files
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
- Clarify docs about -replicationFactor command-line flag at vmselect
- Clarify description for -replicationFactor and -search.skipSlowReplicas command-line flags
- Fix the logic for returning responses if -search.skipSlowReplicas command-line flag
is enabled. The logic was broken in the 173ccf4333,
so it could return responses only if some of vmstorage nodes return error,
while it should return when query results are successfully collected from more than
(len(storageNodes) - replicationFactor) vmstorage nodes.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1207
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/711
* vmselect: introduce `search.skipSlowReplicas` cmd-line flag
vmselect has two logical conditions during request processing when
`-replicationFactor` cmd-line flag is set:
1. If at least `len(storageNodes) - replicationFactor` responded, it could skip
waiting for the rest of nodes to respond. This could lead to problems described
here https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1207.
2. Mark response as partial if less than `len(storageNodes) - replicationFactor` responded
without an error.
The P1 showed itself error-prone and became the main reason why
`-replicationFactor` wasn't recommended to use at vmselect level.
However, this optimization could be still very useful in situations
when there are slow and fast replicas in cluster.
But P2 remains viable and important conditionless.
Hiding P1 behind the feature-flag `search.skipSlowReplicas`
should make `-replicationFactor` flag usable again. And let users
choose whether they want P1 to be respected.
Related issues
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1207https://github.com/VictoriaMetrics/VictoriaMetrics/issues/711
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* docs: update changelog
Signed-off-by: hagen1778 <roman@victoriametrics.com>
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* fix removing storage data dir before restoring from backup
Signed-off-by: Alexander Marshalov <_@marshalov.org>
* fix review comment
Signed-off-by: Alexander Marshalov <_@marshalov.org>
* fix review comment
Signed-off-by: Alexander Marshalov <_@marshalov.org>
* fixes after merge with `enterprise-single-node` branch
Signed-off-by: Alexander Marshalov <_@marshalov.org>
---------
Signed-off-by: Alexander Marshalov <_@marshalov.org>
- Clarify the scope of the fix at docs/CHANGELOG.md
- Handle the case when -search.maxSamplesPerSeries limit is exceeded
in the same way as the -search.maxSamplesPerQuery limit.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4472
libcrypto3 and libssl3 in Alpine 3.18.0 have versions `3.1.0-r4`
which contains CVE-2023-2650:
https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2023-2650
Use ALpine image 3.18.2 which contains fixed versions of libssl3
and libcrypto3: 3.1.1-r0
NB: In Openshift these containers are marked as vulnerabilities
because of these CVEs.
Error message will be present for any auth error, but message claims an error is about OAuth2 configuration which is confusing.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
The validation was needed for covering corner cases when storage is tested with data from 1970.
This resulted into unexpected search results, as year was parsed incorrectly from the given timestamp.
Co-authored-by: hagen1778 <roman@victoriametrics.com>
It is required for periods to be multiplies, but it was not stated clearly in documentation.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
The change focuses on rectifying inconsistencies in the navigation behavior of the application
and eliminating issues encountered when manually altering the URL.
The key updates include:
- Refactoring of the routing mechanism to handle all possible routes and their states.
- Enhancement of the React Router usage to ensure a smoother navigation experience.
- Handling application state when the URL is manually changed.
Added a line for `probeSelector` and links to objects selectors section to make it easier to find more details about selectors.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
expose `vmauth_user_request_duration_seconds`
and `vmauth_unauthorized_user_request_duration_seconds` summary metrics
for measuring requests latency per user.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
It is impossible to run OS vmauth with the provided config.
The example of using ip filters should be only a part of docs.
All other examples should work seamlessly with OS version.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* docs: add scroll to the selected element
* docs: scroll to root if element not found
* docs: simplify code
* docs: code cleanup
* docs: fix comments (fix code formatting, check element only inside sidebar container)
By default, vmalert will make multiple retry attempts with exponential delay.
The total time spent during retry attempts shouldn't exceed `-remoteWrite.retryMaxTime` (default is 30s).
When retry time is exceeded vmalert drops the data dedicated for `-remoteWrite.url`.
Before, vmalert dropped data after 5 retry attempts with 1s delay between attempts (not configurable).
See `-remoteWrite.retryMinInterval` and `-remoteWrite.retryMaxTime` cmd-line flags.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Nikolay <nik@victoriametrics.com>
- remove second panel for disk usage. It is not very useful for users and brings more confusion than profit from having it.
- update CPU graph to show number of used CPUs to make it less ambiguous
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
This reverts commit c19048dc13.
Reason for revert: it has been appeared that the net/http.ResponseWriter is already buffered,
so there in no need in double bufferring
This simplifies routing at auth proxies such as vmauth to vlselect component,
which serves VMUI - just route all the requests, which start with /select/, to vlselect.
If the `Content-Type: application/json` request header isn't set,
then the server can improperly consume the request body when parsing request parameters
vmalert: retry all errors except 4XX status codes
Retry all errors except 4XX status codes while pushing via remote-write
to the remote storage. Previously, errors like broken connection could
prevent vmalert from retrying the request.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* fix: optimize the preparation of data for the graph
* fix: optimize tooltip rendering
* fix: optimize re-rendering of the chart
* vmui: memory leak fix
* lib/storage: creates parts.json on start-up if it not exists.
It fixes migrations from versions below v1.90.0.
Previously parts.json was created only after successful merge.
But if merge was interruped for some reason (OOM or shutdown), parts.json wasn't created and partitions left after interruped merge weren't properly deleted.
Since VM cannot check if it must be removed or not.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4336
* Apply suggestions from code review
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
* Update lib/storage/partition.go
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
---------
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
This error may be wrapped in another error, and should normally be tested using
`errors.Is(err, net.ErrClosed)`.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
app/vmagent/remotewrite: fix vmagent panic on shutdown
Currently, when vmagent is stopping it first flushes pending series in remote write context and proceeds to stop streaming aggregation. This leads to streaming aggregation being unable to write results into pending timeseries (since it is already nil) and panic.
This can lead to losing some aggregation results being lost almost silently.
The fix is reordering flow to first stop streaming aggregation and flush all pending time series after that.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* deployment/docker: update VictoriaMetrics version from v1.91.1 to v1.91.2 in docker compose files
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* deployment/marketplace: update VictoriaMetrics version from v1.91.1 to v1.91.2 in marketplace files
Signed-off-by: hagen1778 <roman@victoriametrics.com>
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* app/vmctl: add verbose output for docker installations or when TTY isn't available
* app/vmctl: fix tests
* app/vmctl: make vmctl interactive if no tty
* app/vmctl: cleanup
* app/vmctl: add comment
---------
Co-authored-by: Nikolay <nik@victoriametrics.com>
* vmalert: fix nil map assignment
The storage instance with nil map params was created for remote-read purposes.
And before change 7a9ae9de0d this map was ignored in ApplyParams.
Now, it started to be used and vmalert panics in runtime.
The fix properly inits map for at `NewVMStorage` and verifies it is not nil
on assignment in `ApplyParams`.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: add to changelog
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: properly clone Storage params
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: properly clone Storage params
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: properly clone Storage params
Signed-off-by: hagen1778 <roman@victoriametrics.com>
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
at arm based CPUs only 9 digits after comma matches for tests.
Especially at holtWinters functions. Since it only takes effect at tests
it makes no sense for changing float prescision at actual functions
Previously the location inside the sendPrometheusError() was logged.
This could make hard investigating error locations via `vm_log_messages_total` metric.
This reverts the following commits:
- e0e16a2d36
- 2ce02a7fe6
The reason for revert: the updated logic breaks assumptions made
when fixing https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2698 .
For example, if a time series stop receiving new samples during the first
day after the indexdb rotation, there are chances that the time series
won't be registered in the new indexdb. This is OK until the next indexdb
rotation, since the time series is registered in the previous indexdb,
so it can be found during queries. But the time series will become invisible
for search after the next indexdb rotation, while its data is still there.
There is also incompletely solved issue with the increased CPU and disk IO resource
usage just after the indexdb rotation. There was an attempt to fix it, but it didn't fix
it in full, while introducing the issue mentioned above. See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1401
TODO: to find out the solution, which simultaneously solves the following issues:
- increased memory usage for setups high churn rate and long retention (e.g. what the reverted commit does)
- increased CPU and disk IO usage during indexdb rotation ( https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1401 )
- https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2698
Possible solution - to create the new indexdb in one hour before the indexdb rotation
and to gradually pre-populate it with the needed index data during the last hour before indexdb rotation.
Then the new indexdb will contain all the needed data just after the rotation,
so it won't trigger increased CPU and disk IO.
- Document the change at docs/CHANGELOG.md
- Clarify comments for non-trivial code touched by the commit
- Improve the logic behind maybeCreateIndexes():
- Correctly create per-day indexes if the indexdb rotation is performed during
the first hour or the last hour of the day by UTC.
Previously there was a possibility of missing index entries on that day.
- Increase the duration for creating new indexes in the current indexdb for up to 22 hours
after indexdb rotation. This should reduce the increased resource usage
after indexdb rotation.
It is safe to postpone index creation for the current day until the last hour
of the current day after indexdb rotation by UTC, since the corresponding (date, ...)
entries exist in the previous indexdb.
- Search for TSID by (date, MetricName) in both the current and the previous indexdb.
Previously the search was performed only in the current indexdb. This could lead
to excess creation of per-day indexes for the current day just after indexdb rotation.
- Search for (date, metricID) entries in both the current and the previous indexdb.
Previously the search was performed only in the current indexdb. This could lead
to excess creation of per-day indexes for the current day just after indexdb rotation.
The new index substitutes global MetricName=>TSID index
used for locating TSIDs on ingestion path.
For installations with high ingestion and churn rate, global
MetricName=>TSID index can grow enormously making
index lookups too expensive. This also results into bigger
than expected cache growth for indexdb blocks.
New per-day index supposed to be much smaller and more efficient.
This should improve ingestion speed and reliability during
re-routings in cluster.
The negative outcome could be occupied disk size, since
per-day index is more expensive comparing to global index.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* added backup locking/unlocking against retention policy to vmbackupmanager
Signed-off-by: Alexander Marshalov <_@marshalov.org>
* added docs for new commands
Signed-off-by: Alexander Marshalov <_@marshalov.org>
* fix review comments
Signed-off-by: Alexander Marshalov <_@marshalov.org>
---------
Signed-off-by: Alexander Marshalov <_@marshalov.org>
* lib/storage: follow-up after a50d63c376
- ensure retentionMsecs is rounded to day
- remove localTimeOffset in test as localOffset is ignored when using `UnixMilli`
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/storage: restore retention timezone offset effect on retention deadline
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* feat: improvement of the top queries page
* vmui/docs: enhancements to top queries page
* Apply suggestions from code review
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
vmui: change default font size to 14px for better readability
vmui: fix bug with missing text on buttons in safari
---------
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
* app/vmui: added Labels with the highest number of unique values
* app/vmui: cleanup
* app/vmui: cleanup
* app/vmui: add table description
* app/vmui: fix comment, updated CHANGELOG.md
* app/vmui: disable links
* app/vmui: added actions to the table, it will show values for selected label with the highest number of series
* app/vmui: fix comment
Previously, metric `vmalert_alerting_rules_last_evaluation_series_fetched`
would be set to 0 for const expressions, because const expression do not match
any series. This may result into a confusion: no series were matched but response isn't empty.
The change updates the logic behind metric: if no series were matched but there are samples
in response - use amount of samples as number of series.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: expand rule groups on anchor click
before, anchor click was only updating the URL.
To expand the group, user had to click on rule's block.
Now, group will toggle automatically.
* vmalert: allow filtering group in web UI
The new filter allows to filter groups and rules within
groups by: errors only or noMatch only.
The filtering supposed to help navigating big numbers of groups/rules.
Filtering is reflected in URL, so can be shared as a link.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Without reset, labels duplicates could have been added during stream aggregation.
Since `ctx.Labels` is reused during processing of many series, each series will
add its labels to the context. Even if the same labels were already addeded on prev
iteration. Now, we reset `ctx.Labels` on each iteration to contain so labels from
different series didn't interfere.
This could have cause exceeding of the limit on number of labels per pushed time series.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4277
Signed-off-by: hagen1778 <roman@victoriametrics.com>
This reverts commit 9e99f2f5b3.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4068
Reason for revert: this breaks valid use cases:
- If timestamps aren't specified in the incoming samples on purpose. For example, if stream aggregation is used
as StatsD replacement. StatsD protocol has no timestamp concept for incoming samples.
See https://github.com/b/statsd_spec
- If all the samples must be aggregated, even if they contain stale timestamps.
for example, if the stream aggregation produces some counter of some events,
it may be better to count all the events even if they were delayed before
being ingested into VictoriaMetrics.
Is is also unclear how to determine whether the sample becomes stale.
For example, if the aggregation interval equals to 1h, and the previous
aggregation cycle just finished 10 minutes ago, what to do with the newly
incoming sample with the timestamp 30 minutes older than the current time?
The answer highly depends on the context, so it is unsafe to uncoditionally
use a single logic for dropping the old samples here.
app/vmalert: detect alerting rules which don't match any series at all
vmalert starts to understand /query responses which contain object:
```
"stats":{"seriesFetched": "42"}
```
If object is present, vmalert parses it and populates a new field
`SeriesFetched`. This field is then used to populate the new metric
`vmalert_alerting_rules_last_evaluation_series_fetched` and to
display warnings in the vmalert's UI.
If response doesn't contain the new object (Prometheus or
VictoriaMetrics earlier than v1.90), then `SeriesFetched=nil`.
In this case, UI will contain no additional warnings.
And `vmalert_alerting_rules_last_evaluation_series_fetched` will
be set to `-1`. Negative value of the metric will help to compile
correct alerting rule in follow-up.
Thanks for the initial implementation to @Haleygo
See https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4056
See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4039
Signed-off-by: hagen1778 <roman@victoriametrics.com>
It makes it easier for users who build and self-host images to publish their images without changing tags manually.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
It appears that 90% usage for anonymous mem usage
is already concerning. So we lowering the threshold to 80%.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
previously during sync for mutable and immutable cache parts, link for hotEntry with current date may be not properly updated
it corrupts cache for backfilling metrics and increased cpu load
Windows doesn't allow to remove dir with opened files. Usually it's a case for snapshots, hard cannot be removed if file is openned.
With this change, dir will be renamed and properly deleted at the next process start.
It's recommended to restart vmstorage/vmsingle for snapshots deletion completion periodically.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/70
* vmselect: exit early from queue on context cancel
When `-search.maxConcurrentRequests` is reached, vmselect puts
request in the queue. It is expected, that requests in the queue
will be processed as soon as it would be enough capacity to do so.
However, it could happen that while request was waiting its turn,
the client could have already cancel it (close the connection,
or just close the tab with UI). In this case, we should de-queue
such requests to avoid spending extra resources on them.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* app/vmselect: address review comments
Signed-off-by: hagen1778 <roman@victoriametrics.com>
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* lib/promscrape/discovery/kubernetes: add common labels to all ports discovered from endpoints
Sets
`__meta_kubernetes_endpoints_name` and `__meta_kubernetes_namespace` labels to all ports of pod.
Prometheus sets those labels to all ports in pod (0ab9553611/discovery/kubernetes/endpoints.go (L267C15-L269)) even if port is not matching any service.
See: #4154
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/promscrape/discovery/kubernetes: fix test for updated discovery logic
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/streamaggr: discard samples with timestamps not matching aggregation interval
Samples with timestamps lower than `now - aggregation_interval` are likely to be written via backfilling and should not be used for calculation of aggregation.
See #4068
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/streamaggr: make log message more descriptive, fix imports
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Templating of `-external.alert.source` is not expected to have access to the query which was causing runtime error when query function was passed as nil.
See: #4181
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* app/vmagent,lib/persistentqueue: show warning message if `--remoteWrite.maxDiskUsagePerURL` flag lower than 500MB
* app/vmagent,lib/persistentqueue: linter fix
* app/vmagent,lib/persistentqueue: fix comment
* feat: display heatmap in the explore metrics (#4111)
* fix: correct calc step for heatmap
* fix: remove spaces in the result of getDurationFromMilliseconds
* feat: add button "show today" to date picker
* feat: add comparison with the prev day (#3967)
* vmui/docs: add comparison of data to cardinality page
* feat: add WithTemplate page
* app/vmselect/prometheus: enable json mode for expand with expr API
* app/vmselect/prometheus: enable CORS and add content type
* feat: add api for expand with templates
* fix: remove console from useExpandWithExprs
* app/vmselect/prometheus: fix escaping
* vmui: integrate WITH template
* app/vmctl: check content type instead of form param
* fix: add content-type for fetch with-exprs
* fix: add a header to the server's response that allows the "Content-Type" header
* app/vmctl: added comment and cleanup
* app/vmctl: use format query param
---------
Co-authored-by: dmitryk-dk <kozlovdmitriyy@gmail.com>
* app/vmctl: add support for the different time format in the native binary protocol
* app/vmctl: update flag description, update CHANGELOG.md
* app/vmctl: add comment to exported function
* lib/httpserver: introduce `-http.maxConcurrentRequests` command-line flag
Introduce `-http.maxConcurrentRequests` command-line flag to protect
VM components from resource exhaustion during unexpected spikes of HTTP requests.
By default, the new flag's value is set to 0 which means no limits are applied.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* lib/httpserver: mention http.maxConcurrentRequests in docs
Signed-off-by: hagen1778 <roman@victoriametrics.com>
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
- added info about metric `vm_vminsert_metrics_read_total`,
- small doc refactoring
- and added make-command for running docs in docker.
Signed-off-by: Alexander Marshalov <_@marshalov.org>
* vmalert: retry datasource requests with EOF or unexpected EOF errors
Retry failed read request on the closed connection one more time.
This may improve rules execution reliability when connection
between vmalert and datasource closes unexpectedly.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: fix old tests
Signed-off-by: hagen1778 <roman@victoriametrics.com>
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
This handler will instruct search engines that indexing is not allowed for the content exposed to the internet. This should help to address issues like #4128 when instances are exposed to the internet without authentication.
"How do We Keep Metrics for a Long Time in VictoriaMetrics" article is referenced twice in "Third-party articles and slides about VictoriaMetrics" section
Callers of OpenStorage() log the returned error and exit.
The error logging and exit can be performed inside MustOpenStorage()
alongside with printing the stack trace for better debuggability.
This simplifies the code at caller side.
Use fs.MustReadDir() instead of os.ReadDir() across the code in order to reduce the code verbosity.
The fs.MustReadDir() logs the error with the directory name and the call stack on error
before exit. This information should be enough for debugging the cause of the error.
Callers of CreateFlockFile log the returned err and exit.
It is better to log the error inside the MustCreateFlockFile together with the path
to the specified directory and the call stack. This simplifies
the code at the callers' side while leaving the debuggability at the same level.
Callers of InitFromFilePart log the error and exit.
It is better to log the error with the path to the part and the call stack
directly inside the MustInitFromFilePart() function.
This simplifies the code at callers' side while leaving the same level of debuggability.
Callers of this function log the returned error and exit.
It is better logging the error together with the path to the filename
and call stack directly inside the function. This simplifies
the code at callers' side without reducing the level of debuggability
Callers of this function log the returned error and exit.
Let's log the error with the path to the filename and call stack
inside the function. This simplifies the code at callers' side
without reducing the level of debuggability.
- Add 'BUG:' prefix to error messages related to programming errors aka bugs.
- Consistently log the path to the file in all the messages in order to improve debuggability.
Callers of ReadFullData() log the error and then exit.
So let's log the error with the path to the filename and the call stack
inside MustReadData(). This simplifies the code at callers' side,
while leaving the debuggability at the same level.
Callers of these functions log the returned error and then exit.
Let's log the error with the call stack inside the function itself.
This simplifies the code at callers' side, while leaving the same
level of debuggability in case of errors.
Callers of this function log the returned error and then exit.
Let's log the error with the call stack inside the function itself.
This simplifies the code at callers' side, while leaving the same
level of debuggability in case of errors.
Callers of this function log the returned error and then exit.
Let's log the error with the call stack inside the function itself.
This simplifies the code at callers' side, while leaving the same
level of debuggability in case of errors.
Callers of this function log the returned error and exit.
So let's just log the error with the given filepath and the call stack
inside the function itself and then exit. This simplifies the code
at callers' place while leaves the same level of debuggability in case of errors.
Callers of these functions log the returned error and then exit. The returned error already contains the path
to directory, which was failed to be created. So let's just log the error together with the call stack
inside these functions. This leaves the debuggability of the returned error at the same level
while allows simplifying the code at callers' side.
While at it, properly use MustMkdirFailIfExist instead of MustMkdirIfNotExist inside inmemoryPart.MustStoreToDisk().
It is expected that the inmemoryPart.MustStoreToDick() must fail if there is already a directory under the given path.
When WriteFileAndSync fails, then the caller eventually logs the error message
and exits. The error message returned by WriteFileAndSync already contains the path
to the file, which couldn't be created. This information alongside the call stack
is enough for debugging the issue. So just use log.Panicf("FATAL: ...") inside MustWriteAndSync().
This simplifies error handling at caller side a bit.
This is a follow-up after 42bba64aa7
Previously the part directory listing was fsync'ed implicitly inside partHeader.WriteMetadata()
by calling fs.WriteFileAtomically(). Now it must be fsync'ed explicitly.
There is no need in fsync'ing the parent directory, since it is fsync'ed by the caller
when updating parts.json file.
Previously the created part directory listing was fsynced implicitly
when storing metadata.json file in it.
Also remove superflouous fsync for part directory listing,
which was called at blockStreamWriter.MustClose().
After that the metadata.json file is created, so an additional fsync
for the directory contents is needed.
Improperly configured -bigMergeConcurrency command-line flag usually leads to uncontrolled
growth of unmerged parts, which, in turn, increases CPU usage and query durations.
So it is better deprecating this flag. In rare cases -smallMergeConcurrency command-line flag
can be used instead for controlling the concurrency of background merges.
This makes it easier to understand exact point in time which is included in this backup.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* fix: correct display of errors for query
* fix: change the logic of histogram detection
* feat: hide empty buckets from the graph
* fix: revert server url
The logic employed for re-using the previously loaded scrape target was broken initially.
The commit cc0427897c tried to fix it, but the new logic
became too complex and fragile. So it is better to just remove this logic,
since the targets from temporarily broken file should be eventually loaded on next
attempts every -promscrape.fileSDCheckInterval
This also allows removing fragile hacks around __vm_filepath label.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3989
* lib/promscrape: fix the problem with scrape work duplicates when file_sd_config can't be read
* lib/promscrape: clarified comment
* lib/promscrape: made better approach to handle a problem with growing []*ScrapeWork on each error when loading config
* lib/promscrape: added CHANGELOG.md
* Update docs/CHANGELOG.md
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* feat: add tips for working with the graph and legend
* feat: add the ability to collapse the legend
* vmui/docs: add the ability to collapse the legend
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* lib/storage: check for free disk space before opening tables
We check for free disk space before call to `openTable`,
so `Storage` can be set to ReadOnly before mergeWorkers start.
Before the change, there was a chance that merges will start
even if Storage has to start in ReadOnly mode because of
`-storage.minFreeDiskSpaceBytes` limit.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4023
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* lib/storage: chore
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* Update lib/storage/storage.go
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
- Make sure that the last successfully loaded config is used on hot-reload failure
- Properly cleanup resources occupied by already initialized aggregators
when the current aggregator fails to be initialized
- Expose distinct vmagent_streamaggr_config_reload* metrics per each -remoteWrite.streamAggr.config
This should simplify monitoring and debugging failed reloads
- Remove race condition at app/vminsert/common.MustStopStreamAggr when calling sa.MustStop() while sa
could be in use at realoadSaConfig()
- Remove lib/streamaggr.aggregator.hasState global variable, since it may negatively impact scalability
on system with big number of CPU cores at hasState.Store(true) call inside aggregator.Push().
- Remove fine-grained aggregator reload - reload all the aggregators on config change instead.
This simplifies the code a bit. The fine-grained aggregator reload may be returned back
if there will be demand from real users for it.
- Check -relabelConfig and -streamAggr.config files when single-node VictoriaMetrics runs with -dryRun flag
- Return back accidentally removed changelog for v1.87.4 at docs/CHANGELOG.md
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3639
Verifying status code helps to avoid misleading errors caused by attempt to parse unsuccessful response.
Related issue: #4034
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
- Compare directory names instead of paths to directory when determining which persistent queues must be deleted
This is less error-prone solution, since paths to the same directory can differ, which could lead
to accidental directory removal for the existing -remoteWrite.url
- Log the `removed %d dangling queues` message when at least a single queue has been removed
- Consistently use filepath.Join() for creating paths to persistent queues.
This is needed for Windows support (see https://github.com/VictoriaMetrics/VictoriaMetrics/issues/70 )
- Clarify the description of the change at docs/CHANGELOG.md
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4014
There is a bug here where if you have a single bucket like:
foo{vmrange="4.084e+02...4.642e+02"} 2 123
The expected output is three le encoded buckets like:
foo{le="4.084e+02"} 0 123
foo{le="4.642e+02"} 2 123
foo{le="+Inf"} 2 123
This correctly encodes the start and end of the vmrange.
If however, the input contains the previous bucket, and that bucket is
empty then you only get the end le and +Inf out currently, i.e:
foo{vmrange="7.743e+05...8.799e+05"} 5 123
foo{vmrange="6.813e+05...7.743e+05"} 0 123
results in:
foo{le="8.799e+05"} 5 123
foo{le="+Inf"} 5 123
This causes issues when you go to compute a quantile because this means
that the assumed lower bound of the buckets is 0 and this we interpolate
between 0->end rather than the vmrange start->end as expected.
- Expose stats.seriesFetched at `/api/v1/query_range` responses too
for the sake of consistency.
- Initialize QueryStats when it is needed and pass it to EvalConfig then.
This guarantees that the QueryStats is properly collected when the query
contains some subqueries.
The change adds a new field `seriesFetched` to EvalConfig object.
Since EvalConfig object can be copied inside `Exec`,
`seriesFetched` is a pointer which can be updated by all copied
objects.
The reason for having stats is that other components, like vmalert,
could benefit from this information.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
using `runtime.Gosched` requires acquiring global lock to check if there are any other goroutines to perform tasks. with the latest versions of runtime it can pause running goroutines automatically without requiring to call `Gosched` directly.
Updates #3966
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
- Use windows.FlushFileBuffers() instead of windows.Fsync() at streamTracker.adviseDontNeed()
for consistency with implementations for other architectures.
- Use filepath.Base() instead of filepath.Split(), since the dir part isn't used.
This simplifies the code a bit.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/70
This is a follow-up for 43b24164ef
* lib/fs: adds memory map for windows
it should improve performance for file reading
* lib/storage: replace '/' with os specific separator
it must fix an errors for windows
* lib/fs: mention windows fsync support
* lib/filestream: adds fdatasync for windows writes
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/70
- Allocate and initialize seriesByWorkerID slice in a single go instead
of initializing every item in the list separately.
This should reduce CPU usage a bit.
- Properly set anti-false sharing padding at timeseriesWithPadding structure
- Document the change at docs/CHANGELOG.md
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3966
* vmselect/promql: refactor `evalRollupNoIncrementalAggregate` to use lock-less approach for parallel workers computation
Locking there is causing issues when running on highly multi-core system as it introduces lock contention during results merge.
New implementation uses lock less approach to store results per workerID and merges final result in the end, this is expected to significantly reduce lock contention and CPU usage for systems with high number of cores.
Related: #3966
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* vmselect/promql: add pooling for `timeseriesWithPadding` to reduce allocations
Related: #3966
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* vmselect/promql: refactor `evalRollupFuncWithSubquery` to avoid using locks
Uses same approach as `evalRollupNoIncrementalAggregate` to remove locking between workers and reduce lock contention.
Related: #3966
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* allowed using dashes and dots in environment variables names for templating config files with envtemplate (#3999)
Signed-off-by: Alexander Marshalov <_@marshalov.org>
* Apply suggestions from code review
---------
Signed-off-by: Alexander Marshalov <_@marshalov.org>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Setting up VMAnomaly on NodeExporter metrics with VictoriaMetrics and AlertManager.
* vmanomaly-guide-draft
* aletr graphs and description
* readme vmanomaly tutorial
* Added back fit_every param for performance
* vmanomaly guide fixes
* added spaces div
* spaces + resize image
* alert example grammar
* quotation marks
* docker link
* typo fixed
* more links
* reader section rephrased
* label change
* lower case for grafana service
* lower case for vm service
* yaml markdown
---------
Co-authored-by: Dima Lazerka <dima@victoriametrics.com>
* lib/netutil: log only parsing errors for proxy-protocol
Previosly every error was logged. With configured TCP health checks at load-balancer or kubernetes, vmauth spams a lot of false positive error message into logs
* Update docs/CHANGELOG.md
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
* Update lib/netutil/tcplistener.go
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
This opens the possibility to remove tssLock from evalRollupFuncWithSubquery()
in the follow-up commit from @zekker6 in order to speed up the code
for systems with many CPU cores.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3966
Call runtime.Gosched() only when there is a work to steal from other workers.
Simplify the timeseriesWorker() and unpackWroker() code a bit by inlining stealTimeseriesWork() and stealUnpackWork().
This should reduce CPU usage when processing queries on systems with big number of CPU cores.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3966
* vmalert: support logs suppressing during config reloads
The change is mostly required for ENT version of vmalert,
since it supports object-storage for config files.
Reading data from object storage could be time-consuming,
so vmalert emits logs to track the progress.
However, these logs are mostly needed on start or on
manual config reload. Printing these logs each time
`rule.configCheckInterval` is triggered would too verbose.
So the change allows to control logs emitting during
config reloads.
Now, logs are emitted during start up or when SIGHUP is receieved.
For periodicall config checks logs emitted by config pkg are suppressed.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: review fixes
Signed-off-by: hagen1778 <roman@victoriametrics.com>
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
This commit changes background merge algorithm, so it becomes compatible with Windows file semantics.
The previous algorithm for background merge:
1. Merge source parts into a destination part inside tmp directory.
2. Create a file in txn directory with instructions on how to atomically
swap source parts with the destination part.
3. Perform instructions from the file.
4. Delete the file with instructions.
This algorithm guarantees that either source parts or destination part
is visible in the partition after unclean shutdown at any step above,
since the remaining files with instructions is replayed on the next restart,
after that the remaining contents of the tmp directory is deleted.
Unfortunately this algorithm doesn't work under Windows because
it disallows removing and moving files, which are in use.
So the new algorithm for background merge has been implemented:
1. Merge source parts into a destination part inside the partition directory itself.
E.g. now the partition directory may contain both complete and incomplete parts.
2. Atomically update the parts.json file with the new list of parts after the merge,
e.g. remove the source parts from the list and add the destination part to the list
before storing it to parts.json file.
3. Remove the source parts from disk when they are no longer used.
This algorithm guarantees that either source parts or destination part
is visible in the partition after unclean shutdown at any step above,
since incomplete partitions from step 1 or old source parts from step 3 are removed
on the next startup by inspecting parts.json file.
This algorithm should work under Windows, since it doesn't remove or move files in use.
This algorithm has also the following benefits:
- It should work better for NFS.
- It fits object storage semantics.
The new algorithm changes data storage format, so it is impossible to downgrade
to the previous versions of VictoriaMetrics after upgrading to this algorithm.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3236
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3821
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/70
* app/vmagent: allow vm proto for kafka consumer and producer
it should reduce network usage up to 50%.
According to benchmarks without any encoding at kafka topic, it reduces traffic up to 50%.
With enabled zstd at kafka topic, it shows no diffence in traffic. So it
doesn't make much sense to use it.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1225
* mention eb61a7dd68b834b08d01727a918f207700348ada at changelog
* app/vmagent: bumps kafka lib version
it allows compiling vmagent for arm64 machines
fixes https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2271
* mention d19b1a888248c96cfd7ccee00ba6f596d89be1d7 at change log
* app/vmagent: adds natural concurrency for kafka consumer
it should improve performance for data consumption
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1957
* mention change 0c143bb22ca2e7e0b7eec9bc84a94ee2b41626ca
* Update app/vmagent/kafka/consumer.go
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
* Update app/vmagent/kafka/consumer_cgo.go
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
* vmalert: support concurrent reading from object storage
Config reading from GCS or S3 can be slow if object storage
contains a big number of files. Object storages are usually
fast for downloading and are slow for individual operations.
If there would be thousands of files to read, vmalert could
spend significant time for retrieving those because it is
done sequentially.
The change introduces ability to read configs from object
storage concurrently. By default, both GCS and S3 are now
read with 50 concurrent readers. This significantly reduces
the load time:
* loading 500 files with concurrency=1 takes 27s
* loading 500 files with concurrency=50 takes <1s
* vmalert: add note to Changelog
* vmalert: cleanup
* vmalert: use ticker properly
* app/vmalert: improve status reporting during config loading
* vmalert: support concurrent reading from object storage
Config reading from GCS or S3 can be slow if object storage
contains a big number of files. Object storages are usually
fast for downloading and are slow for individual operations.
If there would be thousands of files to read, vmalert could
spend significant time for retrieving those because it is
done sequentially.
The change introduces ability to read configs from object
storage concurrently. By default, both GCS and S3 are now
read with 50 concurrent readers. This significantly reduces
the load time:
* loading 500 files with concurrency=1 takes 27s
* loading 500 files with concurrency=50 takes <1s
* app/vmalert: make linter happy
* dashboards/cluser: use `quantile` since `median` isn't supported by PromQL
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* dashboards/*: add `restarts` annotation to show when there were restarts
The cluster's annotation query is aggregated `by job`,
while vmagent/vmalert are aggregated `by job, instance`.
This is because cluster dashboard can contains too many instances
and annotation could become too noisy.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* dashboards/*: support instance filter in Version annotation
Signed-off-by: hagen1778 <roman@victoriametrics.com>
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The change also introduces `List` method to `FS` interface.
The `List` method can be used for wildcard support in object storage FS.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Nikolay <nik@victoriametrics.com>
- Sync the description for -httpListenAddr.useProxyProtocol command-line flag at vmagent and vmauth,
so it is consistent with the description at vmauth and victoria-metrics
- Add a sample of panic text to docs/CHANGELOG.md, so it could be googled
- Mention the -httpListenAddr.useProxyProtocol command-line flag in the description for the bugfix
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3335
Each group in vmalert starts with an artifical delay to avoid
thundering herd problem. For some groups with high evaluation
intervals, the delay could be significant.
If during this delay user will remove the group from the config
and hot-reload it - vmalert will have to wait until the delay
ends. This results into slow config reloading and UI hang.
The change moves the start-delay logic back to the group's
`start` method. Now, group can immediately exit from the
delay when `group.close()` method is called.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
lib{mergset,storage}: prevent possible race condition with logging stats for merges
Previously partwrapper could be release by background process and reference for part may be invalid
during logging stats. It will lead to panic at vmstorage
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3897
app/vmctl: vm-native - split migration on per-metric basis
`vm-native` mode now splits the migration process on per-metric basis.
This allows to migrate metrics one-by-one according to the specified filter.
This change allows to retry export/import requests for a specific metric and provides a better
understanding of the migration progress.
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
When group's update() or close() method is called, the group
still need to wait for its current evaluation to finish.
Sometimes, evaluation could take a significant amount of time
which slows configuration update or vmalert's graceful shutdown.
The change interrupts current evaluation in order to speed up
the graceful shutdown or config update procedures.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
- Use flag.Duration instead of flagutil.Duration for -snapshotCreateTimeout,
since the flagutil.Duration is intended mostly for big durations, e.g. days, months and years,
while the -snapshotCreateTimeout is usually smaller than one hour.
- Add links to https://docs.victoriametrics.com/#how-to-work-with-snapshots in docs/CHANGELOG.md,
so readers could easily find the corresponding docs when reading the changelog.
- Properly remove all the created directories on unsuccessful attempt to create
snapshot in Storage.CreateSnapshot().
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3551
* lib/{fs,mergeset,storage}: skip `.must-remove.` dirs when creating snapshot (#3858)
* lib/{mergeset,storage}: add timeout configuration for snapshots creation, remove incomplete snapshots from storage
* docs: fix formatting
* app/vmstorage: add metrics to track status of snapshots
* app/vmstorage: use `vm_http_requests_total` metric for snapshot endpoints metrics, rename new flag to make name more clear
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* app/vmstorage: update flag name in docs
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* app/vmstorage: reflect new metrics names change in docs
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* lib/promscrape: set `vm_promscrape_config_last_reload_successful` to 1 if there was no promscrape config provided
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/promscrape: register `vm_promscrape_config_*` metrics only in case promscrape config is used
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Added `victoria-metrics-linux-s390x` to allow single node builds for `s390x` platform.
Leaving other packaging options at the moment, as on this platform, they're mostly going to be built from/with container images hosted within the company as a base, and not alpine.
* vmselect/promql: check for deadline in `count_values` fn
`count_values` could be very slow during the data processing.
Checking for deadline between iterations supposed to reduce
probability of exceeding `search.maxQueryDuration`.
The change also adds a new trace record, which captures the time
spent in aggregation function. Before that, the trace for aggr funcs
could be confusing since it doesn't account for all the places where
time was spent.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* wip
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* metricsql: support optional 2nd argument for rollup functions
Support optional 2nd argument `min`, `max` or `avg` for rollup functions:
* rollup
* rollup_delta
* rollup_deriv
* rollup_increase
* rollup_rate
* rollup_scrape_interval
If second argument is passed, then rollup function will return only the selected aggregation type.
This change can be useful for situations where only one type of rollup calculation is needed.
For example, `rollup_rate(requests_total[5m], "max")`.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* wip
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
- Return immediately on context cancel during the backoff sleep.
This should help with https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3747
- Add a comment describing why the second attempt to obtain the response from remote side
is perfromed immediately after the first attempt.
- Remove fasthttp dependency from lib/promscrape/discoveryutils
- Set context deadline before calling doRequestWithPossibleRetry().
This simplifies the doRequestWithPossibleRetry() a bit.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3293
* fix: do not use exponential backoff for first retry of scrape request (#3293)
* lib/promscrape: refactor `doRequestWithPossibleRetry` backoff to simplify logic
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* Update lib/promscrape/client.go
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
* lib/promscrape: refactor `doRequestWithPossibleRetry` to make it more straightforward
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
* feat: improve mobile ui
* feat: improve mobile ui
* fix: change style server url
* fix: improve ExploreMetrics mobile
* fix: display global settings on all pages
* update helper scripts to latest versions
* added missed command for initialisation variables from Makefile
* deployment/marketplace/vultr/helper-scripts/vultr-helper.sh: update helper script to latest version
* fixed typo for using VM_VERSION variable
* added an example of specifying the VM_VERSION and tokens for API's
* set packer logging to STDOUT by default
* Modify API version when running in Container App
* Handle expires on from token response
Response from IMDS does not always contain expires in value which is
currently used to get the token expiry time. An example resources that
doesn't provide it are Container Apps and App Service.
Signed-off-by: Mattias Ängehov <mattias.angehov@castoredc.com>
* Fix client id parameter for user assigned identity
* Apply suggestions from code review
---------
Signed-off-by: Mattias Ängehov <mattias.angehov@castoredc.com>
Co-authored-by: Aliaksandr Valialkin <valyala@gmail.com>
- Do not generate __meta_server label, since it is unavailable in Prometheus.
- Add a link to https://docs.victoriametrics.com/sd_configs.html#kuma_sd_configs to docs/CHANGELOG.md,
so users could click it and read the docs without the need to search the corresponding docs.
- Remove kumaTarget struct, since it is easier generating labels for discovered targets
directly from the response returned by Kuma. This simplifies the code.
- Store the generated labels for discovered targets inside atomic.Value. This allows reading them
from concurrent goroutines without the need to use mutex.
- Use synchronouse requests to Kuma instead of long polling, since there is a little sense
in the long polling when the Kuma server may return 304 Not Modified response every -promscrape.kumaSDCheckInterval.
- Remove -promscrape.kuma.waitTime command-line flag, since it is no longer needed when long polling isn't used.
- Set default value for -promscrape.kumaSDCheckInterval to 30s in order to be consistent with Prometheus.
- Remove unnecessary indirections for string literals, which are used only once, in order to improve code readability.
- Remove unused fields from discoveryRequest and discoveryResponse.
- Update tests.
- Document why fetch_timeout and refresh_interval options are missing in kuma_sd_config.
- Add docs to discoveryutils.RequestCallback and discoveryutils.ResponseCallback,
since these are public types.
Side notes: it is weird that Prometheus implementation for kuma_sd_configs sets `instance` label,
since usually this label is set by the Prometheus itself to __address__ after the relabeling phase.
See https://www.robustperception.io/life-of-a-label/
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3389
See https://github.com/prometheus/prometheus/issues/7919
and https://github.com/prometheus/prometheus/pull/8844
as a reference implementation in Prometheus
`lib/protoparser/prometheus` is used by various applications,
such as `app/vmalert`. The recent change to the
`lib/protoparser/prometheus` package introduced a new dependency
of `lib/writeconcurrencylimiter` which exposes some metrics.
Because of the dependency, now all applications which have this
dependency also expose these metrics.
Creating a new `lib/protoparser/prometheus/stream` package helps
to remove these metrics from apps which use `lib/protoparser/prometheus`
as dependency.
See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3761
Signed-off-by: hagen1778 <roman@victoriametrics.com>
`avg` can be affected by just one outlier, which may lead
to false conclusions. `median` is supposed to reflect
reality better by leveling outliers out.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
While at it, stop sending requests to unavailable backend for 3 seconds
before the next attempt. This should reduce the amounts of useless work
and the number of useless network packets when the backend is temporarily unavailable.
* app/vmauth: add concurent requests limit per auth record
* app/vmauth: added clarification comment
* app/vmauth: remove unused code
* app/vmauth: move read from limiter
* app/vmauth: fix text
* app/vmauth: fix comments
* - Clarify the docs for the max_concurrent_requests option at docs/vmauth.md
- Clarify the description of the change at docs/CHANGELOG.md
- Make sure that the -maxConcurrentRequests takes precedence over per-user max_concurrent_requests
- Update tests for verifying that the max_concurrent_requests option is parsed properly
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3346
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
- Document the change at docs/CHANGELOG.md
- Add `Reading rules from object storage` section to docs/vmalert.md
- Add `s3` prefix to command-line flags related to the configuration of s3 and gcs clients
- Explicitly mention that reading rules from object storage is supported only in enterprise version
* vmalert: support object storage for rules
Support loading of alerting and recording rules from object
storages `gcs://`, `gs://`, `s3://`.
* review fixes
* vmalert: use group's ID in UI to avoid collisions
Identical group names are allowed. So we should used IDs
for various groupings and aggregations in UI.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: prevent disabling state updates tracking
The minimum number of update states to track is now set to 1.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: properly update `debug` and `update_entries_limit` params on hot-reload
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: display `debug` field for rule in UI
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: exclude `updates` field from json marhsaling
This field isn't correctly marshaled right now.
And implementing the correct marshaling for it doesn't
seem right, since json representation is mostly used
by systems like Grafana. And Grafana doesn't expect this
field to be present.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* fix test for disabled state
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* fix test for disabled state
Signed-off-by: hagen1778 <roman@victoriametrics.com>
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
docs: clarifications between standalone/cluster ingestion endpoints
This is an attempt to make it a bit clearer to the user that the cluster version ingestion URLs are different from the standalone ones. I have also changed the order of the list items to make it a bit clearer and hopefully stop the user simply inferring that `/prometheus/api/v1` is only related to Prometheus data.
* vmalert: speed up state restore procedure on start
Alerts state restore procedure has been changed to become asynchronous.
It doesn't block groups start anymore which significantly improves vmalert's startup time.
Instead, state restore is called by each group in their goroutines after the first rules
evaluation.
While previously state restore attempt was made for all loaded alerting rules,
now it is called only for alerts which became active after the first evaluation.
This reduces the amount of API calls to the configured remote read URL.
This also means that `remoteRead.ignoreRestoreErrors` command-line flag becomes deprecated now
and will have no effect if configured.
See relevant issue https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2608
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* make lint happy
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* Apply suggestions from code review
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
previously historical data backfilling may trigger force merge for previous month every hour
it consumes cpu, disk io and decrease cluster performance.
Following commit fixes it by applying deduplication for InMemoryParts
The limit has been increased from 300 bytes to 500 bytes according to the collected production stats.
This allows reducing CPU usage without significant increase of RAM usage in most practical cases.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3692
This allows better controlling requests to backends and providing better error logging.
For example, if the backend was unavailable, then the ReverseProxy was logging the error
message without client ip and the initial request uri. This could harden debugging.
This is based on https://github.com/VictoriaMetrics/VictoriaMetrics/pull/3486
The docs/assets folder should be used only for assets specific to docs generation at https://docs.victoriametrics.com, e.g. css, js and images.
All the other assets related to specific docs should be placed in the same folder as the corresponding *.md file.
These assets should have the same name prefix as the corresponding doc file name. This simplifies tracking the lifetime of these assets.
For example, if the doc is removed, it is very easy to remove all assets associated with it with a simple `rm -rf docs/doc-name*` command.
This also simplifies generating correct urls for doc-specific assets from both https://docs.victoriametrics.com
and from https://github.com/VictoriaMetrics/VictoriaMetrics/blob/master/docs/ - just refer to the asset name without any directory prefixes.
* feat: include fonts in the build
* fix: reduce size fonts
* wip
- Document the change at docs/CHANGELOG.md
- Run `make vmui-update`
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Allow users fine-tuning the maximum string length for interning via -internStringMaxLen command-line flag.
This may be used for fine-tuning RAM vs CPU usage for certain workloads.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3692
- Call httpserver.GetQuotedRemoteAddr() and httpserver.GetRequestURI() only when the error occurs.
This saves CPU time on fast path when there are no parsing errors.
- Create a helper function - httpserver.LogError() - for logging the error with the request uri and remote addr context.
Stale series are sent when there is a difference between current
and previous scrapes. Those series which disappeared in the current scrape
are marked as stale and sent to the remote storage.
Sending stale series requires memory allocation and in case when too many
series disappear in the same it could result in noticeable memory spike.
For example, re-deploy of a big fleet of service can result into
excessive memory usage for vmagent, because all the series with old
pod name will be marked as stale and sent to the remote write storage.
This change limits the number of stale series which can be sent at once,
so memory usage remains steady.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3668https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3675
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Signed-off-by: hagen1778 <roman@victoriametrics.com>
This prevents vmbackup from leaking passwords into logs like shown below.
2023-01-11T15:00:01.050Z info VictoriaMetrics/lib/logger/flag.go:12 build version: vmbackup-20221214-211706-tags-v1.85.1-0-g09a70d3e9
2023-01-11T15:00:01.050Z info VictoriaMetrics/lib/logger/flag.go:13 command-line flags
2023-01-11T15:00:01.050Z info VictoriaMetrics/lib/logger/flag.go:20 -dst="fs:///vm-backups/latest"
2023-01-11T15:00:01.050Z info VictoriaMetrics/lib/logger/flag.go:20 -snapshot.createURL="http://user:super_sercret123@victoriametricspshot/create"
2023-01-11T15:00:01.050Z info VictoriaMetrics/lib/logger/flag.go:20 -storageDataPath="/storage"
2023-01-11T15:00:01.050Z info VictoriaMetrics/app/vmbackup/main.go:53 Snapshot create url http://user:super_sercret123@victoriametrics:8428/snapshot/create
2023-01-11T15:00:01.050Z info VictoriaMetrics/app/vmbackup/main.go:60 Snapshot delete url http://user:super_sercret123@victoriametrics:8428/snapshot/delete
Assisted merges are intended to be performed by goroutines, which accept the incoming samples,
in order to limit the data ingestion rate.
The worker, which converts pending samples to parts, shouldn't be penalized by assisted merges,
since this may result in increased number of pending rows as seen at https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3647#issuecomment-1385039142
when the assisted merge takes too much time.
- Document the fix at docs/CHANGELOG.md
- Limit the concurrency for sendStaleMarkers() function in order to limit its memory usage
when big number of targets disappear and staleness markers are sent
for all the metrics exposed by these targets.
- Make sure that the writeRequestCtx is returned to the pool
when there is no need to send staleness markers.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3668
We have limited amount of time used by Github CI runners
and JS analysis accounts for a half of it.
Since JS represents only a small fraction of the codebase
and is solely maintained by one person - I suggest to disable
the CodeQL check in order to save CI runners time.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* feat: make the step input field global
* fix: correct get step from url
* fix: set minimumSignificantDigits to 1
* app/vmselect/vmui: `make vmui-update`
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
- Add a comment describing the purpose of the `role` field inside `apiConfig` struct
- Revert changes at lib/promscrape/discovery/dockerswarm/dockerswarm.go ,
since they reduce code readability. E.g. the reader needs to look up the named string constants
in order to get their values.
vmselect passes query timeout to vmstorage in seconds.
The commit 20e9598254 treated it as timeout in nanoseconds.
Fix this in order to prevent from the following errors under vmstorage load:
cannot process vmselect request: cannot execute "search_v7": couldn't start executing the request in 0.000 seconds,
since -search.maxConcurrentRequests=... concurrent requests are already executed.
The per-series timestamps are usually shared among series, so it is unsafe modifying them.
The issue has been appeared after the optimization at 2f3ddd4884
* {lib/server, app/}: use `httpAuth.*` flag as fallback for `*AuthKey` if it is not set
* lib/ingestserver/opentsdbhttp: fix opentdb HTTP handler not respecting `httpAuth.*` flags
* Apply suggestions from code review
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* vmagent: add minimal scrape file exampe for vmagent quick start
Signed-off-by: Artem Navoiev <tenmozes@gmail.com>
* replace example with link to your prometheus.yml in docker
Signed-off-by: Artem Navoiev <tenmozes@gmail.com>
Signed-off-by: Artem Navoiev <tenmozes@gmail.com>
- Use promutils.Labels.GetLabels() instead of comparing promutils.Labels.Labels to nil.
This make the code more consistent with other places.
- Mention the release where the issue has been introduced at docs/CHANGELOG.md.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3624
Previously the selected time series were split evenly among available CPU cores
for further processing - e.g unpacking the data and applying the given rollup
function to the unpacked data.
Some time series could be processed slower than others.
This could result in uneven work distribution among available CPU cores,
e.g. some CPU cores could complete their work sooner than others.
This could slow down query execution.
The new algorithm allows stealing time series to process from other CPU cores
when all the local work is done. This should reduce the maximum time
needed for query execution (aka tail latency).
The new algorithm should also scale better on systems with many CPU cores,
since every CPU processes locally assigned time series without inter-CPU communications.
The inter-CPU communications are used only when all the local work is finished
and the pending work from other CPUs needs to be stealed.
Unpack time series with less than 400K samples in the currently running goroutine.
Previously a new goroutine was being started for unpacking the samples.
This was requiring additional memory allocations.
Usually the number of blocks returned per each time series during queries is around 4.
So it is a good idea to pre-allocate 4 block references per time series
in order to reduce the number of memory allocations.
See the list of configs supported by Prometheus at f88a0a7d83/discovery/nomad/nomad.go (L76-L84)
- Removed "token" option. In can be set either via NOMAD_TOKEN env var or via `bearer_token` config option.
- Removed "scheme" option. It is automatically detected depending on whether the `tls_config` is set.
- Removed "services" and "tags" options, since they aren't supported by Prometheus.
- Added "region" option. If it is missing, then the region is read from NOMAD_REGION env var.
If this var is empty, then it is set to "global" in the same way as Nomad client does.
See 865ee8d37c/api/api.go (L297)
and 865ee8d37c/api/api.go (L555-L556)
- If the "server" option is missing, then it is read from NOMAD_ADDR in the same way
as Nomad client does - see 865ee8d37c/api/api.go (L294-L296)
This is a follow-up for 8aee209c53
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3367
This should reduce the maximum memory usage for processScrapedData() function by 2x.
The only part, which can be IO-bound in the processScrapedData() is pushData() call,
when it buffers data to persistent queue if the remote storage cannot keep up
with the data ingestion speed. In this case it is OK if the scrape pace will be limited.
This should reduce memory usage when scraping big number of targets,
since this limits the summary memory usage during concurrent parsing and relabeling
by the number of available CPU cores.
Previously the -maxConcurrentInserts was limiting the number of established client connections,
which write data to VictoriaMetrics. Some of these connections could be idle.
Such connections do not consume big amounts of CPU and RAM, so there is a little sense in limiting
the number of such connections. So now the -maxConcurrentInserts command-line option
limits the number of concurrently executed insert requests, not including idle connections.
It is recommended removing -maxConcurrentInserts command-line option, since the default value
for this option should work good for most cases.
This should prevent from out of memory errors when big number of vmselect
nodes send many concurrent requests to vmstorage
The limit can be controlled at vmstorage via the following command-line flags:
- search.maxConcurrentRequests
- search.maxQueueDuration
See https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html#resource-usage-limits
- Document the bugfix at docs/CHANGELOG.md
- Wait until all the worker goroutines are done in consulWatcher.mustStop()
- Do not log `context canceled` errors when discovering consul serviceNames
- Removed explicit handling of gzipped responses at lib/promscrape/discoveryutils.Client,
since this handling is automatically performed by net/http.Transport.
See DisableCompression option at https://pkg.go.dev/net/http#Transport .
- Remove explicit handling of the proxyURL, since it is automatically handled
by net/http.Transport. See Proxy option at https://pkg.go.dev/net/http#Transport .
- Expliticly set MaxIdleConnsPerHost, since its default value equals to 2.
Such a small value may result in excess tcp connection churn
when more than 2 concurrent requests are processed by lib/promscrape/discoveryutils.Client.
- Do not set explicitly the `Host` request header, since it is automatically set by net/http.Client.
- Backport the bugfix to the recently added nomad_sd_configs - see https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3367
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3468
- Remove undocumented `username` and `password` config options from `nomad_sd_config`.
TODO: probably, remove these options from `consul_sd_config` too?
These options exist there for backwards compatibility purposes.
- Add __meta_nomad_service_alloc_id and __meta_nomad_service_job_id meta-labels
These labels contain AllocID and JobID fields for the discovered Nomad services.
- Various typo fixes.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3367
Got "Failed to upgrade legacy queries Datasource $ds was not found" in
Grafana on operator dashboard.
It's datasource variable was incorrectly named `datasource`.
Also made the rest of the dashboards have homogeneous datasource-variable
names and selections, matching vmagent dashboard.
- Show in the line tooltip the number of the query which generates the given line.
This simplifies comparison of lines generated by multiple queries.
- Show metric name as __name__ label in the line tooltip in the same way as other labels are shown there.
This makes the label information in the tooltip more consistent.
- Properly quote label values with JSON.stringify(). This prevents from improper formatting
when label values contain doublequote chars.
- Remove double curly braces artifact at graph legend for lines without names and labels.
- Properly use modifier for regular expressions across the code.
There is no need to manually call `queryDuration.UpdateDuration(startTime)`, because `defer queryDuration.UpdateDuration(startTime)` is executed at the beginning of the function(L660).
Allow configuring the default number of stored rule's update states in memory
via global `-rule.updateEntriesLimit` command-line flag or per-rule via rule's
`update_entries_limit` configuration param.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
This simplifies manual usage of the APIs. For example, the following query
would return the results over the 2022 year.
/api/v1/query_range?start=2022&end=2023&step=1d&query=...
This is equivalent to:
/api/v1/query_range?start=2022-01-01T00:00:00Z&end=2023-01-01T00:00:00Z&step=1d&query=...
Use "unexpected status code returned from %q: %d; expecting %d" log message format
instead of less clear format "unexpected status code returned from %q; expecting %d; got %d"
This is a follow-up for c612bb165e
- Rename `Custom panel` tab to more clear `Query` tab
- Rename `Cardinality` tab to `Explore cardinality`, so it becomes consistent with `Explore metrics` tab
- Move `Dashboards` tab to the end, since it isn't used too much
- Document the feature at docs/CHANGELOG.md.
- Document the metrics explorer at https://docs.victoriametrics.com/#metrics-explorer .
- Properly set `start` and `end` args for the selected time range
when performing the request, which returns metric names.
- Improve queries, so they return lower number of lines and labels.
This should improve metrics' exploration.
- Properly encode label filters and query args before passing them to VictoriaMetrics.
- Various cosmetic fixes.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3386
* app/vmbackupmanager: add metrics for better observability, include more information to `/api/v1/backups` API call response
* app/vmbackupmanager: drop old metrics before creating new ones
* app/vmbackupmanager: use `_total` postfix for counter metrics
* app/vmbackupmanager: remove `_total` postfix for gauge-like metrics
* app/vmbackupmanager: add `_last_run_failed` metrics for backups and retention
* app/vmbackupmanager: address review feedback
* app/vmbackupmanager: fix metric name
* app/vmbackupmanager: address review feedback, remove background updates of metrics, add restoring state of `_last_run_failed` metric from remote storage
* app/vmbackupmanager: improve performance for backup size calculation
* app/vmbackupmanager: refactor backup and retention runs to deduplicate each run logic
* {app/vmbackupmanager,lib/formatutil}: move HumanizeBytes into lib package
* app/vmbackupmanager: fix creating new metrics instead of reusing existing ones
* lit/formatutil: add comment to make linter happy
* app/vmbackupmanager: address review feedback
Previously too short lookbehind window d for rate(m[d]) could be automatically extended
if it didn't cover at least two raw samples. This was needed in order to guarantee
non-empty results from rate(m[d]) on short time ranges.
Now the lookbehind window isn't extended if it is set explicitly,
since it is expected that the user knows what he is doing.
The lookbehind window continues to be extended when needed if it isn't set explicitly.
For example, in the case of rate(m).
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3483
The issue triggers after the indexdb rotation for time series, which stop receiving new samples.
This results in missing data for such time series in query responses.
This commit should address the https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3502
The issue has been introduced in 2dd93449d8
- Document the change at docs/CHANELOG.md
- Log fatal errors if the -loggerJSONFields contains unexpected values
- Rename -loggerJsonFields to -loggerJSONFields for the sake of consistency naming commonly used in Go
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2348
Previously, $job_select, $job_storage and $job_insert
didn't respect the $job filter. This change updates
the variable queries to account for set $job variable.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
This fixes handling of values bigger than 2GiB for the following command-line flags:
- -storage.minFreeDiskSpaceBytes
- -remoteWrite.maxDiskUsagePerURL
Blocked small merges may result into big number of small parts, which, in turn,
may result in increased CPU and memory usage during queries, since queries need to inspect
all the existing small parts.
The issue has been introduced in 8189770c50
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3337
Previously only up to 100K results were cached.
This could result in sub-optimal performance when more than 100K unique strings were actually used.
For example, when the relabeling rule was applied to a million of unique Graphite metric names
like in the https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3466
This commit should reduce the long-term CPU usage for https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3466
after all the unique Graphite metrics are registered in the FastStringMatcher.Transform() cache.
It is expected that the number of unique strings, which are passed to FastStringMatcher.Match(),
FastStringTransformer.Transform() and to InternString() during the last 5 minutes,
is limited, so the function results fit memory. Otherwise OOM crash can occur.
This should be the case for typical production workloads.
The new annotation is hidden by default and suppose to show
component `short_version` label change on the panels.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The system links are absolute, e.g. they start from `/`, so there are high chances
they won't work as expected when requested via proxy such as vmselect with -vmalert.proxyURL
command-line flag.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/3424
http.Request was used as a part of state struct
for generating the curl command when viewing the rule's
state changes.
It appears, that holding a referencing is far more expensive
than generating the curl command immediately.
On the test with 40k rules, this change reduces memory
and CPU usage by 50%.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* lib/promscrape/discovery/azure: remove API server from URL returned by azure
* lib/promscrape/discovery/azure: validate nextLink contains same URL as apiServer
This should simplify further debugging, since the first thing to start the debugging by query trace
is to know the version of VictoriaMetrics, which produced this trace.
Alert `RequestErrorsToAPI` could be permanently triggered due to
mistakes in clients configuration. However, such requests are unlikely
to cause VM health state change. So there is no need in displaying
this alert because there will be no correlation caused by it.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Signed-off-by: hagen1778 <roman@victoriametrics.com>
- Document the change at docs/CHANGELOG.md
- Run `make docs-sync` for copying app/vmgateway/README.md to docs/vmgateway.md
in order to propagate docs' changes to https://docs.victoriametrics.com/vmgateway.html
* vmalert: correctly return error for RW failures
By mistake, in 0989649ad0 the error
for remote write failures weren't return to user.
This change fixes it.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The main purpose of this command-line flag is to increase the lifetime of low-end flash storage
with the limited number of write operations it can perform. Such flash storage is usually
installed on Raspberry PI or similar appliances.
For example, `-inmemoryDataFlushInterval=1h` reduces the frequency of disk write operations
to up to once per hour if the ingested one-hour worth of data fits the limit for in-memory data.
The in-memory data is searchable in the same way as the data stored on disk.
VictoriaMetrics automatically flushes the in-memory data to disk on graceful shutdown via SIGINT signal.
The in-memory data is lost on unclean shutdown (hardware power loss, OOM crash, SIGKILL).
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3337
The new panels have been added to the vmstorage and drilldown rows.
`Disk space usage %` is supposed to show disk space usage percentage.
This panel is now also referred by `DiskRunsOutOfSpace` alerting rule.
This panel has Drilldown option to show absolute values.
`Disk space usage % by type` shows the relation between datapoints
and indexdb size. It supposed to help identify cases when indexdb
starts to take too much disk space.
This panel has Drilldown option to show absolute values.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Method `metrics()` now pre-allocates slices for labels
and results from query responses. This reduces the number
of allocations on the hot path for instant requests.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The recent change in modifying default value
of `datasource.queryStep` flag resulted in situation
where replay mode was always running queries with
step=`datasource.queryStep`. When it should always
use rule's evaluation interval.
The fix is related not to replay mode only, but
for all Range requests. Now step param is set
individually for each mode.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Fixes a missing `&` char in data link for ETA panel
on cluster dashboards. Without `&` char it generates
wrong link when click on Drilldown menu.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* app/vmalert: add `remoteWrite.sendTimeout` command-line flag to configure timeout for sending data to `remoteWrite.url`
* vmalert: remove WriteTimeout from clients Cfg
No need to have it as a part of configuration struct:
* the client isn't used by other packages;
* there are no internal tests to check the WriteTimeout.
* vmalert: remove DisablePathAppend from clients Cfg
No need to have it as a part of configuration struct:
* the client isn't used by other packages;
* there are no internal tests to check the DisablePathAppend.
Co-authored-by: hagen1778 <roman@victoriametrics.com>
- Return meta-labels for the discovered targets via promutils.Labels
instead of map[string]string. This improves the speed of generating
meta-labels for discovered targets by up to 5x.
- Remove memory allocations in hot paths during ScrapeWork generation.
The ScrapeWork contains scrape settings for a single discovered target.
This improves the service discovery speed by up to 2x.
The change list is the following:
* bump Grafana version to 9.2.6;
* replace old "Graph" panel with "TimeSeries" panel;
* show % usage of Mem and CPU additionally to of absolute values;
* `Caches` row was removed. All needed info for caches is now part of `Troubleshooting`;
* add Annotations for Alert triggers. Not all alerts are supposed to be displayed
on the dashboard, but only those with label `show_at: dashboard`.
See `alerts.yml` change.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The change list is the following:
* bump Grafana version to 9.2.6;
* replace old Graph panel with TimeSeries panel;
* add RemoteWrite section;
* allow configuring topK elements for some of the panels;
* Preer grouping by job instead of grouping by instance.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* flag reference update
there is no flag `-datasource.disablePathAppend` and datasource actually checking for `-remoteRead.disablePathAppend`
* update source for doc as well
The change list is the following:
* bump Grafana version to 9.2.6;
* add version change annotations;
* switch to per-job panels instead of per-instance;
* add drilldown option for resource usage panels.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The change list is the following:
* bump Grafana version to 9.2.6;
* remove artifacts in data links.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* {app/vmstorage,app/vmselect}: add API to get list of existing tenants
* {app/vmstorage,app/vmselect}: add API to get list of existing tenants
* app/vmselect: fix error message
* {app/vmstorage,app/vmselect}: fix error messages
* app/vmselect: change log level for error handling
* wip
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* some unexpected DS UIDs were removed;
* replace `$instance.*` filter with `$instance` since we respect
the instance port anyway;
* remove predefined datasource for `clusterbytenant`
in favour of datasource variable `ds`.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The purpose of the update is to make the dash more usable
for large installations with many instances. Panels which showed
metrics per-instance (Mem, CPU) now are showing metrics per-job or min/max/avg
aggregations in % instead. This supposed to help immediately to identify
resource shortage and remain usable for small and big installations.
For cases when detailed info is needed, to the bottom of the dashboard
a new row `Drilldown` was added. Panels like Mem or CPU now contain
a `data-link` named `Drilldown` (cis shown on line click) which takes
user to more detailed panel.
The change list is the following:
* bump Grafana version to 9.1.0;
* replace old "Graph" panel with "TimeSeries" panel;
* improve Uptime panel to show number of instances per job;
* show % usage of Mem and CPU instead of absolute values;
* `Caches` row was removed. All needed info for caches is now part of `Troubleshooting`;
* add `Drilldown` section for detailed resource usage;
* add Annotations for Alert triggers. Not all alerts are supposed to be displayed
on the dashboard, but only those with label `show_at: dashboard`.
See `alerts-cluster.yml` change.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* fix: reset the value of the switches trace and cache
* fix: add cursor text for inputs
* fix: solve the Infinite loop of useFetchQuery.ts
* fix: change condition for show/hide autocomplete
* fix: add limit error length for input
The default list of alerting rules contains the basic
rules for checking vmalert's health state and is recommended
to use for monitoring vmalert deployments.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
It has been appeared that some production workloads could suffer for some time
after every reset of the previous cache when it gets less than 5% of requests
after the needed item isn't found in the current cache. This could result
in reduced cache hit rates, which, in turn, could increase CPU, disk IO and RAM
usage needed for reading, unpacking and caching the missed data from disk.
This commit reduces the cache miss threshold for resetting the previous cache from 5% to 1%.
This should reduce the possible negative impact after each cache reset by at least 5x,
while reducing the total memory used by caches.
This is a follow-up for d906d8573e
* docs/operator: change VMAgentRemoteWriteSettings.MaxDiskUsagePerURL type from int32 to int64
* docs: operator updates api description
Co-authored-by: f41gh7 <nik@victoriametrics.com>
- Remove trailing whitespace at the end of lines
- Remove redundant sentence stating that time series matching the given selector will be deleted.
It should be clear from the surrounding context.
* deployment/docker/docker-compose-cluster.yml: bump VictoriaMetrics Cluster components to the latest v1.83.0 version
* deployment/docker/docker-compose.yml: bump VictoriaMetrics Single node and vmutils to the latest v1.83.0 version
The issue was in the `labels := dst[offset:]` line in the beginning of appendExtraLabels() function.
The `dst` may be re-allocated when adding extra labels to it. In this case the addition of `exported_`
prefix to labels inside `labels` slice become invisible in the returned `dst` labels.
While at it, properly handle some corner cases:
- Add additional `exported_` prefix to clashing metric labels with already existing `exported_` prefix.
- Store scraped metric names in `exported___name__` label if scrape target contains `__name__` label.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3278
Thanks to @jplanckeel for the initial attempt to fix this issue
at https://github.com/VictoriaMetrics/VictoriaMetrics/pull/3281
* feat: apply serverURL on down Enter
* fix: change method of set time range
* fix: remove prevent run fetch without changes
* fix: prevent reset timerange when autorefresh
Previously the `quotesEscape` function was escaping only double quotes.
This wasn't enough, since the input string could contain other special chars,
which must be escaped when put inside JSON string. For example, carriage return and line feed chars (\n\r),
backslash char, etc. This led to the following issues, which were improperly fixed:
- https://github.com/VictoriaMetrics/VictoriaMetrics/issues/890 - this issue
was "fixed" by introducing the `crlfEscape` function, which led to unnecessary
complications in user templates, while not fixing various corner cases
such as backslash chars in the input string.
See 1de15ad490
- https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3139 - this issue
was "fixed" by urlencoding the whole string passed to -external.alert.source
command-line flag. This led to invalid urls, which couldn't be parsed by Grafana.
See 00c838353d
and 4bd0244599
This commit properly encodes the input string passed to `quotesEscape`, so it can be safely embedded inside JSON strings.
This commit deprecates crlfEscape template function and adds the following new template functions:
- strvalue and stripDomain - these functions are supported by Prometheus, so they were added
for compatibility purposes.
- jsonEscape and htmlEscape for converting the input string to valid quoted JSON string
and for html-escaping the input string, so it could be safely embedded as a plaintext
into html.
This commit also documents all supported template functions at https://docs.victoriametrics.com/vmalert.html#template-functions
The deprecated crlfEscape function isn't documented on purpose, since its usefulness is negative in general case.
This reverts commit 00c838353d.
Reason for revert: it incorrectly fixes the issue https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3139 .
Now `-external.alert.source=explore?orgId=1&left=...` is converted to the following invalid url, which cannot be handled by Grafana:
https://grafana.example.com/explore%3ForgId%3D1%26left%3D...
The next commit will contain the correct fix of the issue - the `quotesEscape` function must
properly escape the string, so it could be embedded into JSON string. This function must
properly escape \n\r chars too. In this case the `crlfEscape` function becomes unnecessary.
Actually, the next commit makes the `crlfEscape` function deprecated.
Previously netstorage.MustStop() call didn't free up all the resources,
so the subsequent call to nestorage.Init() would panic.
This allows writing tests, which call nestorage.Init() + nestorage.MustStop() in a loop.
Blocks outside the configured retention are eventually deleted during background merge.
But such blocks may reside in the storage for long time until background merge.
Previously VictoriaMetrics could spend additional CPU time on processing such blocks
during search queries. Now these blocks are skipped.
The searchTSIDs function was searching for metricIDs matching the the given tag filters
and then was locating the corresponding TSID entries for the found metricIDs.
The TSID entries aren't needed when searching for time series names (aka MetricName),
so this commit removes the uneeded TSID search from the implementation of /api/v1/series API.
This improves perfromance of /api/v1/series calls.
This commit also improves performance a bit for /api/v1/query and /api/v1/query_range calls,
since now these calls cache small metricIDs instead of big TSID entries
in the indexdb/tagFilters cache (now this cache is named indexdb/tagFiltersToMetricIDs)
without the need to compress the saved entries in order to save cache space.
This commit also removes concurrency limiter during searching for matching time series,
which was introduced in 8f16388428, since the concurrency
for all the read queries is already limited with -search.maxConcurrentRequests command-line flag.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/648
This increases the maximum time for cache population with new entries from 20 minutes to 40 minutes.
This
This change shouldn't increase memory usage for caches, since the prev cache cleaner
should free up memory by deleting unused prev cache as soon as possible.
See 08ca45d238 for details on prev cache cleaner.
The panic has been introduced in 68f3a02589
While at it, add padding to shard structs in order to avoid false sharing on mordern CPUs
This should improve scalability on systems with many CPU cores
This means that the majority of requests are successfully served from the current cache,
so the previous cache can be reset in order to free up memory.
The message about dropped data still remains at `error` level.
The change supposed to make log message more clear about how
serious it is.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* feat: add maximum display series by tabs
* feat: add warning on PredefinedPanels.tsx
* docs/CHANGELOG.md: vmui limit number of plotted series
* docs/CHANGELOG.md: vmui limit number of plotted series
* wip
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* lib/backup: set s3 default region to us-west-2
it should fix an error with region detection for bucket, if AWS_REGION env var is not set
* Update lib/backup/s3remote/s3.go
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* Multi retention standardization of docs
While reading through the docs the implementation details had different formatting for storageNode. This standardizes them and adds a link to the retention docs.
* Update docs/guides/guide-vmcluster-multiple-retention-setup.md
Co-authored-by: Aliaksandr Valialkin <valyala@gmail.com>
The default value of `-datasource.queryStep` has changed, so we update
the troubleshooting docs accordingly.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Sort labels explicitly after calling the ParsedConfigs.Apply() when needed.
This reduces CPU usage when performing metric-level relabeling, where labels' sorting isn't needed.
Previously the `__address__` label could contain only `host:port` part of the target url,
while the scheme and metrics path were obtained from `__scheme__` and `__metrics_path__`
labels. Now it is possible to set the full url in `__address__` label.
This makes valid the following scrape config, which is frequently used by novice users:
scrape_configs:
- job_name: foo
static_configs:
- targets:
- http://host1/metrics1
- https://host2/metrics2
The stored backups would help to identify docs corruption
but aren't needed for commiting. So `.tmp` backup files
are also git-ignored.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* Optimize fast path for /api/v1/import when importing numeric values
* Move the docs about the change from features to bugfixes at docs/CHANGELOG.md
* Update tests at lib/protoparser/vmimport
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3161
* app/vmselect: properly work when export import json from `api/v1/{export, import}` API
* app/vmselect: update convert function
* app/vmselect: export null if `math.IsNaN(v)`
* app/vmselect: get float from json
* lib/protoparser: add test
* docs: add change log
* lib/protoparser: make export import api compatible
* Document the addition of Azure blob storage support in vmbackup / vmrestore
* List the supported storage system types at docs/vmrestore.md
* Mention about azblob storage system support at -src and -dst command-line flags
for vmbackup / vmrestore tools.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1029
* Use vm_account_id and vm_project_id labels to be consistent with https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html#multitenancy-via-labels
* Document the feature that vmalert now exposes vm_account_id and vm_project_id
labels if -clusterMode is set.
* Use literal strings instead of string constants for vm_account_id and vm_project_id.
This improves code readability.
The change is supposed to provide additional flexibility for generating alert's
source link based on label values.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The `docs-sync` command was updated to modify images path
for assets in `docs/` folder. The change allows to refer images from `docs`
without copying them to the root folder.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Incorrect 301 redirects can be cached by user agents such as web browsers.
This can complicate recovery procedure after the incorrect redirect is fixed,
e.g. web browser cache must be reset.
The related issue - https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1752
* app/vminsert: allows parsing tenant id from labels
it should help mitigate issues with vmagent's multiTenant mode, which works incorrectly at heavy load
and it cannot handle more then 100 different tenants.
This functional hidden with flag and do not change vminsert default behaviour
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2970
* Update docs/Cluster-VictoriaMetrics.md
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
* wip
* app/vminsert/netstorage: clean remaining labels in order to free up GC
* docs/Cluster-VictoriaMetrics.md: typo fix
* wip
* wip
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Cache `action: replace` results for non-trivial regexs and return them next time
instead of performing CPU-intensive regex replacement.
Optimize also `action: labelmap_all` and `action: replace_all` in the same way.
Previously empty series (e.g. series with all NaN samples) were passed to aggregate functions.
Such series must be ingored by all the aggregate functions.
So it is better from consistency PoV filtering out empty series before applying aggregate functions.
* app/vmselect: ignore empty series for `limit_offset`
VictoriaMetrics doesn't return empty series (with all NaN values) to
the user. But such series are filtered after transform functions.
It means `limit_offset` will account for empty series as well.
For example, let's consider following data set:
```
time series:
foo{label="1"} NaN, NaN, NaN, NaN // empty series
foo{label="2"} 1, 2, 3, 4
foo{label="3"} 4, 3, 2, 1
```
When user requests all series for metric `foo` the empty series
will be filtered out:
```
/query=foo:
foo{label="v2"} 1, 2, 3, 4
foo{label="v3"} 4, 3, 2, 1
```
But `limit_offset(1, 1, foo)` is applied to original series, not filtered yet.
So it will return `foo{label="v2"}` (skips the first in list)
```
/query=limit_offset(1, 1, foo):
foo{label="v2"} 1, 2, 3, 4
```
Expected result would be to apply `limit_offset` to already filtered list,
so in result we receive `foo{label="v3"}`:
```
/query=limit_offset(1, 1, foo):
foo{label="v3"} 4, 3, 2, 1
```
The change does exactly that - filters empty series before applying `limit_offset`.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* app/vmselect: ignore empty series for `limit_offset`
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Signed-off-by: hagen1778 <roman@victoriametrics.com>
According to Ruler specification, only labels returned within time series
should be available for use in annotations.
For long time, vmalert didn't respect this rule. And in PR
https://github.com/VictoriaMetrics/VictoriaMetrics/pull/2403
this was fixed for the sake of compatibility. However, this resulted
into users confusion, as they expected all configured and extra labels
to be available - https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3013
This fix allows to use extra labels in Annotations. But in the case of conflicts
the original labels (extracted from time series) are preferred.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* lib/{httpserver,netutil}: allow to define min and max TLS version of the http server
* lib/httpserver: added descriptions about tls supported versions
* lib/netutil: check minimal tls version, added supported tls versions to error
* wip
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
The workaround was introduced to fix https://github.com/VictoriaMetrics/VictoriaMetrics/issues/962.
However, it didn't prove itself useful. Instead, it is recommended using `increase_pure` function.
Removing the workaround makes VM to produce accurate results when calculating
`delta` or `increase` functions over slow-changing counters with vary intervals
between data points.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Signed-off-by: hagen1778 <roman@victoriametrics.com>
- Clarify the description for -datasource.queryStep command-line flag
- Consistently use a single dash in front of -datasource.queryStep command-line flag
- Update -help output at docs/vmalert.md
- Consistently use single dash in front of command-line flags instead of double dashes.
- Add a warning that too small -search.latencyOffset may lead to incomplete query results.
Change default value for command-line flag `datasource.queryStep` from `0s` to `5m`.
Param `step` is added by vmalert to every rule evaluation request sent to datasource.
Before this change, `step` was equal to group's evaluation interval by default.
Param `step` for instant queries defines how far VM can look back for the last written data point.
The change supposed to improve reliability of the rules evaluation when evaluation interval
is lower than scraping interval.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Cluster components always have `-cluster` suffix. The change fixes
incorrect image tag in docker-compose manifest.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* deployment/docker/docker-compose.yml: adds version tags for VictoriaMetrics containers
* deployment/docker/docker-compose-cluster.yml: adds version tags for VictoriaMetrics containers
* deployment/docker: move cluster compose env to master branch
The change supposed to simplify the process of maintaining for
single/cluster docker-compose envs, alerts, dashboards. It also
supposes to reduce confusion for users when looking for cluster
related alerts/configs.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* deployment/docker: move cluster compose env to master branch
Review updates.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Now vmalert will print the following messages on dupliсates:
```
"recording rule \"record\"; expr: \"up == 1\"; labels: summary={{ value|query }}" is a duplicate within the group "test"
"alerting rule \"alert\"; expr: \"up == 1\"; labels: description={{ value|query }}" is a duplicate within the group "test"
```
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3127
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The standard Snappy encoder from github.com/golang/snappy shows quite good performance number
for compressing the Prometheus remote_write proto messages according to the added benchmarks,
so there is no need in switching to github.com/klauspost/compress/s2 yet.
* dashboards/cluster: few updates
* apply consistent formatting across panels;
* make resource usage panels per component more detailed;
* add extra panels to vmselect for displaying
`vm_rows_read_per_query`, `vm_rows_scanned_per_query`,
`vm_rows_read_per_series` and `vm_series_read_per_query` metrics.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* dashboards/single: few updates
* apply consistent formatting across panels;
* add extra panels to Performance for displaying
`vm_rows_read_per_query`, `vm_rows_scanned_per_query`,
`vm_rows_read_per_series` and `vm_series_read_per_query` metrics.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* dashboards/vmagent: few updates
* apply consistent formatting across panels;
* add panels for showing number of samples ingested
or scraped;
* adapt resource usage panels for multiple selected jobs/instances;
* add adhoc variable;
* display vmagent's version in Stats.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* dashboards/vmalert: few updates
* apply consistent formatting across panels;
* adapt resource usage panels for multiple selected jobs/instances;
* show vmalert version in Stats section.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: always re-evaluate Annotations
Previously, Annotations were evaluated only:
1. On alert creating.
2. On alert's value change.
This is premature optimization. It was assumed that since annotations
could contain only text with alert's labels or value - there is no need
in spending resources to re-compile Annotations.
Later, template function `query` was added, which can execute
arbitrary queries and return different results on every evaluation.
So if it was used in annotations, it would be executed only on init
or value change.
Another case when optimization caused an issue - annotations hot reload.
In this case, annotations of the active alert won't change even if Rule's
annotations were changed.
This fix enables Annotations re-evaluation on each iteration to resolve
issues above. It would have some impact on performance, but it is unlikely
it will be noticeable.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: add tp Changelog
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The change adds an example of `curl` command to the Rule's page.
The command is generated for each recorded state. It is supposed
user can just copy&execute the command to see what was returned
to vmalert.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmselect/promql: add alphanumeric sort by label (sort_by_label_numeric)
* vmselect/promql: fix tests, add documentation
* vmselect/promql: update test
* vmselect/promql: update for alphanumeric sorting, fix tests
* vmselect/promql: remove comments
* vmselect/promql: cleanup
* vmselect/promql: avoid memory allocations, update functions descriptions
* vmselect/promql: make linter happy (remove ineffectual assigment)
* vmselect/promql: add test case, fix behavior when strings are equal
* vendor: update github.com/VictoriaMetrics/metricsql from v0.44.1 to v0.45.0
this adds support for sort_by_label_numeric and sort_by_label_numeric_desc functions
* wip
* lib/promscrape: read response body into memory in stream parsing mode before parsing it
This reduces scrape duration for targets returning big responses.
The response body was already read into memory in stream parsing mode before this change,
so this commit shouldn't increase memory usage.
* wip
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
vmalert: add experimental feature of storing Rule's evaluation state
The new feature keeps last 20 state changes of each Rule
in memory. The state are available for view on the Rule's
view page. The page can be opened by clicking on `Details`
link next to Rule's name on the `/groups` page.
States change suppose to help in investigating cases when Rule
doesn't generate alerts or records.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
This reduces scrape duration for targets returning big responses.
The response body was already read into memory in stream parsing mode before this change,
so this commit shouldn't increase memory usage.
- Rename logDebug() to logDebugf() and pass format string together
with format args directly to logDebugf(). This eliminates fmt.Sprintf()
overhead at logDebug() call site when debugging is disabled.
- Format labels in debug message in Prometheus format, e.g. {label1="value1",...labelN="valueN"}
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3025
`go install` is the preferred way for installing go binaries starting
from the minimum supported Go version for VictoriaMetrics - Go1.18 -
see https://tip.golang.org/doc/go1.18#go-command
The automated push of release tags to Github require specifying the remote repository name
when doing `git push <remote-repo-name> v1.xx.y`.
The remote repository name can differ in different environments,
so it cannot be put into Makefile rule.
TODO: create a Makefile rule, which generates standard remote names for public
and private repositories in Git, so `git push` for release tags could be automated then.
- Add getCommonParamsWithDefaultDuration function and use it at /api/v1/series, /api/v1/labels and /api/v1/label/.../values
- Document the default behaviour for setting 5 minutes time range if start arg isn't passed to /api/v1/series, /api/v1/labels and /api/v1/label/.../values
- Document the change at docs/CHANGELOG.md
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/3052
The panic may trigger during data blocks' processing received
from vmstorage nodes when some of vmstorage nodes return an error
or when `-replicationFactor` is set to values higher than 2 at `vmselect`.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3058
Note that the parallel execution of `union()` args may take more memory and CPU time
than the sequential execution if args contain heavy queries, which may load all the available CPU,
disk and memory resources and vmselect and vmstorage levels.
- Use getScalar() function for obtaining the expected scalar from phi arg
- Reduce the error message returned to the user when incorrect phi is passed to histogram_quantiles
- Improve the description of this bugfix in the docs/CHANGELOG.md
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3026
Cache sanitized label names and return them next time.
This reduces the number of allocations and speeds up the SanitizeLabelName()
function for common case when the number of unique label names is smaller than 100k
The following regex patterns are optimized:
- literal string match, e.g. "foo"
- prefix match, e.g. "foo.*" and "foo.+"
- substring match, e.g. ".*foo.*" and ".+foo.+"
- alternate values match, e.g. "foo|bar|baz"
This is OK, since the anchors are implicitly applied to the whole regexp.
This optimization should improve the speed for regexp series filters with explicit $ and ^ anchors.
For example, `{label="^(foo|bar)$"}`
- Document the change at docs/CHANGELOG.md
- Move auth token parsing from app/vmagent/opentsdbhttp/ to app/vmagent/main.go,
since it must be parsed only when multitenancy support is enabled at vmagent side.
See https://docs.victoriametrics.com/vmagent.html#multitenancy
These metrics allow alerting when the number of unique series approach the limit.
For example, the following query alerts when the number of series reaches 90% of the configured limit:
vm_hourly_series_limit_current_series / vm_hourly_series_limit_max_series > 0.9
The io/ioutil package is deprecated since Go1.16 - see https://tip.golang.org/doc/go1.16#ioutil
VictoriaMetrics requires at least Go1.18, so it is time to remove the io/ioutil from source code
This is a follow-up for 02ca2342ab
ioutil.ReadAll is deprecated since Go1.16 - see https://tip.golang.org/doc/go1.16#ioutil
VictoriaMetrics requires at least Go1.18, so it is OK to switch from ioutil.ReadAll to io.ReadAll.
This is a follow-up for 02ca2342ab
The ioutil.ReadDir is deprecated since Go1.16 - see https://tip.golang.org/doc/go1.16#ioutil
VictoriaMetrics requires at least Go1.18, so it is time to switch from io.ReadDir to os.ReadDir
This is a follow-up for 02ca2342ab
The ioutil.{Read|Write}File is deprecated since Go1.16 -
see https://tip.golang.org/doc/go1.16#ioutil
VictoriaMetrics needs at least Go1.18, so it is safe to remove ioutil usage
from source code.
This is a follow-up for 02ca2342ab
* lib/storage: bump max merge concurrency for small parts to 15
The change is based on the feedback from users on github.
Thier examples show, that limit of 8 sometimes become a
bottleneck. Users report that without limit concurrency
can climb up to 15-20 merges at once.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* Update lib/storage/partition.go
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Flags `-search.maxTagKeys` and `-search.maxTagValues` are present at vmstorage and not at vmselect
```
# ./vmselect-prod -h | egrep 'search.maxTagKeys|search.maxTagValues'
# ./vmstorage-prod -h | egrep 'search.maxTagKeys|search.maxTagValues'
-search.maxTagKeys int
-search.maxTagValues int
```
We switch default alert's source link to redirect user
to vmalert's UI instead of previous JSON object. While it breaks
compatibility, it also supposed to improve user's experience.
The old behavior can be achieved by updating `-external.alert.source`
command-line flag.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The 429 status code means that the server is overwhelmed with requests.
The client can retry the request after some wait time.
Implement this strategy for service discovery and scrape requests.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2940
* Explicitly store a pointer to UserReadableError in the error interface.
Previously Go automatically converted the value to a pointer before storing in the error interface.
* Add Unwrap() method to UserReadableError, so it can be used transparently with the other code,
which calls errors.Is() and errors.As().
* Document the change in docs/CHANGELOG.md
When read query fails, VM returns rich error message with
all the details. While these details might be useful
for debugging specific cases, they're usually too verbose
for users.
Introducing a new error type `UserReadableError` is supposed
to allow to return to user only the most important parts
of the error trace. This supposed to improve error readability
in web interfaces such as VMUI or Grafana.
The full error trace is still logged with the full context
and can be found in vmselect logs.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Previously a single syncwg.WaitGroup was used for tracking the lifetime of processBlock callbacks
across all the per-vmstorage goroutines. This could be slow on systems with many CPU cores
because of inter-CPU synchronization overhead.
Use a separate per-vmstorage sync.WaitGroup instead in order to reduce inter-CPU synchronization overhead.
This should imrpove performance for heavy queries over big number of blocks on multi-CPU systems.
Other components, such as `vmagent`, mark these flags as sensitive and
hide them from the `/metrics` endpoint by default. This commit adds
similar handling to the `vmalert` component, hiding them by default, to
prevent logging of secrets inappropriately.
Showing of these values is controlled by an additional flag.
Follow up to https://github.com/VictoriaMetrics/VictoriaMetrics/pull/2947
* lib/storage: prevent excessive loops when storage is in RO
Returning nil error when storage is in RO mode results
into excessive loops and function calls which could
result into CPU exhaustion. Returning an err instead
will trigger delays in the for loop and save some resources.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* document the change
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
vmalert can be successfully used with datasources
compatible with Prometheus HTTP API. So we remove comments or
notes in Readme which are saying opposite.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* lib/auth: add tests from NewToken function
* lib/auth: update test, fix problem with type conversion
* lib/auth: update test description
* lib/auth: simplify failure tests
The `-mod=vendor` is automatically set when there is a `vendor` directory
starting from Go1.14 - see https://go.dev/doc/go1.14#go-command
Since the minimum supported Go version for VictoriaMetrics is Go1.17,
then the `-mod=vendor` option is no longer needed.
* Update README.md: fix typo to "HA pairs"
Update README.md to fix typo from "HA paris" to "HA pairs"
* Update docs/README.md : Fix of typo from "HA paris"
Update docs/README.md : Fix of typo from "HA paris" to "HA pairs"
* Update of Single-server-VictoriaMetrics.md to fix typo "HA paris" to "HA pairs"
Update of Single-server-VictoriaMetrics.md to fix typo "HA paris" to "HA pairs"
This adds the following push-related metrics when -pushmetrics.url is set:
- metrics_push_interval_seconds
- metrics_push_total
- metrics_push_errors_total
- metrics_push_bytes_pushed_total
- metrics_push_duration_seconds
- metrics_push_block_size_bytes
Updates https://github.com/VictoriaMetrics/metrics/issues/35
* deployment/docker/Makefile: added docker-scan
docker-scan based on native 'docker scan' function that use snyk.io, see https://docs.docker.com/engine/scan/
* set to call 'docker-scan after release binaries but before publishing
Reduce inter-CPU communications when processing the query over big number of time series.
This should improve performance for queries over big number of time series
on systems with many CPU cores.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2896
Based on b596ac3745
Thanks to @zqyzyq for the idea.
Usernames could be duplicate if it has uniq password.
vmauth makes routing based on auth token and username + password combination must be unique for this case.
The new metric `vmagent_remotewrite_queues` exports a static value of
number of configured remote write queus. This metric is useful to
calculate total saturation per each configured URL with given number
of queues. See corresponding changes to vmagent alerts and dashboard.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The change allows to specify default value for `getScrapeInterval`
function when actual interval can't be calculated.
Before the change, function were returning `maxSilenceInterval` (5m)
in such cases, which may be not correct for instant queries processing.
The specific scenario where using `maxSilenceInterval` caused issues
is the following:
1. Series becomes stale;
2. Client (in this case vmalert) continues to request series every 15s;
3. Database returns empty results as expected;
4. But at some specific moment of time database returns datapoints from `now()-5m`,
because lookback window was extended to `maxSilenceInterval`.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmselect: cover special cases for vmalert's routing in single-node version
* remove trailing `/` from requests
* redirect to vmalert's home page when `/vmalert` is requested.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: fix review comments
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* Update app/vmselect/main.go
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
- Use binary search instead of linear scan when locating the run of smallest timestamps
in blocks with intersected time ranges. This should improve performance
when merging blocks with big number of samples
- Skip samples with duplicate timestamps. This should increase query performance
in cluster version of VictoriaMetrics with the enabled replication.
* vmalert: deprecate alert's status link
Deprecate alert's status link `/api/v1/<groupID>/<alertID>/status` in favour of
`api/v1/alerts?group_id=<group_id>&alert_id=<alert_id>"`.
The change was needed for simplifying logic in vmselect for proxying vmalert's requests.
The old alert's status link will be still supported for a few versions but will be removed in the future.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2825
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: fix review comments
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Previously the time series could be put into dateMetricIDCache without
registering in the per-day inverted index if GetOrCreateTSIDByName
finds TSID entry in the global index. This could lead to missing
series in query results.
The issue has been introduced in the commit 55e7afae3a,
which has been included in VictoriaMetrics v1.78.0
* docs: warn about potential issue with read queries for 1.78.0
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* docs: warn about potential issue with read queries for 1.78.0
Signed-off-by: hagen1778 <roman@victoriametrics.com>
- show dates in human-readable format, e.g. 2022-05-07, instead of a numeric value
- limit the maximum length of queries and filters shown in trace messages
Production experience shows that 100k is too big for /api/v1/series .
It leads to increased CPU usage when Grafana queries /api/v1/series over VictoriaMetrics
with big number of time series during auto-completion and when modifying template variables.
Previously SearchMetricNames was returning unmarshaled metric names.
This wasn't great for vmstorage, which should spend additional CPU time
for marshaling the metric names before sending them to vmselect.
While at it, remove possible duplicate metric names, which could occur when
multiple samples for new time series are ingested via concurrent requests.
Also sort the metric names before returning them to the client.
This simplifies debugging of the returned metric names across repeated requests to /api/v1/series
querytracer has been added to the following storage.Storage methods:
- RegisterMetricNames
- DeleteMetrics
- SearchTagValueSuffixes
- SearchGraphitePaths
Previously the cache could store 10K unique regexps. When every regexp is huge (e.g. hundreds of kilobytes),
then the total cache size could grow to multiples of gigabytes. Now the cache size is limited by the total length
of all cached regexps. So huge regexps won't result in high memory usage for the cache.
alerts: filter out non error log messages for `TooManyLogs`
Info and Warn error levels aren't always a result of malfunctioning
or faulty state. So we filter them out.
If the previous dial attempt was unsuccessful, then all the new dial attempts are skipped
until the background goroutine determines that the given address can be successfully dialed.
This reduces query latency when some of vmstorage nodes are unavailable and dialing them is slow.
This should help with https://github.com/VictoriaMetrics/VictoriaMetrics/issues/711
This commit is based on ideas from the https://github.com/VictoriaMetrics/VictoriaMetrics/pull/2756
The main differences are:
- The check for healthy/unhealthy storage nodes is moved one level lower from app/vmselect/netstorage to lib/netutil.ConnPool.
This makes possible re-using this feature everywhere lib/netutil.ConnPool is used.
- The check doesn't take into account handshake errors for already established connections.
Handshake errors usually mean improperly configured VictoriaMetrics cluster, so they shouldn't be ignored.
The commit 5fb45173ae takes into account only newly registered series
when applying cardinality limits. This means that the cardinality limit could be exceeded with already registered series.
This commit returns back accounting for already registered series when applying cardinality limits.
Previously the creation of per-day indexes and global indexes
for the newly registered time series was decoupled.
Now global indexes and per-day indexes for the current day are created toghether for new time series.
This should speed up registering new time series a bit.
* vmalert: remove head of line blocking for sending alerts
This change makes sending alerts to notifiers concurrent instead
of sequential. This eliminates head of line blocking, where first
faulty notifier address prevents the rest of notifiers from
receiving notifications.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: make default timeout for sending alerts 10s
Previous value of 1m was too high and was inconsistent
with default timeout defined for notifiers via
configuration file.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: linter checks fix
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmselect: limit `end` param max value by 2d in future
The change is applied only to service handlers like `/labels` or `/series`
and limits the `end` param by max value <= now() + 2 days. The same limit
is applied for the ingested data, so no reason to allow to request data
in future far than that.
The change is also needed for corner cases like https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2669
where too high `end` value triggers inefficient global index search.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* docs/CHANGELOG.md: document the bugfix
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
This allows filling the seriesCountByFocusLabelValue list in the /api/v1/status/tsdb response
with label values for the specified focusLabel, which contain the highest number of time series.
TODO: add this to Cardinality explorer at VMUI - https://docs.victoriametrics.com/#cardinality-explorer
Suggest `-search.maxStalenessInterval=10s` instead of `-search.maxStalenessInterval=1ms`,
since `1ms` would result in empty graphs in most cases, since the interval between data points
on the graph is usually much higher than 1ms. For example, if the graph shows time range of one hour
and it contains 1000 points, then the interval between points on the graph would equal to
3600s/1000=3.6 seconds.
* feat: make datepicker to be set to last 30 min by default
* fix: correct spinner while loading data
* feat: change legend style
* app/vmselect: `make vmui-update`
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
vmalert: support `limit` param in groups definition
`limit` param limits number of time series samples produced by a single rule
during execution.
On reaching the limit rule will return an err.
Signed-off-by: lihaowei <haoweili35@gmail.com>
The change updates histogram for registering SD update duration
only SD is considered as `active`. SD is active if at least
one scraper for this SD has started.
This change supposed to reduce metrics cardinality produced
by duration histogram which gets updated even if SD isn't configured.
See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2671
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Workers count for merges affects the max part size during merges. Such behaviour
protects storage from running out of disk space for scenario when all workers
are merging parts with the max size.
This works very well for most cases. But for systems where high number of CPUs
is allocated for vmstorage components this could significantly impact the max
part size and result in more unmerged parts than expected.
While checking multiple production highly loaded setups it was discovered that
`max_over_time(vm_active_merges{type="storage/big}[1h]}"` rarely exceeds 2,
and `max_over_time(vm_active_merges{type="storage/small}[1h]}"` rarely exceeds 4.
The change in this commit limits the max value for concurrency accordingly.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
- Remove unused js bloatware from /targets page. This strips down binary size by more than 100Kb
- Add /service-discovery page for API compatibility with Prometheus
- Properly load bootstrap.min.css from /prometheus/targets
- Serve static contents for /targets page from app/vminsert instead of app/vmselect, because /targets page is served from there
- Take into account `ca`, `key` and `cert` values when generating string representation of TLSConfig.
Print hashes instead of real values because of security considerations.
- Properly update Config.tlsCertDigets when `key` and `cert` values are set.
This allows properly updating scrape targets after these values are updated in configs.
- Do not re-generate certificate from `key` and `cert` values per each call to getTLSCert,
because these values are immutable.
- Do not set `ca` value from `ca_file` value, so it isn't exposed at `/config` page.
- Generate proper error messages on incorrect `key`, `cert` or `ca` values.
Fixes security vulnerability in nth-check version <=1.0.2
My previous version pin was insufficient, as it was imported again through a different (svgo -> css-select).
- data modification
- more applications for gauges
- recommendations for instrumenting app with metrics
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* lib/promscrape/discovery/kubernetes: properly updates discovered scrape works
previously, added or updated scrapeworks may override previuosly
discovered.
it happens because swosByKey may contain small subset of kubernetes
objects with it's labels.
It happens for objectsUpdated and objectsAdded maps, which include only changed elements
* Properly calculate vm_promscrape_discovery_kubernetes_scrape_works
Co-authored-by: f41gh7 <nik@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
The new metric shows the configured evaluation interval per group.
Metric updates its value when group's interval is changed during
hot reload.
The new metric can be used to estimate how close group
is to start missing evaluation rounds. The following query
will show the % of used time by the group to evaluate all rules
before the next round:
```
(max(vmalert_iteration_duration_seconds{quantile="0.99"}) / vmalert_iteration_interval_seconds) * 100
```
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2618
Signed-off-by: hagen1778 <roman@victoriametrics.com>
This adds the ability to utilize sigv4 signing for all AWS services not
just "aps". When the newly introduced property "service" is not set it
will default to "aps".
Signed-off-by: Boris Petersen <boris.petersen@idealo.de>
docs: update docs for beginners
QuickStart page was updated with more relevant information.
Key Concepts was added to cover basics for the VictoriaMetrics.
Unexpectedly, Grafana makes an extra request to `/rules`
handler in addition to `/api/v1/rules` calls in alerts UI.
This happens only for Grafana versions older than 8.5.*.
Apparently, this is related to support of other monitoring
systems.
Prometheus responds with `text/html` content for UI page `/rules`
to such requests. Actually, returning just a blank page with
SC=200 works as well.
Returning actual response of `/api/v1/rules`
results in error in Grafana since it expects a `yaml` (?) in response.
So we add a placeholder to `vmalert`.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2583
Signed-off-by: hagen1778 <roman@victoriametrics.com>
For liquid text processor double braces `{{` `}}`
are special chars for templating.
Since we use them in some of our docs with different purpose,
we must escape them to avoid syntax errors from liquid.
For escaping curly braces we use bult-in plugin which helps
to enclose sections of text via `{% raw %}` and `{% endraw %}`.
This approach prevents liquid syntax errors and makes render correct.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Rules executor within group tracks series sent to remote write
in order to mark them as stale if they had disappeared in next
evaluation round.
The executor uses rules ID as a key to identifies series which belong to rule.
On config reload, executor remains active but the set of rules could change.
Hence, we need to properly cleanup the tracker for rules which has been disappeared
on config reload.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* deployment/docker: pass `-buildvs=false` to `go build` for production builds
This should resolve the `error obtaining VCS status: exit status 128` error
when the environment contains incorrect version of git or has incorrect access rights
to the directory with VictoriaMetrics source code.
See the following links for additional info:
- https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2508#issuecomment-1117126702 ,
- https://github.com/google/ko/issues/672
- https://github.com/golang/go/issues/49004
* lib/netutil: limit the number of concurrently established connections when calling ConnPool.Get()
This should reduce potential spikes in the number of established connections in the following cases:
- when the connection establishing procedure becomes temporarily slow
- after a temporary spike in the rate of ConnPool.Get() calls
See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2552
* docs/CHANGELOG.md: document c8af625bcc
See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1322#issuecomment-1120276146
* docs/Cluster-VictoriaMetrics.md: typo fix: `by by` -> `by`
* docs: add `resource usage limits` docs, which describe fine-grained tuning for various resource usage limits
* docs/Cluster-VictoriaMetrics.md: the `/api/v1/label/.../values` query can take CPU and ram at both vmstorage and vmselect
* Update root Readme and root vmagent readme
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
This should reduce potential spikes in the number of established connections in the following cases:
- when the connection establishing procedure becomes temporarily slow
- after a temporary spike in the rate of ConnPool.Get() calls
See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2552
* vmalert: calculate time for firing alert based on the given timestamp
Previously, current time was used for checking the `firing` threshold.
This is not correct, since alerts are evaluated at specific timestamps.
Hence, this specific timestamp supposed to be used in the calculation.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: properly calculate evaluation timestamp for rules
Timestamp for rules evaluation should be calculated after
the artifical delay for groups start. Otherwise, evaluation
timestamp can fall back too far in time.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Make it possible to migrate timeseries while restoring the
original timeseries name previously written from Prometheus
to InfluxDB v1 via remote_write.
Fixes: https://github.com/VictoriaMetrics/vmctl/issues/8
Previously ScrapeConfig.clone() was improperly copying promauth.Secret fields -
their contents was replaced with `<secret>` value.
This led to inability to use passwords and secrets in `-promscrape.config` file.
The bug has been introduced in v1.77.0 in the commit 67b10896d2
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2551
Do not assume the db label to be the last one and also
make sure we are not skipping it and everything afterwards.
Breaking the loop would cause following labels to be empty.
* {lib/promscrape,app/vmagent}: adds sigv4 support for vmagent remoteWrite
moves aws related code into separate lib from lib/promscrape
it allows to write data from vmagent to the AWS managed prometheus (cortex)
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1287
* Apply suggestions from code review
* wip
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
vmctl: fix vmctl blocking on process interrupt
This change prevents vmctl from indefinite blocking on
receiving the interrupt signal. The update touches all
import modes and suppose to improve tool reliability.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2491
This adds a metric for the rate limit.
The limit is present as a flag currently:
`flag{name="remoteWrite.rateLimit", value="500000", is_set="true"} 1`
We are running many instances of vmagent and when creating alerts it is harder than it needs to be when extracting the value from the flag.
With this change it should be easier to monitor how close to the limit we are.
`((100/vmagent_remotewrite_rate_limit{account="account"})*sum (rate(vmagent_remotewrite_conn_bytes_written_total{account="account"}))) and ON (account) flag{name="remoteWrite.rateLimit"} == 1`
Function `ValidateTemplates`, used on the vmalert startup,
is supposed to check whether used templates and functions
in loaded rules are correct. The function was parsing
and executing loaded templates.
However, rules may contain functions which can't be executed
without values (label values or query results), like `slice`.
Because of this, validation for completely valid expression
`{{ slice $labels.job 9 }}` will fail since `$labels.job`
is empty during validation.
This PR updates `ValidateTemplates` function to only parse
templates without executing them.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2514
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* lib/{storage,flagutil} - Add option for snapshot autoremoval
- add prometheus-like duration as command flag
- add option to delete stale snapshots
- update duration.go flag to re-use own code
* wip
* lib/flagutil: re-use Duration.Set() call in NewDuration
* wip
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
add progress bars to the VM importer
The new progress bars supposed to display the processing speed per each
VM importer worker. This info should help to identify if there is a bottleneck
on the VM side during the import process, without waiting for its finish.
The new progress bars can be disabled by passing `vm-disable-progress-bar` flag.
Plotting multiple progress bars requires using experimental progress bar pool
from github.com/cheggaaa/pb/v3. Switch to progress bar pool required changes
in all import modes.
The openTSDB mode wasn't changed due to its implementation, which implies individual progress
bars per each series. Because of this, using the pool wasn't possible.
Signed-off-by: dmitryk-dk <kozlovdmitriyy@gmail.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
* Export "null" in jsonl instead of NaN
The NaN appeared because of staleness markers that were added for compatibility. I think it's better to use json `null`, implemented here.
Also maybe it also makes sense to add a flag like `?skip-staleness-markers=true` to `/export`, to skip nulls at all?
* Update app/vmselect/prometheus/export.qtpl
* app/vmselect/prometheus/export.qtpl.go: `make quicktemplate-gen`
* docs/CHANGELOG.md: document the change
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* app/vmselect: adds API /api/v1/status/buildinfo
it should fix an compability error with grafana 8.5 prometheus datasource
https://github.com/grafana/grafana/pull/46771
* Update main.go
* dashboards: remove index filter from stats panel for DiskUsage
The diskUsage stats panel was showing disk usage without including
size of the index, which is not correct. The filter was removed
to reflect the total disk usage.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2368
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* dashboards: add adhoc filter to dasbhoard variables
The adhoc filter allows to quickly apply global filters without
modifying the panels.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* dashboards: add new panel `IndexDB items rate`
The new panel supposed to reflect the pressure on indexDB
caused by churn rate or new series registration.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* dashboards: rm "Deferred merges" panel since it could be misleading
See more context here https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1682#issuecomment-938608067
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* dashboards: replace fixed interval of `5m` for `rate` expressions
Before we used fixed `5m` interval for expressions with `rate` func.
Unfortunately, this interval wasn't a fit for all the cases. So we
switch to `$__rate_interval` instead.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* dashboards: bump version requirement
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* dashboards: rm `vm_indexdb_items_added_size_bytes_total` expression
Rate over `vm_indexdb_items_added_size_bytes_total` doesn't seem to be useful
on the dasbhoard panel.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
This should prevent from `duplicate time series` errors when executing the following query:
kube_pod_container_resource_requests{resource="cpu"} * on (namespace,pod) group_left() (kube_pod_status_phase{phase=~"Pending|Running"}==1)
where `kube_pod_status_phase{phase=~"Pending|Running"}==1` filters out diplicate time series
replaces internStringMap with sync.Map - it greatly reduces lock contention
concurently reload scrape work for api watcher - each object labels added by dedicated CPU
changes can be tested with following script https://gist.github.com/f41gh7/6f8f8d8719786aff1f18a85c23aebf70
Reduce the number of memory allocations in this function. This improves its performance by up to 50%.
This should improve service discovery speed when big number of potential targets with big number of meta-labels
are generated by service discovery.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2270
This reduces the number of allocations and improves the performance for updating dropped targets' map.
This map is exposed at /api/v1/targets as in droppedTargets list.
* lib/promscrape: adds job restart method
it must restart only ScrapeConfig with changed content
this change greatly reduce time, that needed for job restart
and it should decrease possible data loss when config frequently changed at kubernetes based deployments
Apply suggestions from code review
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
* wip
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* lib/httpserver: added tlsCipherSuites flag
* lib/httpserver: compare lower case strings
* lib/httpserver: use EqualFold
* lib/httpserver: used flagutil.NewArray, supported only strings cipher suites
* lib/httpserver: updated flag description, added flag to documentation
* Update lib/httpserver/httpserver.go
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
This change fixes incorrect marshalling for ScrapeConfig
it affects http endpoint and ScrapeConfig checksum.
With omitempty, custom Marshaller is not called if field is not a pointer.
Previously this issue happened at vmalert
* lib/promscrape: allows to use k8s pod name as clusterMemberNum
it must improve user expirience and simplify clustering scrapers.
it must allow to use vmagent cluster with distroless images
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2359
* Apply suggestions from code review
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
previously, if native block cannot be unmarshaled, wg.Done wasn't called by unmarshal work.
It leads to connection blocking and possible dead-lock at client side
Before, relabeling for notifier configured via file was supported
only for target labels discovered via SD.
With this change, new config field `alert_relabel_configs` is introduced
for applying relabeling to labels of sent alerts.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
This metric is equivalent to `vm_available_memory_bytes`, but it has better name,
since the metric is related to a process, not VictoriaMetrics itself.
Leave `vm_available_memory_bytes` for backwards compatibility.
To improve compatibility with Prometheus alerting the order of
templates processing has changed.
Before, vmalert did all labels processing beforehand. It meant
all extra labels (such as `alertname`, `alertgroup` or rule labels)
were available in templating. All collisions were resolved in favour
of extra labels.
In Prometheus, only labels from the received metric are available in
templating, so no collisions are possible.
This change makes vmalert's behaviour similar to Prometheus.
For example, consider alerting rule which is triggered by time series
with `alertname` label. In vmalert, this label would be overriden
by alerting rule's name everywhere: for alert labels, for annotations, etc.
In Prometheus, it would be overriden for alert's labels only, but in annotations
the original label value would be available.
See more details here https://github.com/prometheus/compliance/issues/80
Signed-off-by: hagen1778 <roman@victoriametrics.com>
This should improve the performance for items sorting inside inmemoryBlock.MarshalUnsortedData
if they have common prefix.
While at it, improve the performance for inmemoryBlock.updateCommonPrefix for sorted items.
This should improve performance for inmemoryBlock.MarshalSortedData during background merge.
This reduces memory usage under production workloads by up to 10%,
while CPU spent on GC remains roughly the same.
The CPU spent on GC can be monitored with go_memstats_gc_cpu_fraction metric
The following actions are taken:
- Increase the TLS hashdshake timeout from 5 seconds to 10 seconds
- Increase dial timeout from 5 seconds to 30 seconds
- Specify DialContext instead of Dial in http.Transport. This allows properly handling
the Context arg during dialing the remote storage
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1699
* lib/protoparser: changes ParseStream for native format
uses reader instead of http.Request
updates app/vmagent and app/vmagent method usage
* app/vmctl: add verify-block subcommand
it allows to check exported from VictoriaMetrics data block in native format
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2362
Update app/vmctl/README.md
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
The new flag `datasource.disableKeepAlive` allows disabling keepalive
connections. This may be useful if there are multiple datasource
replicas (e.g. vmselects) behind the HTTP balancer to avoid uneven
load spread because of long-lived connections.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Executor recently gain field for storing previously sent series.
Since the same executor object can be used in multiple goroutines,
the access to this field should be serialized.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: split alert's `Start` field into `ActiveAt` and `Start`
The `ActiveAt` field identifies when alert becomes active for rules
with `for > 0`. Previously, this value was stored in field `Start`.
The field `Start` now identifies the moment alert became `FIRING`.
The split is needed in order to distinguish these two moments
in the API responses for alerts.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: support specific moment of time for rules evaluation
The Querier interface was extended to accept a new argument
used as a timestamp at which evaluation should be made.
It is needed to align rules execution time within the group.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: mark disappeared series as stale
Series generated by alerting rules, which were sent to remote write
now will be marked as stale if they will disappear on the next
evaluation. This would make ALERTS and ALERTS_FOR_TIME series
more precise.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* wip
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: evaluate rules at fixed timestamp
Before, time at which rules were evaluated was calculated
right before rule execution. The change makes sure
that timestamp is calculated only once per evalution round
and all rules are using the same timestamp.
It also updates the logic of resending of already resolved
alert notification.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: allow overridin `alertname` label value if it is present in response
Previously, `alertname` was always equal to the Alerting Rule name. Now,
its value can be overriden if series in response containt the different value
for this label.
The change is needed for improving compatibility with Prometheus.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: align rules evaluation in time
Now, evaluation timestamp for rules evaluates as if
there was no delay in rules evaluation. It means, that
rules will be evaluated at fixed timestamps+group_interval.
This way provides more consistent evaluation results and
improves compatibility with Prometheus,
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: add metric for missed iterations
New metric `vmalert_iteration_missed_total` will show
whether rules evaluation round was missed.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: reduce delay before the initial rule evaluation in group
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: rollback alertname override
According to the spec:
```
The alert name from the alerting rule (HighRequestLatency from the example above) MUST be added to the labels of the alert with the label name as alertname. It MUST override any existing alertname label.
```
https://github.com/prometheus/compliance/blob/main/alert_generator/specification.md#step-3
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: throw err immediately on dedup detection
```
The execution of an alerting rule MUST error out immediately and MUST NOT send any alerts
or add samples to samples receiver if there is more than one alert with the same labels
```
https://github.com/prometheus/compliance/blob/main/alert_generator/specification.md#step-4
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: cleanup
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: use strings builder to reduce allocs
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* lib/{storage,regexpcache}: replaces regexpCacheMap with LRU cache
It should decrease memory usage for regexp caching
with storing cacheEntry by pointer - golang map should be able to effectivly shrink it's size
original issue with this case - unexpected map grows and storage OOM
Apply suggestions from code review
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
Adds missing metrics for regexp cache and regexpPrefixes cache
* wip
* wip
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* Update url-examples.md
+additional examples
* Update
* Update url-examples
Some changes requested by Roman.
* Update url-examples.md
* Update url-example
* Update url-examples
Additional info and marking for /labels part
* Update url-example
Added example with complex query which needs encoding:
How to execute the query similar to - sum(increase(foo{status="bar"}[5m])) by (status)
* Update url-samples
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* fix for issue 2255 - matchTagFilters for positive empty-match filters
* add example to comments
* formatting
* add test for positive empty match
* formatting
* vmalert: add support for `sortByLabel` template function
* vmalert: update API according to Prometheus conformance program
The changes to the API, field names and URL path has been made
according to the Prometheus specification for `alert_generator`
https://github.com/prometheus/compliance/blob/main/alert_generator/specification.md
* vmalert: fix the timestamp of the evaluated rules
The timestamp used for alert's `EndsAt` was calculated
before sending the notification. While the correct way
is to use the timestamp taken right before rules evaluation.
* vmalert: add `-datasource.queryTimeAlignment` flag
The flag is supposed to provide ability to disable `time`
param alignment when executing rules. By default, this flag
is enabled, so it remains backward compatible.
The flag was introduced to achieve better compatibility
with Prometheus behaviour according to https://github.com/prometheus/compliance/blob/main/alert_generator/specification.md
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Some of the releases could negatively affect performance for a limited
period of time due to some changes in core. Update details are meant to
warn users about expected changes in peformance after the update.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The lifetime of storageBlock is much shorter comparing to the lifetime of inmemoryPart,
so sync.Pool usage should reduce overall memory usage and improve performance
because of better locality of reference when marshaling inmemoryBlock to inmemoryPart.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2247
There is no need to sort the underlying data according to sorted items there.
This should reduce cpu usage when registering new time series in `indexdb`.
Thanks to @ahfuzhang for the suggestion at https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2245
* fixes jwt token parse with correct base64Url decoding
it must be applied according to jwt RFC that requires token to be URL safe
added slow path for decoding tokens with std base64 decoding
adds error logging for vmgateway
* docs/CHANGELOG.md: document the bugfix
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* lib/discovery/consul: update services on the watcher's start
Previously, watcher's start was only initing goroutines for discovery
but not waiting for the first iteration to end. It means first Consul
discovery wasn't returning discovered targets until the next iteration.
The change makes the watcher's start blocking until we get first discovery
iteration done and all registries updated.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: remove workarounds for consul SD
Now when consul SD lib properly updates services
on the first start, we don't need workarounds in vmalert.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* lib/discovery/consul: update after review
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* wip
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* added missed runbook for udpating k8s VM Cluster in DO
* Update deployment/marketplace/digitialocean/one-click-droplet/RELEASE_GUIDE.md
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Lowering threshold from 50% to 5% will be more sufficient
for discovering un-healthy system state. It also goes in
sync with alert definition in cluster branch.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The new row Caches adds more visibility for cache utilization by VM.
It replaces the old `Cache size` panel.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* dashboards: plot cpu limits for vmagent, vmalert and vm-single dashboards
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* alerts: add `TooHighCPUUsage` alert for all VM components
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* dashboards: bump components version requirements
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* Remove unneeded dependency on `numeral` package
* Properly parse numbers obtained from /api/v1/query_range according to
https://prometheus.io/docs/prometheus/latest/querying/api/#expression-query-result-formats
* Optimize updating processing the received data from /api/v1/query_range
* Make smoother zoom on `ctrl+scroll`
* Reduce the number of points received from /api/v1/query_range by 2x in order to reduce load on backend
- Postpone the pre-poulation to the last hour of the current day. This should reduce the number
of useless entries in the next per-day index, which shouldn't be created there,
when the corresponding time series are stopped to be pushed during the current day.
- Make the pre-population more smooth in time by using the hash of MetricID instead of MetricID itself
when calculating the need for for the given MetricID pre-population.
- Sync the logic for pre-population of the next day inverted index with the logic of pre-populating tsid cache
after indexdb rotation. This should improve code maintainability.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/430
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1401
* lib/index: reduce read/write load after indexDB rotation
IndexDB in VM is responsible for storing TSID - ID's used for identifying
time series. The index is stored on disk and used by both ingestion and read path.
IndexDB is stored separately to data parts and is global for all stored data.
It can't be deleted partially as VM deletes data parts. Instead, indexDB is
rotated once in `retention` interval.
The rotation procedure means that `current` indexDB becomes `previous`,
and new freshly created indexDB struct becomes `current`. So in any time,
VM holds indexDB for current and previous retention periods.
When time series is ingested or queried, VM checks if its TSID is present
in `current` indexDB. If it is missing, it checks the `previous` indexDB.
If TSID was found, it gets copied to the `current` indexDB. In this way
`current` indexDB stores only series which were active during the retention
period.
To improve indexDB lookups, VM uses a cache layer called `tsidCache`. Both
write and read path consult `tsidCache` and on miss the relad lookup happens.
When rotation happens, VM resets the `tsidCache`. This is needed for ingestion
path to trigger `current` indexDB re-population. Since index re-population
requires additional resources, every index rotation event may cause some extra
load on CPU and disk. While it may be unnoticeable for most of the cases,
for systems with very high number of unique series each rotation may lead
to performance degradation for some period of time.
This PR makes an attempt to smooth out resource usage after the rotation.
The changes are following:
1. `tsidCache` is no longer reset after the rotation;
2. Instead, each entry in `tsidCache` gains a notion of indexDB to which
they belong;
3. On ingestion path after the rotation we check if requested TSID was
found in `tsidCache`. Then we have 3 branches:
3.1 Fast path. It was found, and belongs to the `current` indexDB. Return TSID.
3.2 Slow path. It wasn't found, so we generate it from scratch,
add to `current` indexDB, add it to `tsidCache`.
3.3 Smooth path. It was found but does not belong to the `current` indexDB.
In this case, we add it to the `current` indexDB with some probability.
The probability is based on time passed since the last rotation with some threshold.
The more time has passed since rotation the higher is chance to re-populate `current` indexDB.
The default re-population interval in this PR is set to `1h`, during which entries from
`previous` index supposed to slowly re-populate `current` index.
The new metric `vm_timeseries_repopulated_total` was added to identify how many TSIDs
were moved from `previous` indexDB to the `current` indexDB. This metric supposed to
grow only during the first `1h` after the last rotation.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1401
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* wip
* wip
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* lib/promscrape: support prometheus-like duration in scrape configs
The change allows to specify duration values like `1d`, `1w`
for fields `scrape_interval`, `scrape_timeout`, etc.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/817#issuecomment-1033384766
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* lib/blockcache: make linter happy
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* lib/promscrape: support prometheus-like duration in scrape configs
* add support for extra fields `scrape_align_interval` and `scrape_offset`;
* support Prometheus duration parsing for `__scrape_interval__`
and `__scrape_duration__` labels;
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* wip
* wip
* docs/CHANGELOG.md: document the feature
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* adds CGO build for arm64
it must improve performance for arm64 based deployments of vmstorage and
vmsingle for 15-20%
it depends on gozstd package update for correct musl gozstd vendoring
* typo fixes
* docs/CHANGELOG.md: document the change
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
This should improve data ingestion speed if time series samples are ingested with interval bigger than 2 minutes.
The actual interval could exceed 2 minutes if the original interval between samples doesn't exceed 2 minutes
in the case of slow inserts. Slow inserts may appear in the following cases:
* Big number of new time series are pushed to VictoriaMetrics, so they couldn't be registered in 2 minutes.
* MetricName->tsid cache reset on indexdb rotation or due to unclean shutdown.
In this case VictoriaMetrics needs to load MetricName->tsid entries for all the incoming series from IndexDB.
IndexDB uses the block cache for increasing lookup performance. If the cache has no the needed block,
then IndexDB reads and unpacks the block from disk. This requires an extra disk read IO and CPU.
See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1401
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2007
This also should increase performance for periodically executed queries with intervals from 2 minutes to 5 minutes.
See the previous similar commit - 43103be011
It is possible that the timeout can be increased further. Let's collect production numbers for this change
so the timeout could be adjusted further.
Previously limits for new caches were taken from cache stats.
These limits could mismatch the original limits. This could result in failed cache load
if the stored cache has been created with the limits obtained from cache stats.
* Mention about the ability to configure vmalert notifiers via files in docs/CHANGELOG.md
* Mention about the ability to use Consul service discovery for vmalert notifiers in docs/CHANGELOG.md
* Run `make docs-sync` in order to sync app/vmalert/README.md to docs/vmalert.md
vmalert: support configuration file for notifiers
* vmalert notifiers now can be configured via file
see https://docs.victoriametrics.com/vmalert.html#notifier-configuration-file
* add support of Consul service discovery for notifiers config
see https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1947
* add UI section for currently loaded/discovered notifiers
* deprecate `-rule.configCheckInterval` in favour of `-configCheckInterval`
* add ability to suppress logs for duplicated targets for notifiers discovery
* change behaviour of `vmalert_alerts_send_errors_total` - it now accounts
for failed alerts, not HTTP calls.
This metric shows the number of CPU cores available to the process.
This allows creating alerting rules on CPU saturation with the following query:
rate(process_cpu_seconds_total[5m]) / process_cpu_cores_available > 0.9
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2107
* optimized code ,because only the first error,so no need var errors []error
* optimized code ,because only the first error,so no need var errors []error
Co-authored-by: lirenzuo <lirenzuo@shein.com>
* FAQ update: how downsampling and deduplication will work at the same time
* Update docs/FAQ.md
Co-authored-by: Aliaksandr Valialkin <valyala@gmail.com>
This adds more optimization cases for https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusLabelNonOptimization
For example:
* Multi-level transform functions. For example, abs(round(foo{a="b"})) + bar{x="y"}
is now optimized to abs(round(foo{a="b",x="y"})) + bar{a="b",x="y"}
* Binary operations with `on()`, `without()`, `group_left()` and `group_right()` modifiers.
For example, foo{a="b"} on (a) + bar is now optimized to foo{a="b"} on (a) + bar{a="b"}
* Multi-level binary operations. For example, foo{a="b"} + bar{x="y"} + baz{z="q"}
is now optimized to foo{a="b",x="y",z="q"} + bar{a="b",x="y",z="q"} + baz{a="b",x="y",z="q"}
* Aggregate functions. For example, sum(foo{a="b"}) by (c) + bar{c="d"}
is now optimized to sum(foo{a="b",c="d"}) by (c) + bar{c="d"}
Previously bytesutil.Resize() was copying the original byte slice contents to a newly allocated slice.
This wasted CPU cycles and memory bandwidth in some places, where the original slice contents wasn't needed
after slize resizing. Switch such places to bytesutil.ResizeNoCopy().
Rename the original bytesutil.Resize() function to bytesutil.ResizeWithCopy() for the sake of improved readability.
Additionally, allocate new slice with `make()` instead of `append()`. This guarantees that the capacity of the allocated slice
exactly matches the requested size. The `append()` could return a slice with bigger capacity as an optimization for further `append()` calls.
This could result in excess memory usage when the returned byte slice was cached (for instance, in lib/blockcache).
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2007
- Optimize Cache.RemoveBlocksFromPart(), so it doesn't need to iterate over all the cached blocks.
- Cache blocks if there were no cache misses during the last 2 minutes.
This may be the case when new blocks are added simultaneously to the storage and to the cache.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2007
* fix: add date validate for time range
* app/vmselect/vmui: `make vmui-update`
* docs/CHANGELOG.md: document the bugfix
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Previously these caches could exceed limits set via `-memory.allowedPercent` and/or `-memory.allowedBytes`,
since limits were set independently per each data part. If the number of data parts was big, then limits could be exceeded,
which could result to out of memory errors.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2007
* feat: replace @codemirror to text field
* feat: switch to Preact from React
* fix: optimize mui imports
* fix: remove unused vars
* update package-lock.json
* app/vmselect/vmui: `make vmui-update`
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
- Document the bugfix at docs/CHANGELOG.md
- Set __address__ field after copying commonLabels to the resulting map of discovered labels.
This makes sure that the correct __address__ label is used.
This reverts commit 2104330d4c.
This check doesn't work well for community pull requests, since third-party users
aren't motivated to rebase pull requests to branch head after they are created.
This check is useful for private repositories though.
* Simplify queries to OpenTSDB (and make them properly appear in OpenTSDB query stats) and also tweak defaults a bit
* Convert seconds to milliseconds before writing to VictoriaMetrics and increase subquery size
Signed-off-by: John Seekins <jseekins@datto.com>
On import process interruption `vmctl` now prints the max and min timestamps of:
* last failed batch if import ended with error;
* last sent batch if import was cancelled by user.
To get more details for each timeseries in batch user needs to specify `--verbose` flag.
The change does not relate to `vm-native` mode, since `vmctl` has no control over
transferred data in this mode.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1236
Signed-off-by: hagen1778 <roman@victoriametrics.com>
It will prevent merging in a branch that's not based on its base branch HEAD, leading to streamlined history.
Note it will not prevent squash commits, nor commits directly to base branch.
* feat: add a reset query by clicking the logo
* feat: add sequence number for query fields
* feat: invert behavior on the graph's legend
* app/vmselect/vmui: `make vmui-update`
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* vmalert: check if remoteWrite is configured for replay mode
The purpose of `replay` mode is to backfill results of recording
or alerting rules. So `remoteWrite.url` should be required.
Otherwise, process can fail on attempt to send data.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* Update app/vmalert/main.go
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* vmagent: add error log for skipped data block when rejected by receiving side
Previously, rejected data blocks were silently dropped - only metrics were update.
From operational perspective, having an additional logging for such cases is preferable.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1911
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmagent: throttle log messages about skipped blocks
The new type of logger was added to logger pacakge.
This new type supposed to control number of logged messages
by time.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* lib/logger: make LogThrottler public, so its methods can be inspected by external packages
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
This should handle the case when the original job_name has been changed in -promscrape.config ,
while the resulting job label remains the same because it is overriden via relabeling.
* fix: remove disabling custom step when zooming
* feat: add a dynamic calc of the width of the graph
* fix: add validate y-axis limits
* fix: correct axis limits for value 0
* fix: change logic create time series
* fix: change types for tooltip
* fix: correct points on the line
* fix: change the logic for set graph width
* fix: stop checking the period when auto-refresh is enabled
* app/vmselect/vmui: `make vmui-update`
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* dashboards/vmsingle: add "Merges deferred" panel
The new panel supposed to show if there were deferred merges
due to insufficient disk space.
It goes within alerting rule which suppose to send a signal
in such cases.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* dashboards/vmsingle: add "Cache usage" panel
The new panel supposed to show the % of the used cache
compared to allowed size by type.
It should help to determine underutilized types of caches.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* dashboards/vmsingle: bump version requirement
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* dashboards/vmsingle: rm alert for `vm_merge_need_free_disk_space`
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* Update FAQ.md
Adding explanation "Why do same metrics have differences in VictoriaMetrics and Prometheus dashboards?"
* Update FAQ
* Update FAQ.md
* Apply suggestions from code review
Co-authored-by: Roman Khavronenko <hagen1778@gmail.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Co-authored-by: Roman Khavronenko <hagen1778@gmail.com>
* Document changes_prometheus(), increase_prometheus() and delta_prometheus() functions.
* Simplify their implementation
* Mention these functions in docs/CHANGELOG.md
* dashboards/vmagent: shuffle panels for better visibility
More important error/dropped panels were moved higher on the main row.
Network usage panel moved to Resource usage row.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* dashboards/vmagent: add Troubleshooting row to show top 5 instances/jobs by churn rate
New panels are supposed to show top 5 jobs or targets which generate the most
of the churn rate. They were placed into a new row "Troubleshooting".
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* dashboards/vmagent: add panels for showing persistent queue saturation
New panels were added to Torubleshooting row to show the persistent queue
saturation. The corresponding alerts were added and linked to these
panels as well.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* dashboards/vmagent: add alert "RejectedRemoteWriteDataBlocksAreDropped"
New alert suppose to send a notification when vmagent starts to drop
data blocks rejected by configured remote write destiantion.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* Simplify queries to OpenTSDB (and make them properly appear in OpenTSDB query stats) and also tweak defaults a bit
Signed-off-by: John Seekins <jseekins@datto.com>
When using `vmalert` with older Prometheus versions, the passed
`step=2m` may be parsed by Prometheus with an err: "cannot parse \"2m0s\" to a valid duration".
In order to improve compatibility vmalert will always convert step duration to seconds.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1943
Signed-off-by: hagen1778 <roman@victoriametrics.com>
For example, `{__graphite__=~"foo.(bar|baz)"}` is automatically converted to `{__graphite__=~"foo.{bar,baz}"}` before execution.
This allows using multi-value Grafana template variables such as `{__graphite__=~"foo.($app)"}`.
* feat: add a label for the Query field
* fix: change zoom position
* fix: add description and error code to alerts
* fix: correct logic query history
* fix: correct update query history
* feat: add custom step
* update package-lock.json
* docs: document that VMUI now supports overriding of `step` query arg, which is passed to `/api/v1/query_range`
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Service labels like `alertname` or `alertgroup` were attached
after template expanding for `labels` section. Because of this,
labels `alertname` or `alertgroup` weren't available for templating
in `labels` section of alert's definition.
This commit changes the order of labels attaching and adds a test
for verifying these labels availability.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1921
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* feat: change duration by "enter"
* fix: optimize data processing for chart
* feat: set minimum step to 1ms
* update dependencies
* feat: remove save the last query to local storage
* fix: handle an error in a table with subqueries
* feat: store display type in URL
* Revert "feat: store display type in URL"
This reverts commit ccc242c69a.
* feat: store display type in URL
* refactor: move the time setting to a folder
* refactor: move the query configurator to a folder
* refactor: move the auth settings to a folder
* feat: improve styles
* feat: add multi query
* update package-lock
* feat: add display multiple queries
* feat: add limits for multiple queries
* update dependencies
* feat: add history for multiple queries
* feat: add line type to legend
* feat: change style for switch
* feat: change the logic for axes limits for multiple queries
* update package-lock.json
* update dependencies
* feat: add the filter to legend
* wip
* lib/httpserver: add missing 127.0.0.1 hostname to the logged address for http and pprof server if the address starts with ':'
This allows copy-pasting the url to http server from logs.
* lib/httpserver: add missing 127.0.0.1 hostname to the logged address for http and pprof server if the address starts with ':'
This allows copy-pasting the url to http server from logs.
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* Document the ability to specify http or https urls in `-auth.config` at docs/CHANGELOG.md
* Move the ReadFileOrHTTP to lib/fs, so it can be re-used in other places where a file
should be read from the given path. For example, in `-promscrape.config` at `vmagent`.
* add support for reading remote auth_config file via http
* fix lint
* fix defer on close body
Co-authored-by: Tiago Magalhães <tmagalhaes@wavecom.pt>
* vmalert: introduce additional HTTP URL params per-group configuration
The new group field `params` allows to configure custom HTTP URL params
per each group. These params will be applied to every request before
executing rule's expression. Hot config reload is also supported.
Field `extra_filter_labels` was deprecated in favour of `params` field.
vmalert will print deprecation log message if config file contains
the deprecated field.
`params` fields are supported by both Prometheus and Graphite datasource types.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: provide more examples for `params` field
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: set higher priority for `params` setting
If there would be a conflict between URL params set in `datasource.url` flag
and params in group definition the latter will have higher priority.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The bump was required for `vmalert` package.
`vmalert` docs now also contain an updated description.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The vm_cache_size_max_bytes metric can be used for determining caches which reach their capacity via the following query:
vm_cache_size_bytes / vm_cache_size_max_bytes > 0.9
* removes FileSize from backup part key
it should fix download restoration for backups
* Update lib/backup/common/part.go
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
The exported data isn't de-duplicated by default due to performance reasons.
It is expected that the de-duplication is applied during importing the exported data.
The deduplication is applied only when exporting data via /api/v1/export if `reduce_mem_usage=1` query arg isn't passed to the request.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1837
Previously, vmalert would print an err message and set vmalert_config_last_reload_successful=0
only once during a hot reload of a bad config. Such behaviour may result into non noticed
event of a bad config reload attempt
Now, it continues to print error messages and keep vmalert_config_last_reload_successful state
until successful attempt will be made or config state will be rolled back to prev state.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
For a long time notifier.Addr flag was required. The assumption was that vmalert will
be always used for alerting. However, practice shows that some users need only
recording rules. In this case, requirement of notifier.Addr is ambigious.
The change verifies if loaded config contains recording or alerting rules and
if there are corresponding flags set. This is true for initial config load
and hot reload.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* fix: handle an error in a table with subqueries
* feat: store display type in URL
* Revert "feat: store display type in URL"
This reverts commit ccc242c69a.
* Simplify queries to OpenTSDB (and make them properly appear in OpenTSDB query stats) and also tweak defaults a bit
Signed-off-by: John Seekins <jseekins@datto.com>
* remove extraneous printlns
Signed-off-by: John Seekins <jseekins@datto.com>
* remove empty line
Signed-off-by: John Seekins <jseekins@datto.com>
* fix bug in offset calcuation and closer to working with simpler queries
Signed-off-by: John Seekins <jseekins@datto.com>
* fix boolean eval
Signed-off-by: John Seekins <jseekins@datto.com>
* fix casting and check for multiple series
Signed-off-by: John Seekins <jseekins@datto.com>
Describe remoteWrite.url is used to persist rules and alerts state info,
and add an additional paragraph explaining the separation between
-remoteRead.url and -datasource.url.
Fixes#1810.
* feat: add query history
* fix: change detect keyUp for nav query history
* feat: set default query history
* feat: change graph legend
* update dependencies
* update codemirror version
* fix: correct update period time after zoom/pan
* fix: optimize data processing for the graph
* fix: eliminate memory leaks related to mouse events
* fix: correct display of straight line
* Merge branch 'master' into vmui-fix-reset-graph
* app/vmselect: `make vmui-update`
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
This removes the unneeded level of indirection and improves code readability.
The "prometheus" and "graphite" constants aren't going to change in the future, so there is no sense in hiding them behind constants.
This should improve the maximum data ingestion speed for highly-loaded vmagent instances
which run on beefy servers with many CPU cores and big amounts of RAM
Previously only the lower part of 64-bit hash was used for calculating the offset.
This may give uneven distribution in some cases. So let's use all the available 64 bits from the hash
for calculating the offset.
Prometheus allows to have groups with no rules, so we should support
it in vmalert as well for compatibility reasons.
It is also allowed to hot-reload empty groups by adding or removing rules.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Do not store in memory the response from the last scrape per each target if -promscrape.noStaleMarkers option is enabled.
This should reduce memory usage when the scraped targets return large responses.
Previously, ID for alert entity was generated without alertname or groupname.
This led to collision, when multiple alerting rules within the same group
producing same labelsets. E.g. expr: `sum(metric1) by (job) > 0` and
expr: `sum(metric2) by (job) > 0` could result into same labelset `job: "job"`.
The issue affects only UI and Web API parts of vmalert, because alert ID is used
only for displaying and finding active alerts. It does not affect state restore
procedure, since this label was added right before pushing to remote storage.
The change now adds all extra labels right after receiving response from the datasource.
And removes adding extra labels before pushing to remote storage.
Additionally, change introduces a new flag `Restored` which will be displayed in UI
for alerts which have been restored from remote storage on restart.
* adds tab as second separator for graphite text protocol
* changes indexFunc for indexAny
* Update lib/protoparser/graphite/parser_test.go
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* feat: add query history
* fix: change detect keyUp for nav query history
* feat: set default query history
* app/vmselect/vmui: `make vmui-update`
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
This should make visible the set flags at flag.Visit(), which is used later for logging
and exporting the `is_set` label for these flags at /metrics page
Commit fixes potential race condition when group update
and generating of ID() happens simultaneously.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Regression was introduced during code refactoring. It potentially
could lead to situation when SIGHUP signals were ignored while
vmalert was still busy with initing group manager.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The following errors:
vendor/cloud.google.com/go/storage/storage.go:1447:53: o.GetCustomerEncryption().GetKeySha256 undefined (type *"google.golang.org/genproto/googleapis/storage/v2".Object_CustomerEncryption has no field or method GetKeySha256)
vendor/cloud.google.com/go/storage/writer.go:439:10: q.GetCommittedSize undefined (type *"google.golang.org/genproto/googleapis/storage/v2".QueryWriteStatusResponse has no field or method GetCommittedSize)
The extra `/` may cause issues when additional path prefixes
are configured. Also, removing it makes it consistent
with the rest of declarations.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
vmctl: properly convert influx bools into integer representation
When using vmctl influx, the import would fail importing boolean fields
with:
```
failed to convert value "some".0 to float64: unexpected value type true
```
This converts `true` to `1` and `false` to `0`.
Fixes#1709
Sort series by a hash calculated from the series labels. This should guarantee "random" selection of the returned time series.
Previously the selection could be biased, since time series were sorted alphabetically by label names and label values.
Stream parsing mode can be automatically enabled when scraping targets with big response bodies
exceeding the -promscrape.minResponseSizeForStreamParse , so it must be always initialized.
This allows sending staleness marks and properly calculate scrape_series_added metric in stream parsing mode
at the cost of the increased memory usage, since now the potentially big response is kept
in the lastScrape byte slice per each scrapeWork.
In practice the memory usage increase shouldn't be big, since the response size
is usually much smaller than the parsed metrics from this response after the relabeling,
which usually adds a big pile of target-specific labels per each metric.
* vmalert: adjust `http.Transport.MaxIdleConns` value accordingly to `http.Transport.MaxIdleConnsPerHost`
`http.Transport.MaxIdleConnsPerHost` setting is controlled by `datasource.maxIdleConnections` flag,
while `http.Transport.MaxIdleConns` is inherited from DefaultTransport and is equal to `100`.
The fix adjusts `http.Transport.MaxIdleConns` value if it is lower than `http.Transport.MaxIdleConnsPerHost`.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmauth: adjust `http.Transport.MaxIdleConns` value accordingly to `http.Transport.MaxIdleConnsPerHost`
`http.Transport.MaxIdleConnsPerHost` setting is controlled by `maxIdleConnsPerBackend` flag,
while `http.Transport.MaxIdleConns` is inherited from DefaultTransport and is equal to `100`.
The fix adjusts `http.Transport.MaxIdleConns` value if it is lower than `http.Transport.MaxIdleConnsPerHost`.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The source link is controlled by `external.url` and `external.alert.source`
flags, in the same way as for alertmanager notifications.
The source link is added to Alerts list view, and specific Alert view.
* added guide for VM operator
* Update docs/guides/getting-started-with-vm-operator.md
Co-authored-by: Roman Khavronenko <hagen1778@gmail.com>
* Update docs/guides/getting-started-with-vm-operator.md
Co-authored-by: Roman Khavronenko <hagen1778@gmail.com>
* Fixed different typos and added improvements from proposals
* move remoteWrite.url to other place
* fixed typo
* rephrased vminsert explanation
* remove not needed parameters for default setup
Co-authored-by: Roman Khavronenko <hagen1778@gmail.com>
* adds read-only mode for vmstorage
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/269
* changes order a bit
* moves isFreeDiskLimitReached var to storage struct
renames functions to be consistent
change protoparser api - with optional storage limit check for given openned storage
* renames freeSpaceLimit to ReadOnly
The list of functions, which can adjust lookbehind window is more limited than the rest of functions,
so it is better from maintainability and readability PoV using the allowlist instead of blocklist.
It is expected that the `deriv(m[d])` returns non-empty value if the lookbehind window `d`
contains less than 2 samples in the same way as `rate()` does.
This is a follow-up after 3e084be06b .
Previously, `predict_linear` returned slightly different results comparing
to Prometheus. The change makes linear regression algorithm compatible
with Prometheus.
`deriv` was excluded from the list of functions which can adjust the time
window for the same reasons.
The fix makes the binary comparison func to check for NaNs
before executing the actual comparison. This prevents VM
to return values for non-existing samples for expressions
which contain bool comparisons. Please see added test
for example.
It appeared, that `testRowsEqual` NaN comparison was incorrect.
The fix caused some tests to fail. Please see the change and
tests updated.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Adjustment results into discrepancy between Prometheus and VM on time windows
smaller than scrape interval.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The fix will always return zero if received set of items consists of one
element only, which also means no deviation.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The change affects `count/stddev/stdvar_over_time` funcs and makes
them to return NaN instead of zero when there is no datapoints
in a time window.
This is needed for improving compatibility with Prometheus.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The removed fast path optimisations weren't consistent with
`quantile` function behavior and results into discrepancy.
Specifically, results didn't match in cases when:
* 0 < phi > 1;
* values contain only one element.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* app/vmselect: `quantile` func compatiblity with Prometheus
The `quantile` func was previously calculated by https://github.com/valyala/histogram
package. The result of such calculation was always the closest real value to
requested quantile. While in Prometheus implementation interpolation is used.
Such difference may result into discrepancy in output between Prometheus and
VictoriaMetrics.
This commit adds a Prometheus-like `quantile` function. It also used by other
functions which depend on it, such as `quantiles`, `quantile_over_time`, `median` etc.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1625
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* app/vmselect: `quantile` review fixes
* quantile functions were split into multiple to provide
different API for already sorted data;
* float64sPool is used for reducing allocations. Items in pool may have
different sizes, but defining a new pool was complicates due to name collisions;
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* app/vmselect: make sorting for query result similar to Prometheus
Updated sorting allows to get the order of series in result similar or equal
to what Prometheus returns.
The change is needed for compatibility reasons.
* Update app/vmselect/promql/exec_test.go
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
`omitempty` tag resulted into skipping this param on marshaling,
which was used as a checksum for groups configuration. Since on
config reload checksums are compared before applying changes,
any change to `interval` only didn't trigger config reload.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1641
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The commit contains the following changes:
- Show vmui when requesting /graph page in order to be compatible with Prometheus datasource in Grafana.
- Properly encode query args at vmui url.
- Set the number of points on the graph to the number of horizontal pixels divided by 2. Previously it was hardcoded to 30.
- Do not save server url to persistent storage at browser, since it should be always obtained from the url.
- Run `make vmui-update` for updating vmui embedded into VictoriaMetrics.
* feat: change url params for compatible prometheus
* style: add comment for TimeParams
* fix: change get default server for single version
* fix: change function for get query string value
* vmalert: add flag to limit the max value for auto-resovle duration for alerts
The new flag `rule.maxResolveDuration` suppose to limit max value for
alert.End param, which is used by notifiers like Alertmanager for alerts auto resolve.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1586
Also reduce CPU usage when applying `series_limit` to scrape targets with constant set of metrics.
The main idea is to perform the calculations on scrape_series_added and series_limit
only if the set of metrics exposed by the target has been changed.
Scrape targets rarely change the set of exposed metrics,
so this optimization should reduce CPU usage in general case.
These actions simlify metrics filtering. For example,
- action: keep_metrics
regex: 'foo|bar|baz'
would leave only metrics with `foo`, `bar` and `baz` names, while the rest of metrics will be deleted.
The commit also makes possible to split long regexps into multiple lines. For example, the following config is equivalent to the config above:
- action: keep_metrics
regex:
- foo
- bar
- baz
New UI pages:
/ - welcome page with API handlers list;
/groups - list of all rules per group;
/alerts - list of all active alerts;
/groupID/alertID/status - status of the active alert;
The number of series per target can be limited with the following options:
* Global limit with `-promscrape.maxSeriesPerTarget` command-line option.
* Per-target limit with `max_series: N` option in `scrape_config` section.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1561
* dasbhoard: replace `null` datasources
null datasource value may confuse Grafana and make it drop panel query in some
versions.
* docker: bump grafana image version
* dashboards: add URL variable selector to vmagent dashboard
* dashboards: add new panel `Remote write connection saturation` to vmagent dashboard
* alerts: add new alert for `Remote write connection saturation` panel of vmagent dashboard
* dashboards: add "Logging rate" panel to vmagent dashboard
The purpose of update is to make README and flags description more
clear to the reader. Especially, show that vm-account-id flag is required
for clustered version of VM.
* vmalert: allow extra GET params in datasource package
ExtraParams will be added as GET params to every HTTP request made by datasource.
The `roundDigits` param, for example, was substituted by corresponding extra param.
* vmalert: add nocache=1 param for replay process
The `nocache=1` param is VictoriaMetrics specific parameter which prevents it
from caching and boundaries aligning for queries. We set it to avoid cache
pollution in `replay` mode and also to avoid unnecessary time range boundaries
alignment.
* vmalert: mention nocache=1 in replay description
* vmalert: fix bug with unused param
* vmalert: remove `vmalert_execution_duration_seconds` metric
The summary for `vmalert_execution_duration_seconds` metric gives no additional
value comparing to `vmalert_iteration_duration_seconds` metric.
* vmalert: update config reload success metric properly
Previously, if there was unsuccessfull attempt to reload config and then
rollback to previous version - the metric remained set to 0.
* vmalert: add Grafana dashboard to overview application metrics
* docker: include vmalert target into list for scraping
* vmalert: extend notifier metrics with addr label
The change adds an `addr` label to metrics for alerts_sent and alerts_send_errors
to identify which exact address is having issues.
The according change was made to vmalert dashboard.
* vmalert: update documentation and docker environment for vmalert's dashboard
Mention Grafana's dashboard in vmalert's README in a new section #Monitoring.
Update docker-compose env to automatically add vmalert's dashboard.
Update docker-compose README with additional info about services.
* Document all the functions supported by MetricsQL, including PromQL functions
* Group functions by their type: rollup functions, transform functions, label manipulation functions and aggregate functions.
* Document implicit query transformations.
Store the scraped response body instead of storing the parsed and relabeld metrics.
This should reduce memory usage, since the response body takes less memory than the parsed and relabeled metrics.
This is especially true for Kubernetes service discovery, which adds many long labels for all the scraped metrics.
This should also reduce CPU usage, since the marshaling of the parsed
and relabeld metrics has been substituted by response body copying.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1526
Before, metric `vm_request_duration_seconds` was update only on successful
attempts which could be misleading. For example, timeout errors on netstorage
request may be not accounted in the metric and won't be visible on dashboards.
Using `defer` statement to update the metric after query arguments validation
may improve the situation.
This option allows reducing CPU usage a bit when VictoriaMetrics is used
for collecting and processing non-Prometheus data. For example, InfluxDB line protocol, Graphite, OpenTSDB, CSV, etc.
This option can be useful when vmagent consumes too much additional memory
for staleness markers functionality and when staleness markers aren't needed.
* vmalert: allow to disable automatically added path to remote write address via disablePathAppend flag
* docs: update docs to include remoteWrite.disablePathAppend
This metric can be used for determining high saturation of every connection to remote storage with
an alerting query `rate(vmagent_remotewrite_send_duration_seconds_total) > 0.9s`.
This query triggers when a connection is satureated by more than 90%
This reverts commit 94dfcb6747a3b29a11d14e71bea21a2312bb6346.
It is better to remove staleness marks (decimal.StaleNaN) before calling rollupConfig.Do, e.g. in preFunc
Prometheus stalenss marks shouldn't be changed in removeCounterResets. Otherwise they will be converted to an ordinary NaN values,
which couldn't be removed in dropStaleNaNs() function later. This may result in incorrect calculations for rollup functions.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1526
* docs: update "number of open files" tuning recomendation
Make "number of open files" recomendation not only Prometheus specific to avoid
confusion for users who does not use Prometheus.
* docs: mention fstrim in Tuning section
* vmalert: expose new metrics for tracking number of produced samples during last evaluation
Two new metrics were added to track the number of samples produced during the last evaluation:
* vmalert_recording_rules_last_evaluation_samples
* vmalert_alerting_rules_last_evaluation_samples
The gauge type is used to remain consistent with Prometheus metric
`prometheus_rule_group_last_evaluation_samples` which is on the group level.
However, the counter type was considered as well.
Two metrics instead of one are used to make it easier to separate recording and
alerting rules. It is likely, number of samples produced by recording rules is
more important so people will refer to it more frequently.
The expected usage of the new metric is the following:
```
- alert: RecordingRuleReturnsEmptyResults
expr: sum(vmalert_recording_rules_last_evaluation_samples) by(recording) < 1
annotations:
summary: Recording rule {{$labels.recording}} returns empty results.
Please verify expression correctness.
```
Addresses https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1494
* vmalert: rename `vmalert_alerts_error` to `vmalert_alerting_rules_error` to remain consistent with recording rules metrics
* feature: Add multitenant for vmagent
* Minor fix
* Fix rcs index out of range
* Minor fix
* Fix multi Init
* Fix multi Init
* Fix multi Init
* Add default multi
* Adjust naming
* Add TenantInserted metrics
* Add TenantInserted metrics
* fix: remove unused metrics for vmagent
* fix: remove unused metrics for vmagent
Co-authored-by: mghader <marc.ghader@ubisoft.com>
Co-authored-by: Sebastian YEPES <syepes@gmail.com>
* Rename -search.maxMetricsPointSearch to -search.maxSamplesPerQuery, so it is more consistent with the existing -search.maxSamplesPerSeries
* Move the -search.maxSamplesPerQuery from vmstorage to vmselect, so it could effectively limit the number of raw samples obtained from all the vmstorage nodes
* Document the -search.maxSamplesPerQuery in docs/CHANGELOG.md
* fix: move request button to server input
* feat: add switch for query autocomplete
* refactor: rename state for popover open
* feat: add detect os by userAgent
* fix: change hotkey to run query for mac
* fix: change detect mac os
* fix: change div to span inside Typography
Co-authored-by: yury <yurymolodov@victoriametrics.com>
This should improve the readability and usefullness of the /api/v1/status/top_queries when debugging slow queries
or queries that take too much cpu time.
Previously the switch occurred when the cache size becomes 100% of its capacity. The cache size could never reach 100% capacity.
This could prevent from switching from the split cache to full cache, thus reducing the cache effectiveness.
- Support durations anywhere in MetricsQL queries. E.g. sum_over_time(m[1h])/1h is equivalent to sum_over_time(m[1h])/3600
- Support durations without suffix. E.g. rate(m[300]) is equivalent to rate(m[5m])
Previously needsDedup() could return true if the de-duplication wasn't needed for the following case:
d < interval
/ \
| v | v |
interval interval
Now it properly returns false for this case
This reverts commit 7c6d3981bf.
Reason for revert: high contention at bucket16Pool on systems with big number of CPU cores.
This slows down query processing significantly.
This should reduce memory usage on systems with big number of CPU cores,
since every inmemoryPart object occupies at least 64KB of memory and sync.Pool maintains
a separate pool inmemoryPart objects per each CPU core.
Though the new scheme for the pool worsens per-cpu cache locality, this should be amortized
by big sizes of inmemoryPart objects.
CPU and memory profiles show that the pool capacity for inmemoryBlock objects is too small.
This results in the increased load on memory allocation code in Go runtime.
Increase the pool capacity in order to reduce the load on Go runtime.
This should improve hit ratio for tagFiltersCache when big number of new time series are constantly registered
(aka high churn rate). This, in turn, should reduce CPU usage for queries over such time series.
Previously the stats for cache misses could be improperly counted, because it had inflated cache misses
if the entry was missing in the curr cache, but was existing in the prev cache.
The same applies to cache requests - they were inflated if the entry was missing in the curr cache.
To add a Copy button wrap code snippet with the following element:
```
<div class="with-copy" markdown="1">
<your-code-snippet>
</div>
```
See the changes to `Kubernetes monitoring with VictoriaMetrics Single` for details.
Previously the switch from `split` to `whole` mode had been performed too early,
e.g. when the current cache size became bigger than 1/4 of the allowed cache size.
Now it is performed when the current cache size becomes bigger than 1/2 of the allowed cache size.
This change can reduce memory usage for data ingestion path when big number of active time series are ingested.
One minute cache timeout result in slower queries in some production workloads where the interval
between query execution is in the range 1 minute - 2 minutes.
This should reduce memory usage on a system with high number of active time series and a high churn rate.
One minute is enough for caching the blocks needed for repeated queries (e.g. alerting rules, recording rules and dashboard refreshes).
This should reduce memory usage for the pool on systems with big number of CPU cores.
The sync.Pool maintains per-CPU pools, so the total number of objects in the pool
is proportional to the number of available CPU cores. The channel limits the number
of pooled objects by its own capacity. This means smaller number of pooled objects on average.
This should reduce resource usage (CPU, RAM, disk IO) at vmstorage nodes
if the addresses of vmstorage nodes are passed in random order to vminsert nodes.
* Change default value of '-remoteWrite.queues' to cgroup.AvailableCPUS() * 2 to reduce scrape interval
Default value of vmagent option '-remotewrite.queues' is 4 and default
size of vmagent ScheudleUnmarshalWorkers is number of CPUs, when available
CPUs is much greater than 4, e.g 32, worker are competing push queues
which will increase scrape interval and may cause scrape timeout.
* Update README and flag description
Co-authored-by: xiaozy <xiaozy01@fenbi.com>
Due to staleness handling, increase_pure were using incorrect previous value
during calculation in cases where series disappears for period longer
than staleness period and then returns back. The fix suppose to account
for a real datapoint value before staleness takes place. The fix should
remove unexpected spikes while using `increase_pure` for staled series.
* dashboard: update single version dash
The update contains the following changes:
* display anonymous memory usage metric. This metric suppose to reflect
memory usage of the process which can't be freed by OS;
* add legends to all panels. This is important for cases when users share
the screenshots;
* modify panels for Grafana v8.0.0
* dashboard: update single version dash tags
* dashboard: update vmagent dash
The update contains the following changes:
* display anonymous memory usage metric. This metric suppose to reflect
memory usage of the process which can't be freed by OS;
* add legends to all panels. This is important for cases when users share
the screenshots;
* modify panels for Grafana v8.0.0
This panic can be raised by the reverseProxy on aborted request to the backend.
So handle it (e.g. suppress) at reverseProxy.ServeHTTP call.
Do not suppress the panic at lib/httpserver generic HTTP handler,
since it may result in an inconsistent state left after the panicking handler.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1353
* vmalert: fix mistake with object reuse while parsing response
During the refactoring, the wrong optimisations was applied in
parse function which caused metric fields reset. The change removes
optimisation.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1369
* vmalert: add test to cover multiple metrics in one response
* vmalert: support rules backfilling (aka `replay`)
vmalert can `replay` configured rules in the past
and backfill results via remote write protocol.
It supports MetricsQL/PromQL storage as data source,
and can backfill data to remote write compatible
storage.
Supports recording and alerting rules `replay`. See more
details in README.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/836
* vmalert: review fixes
* vmalert: readme fixes
* new feature: relabel logging
Use scrape_configs[x].relabel_debug = true to log metric names inkl.
labels before and after relabeling. After relabeling related metrics
get dropped, i.e. not submitted to servers.
* vminsert wants relabel logging, too.
The pool for inmemoryBlock struct doesn't give any performance gains in production workloads,
while it may result in excess memory usage for inmemoryBlock structs inside the pool during
background merge of indexdb.
New flag `-rule.configCheckInterval` defines how often `vmalert` will re-read
config file. If it detects any changes, the config will be reloaded.
This behaviour is turned off by default.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/512
This speeds up bucket32.addBucketAtPos() when bucket32.buckets contains big number of items,
since the copying of bucket16 pointers is much faster than the copying of bucket16 objects.
This is a cpu profile for copying bucket16 objects:
10ms 13.43s (flat, cum) 32.01% of Total
10ms 120ms 650: b.b16his = append(b.b16his[:pos+1], b.b16his[pos:]...)
. . 651: b.b16his[pos] = hi
. 13.31s 652: b.buckets = append(b.buckets[:pos+1], b.buckets[pos:]...)
. . 653: b16 := &b.buckets[pos]
. . 654: *b16 = bucket16{}
. . 655: return b16
. . 656:}
This is a cpu profile for copying pointers to bucket16:
10ms 1.14s (flat, cum) 2.19% of Total
. 100ms 647: b.b16his = append(b.b16his[:pos+1], b.b16his[pos:]...)
. . 648: b.b16his[pos] = hi
10ms 700ms 649: b.buckets = append(b.buckets[:pos+1], b.buckets[pos:]...)
. 330ms 650: b16 := &bucket16{}
. . 651: b.buckets[pos] = b16
. . 652: return b16
. . 653:}
A clear and concise description of what the bug is.
It would be a great [upgrading](https://docs.victoriametrics.com/#how-to-upgrade) to [the latest avaialble release](https://github.com/VictoriaMetrics/VictoriaMetrics/releases)
and verifying whether the bug is reproducible there.
It is also recommended reading [troubleshooting docs](https://docs.victoriametrics.com/#troubleshooting).
**To Reproduce**
Steps to reproduce the behavior.
**Expected behavior**
A clear and concise description of what you expected to happen.
**Screenshots**
If applicable, add screenshots to help explain your problem.
**Version**
The line returned when passing `--version` command line flag to binary. For example:
Please provide a brief description of the changes you made. Be as specific as possible to help others understand the purpose and impact of your modifications.
### Checklist
The following checks are mandatory:
- [ ] I have read the [Contributing Guidelines](https://github.com/VictoriaMetrics/VictoriaMetrics/blob/master/CONTRIBUTING.md)
- [ ] All commits are signed and include `Signed-off-by` line. Use `git commit -s` to include `Signed-off-by` your commits. See this [doc](https://git-scm.com/book/en/v2/Git-Tools-Signing-Your-Work) about how to sign your commits.
- [ ] Tests are passing locally. Use `make test` to run all tests locally.
- [ ] Linting is passing locally. Use `make check-all` to run all linters locally.
Further checks are optional for External Contributions:
- [ ] Include a link to the GitHub issue in the commit message, if issue exists.
- [ ] Mention the change in the [Changelog](https://github.com/VictoriaMetrics/VictoriaMetrics/blob/master/docs/CHANGELOG.md). Explain what has changed and why. If there is a related issue or documentation change - link them as well.
Tips for writing a good changelog message::
* Write a human-readable changelog message that describes the problem and solution.
* Include a link to the issue or pull request in your changelog message.
* Use specific language identifying the fix, such as an error message, metric name, or flag name.
* Provide a link to the relevant documentation for any new features you add or modify.
- [ ] After your pull request is merged, please add a message to the issue with instructions for how to test the fix or try the feature you added. Here is an [example](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4048#issuecomment-1546453726)
- [ ] Do not close the original issue before the change is released. Please note, in some cases Github can automatically close the issue once PR is merged. Re-open the issue in such case.
- [ ] If the change somehow affects public interfaces (a new flag was added or updated, or some behavior has changed) - add the corresponding change to documentation.
Examples of good changelog messages:
1. FEATURE: [vmagent](https://docs.victoriametrics.com/vmagent.html): add support for [VictoriaMetrics remote write protocol](https://docs.victoriametrics.com/vmagent.html#victoriametrics-remote-write-protocol) when [sending / receiving data to / from Kafka](https://docs.victoriametrics.com/vmagent.html#kafka-integration). This protocol allows saving egress network bandwidth costs when sending data from `vmagent` to `Kafka` located in another datacenter or availability zone. See [this feature request](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1225).
2. BUGFIX: [stream aggregation](https://docs.victoriametrics.com/stream-aggregation.html): suppress `series after dedup` error message in logs when `-remoteWrite.streamAggr.dedupInterval` command-line flag is set at [vmagent](https://docs.victoriametrics.com/vmgent.html) or when `-streamAggr.dedupInterval` command-line flag is set at [single-node VictoriaMetrics](https://docs.victoriametrics.com/).
@@ -14,3 +14,8 @@ We are open to third-party pull requests provided they follow [KISS design princ
- Avoid automated decisions, which may hurt cluster availability, consistency or performance.
Adhering `KISS` principle simplifies the resulting code and architecture, so it can be reviewed, understood and verified by many people.
Before sending a pull request please check the following:
- [ ] All commits are signed and include `Signed-off-by` line. Use `git commit -s` to include `Signed-off-by` your commits. See this [doc](https://git-scm.com/book/en/v2/Git-Tools-Signing-Your-Work) about how to sign your commits.
- [ ] Tests are passing locally. Use `make test` to run all tests locally.
- [ ] Linting is passing locally. Use `make check-all` to run all linters locally.
which errcheck ||GO111MODULE=off go get github.com/kisielk/errcheck
check-all:fmtvetlinterrcheckgolangci-lint
check-all:fmtvetgolangci-lintgovulncheck
test:
GO111MODULE=on go test -mod=vendor ./lib/... ./app/...
go test ./lib/... ./app/...
test-race:
GO111MODULE=on go test -mod=vendor -race ./lib/... ./app/...
go test -race ./lib/... ./app/...
test-pure:
GO111MODULE=on CGO_ENABLED=0 go test -mod=vendor ./lib/... ./app/...
CGO_ENABLED=0 go test ./lib/... ./app/...
test-full:
GO111MODULE=on go test -mod=vendor -coverprofile=coverage.txt -covermode=atomic ./lib/... ./app/...
go test -coverprofile=coverage.txt -covermode=atomic ./lib/... ./app/...
test-full-386:
GO111MODULE=on GOARCH=386 go test -mod=vendor -coverprofile=coverage.txt -covermode=atomic ./lib/... ./app/...
GOARCH=386 go test -coverprofile=coverage.txt -covermode=atomic ./lib/... ./app/...
benchmark:
GO111MODULE=on go test -mod=vendor -bench=. ./lib/...
GO111MODULE=on go test -mod=vendor -bench=. ./app/...
go test -bench=. ./lib/...
go test -bench=. ./app/...
benchmark-pure:
GO111MODULE=on CGO_ENABLED=0 go test -mod=vendor -bench=. ./lib/...
GO111MODULE=on CGO_ENABLED=0 go test -mod=vendor -bench=. ./app/...
CGO_ENABLED=0 go test -bench=. ./lib/...
CGO_ENABLED=0 go test -bench=. ./app/...
vendor-update:
GO111MODULE=on go get -u -d ./lib/...
GO111MODULE=on go get -u -d ./app/...
GO111MODULE=on go mod tidy
GO111MODULE=on go mod vendor
go get -u -d ./lib/...
go get -u -d ./app/...
go mod tidy -compat=1.21
go mod vendor
app-local:
CGO_ENABLED=1GO111MODULE=on go build $(RACE) -mod=vendor -ldflags "$(GO_BUILDINFO)" -o bin/$(APP_NAME)$(RACE)$(PKG_PREFIX)/app/$(APP_NAME)
CGO_ENABLED=1 go build $(RACE) -ldflags "$(GO_BUILDINFO)" -o bin/$(APP_NAME)$(RACE)$(PKG_PREFIX)/app/$(APP_NAME)
app-local-pure:
CGO_ENABLED=0GO111MODULE=on go build $(RACE) -mod=vendor -ldflags "$(GO_BUILDINFO)" -o bin/$(APP_NAME)-pure$(RACE)$(PKG_PREFIX)/app/$(APP_NAME)
CGO_ENABLED=0 go build $(RACE) -ldflags "$(GO_BUILDINFO)" -o bin/$(APP_NAME)-pure$(RACE)$(PKG_PREFIX)/app/$(APP_NAME)
app-local-with-goarch:
GO111MODULE=on go build $(RACE) -mod=vendor -ldflags "$(GO_BUILDINFO)" -o bin/$(APP_NAME)-$(GOARCH)$(RACE)$(PKG_PREFIX)/app/$(APP_NAME)
app-local-goos-goarch:
CGO_ENABLED=$(CGO_ENABLED)GOOS=$(GOOS)GOARCH=$(GOARCH) go build $(RACE) -ldflags "$(GO_BUILDINFO)" -o bin/$(APP_NAME)-$(GOOS)-$(GOARCH)$(RACE)$(PKG_PREFIX)/app/$(APP_NAME)
app-local-windows-with-goarch:
CGO_ENABLED=0GO111MODULE=on go build $(RACE) -mod=vendor -ldflags "$(GO_BUILDINFO)" -o bin/$(APP_NAME)-windows-$(GOARCH)$(RACE).exe $(PKG_PREFIX)/app/$(APP_NAME)
app-local-windows-goarch:
CGO_ENABLED=0GOOS=windows GOARCH=$(GOARCH) go build $(RACE) -ldflags "$(GO_BUILDINFO)" -o bin/$(APP_NAME)-windows-$(GOARCH)$(RACE).exe $(PKG_PREFIX)/app/$(APP_NAME)
quicktemplate-gen:install-qtc
qtc
install-qtc:
which qtc ||GO111MODULE=off go get github.com/valyala/quicktemplate/qtc
which qtc ||go install github.com/valyala/quicktemplate/qtc@latest
golangci-lint:install-golangci-lint
golangci-lint run --exclude '(SA4003|SA1019|SA5011):' -D errcheck -D structcheck --timeout 2m
golangci-lint run
install-golangci-lint:
which golangci-lint || curl -sSfL https://raw.githubusercontent.com/golangci/golangci-lint/master/install.sh | sh -s -- -b $(shell go env GOPATH)/bin v1.40.1
which golangci-lint || curl -sSfL https://raw.githubusercontent.com/golangci/golangci-lint/master/install.sh | sh -s -- -b $(shell go env GOPATH)/bin v1.57.1
govulncheck:install-govulncheck
govulncheck ./...
install-govulncheck:
which govulncheck || go install golang.org/x/vuln/cmd/govulncheck@latest
install-wwhrd:
which wwhrd ||GO111MODULE=off go get github.com/frapposelli/wwhrd
which wwhrd ||go install github.com/frapposelli/wwhrd@latest
check-licenses:install-wwhrd
wwhrd check -f .wwhrd.yml
copy-docs:
echo"---\nsort: ${ORDER}\n---\n" > ${DST}
cat ${SRC} >> ${DST}
# The 'printf' function is used instead of 'echo' or 'echo -e' to handle line breaks (e.g. '\n') in the same way on different operating systems (MacOS/Ubuntu Linux/Arch Linux) and their shells (bash/sh/zsh/fish).
# For details, see https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4548#issue-1782796419 and https://stackoverflow.com/questions/8467424/echo-newline-in-bash-prints-literal-n
httpListenAddrs=flagutil.NewArrayString("httpListenAddr","TCP address to listen for incoming http requests. See also -httpListenAddr.useProxyProtocol")
useProxyProtocol=flagutil.NewArrayBool("httpListenAddr.useProxyProtocol","Whether to use proxy protocol for connections accepted at the given -httpListenAddr . "+
"See https://www.haproxy.org/download/1.8/doc/proxy-protocol.txt . "+
"With enabled proxy protocol http server cannot serve regular /metrics endpoint. Use -pushmetrics.url for metrics pushing")
)
funcmain(){
// Write flags and help message to stdout, since it is easier to grep or pipe.
flag.CommandLine.SetOutput(os.Stdout)
flag.Usage=usage
envflag.Parse()
buildinfo.Init()
logger.Init()
listenAddrs:=*httpListenAddrs
iflen(listenAddrs)==0{
listenAddrs=[]string{":9428"}
}
logger.Infof("starting VictoriaLogs at %q...",listenAddrs)
httpListenAddr=flag.String("httpListenAddr",":8428","TCP address to listen for http connections")
minScrapeInterval=flag.Duration("dedup.minScrapeInterval",0,"Remove superflouos samples from time series if they are located closer to each other than this duration. "+
"This may be useful for reducing overhead when multiple identically configured Prometheus instances write data to the same VictoriaMetrics. "+
"Deduplication is disabled if the -dedup.minScrapeInterval is 0")
dryRun=flag.Bool("dryRun",false,"Whether to check only -promscrape.config and then exit. "+
"Unknown config entries are allowed in -promscrape.config by default. This can be changed with -promscrape.config.strictParse")
httpListenAddrs=flagutil.NewArrayString("httpListenAddr","TCP addresses to listen for incoming http requests. See also -tls and -httpListenAddr.useProxyProtocol")
useProxyProtocol=flagutil.NewArrayBool("httpListenAddr.useProxyProtocol","Whether to use proxy protocol for connections accepted at the corresponding -httpListenAddr . "+
"See https://www.haproxy.org/download/1.8/doc/proxy-protocol.txt . "+
"With enabled proxy protocol http server cannot serve regular /metrics endpoint. Use -pushmetrics.url for metrics pushing")
minScrapeInterval=flag.Duration("dedup.minScrapeInterval",0,"Leave only the last sample in every time series per each discrete interval "+
"equal to -dedup.minScrapeInterval > 0. See also -streamAggr.dedupInterval and https://docs.victoriametrics.com/#deduplication")
dryRun=flag.Bool("dryRun",false,"Whether to check config files without running VictoriaMetrics. The following config files are checked: "+
"-promscrape.config, -relabelConfig and -streamAggr.config. Unknown config entries aren't allowed in -promscrape.config by default. "+
"This can be changed with -promscrape.config.strictParse=false command-line flag")
inmemoryDataFlushInterval=flag.Duration("inmemoryDataFlushInterval",5*time.Second,"The interval for guaranteed saving of in-memory data to disk. "+
"The saved data survives unclean shutdowns such as OOM crash, hardware reset, SIGKILL, etc. "+
"Bigger intervals may help increase the lifetime of flash storage with limited write cycles (e.g. Raspberry PI). "+
"Smaller intervals increase disk IO load. Minimum supported value is 1s")
)
funcmain(){
@@ -46,33 +56,47 @@ func main() {
iferr:=promscrape.CheckConfig();err!=nil{
logger.Fatalf("error when checking -promscrape.config: %s",err)
}
logger.Infof("-promscrape.config is ok; exitting with 0 status code")
t.Fatalf("not expected number of csv lines want=%d\ngot=%d test=%s.%s\n\response=%q",len(data),test.ExpectedResultLinesCount,q,test.Issue,strings.Join(data,"\n"))
maxConcurrentRequests=flag.Int("search.maxConcurrentRequests",getDefaultMaxConcurrentRequests(),"The maximum number of concurrent search requests. "+
"It shouldn't be high, since a single request can saturate all the CPU cores, while many concurrently executed requests may require high amounts of memory. "+
"See also -search.maxQueueDuration")
maxQueueDuration=flag.Duration("search.maxQueueDuration",10*time.Second,"The maximum time the search request waits for execution when -search.maxConcurrentRequests "+
"limit is reached; see also -search.maxQueryDuration")
maxQueryDuration=flag.Duration("search.maxQueryDuration",time.Second*30,"The maximum duration for query execution")
)
funcgetDefaultMaxConcurrentRequests()int{
n:=cgroup.AvailableCPUs()
ifn<=4{
n*=2
}
ifn>16{
// A single request can saturate all the CPU cores, so there is no sense
// in allowing higher number of concurrent requests - they will just contend
<!doctype html><htmllang="en"><head><metacharset="utf-8"/><linkrel="icon"href="./favicon.ico"/><metaname="viewport"content="width=device-width,initial-scale=1,maximum-scale=5"/><metaname="theme-color"content="#000000"/><metaname="description"content="UI for VictoriaMetrics"/><linkrel="apple-touch-icon"href="./apple-touch-icon.png"/><linkrel="icon"type="image/png"sizes="32x32"href="./favicon-32x32.png"><linkrel="manifest"href="./manifest.json"/><title>VM UI</title><scriptsrc="./dashboards/index.js"type="module"></script><metaname="twitter:card"content="summary_large_image"><metaname="twitter:image"content="./preview.jpg"><metaname="twitter:title"content="UI for VictoriaMetrics"><metaname="twitter:description"content="Explore and troubleshoot your VictoriaMetrics data"><metaname="twitter:site"content="@VictoriaMetrics"><metaproperty="og:title"content="Metric explorer for VictoriaMetrics"><metaproperty="og:description"content="Explore and troubleshoot your VictoriaMetrics data"><metaproperty="og:image"content="./preview.jpg"><metaproperty="og:type"content="website"><scriptdefer="defer"src="./static/js/main.034044a7.js"></script><linkhref="./static/css/main.bc07cc78.css"rel="stylesheet"></head><body><noscript>You need to enable JavaScript to run this app.</noscript><divid="root"></div></body></html>
retentionPeriod=flagutil.NewDuration("retentionPeriod","7d","Log entries with timestamps older than now-retentionPeriod are automatically deleted; "+
"log entries with timestamps outside the retention are also rejected during data ingestion; the minimum supported retention is 1d (one day); "+
"see https://docs.victoriametrics.com/VictoriaLogs/#retention")
futureRetention=flagutil.NewDuration("futureRetention","2d","Log entries with timestamps bigger than now+futureRetention are rejected during data ingestion; "+
"see https://docs.victoriametrics.com/VictoriaLogs/#retention")
storageDataPath=flag.String("storageDataPath","victoria-logs-data","Path to directory with the VictoriaLogs data; "+
"see https://docs.victoriametrics.com/VictoriaLogs/#storage")
inmemoryDataFlushInterval=flag.Duration("inmemoryDataFlushInterval",5*time.Second,"The interval for guaranteed saving of in-memory data to disk. "+
"The saved data survives unclean shutdowns such as OOM crash, hardware reset, SIGKILL, etc. "+
"Bigger intervals may help increase the lifetime of flash storage with limited write cycles (e.g. Raspberry PI). "+
"Smaller intervals increase disk IO load. Minimum supported value is 1s")
logNewStreams=flag.Bool("logNewStreams",false,"Whether to log creation of new streams; this can be useful for debugging of high cardinality issues with log streams; "+
"see https://docs.victoriametrics.com/VictoriaLogs/keyConcepts.html#stream-fields ; see also -logIngestedRows")
logIngestedRows=flag.Bool("logIngestedRows",false,"Whether to log all the ingested log entries; this can be useful for debugging of data ingestion; "+
"see https://docs.victoriametrics.com/VictoriaLogs/data-ingestion/ ; see also -logNewStreams")
minFreeDiskSpaceBytes=flagutil.NewBytes("storage.minFreeDiskSpaceBytes",10e6,"The minimum free disk space at -storageDataPath after which "+
"the storage stops accepting new data")
)
// Init initializes vlstorage.
//
// Stop must be called when vlstorage is no longer needed
funcInit(){
ifstrg!=nil{
logger.Panicf("BUG: Init() has been already called")
}
ifretentionPeriod.Duration()<24*time.Hour{
logger.Fatalf("-retentionPeriod cannot be smaller than a day; got %s",retentionPeriod)
}
cfg:=&logstorage.StorageConfig{
Retention:retentionPeriod.Duration(),
FlushInterval:*inmemoryDataFlushInterval,
FutureRetention:futureRetention.Duration(),
LogNewStreams:*logNewStreams,
LogIngestedRows:*logIngestedRows,
MinFreeDiskSpaceBytes:minFreeDiskSpaceBytes.N,
}
logger.Infof("opening storage at -storageDataPath=%s",*storageDataPath)
See vmagent docs [here](https://docs.victoriametrics.com/vmagent.html).
`vmagent` is a tiny but mighty agent which helps you collect metrics from various sources
and store them in [VictoriaMetrics](https://github.com/VictoriaMetrics/VictoriaMetrics)
or any other Prometheus-compatible storage systems that support the `remote_write` protocol.
<img alt="vmagent" src="vmagent.png">
## Motivation
While VictoriaMetrics provides an efficient solution to store and observe metrics, our users needed something fast
and RAM friendly to scrape metrics from Prometheus-compatible exporters into VictoriaMetrics.
Also, we found that our user's infrastructure are like snowflakes in that no two are alike. Therefore we decided to add more flexibility
to `vmagent` such as the ability to push metrics instead of pulling them. We did our best and will continue to improve vmagent.
## Features
* Can be used as a drop-in replacement for Prometheus for scraping targets such as [node_exporter](https://github.com/prometheus/node_exporter).
See [Quick Start](#quick-start) for details.
* Can add, remove and modify labels (aka tags) via Prometheus relabeling. Can filter data before sending it to remote storage. See [these docs](#relabeling) for details.
* Accepts data via all ingestion protocols supported by VictoriaMetrics:
* Influx line protocol via `http://<vmagent>:8429/write`. See [these docs](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#how-to-send-data-from-influxdb-compatible-agents-such-as-telegraf).
* Graphite plaintext protocol if `-graphiteListenAddr` command-line flag is set. See [these docs](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#how-to-send-data-from-graphite-compatible-agents-such-as-statsd).
* OpenTSDB telnet and http protocols if `-opentsdbListenAddr` command-line flag is set. See [these docs](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#how-to-send-data-from-opentsdb-compatible-agents).
* Prometheus remote write protocol via `http://<vmagent>:8429/api/v1/write`.
* JSON lines import protocol via `http://<vmagent>:8429/api/v1/import`. See [these docs](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#how-to-import-data-in-json-line-format).
* Native data import protocol via `http://<vmagent>:8429/api/v1/import/native`. See [these docs](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#how-to-import-data-in-native-format).
* Data in Prometheus exposition format. See [these docs](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#how-to-import-data-in-prometheus-exposition-format) for details.
* Arbitrary CSV data via `http://<vmagent>:8429/api/v1/import/csv`. See [these docs](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#how-to-import-csv-data).
* Can replicate collected metrics simultaneously to multiple remote storage systems.
* Works smoothly in environments with unstable connections to remote storage. If the remote storage is unavailable, the collected metrics
are buffered at `-remoteWrite.tmpDataPath`. The buffered metrics are sent to remote storage as soon as the connection
to the remote storage is repaired. The maximum disk usage for the buffer can be limited with `-remoteWrite.maxDiskUsagePerURL`.
* Uses lower amounts of RAM, CPU, disk IO and network bandwidth compared with Prometheus.
* Scrape targets can be spread among multiple `vmagent` instances when big number of targets must be scraped. See [these docs](#scraping-big-number-of-targets).
* Can efficiently scrape targets that expose millions of time series such as [/federate endpoint in Prometheus](https://prometheus.io/docs/prometheus/latest/federation/). See [these docs](#stream-parsing-mode).
* Can deal with high cardinality and high churn rate issues by limiting the number of unique time series sent to remote storage systems. See [these docs](#cardinality-limiter).
## Quick Start
Please download `vmutils-*` archive from [releases page](https://github.com/VictoriaMetrics/VictoriaMetrics/releases), unpack it
and configure the following flags to the `vmagent` binary in order to start scraping Prometheus targets:
*`-promscrape.config` with the path to Prometheus config file (usually located at `/etc/prometheus/prometheus.yml`)
*`-remoteWrite.url` with the remote storage endpoint such as VictoriaMetrics, the `-remoteWrite.url` argument can be specified multiple times to replicate data concurrently to an arbitrary number of remote storage systems.
Then send Influx data to `http://vmagent-host:8429`. See [these docs](https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#how-to-send-data-from-influxdb-compatible-agents-such-as-telegraf) for more details.
`vmagent` is also available in [docker images](https://hub.docker.com/r/victoriametrics/vmagent/tags).
Pass `-help` to `vmagent` in order to see [the full list of supported command-line flags with their descriptions](#advanced-usage).
## Configuration update
`vmagent` should be restarted in order to update config options set via command-line args.
`vmagent` supports multiple approaches for reloading configs from updated config files such as `-promscrape.config`, `-remoteWrite.relabelConfig` and `-remoteWrite.urlRelabelConfig`:
* Sending `SUGHUP` signal to `vmagent` process:
```bash
kill -SIGHUP `pidof vmagent`
```
* Sending HTTP request to `http://vmagent:8429/-/reload` endpoint.
There is also `-promscrape.configCheckInterval` command-line option, which can be used for automatic reloading configs from updated `-promscrape.config` file.
## Use cases
### IoT and Edge monitoring
`vmagent` can run and collect metrics in IoT and industrial networks with unreliable or scheduled connections to their remote storage.
It buffers the collected data in local files until the connection to remote storage becomes available and then sends the buffered
data to the remote storage. It re-tries sending the data to remote storage until any errors are resolved.
The maximum buffer size can be limited with `-remoteWrite.maxDiskUsagePerURL`.
`vmagent` works on various architectures from the IoT world - 32-bit arm, 64-bit arm, ppc64, 386, amd64.
See [the corresponding Makefile rules](https://github.com/VictoriaMetrics/VictoriaMetrics/blob/master/app/vmagent/Makefile) for details.
### Drop-in replacement for Prometheus
If you use Prometheus only for scraping metrics from various targets and forwarding those metrics to remote storage
then `vmagent` can replace Prometheus. Typically, `vmagent` requires lower amounts of RAM, CPU and network bandwidth compared with Prometheus.
See [these docs](#how-to-collect-metrics-in-prometheus-format) for details.
### Replication and high availability
`vmagent` replicates the collected metrics among multiple remote storage instances configured via `-remoteWrite.url` args.
If a single remote storage instance temporarily is out of service, then the collected data remains available in another remote storage instance.
`vmagent` buffers the collected data in files at `-remoteWrite.tmpDataPath` until the remote storage becomes available again and then it sends the buffered data to the remote storage in order to prevent data gaps.
### Relabeling and filtering
`vmagent` can add, remove or update labels on the collected data before sending it to the remote storage. Additionally,
it can remove unwanted samples via Prometheus-like relabeling before sending the collected data to remote storage.
Please see [these docs](#relabeling) for details.
### Splitting data streams among multiple systems
`vmagent` supports splitting the collected data between muliple destinations with the help of `-remoteWrite.urlRelabelConfig`,
which is applied independently for each configured `-remoteWrite.url` destination. For example, it is possible to replicate or split
data among long-term remote storage, short-term remote storage and a real-time analytical system [built on top of Kafka](https://github.com/Telefonica/prometheus-kafka-adapter).
Note that each destination can receive it's own subset of the collected data due to per-destination relabeling via `-remoteWrite.urlRelabelConfig`.
### Prometheus remote_write proxy
`vmagent` can be used as a proxy for Prometheus data sent via Prometheus `remote_write` protocol. It can accept data via the `remote_write` API
at the`/api/v1/write` endpoint. Then apply relabeling and filtering and proxy it to another `remote_write` system .
The `vmagent` can be configured to encrypt the incoming `remote_write` requests with `-tls*` command-line flags.
Also, Basic Auth can be enabled for the incoming `remote_write` requests with `-httpAuth.*` command-line flags.
### remote_write for clustered version
While `vmagent` can accept data in several supported protocols (OpenTSDB, Influx, Prometheus, Graphite) and scrape data from various targets, writes are always peformed in Promethes remote_write protocol. Therefore for the [clustered version](https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html), `-remoteWrite.url` the command-line flag should be configured as `<schema>://<vminsert-host>:8480/insert/<customer-id>/prometheus/api/v1/write`
## How to collect metrics in Prometheus format
Specify the path to `prometheus.yml` file via `-promscrape.config` command-line flag. `vmagent` takes into account the following
sections from [Prometheus config file](https://prometheus.io/docs/prometheus/latest/configuration/configuration/):
* `global`
* `scrape_configs`
All other sections are ignored, including the [remote_write](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#remote_write) section.
Use `-remoteWrite.*` command-line flag instead for configuring remote write settings.
The following scrape types in [scrape_config](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config) section are supported:
* `static_configs` - is for scraping statically defined targets. See [these docs](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#static_config) for details.
* `file_sd_configs` - is for scraping targets defined in external files (aka file-based service discover).
See [these docs](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#file_sd_config) for details
* `kubernetes_sd_configs` - for scraping targets in Kubernetes (k8s).
See [kubernetes_sd_config](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#kubernetes_sd_config) for details.
* `ec2_sd_configs` - is for scraping targets in Amazon EC2.
See [ec2_sd_config](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#ec2_sd_config) for details.
`vmagent` doesn't support the `profile` config param yet.
* `gce_sd_configs` - is for scraping targets in Google Compute Engine (GCE).
See [gce_sd_config](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#gce_sd_config) for details.
`vmagent` provides the following additional functionality for `gce_sd_config`:
* if `project` arg is missing then `vmagent` uses the project for the instance where it runs;
* if `zone` arg is missing then `vmagent` uses the zone for the instance where it runs;
* if `zone` arg is equal to `"*"`, then `vmagent` discovers all the zones for the given project;
* `zone` may contain an arbitrary number of zones, i.e. `zone: [us-east1-a, us-east1-b]`.
* `consul_sd_configs` - is for scraping the targets registered in Consul.
See [consul_sd_config](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#consul_sd_config) for details.
* `dns_sd_configs` - is for scraping targets discovered from DNS records (SRV, A and AAAA).
See [dns_sd_config](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#dns_sd_config) for details.
* `openstack_sd_configs` - is for scraping OpenStack targets.
See [openstack_sd_config](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#openstack_sd_config) for details.
[OpenStack identity API v3](https://docs.openstack.org/api-ref/identity/v3/) is supported only.
* `dockerswarm_sd_configs` - is for scraping Docker Swarm targets.
See [dockerswarm_sd_config](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#dockerswarm_sd_config) for details.
* `eureka_sd_configs` - is for scraping targets registered in [Netflix Eureka](https://github.com/Netflix/eureka).
See [eureka_sd_config](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#eureka_sd_config) for details.
Please file feature requests to [our issue tracker](https://github.com/VictoriaMetrics/VictoriaMetrics/issues) if you need other service discovery mechanisms to be supported by `vmagent`.
`vmagent` also support the following additional options in `scrape_configs` section:
* `disable_compression: true` - to disable response compression on a per-job basis. By default `vmagent` requests compressed responses from scrape targets
to save network bandwidth.
* `disable_keepalive: true` - to disable [HTTP keep-alive connections](https://en.wikipedia.org/wiki/HTTP_persistent_connection) on a per-job basis.
By default, `vmagent` uses keep-alive connections to scrape targets to reduce overhead on connection re-establishing.
* `stream_parse: true` - for scraping targets in a streaming manner. This may be useful for targets exporting big number of metrics. See [these docs](#stream-parsing-mode).
Note that `vmagent` doesn't support `refresh_interval` option for these scrape configs. Use the corresponding `-promscrape.*CheckInterval`
command-line flag instead. For example, `-promscrape.consulSDCheckInterval=60s` sets `refresh_interval` for all the `consul_sd_configs`
entries to 60s. Run `vmagent -help` in order to see default values for the `-promscrape.*CheckInterval` flags.
The file pointed by `-promscrape.config` may contain `%{ENV_VAR}` placeholders which are substituted by the corresponding `ENV_VAR` environment variable values.
## Adding labels to metrics
Labels can be added to metrics by the following mechanisms:
* The `global -> external_labels` section in `-promscrape.config` file. These labels are added only to metrics scraped from targets configured in the `-promscrape.config` file. They aren't added to metrics collected via other [data ingestion protocols](https://docs.victoriametrics.com/#how-to-import-time-series-data).
* The `-remoteWrite.label` command-line flag. These labels are added to all the collected metrics before sending them to `-remoteWrite.url`. For example, the following command will start `vmagent`, which will add `{datacenter="foobar"}` label to all the metrics pushed to all the configured remote storage systems (all the `-remoteWrite.url` flag values):
* `replace_all`: replaces all of the occurences of `regex` in the values of `source_labels` with the `replacement` and stores the results in the `target_label`.
* `labelmap_all`: replaces all of the occurences of `regex` in all the label names with the `replacement`.
* `keep_if_equal`: keeps the entry if all the label values from `source_labels` are equal.
* `drop_if_equal`: drops the entry if all the label values from `source_labels` are equal.
The relabeling can be defined in the following places:
* At the `scrape_config -> relabel_configs` section in `-promscrape.config` file. This relabeling is applied to target labels.
* At the `scrape_config -> metric_relabel_configs` section in `-promscrape.config` file. This relabeling is applied to all the scraped metrics in the given `scrape_config`.
* At the `-remoteWrite.relabelConfig` file. This relabeling is aplied to all the collected metrics before sending them to remote storage.
* At the `-remoteWrite.urlRelabelConfig` files. This relabeling is applied to metrics before sending them to the corresponding `-remoteWrite.url`.
You can read more about relabeling in the following articles:
* [How to use Relabeling in Prometheus and VictoriaMetrics](https://valyala.medium.com/how-to-use-relabeling-in-prometheus-and-victoriametrics-8b90fc22c4b2)
* [Life of a label](https://www.robustperception.io/life-of-a-label)
* [Discarding targets and timeseries with relabeling](https://www.robustperception.io/relabelling-can-discard-targets-timeseries-and-alerts)
* [Dropping labels at scrape time](https://www.robustperception.io/dropping-metrics-at-scrape-time-with-prometheus)
* [Extracting labels from legacy metric names](https://www.robustperception.io/extracting-labels-from-legacy-metric-names)
* [relabel_configs vs metric_relabel_configs](https://www.robustperception.io/relabel_configs-vs-metric_relabel_configs)
## Stream parsing mode
By default `vmagent` reads the full response from scrape target into memory, then parses it, applies [relabeling](#relabeling) and then pushes the resulting metrics to the configured `-remoteWrite.url`. This mode works good for the majority of cases when the scrape target exposes small number of metrics (e.g. less than 10 thousand). But this mode may take big amounts of memory when the scrape target exposes big number of metrics. In this case it is recommended enabling stream parsing mode. When this mode is enabled, then `vmagent` reads response from scrape target in chunks, then immediately processes every chunk and pushes the processed metrics to remote storage. This allows saving memory when scraping targets that expose millions of metrics. Stream parsing mode may be enabled either globally for all of the scrape targets by passing `-promscrape.streamParse` command-line flag or on a per-scrape target basis with `stream_parse: true` option. For example:
```yml
scrape_configs:
- job_name: 'big-federate'
stream_parse: true
static_configs:
- targets:
- big-prometeus1
- big-prometeus2
honor_labels: true
metrics_path: /federate
params:
'match[]': ['{__name__!=""}']
```
Note that `sample_limit` option doesn't work if stream parsing is enabled because the parsed data is pushed to remote storage as soon as it is parsed. Therefore the `sample_limit` option doesn't make sense during stream parsing.
## Scraping big number of targets
A single `vmagent` instance can scrape tens of thousands of scrape targets. Sometimes this isn't enough due to limitations on CPU, network, RAM, etc.
In this case scrape targets can be split among multiple `vmagent` instances (aka `vmagent` horizontal scaling and clustering).
Each `vmagent` instance in the cluster must use identical `-promscrape.config` files with distinct `-promscrape.cluster.memberNum` values.
The flag value must be in the range `0 ... N-1`, where `N` is the number of `vmagent` instances in the cluster.
The number of `vmagent` instances in the cluster must be passed to `-promscrape.cluster.membersCount` command-line flag. For example, the following commands
spread scrape targets among a cluster of two `vmagent` instances:
By default each scrape target is scraped only by a single `vmagent` instance in the cluster. If there is a need for replicating scrape targets among multiple `vmagent` instances,
then `-promscrape.cluster.replicationFactor` command-line flag must be set to the desired number of replicas. For example, the following commands
start a cluster of three `vmagent` instances, where each target is scraped by two `vmagent` instances:
If each target is scraped by multiple `vmagent` instances, then data deduplication must be enabled at remote storage pointed by `-remoteWrite.url`.
See [these docs](https://docs.victoriametrics.com/#deduplication) for details.
## Scraping targets via a proxy
`vmagent` supports scraping targets via http, https and socks5 proxies. Proxy address must be specified in `proxy_url` option. For example, the following scrape config instructs
target scraping via https proxy at `https://proxy-addr:1234`:
```yml
scrape_configs:
- job_name: foo
proxy_url: https://proxy-addr:1234
```
Proxy can be configured with the following optional settings:
* `proxy_authorization` for generic token authorization. See [Prometheus docs for details on authorization section](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config)
* `proxy_bearer_token` and `proxy_bearer_token_file` for Bearer token authorization
* `proxy_basic_auth` for Basic authorization. See [these docs](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config).
* `proxy_tls_config` for TLS config. See [these docs](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#tls_config).
For example:
```yml
scrape_configs:
- job_name: foo
proxy_url: https://proxy-addr:1234
proxy_basic_auth:
username: foobar
password: secret
proxy_tls_config:
insecure_skip_verify: true
cert_file: /path/to/cert
key_file: /path/to/key
ca_file: /path/to/ca
server_name: real-server-name
```
## Cardinality limiter
By default `vmagent` doesn't limit the number of time series written to remote storage systems specified at `-remoteWrite.url`. The limit can be enforced by setting the following command-line flags:
* `-remoteWrite.maxHourlySeries` - limits the number of unique time series `vmagent` can write to remote storage systems during the last hour. Useful for limiting the number of active time series.
* `-remoteWrite.maxDailySeries` - limits the number of unique time series `vmagent` can write to remote storage systems during the last day. Useful for limiting daily churn rate.
Both limits can be set simultaneously. If any of these limits is reached, then samples for new time series are dropped instead of sending them to remote storage systems. A sample of dropped series is put in the log with `WARNING` level.
The exceeded limits can be [monitored](#monitoring) with the following metrics:
* `vmagent_hourly_series_limit_rows_dropped_total` - the number of metrics dropped due to exceeded hourly limit on the number of unique time series.
* `vmagent_daily_series_limit_rows_dropped_total` - the number of metrics dropped due to exceeded daily limit on the number of unique time series.
These limits are approximate, so `vmagent` can underflow/overflow the limit by a small percentage (usually less than 1%).
## Monitoring
`vmagent` exports various metrics in Prometheus exposition format at `http://vmagent-host:8429/metrics` page. We recommend setting up regular scraping of this page
either through `vmagent` itself or by Prometheus so that the exported metrics may be analyzed later.
Use official [Grafana dashboard](https://grafana.com/grafana/dashboards/12683) for `vmagent` state overview.
If you have suggestions for improvements or have found a bug - please open an issue on github or add a review to the dashboard.
`vmagent` also exports the status for various targets at the following handlers:
* `http://vmagent-host:8429/targets`. This handler returns human-readable status for every active target.
This page is easy to query from the command line with `wget`, `curl` or similar tools.
It accepts optional `show_original_labels=1` query arg which shows the original labels per each target before applying the relabeling.
This information may be useful for debugging target relabeling.
* `http://vmagent-host:8429/api/v1/targets`. This handler returns data compatible with [the corresponding page from Prometheus API](https://prometheus.io/docs/prometheus/latest/querying/api/#targets).
* `http://vmagent-host:8429/ready`. This handler returns http 200 status code when `vmagent` finishes it's initialization for all service_discovery configs.
It may be useful to perform `vmagent` rolling update without any scrape loss.
## Troubleshooting
* We recommend you [set up the official Grafana dashboard](#monitoring) in order to monitor the state of `vmagent'.
* We recommend you increase the maximum number of open files in the system (`ulimit -n`) when scraping a big number of targets,
as `vmagent` establishes at least a single TCP connection per target.
* If `vmagent` uses too big amounts of memory, then the following options can help:
* Enabling stream parsing. See [these docs](#stream-parsing-mode).
* Reducing the number of output queues with `-remoteWrite.queues` command-line option.
* Reducing the amounts of RAM vmagent can use for in-memory buffering with `-memory.allowedPercent` or `-memory.allowedBytes` command-line option. Another option is to reduce memory limits in Docker and/or Kuberntes if `vmagent` runs under these systems.
* Reducing the number of CPU cores vmagent can use by passing `GOMAXPROCS=N` environment variable to `vmagent`, where `N` is the desired limit on CPU cores. Another option is to reduce CPU limits in Docker or Kubernetes if `vmagent` runs under these systems.
* When `vmagent` scrapes many unreliable targets, it can flood the error log with scrape errors. These errors can be suppressed
by passing `-promscrape.suppressScrapeErrors` command-line flag to `vmagent`. The most recent scrape error per each target can be observed at `http://vmagent-host:8429/targets`
and `http://vmagent-host:8429/api/v1/targets`.
* The `/api/v1/targets` page could be useful for debugging relabeling process for scrape targets.
This page contains original labels for targets dropped during relabeling (see "droppedTargets" section in the page output). By default the `-promscrape.maxDroppedTargets` targets are shown here. If your setup drops more targets during relabeling, then increase `-promscrape.maxDroppedTargets` command-line flag value to see all the dropped targets. Note that tracking each dropped target requires up to 10Kb of RAM. Therefore big values for `-promscrape.maxDroppedTargets` may result in increased memory usage if a big number of scrape targets are dropped during relabeling.
* If `vmagent` scrapes a big number of targets then the `-promscrape.dropOriginalLabels` command-line option may be passed to `vmagent` in order to reduce memory usage.
This option drops `"discoveredLabels"` and `"droppedTargets"` lists at `/api/v1/targets` page, which may result in reduced debuggability for improperly configured per-target relabeling.
* If `vmagent` scrapes targets with millions of metrics per target (for example, when scraping [federation endpoints](https://prometheus.io/docs/prometheus/latest/federation/)),
we recommend enabling [stream parsing mode](#stream-parsing-mode) in order to reduce memory usage during scraping.
* We recommend you increase `-remoteWrite.queues` if `vmagent_remotewrite_pending_data_bytes` metric exported at `http://vmagent-host:8429/metrics` page grows constantly.
* If you see gaps in the data pushed by `vmagent` to remote storage when `-remoteWrite.maxDiskUsagePerURL` is set, try increasing `-remoteWrite.queues`.
Such gaps may appear because `vmagent` cannot keep up with sending the collected data to remote storage. Therefore it starts dropping the buffered data
if the on-disk buffer size exceeds `-remoteWrite.maxDiskUsagePerURL`.
*`vmagent` drops data blocks if remote storage replies with `400 Bad Request` and `409 Conflict` HTTP responses. The number of dropped blocks can be monitored via `vmagent_remotewrite_packets_dropped_total` metric exported at [/metrics page](#monitoring).
* Use `-remoteWrite.queues=1` when `-remoteWrite.url` points to remote storage, which doesn't accept out-of-order samples (aka data backfilling). Such storage systems include Prometheus, Cortex and Thanos.
*`vmagent` buffers scraped data at the `-remoteWrite.tmpDataPath` directory until it is sent to `-remoteWrite.url`.
The directory can grow large when remote storage is unavailable for extended periods of time and if `-remoteWrite.maxDiskUsagePerURL` isn't set.
If you don't want to send all the data from the directory to remote storage then simply stop `vmagent` and delete the directory.
* By default `vmagent` masks `-remoteWrite.url` with `secret-url` values in logs and at `/metrics` page because
the url may contain sensitive information such as auth tokens or passwords.
Pass `-remoteWrite.showURL` command-line flag when starting `vmagent` in order to see all the valid urls.
* By default `vmagent` evenly spreads scrape load in time. If a particular scrape target must be scraped at the beginning of some interval,
then `scrape_align_interval` option must be used. For example, the following config aligns hourly scrapes to the beginning of hour:
```yml
scrape_configs:
- job_name: foo
scrape_interval: 1h
scrape_align_interval: 1h
```
* By default `vmagent` evenly spreads scrape load in time. If a particular scrape target must be scraped at specific offset, then `scrape_offset` option must be used.
For example, the following config instructs `vmagent` to scrape the target at 10 seconds of every minute:
```yml
scrape_configs:
- job_name: foo
scrape_interval: 1m
scrape_offset: 10s
```
* If you see `skipping duplicate scrape target with identical labels` errors when scraping Kubernetes pods, then it is likely these pods listen to multiple ports
or they use an init container. These errors can either be fixed or suppressed with the `-promscrape.suppressDuplicateScrapeTargetErrors` command-line flag.
See the available options below if you prefer fixing the root cause of the error:
The following relabeling rule may be added to `relabel_configs` section in order to filter out pods with unneeded ports:
2. Run `make vmagent-prod` from the root folder of [the repository](https://github.com/VictoriaMetrics/VictoriaMetrics).
It builds `vmagent-prod` binary and puts it into the `bin` folder.
### Building docker images
Run `make package-vmagent`. It builds `victoriametrics/vmagent:<PKG_TAG>` docker image locally.
`<PKG_TAG>` is an auto-generated image tag, which depends on source code in [the repository](https://github.com/VictoriaMetrics/VictoriaMetrics).
The `<PKG_TAG>` may be manually set via `PKG_TAG=foobar make package-vmagent`.
The base docker image is [alpine](https://hub.docker.com/_/alpine) but it is possible to use any other base image
by setting it via `<ROOT_IMAGE>` environment variable. For example, the following command builds the image on top of [scratch](https://hub.docker.com/_/scratch) image:
```bash
ROOT_IMAGE=scratch make package-vmagent
```
### ARM build
ARM build may run on Raspberry Pi or on [energy-efficient ARM servers](https://blog.cloudflare.com/arm-takes-wing/).
### Development ARM build
1. [Install Go](https://golang.org/doc/install). The minimum supported version is Go 1.15.
2. Run `make vmagent-arm` or `make vmagent-arm64` from the root folder of [the repository](https://github.com/VictoriaMetrics/VictoriaMetrics)
It builds `vmagent-arm` or `vmagent-arm64` binary respectively and puts it into the `bin` folder.
2. Run `make vmagent-arm-prod` or `make vmagent-arm64-prod` from the root folder of [the repository](https://github.com/VictoriaMetrics/VictoriaMetrics).
It builds `vmagent-arm-prod` or `vmagent-arm64-prod` binary respectively and puts it into the `bin` folder.
## Profiling
`vmagent` provides handlers for collecting the following [Go profiles](https://blog.golang.org/profiling-go-programs):
* Memory profile can be collected with the following command:
The command for collecting CPU profile waits for 30 seconds before returning.
The collected profiles may be analyzed with [go tool pprof](https://github.com/google/pprof).
## Advanced usage
`vmagent` can be fine-tuned with various command-line flags. Run `./vmagent -help` in order to see the full list of these flags with their desciptions and default values:
```
./vmagent -help
vmagent collects metrics data via popular data ingestion protocols and routes them to VictoriaMetrics.
See the docs at https://docs.victoriametrics.com/vmagent.html .
-csvTrimTimestamp duration
Trim timestamps when importing csv data to this duration. Minimum practical duration is 1ms. Higher duration (i.e. 1s) may be used for reducing disk space usage for timestamp data (default 1ms)
-dryRun
Whether to check only config files without running vmagent. The following files are checked: -promscrape.config, -remoteWrite.relabelConfig, -remoteWrite.urlRelabelConfig . Unknown config entries are allowed in -promscrape.config by default. This can be changed with -promscrape.config.strictParse
-enableTCP6
Whether to enable IPv6 for listening and dialing. By default only IPv4 TCP and UDP is used
-envflag.enable
Whether to enable reading flags from environment variables additionally to command line. Command line flag values have priority over values from environment vars. Flags are read only from command line if this flag isn't set
-envflag.prefix string
Prefix for environment variables if -envflag.enable is set
-fs.disableMmap
Whether to use pread() instead of mmap() for reading data files. By default mmap() is used for 64-bit arches and pread() is used for 32-bit arches, since they cannot read data files bigger than 2^32 bytes in memory. mmap() is usually faster for reading small data chunks than pread()
-graphiteListenAddr string
TCP and UDP address to listen for Graphite plaintext data. Usually :2003 must be set. Doesn't work if empty
-graphiteTrimTimestamp duration
Trim timestamps for Graphite data to this duration. Minimum practical duration is 1s. Higher duration (i.e. 1m) may be used for reducing disk space usage for timestamp data (default 1s)
-http.connTimeout duration
Incoming http connections are closed after the configured timeout. This may help to spread the incoming load among a cluster of services behind a load balancer. Please note that the real timeout may be bigger by up to 10% as a protection against the thundering herd problem (default 2m0s)
-http.disableResponseCompression
Disable compression of HTTP responses to save CPU resources. By default compression is enabled to save network bandwidth
-http.idleConnTimeout duration
Timeout for incoming idle http connections (default 1m0s)
-http.maxGracefulShutdownDuration duration
The maximum duration for a graceful shutdown of the HTTP server. A highly loaded server may require increased value for a graceful shutdown (default 7s)
-http.pathPrefix string
An optional prefix to add to all the paths handled by http server. For example, if '-http.pathPrefix=/foo/bar' is set, then all the http requests will be handled on '/foo/bar/*' paths. This may be useful for proxied requests. See https://www.robustperception.io/using-external-urls-and-proxies-with-prometheus
-http.shutdownDelay duration
Optional delay before http server shutdown. During this delay, the server returns non-OK responses from /health page, so load balancers can route new requests to other servers
-httpAuth.password string
Password for HTTP Basic Auth. The authentication is disabled if -httpAuth.username is empty
-httpAuth.username string
Username for HTTP Basic Auth. The authentication is disabled if empty. See also -httpAuth.password
-httpListenAddr string
TCP address to listen for http connections. Set this flag to empty value in order to disable listening on any port. This mode may be useful for running multiple vmagent instances on the same server. Note that /targets and /metrics pages aren't available if -httpListenAddr='' (default ":8429")
-import.maxLineLen size
The maximum length in bytes of a single line accepted by /api/v1/import; the line length can be limited with 'max_rows_per_line' query arg passed to /api/v1/export
Supports the following optional suffixes for size values: KB, MB, GB, KiB, MiB, GiB (default 104857600)
-influx.databaseNames array
Comma-separated list of database names to return from /query and /influx/query API. This can be needed for accepting data from Telegraf plugins such as https://github.com/fangli/fluent-plugin-influxdb
Supports an array of values separated by comma or specified via multiple flags.
-influx.maxLineSize size
The maximum size in bytes for a single Influx line during parsing
Supports the following optional suffixes for size values: KB, MB, GB, KiB, MiB, GiB (default 262144)
-influxListenAddr string
TCP and UDP address to listen for Influx line protocol data. Usually :8189 must be set. Doesn't work if empty. This flag isn't needed when ingesting data over HTTP - just send it to http://<vmagent>:8429/write
-influxMeasurementFieldSeparator string
Separator for '{measurement}{separator}{field_name}' metric name when inserted via Influx line protocol (default "_")
-influxSkipMeasurement
Uses '{field_name}' as a metric name while ignoring '{measurement}' and '-influxMeasurementFieldSeparator'
-influxSkipSingleField
Uses '{measurement}' instead of '{measurement}{separator}{field_name}' for metic name if Influx line contains only a single field
-influxTrimTimestamp duration
Trim timestamps for Influx line protocol data to this duration. Minimum practical duration is 1ms. Higher duration (i.e. 1s) may be used for reducing disk space usage for timestamp data (default 1ms)
-insert.maxQueueDuration duration
The maximum duration for waiting in the queue for insert requests due to -maxConcurrentInserts (default 1m0s)
-loggerDisableTimestamps
Whether to disable writing timestamps in logs
-loggerErrorsPerSecondLimit int
Per-second limit on the number of ERROR messages. If more than the given number of errors are emitted per second, the remaining errors are suppressed. Zero values disable the rate limit
-loggerFormat string
Format for logs. Possible values: default, json (default "default")
-loggerLevel string
Minimum level of errors to log. Possible values: INFO, WARN, ERROR, FATAL, PANIC (default "INFO")
-loggerOutput string
Output for the logs. Supported values: stderr, stdout (default "stderr")
-loggerTimezone string
Timezone to use for timestamps in logs. Timezone must be a valid IANA Time Zone. For example: America/New_York, Europe/Berlin, Etc/GMT+3 or Local (default "UTC")
-loggerWarnsPerSecondLimit int
Per-second limit on the number of WARN messages. If more than the given number of warns are emitted per second, then the remaining warns are suppressed. Zero values disable the rate limit
-maxConcurrentInserts int
The maximum number of concurrent inserts. Default value should work for most cases, since it minimizes the overhead for concurrent inserts. This option is tigthly coupled with -insert.maxQueueDuration (default 16)
-maxInsertRequestSize size
The maximum size in bytes of a single Prometheus remote_write API request
Supports the following optional suffixes for size values: KB, MB, GB, KiB, MiB, GiB (default 33554432)
-memory.allowedBytes size
Allowed size of system memory VictoriaMetrics caches may occupy. This option overrides -memory.allowedPercent if set to a non-zero value. Too low a value may increase the cache miss rate usually resulting in higher CPU and disk IO usage. Too high a value may evict too much data from OS page cache resulting in higher disk IO usage
Supports the following optional suffixes for size values: KB, MB, GB, KiB, MiB, GiB (default 0)
-memory.allowedPercent float
Allowed percent of system memory VictoriaMetrics caches may occupy. See also -memory.allowedBytes. Too low a value may increase cache miss rate usually resulting in higher CPU and disk IO usage. Too high a value may evict too much data from OS page cache which will result in higher disk IO usage (default 60)
-metricsAuthKey string
Auth key for /metrics. It overrides httpAuth settings
-opentsdbHTTPListenAddr string
TCP address to listen for OpentTSDB HTTP put requests. Usually :4242 must be set. Doesn't work if empty
-opentsdbListenAddr string
TCP and UDP address to listen for OpentTSDB metrics. Telnet put messages and HTTP /api/put messages are simultaneously served on TCP port. Usually :4242 must be set. Doesn't work if empty
-opentsdbTrimTimestamp duration
Trim timestamps for OpenTSDB 'telnet put' data to this duration. Minimum practical duration is 1s. Higher duration (i.e. 1m) may be used for reducing disk space usage for timestamp data (default 1s)
-opentsdbhttp.maxInsertRequestSize size
The maximum size of OpenTSDB HTTP put request
Supports the following optional suffixes for size values: KB, MB, GB, KiB, MiB, GiB (default 33554432)
-opentsdbhttpTrimTimestamp duration
Trim timestamps for OpenTSDB HTTP data to this duration. Minimum practical duration is 1ms. Higher duration (i.e. 1s) may be used for reducing disk space usage for timestamp data (default 1ms)
-pprofAuthKey string
Auth key for /debug/pprof. It overrides httpAuth settings
-promscrape.cluster.memberNum int
The number of number in the cluster of scrapers. It must be an unique value in the range 0 ... promscrape.cluster.membersCount-1 across scrapers in the cluster
-promscrape.cluster.membersCount int
The number of members in a cluster of scrapers. Each member must have an unique -promscrape.cluster.memberNum in the range 0 ... promscrape.cluster.membersCount-1 . Each member then scrapes roughly 1/N of all the targets. By default cluster scraping is disabled, i.e. a single scraper scrapes all the targets
-promscrape.cluster.replicationFactor int
The number of members in the cluster, which scrape the same targets. If the replication factor is greater than 2, then the deduplication must be enabled at remote storage side. See https://docs.victoriametrics.com/#deduplication (default 1)
-promscrape.config string
Optional path to Prometheus config file with 'scrape_configs' section containing targets to scrape. See https://docs.victoriametrics.com/#how-to-scrape-prometheus-exporters-such-as-node-exporter for details
-promscrape.config.dryRun
Checks -promscrape.config file for errors and unsupported fields and then exits. Returns non-zero exit code on parsing errors and emits these errors to stderr. See also -promscrape.config.strictParse command-line flag. Pass -loggerLevel=ERROR if you don't need to see info messages in the output.
-promscrape.config.strictParse
Whether to allow only supported fields in -promscrape.config . By default unsupported fields are silently skipped
-promscrape.configCheckInterval duration
Interval for checking for changes in '-promscrape.config' file. By default the checking is disabled. Send SIGHUP signal in order to force config check for changes
-promscrape.consul.waitTime duration
Wait time used by Consul service discovery. Default value is used if not set
-promscrape.consulSDCheckInterval duration
Interval for checking for changes in Consul. This works only if consul_sd_configs is configured in '-promscrape.config' file. See https://prometheus.io/docs/prometheus/latest/configuration/configuration/#consul_sd_config for details (default 30s)
-promscrape.disableCompression
Whether to disable sending 'Accept-Encoding: gzip' request headers to all the scrape targets. This may reduce CPU usage on scrape targets at the cost of higher network bandwidth utilization. It is possible to set 'disable_compression: true' individually per each 'scrape_config' section in '-promscrape.config' for fine grained control
-promscrape.disableKeepAlive
Whether to disable HTTP keep-alive connections when scraping all the targets. This may be useful when targets has no support for HTTP keep-alive connection. It is possible to set 'disable_keepalive: true' individually per each 'scrape_config' section in '-promscrape.config' for fine grained control. Note that disabling HTTP keep-alive may increase load on both vmagent and scrape targets
-promscrape.discovery.concurrency int
The maximum number of concurrent requests to Prometheus autodiscovery API (Consul, Kubernetes, etc.) (default 100)
-promscrape.discovery.concurrentWaitTime duration
The maximum duration for waiting to perform API requests if more than -promscrape.discovery.concurrency requests are simultaneously performed (default 1m0s)
-promscrape.dnsSDCheckInterval duration
Interval for checking for changes in dns. This works only if dns_sd_configs is configured in '-promscrape.config' file. See https://prometheus.io/docs/prometheus/latest/configuration/configuration/#dns_sd_config for details (default 30s)
-promscrape.dockerswarmSDCheckInterval duration
Interval for checking for changes in dockerswarm. This works only if dockerswarm_sd_configs is configured in '-promscrape.config' file. See https://prometheus.io/docs/prometheus/latest/configuration/configuration/#dockerswarm_sd_config for details (default 30s)
-promscrape.dropOriginalLabels
Whether to drop original labels for scrape targets at /targets and /api/v1/targets pages. This may be needed for reducing memory usage when original labels for big number of scrape targets occupy big amounts of memory. Note that this reduces debuggability for improper per-target relabeling configs
-promscrape.ec2SDCheckInterval duration
Interval for checking for changes in ec2. This works only if ec2_sd_configs is configured in '-promscrape.config' file. See https://prometheus.io/docs/prometheus/latest/configuration/configuration/#ec2_sd_config for details (default 1m0s)
-promscrape.eurekaSDCheckInterval duration
Interval for checking for changes in eureka. This works only if eureka_sd_configs is configured in '-promscrape.config' file. See https://prometheus.io/docs/prometheus/latest/configuration/configuration/#eureka_sd_config for details (default 30s)
-promscrape.fileSDCheckInterval duration
Interval for checking for changes in 'file_sd_config'. See https://prometheus.io/docs/prometheus/latest/configuration/configuration/#file_sd_config for details (default 30s)
-promscrape.gceSDCheckInterval duration
Interval for checking for changes in gce. This works only if gce_sd_configs is configured in '-promscrape.config' file. See https://prometheus.io/docs/prometheus/latest/configuration/configuration/#gce_sd_config for details (default 1m0s)
-promscrape.kubernetes.apiServerTimeout duration
How frequently to reload the full state from Kuberntes API server (default 30m0s)
-promscrape.kubernetesSDCheckInterval duration
Interval for checking for changes in Kubernetes API server. This works only if kubernetes_sd_configs is configured in '-promscrape.config' file. See https://prometheus.io/docs/prometheus/latest/configuration/configuration/#kubernetes_sd_config for details (default 30s)
-promscrape.maxDroppedTargets int
The maximum number of droppedTargets to show at /api/v1/targets page. Increase this value if your setup drops more scrape targets during relabeling and you need investigating labels for all the dropped targets. Note that the increased number of tracked dropped targets may result in increased memory usage (default 1000)
-promscrape.maxScrapeSize size
The maximum size of scrape response in bytes to process from Prometheus targets. Bigger responses are rejected
Supports the following optional suffixes for size values: KB, MB, GB, KiB, MiB, GiB (default 16777216)
-promscrape.openstackSDCheckInterval duration
Interval for checking for changes in openstack API server. This works only if openstack_sd_configs is configured in '-promscrape.config' file. See https://prometheus.io/docs/prometheus/latest/configuration/configuration/#openstack_sd_config for details (default 30s)
-promscrape.streamParse
Whether to enable stream parsing for metrics obtained from scrape targets. This may be useful for reducing memory usage when millions of metrics are exposed per each scrape target. It is posible to set 'stream_parse: true' individually per each 'scrape_config' section in '-promscrape.config' for fine grained control
-promscrape.suppressDuplicateScrapeTargetErrors
Whether to suppress 'duplicate scrape target' errors; see https://docs.victoriametrics.com/vmagent.html#troubleshooting for details
-promscrape.suppressScrapeErrors
Whether to suppress scrape errors logging. The last error for each target is always available at '/targets' page even if scrape errors logging is suppressed
-remoteWrite.basicAuth.password array
Optional basic auth password to use for -remoteWrite.url. If multiple args are set, then they are applied independently for the corresponding -remoteWrite.url
Supports an array of values separated by comma or specified via multiple flags.
-remoteWrite.basicAuth.passwordFile array
Optional path to basic auth password to use for -remoteWrite.url. The file is re-read every second. If multiple args are set, then they are applied independently for the corresponding -remoteWrite.url
Supports an array of values separated by comma or specified via multiple flags.
-remoteWrite.basicAuth.username array
Optional basic auth username to use for -remoteWrite.url. If multiple args are set, then they are applied independently for the corresponding -remoteWrite.url
Supports an array of values separated by comma or specified via multiple flags.
-remoteWrite.bearerToken array
Optional bearer auth token to use for -remoteWrite.url. If multiple args are set, then they are applied independently for the corresponding -remoteWrite.url
Supports an array of values separated by comma or specified via multiple flags.
-remoteWrite.bearerTokenFile array
Optional path to bearer token file to use for -remoteWrite.url. The token is re-read from the file every second. If multiple args are set, then they are applied independently for the corresponding -remoteWrite.url
Supports an array of values separated by comma or specified via multiple flags.
-remoteWrite.flushInterval duration
Interval for flushing the data to remote storage. This option takes effect only when less than 10K data points per second are pushed to -remoteWrite.url (default 1s)
-remoteWrite.label array
Optional label in the form 'name=value' to add to all the metrics before sending them to -remoteWrite.url. Pass multiple -remoteWrite.label flags in order to add multiple labels to metrics before sending them to remote storage
Supports an array of values separated by comma or specified via multiple flags.
-remoteWrite.maxBlockSize size
The maximum size in bytes of unpacked request to send to remote storage. It shouldn't exceed -maxInsertRequestSize from VictoriaMetrics
Supports the following optional suffixes for size values: KB, MB, GB, KiB, MiB, GiB (default 8388608)
-remoteWrite.maxDailySeries int
The maximum number of unique series vmagent can send to remote storage systems during the last 24 hours. Excess series are logged and dropped. This can be useful for limiting series churn rate. See also -remoteWrite.maxHourlySeries
-remoteWrite.maxDiskUsagePerURL size
The maximum file-based buffer size in bytes at -remoteWrite.tmpDataPath for each -remoteWrite.url. When buffer size reaches the configured maximum, then old data is dropped when adding new data to the buffer. Buffered data is stored in ~500MB chunks, so the minimum practical value for this flag is 500000000. Disk usage is unlimited if the value is set to 0
Supports the following optional suffixes for size values: KB, MB, GB, KiB, MiB, GiB (default 0)
-remoteWrite.maxHourlySeries int
The maximum number of unique series vmagent can send to remote storage systems during the last hour. Excess series are logged and dropped. This can be useful for limiting series cardinality. See also -remoteWrite.maxDailySeries
-remoteWrite.oauth2.clientID array
Optional OAuth2 clientID to use for -remoteWrite.url. If multiple args are set, then they are applied independently for the corresponding -remoteWrite.url
Supports an array of values separated by comma or specified via multiple flags.
-remoteWrite.oauth2.clientSecret array
Optional OAuth2 clientSecret to use for -remoteWrite.url. If multiple args are set, then they are applied independently for the corresponding -remoteWrite.url
Supports an array of values separated by comma or specified via multiple flags.
-remoteWrite.oauth2.clientSecretFile array
Optional OAuth2 clientSecretFile to use for -remoteWrite.url. If multiple args are set, then they are applied independently for the corresponding -remoteWrite.url
Supports an array of values separated by comma or specified via multiple flags.
-remoteWrite.oauth2.scopes array
Optional OAuth2 scopes to use for -remoteWrite.url. Scopes must be delimited by ';'. If multiple args are set, then they are applied independently for the corresponding -remoteWrite.url
Supports an array of values separated by comma or specified via multiple flags.
-remoteWrite.oauth2.tokenUrl array
Optional OAuth2 tokenURL to use for -remoteWrite.url. If multiple args are set, then they are applied independently for the corresponding -remoteWrite.url
Supports an array of values separated by comma or specified via multiple flags.
-remoteWrite.proxyURL array
Optional proxy URL for writing data to -remoteWrite.url. Supported proxies: http, https, socks5. Example: -remoteWrite.proxyURL=socks5://proxy:1234
Supports an array of values separated by comma or specified via multiple flags.
-remoteWrite.queues int
The number of concurrent queues to each -remoteWrite.url. Set more queues if default number of queues isn't enough for sending high volume of collected data to remote storage (default 4)
-remoteWrite.rateLimit array
Optional rate limit in bytes per second for data sent to -remoteWrite.url. By default the rate limit is disabled. It can be useful for limiting load on remote storage when big amounts of buffered data is sent after temporary unavailability of the remote storage
Supports array of values separated by comma or specified via multiple flags.
-remoteWrite.relabelConfig string
Optional path to file with relabel_config entries. These entries are applied to all the metrics before sending them to -remoteWrite.url. See https://docs.victoriametrics.com/vmagent.html#relabeling for details
-remoteWrite.roundDigits array
Round metric values to this number of decimal digits after the point before writing them to remote storage. Examples: -remoteWrite.roundDigits=2 would round 1.236 to 1.24, while -remoteWrite.roundDigits=-1 would round 126.78 to 130. By default digits rounding is disabled. Set it to 100 for disabling it for a particular remote storage. This option may be used for improving data compression for the stored metrics
Supports array of values separated by comma or specified via multiple flags.
-remoteWrite.sendTimeout array
Timeout for sending a single block of data to -remoteWrite.url
Supports array of values separated by comma or specified via multiple flags.
-remoteWrite.showURL
Whether to show -remoteWrite.url in the exported metrics. It is hidden by default, since it can contain sensitive info such as auth key
-remoteWrite.significantFigures array
The number of significant figures to leave in metric values before writing them to remote storage. See https://en.wikipedia.org/wiki/Significant_figures . Zero value saves all the significant figures. This option may be used for improving data compression for the stored metrics. See also -remoteWrite.roundDigits
Supports array of values separated by comma or specified via multiple flags.
-remoteWrite.tlsCAFile array
Optional path to TLS CA file to use for verifying connections to -remoteWrite.url. By default system CA is used. If multiple args are set, then they are applied independently for the corresponding -remoteWrite.url
Supports an array of values separated by comma or specified via multiple flags.
-remoteWrite.tlsCertFile array
Optional path to client-side TLS certificate file to use when connecting to -remoteWrite.url. If multiple args are set, then they are applied independently for the corresponding -remoteWrite.url
Supports an array of values separated by comma or specified via multiple flags.
-remoteWrite.tlsInsecureSkipVerify array
Whether to skip tls verification when connecting to -remoteWrite.url
Supports array of values separated by comma or specified via multiple flags.
-remoteWrite.tlsKeyFile array
Optional path to client-side TLS certificate key to use when connecting to -remoteWrite.url. If multiple args are set, then they are applied independently for the corresponding -remoteWrite.url
Supports an array of values separated by comma or specified via multiple flags.
-remoteWrite.tlsServerName array
Optional TLS server name to use for connections to -remoteWrite.url. By default the server name from -remoteWrite.url is used. If multiple args are set, then they are applied independently for the corresponding -remoteWrite.url
Supports an array of values separated by comma or specified via multiple flags.
-remoteWrite.tmpDataPath string
Path to directory where temporary data for remote write component is stored. See also -remoteWrite.maxDiskUsagePerURL (default "vmagent-remotewrite-data")
-remoteWrite.url array
Remote storage URL to write data to. It must support Prometheus remote_write API. It is recommended using VictoriaMetrics as remote storage. Example url: http://<victoriametrics-host>:8428/api/v1/write . Pass multiple -remoteWrite.url flags in order to write data concurrently to multiple remote storage systems
Supports an array of values separated by comma or specified via multiple flags.
-remoteWrite.urlRelabelConfig array
Optional path to relabel config for the corresponding -remoteWrite.url
Supports an array of values separated by comma or specified via multiple flags.
-sortLabels
Whether to sort labels for incoming samples before writing them to all the configured remote storage systems. This may be needed for reducing memory usage at remote storage when the order of labels in incoming samples is random. For example, if m{k1="v1",k2="v2"} may be sent as m{k2="v2",k1="v1"}Enabled sorting for labels can slow down ingestion performance a bit
-tls
Whether to enable TLS (aka HTTPS) for incoming requests. -tlsCertFile and -tlsKeyFile must be set if -tls is set
-tlsCertFile string
Path to file with TLS certificate. Used only if -tls is set. Prefer ECDSA certs instead of RSA certs as RSA certs are slower
-tlsKeyFile string
Path to file with TLS key. Used only if -tls is set
-version
Show VictoriaMetrics version
```
vmagent docs can be edited at [docs/vmagent.md](https://github.com/VictoriaMetrics/VictoriaMetrics/blob/master/docs/vmagent.md).
measurementFieldSeparator=flag.String("influxMeasurementFieldSeparator","_","Separator for '{measurement}{separator}{field_name}' metric name when inserted via Influx line protocol")
skipSingleField=flag.Bool("influxSkipSingleField",false,"Uses '{measurement}' instead of '{measurement}{separator}{field_name}' for metic name if Influx line contains only a single field")
measurementFieldSeparator=flag.String("influxMeasurementFieldSeparator","_","Separator for '{measurement}{separator}{field_name}' metric name when inserted via InfluxDB line protocol")
skipSingleField=flag.Bool("influxSkipSingleField",false,"Uses '{measurement}' instead of '{measurement}{separator}{field_name}' for metric name if InfluxDB line contains only a single field")
skipMeasurement=flag.Bool("influxSkipMeasurement",false,"Uses '{field_name}' as a metric name while ignoring '{measurement}' and '-influxMeasurementFieldSeparator'")
dbLabel=flag.String("influxDBLabel","db","Default label for the DB name sent over '?db={db_name}' query parameter")
httpListenAddr=flag.String("httpListenAddr",":8429","TCP address to listen for http connections. "+
httpListenAddrs=flagutil.NewArrayString("httpListenAddr","TCP address to listen for incoming http requests. "+
"Set this flag to empty value in order to disable listening on any port. This mode may be useful for running multiple vmagent instances on the same server. "+
"Note that /targets and /metrics pages aren't available if -httpListenAddr=''")
influxListenAddr=flag.String("influxListenAddr","","TCP and UDP address to listen for Influx line protocol data. Usually :8189 must be set. Doesn't work if empty. "+
"This flag isn't needed when ingesting data over HTTP - just send it to http://<vmagent>:8429/write")
graphiteListenAddr=flag.String("graphiteListenAddr","","TCP and UDP address to listen for Graphite plaintext data. Usually :2003 must be set. Doesn't work if empty")
opentsdbListenAddr=flag.String("opentsdbListenAddr","","TCP and UDP address to listen for OpentTSDB metrics. "+
"Note that /targets and /metrics pages aren't available if -httpListenAddr=''. See also -tls and -httpListenAddr.useProxyProtocol")
useProxyProtocol=flagutil.NewArrayBool("httpListenAddr.useProxyProtocol","Whether to use proxy protocol for connections accepted at the corresponding -httpListenAddr . "+
"See https://www.haproxy.org/download/1.8/doc/proxy-protocol.txt . "+
"With enabled proxy protocol http server cannot serve regular /metrics endpoint. Use -pushmetrics.url for metrics pushing")
influxListenAddr=flag.String("influxListenAddr","","TCP and UDP address to listen for InfluxDB line protocol data. Usually :8089 must be set. Doesn't work if empty. "+
"This flag isn't needed when ingesting data over HTTP - just send it to http://<vmagent>:8429/write . "+
"See also -influxListenAddr.useProxyProtocol")
influxUseProxyProtocol=flag.Bool("influxListenAddr.useProxyProtocol",false,"Whether to use proxy protocol for connections accepted at -influxListenAddr . "+
"See https://www.haproxy.org/download/1.8/doc/proxy-protocol.txt")
graphiteListenAddr=flag.String("graphiteListenAddr","","TCP and UDP address to listen for Graphite plaintext data. Usually :2003 must be set. Doesn't work if empty. "+
"See also -graphiteListenAddr.useProxyProtocol")
graphiteUseProxyProtocol=flag.Bool("graphiteListenAddr.useProxyProtocol",false,"Whether to use proxy protocol for connections accepted at -graphiteListenAddr . "+
"See https://www.haproxy.org/download/1.8/doc/proxy-protocol.txt")
opentsdbListenAddr=flag.String("opentsdbListenAddr","","TCP and UDP address to listen for OpenTSDB metrics. "+
"Telnet put messages and HTTP /api/put messages are simultaneously served on TCP port. "+
"Usually :4242 must be set. Doesn't work if empty")
opentsdbHTTPListenAddr=flag.String("opentsdbHTTPListenAddr","","TCP address to listen for OpentTSDB HTTP put requests. Usually :4242 must be set. Doesn't work if empty")
dryRun=flag.Bool("dryRun",false,"Whether to check only config files without running vmagent. The following files are checked: "+
"Unknown config entries are allowed in -promscrape.config by default. This can be changed with -promscrape.config.strictParse")
"Usually :4242 must be set. Doesn't work if empty. See also -opentsdbListenAddr.useProxyProtocol")
opentsdbUseProxyProtocol=flag.Bool("opentsdbListenAddr.useProxyProtocol",false,"Whether to use proxy protocol for connections accepted at -opentsdbListenAddr . "+
"See https://www.haproxy.org/download/1.8/doc/proxy-protocol.txt")
opentsdbHTTPListenAddr=flag.String("opentsdbHTTPListenAddr","","TCP address to listen for OpenTSDB HTTP put requests. Usually :4242 must be set. Doesn't work if empty. "+
"See also -opentsdbHTTPListenAddr.useProxyProtocol")
opentsdbHTTPUseProxyProtocol=flag.Bool("opentsdbHTTPListenAddr.useProxyProtocol",false,"Whether to use proxy protocol for connections accepted "+
"at -opentsdbHTTPListenAddr . See https://www.haproxy.org/download/1.8/doc/proxy-protocol.txt")
configAuthKey=flagutil.NewPassword("configAuthKey","Authorization key for accessing /config page. It must be passed via authKey query arg")
reloadAuthKey=flagutil.NewPassword("reloadAuthKey","Auth key for /-/reload http endpoint. It must be passed as authKey=...")
dryRun=flag.Bool("dryRun",false,"Whether to check config files without running vmagent. The following files are checked: "+
"Unknown config entries aren't allowed in -promscrape.config by default. This can be changed by passing -promscrape.config.strictParse=false command-line flag")
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.