This change should clearly distinguish different multitnenacy scenarios for vmagent. It is expected to be easier to read and follow for users. --------- Signed-off-by: hagen1778 <roman@victoriametrics.com> Co-authored-by: Pablo Fernandez <46322567+TomFern@users.noreply.github.com>
94 KiB
weight, menu, title, tags, aliases
| weight | menu | title | tags | aliases | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3 |
|
vmagent |
|
|
vmagent is a tiny agent that helps you collect metrics from various sources,
relabel and filter the collected metrics
and store them in VictoriaMetrics
or any other storage systems via Prometheus remote_write protocol
or via VictoriaMetrics remote_write protocol.
See Quick Start for details.
Motivation
While VictoriaMetrics provides an efficient solution to store and observe metrics, our users needed something fast
and RAM friendly to scrape metrics from Prometheus-compatible exporters into VictoriaMetrics.
Also, we found that our user's infrastructure are like snowflakes in that no two are alike. Therefore, we decided to add more flexibility
to vmagent such as the ability to accept metrics via popular push protocols
and to discover Prometheus-compatible targets and scrape metrics from them.
Features
- Can be used as a drop-in replacement for Prometheus for discovering and scraping targets such as node_exporter.
Note that single-node VictoriaMetrics can also discover and scrape Prometheus-compatible targets in the same way
vmagentdoes - see these docs. - Can add, remove and modify labels (aka tags) via Prometheus relabeling and filter data before sending it to remote storage. See these docs for details.
- Can accept data via all the ingestion protocols supported by VictoriaMetrics - see these docs.
- Can aggregate incoming samples by time and by labels before sending them to remote storage - see these docs.
- Can replicate collected metrics simultaneously to multiple Prometheus-compatible remote storage systems - see these docs.
- Can save egress network bandwidth usage costs when VictoriaMetrics remote write protocol is used for sending the data to VictoriaMetrics.
- Works smoothly in environments with unstable connections to remote storage. If the remote storage is unavailable, the collected metrics
are buffered at
-remoteWrite.tmpDataPath. The buffered metrics are sent to remote storage as soon as the connection to the remote storage is repaired. The maximum disk usage for the buffer can be limited with-remoteWrite.maxDiskUsagePerURL. - Uses much lower amounts of RAM, CPU, disk IO and network bandwidth than Prometheus. The RAM usage and CPU usage can be reduced further if needed according to these docs.
- Scrape targets can be spread among multiple
vmagentinstances when large number of targets must be scraped. See these docs. - Can load scrape configs from multiple files. See these docs.
- Can efficiently scrape targets that expose millions of time series such as /federate endpoint in Prometheus. See these docs.
- Can deal with high cardinality and high churn rate issues by limiting the number of unique time series at scrape time and before sending them to remote storage systems. See these docs.
- Can write collected metrics to multiple tenants. See these docs.
- Can read and write data from / to Kafka. See these docs.
- Can read and write data from / to Google PubSub. See these docs.
Quick Start
Please download vmutils-* archive from releases page (
vmagent is also available in docker images Docker Hub and Quay),
unpack it and pass the following flags to the vmagent binary in order to start scraping Prometheus-compatible targets
and sending the data to the Prometheus-compatible remote storage:
-promscrape.configwith the path to Prometheus config file (usually located at/etc/prometheus/prometheus.yml). The path can point either to local file or to http url. See scrape config examples.vmagentdoesn't support some sections of Prometheus config file, so you may need either to delete these sections or to runvmagentwith-promscrape.config.strictParse=falsecommand-line flag. In this casevmagentignores unsupported sections. See the list of unsupported sections.-remoteWrite.urlof a Prometheus-compatible remote storage endpoint (e.g., VictoriaMetrics) to send data to. The-remoteWrite.urlmay refer to a DNS SRV address. See these docs for details.
Example command for writing the data received via supported push-based protocols
to single-node VictoriaMetrics located at victoria-metrics-host:8428:
/path/to/vmagent -remoteWrite.url=https://victoria-metrics-host:8428/api/v1/write
See these docs if you need to write data to VictoriaMetrics cluster.
Example command for scraping Prometheus targets and writing the data to single-node VictoriaMetrics:
/path/to/vmagent -promscrape.config=/path/to/prometheus.yml -remoteWrite.url=https://victoria-metrics-host:8428/api/v1/write
See how to scrape Prometheus-compatible targets for more details.
If you use single-node VictoriaMetrics, then you can discover and scrape Prometheus-compatible targets directly from VictoriaMetrics
without the need to use vmagent - see these docs.
vmagent can save network bandwidth usage costs under high load when VictoriaMetrics remote write protocol is used.
Common vmagent issues are covered in the troubleshooting docs.
See various use cases for vmagent.
Pass -help to vmagent in order to see the full list of supported command-line flags with their descriptions.
Use Cases
IoT and Edge monitoring
vmagent can run and collect metrics in IoT environments and industrial networks with unreliable or scheduled connections to remote storage.
It buffers the collected data in local files until the connection to remote storage becomes available and then sends the buffered
data to the remote storage. It re-tries sending the data to remote storage until errors are resolved.
The maximum on-disk size for the buffered metrics can be limited with -remoteWrite.maxDiskUsagePerURL.
vmagent works on several common architectures used in IoT environments - 32-bit arm, 64-bit arm, ppc64, 386, amd64.
vmagent can save on network bandwidth usage costs by using VictoriaMetrics remote write protocol.
See how to optimize index size at VictoriaMetrics for IoT and industrial monitoring.
Drop-in replacement for Prometheus
If you use Prometheus only for scraping metrics from various targets and forwarding these metrics to remote storage
then vmagent can replace Prometheus. Typically, vmagent requires lower amounts of RAM, CPU and network bandwidth compared with Prometheus.
See these docs for details.
Statsd alternative
vmagent can be used as an alternative to statsd
when stream aggregation is enabled.
See these docs for details.
Flexible metrics relay
vmagent can accept metrics in various popular data ingestion protocols, apply relabeling
to the accepted metrics (for example, change metric names/labels or drop unneeded metrics) and then forward the relabeled metrics
to other remote storage systems, that support Prometheus remote_write protocol (including other vmagent instances).
Replication and high availability
vmagent replicates the collected metrics among multiple remote storage instances configured via -remoteWrite.url args.
If a single remote storage instance is temporarily out of service, then the collected data remains available in the other remote storage instances.
vmagent buffers the collected data in files at -remoteWrite.tmpDataPath until the remote storage becomes available again,
and then it sends the buffered data to the remote storage in order to prevent data gaps.
VictoriaMetrics cluster already supports replication,
so there is no need to specify multiple -remoteWrite.url flags when writing data to the same cluster.
See these docs.
Relabeling and filtering
vmagent can add, remove or update labels on the collected data before sending it to the remote storage.
It can filter scrape targets or remove unwanted samples via Prometheus-like relabeling.
Please see Relabeling cookbook for details.
Sharding among remote storages
By default vmagent replicates data to remote storage systems via the -remoteWrite.url command-line flag.
If the -remoteWrite.shardByURL command-line flag is set, then vmagent spreads
the outgoing time series evenly among all the remote storage
systems listed in -remoteWrite.url.
It is possible to replicate samples among remote storage systems by passing -remoteWrite.shardByURLReplicas=N
command-line flag to vmagent additionally to -remoteWrite.shardByURL command-line flag.
This instructs vmagent to write every outgoing sample to N number of distinct remote storage systems listed in -remoteWrite.url
in addition to sharding.
Samples for the same time series are routed to the same remote storage system if -remoteWrite.shardByURL flag is specified.
This allows building scalable data processing pipelines when a single remote storage cannot keep up with the data ingestion workload.
For example, this allows horizontal scaling with stream aggregation
by routing outgoing samples for the same time series like counter
and histogram types from top-level vmagent instances
to the same second-level vmagent instance, so they are aggregated properly.
If -remoteWrite.shardByURL command-line flag is set, then all the metric labels are used for even sharding
among remote storage systems specified in -remoteWrite.url.
The
-remoteWrite.shardByURLmay not work as expected when SRV URLs are in use.An SRV record might resolve to multiple addresses, one address is chosen randomly for all subsequent logic, including sharding. It will make sharding inconsistent. Samples of the same time series always go to the same remote write URL/SRV record, but they may reach different addresses randomly based on the DNS resolution.
For example, if you set
-remoteWrite.url=srv+fooand it's resolved to three addresses (192.168.1.1,192.168.1.2,192.168.1.3), vmagent will only choose one randomly every time it (re-)creates the connection. In contrast, specifying the addresses manually (-remoteWrite.url=192.168.1.1 -remoteWrite.url=192.168.1.2 -remoteWrite.url=192.168.1.3) will shard samples across all three URLs.
Sometimes it may be needed to use only a particular set of labels for sharding. For example, it may be necessary to route all the metrics with the same instance label
to the same -remoteWrite.url. In this case you can specify comma-separated list of these labels in the -remoteWrite.shardByURL.labels
command-line flag. For example, -remoteWrite.shardByURL.labels=instance,__name__ would shard metrics with the same name and instance
label to the same -remoteWrite.url.
Sometimes, it may be necessary to ignore some labels when sharding samples across multiple -remoteWrite.url backends.
For example, if all the raw samples with the same set of labels
except for the labels instance and pod must be routed to the same backend. In this case the list of ignored labels must be passed to
-remoteWrite.shardByURL.ignoreLabels command-line flag: -remoteWrite.shardByURL.ignoreLabels=instance,pod.
See also how to scrape large number of targets.
Splitting data streams among multiple systems
vmagent supports splitting the collected data between multiple destinations with the help of -remoteWrite.urlRelabelConfig,
it is applied independently for each configured -remoteWrite.url destination. For example, it is possible to replicate or split
data among long-term remote storage, short-term remote storage and a real-time analytical system built on top of Kafka.
Note that each destination can receive its own subset of the collected data due to per-destination relabeling via -remoteWrite.urlRelabelConfig.
For example, let's assume that all the metrics scraped or received by vmagent have the label env with the value of dev or prod.
To route metrics with the label env=dev to dev and metrics with the label env=prod to prod apply the following config:
- Create a relabeling config file
relabelDev.ymlto drop all metrics that don't have labelenv=dev:
- action: keep
source_labels: [env]
regex: "dev"
- Create a relabeling config file
relabelProd.ymlto drop all metrics that don't have labelenv=prod:
- action: keep
source_labels: [env]
regex: "prod"
- Configure
vmagentwith 2-remoteWrite.urlflags pointing to destinationsdevandprodwith corresponding-remoteWrite.urlRelabelConfigconfig files:
./vmagent \
-remoteWrite.url=http://<dev-url> -remoteWrite.urlRelabelConfig=relabelDev.yml \
-remoteWrite.url=http://<prod-url> -remoteWrite.urlRelabelConfig=relabelProd.yml
With this configuration vmagent will only forward metrics with the label env=dev to http://<dev-url> and
metrics with the label env=prod to http://<prod-url>.
Please note, the order of flags is important: the first -remoteWrite.urlRelabelConfig will be applied to the
first -remoteWrite.url, and so on.
Prometheus remote_write proxy
vmagent can be used as a proxy for Prometheus data sent via Prometheus remote_write protocol. It can accept data via the remote_write API
at the/api/v1/write endpoint. It then applies relabeling and filtering then proxies it to another remote_write system.
vmagent can also be configured to encrypt the incoming remote_write requests with -tls* command-line flags.
Basic Auth can be enabled for the incoming remote_write requests with -httpAuth.* command-line flags.
remote_write for clustered version
While vmagent can accept data in several supported protocols (OpenTSDB, Influx, Prometheus, Graphite) and scrape data from various targets,
writes are always performed in Prometheus remote_write protocol. Therefore, for the clustered version,
the -remoteWrite.url command-line flag should be configured as <schema>://<vminsert-host>:8480/insert/<accountID>/prometheus/api/v1/write
according to these docs.
There is also support for multitenant writes. See these docs.
Flexible deduplication
Deduplication at stream aggregation allows setting up arbitrary complex de-duplication schemes for the collected samples. Examples:
-
The following command instructs
vmagentto send only the last sample for each time series every 60 seconds:./vmagent -remoteWrite.url=http://remote-storage/api/v1/write -streamAggr.dedupInterval=60s -
The following command instructs
vmagentto merge time series with differentreplicalabel values and then to send only the last sample for each merged series every 60 seconds:./vmagent -remoteWrite.url=http://remote-storage/api/v1/write -streamAggr.dropInputLabels=replica -streamAggr.dedupInterval=60s
Life of a sample
vmagent supports limiting, relabeling, deduplication and stream aggregation for all metric samples, scraped or pushed.
The received data is then forwarded to the specified -remoteWrite.url destinations. The pipeline is as follows:
%%{init: { "themeCSS": ".nodeLabel, .edgeLabel { white-space: nowrap; word-break: normal; overflow-wrap: normal; }" }}%%
flowchart TB
A[Pushed or Scraped samples] --> B[Ingestion rate limiting<br><b>-maxIngestionRate</b>]
B --> C[Global <a href="https://docs.victoriametrics.com/victoriametrics/relabeling/">relabeling</a><br><b>-remoteWrite.relabelConfig</b>]
C --> D[complexity limiting<br><b>-maxLabelsPerTimeseries</b><br><b>-maxLabelNameLen</b><br><b>-maxLabelValueLen</b>]
D --> E[<a href="https://docs.victoriametrics.com/victoriametrics/vmagent/#cardinality-limiter">cardinality limiting</a><br><b>-remoteWrite.maxHourlySeries</b><br><b>-remoteWrite.maxDailySeries</b>]
E --> F[Global <a href="https://docs.victoriametrics.com/victoriametrics/stream-aggregation">aggregation</a><br><b>-streamAggr.config</b><br><b>-streamAggr.dedupInterval</b>]
F --> G[<a href="https://docs.victoriametrics.com/victoriametrics/vmagent/#replication-and-high-availability">replicate</a> to each <b>-remoteWrite.url</b><br/>or <a href="https://docs.victoriametrics.com/victoriametrics/vmagent/#sharding-among-remote-storages">shard</a> if <b>-remoteWrite.shardByURL</b> is set]
%% Left branch
G --> H1[per-url <a href="https://docs.victoriametrics.com/victoriametrics/relabeling/">relabeling</a><br><b>-remoteWrite.urlRelabelConfig</b>]
H1 --> H2[per-url <a href="https://docs.victoriametrics.com/victoriametrics/stream-aggregation">aggregation</a><br><b>-remoteWrite.streamAggr.config</b><br><b>-remoteWrite.streamAggr.dedupInterval</b>]
H2 --> H3["per-url <a href="https://docs.victoriametrics.com/victoriametrics/vmagent/#calculating-disk-space-for-persistence-queue">queue</a> (default: enabled)<br><b>-remoteWrite.disableOnDiskQueue</b>"]
H3 --> H4[<a href="https://docs.victoriametrics.com/victoriametrics/vmagent/#adding-labels-to-metrics">add extra labels</a><br><b>-remoteWrite.label</b>]
H4 --> H5[[push to <b>-remoteWrite.url</b>]]
%% Right branch
G --> R1[per-url <a href="https://docs.victoriametrics.com/victoriametrics/relabeling/">relabeling</a><br><b>-remoteWrite.urlRelabelConfig</b>]
R1 --> R2[per-url <a href="https://docs.victoriametrics.com/victoriametrics/stream-aggregation">aggregation</a><br><b>-remoteWrite.streamAggr.config</b><br><b>-remoteWrite.streamAggr.dedupInterval</b>]
R2 --> R3["per-url <a href="https://docs.victoriametrics.com/victoriametrics/vmagent/#calculating-disk-space-for-persistence-queue">queue</a> (default: enabled)<br><b>-remoteWrite.disableOnDiskQueue</b>"]
R3 --> R4[<a href="https://docs.victoriametrics.com/victoriametrics/vmagent/#adding-labels-to-metrics">add extra labels</a><br><b>-remoteWrite.label</b>]
R4 --> R5[[push to <b>-remoteWrite.url</b>]]
Scraping has additional settings that can be applied before samples are pushed to the processing pipeline above:
- Service Discovery relabeling
- Scraping relabeling
sample_limit,series_limit,label_limitin scrape_configs.
How to push data to vmagent
vmagent supports the same set of push-based data ingestion protocols as VictoriaMetrics does
in addition to the pull-based Prometheus-compatible targets' scraping:
- DataDog "submit metrics" API. See these docs.
- InfluxDB line protocol via
http://<vmagent>:8429/write. See these docs. - Graphite plaintext protocol if
-graphiteListenAddrcommand-line flag is set. See these docs. - OpenTelemetry http API. See these docs.
- NewRelic API. See these docs.
- OpenTSDB telnet and http protocols if
-opentsdbListenAddrcommand-line flag is set. See these docs. - Zabbix Connector streaming protocol. See these docs.
- Prometheus remote write protocol via
http://<vmagent>:8429/api/v1/write. - JSON lines import protocol via
http://<vmagent>:8429/api/v1/import. See these docs. - Native data import protocol via
http://<vmagent>:8429/api/v1/import/native. See these docs. - Prometheus exposition format via
http://<vmagent>:8429/api/v1/import/prometheus. See these docs for details. - Arbitrary CSV data via
http://<vmagent>:8429/api/v1/import/csv. See these docs.
How to collect metrics in Prometheus format
Specify the path to prometheus.yml file via -promscrape.config command-line flag. vmagent takes into account the following
sections from Prometheus config file:
globalscrape_configs
All other sections are ignored, including the remote_write section.
Use -remoteWrite.* command-line flag instead for configuring remote write settings. See the list of unsupported config sections.
The file pointed to by -promscrape.config may contain %{ENV_VAR} placeholders that are substituted by the corresponding ENV_VAR environment variable values.
See also:
scrape_config enhancements
vmagent supports the following additional options in scrape_configs section:
headers- a list of HTTP headers to send to a scrape target with each scrape request. This can be used when the scrape target needs custom authorization and authentication. For example:
scrape_configs:
- job_name: custom_headers
headers:
- "TenantID: abc"
- "My-Auth: TopSecret"
disable_compression: truefor disabling response compression on a per-job basis. By default,vmagentrequests compressed responses from scrape targets for saving network bandwidth.disable_keepalive: truefor disabling HTTP keep-alive connections on a per-job basis. By default,vmagentuses keep-alive connections to scrape targets for reducing overhead on connection re-establishing.series_limit: Nfor limiting the number of unique time series a single scrape target can expose. See these docs.stream_parse: truefor scraping targets in a streaming manner. This may be useful when targets export large number of metrics. See these docs.scrape_align_interval: durationfor aligning scrapes to the given interval instead of using random offset in the range[0 ... scrape_interval]for scraping each target. The random offset helps to spread scrapes evenly in time.scrape_offset: durationfor specifying the exact offset for scraping instead of using random offset in the range[0 ... scrape_interval].
See scrape_configs docs for more details on all the supported options.
Loading scrape configs from multiple files
vmagent supports loading scrape configs from multiple files specified
in the scrape_config_files section of -promscrape.config file. For example, the following -promscrape.config instructs vmagent
to load scrape configs from all the *.yml files under the configs directory, from the local file single_scrape_config.yml
and from the URL https://config-server/scrape_config.yml:
scrape_config_files:
- configs/*.yml
- single_scrape_config.yml
- https://config-server/scrape_config.yml
Every referred file can contain arbitrary number of supported scrape configs.
There is no need to specify a top-level scrape_configs section in these files. For example:
- job_name: foo
static_configs:
- targets: ["vmagent:8429"]
- job_name: bar
kubernetes_sd_configs:
- role: pod
vmagent is able to dynamically reload these files - see these docs.
Unsupported Prometheus config sections
vmagent doesn't support the following sections in Prometheus config file passed to -promscrape.config command-line flag:
- remote_write. This section is substituted
with various
-remoteWrite*command-line flags. See the full list of flags. Theremote_writesection isn't supported in order to reduce possible confusion whenvmagentis used for accepting incoming metrics via supported push protocols. In this case the-promscrape.configfile isn't needed. remote_read. This section isn't supported at all, sincevmagentdoesn't provide Prometheus querying API. It is expected that the querying API is provided by the remote storage specified via-remoteWrite.urlsuch as VictoriaMetrics. See Prometheus querying API docs for VictoriaMetrics.rule_filesandalerting. These sections are supported by vmalert.
The list of supported service discovery types is available in how-to-collect-metrics-in-prometheus-format.
Additionally, vmagent doesn't support refresh_interval option at service discovery sections.
This option is substituted with -promscrape.*CheckInterval command-line flags, that are specific per each service discovery type.
See the full list of command-line flags for vmagent.
Configuration update
vmagent should be restarted in order to update config options set via command-line args.
vmagent supports multiple approaches for reloading configs from updated config files such as
-promscrape.config, -remoteWrite.relabelConfig, -remoteWrite.urlRelabelConfig, -streamAggr.config
and -remoteWrite.streamAggr.config:
-
Sending
SIGHUPsignal tovmagentprocess:kill -SIGHUP `pidof vmagent` -
Sending HTTP request to
http://vmagent:8429/-/reloadendpoint. This endpoint can be protected with-reloadAuthKeycommand-line flag.
There is also -promscrape.configCheckInterval command-line flag, that can be used for automatic reloading configs from updated -promscrape.config file.
SRV URLs
If vmagent encounters URLs with srv+ prefix in hostname (such as http://srv+some-addr/some/path), then it resolves some-addr DNS SRV
record into TCP address with hostname and TCP port, and then uses the resulting url when it needs connecting to it.
SRV URLs are supported in the following places:
-
In
-remoteWrite.urlcommand-line flag. For example, ifvictoria-metricsDNS SRV record containsvictoria-metrics-host:8428TCP address, then-remoteWrite.url=http://srv+victoria-metrics/api/v1/writeis automatically resolved into-remoteWrite.url=http://victoria-metrics-host:8428/api/v1/write. If the DNS SRV record is resolved into multiple TCP addresses, thenvmagentuses randomly chosen address per each connection it establishes to the remote storage. -
In scrape target addresses aka
__address__label - see these docs for details. -
In urls used for service discovery.
SRV urls are useful when HTTP services run on different TCP ports or when they can change TCP ports over time (for instance, after the restart).
VictoriaMetrics remote write protocol
vmagent supports sending data to the configured -remoteWrite.url either via Prometheus remote write protocol
or via VictoriaMetrics remote write protocol.
VictoriaMetrics remote write protocol provides the following benefits comparing to Prometheus remote write protocol:
-
Reduced network bandwidth usage by 2x-5x. This allows saving network bandwidth usage costs when
vmagentand the configured remote storage systems are located in different datacenters, availability zones or regions. -
Reduced disk read/write IO and disk space usage at
vmagentwhen the remote storage is temporarily unavailable. In this casevmagentbuffers the incoming data to disk using the VictoriaMetrics remote write format. This reduces disk read/write IO and disk space usage by 2x-5x compared to Prometheus remote write format.
See blogpost Save network costs with VictoriaMetrics remote write protocol.
vmagent uses VictoriaMetrics remote write protocol by default {{% available_from "v1.116.0" %}} when it sends data to VictoriaMetrics components such as other vmagent instances,
single-node VictoriaMetrics
or vminsert at cluster version. If needed, It can automatically downgrade to a Prometheus protocol at runtime.
It is possible to force switch to VictoriaMetrics remote write protocol by specifying -remoteWrite.forceVMProto
command-line flag for the corresponding -remoteWrite.url.
It is possible to tune the compression level for VictoriaMetrics remote write protocol with -remoteWrite.vmProtoCompressLevel command-line flag.
Bigger values reduce network usage at the cost of higher CPU usage. Negative values reduce CPU usage at the cost of higher network usage.
The default value for the compression level is 0, the minimum value is -22 and the maximum value is 22. The default value works optimally
in most cases, so it isn't recommended changing it.
vmagent automatically switches to Prometheus remote write protocol when it sends data to old versions of VictoriaMetrics components
or to other Prometheus-compatible remote storage systems. It is possible to force switch to Prometheus remote write protocol
by specifying -remoteWrite.forcePromProto command-line flag for the corresponding -remoteWrite.url.
Multitenancy
By default vmagent collects the data without tenant identifiers
and routes it to the remote storage specified via -remoteWrite.url command-line flag. The -remoteWrite.url can point to /insert/<tenant_id>/prometheus/api/v1/write path
at vminsert according to these docs.
flowchart LR
A["requests_total{instance=foo}"] --> |/api/v1/write| V[vmagent]
B["requests_total{instance=bar}"] <--> |scrape| V
V --> |"/insert/#60;tenant_id#62;/#60;suffix#62;"| C[vminsert]
In this case, all the metrics written to /insert/tenant_id/prometheus/api/v1/write will belong to specified <tenant_id> tenant.
Multitenancy via labels
vmagent can write data to multiple distinct tenants if -remoteWrite.url points to multitenant url at VictoriaMetrics cluster
and tenant is specified via multitenancy labels:
flowchart LR
A["requests_total{instance=foo, vm_account_id=0}"] --> |/api/v1/write| V[vmagent]
B["requests_total{instance=bar, vm_account_id=1}"] <--> |scrape| V
V --> |"/insert/multitenant/#60;suffix#62;"| C[vminsert]
<tenant_id> is extracted from the vm_account_id and vm_project_id labels.
The vm_account_id and vm_project_id labels can be specified via relabeling before sending the metrics to -remoteWrite.url.
For example, the following relabeling rule instructs sending metrics to <account_id>:0 tenant defined in the prometheus.io/account_id annotation of Kubernetes pod deployment:
scrape_configs:
- kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_account_id]
target_label: vm_account_id
vmagent can get tenant identifier from __tenant_id__ label at target discovery phase.
It implicitly converts __tenant_id__ label into vm_account_id and vm_project_id labels and attaches
it to the scraped metrics and metrics metadata.
For example, the following relabeling rule instructs sending metrics to 10:5 tenant defined in the prometheus.io/tenant_id: 10:5 annotation of Kubernetes pod deployment:
scrape_configs:
- kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_tenant_id]
target_label: __tenant_id__
vmagent can enforce adding labels to all scraped or forwarded metrics.
Multitenancy via path
vmagent can write data to multiple distinct tenants if -remoteWrite.url points to multitenant url at VictoriaMetrics cluster,
tenant is specified in the write path and -enableMultitenantHandlers command-line flag is set:
flowchart LR
A["requests_total{instance=foo}"] --> |/insert/0/#60;suffix#62;| V[vmagent]
B["requests_total{instance=bar}"] --> |/insert/1/#60;suffix#62;| V
V --> |"/insert/multitenant/#60;suffix#62;"| C[vminsert]
In this configuration, vmagent accepts writes via the same multitenant endpoints (/insert/<accountID>/<suffix>) as vminsert.
For all received data, vmagent will automatically convert tenant identifiers from the URL to vm_account_id and vm_project_id labels and sets tenant info in metadata.
These tenant labels are added before applying relabeling specified via -remoteWrite.relabelConfig
and -remoteWrite.urlRelabelConfig command-line flags.
Multitenancy via headers
vmagent can write data to multiple distinct tenants if -remoteWrite.url points to multitenant url at VictoriaMetrics cluster,
tenant is specified via headers {{% available_from "v1.143.0" %}}, both -enableMultitenantHandlers and -enableMultitenancyViaHeaders command-line flags are set:
flowchart LR
A["requests_total{instance=foo}"] --> |/insert/#60;suffix#62; <br>--header AccountID: 0| V[vmagent]
B["requests_total{instance=bar}"] --> |/insert/#60;suffix#62; <br>--header AccountID: 1| V
V --> |"/insert/multitenant/#60;suffix#62;"| C[vminsert]
In this configuration, vmagent accepts writes via the same simplified multitenant endpoints (/insert/<suffix>) as vminsert.
The tenant information is extracted from the AccountID and ProjectID HTTP headers, that are expected to be attached to all the incoming requests. If headers are missing, then tenant is set to 0:0 as default.
For all received data, vmagent will automatically convert tenant identifiers from the headers to vm_account_id and vm_project_id labels and sets tenant info in metadata.
These tenant labels are added before applying relabeling specified via -remoteWrite.relabelConfig
and -remoteWrite.urlRelabelConfig command-line flags.
vmauth can enforce adding headers to all
forwarded requests via headers param in the config file.
Adding labels to metrics
Extra labels can be added to metrics collected by vmagent via the following mechanisms:
-
The
global -> external_labelssection in-promscrape.configfile. These labels are added only to metrics scraped from targets configured in the-promscrape.configfile. They aren't added to metrics collected via other data ingestion protocols. -
The
-remoteWrite.labelcommand-line flag. These labels are added to all the collected metrics before sending them to all configured-remoteWrite.url. For example, the following command startsvmagent, which adds{datacenter="foobar"}label to all the metrics pushed to all the configured-remoteWrite.urldestinations:/path/to/vmagent -remoteWrite.label=datacenter=foobar ... -
Via relabeling. Relabeling can be applied globally and per each configured
-remoteWrite.urldestination. See Relabeling Cookbook. -
Add
extra_labelGET param to-remoteWrite.urladdress (only works when sending data to VictoriaMetrics components):/path/to/vmagent -remoteWrite.url=http://127.0.0.1:8428/api/v1/write?extra_label="env=prod"
Automatically generated metrics
vmagent automatically generates the following metrics per each scrape of every Prometheus-compatible target
and attaches instance, job and other target-specific labels to these metrics:
-
up- this metric exposes1value on successful scrape and0value on unsuccessful scrape. This allows monitoring failing scrapes with the following MetricsQL query:up == 0 -
scrape_duration_seconds- the duration of the scrape for the given target. This allows monitoring slow scrapes. For example, the following MetricsQL query returns scrapes, which take more than 1.5 seconds to complete:scrape_duration_seconds > 1.5 -
scrape_timeout_seconds- the configured timeout for the current scrape target (akascrape_timeout). This allows detecting targets with scrape durations close to the configured scrape timeout. For example, the following MetricsQL query returns targets (identified byinstancelabel), which take more than 80% of the configuredscrape_timeoutduring scrapes:scrape_duration_seconds / scrape_timeout_seconds > 0.8 -
scrape_response_size_bytes- response size in bytes for the given target. This allows to monitor amount of data scraped and to adjustmax_scrape_sizeoption for scraped targets. For example, the following MetricsQL query returns targets with scrape response bigger than10MiB:scrape_response_size_bytes > 10MiB -
scrape_samples_scraped- the number of samples parsed per each scrape. This allows detecting targets, which expose too many series. For example, the following MetricsQL query returns targets, which expose more than 10000 metrics:scrape_samples_scraped > 10000 -
scrape_samples_limit- the configured limit on the number of samples the given target can expose. The limit can be set viasample_limitoption at scrape_configs. This metric is exposed only if thesample_limitis set. This allows detecting targets, which expose too many metrics compared to the configuredsample_limit. For example, the following query returns targets (identified byinstancelabel), which expose more than 80% metrics compared to the configuredsample_limit:scrape_samples_scraped / scrape_samples_limit > 0.8 -
scrape_samples_post_metric_relabeling- the number of samples left after applying metric-level relabeling frommetric_relabel_configssection (see relabeling docs for more details). This allows detecting targets with too many series after the relabeling. For example, the following MetricsQL query returns targets with more than 10000 metrics after the relabeling:scrape_samples_post_metric_relabeling > 10000 -
scrape_labels_limit- the configured limit on the number of labels the given target can expose per sample. The limit can be set vialabel_limitoption at scrape_configs. This metric is exposed only if thelabel_limitis set. -
scrape_series_added- an approximate number of new series the given target generates during the current scrape. This metric allows detecting targets (identified byinstancelabel), which lead to high churn rate. For example, the following MetricsQL query returns targets, which generate more than 1000 new series during the last hour:sum_over_time(scrape_series_added[1h]) > 1000vmagentsetsscrape_series_addedto zero when it runs with-promscrape.noStaleMarkerscommand-line flag or when it scrapes target withno_stale_markers: trueoption, e.g. when staleness markers are disabled.Restarting
vmagentcan causescrape_series_addedto rise because all time series are new to a newly startedvmagent. -
scrape_series_limit- the limit on the number of unique series the given target can expose according to these docs. This metric is exposed only if the series limit is set. -
scrape_series_current- the number of unique series the given target exposed so far. This metric is exposed only if the series limit is set according to these docs. This metric allows alerting when the number of exposed series by the given target reaches the limit. For example, the following query would alert when the target exposes more than 90% of unique series compared to the configured limit.scrape_series_current / scrape_series_limit > 0.9 -
scrape_series_limit_samples_dropped- exposes the number of dropped samples during the scrape because of the exceeded limit on the number of unique series. This metric is exposed only if the series limit is set according to these docs. This metric allows alerting when scraped samples are dropped because of the exceeded limit. For example, the following query alerts when at least a single sample is dropped because of the exceeded limit during the last hour:sum_over_time(scrape_series_limit_samples_dropped[1h]) > 0
If the target exports metrics with names clashing with the automatically generated metric names, then vmagent automatically
adds exported_ prefix to these metric names, so they don't clash with automatically generated metric names.
Relabeling defined in relabel_configs or metric_relabel_configs of scrape config isn't applied to automatically
generated metrics. But they still can be relabeled via -remoteWrite.relabelConfig before sending metrics to remote address.
Prometheus staleness markers
vmagent sends Prometheus staleness markers to -remoteWrite.url in the following cases:
- If they are passed to
vmagentvia Prometheus remote_write protocol. - If the metric disappears from the list of scraped metrics, then stale marker is sent to this particular metric.
- If the scrape target becomes temporarily unavailable, then stale markers are sent for all the metrics scraped from this target.
- If the scrape target is removed from the list of targets, then stale markers are sent for all the metrics scraped from this target.
Prometheus staleness markers' tracking needs additional memory, since it must store the previous response body per each scrape target in order to compare it to the current response body. The memory usage may be reduced by disabling staleness tracking in the following ways:
- By passing
-promscrape.noStaleMarkerscommand-line flag tovmagent. This disables staleness tracking across all the targets. - By specifying
no_stale_markers: trueoption in the scrape_config for the corresponding target.
When staleness tracking is disabled, then vmagent doesn't track the number of new time series per each scrape,
e.g. it sets scrape_series_added metric to zero. See these docs for details.
Metric metadata
vmagent accepts{{% available_from "v1.137.0" %}} metric metadata exposed by scrape targets in Prometheus exposition format, received via Prometheus remote write v1 or OpenTelemetry protocol by default. Set -enableMetadata=false to disable metadata processing{{% available_from "v1.125.1" %}}.
During processing, metadata won't be dropped or modified by relabeling or streaming aggregation.
When -enableMultitenantHandlers is enabled, vmagent adds tenant info to metadata specified in multitenant endpoint (/insert/<accountID>/<suffix>).
However, if the /insert/multitenant/<suffix> endpoint is used, vmagent preserves the tenant information provided in the metadata by the sender. If the sender does not specify tenant information, the default tenant 0:0 is used.
Metadata requires extra memory, disk space, and network traffic.
Use -remoteWrite.disableMetadata{{% available_from "v1.140.0" %}} to fully disable sending metadata from vmagent.
This reduces network traffic and resource usage when metadata is not required.
Stream parsing mode
By default, vmagent parses the full response from the scrape target, applies relabeling
and then pushes the resulting metrics to the configured -remoteWrite.url in one go. This mode works great for the majority of cases
when the scrape target exposes small number of metrics (e.g. less than 10K). But this mode may take large amounts of memory
when the scrape target exposes a large number of metrics (for example, when vmagent scrapes kube-state-metrics
in large Kubernetes cluster). It is recommended to enable stream parsing for such targets.
When stream parsing is enabled, vmagent processes the response from the scrape target in chunks.
This saves memory when scraping targets that expose millions of metrics.
Stream parsing is automatically enabled for scrape targets returning response bodies with sizes larger than
the -promscrape.minResponseSizeForStreamParse command-line flag value. Additionally,
stream parsing can be explicitly enabled in the following places:
- Via
-promscrape.streamParsecommand-line flag. In this case all the scrape targets defined in the file pointed by-promscrape.configare scraped in stream parsing mode. - Via
stream_parse: trueoption atscrape_configssection. In this case all the scrape targets defined in this section are scraped in stream parsing mode. - Via
__stream_parse__=truelabel, which can be set via relabeling atrelabel_configssection. In this case stream parsing mode is enabled for the corresponding scrape targets. Typical use case: to set the label via Kubernetes annotations for targets exposing large number of metrics.
Examples:
scrape_configs:
- job_name: 'big-federate'
stream_parse: true
static_configs:
- targets:
- big-prometheus1
- big-prometheus2
honor_labels: true
metrics_path: /federate
params:
'match[]': ['{__name__!=""}']
Note that vmagent in stream parsing mode stores up to sample_limit samples to the configured -remoteStorage.url
instead of dropping all the samples read from the target, because the parsed data is sent to the remote storage
as soon as it is parsed in stream parsing mode.
Scraping big number of targets
A single vmagent instance can scrape tens of thousands of scrape targets. Sometimes this isn't enough due to limitations on CPU, network, RAM, etc.
In this case scrape targets can be split among multiple vmagent instances (aka vmagent horizontal scaling, sharding and clustering).
The number of vmagent instances in the cluster must be passed to -promscrape.cluster.membersCount command-line flag.
Each vmagent instance in the cluster must use identical -promscrape.config files with distinct -promscrape.cluster.memberNum values
in the range 0 ... N-1, where N is the number of vmagent instances in the cluster specified via -promscrape.cluster.membersCount.
For example, the following commands spread scrape targets among a cluster of two vmagent instances:
/path/to/vmagent -promscrape.cluster.membersCount=2 -promscrape.cluster.memberNum=0 -promscrape.config=/path/to/config.yml ...
/path/to/vmagent -promscrape.cluster.membersCount=2 -promscrape.cluster.memberNum=1 -promscrape.config=/path/to/config.yml ...
The -promscrape.cluster.memberNum can be set to a StatefulSet pod name when vmagent runs in Kubernetes.
The pod name must end with a number in the range 0 ... promscrape.cluster.membersCount-1. For example, -promscrape.cluster.memberNum=vmagent-0.
By default, each scrape target is scraped only by a single vmagent instance in the cluster. If there is a need for replicating scrape targets among multiple vmagent instances,
then -promscrape.cluster.replicationFactor command-line flag must be set to the desired number of replicas. For example, the following commands
start a cluster of three vmagent instances, where each target is scraped by two vmagent instances:
/path/to/vmagent -promscrape.cluster.membersCount=3 -promscrape.cluster.replicationFactor=2 -promscrape.cluster.memberNum=0 -promscrape.config=/path/to/config.yml ...
/path/to/vmagent -promscrape.cluster.membersCount=3 -promscrape.cluster.replicationFactor=2 -promscrape.cluster.memberNum=1 -promscrape.config=/path/to/config.yml ...
/path/to/vmagent -promscrape.cluster.membersCount=3 -promscrape.cluster.replicationFactor=2 -promscrape.cluster.memberNum=2 -promscrape.config=/path/to/config.yml ...
Every vmagent in the cluster exposes all the discovered targets at http://vmagent:8429/service-discovery page.
Each discovered target on this page contains its status (UP, DOWN or DROPPED with the reason why the target has been dropped).
If the target is dropped because of sharding to other vmagent instances in the cluster, then the status column contains
-promscrape.cluster.memberNum values for vmagent instances where the given target is scraped.
The /service-discovery page provides links to the corresponding vmagent instances if -promscrape.cluster.memberURLTemplate command-line flag is set.
Every occurrence of %d inside the -promscrape.cluster.memberURLTemplate is substituted with the -promscrape.cluster.memberNum
for the corresponding vmagent instance. For example, -promscrape.cluster.memberURLTemplate='http://vmagent-instance-%d:8429/targets'
generates http://vmagent-instance-42:8429/targets url for vmagent instance, which runs with -promscrape.cluster.memberNum=42.
Note that vmagent shows up to -promscrape.maxDroppedTargets dropped targets on the /service-discovery page.
Increase the -promscrape.maxDroppedTargets command-line flag value if the /service-discovery page misses some dropped targets.
If each target is scraped by multiple vmagent instances, then data deduplication must be enabled at remote storage pointed by -remoteWrite.url.
The -dedup.minScrapeInterval must be set to the scrape_interval configured at -promscrape.config.
See these docs for details.
The -promscrape.cluster.memberLabel command-line flag allows specifying a name for member num label to add to all the scraped metrics.
The value of the member num label is set to -promscrape.cluster.memberNum. For example, the following config instructs adding vmagent_instance="0" label
to all the metrics scraped by the given vmagent instance:
/path/to/vmagent -promscrape.cluster.membersCount=2 -promscrape.cluster.memberNum=0 -promscrape.cluster.memberLabel=vmagent_instance
See also how to shard data among multiple remote storage systems.
High availability
It is possible to run multiple identically configured vmagent instances or vmagent
clusters, so they scrape
the same set of targets and push the collected data to the same set of VictoriaMetrics remote storage systems.
Two identically configured vmagent instances or clusters is usually called an HA pair.
When running HA pairs, deduplication must be configured on the VictoriaMetrics side in order to de-duplicate received samples. See these docs for details.
It is also recommended to pass different values to -promscrape.cluster.name command-line flag per vmagent
instance or per vmagent cluster in a HA setup. This is needed for proper data de-duplication.
See this issue for details.
Scraping targets via a proxy
vmagent supports scraping targets via http, https and socks5 proxies. Proxy address must be specified in proxy_url option. For example, the following scrape config instructs
target scraping via https proxy at https://proxy-addr:1234:
scrape_configs:
- job_name: foo
proxy_url: https://proxy-addr:1234
Proxy can be configured with the following optional settings:
proxy_authorizationfor generic token authorization. See these docs.proxy_basic_authfor Basic authorization. See these docs.proxy_bearer_tokenandproxy_bearer_token_filefor Bearer token authorizationproxy_oauth2for OAuth2 config. See these docs.proxy_tls_configfor TLS config. See these docs.proxy_headersfor passing additional HTTP headers in requests to proxy.
For example:
scrape_configs:
- job_name: foo
proxy_url: https://proxy-addr:1234
proxy_basic_auth:
username: foobar
password: secret
proxy_tls_config:
insecure_skip_verify: true
cert_file: /path/to/cert
key_file: /path/to/key
ca_file: /path/to/ca
server_name: real-server-name
proxy_headers:
- "Proxy-Auth: top-secret"
On-disk persistence
vmagent stores pending data that cannot be sent to the configured remote storage systems in a timely manner.
By default, vmagent writes all the pending data to folder configured via -remoteWrite.tmpDataPath cmd-line flag
until this data is sent to the configured -remoteWrite.url systems or until the folder becomes full.
The maximum data size that can be saved to -remoteWrite.tmpDataPath per every configured -remoteWrite.url can be
limited via -remoteWrite.maxDiskUsagePerURL command-line flag. When this limit is reached, vmagent drops the oldest
data from disk in order to save newly ingested data.
The folder structure of persisted data is as follows:
<remoteWrite.tmpDataPath>
└── persistent-queue
└── 1_B9EB7BE220B91E9D
Each remote write URL corresponds to a folder similar to 1_B9EB7BE220B91E9D.
It's generated based on the following information:
- The sequence order of the remote write URL cmd-line flags, starting from 1.
- The hash result of the remote write URL itself, excluding query parameters and fragments.
For example, for the remote write configs:
-remoteWrite.url=http://example-1:8428/prometheus/api/v1/write?foo=bar#baz
-remoteWrite.url=http://user:pass@example-2:8428/prometheus/api/v1/write?qux=quux#quuz
vmagent will generate the following persistent queue folders:
# 1_<hash(http://example-1:8428/prometheus/api/v1/write)>, query parameters foo=bar and fragment baz are removed.
1_BA6E4303DCFA0D45
# 2_<hash(http://user:pass@example-2:8428/prometheus/api/v1/write)>, query parameters qux=quux and fragment quuz are removed.
2_0AAFDF53E314A72A
Disabling On-disk persistence
There are cases when it is better to disable on-disk persistence for pending data on the vmagent side:
- When the persistent disk performance isn't enough for the given data processing rate.
- When it is better to buffer pending data at the client side instead of buffering it at
vmagentside in the-remoteWrite.tmpDataPathfolder. - When the data is already buffered at Kafka or Google PubSub.
- When it is better to drop pending data instead of buffering it.
In this case -remoteWrite.disableOnDiskQueue command-line flag can be passed to vmagent per each configured -remoteWrite.url.
vmagent works in the following way if the corresponding remote storage system at -remoteWrite.url cannot keep up with the data ingestion rate
and the -remoteWrite.disableOnDiskQueue command-line flag is set:
- It returns
429 Too Many RequestsHTTP error to clients, which send data tovmagentvia supported HTTP endpoints. If-remoteWrite.dropSamplesOnOverloadcommand-line flag is set or if multiple-remoteWrite.urlcommand-line flags are set, then the ingested samples are silently dropped instead of returning the error to clients. - It suspends consuming data from Kafka or Google PubSub until the remote storage becomes available.
If
-remoteWrite.dropSamplesOnOverloadcommand-line flag is set or if multiple-remoteWrite.disableOnDiskQueuecommand-line flags are set for different-remoteWrite.urloptions, then the fetched samples are silently dropped instead of suspending data consumption from Kafka or Google PubSub. - It drops samples pushed to
vmagentvia non-HTTP protocols and logs the error. Pass-remoteWrite.dropSamplesOnOverloadcommand-line flag in order to suppress error messages in this case. - It drops samples scraped from Prometheus-compatible targets, because it is better from operations perspective to drop samples instead of blocking the scrape process.
- It drops stream aggregation output samples, because it is better from operations perspective to drop output samples instead of blocking the stream aggregation process.
The number of dropped samples because of overloaded remote storage can be monitored via vmagent_remotewrite_samples_dropped_total metric.
The number of unsuccessful attempts to send data to overloaded remote storage can be monitored via vmagent_remotewrite_push_failures_total metric.
Running vmagent on hosts with more RAM or increasing the value for -memory.allowedPercent may reduce the number of unsuccessful attempts or dropped samples
on spiky workloads, since vmagent may buffer more data in memory before returning the error or dropping data.
vmagent still may write pending in-memory data to -remoteWrite.tmpDataPath on graceful shutdown
if -remoteWrite.disableOnDiskQueue command-line flag is specified. It may also read buffered data from -remoteWrite.tmpDataPath
on startup.
When -remoteWrite.disableOnDiskQueue command-line flag is set, vmagent may send the same samples multiple times to the configured remote storage
if it cannot keep up with the data ingestion rate. In this case deduplication
must be enabled on all the configured remote storage systems.
Cardinality limiter
By default, vmagent doesn't limit the number of time series each scrape target can expose.
The limit can be enforced in the following places:
- Via
-promscrape.seriesLimitPerTargetcommand-line flag. This limit is applied individually to all the scrape targets defined in the file pointed by-promscrape.config. - Via
series_limitconfig option at scrape_config section. Theseries_limitallows overriding the-promscrape.seriesLimitPerTargeton a per-scrape_configbasis. Ifseries_limitis set to0or to negative value, then it isn't applied to the givenscrape_config, even if-promscrape.seriesLimitPerTargetcommand-line flag is set. - Via
__series_limit__label, which can be set with relabeling atrelabel_configssection. The__series_limit__allows overriding theseries_limiton a per-target basis. If__series_limit__is set to0or to negative value, then it isn't applied to the given target. Typical use case: to set the limit via Kubernetes annotations for targets, which may expose too high number of time series.
Scraped metrics are dropped for time series exceeding the given limit on the time window of 24h.
vmagent creates the following additional per-target metrics for targets with non-zero series limit:
scrape_series_limit_samples_dropped- the number of dropped samples during the scrape when the unique series limit is exceeded.scrape_series_limit- the series limit for the given target.scrape_series_current- the current number of series for the given target.
These metrics are automatically sent to the configured -remoteWrite.url alongside with the scraped per-target metrics.
These metrics allow building the following alerting rules:
scrape_series_current / scrape_series_limit > 0.9- alerts when the number of series exposed by the target reaches 90% of the limit.sum_over_time(scrape_series_limit_samples_dropped[1h]) > 0- alerts when some samples are dropped because the series limit on a particular target is reached.
See also sample_limit option at scrape_config section.
By default, vmagent doesn't limit the number of time series written to remote storage systems specified at -remoteWrite.url.
The limit can be enforced by setting the following command-line flags:
-remoteWrite.maxHourlySeries- limits the number of unique time seriesvmagentcan write to remote storage systems during the last hour. Useful for limiting the number of active time series.-remoteWrite.maxDailySeries- limits the number of unique time seriesvmagentcan write to remote storage systems during the last day. Useful for limiting daily churn rate.
It is possible to use -1 as a value for these flags{{% available_from "v1.140.0" %}} in order to enable series tracking but set limit to maximum possible value.
This is useful in order to estimate the number of unique series which is written to remote storage systems without enforcing limits.
Both limits can be set simultaneously. If any of these limits is reached, then samples for new time series are dropped instead of sending
them to remote storage systems. A sample of dropped series is put in the log with WARNING level.
vmagent exposes the following metrics at http://vmagent:8429/metrics page (see monitoring docs for details):
vmagent_hourly_series_limit_rows_dropped_total- the number of metrics dropped due to exceeded hourly limit on the number of unique time series.vmagent_hourly_series_limit_max_series- the hourly series limit set via-remoteWrite.maxHourlySeries.vmagent_hourly_series_limit_current_series- the current number of unique series registered during the last hour.vmagent_daily_series_limit_rows_dropped_total- the number of metrics dropped due to exceeded daily limit on the number of unique time series.vmagent_daily_series_limit_max_series- the daily series limit set via-remoteWrite.maxDailySeries.vmagent_daily_series_limit_current_series- the current number of unique series registered during the last day.
These limits are approximate, so vmagent can underflow/overflow the limit by a small percentage (usually less than 1%).
See also cardinality explorer docs.
Monitoring
vmagent exports various metrics in Prometheus exposition format at http://vmagent-host:8429/metrics page.
We recommend setting up regular scraping of this page either through vmagent itself or by Prometheus-compatible scraper,
so that the exported metrics may be analyzed later.
If you use Google Cloud Managed Prometheus for scraping metrics from VictoriaMetrics components, then pass -metrics.exposeMetadata
command-line to them, so they add TYPE and HELP comments per each exposed metric at /metrics page.
See these docs for details.
Use official Grafana dashboard for vmagent state overview.
Graphs on this dashboard contain useful hints - hover the i icon at the top left corner of each graph in order to read it.
If you have suggestions for improvements or have found a bug - please open an issue on github
or add a review to the dashboard.
vmagent also exports the status for various targets at the following pages:
http://vmagent-host:8429/targets. This pages shows the current status for every active target.http://vmagent-host:8429/service-discovery. This pages shows the list of discovered targets with the discovered__meta_*labels according to these docs. This page may help debugging target relabeling.http://vmagent-host:8429/api/v1/targets. This handler returns JSON response compatible with the corresponding page from Prometheus API.http://vmagent-host:8429/ready. This handler returns http 200 status code whenvmagentfinishes its initialization for all the service_discovery configs. It may be useful to performvmagentrolling update without any scrape loss.
Troubleshooting
-
It is recommended to set up the official Grafana dashboard in order to monitor the state of
vmagent. -
It is recommended to increase the maximum number of open files in the system (
ulimit -n) when scraping a large number of targets, asvmagentestablishes at least one TCP connection per target. -
If
vmagentuses too much RAM or CPU, then follow these recommendations. -
When
vmagentscrapes many unreliable targets, it can flood the error log with scrape errors. It is recommended to investigate and fix these errors. If it is unfeasible to fix all the reported errors, then they can be suppressed by passing-promscrape.suppressScrapeErrorscommand-line flag tovmagent. The most recent scrape error per each target can be observed athttp://vmagent-host:8429/targetsandhttp://vmagent-host:8429/api/v1/targets. -
The
http://vmagent-host:8429/configpage shows current active-promscrape.configconfiguration. Access to endpoint can be protected via-configAuthKeycommand-line flag. -
Pages
http://vmagent-host:8429/remotewrite-relabel-configandhttp://vmagent-host:8429/remotewrite-url-relabel-config{{% available_from "v1.129.0" %}} show current active-remoteWrite.relabelConfigand-remoteWrite.urlRelabelConfigconfiguration correspondingly. Access to endpoints can be protected via-configAuthKeycommand-line flag. -
The
http://vmagent-host:8429/service-discoverypage could be useful for debugging the relabeling process for scrape targets. This page contains original labels for targets dropped during relabeling. By default, the-promscrape.maxDroppedTargetstargets are shown here. If your setup drops more targets during relabeling, then increase-promscrape.maxDroppedTargetscommand-line flag value to see all the dropped targets. Note that tracking each dropped target requires up to 10Kb of RAM. Therefore, big values for-promscrape.maxDroppedTargetsmay result in increased memory usage if a large number of scrape targets are dropped during relabeling. -
It is recommended to increase
-remoteWrite.queuesifvmagent_remotewrite_pending_data_bytesmetric grows constantly. It is also recommended to increase-remoteWrite.maxBlockSizeand-remoteWrite.maxRowsPerBlockcommand-line flags in this case. This can improve data ingestion performance to the configured remote storage systems at the cost of higher memory usage. -
If you see gaps in the data pushed by
vmagentto remote storage when-remoteWrite.maxDiskUsagePerURLis set, try increasing-remoteWrite.queues. Such gaps may appear becausevmagentcannot keep up with sending the collected data to remote storage. Therefore, it starts dropping the buffered data if the on-disk buffer size exceeds-remoteWrite.maxDiskUsagePerURL. -
vmagentdrops data blocks if remote storage replies with400 Bad Requestor409 ConflictHTTP responses. The number of dropped blocks can be monitored viavmagent_remotewrite_packets_dropped_totalmetric exported at /metrics page. -
Use
-remoteWrite.queues=1when-remoteWrite.urlpoints to remote storage, which doesn't accept out-of-order samples (aka data backfilling). Such storage systems include Prometheus, Mimir, Cortex and Thanos, which typically emitout of order sampleerrors. The best solution is to use remote storage with backfilling support such as VictoriaMetrics. -
vmagentbuffers scraped data at the-remoteWrite.tmpDataPathdirectory until it is sent to-remoteWrite.url. The directory can grow large when remote storage is unavailable for extended periods of time and if the maximum directory size isn't limited with-remoteWrite.maxDiskUsagePerURLcommand-line flag. If you don't want to send all the buffered data from the directory to remote storage then simply stopvmagentand delete the directory. -
If
vmagentruns on a host with slow persistent storage, which cannot keep up with the volume of processed samples, then it is possible to disable the persistent storage with-remoteWrite.disableOnDiskQueuecommand-line flag. See these docs for more details. -
By default
vmagentmasks-remoteWrite.urlwithsecret-urlvalues in logs and at/metricspage because the url may contain sensitive information such as auth tokens or passwords. Pass-remoteWrite.showURLcommand-line flag when startingvmagentin order to see all the valid urls. -
By default
vmagentevenly spreads scrape load in time. If a particular scrape target must be scraped at the beginning of some interval, thenscrape_align_intervaloption must be used. For example, the following config aligns hourly scrapes to the beginning of hour:scrape_configs: - job_name: foo scrape_interval: 1h scrape_align_interval: 1h -
By default
vmagentevenly spreads scrape load in time. If a particular scrape target must be scraped at specific offset, thenscrape_offsetoption must be used. For example, the following config instructsvmagentto scrape the target at 10 seconds of every minute:scrape_configs: - job_name: foo scrape_interval: 1m scrape_offset: 10s -
If you see
skipping duplicate scrape target with identical labelserrors when scraping Kubernetes pods, then it is likely these pods listen to multiple ports or they use an init container. These errors can either be fixed or suppressed with the-promscrape.suppressDuplicateScrapeTargetErrorscommand-line flag. See the available options below if you prefer fixing the root cause of the error:The following relabeling rule may be added to
relabel_configssection in order to filter out pods with unneeded ports:- action: keep_if_equal source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port, __meta_kubernetes_pod_container_port_number]The following relabeling rule may be added to
relabel_configssection in order to filter out init container pods:- action: drop source_labels: [__meta_kubernetes_pod_container_init] regex: true
See also:
Calculating disk space for persistence queue
vmagent buffers collected metrics on disk at the directory specified via -remoteWrite.tmpDataPath command-line flag
until the metrics are sent to remote storage configured via -remoteWrite.url command-line flag.
The -remoteWrite.tmpDataPath directory can grow large when remote storage is unavailable for extended
periods of time and if the maximum directory size isn't limited with -remoteWrite.maxDiskUsagePerURL command-line flag.
To estimate the allocated disk size for persistent queue, or to estimate -remoteWrite.maxDiskUsagePerURL command-line flag value,
take into account the following attributes:
-
The size in bytes of data stream sent by vmagent:
Run the query
sum(rate(vmagent_remotewrite_bytes_sent_total[1h])) by(instance,url)in vmui or Grafana to get the amount of bytes sent by each vmagent instance per second. -
The amount of time a persistent queue should keep the data before starting to drop it.
For example, if
vmagentshould be able to buffer the data for at least 6 hours, then the following query can be used for estimating the needed amounts of disk space in gigabytes:sum(rate(vmagent_remotewrite_bytes_sent_total[1h])) by(instance,url) * 6h / 1Gi
Additional notes:
- Ensure that
vmagentmonitoring is configured properly. - Re-evaluate the estimation each time when:
- there is an increase in the vmagent's workload
- there is a change in relabeling rules which could increase the amount metrics to send
- there is a change in number of configured
-remoteWrite.urladdresses
- The minimum disk size to allocate for the persistent queue is 500Mi per each
-remoteWrite.url. - On-disk persistent queue can be disabled if needed. See these docs.
Security
See general recommendations regarding security.
vmagent's /remotewrite-relabel-config and /remotewrite-url-relabel-config endpoints {{% available_from "v1.129.0" %}}
can be protected via -configAuthKey command-line flag.
mTLS protection
By default vmagent accepts http requests at port 8429 (this port can be changed via -httpListenAddr command-line flags),
it is expected to run in an isolated trusted network.
Enterprise version of vmagent supports the ability to accept mTLS
requests at this port, by specifying -tls and -mtls command-line flags. For example, the following command runs vmagent, which accepts only mTLS requests at port 8429:
./vmagent -tls -mtls -remoteWrite.url=...
By default, the system-wide TLS Root CA is used for verifying client certificates if -mtls command-line flag is specified.
It is possible to specify custom TLS Root CA via -mtlsCAFile command-line flag.
Performance optimizations
vmagent is optimized for low CPU usage and low RAM usage without the need to tune any configs. Sometimes it is needed to optimize the CPU / RAM usage of vmagent even more.
For example, if vmagent needs to scrape thousands of targets in resource-constrained environments, then the following options may help to reduce CPU and RAM usage:
-
Set GOGC environment variable to
100. This reduces CPU usage at the cost of higher RAM usage. -
Set GOMAXPROCS environment variable to the value slightly bigger than the number of CPU cores used by
vmagent. Another option is to set CPU limit in Kubernetes / Docker to the integer value bigger than the number of CPU cores used byvmagent. This reduces RAM and CPU usage whenvmagentruns in an environment with a large number of available CPU cores. Note that it may be necessary to increase the-remoteWrite.queuescommand-line flag to a larger value ifGOMAXPROCSis set to too small of a value, since by default-remoteWrite.queuesis proportional toGOMAXPROCS. -
Avoid increasing the value for the
-maxConcurrentRequestscommand-line flag. The default value is optimized for the best performance in most cases, even if many clients send data tovmagentvia many concurrent connections and the number of these connections significantly exceeds the default value for the-maxConcurrentRequestscommand-line flag.vmagentputs incoming requests into a wait queue if the number of concurrently executed requests exceeds-maxConcurrentRequests. The pending requests at the wait queue do not consume CPU and do not consume significant amounts of RAM, so it is OK to have thousands of pending requests in the wait queue. Pending requests in the wait queue are canceled if they wait for their execution for longer than the duration specified in the-insert.maxQueueDurationcommand-line flag. Canceled requests can be monitored viavm_concurrent_insert_limit_timeout_totalmetric. -
Disable response compression at scrape targets via
-promscrape.disableCompressioncommand-line flag or viadisable_compression: trueoption in the scrape_config. This reduces CPU usage at the cost of higher network bandwidth usage betweenvmagentand scrape targets. -
Disable tracking of original labels for the discovered targets via
-promscrape.dropOriginalLabelscommand-line flag. This helps to reduce RAM usage whenvmagentdiscovers a large number of scrape targets and the majority of these targets are dropped. This is a typical case whenvmagentdiscovers Kubernetes targets. The downside of using-promscrape.dropOriginalLabelscommand-line flag is the reduced debuggability for improperly configured per-target relabeling. -
Disable staleness markers via
-promscrape.noStaleMarkerscommand-line flag or viano_stale_markers: trueoption in the scrape_config. This reduces RAM and CPU usage. Note that disabling staleness markers may result in unexpected query results when scrape targets are frequently rotated (this is a typical case in Kubernetes). -
Set
-memory.allowedBytescommand-line flag to a value close to the actual memory usage ofvmagent. Another option is to set memory limit in Kubernetes / Docker to the value 50% larger than the actual memory usage ofvmagent. This should reduce memory usage spikes forvmagentrunning in the environment with a large amount of available memory and when the remote storage cannot keep up with the data ingestion rate. Increasing-remoteWrite.queuescommand-line flag value may help in this case too. -
In extreme cases it may be useful to set
-promscrape.disableKeepAlivecommand-line flag in order to save RAM on HTTP keep-alive connections to thousands of scrape targets. -
Increase the
scrape_intervaloption in theglobalsection of the-promscrape.configand/or at every scrape_config to reduce CPU usage. For example, increasing thescrape_intervalfrom10sto30sacross all the targets decreases CPU usage atvmagentby up to 3x.
Example command, which runs vmagent in an optimized mode:
GOGC=100 GOMAXPROCS=1 ./vmagent -promscrape.disableCompression -promscrape.dropOriginalLabels -promscrape.noStaleMarkers -memory.allowedBytes=1GiB -promscrape.disableKeepAlive ...
How to build from sources
We recommend using the official binary releases - vmagent is located in the vmutils-... archives.
It may be needed to build vmagent from source code when developing or testing new feature or bugfix.
Development build
- Install Go.
- Run
make vmagentfrom the root folder of the repository. It builds thevmagentbinary and puts it into thebinfolder.
Production build
- Install docker.
- Run
make vmagent-prodfrom the root folder of the repository. It buildsvmagent-prodbinary and puts it into thebinfolder.
Building docker images
Run make package-vmagent. It builds victoriametrics/vmagent:<PKG_TAG> docker image locally.
<PKG_TAG> is an auto-generated image tag, which depends on source code in the repository.
The <PKG_TAG> may be manually set via PKG_TAG=foobar make package-vmagent.
The base docker image is alpine but it is possible to use any other base image
by setting it via <ROOT_IMAGE> environment variable. For example, the following command builds the image on top of scratch image:
ROOT_IMAGE=scratch make package-vmagent
ARM build
ARM build may run on Raspberry Pi or on energy-efficient ARM servers.
Development ARM build
- Install Go.
- Run
make vmagent-linux-armormake vmagent-linux-arm64from the root folder of the repository It buildsvmagent-linux-armorvmagent-linux-arm64binary respectively and puts it into thebinfolder.
Production ARM build
- Install docker.
- Run
make vmagent-linux-arm-prodormake vmagent-linux-arm64-prodfrom the root folder of the repository. It buildsvmagent-linux-arm-prodorvmagent-linux-arm64-prodbinary respectively and puts it into thebinfolder.
Profiling
vmagent provides handlers for collecting the following Go profiles:
- Memory profile can be collected with the following command (replace
0.0.0.0with hostname if needed):
curl http://0.0.0.0:8429/debug/pprof/heap > mem.pprof
- CPU profile can be collected with the following command (replace
0.0.0.0with hostname if needed):
curl http://0.0.0.0:8429/debug/pprof/profile > cpu.pprof
The command for collecting CPU profile waits for 30 seconds before returning.
The collected profiles may be analyzed with go tool pprof.
It is safe sharing the collected profiles from security point of view, since they do not contain sensitive information.
Advanced usage
vmagent can be fine-tuned with various command-line flags. Run ./vmagent -help in order to see the full list of these flags with their descriptions and default value.
Common flags
These flags are available in both VictoriaMetrics OSS and VictoriaMetrics Enterprise. {{% content "vmagent_common_flags.md" %}}
Enterprise flags
These flags are available only in VictoriaMetrics enterprise. {{% content "vmagent_enterprise_flags.md" %}}
Section below contains backward-compatible anchors for links that were moved or renamed.
Relabeling
Moved to Relabeling Cookbook.
Relabeling enhancements
Moved to Relabeling Cookbook.
Graphite relabeling
Moved to Relabeling Cookbook.
Relabel debug
Moved to Relabeling Cookbook.
Debugging scrape targets
Moved to Relabeling Cookbook.
Google PubSub integration
Moved to integrations/pubsub.
Reading metrics from PubSub
Moved to integrations/pubsub/#reading-metrics.
Consume metrics from multiple topics
Moved to integrations/pubsub/#multiple-topics.
Command-line flags for PubSub consumer
Moved to integrations/pubsub/#consumer-command-line-flags.
Writing metrics to PubSub
Moved to integrations/pubsub/#writing-metrics.
Command-line flags for PubSub producer
Moved to integrations/pubsub/#producer-command-line-flags.
Kafka integration
Moved to integrations/kafka.
Reading metrics from Kafka
Moved to integrations/kafka/#reading-metrics.
Command-line flags for Kafka consumer
Moved to integrations/kafka/#consumer-command-line-flags.
Writing metrics to Kafka
Moved to integrations/kafka/#writing-metrics.
Estimating message size and rate
Moved to integrations/kafka/#estimating-message-size-and-rate.
Kafka broker authorization and authentication
Moved to integrations/kafka/#kafka-broker-authorization-and-authentication.
